Auto-synchronized Python bindings for llama.cpp
llama-cpp-py-sync provides Python bindings for llama.cpp that are kept up-to-date automatically. It generates bindings from upstream headers using CFFI ABI mode, and ships prebuilt wheels.
- Automatic upstream sync and binding regeneration
- Prebuilt wheels built by CI
- CPU wheels published to PyPI
- Backend-specific wheels published to GitHub Releases: Linux CUDA (12.2) and Vulkan, Windows CUDA (12.4) and Vulkan, macOS Apple Silicon Metal
- CI checks that the generated CFFI surface matches the upstream C API (functions, structs, enums, and signatures)
- A small, explicit Python API (
Llama.generate,tokenize,get_embeddings, etc.)
- This project binds to the public C API that llama.cpp exposes in
llama.h. - It does not attempt to bind llama.cpp’s internal C++ implementation such as private headers, C++ classes/templates, or functions that never appear in
llama.h. - We use CFFI ABI mode: Python loads a prebuilt shared library at runtime (no compiled Python extension module for the bindings).
- Because of that, you still need a compatible llama.cpp shared library available, either bundled in the wheel or via
LLAMA_CPP_LIB. - You get a small high-level API (
llama_cpp_py_sync.Llama) for common tasks, and an “escape hatch” to call the low-level C functions directly via CFFI when needed.
- High-level API:
llama_cpp_py_sync.Llamais the recommended entry point for typical usage such as generation, tokenization, and embeddings.
import llama_cpp_py_sync as llama
with llama.Llama("path/to/model.gguf", n_ctx=2048, n_gpu_layers=0) as llm:
print(llm.generate("Hello", max_tokens=64))- Low-level API:
llama_cpp_py_sync._cffi_bindingsexposes CFFI access to the underlying llama.cpp C API for advanced use.
from llama_cpp_py_sync._cffi_bindings import get_ffi, get_lib
ffi = get_ffi()
lib = get_lib()
print(ffi.string(lib.llama_print_system_info()).decode("utf-8", errors="replace"))This project supports Python 3.8 through 3.14. CI builds wheels with Python 3.13.13 for reproducibility; the published wheels are intended to work across supported Python versions.
pip install llama-cpp-py-syncThis installs the CPU wheel.
Note: depending on CI configuration and platform support, additional wheels may also be published to PyPI.
After installing from PyPI, you can start an interactive chat session with:
python -m llama_cpp_py_sync chatIf you do not pass --model (and LLAMA_MODEL is not set), the CLI will prompt before downloading a default GGUF model and cache it locally for future runs.
To auto-download without prompting, pass --yes.
One-shot prompt:
python -m llama_cpp_py_sync chat --prompt "Say 'ok'." --max-tokens 32Use a specific local model:
python -m llama_cpp_py_sync chat --model path/to/model.ggufDownload the wheel for your platform/backend from GitHub Releases and install the .whl:
pip install path/to/llama_cpp_py_sync-*.whlgit clone https://github.com/FarisZahrani/llama-cpp-py-sync.git
cd llama-cpp-py-sync
# Sync upstream llama.cpp
python scripts/sync_upstream.py
# Regenerate CFFI bindings from the synced llama.cpp headers
# (Optional) record the exact llama.cpp commit SHA in the generated file.
python scripts/gen_bindings.py --commit-sha "$(python scripts/sync_upstream.py --sha)"
# Build the shared library
python scripts/build_llama_cpp.py
# Install the package
pip install -e .vendor/llama.cpp is cloned locally by scripts/sync_upstream.py (and in CI during builds) and is not committed to this repository.
import llama_cpp_py_sync as llama
# Load a model
llm = llama.Llama("path/to/model.gguf", n_ctx=2048, n_gpu_layers=35)
# Generate text
response = llm.generate("Hello, world!", max_tokens=100)
print(response)
# Streaming generation
for token in llm.generate("Write a poem:", max_tokens=100, stream=True):
print(token, end="", flush=True)
# Clean up
llm.close()with llama.Llama("model.gguf", n_gpu_layers=35) as llm:
print(llm.generate("Once upon a time"))# Load an embedding model
with llama.Llama("embed-model.gguf", embedding=True) as llm:
emb = llm.get_embeddings("Hello, world!")
print(f"Embedding dimension: {len(emb)}")from llama_cpp_py_sync import get_available_backends, get_backend_info
print(get_available_backends()) # ['cuda', 'blas'] or similar
info = get_backend_info()
print(f"CUDA available: {info.cuda}")
print(f"Metal available: {info.metal}")Full API (click to expand)
import llama_cpp_py_sync as llama
# Versions
llama.__version__
llama.__llama_cpp_commit__
# Main class
llm = llama.Llama(
model_path="path/to/model.gguf",
n_ctx=512,
n_batch=512,
n_threads=None,
n_gpu_layers=0,
n_ubatch=None,
n_threads_batch=None,
seed=-1,
use_mmap=True,
use_mlock=False,
verbose=False,
embedding=False,
flash_attn_type=None,
)
text = llm.generate(
"Hello",
max_tokens=256,
temperature=0.8,
top_k=40,
top_p=0.95,
min_p=0.05,
repeat_penalty=1.1,
repeat_last_n=64,
stop_sequences=None,
stream=False,
seed=None,
)
stream = llm.generate(
"Hello",
max_tokens=256,
stream=True,
)
tokens = llm.tokenize("Hello", add_special=True, parse_special=False)
text = llm.detokenize(tokens, remove_special=False, unparse_special=True)
piece = llm.token_to_piece(tokens[0])
llm.get_model_desc()
llm.get_model_size()
llm.get_model_n_params()
# Properties
llm.n_vocab
llm.n_ctx
llm.n_embd
llm.n_layer
llm.bos_token
llm.eos_token
# Embeddings (requires embedding=True)
emb = llm.get_embeddings("Hello")
llm.close()
# Module-level embeddings helpers
llama.get_embeddings("path/to/model.gguf", "Hello")
llama.get_embeddings_batch("path/to/model.gguf", ["Hello", "World"])
# Backend helpers
llama.get_available_backends()
llama.get_backend_info()
llama.is_cuda_available()
llama.is_metal_available()
llama.is_vulkan_available()
llama.is_rocm_available()
llama.is_blas_available()- Scheduled Checks: GitHub Actions checks upstream llama.cpp on a schedule
- Tag Mirroring: When an upstream tag exists, the workflow can mirror it into this repository
- Wheel Building: CI builds wheels for all platforms/backends
- Release Publishing: GitHub Releases are created only for tags that exist upstream
- PyPI Publishing: CPU-only wheels are published to PyPI for upstream tags (if configured)
To keep the Python bindings aligned with upstream, CI runs a validation step that compares upstream llama.h to the generated CFFI cdef.
It checks:
- Public function coverage (missing/extra)
- Struct and enum coverage (missing fields/members)
- Function signatures (return + parameter types)
Local run (after syncing upstream headers):
python scripts/sync_upstream.py
python scripts/gen_bindings.py --commit-sha "$(python scripts/sync_upstream.py --sha)"
python scripts/validate_cffi_surface.py --check-structs --check-enums --check-signaturesUnlike pybind11 or manual ctypes, CFFI ABI mode:
- Reads C declarations directly (no compilation needed for bindings)
- Loads the shared library at runtime via
ffi.dlopen() - Automatically handles type conversions
- Works across platforms without modification
Check which llama.cpp version you're running:
import llama_cpp_py_sync as llama
print(f"Package version: {llama.__version__}")
print(f"llama.cpp commit: {llama.__llama_cpp_commit__}")
print(f"llama.cpp tag: {getattr(llama, '__llama_cpp_tag__', '')}")The build system automatically detects available backends:
| Backend | Platform | Detection |
|---|---|---|
| CUDA | Linux, Windows | CUDA_HOME or /usr/local/cuda |
| ROCm | Linux | ROCM_PATH or /opt/rocm |
| Metal | macOS | Xcode SDK |
| Vulkan | All | VULKAN_SDK environment variable |
| BLAS | All | OpenBLAS, MKL, or Accelerate |
# Use GPU acceleration
llm = llama.Llama("model.gguf", n_gpu_layers=35)
# CPU only (no GPU offload)
llm = llama.Llama("model.gguf", n_gpu_layers=0)
# Full GPU offload (all layers)
llm = llama.Llama("model.gguf", n_gpu_layers=-1)class Llama:
def __init__(
self,
model_path: str,
n_ctx: int = 512, # Context window size
n_batch: int = 512, # Logical max batch size for prompt processing
n_threads: int = None, # CPU threads (auto-detect if None)
n_gpu_layers: int = 0, # Layers to offload to GPU
n_ubatch: int = None, # Physical microbatch size (defaults to n_batch)
n_threads_batch: int = None, # Threads for batch processing (defaults to n_threads)
seed: int = -1, # Random seed (-1 for random)
use_mmap: bool = True, # Memory map model file
use_mlock: bool = False, # Lock model in RAM
verbose: bool = False, # Print loading info
embedding: bool = False, # Enable embedding mode
flash_attn_type: int = None, # Flash attention type (None = use env var)
): ...
def generate(
self,
prompt: str,
max_tokens: int = 256,
temperature: float = 0.8,
top_k: int = 40,
top_p: float = 0.95,
min_p: float = 0.05,
repeat_penalty: float = 1.1,
repeat_last_n: int = 64,
stop_sequences: List[str] = None,
stream: bool = False,
seed: int = None,
) -> Union[str, Iterator[str]]: ...
def tokenize(self, text: str, add_special: bool = True, parse_special: bool = False) -> List[int]: ...
def detokenize(self, tokens: List[int], remove_special: bool = False, unparse_special: bool = True) -> str: ...
def token_to_piece(self, token: int) -> str: ...
def get_embeddings(self, text: str) -> List[float]: ...
def get_model_desc(self) -> str: ...
def get_model_size(self) -> int: ...
def get_model_n_params(self) -> int: ...
def close(self): ...
# Properties
n_vocab: int
n_ctx: int
n_embd: int
n_layer: int
bos_token: int
eos_token: intdef get_available_backends() -> List[str]: ...
def get_backend_info() -> BackendInfo: ...
def is_cuda_available() -> bool: ...
def is_metal_available() -> bool: ...
def is_vulkan_available() -> bool: ...
def is_rocm_available() -> bool: ...
def is_blas_available() -> bool: ...def get_embeddings(model: Union[str, Llama], text: str) -> List[float]: ...
def get_embeddings_batch(model: Union[str, Llama], texts: List[str]) -> List[List[float]]: ...See the examples/ directory:
basic_generation.py- Simple text generationstreaming_generation.py- Real-time token streamingembeddings_example.py- Generate and compare embeddingsbackend_info.py- Check available GPU backendsbenchmark.py- Measure token throughput
This repository includes an interactive smoke test that can run either as a one-shot prompt (CI-friendly) or as a back-and-forth chat.
# Interactive chat (Ctrl+C or blank line to exit)
python -m llama_cpp_py_sync chat
# One-shot prompt
python -m llama_cpp_py_sync chat --prompt "Say 'ok'." --max-tokens 16
# Use a specific model
python -m llama_cpp_py_sync chat --model path/to/model.ggufBy default it uses LLAMA_MODEL if set. Otherwise it downloads a default GGUF model and caches it locally.
If the default model is missing, the CLI will prompt before downloading it. To auto-download without prompting, pass --yes.
Model cache location:
- Windows:
%LOCALAPPDATA%\llama-cpp-py-sync\models\ - Linux/macOS:
~/.cache/llama-cpp-py-sync/models/
- Python 3.8+
- Ninja
- CMake (configure step)
- C/C++ compiler (GCC, Clang, MSVC)
- Git
# Clone repository
git clone https://github.com/FarisZahrani/llama-cpp-py-sync.git
cd llama-cpp-py-sync
# Sync upstream llama.cpp
python scripts/sync_upstream.py
# Regenerate bindings from the synced llama.cpp headers
# (Optional) record the exact llama.cpp commit SHA in the generated file.
python scripts/gen_bindings.py --commit-sha "$(python scripts/sync_upstream.py --sha)"
# Build with auto-detected backends
python scripts/build_llama_cpp.py
# Build a specific backend
python scripts/build_llama_cpp.py --backend cuda
python scripts/build_llama_cpp.py --backend vulkan
python scripts/build_llama_cpp.py --backend cpu
# On Windows, the build script bundles required runtime DLLs (MSVC/OpenMP and backend runtimes)
# next to the built library by default. You can disable this behavior with:
python scripts/build_llama_cpp.py --no-bundle-runtime-dlls
# Detect available backends without building
python scripts/build_llama_cpp.py --detect-only
# Build wheel
pip install build
python -m build --wheelIf you need direct access to the underlying C API (beyond the high-level Llama wrapper), you can use the generated CFFI bindings:
from llama_cpp_py_sync._cffi_bindings import get_ffi, get_lib
ffi = get_ffi()
lib = get_lib()
print(ffi.string(lib.llama_print_system_info()).decode("utf-8", errors="replace"))llama-cpp-py-sync/
├── src/llama_cpp_py_sync/ # Python package
│ ├── __init__.py # Public API
│ ├── _cffi_bindings.py # Auto-generated CFFI bindings
│ ├── _version.py # Version info
│ ├── llama.py # High-level Llama class
│ ├── embeddings.py # Embedding utilities
│ └── backends.py # Backend detection
├── scripts/ # Build and sync scripts
│ ├── sync_upstream.py # Sync upstream llama.cpp
│ ├── gen_bindings.py # Generate CFFI bindings
│ ├── build_llama_cpp.py # Build shared library
│ └── auto_version.py # Version generation
├── examples/ # Example scripts
├── vendor/llama.cpp/ # Upstream source (cloned at build time)
├── .github/workflows/ # CI/CD pipelines
├── pyproject.toml # Package metadata
└── README.md # This file
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Run checks:
python scripts/run_tests.pyOptionally also verify wheel packaging locally:
python scripts/run_tests.py- Submit a pull request
MIT License - see LICENSE for details.
This project uses llama.cpp which is also MIT licensed.
Third-party license notices are included in THIRD_PARTY_NOTICES.txt.
- ggml-org/llama.cpp - The upstream C/C++ implementation
- CFFI - C Foreign Function Interface for Python