llama.cpp: LLM inference in C/C++
Quantized LLM inference with hardware-accelerated CPU/GPU backends.
Learn more about llama.cpp
llama.cpp is a C/C++ library and command-line tool for executing large language model inference without external dependencies. It implements quantization support ranging from 1.5-bit to 8-bit integer formats and includes hardware-specific optimizations via ARM NEON, Accelerate, Metal, AVX/AVX2/AVX512, CUDA, HIP, Vulkan, and SYCL backends. The project supports numerous model architectures including LLaMA variants, Mistral, Mixtral, Falcon, and others, with capabilities for both CPU and GPU acceleration as well as hybrid inference modes. Common deployment scenarios include local inference on consumer hardware, cloud-based inference services, and integration into applications requiring on-device language model execution.
Zero External Dependencies
Pure C/C++ implementation requires no external libraries for compilation or runtime. Simplifies deployment across embedded systems, servers, and consumer devices without dependency management.
Multi-Backend Hardware Acceleration
Single codebase supports ARM NEON, Metal, AVX/AVX2/AVX512, CUDA, HIP, Vulkan, and SYCL backends. Automatically leverages available hardware acceleration without code changes across CPU, GPU, and specialized accelerators.
Flexible Quantization Formats
Supports 1.5-bit to 8-bit integer quantization with runtime format selection. Enables engineers to balance model size, memory footprint, and inference speed based on target hardware constraints.
./main -m models/llama-2-7b.Q4_K_M.gguf -p "Hello, my name is" -n 128
# Load a quantized model and generate 128 tokens
# -m: model path
# -p: prompt text
# -n: number of tokens to generateRelease notes do not specify breaking changes, requirements, or new capabilities.
- –No actionable migration steps or configuration changes are documented for this release.
- –Consult the commit history or changelog for details before upgrading production systems.
Release notes do not specify breaking changes, requirements, or new capabilities.
- –No migration steps or configuration changes are documented for this release.
- –Evaluate commit history directly to assess impact before upgrading production systems.
Release notes do not specify breaking changes, requirements, or new capabilities.
- –No actionable migration steps or deprecations are documented in the provided release notes.
- –Verify compatibility and test thoroughly before upgrading production systems.
See how people are using llama.cpp
Related Repositories
Discover similar tools and frameworks used by developers
text-generation-webui
Feature-rich Gradio-based UI for running and interacting with LLMs locally, supporting multiple model formats and extensions.
openpose
Multi-person 135-keypoint anatomical detection in C++.
pix2pix
Torch implementation for paired image-to-image translation using cGANs.
Kimi-K2
Trillion-parameter MoE model with Muon-optimized training.
crewAI
Python framework for autonomous multi-agent AI collaboration.