llama.cpp: LLM inference in C/C++
Quantized LLM inference with hardware-accelerated CPU/GPU backends.
Learn more about llama.cpp
llama.cpp is a C/C++ library and command-line tool for executing large language model inference without external dependencies. It implements quantization support ranging from 1.5-bit to 8-bit integer formats and includes hardware-specific optimizations via ARM NEON, Accelerate, Metal, AVX/AVX2/AVX512, CUDA, HIP, Vulkan, and SYCL backends. The project supports numerous model architectures including LLaMA variants, Mistral, Mixtral, Falcon, and others, with capabilities for both CPU and GPU acceleration as well as hybrid inference modes. Common deployment scenarios include local inference on consumer hardware, cloud-based inference services, and integration into applications requiring on-device language model execution.
Zero External Dependencies
Pure C/C++ implementation requires no external libraries for compilation or runtime. Simplifies deployment across embedded systems, servers, and consumer devices without dependency management.
Multi-Backend Hardware Acceleration
Single codebase supports ARM NEON, Metal, AVX/AVX2/AVX512, CUDA, HIP, Vulkan, and SYCL backends. Automatically leverages available hardware acceleration without code changes across CPU, GPU, and specialized accelerators.
Flexible Quantization Formats
Supports 1.5-bit to 8-bit integer quantization with runtime format selection. Enables engineers to balance model size, memory footprint, and inference speed based on target hardware constraints.
#include "llama.h"
#include <iostream>
#include <string>
#include <vector>
int main() {
// Initialize llama backend
llama_backend_init();
// Set up model parameters
llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = 0; // CPU only
// Load the model
llama_model* model = llama_load_model_from_file("model.gguf", model_params);
if (!model) {
std::cerr << "Failed to load model" << std::endl;
return 1;
}
// Create context
llama_context_params ctx_params = llama_context_default_params();
ctx_params.n_ctx = 2048;
ctx_params.seed = 1234;
llama_context* ctx = llama_new_context_with_model(model, ctx_params);
// Tokenize input
std::string prompt = "Hello, how are you?";
std::vector<llama_token> tokens(prompt.length() + 1);
int n_tokens = llama_tokenize(model, prompt.c_str(), prompt.length(), tokens.data(), tokens.size(), true, false);
tokens.resize(n_tokens);
// Generate response
for (int i = 0; i < tokens.size(); ++i) {
if (llama_decode(ctx, llama_batch_get_one(tokens.data() + i, 1, i, 0))) {
std::cerr << "Failed to decode" << std::endl;
break;
}
}
// Sample next token
llama_token new_token = llama_sampler_sample(llama_sampler_init_greedy(), ctx, -1);
std::cout << "Generated token: " << new_token << std::endl;
// Cleanup
llama_free(ctx);
llama_free_model(model);
llama_backend_free();
return 0;
}Added CMake variable to allow downstream packagers to skip installing test files.
Added support for Qwen 3.5 dense and MoE models, excluding vision capabilities.
- –Unified delta net handling
- –Remove old methods
- –Refactor and optimize
- –Adapt autoregressive version
- –Change to decay mask approach
Fixed CUDA non-contiguous rope operations and improved variable naming consistency.
- –Rename variables + fix rope_neox
- –Fix rope_multi
- –Fix rope_vision
- –Fix rope_norm
- –Rename ne to ne0 for consistent variable naming
See how people are using llama.cpp
Related Repositories
Discover similar tools and frameworks used by developers
vLLM
Fast, memory-efficient LLM inference engine with PagedAttention for production deployments at scale.
Crush
LLM-powered coding agent with LSP and MCP integration.
Magenta
Google Brain research project using ML to generate music, images, and creative content with TensorFlow.
Llama
PyTorch inference for Meta's Llama language models.
TTS
PyTorch toolkit for deep learning text-to-speech synthesis.