Navigate:
llama.cpp
~$LLAMA0.9%

llama.cpp: LLM inference in C/C++

Quantized LLM inference with hardware-accelerated CPU/GPU backends.

LIVE RANKINGS • 10:20 AM • STEADY
TOP 50TOP 50TOP 50TOP 50TOP 50TOP 50TOP 50TOP 50TOP 50TOP 50TOP 50TOP 50
OVERALL
#28
9
AI & ML
#16
7
30 DAY RANKING TREND
ovr#28
·AI#16
STARS
96.0K
FORKS
15.1K
7D STARS
+856
7D FORKS
+163
Tags:
See Repo:
Share:

Learn more about llama.cpp

llama.cpp is a C/C++ library and command-line tool for executing large language model inference without external dependencies. It implements quantization support ranging from 1.5-bit to 8-bit integer formats and includes hardware-specific optimizations via ARM NEON, Accelerate, Metal, AVX/AVX2/AVX512, CUDA, HIP, Vulkan, and SYCL backends. The project supports numerous model architectures including LLaMA variants, Mistral, Mixtral, Falcon, and others, with capabilities for both CPU and GPU acceleration as well as hybrid inference modes. Common deployment scenarios include local inference on consumer hardware, cloud-based inference services, and integration into applications requiring on-device language model execution.

llama.cpp

1

Zero External Dependencies

Pure C/C++ implementation requires no external libraries for compilation or runtime. Simplifies deployment across embedded systems, servers, and consumer devices without dependency management.

2

Multi-Backend Hardware Acceleration

Single codebase supports ARM NEON, Metal, AVX/AVX2/AVX512, CUDA, HIP, Vulkan, and SYCL backends. Automatically leverages available hardware acceleration without code changes across CPU, GPU, and specialized accelerators.

3

Flexible Quantization Formats

Supports 1.5-bit to 8-bit integer quantization with runtime format selection. Enables engineers to balance model size, memory footprint, and inference speed based on target hardware constraints.


#include "llama.h"
#include <iostream>
#include <string>
#include <vector>

int main() {
    // Initialize llama backend
    llama_backend_init();
    
    // Set up model parameters
    llama_model_params model_params = llama_model_default_params();
    model_params.n_gpu_layers = 0; // CPU only
    
    // Load the model
    llama_model* model = llama_load_model_from_file("model.gguf", model_params);
    if (!model) {
        std::cerr << "Failed to load model" << std::endl;
        return 1;
    }
    
    // Create context
    llama_context_params ctx_params = llama_context_default_params();
    ctx_params.n_ctx = 2048;
    ctx_params.seed = 1234;
    
    llama_context* ctx = llama_new_context_with_model(model, ctx_params);
    
    // Tokenize input
    std::string prompt = "Hello, how are you?";
    std::vector<llama_token> tokens(prompt.length() + 1);
    int n_tokens = llama_tokenize(model, prompt.c_str(), prompt.length(), tokens.data(), tokens.size(), true, false);
    tokens.resize(n_tokens);
    
    // Generate response
    for (int i = 0; i < tokens.size(); ++i) {
        if (llama_decode(ctx, llama_batch_get_one(tokens.data() + i, 1, i, 0))) {
            std::cerr << "Failed to decode" << std::endl;
            break;
        }
    }
    
    // Sample next token
    llama_token new_token = llama_sampler_sample(llama_sampler_init_greedy(), ctx, -1);
    std::cout << "Generated token: " << new_token << std::endl;
    
    // Cleanup
    llama_free(ctx);
    llama_free_model(model);
    llama_backend_free();
    
    return 0;
}

vb7974

Added CMake variable to allow downstream packagers to skip installing test files.

vb7973

Added support for Qwen 3.5 dense and MoE models, excluding vision capabilities.

  • Unified delta net handling
  • Remove old methods
  • Refactor and optimize
  • Adapt autoregressive version
  • Change to decay mask approach
vb7972

Fixed CUDA non-contiguous rope operations and improved variable naming consistency.

  • Rename variables + fix rope_neox
  • Fix rope_multi
  • Fix rope_vision
  • Fix rope_norm
  • Rename ne to ne0 for consistent variable naming

See how people are using llama.cpp

Loading tweets...


[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers