Navigate:

All Reposllama.cpp

~$LLAMA↑0.9%

llama.cpp: LLM inference in C/C++

Quantized LLM inference with hardware-accelerated CPU/GPU backends.

LIVE RANKINGS • 10:20 AM • STEADY

TOP 50TOP 50TOP 50TOP 50TOP 50TOP 50TOP 50TOP 50TOP 50TOP 50TOP 50TOP 50

OVERALL

#28

AI & ML

#16

30 DAY RANKING TREND

ovr#28

·AI#16

STARS

96.0K

FORKS

15.1K

7D STARS

+856

7D FORKS

+163

Tags:

AI & ML

See Repo:

Learn more about llama.cpp

llama.cpp is a C/C++ library and command-line tool for executing large language model inference without external dependencies. It implements quantization support ranging from 1.5-bit to 8-bit integer formats and includes hardware-specific optimizations via ARM NEON, Accelerate, Metal, AVX/AVX2/AVX512, CUDA, HIP, Vulkan, and SYCL backends. The project supports numerous model architectures including LLaMA variants, Mistral, Mixtral, Falcon, and others, with capabilities for both CPU and GPU acceleration as well as hybrid inference modes. Common deployment scenarios include local inference on consumer hardware, cloud-based inference services, and integration into applications requiring on-device language model execution.

Zero External Dependencies

Pure C/C++ implementation requires no external libraries for compilation or runtime. Simplifies deployment across embedded systems, servers, and consumer devices without dependency management.

Multi-Backend Hardware Acceleration

Single codebase supports ARM NEON, Metal, AVX/AVX2/AVX512, CUDA, HIP, Vulkan, and SYCL backends. Automatically leverages available hardware acceleration without code changes across CPU, GPU, and specialized accelerators.

Flexible Quantization Formats

Supports 1.5-bit to 8-bit integer quantization with runtime format selection. Enables engineers to balance model size, memory footprint, and inference speed based on target hardware constraints.

#include "llama.h"
#include <iostream>
#include <string>
#include <vector>

int main() {
    // Initialize llama backend
    llama_backend_init();
    
    // Set up model parameters
    llama_model_params model_params = llama_model_default_params();
    model_params.n_gpu_layers = 0; // CPU only
    
    // Load the model
    llama_model* model = llama_load_model_from_file("model.gguf", model_params);
    if (!model) {
        std::cerr << "Failed to load model" << std::endl;
        return 1;
    }
    
    // Create context
    llama_context_params ctx_params = llama_context_default_params();
    ctx_params.n_ctx = 2048;
    ctx_params.seed = 1234;
    
    llama_context* ctx = llama_new_context_with_model(model, ctx_params);
    
    // Tokenize input
    std::string prompt = "Hello, how are you?";
    std::vector<llama_token> tokens(prompt.length() + 1);
    int n_tokens = llama_tokenize(model, prompt.c_str(), prompt.length(), tokens.data(), tokens.size(), true, false);
    tokens.resize(n_tokens);
    
    // Generate response
    for (int i = 0; i < tokens.size(); ++i) {
        if (llama_decode(ctx, llama_batch_get_one(tokens.data() + i, 1, i, 0))) {
            std::cerr << "Failed to decode" << std::endl;
            break;
        }
    }
    
    // Sample next token
    llama_token new_token = llama_sampler_sample(llama_sampler_init_greedy(), ctx, -1);
    std::cout << "Generated token: " << new_token << std::endl;
    
    // Cleanup
    llama_free(ctx);
    llama_free_model(model);
    llama_backend_free();
    
    return 0;
}

vb7974

Added CMake variable to allow downstream packagers to skip installing test files.

vb7973

Added support for Qwen 3.5 dense and MoE models, excluding vision capabilities.

–Unified delta net handling
–Remove old methods
–Refactor and optimize
–Adapt autoregressive version
–Change to decay mask approach

vb7972

Fixed CUDA non-contiguous rope operations and improved variable naming consistency.

–Rename variables + fix rope_neox
–Fix rope_multi
–Fix rope_vision
–Fix rope_norm
–Rename ne to ne0 for consistent variable naming

See how people are using llama.cpp

Loading tweets...

Top in AI & ML

Trending Repos

Pi Mono

17,222#1

OpenClaw

233,443#2

Zvec

8,089#3

Claude Code

70,649#4

Heretic

9,761#5

See all →

LIVE RANKINGS • 10:20 AM • STEADY

TOP 50TOP 50TOP 50TOP 50TOP 50TOP 50TOP 50TOP 50TOP 50TOP 50TOP 50TOP 50

OVERALL

#28

AI & ML

#16

30 DAY RANKING TREND

ovr#28

·AI#16

STARS

96.0K

FORKS

15.1K

7D STARS

+856

7D FORKS

+163

[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers

llama.cpp: LLM inference in C/C++

Learn more about llama.cpp

What is llama.cpp for?

What makes llama.cpp different?

Zero External Dependencies

Multi-Backend Hardware Acceleration

Flexible Quantization Formats

Example code snippets

Recent Changes

See how people are using llama.cpp

Top in AI & ML

Pi Mono

OpenClaw

Claude Code

Heretic

Rowboat

Trending Repos

Pi Mono

OpenClaw

Zvec

Claude Code

Heretic

Related Repositories

vLLM

Crush

Magenta

Llama

TTS

Product

Company

Helpful Links