KoboldCpp: GGUF model inference with web UI
Self-contained llama.cpp distribution with KoboldAI API for running LLMs on consumer hardware.
Learn more about KoboldCpp
KoboldCpp is an inference engine for quantized language models that packages llama.cpp with additional features into a single executable. It runs on CPU or GPU with optional layer offloading, and serves a web interface for model interaction. The application supports GGML and GGUF model formats with backward compatibility for older model versions. Common deployment contexts include local development, cloud platforms like Google Colab and RunPod, and containerized environments via Docker.
Zero-Setup Single Executable
KoboldCpp ships as a self-contained executable with no installation required. Simply download, run, and start using LLMs immediately. The portable design eliminates dependency management, virtual environments, and configuration headaches. Perfect for users who want to experiment with AI without wrestling with Python environments or complex build processes.
Hardware Acceleration Support
Leverages multiple acceleration backends including CUDA for NVIDIA GPUs, OpenCL for AMD cards, and Vulkan for cross-platform GPU support. Intelligently splits processing between CPU and GPU for optimal performance on mixed hardware. Supports quantized models (4-bit, 5-bit, 8-bit) to run larger models on consumer-grade hardware with limited VRAM.
KoboldAI API Compatibility
Provides full compatibility with the KoboldAI ecosystem and API specification, enabling integration with popular frontends like SillyTavern, Agnaistic, and other community tools. Supports OpenAI-compatible endpoints for drop-in replacement scenarios. Includes built-in web UI for immediate text generation without additional clients, plus streaming support for real-time token generation.
import requests
response = requests.post('http://localhost:5001/api/v1/generate', json={
'prompt': 'Once upon a time',
'max_length': 100,
'temperature': 0.7
})
generated_text = response.json()['results'][0]['text']
print(generated_text)Add Vulkan GPU support for older PCs and enable pipeline parallel and flash attention by default
- –Added a new option for Vulkan (Older PC) in the oldpc builds. This provides GPU support via Vulkan without any CPU intrinsics
- –Pipeline parallel is enabled by default now in CLI. Disable it in the launcher or with --nopipelineparallel
- –Flash attention is enabled by default now in CLI. Disable it in the launcher or with --noflashattention
- –Fixes for mcp.json importing and MCP tool listing handshake
Add MCP (Model Context Protocol) server and client support for external tools and services
- –NEW: MCP Server and Client Support Added to KoboldCpp - serves as a direct drop-in replacement for Claude Desktop
- –KoboldCpp can connect to any HTTP or STDIO MCP server, using a mcp.json config format compatible with Claude Desktop
- –Multiple servers are supported, KoboldCpp will automatically combine their tools and dispatch request appropriately
- –CAUTION: Running ANY MCP SERVER gives it full access to your system. Be sure to only run servers you trust!
Add --gendefaults flag for API parameter overrides and introduce Adaptive-P sampler
- –--sdgendefaults has been deprecated and merged into this flag
- –StableUI SDUI: Fixed generation queue stacking, allowed requesting AVI formatted videos, added a dismiss button
- –Minor fixes to tool calling
- –Fixed LoRA loading issues with some Qwen Image LoRAs
Related Repositories
Discover similar tools and frameworks used by developers
Stable Diffusion
Text-to-image diffusion in compressed latent space.
LivePortrait
PyTorch implementation for animating portraits by transferring expressions from driving videos.
xFormers
Memory-efficient PyTorch components for transformer architectures.
InvokeAI
Node-based workflow interface for local Stable Diffusion deployment.
WiFi DensePose
System for real-time human pose tracking using WiFi Channel State Information without cameras or wearables.