Navigate:
KoboldCpp
~$KOBOL0.7%

KoboldCpp: GGUF model inference with web UI

Self-contained llama.cpp distribution with KoboldAI API for running LLMs on consumer hardware.

LIVE RANKINGS • 12:26 PM • STEADY
OVERALL
#135
22
AI & ML
#55
4
30 DAY RANKING TREND
ovr#135
·AI#55
STARS
9.6K
FORKS
620
7D STARS
+66
7D FORKS
+5
Tags:
See Repo:
Share:

Learn more about KoboldCpp

KoboldCpp is an inference engine for quantized language models that packages llama.cpp with additional features into a single executable. It runs on CPU or GPU with optional layer offloading, and serves a web interface for model interaction. The application supports GGML and GGUF model formats with backward compatibility for older model versions. Common deployment contexts include local development, cloud platforms like Google Colab and RunPod, and containerized environments via Docker.

KoboldCpp

1

Zero-Setup Single Executable

KoboldCpp ships as a self-contained executable with no installation required. Simply download, run, and start using LLMs immediately. The portable design eliminates dependency management, virtual environments, and configuration headaches. Perfect for users who want to experiment with AI without wrestling with Python environments or complex build processes.

2

Hardware Acceleration Support

Leverages multiple acceleration backends including CUDA for NVIDIA GPUs, OpenCL for AMD cards, and Vulkan for cross-platform GPU support. Intelligently splits processing between CPU and GPU for optimal performance on mixed hardware. Supports quantized models (4-bit, 5-bit, 8-bit) to run larger models on consumer-grade hardware with limited VRAM.

3

KoboldAI API Compatibility

Provides full compatibility with the KoboldAI ecosystem and API specification, enabling integration with popular frontends like SillyTavern, Agnaistic, and other community tools. Supports OpenAI-compatible endpoints for drop-in replacement scenarios. Includes built-in web UI for immediate text generation without additional clients, plus streaming support for real-time token generation.


import requests

response = requests.post('http://localhost:5001/api/v1/generate', json={
    'prompt': 'Once upon a time',
    'max_length': 100,
    'temperature': 0.7
})

generated_text = response.json()['results'][0]['text']
print(generated_text)


vv1.107.3

Add Vulkan GPU support for older PCs and enable pipeline parallel and flash attention by default

  • Added a new option for Vulkan (Older PC) in the oldpc builds. This provides GPU support via Vulkan without any CPU intrinsics
  • Pipeline parallel is enabled by default now in CLI. Disable it in the launcher or with --nopipelineparallel
  • Flash attention is enabled by default now in CLI. Disable it in the launcher or with --noflashattention
  • Fixes for mcp.json importing and MCP tool listing handshake
vv1.106.2

Add MCP (Model Context Protocol) server and client support for external tools and services

  • NEW: MCP Server and Client Support Added to KoboldCpp - serves as a direct drop-in replacement for Claude Desktop
  • KoboldCpp can connect to any HTTP or STDIO MCP server, using a mcp.json config format compatible with Claude Desktop
  • Multiple servers are supported, KoboldCpp will automatically combine their tools and dispatch request appropriately
  • CAUTION: Running ANY MCP SERVER gives it full access to your system. Be sure to only run servers you trust!
vv1.105.4

Add --gendefaults flag for API parameter overrides and introduce Adaptive-P sampler

  • --sdgendefaults has been deprecated and merged into this flag
  • StableUI SDUI: Fixed generation queue stacking, allowed requesting AVI formatted videos, added a dismiss button
  • Minor fixes to tool calling
  • Fixed LoRA loading issues with some Qwen Image LoRAs


[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers