Navigate:
All Reposkoboldcpp
~$KOBOLD0.2%

KoboldCpp: GGUF model inference with web UI

Self-contained distribution of llama.cpp with KoboldAI-compatible API server for running large language models locally on consumer hardware.

LIVE RANKINGS • 06:51 AM • STEADY
TOP 100TOP 100TOP 100TOP 100TOP 100TOP 100TOP 100TOP 100TOP 100TOP 100TOP 100TOP 100
OVERALL
#90
24
AI & ML
#41
14
30 DAY RANKING TREND
ovr#90
·AI#41
STARS
9.2K
FORKS
602
DOWNLOADS
3
7D STARS
+17
7D FORKS
+1
Tags:
See Repo:
Share:

Learn more about koboldcpp

KoboldCpp is an inference engine for quantized language models that packages llama.cpp with additional features into a single executable. It runs on CPU or GPU with optional layer offloading, and serves a web interface for model interaction. The application supports GGML and GGUF model formats with backward compatibility for older model versions. Common deployment contexts include local development, cloud platforms like Google Colab and RunPod, and containerized environments via Docker.

koboldcpp

1

Zero-Setup Single Executable

KoboldCpp ships as a self-contained executable with no installation required. Simply download, run, and start using LLMs immediately. The portable design eliminates dependency management, virtual environments, and configuration headaches. Perfect for users who want to experiment with AI without wrestling with Python environments or complex build processes.

2

Hardware Acceleration Support

Leverages multiple acceleration backends including CUDA for NVIDIA GPUs, OpenCL for AMD cards, and Vulkan for cross-platform GPU support. Intelligently splits processing between CPU and GPU for optimal performance on mixed hardware. Supports quantized models (4-bit, 5-bit, 8-bit) to run larger models on consumer-grade hardware with limited VRAM.

3

KoboldAI API Compatibility

Provides full compatibility with the KoboldAI ecosystem and API specification, enabling integration with popular frontends like SillyTavern, Agnaistic, and other community tools. Supports OpenAI-compatible endpoints for drop-in replacement scenarios. Includes built-in web UI for immediate text generation without additional clients, plus streaming support for real-time token generation.


import requests

response = requests.post('http://localhost:5001/api/v1/generate', json={
    'prompt': 'Once upon a time',
    'max_length': 100,
    'temperature': 0.7
})

generated_text = response.json()['results'][0]['text']
print(generated_text)



[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers