whisper.cpp: C/C++ port of OpenAI's Whisper model
Lightweight, cross-platform speech recognition engine delivering real-time transcription with minimal dependencies and optimized inference performance.
Learn more about whisper.cpp
whisper.cpp is a high-performance C/C++ implementation of OpenAI's Whisper automatic speech recognition model. It provides efficient inference without requiring Python dependencies, utilizing optimized CPU and GPU backends including Metal, CUDA, and OpenCL. The project supports multiple quantization formats (4-bit to 16-bit) for reduced memory footprint, enables real-time streaming transcription, and offers bindings for various programming languages. It includes command-line tools, server implementations, and mobile platform support (iOS, Android). The architecture prioritizes minimal resource consumption while maintaining accuracy, making it ideal for edge devices, embedded systems, and production deployments requiring fast, offline speech-to-text capabilities.
Zero Python Dependencies
Unlike the original Whisper implementation, whisper.cpp runs entirely in C/C++ with no Python runtime required. This dramatically reduces deployment complexity, memory overhead, and startup latency. The self-contained binary approach enables seamless integration into embedded systems, mobile applications, and resource-constrained environments where Python installations are impractical or impossible.
Real-Time Performance Optimization
Delivers highly optimized inference through hardware-accelerated backends including Apple Metal, NVIDIA CUDA, and Arm NEON intrinsics. Model quantization support (4-bit, 5-bit, 8-bit) reduces memory usage by up to 75% while maintaining transcription quality. Achieves real-time factor (RTF) below 0.5 on modern CPUs, enabling live streaming transcription and interactive voice applications.
Cross-Platform Production Ready
Supports deployment across Linux, macOS, Windows, iOS, Android, and WebAssembly with consistent APIs. Includes production-ready HTTP server implementation, language bindings (Python, Node.js, Go, Rust, Swift), and integration examples. The lightweight footprint enables edge deployment in IoT devices, desktop applications, and cloud services with straightforward build and deployment processes.
from whispercpp import Whisper
model = Whisper.from_pretrained("base.en")
result = model.transcribe("audio.wav")
print(result["text"])Patch release fixing a CPU scalar operator bug in the ggml norm implementation.
- –Upgrade to resolve incorrect behavior in ggml norm CPU scalar operations that may affect inference accuracy.
- –No breaking changes or new requirements; this is a drop-in bug-fix update from v1.8.1.
Patch release fixes Vulkan build failures, VAD memory leaks, and adds initial-prompt carry-over support.
- –Update builds using Vulkan backend to resolve compilation errors introduced in v1.8.0.
- –Pass `--carry-initial-prompt` to preserve context across segments; VAD leaks are now patched.
Flash attention now enabled by default, delivering 30–50% faster encoding across all models on Apple Silicon.
- –Expect encoder speedups of 30–50% on M1/M2 hardware with no configuration changes required.
- –Benchmark tables show medium model encoding drops from 580ms to 395ms on M1 Pro with flash attention.
See how people are using whisper.cpp
Related Repositories
Discover similar tools and frameworks used by developers
mediapipe
Graph-based framework for streaming media ML pipelines.
koboldcpp
Self-contained distribution of llama.cpp with KoboldAI-compatible API server for running large language models locally on consumer hardware.
tesseract
LSTM-based OCR engine supporting 100+ languages.
optuna
Define-by-run Python framework for automated hyperparameter tuning.
langchain
Modular framework for chaining LLMs with external data.