Whisper: General-purpose speech recognition model
Speech recognition system supporting multilingual transcription, translation, and language ID.
Learn more about Whisper
Whisper is a Transformer-based sequence-to-sequence model developed by OpenAI for automatic speech recognition and related tasks. The model uses a unified architecture that processes audio through log-Mel spectrograms and generates text tokens autoregressively, handling multiple speech processing tasks within a single framework. It comes in six different sizes ranging from 39M to 1.55B parameters, with both English-only and multilingual variants available. The system processes audio in 30-second windows and can perform transcription, translation to English, spoken language identification, and voice activity detection.
Multitask Architecture
Single model handles transcription, translation, language identification, and voice activity detection using special tokens as task specifiers. Replaces traditional multi-stage speech processing pipelines with unified sequence-to-sequence approach.
Weak Supervision Training
Trained on large-scale diverse audio data without requiring perfectly aligned transcripts. This approach enables robust performance across various audio conditions and speaking styles.
Multiple Model Sizes
Offers six model variants from tiny (39M parameters) to large (1.55B parameters) with different speed-accuracy tradeoffs. Includes specialized English-only models and an optimized turbo variant.
# Install Whisper
pip install -U openai-whisper
# Install required ffmpeg dependency
# Ubuntu/Debian:
sudo apt update && sudo apt install ffmpeg
# macOS:
brew install ffmpeg
# Windows (Chocolatey):
choco install ffmpeg
# Transcribe audio files using the turbo model
whisper audio.flac audio.mp3 audio.wav --model turboSee how people are using Whisper
Related Repositories
Discover similar tools and frameworks used by developers
Triton
Domain-specific language and compiler for writing GPU deep learning primitives with higher productivity than CUDA.
DeepFace
Python library wrapping multiple face recognition deep learning models.
LivePortrait
PyTorch implementation for animating portraits by transferring expressions from driving videos.
tiktoken
Fast BPE tokenizer for OpenAI language models.
DINOv2
PyTorch vision transformers pretrained on 142M unlabeled images.