tiktoken: BPE tokenizer for OpenAI models
Fast BPE tokenizer for OpenAI language models.
Learn more about tiktoken
tiktoken is a tokenization library that implements byte pair encoding (BPE), a compression algorithm that converts text into sequences of numeric tokens. The library is written in Rust with Python bindings, providing both standard encodings for OpenAI models and an extensible architecture for custom tokenizers. It performs lossless, reversible tokenization that works on arbitrary text and compresses input by mapping text to subword units, with tokens typically representing about 4 bytes of text on average. The tool is commonly used in applications that need to count tokens for API billing, prepare text for language models, or implement custom tokenization schemes.
Rust-Backed Performance
Written in Rust with Python bindings rather than pure Python, delivering significantly faster tokenization than transformers library implementations. Handles large-scale text processing with minimal overhead for production workloads.
Pre-Built Model Encodings
Includes native encodings for OpenAI models (o200k_base, cl100k_base, gpt-4o) with exact token counts for API billing. Educational submodule provides BPE visualization tools for understanding tokenization mechanics.
Plugin-Based Extensibility
Supports custom tokenizer encodings through a plugin architecture. Add proprietary model tokenizers or modified encoding schemes without forking the core library, enabling experimentation with novel tokenization approaches.
import tiktoken
encoding = tiktoken.get_encoding("cl100k_base")
text = "Hello, how are you doing today?"
tokens = encoding.encode(text)
print(f"Token count: {len(tokens)}")
print(f"Tokens: {tokens}")See how people are using tiktoken
Related Repositories
Discover similar tools and frameworks used by developers
whisper.cpp
Lightweight C++ port of OpenAI Whisper for cross-platform speech recognition.
Continue
Multi-LLM coding agent with interactive and automated modes.
OpenPose
Multi-person 135-keypoint anatomical detection in C++.
ByteTrack
Multi-object tracker associating low-confidence detections across frames.
Ultralytics YOLO
PyTorch library for YOLO-based real-time computer vision.