Navigate:
All Repostiktoken
~$TIKTOK0.1%

tiktoken: BPE tokenizer for OpenAI models

Fast BPE tokenizer for OpenAI language models.

LIVE RANKINGS • 06:51 AM • STEADY
TOP 100TOP 100TOP 100TOP 100TOP 100TOP 100TOP 100TOP 100TOP 100TOP 100TOP 100TOP 100
OVERALL
#94
7
AI & ML
#44
1
30 DAY RANKING TREND
ovr#94
·AI#44
STARS
17.0K
FORKS
1.4K
DOWNLOADS
817.3K
7D STARS
+24
7D FORKS
0
Tags:
See Repo:
Share:

Learn more about tiktoken

tiktoken is a tokenization library that implements byte pair encoding (BPE), a compression algorithm that converts text into sequences of numeric tokens. The library is written in Rust with Python bindings, providing both standard encodings for OpenAI models and an extensible architecture for custom tokenizers. It performs lossless, reversible tokenization that works on arbitrary text and compresses input by mapping text to subword units, with tokens typically representing about 4 bytes of text on average. The tool is commonly used in applications that need to count tokens for API billing, prepare text for language models, or implement custom tokenization schemes.


1

Rust-Backed Performance

Written in Rust with Python bindings rather than pure Python, delivering significantly faster tokenization than transformers library implementations. Handles large-scale text processing with minimal overhead for production workloads.

2

Pre-Built Model Encodings

Includes native encodings for OpenAI models (o200k_base, cl100k_base, gpt-4o) with exact token counts for API billing. Educational submodule provides BPE visualization tools for understanding tokenization mechanics.

3

Plugin-Based Extensibility

Supports custom tokenizer encodings through a plugin architecture. Add proprietary model tokenizers or modified encoding schemes without forking the core library, enabling experimentation with novel tokenization approaches.


import tiktoken

encoding = tiktoken.get_encoding("cl100k_base")

text = "Hello, how are you doing today?"
tokens = encoding.encode(text)

print(f"Token count: {len(tokens)}")
print(f"Tokens: {tokens}")


v0.12.0

Release notes do not specify breaking changes, new requirements, or feature details for this version.

  • Review the commit history or changelog file directly to identify actual changes before upgrading.
  • Test thoroughly in a staging environment as the scope of modifications is undocumented.
v0.11.0

Release notes do not specify breaking changes, new requirements, or feature additions for this version.

  • Review the commit history or changelog file directly to identify changes before upgrading production systems.
  • Test thoroughly in staging as the scope of modifications and potential compatibility impacts are undocumented.
v0.9.0

Release notes do not specify breaking changes, requirements, or new features for this version.

  • Review the commit history or changelog file directly to identify changes before upgrading.
  • Test thoroughly in a non-production environment as impact and compatibility are undocumented.


[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers