Tesseract OCR: Open source optical character recognition engine
LSTM-based OCR engine supporting 100+ languages.
Learn more about tesseract
Tesseract is an open-source optical character recognition engine that converts images containing text into machine-readable character data. The system employs a Long Short-Term Memory neural network architecture as its primary recognition engine, processing text line images through multiple layers that analyze character patterns and linguistic context to produce accurate transcriptions. It maintains a modular design that supports over 100 languages through trained data models, processes standard image formats, and generates structured output in multiple document formats including searchable PDFs and XML-based representations. The engine implements a multi-stage pipeline that performs image preprocessing, layout analysis to detect text regions, line segmentation, and finally character recognition through the neural network. Originally developed at Hewlett-Packard and later maintained by Google, it balances recognition accuracy with processing speed by leveraging both statistical language models and neural network predictions.
Dual Recognition Engines
Includes both LSTM neural network and legacy pattern recognition engines with runtime switching via --oem flag. Enables modern accuracy while maintaining compatibility with older trained models and specialized use cases.
100+ Language Support
Recognizes text in over 100 languages out-of-the-box using pre-trained data files. Custom language training supported through documented pipeline for specialized fonts, domains, or historical scripts.
Multiple Output Formats
Generates plain text, hOCR with positioning data, searchable PDFs, TSV structured output, and PAGE/ALTO XML. Integrates directly into document processing workflows without format conversion layers.
import pytesseract
from PIL import Image
image = Image.open('document.png')
text = pytesseract.image_to_string(image)
print(text)Patch release fixing colormap handling, ALTO XML duplicate IDs, random number generation, and crashes in binarization; adds `-c` CLI parameter support.
- –Update colormap processing and fix pixSauvolaBinarizeTiled crashes to avoid errors on certain image inputs.
- –Use `-c` parameter to initialize config vectors at runtime; ALTO output now avoids duplicate IDs across multi-page documents.
Removes TensorFlow support and modernizes internals; no breaking changes to core OCR APIs reported.
- –Remove TensorFlow dependencies if present; neural-net training workflows relying on TF will break.
- –Use symbolic values for --oem and --psm flags (e.g., names instead of integers) for clearer CLI invocations.
Fixes a floating-point overflow regression that broke legacy and mixed OCR models introduced in 5.4.0.
- –Upgrade immediately if using legacy or mixed models; 5.4.0 caused crashes due to FP overflow in NormEvidenceOf.
- –No configuration changes required; patch includes code-quality fixes and build improvements only.
See how people are using tesseract
Related Repositories
Discover similar tools and frameworks used by developers
context7
MCP server delivering version-specific library documentation to LLMs.
Wan2.2
Open-source diffusion framework for multi-modal video generation.
presentation-ai
AI-powered slide generator with multi-model integration and themes.
text-generation-webui
Feature-rich Gradio-based UI for running and interacting with LLMs locally, supporting multiple model formats and extensions.
LightRAG
Graph-based retrieval framework for structured RAG reasoning.