Navigate:
All Repostesseract
~$TESSER0.1%

Tesseract OCR: Open source optical character recognition engine

LSTM-based OCR engine supporting 100+ languages.

LIVE RANKINGS • 06:52 AM • STEADY
OVERALL
#113
6
AI & ML
#50
8
30 DAY RANKING TREND
ovr#113
·AI#50
STARS
71.8K
FORKS
10.5K
DOWNLOADS
110
7D STARS
+45
7D FORKS
0
Tags:
See Repo:
Share:

Learn more about tesseract

Tesseract is an open-source optical character recognition engine that converts images containing text into machine-readable character data. The system employs a Long Short-Term Memory neural network architecture as its primary recognition engine, processing text line images through multiple layers that analyze character patterns and linguistic context to produce accurate transcriptions. It maintains a modular design that supports over 100 languages through trained data models, processes standard image formats, and generates structured output in multiple document formats including searchable PDFs and XML-based representations. The engine implements a multi-stage pipeline that performs image preprocessing, layout analysis to detect text regions, line segmentation, and finally character recognition through the neural network. Originally developed at Hewlett-Packard and later maintained by Google, it balances recognition accuracy with processing speed by leveraging both statistical language models and neural network predictions.


1

Dual Recognition Engines

Includes both LSTM neural network and legacy pattern recognition engines with runtime switching via --oem flag. Enables modern accuracy while maintaining compatibility with older trained models and specialized use cases.

2

100+ Language Support

Recognizes text in over 100 languages out-of-the-box using pre-trained data files. Custom language training supported through documented pipeline for specialized fonts, domains, or historical scripts.

3

Multiple Output Formats

Generates plain text, hOCR with positioning data, searchable PDFs, TSV structured output, and PAGE/ALTO XML. Integrates directly into document processing workflows without format conversion layers.


import pytesseract
from PIL import Image

image = Image.open('document.png')
text = pytesseract.image_to_string(image)
print(text)

v5.5.1

Patch release fixing colormap handling, ALTO XML duplicate IDs, random number generation, and crashes in binarization; adds `-c` CLI parameter support.

  • Update colormap processing and fix pixSauvolaBinarizeTiled crashes to avoid errors on certain image inputs.
  • Use `-c` parameter to initialize config vectors at runtime; ALTO output now avoids duplicate IDs across multi-page documents.
v5.5.0

Removes TensorFlow support and modernizes internals; no breaking changes to core OCR APIs reported.

  • Remove TensorFlow dependencies if present; neural-net training workflows relying on TF will break.
  • Use symbolic values for --oem and --psm flags (e.g., names instead of integers) for clearer CLI invocations.
v5.4.1

Fixes a floating-point overflow regression that broke legacy and mixed OCR models introduced in 5.4.0.

  • Upgrade immediately if using legacy or mixed models; 5.4.0 caused crashes due to FP overflow in NormEvidenceOf.
  • No configuration changes required; patch includes code-quality fixes and build improvements only.

See how people are using tesseract

Loading tweets...


[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers