Navigate:
All ReposPaddleOCR
~$PADDLE0.3%

PaddleOCR: Optical character recognition and document parsing

Multilingual OCR toolkit with document structure extraction.

LIVE RANKINGS • 06:51 AM • STEADY
TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25
OVERALL
#14
3
AI & ML
#6
5
30 DAY RANKING TREND
ovr#14
·AI#6
STARS
67.7K
FORKS
9.6K
DOWNLOADS
124
7D STARS
+187
7D FORKS
+19
Tags:
See Repo:
Share:

Learn more about PaddleOCR

PaddleOCR is an optical character recognition system implemented in Python using the PaddlePaddle deep learning framework. It combines text detection and recognition models to process document images end-to-end, extracting both raw text and structured layout information. The toolkit includes pre-trained models for multiple languages, handwriting detection, and document structure analysis (tables, forms, key-value pairs). Common deployment scenarios include document digitization pipelines, PDF extraction for RAG systems, and integration with language models for document understanding tasks.


1

Multi-Language Pre-Trained Models

Ships with production-ready models for 100+ languages including CJK, Arabic, and Latin scripts. Eliminates cold-start training and dataset collection for most deployment scenarios.

2

Modular Detection-Recognition Pipeline

Decouples text localization from character recognition into swappable components. Enables per-region model selection and independent optimization of detection versus recognition accuracy.

3

Document Structure Extraction

Parses tables, forms, and key-value pairs beyond raw text output. Produces structured JSON suitable for direct ingestion into RAG pipelines or database workflows.


from paddleocr import PaddleOCR

ocr = PaddleOCR(use_angle_cls=True, lang='en')
result = ocr.ocr('invoice.jpg', cls=True)

for line in result[0]:
    text = line[1][0]
    confidence = line[1][1]
    print(f"{text} (confidence: {confidence:.2f})")

vv3.3.1

Patch release fixing a broken document preprocessing switch in PP-StructureV3 and PaddleOCR-VL; adds offline deployment guidance.

  • Update to restore document image preprocessing functionality that was previously ignored in PP-StructureV3 and PaddleOCR-VL pipelines.
  • Consult new offline environment setup instructions if deploying PaddleOCR-VL without internet access.
vv3.3.0

Introduces PaddleOCR-VL, a 0.9B-parameter vision-language model for document parsing with 109-language support, plus PP-OCRv5 multilingual recognition (2M params, 40%+ accuracy gain).

  • Deploy PaddleOCR-VL-0.9B from HuggingFace for SOTA element recognition (text, tables, formulas, charts) across 109 languages with low resource usage.
  • Upgrade to PP-OCRv5 recognition models for Latin, Cyrillic, Arabic, Devanagari, Telugu, and Tamil scripts with 2M parameters and 40%+ accuracy improvements.
vv3.2.0

Adds PP-OCRv5 English/Thai/Greek models (11% English improvement), requires PaddlePaddle 3.1.0/3.1.1, splits core/optional deps, and upgrades C++ deployment to Linux/Windows parity with Python.

  • Upgrade to PaddlePaddle 3.1.0 or 3.1.1; install only core dependencies for basic OCR, add optional packages for document parsing as needed.
  • Deploy PP-OCRv5 via upgraded C++ solution (Linux/Windows) with CUDA 12 support and choice of Paddle Inference or ONNX Runtime backends.

See how people are using PaddleOCR

Loading tweets...


[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers