PaddleOCR: Optical character recognition and document parsing
Multilingual OCR toolkit with document structure extraction.
Learn more about PaddleOCR
PaddleOCR is an optical character recognition system implemented in Python using the PaddlePaddle deep learning framework. It combines text detection and recognition models to process document images end-to-end, extracting both raw text and structured layout information. The toolkit includes pre-trained models for multiple languages, handwriting detection, and document structure analysis (tables, forms, key-value pairs). Common deployment scenarios include document digitization pipelines, PDF extraction for RAG systems, and integration with language models for document understanding tasks.
Multi-Language Pre-Trained Models
Ships with production-ready models for 100+ languages including CJK, Arabic, and Latin scripts. Eliminates cold-start training and dataset collection for most deployment scenarios.
Modular Detection-Recognition Pipeline
Decouples text localization from character recognition into swappable components. Enables per-region model selection and independent optimization of detection versus recognition accuracy.
Document Structure Extraction
Parses tables, forms, and key-value pairs beyond raw text output. Produces structured JSON suitable for direct ingestion into RAG pipelines or database workflows.
from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=True, lang='en')
result = ocr.ocr('invoice.jpg', cls=True)
for line in result[0]:
text = line[1][0]
confidence = line[1][1]
print(f"{text} (confidence: {confidence:.2f})")Patch release fixing a broken document preprocessing switch in PP-StructureV3 and PaddleOCR-VL; adds offline deployment guidance.
- –Update to restore document image preprocessing functionality that was previously ignored in PP-StructureV3 and PaddleOCR-VL pipelines.
- –Consult new offline environment setup instructions if deploying PaddleOCR-VL without internet access.
Introduces PaddleOCR-VL, a 0.9B-parameter vision-language model for document parsing with 109-language support, plus PP-OCRv5 multilingual recognition (2M params, 40%+ accuracy gain).
- –Deploy PaddleOCR-VL-0.9B from HuggingFace for SOTA element recognition (text, tables, formulas, charts) across 109 languages with low resource usage.
- –Upgrade to PP-OCRv5 recognition models for Latin, Cyrillic, Arabic, Devanagari, Telugu, and Tamil scripts with 2M parameters and 40%+ accuracy improvements.
Adds PP-OCRv5 English/Thai/Greek models (11% English improvement), requires PaddlePaddle 3.1.0/3.1.1, splits core/optional deps, and upgrades C++ deployment to Linux/Windows parity with Python.
- –Upgrade to PaddlePaddle 3.1.0 or 3.1.1; install only core dependencies for basic OCR, add optional packages for document parsing as needed.
- –Deploy PP-OCRv5 via upgraded C++ solution (Linux/Windows) with CUDA 12 support and choice of Paddle Inference or ONNX Runtime backends.
See how people are using PaddleOCR
Related Repositories
Discover similar tools and frameworks used by developers
DeepSpeed
PyTorch library for training billion-parameter models efficiently.
tesseract
LSTM-based OCR engine supporting 100+ languages.
Mask2Former
Unified transformer architecture for multi-task image segmentation.
stablediffusion
Text-to-image diffusion in compressed latent space.
unsloth
Memory-efficient Python library for accelerated LLM training.