Docling: Document parsing for generative AI
Fast document parser for RAG and AI workflows.
Learn more about docling
Docling is a Python library for parsing and converting documents across multiple formats into structured representations suitable for AI applications. It uses layout analysis models and OCR to extract content from PDFs, scanned documents, and other file types, producing a unified DoclingDocument format. The tool supports local execution for sensitive data processing and includes integrations with frameworks like LangChain, LlamaIndex, and Haystack. Common deployment contexts include document preprocessing pipelines for retrieval-augmented generation, knowledge extraction workflows, and document conversion services.
Multi-Format Processing Pipeline
Handles PDF, DOCX, PPTX, XLSX, HTML, audio, and image formats through a single unified interface. Includes format-specific optimizations like advanced PDF layout analysis and OCR for scanned documents without switching tools.
Unified Document Structure
Converts all input formats into a consistent DoclingDocument representation with standardized extraction APIs. Exports to Markdown, HTML, JSON, or DocTags without format-specific parsing logic.
Local-First Architecture
Processes all documents locally without sending data to external services. Sensitive documents remain on-premises throughout the parsing pipeline, meeting compliance requirements for regulated industries and privacy-conscious organizations.
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
# Export to markdown
markdown_text = result.document.export_to_markdown()
print(markdown_text)Switches OCR engine default to EasyOCR when running on Python 3.14.
- –Verify OCR behavior if you rely on the previous default engine under Python 3.14.
- –Release notes do not specify breaking changes or migration steps beyond the engine switch.
Fixes slow table parsing in DOCX and HTML documents; no breaking changes or new requirements.
- –Upgrade to resolve performance bottlenecks when parsing tables in DOCX files.
- –Upgrade to resolve performance bottlenecks when parsing tables in HTML files.
Adds VLM token tracking and applies two dependency/OCR fixes; no breaking changes reported.
- –Track generated tokens and stop reasons when using VLM models for enhanced observability.
- –Pin NuExtract to a working revision and fix OCR PSM integer handling to resolve runtime issues.
See how people are using docling
Top in Data Engineering
Related Repositories
Discover similar tools and frameworks used by developers
luigi
Build complex batch pipelines with dependency management.
dbt-core
SQL-based transformation framework for analytics data warehouses.
superset
Flask-based BI platform for SQL database visualization.
patroni
Automates PostgreSQL failover using distributed consensus systems.
flyway
Version-controlled SQL migrations with automated execution tracking.