Navigate:
All Reposdocling
~$DOCLIN0.8%

Docling: Document parsing for generative AI

Fast document parser for RAG and AI workflows.

LIVE RANKINGS • 06:50 AM • STEADY
TOP 10TOP 10TOP 10TOP 10TOP 10TOP 10TOP 10TOP 10TOP 10TOP 10TOP 10TOP 10
OVERALL
#2
2
DATA ENGINEERING
#1
30 DAY RANKING TREND
ovr#2
·Data#1
STARS
49.5K
FORKS
3.4K
DOWNLOADS
7D STARS
+376
7D FORKS
+16
See Repo:
Share:

Learn more about docling

Docling is a Python library for parsing and converting documents across multiple formats into structured representations suitable for AI applications. It uses layout analysis models and OCR to extract content from PDFs, scanned documents, and other file types, producing a unified DoclingDocument format. The tool supports local execution for sensitive data processing and includes integrations with frameworks like LangChain, LlamaIndex, and Haystack. Common deployment contexts include document preprocessing pipelines for retrieval-augmented generation, knowledge extraction workflows, and document conversion services.


1

Multi-Format Processing Pipeline

Handles PDF, DOCX, PPTX, XLSX, HTML, audio, and image formats through a single unified interface. Includes format-specific optimizations like advanced PDF layout analysis and OCR for scanned documents without switching tools.

2

Unified Document Structure

Converts all input formats into a consistent DoclingDocument representation with standardized extraction APIs. Exports to Markdown, HTML, JSON, or DocTags without format-specific parsing logic.

3

Local-First Architecture

Processes all documents locally without sending data to external services. Sensitive documents remain on-premises throughout the parsing pipeline, meeting compliance requirements for regulated industries and privacy-conscious organizations.


from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")

# Export to markdown
markdown_text = result.document.export_to_markdown()
print(markdown_text)


vv2.61.2

Switches OCR engine default to EasyOCR when running on Python 3.14.

  • Verify OCR behavior if you rely on the previous default engine under Python 3.14.
  • Release notes do not specify breaking changes or migration steps beyond the engine switch.
vv2.61.1

Fixes slow table parsing in DOCX and HTML documents; no breaking changes or new requirements.

  • Upgrade to resolve performance bottlenecks when parsing tables in DOCX files.
  • Upgrade to resolve performance bottlenecks when parsing tables in HTML files.
vv2.61.0

Adds VLM token tracking and applies two dependency/OCR fixes; no breaking changes reported.

  • Track generated tokens and stop reasons when using VLM models for enhanced observability.
  • Pin NuExtract to a working revision and fix OCR PSM integer handling to resolve runtime issues.

See how people are using docling

Loading tweets...


[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers