Navigate:
Docling
~$DOCLI2.0%

Docling: Document parsing for generative AI

Fast document parser for RAG and AI workflows.

LIVE RANKINGS • 12:09 PM • STEADY
TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25
OVERALL
#12
AI & ML
#9
1
30 DAY RANKING TREND
ovr#12
·AI#9
STARS
54.3K
FORKS
3.7K
7D STARS
+1.0K
7D FORKS
+44
Tags:
See Repo:
Share:

Learn more about Docling

Docling is a Python library for parsing and converting documents across multiple formats into structured representations suitable for AI applications. It uses layout analysis models and OCR to extract content from PDFs, scanned documents, and other file types, producing a unified DoclingDocument format. The tool supports local execution for sensitive data processing and includes integrations with frameworks like LangChain, LlamaIndex, and Haystack. Common deployment contexts include document preprocessing pipelines for retrieval-augmented generation, knowledge extraction workflows, and document conversion services.

Docling

1

Multi-Format Processing Pipeline

Handles PDF, DOCX, PPTX, XLSX, HTML, audio, and image formats through a single unified interface. Includes format-specific optimizations like advanced PDF layout analysis and OCR for scanned documents without switching tools.

2

Unified Document Structure

Converts all input formats into a consistent DoclingDocument representation with standardized extraction APIs. Exports to Markdown, HTML, JSON, or DocTags without format-specific parsing logic.

3

Local-First Architecture

Processes all documents locally without sending data to external services. Sensitive documents remain on-premises throughout the parsing pipeline, meeting compliance requirements for regulated industries and privacy-conscious organizations.


from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")

# Export to markdown
markdown_text = result.document.export_to_markdown()
print(markdown_text)


vv2.72.0

Add chart extraction models and improve Excel table bounds detection

  • Add chart extraction models
  • backend: Improve Excel table bounds detection and flatten merged cells
  • pptx: Handle picture shapes with external image references
  • Add granite vision for charts
vv2.71.0

Add WebVTT support, Word document comments extraction, and Ollama presets

  • Webvtt and source tracker
  • Add support for Word document comments extraction
  • Allow newer typer versions
  • rapidocr: Use new model links for RapidOCR
  • Presets for ollama
vv2.70.0

Drop Python 3.9 support and improve PPTX parsing with comprehensive documentation updates

  • Drop support for Python 3.9
  • md: Handle pipe symbols that are not table markers
  • Remove direct vllm dependency
  • PPTX parsing: bullet points not grouped correctly under subheadings
  • Add comprehensive docstrings to PdfPipelineOptions

See how people are using Docling

Loading tweets...


[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers