Docling: Document parsing for generative AI
Fast document parser for RAG and AI workflows.
Learn more about Docling
Docling is a Python library for parsing and converting documents across multiple formats into structured representations suitable for AI applications. It uses layout analysis models and OCR to extract content from PDFs, scanned documents, and other file types, producing a unified DoclingDocument format. The tool supports local execution for sensitive data processing and includes integrations with frameworks like LangChain, LlamaIndex, and Haystack. Common deployment contexts include document preprocessing pipelines for retrieval-augmented generation, knowledge extraction workflows, and document conversion services.
Multi-Format Processing Pipeline
Handles PDF, DOCX, PPTX, XLSX, HTML, audio, and image formats through a single unified interface. Includes format-specific optimizations like advanced PDF layout analysis and OCR for scanned documents without switching tools.
Unified Document Structure
Converts all input formats into a consistent DoclingDocument representation with standardized extraction APIs. Exports to Markdown, HTML, JSON, or DocTags without format-specific parsing logic.
Local-First Architecture
Processes all documents locally without sending data to external services. Sensitive documents remain on-premises throughout the parsing pipeline, meeting compliance requirements for regulated industries and privacy-conscious organizations.
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
# Export to markdown
markdown_text = result.document.export_to_markdown()
print(markdown_text)Add chart extraction models and improve Excel table bounds detection
- –Add chart extraction models
- –backend: Improve Excel table bounds detection and flatten merged cells
- –pptx: Handle picture shapes with external image references
- –Add granite vision for charts
Add WebVTT support, Word document comments extraction, and Ollama presets
- –Webvtt and source tracker
- –Add support for Word document comments extraction
- –Allow newer typer versions
- –rapidocr: Use new model links for RapidOCR
- –Presets for ollama
Drop Python 3.9 support and improve PPTX parsing with comprehensive documentation updates
- –Drop support for Python 3.9
- –md: Handle pipe symbols that are not table markers
- –Remove direct vllm dependency
- –PPTX parsing: bullet points not grouped correctly under subheadings
- –Add comprehensive docstrings to PdfPipelineOptions
See how people are using Docling
Related Repositories
Discover similar tools and frameworks used by developers
StabilityMatrix
Multi-backend inference UI manager with embedded dependencies.
OpenVINO
Convert and deploy deep learning models across Intel hardware.
Civitai
Community platform for sharing Stable Diffusion models, embeddings, and AI generation assets.
Text Generation WebUI
Gradio-based UI for running LLMs locally with multiple model format and extension support.
Pi Mono
Monorepo providing AI agent development tools, unified LLM API, and deployment management for multiple providers.