pdfplumber: Extract text and tables from PDFs
Python library for extracting PDF text and tables.
Learn more about pdfplumber
pdfplumber is a Python library built on pdfminer.six that parses PDF documents to extract structured information about their contents. It works by analyzing the low-level objects within a PDF, including individual characters, rectangles, lines, and curves, then exposes this data through a Python API. The library includes specialized algorithms for table detection and extraction, as well as text extraction with layout preservation. It is designed for machine-generated PDFs rather than scanned documents, and supports password-protected files and customizable layout analysis parameters.
Granular Object Access
Exposes detailed properties of individual text characters, rectangles, and lines with precise coordinates and styling. Enables custom analysis of document structure beyond basic text extraction, useful for complex layout parsing.
Built-in Table Detection
Includes algorithms that automatically detect and extract tabular data without manual cell coordinate specification. Handles merged cells, implicit borders, and varied table layouts that other libraries require preprocessing to parse.
Visual Debugging Tools
Renders detected objects directly onto PDF pages for inspection during development. Displays character bounding boxes, table boundaries, and extracted regions visually, reducing guesswork when tuning extraction parameters.
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
first_page = pdf.pages[0]
text = first_page.extract_text()
print(text)v0.11.9
- –Upgrade `pdfminer.six` from `20251107` to `20251230`. (75bbed3 + 1524ce4 + 26687c3 + 9555532)
v0.11.8
- –Upgrade `pdfminer.six` from `20250506` to `20251107` (h/t @henry-renner-v). (0079187)
v0.11.7
- –Add access to `Page.trimbox`, `Page.bleedbox`, and `Page.artbox` (h/t @samuelbradshaw). (#1313 + 7e364e6)
- –Upgrade `pdfminer.six` from `20250327` to `20250506`. (4c7e092)
- –Remove `strokingpattern` and `nonstroking_pattern` object attributes, due to changes in `pdfminer.six`. (4c7e092)
See how people are using pdfplumber
Top in Data Engineering
Related Repositories
Discover similar tools and frameworks used by developers
patroni
Automates PostgreSQL failover using distributed consensus systems.
dbt-core
SQL-based transformation framework for analytics data warehouses.
Fiona
Python library for reading and writing geographic data files like GeoPackage and Shapefile.
flyway
Version-controlled SQL migrations with automated execution tracking.
n8n
Node-based automation platform with JavaScript and Python scripting.