Navigate:
All Repospdfplumber
~$PDFPLU0.2%

pdfplumber: Extract text and tables from PDFs

Python library for extracting PDF text and tables.

LIVE RANKINGS • 06:51 AM • STEADY
OVERALL
#109
18
DATA ENGINEERING
#9
30 DAY RANKING TREND
ovr#109
·Data#9
STARS
9.5K
FORKS
855
DOWNLOADS
113
7D STARS
+15
7D FORKS
+1
See Repo:
Share:

Learn more about pdfplumber

pdfplumber is a Python library built on pdfminer.six that parses PDF documents to extract structured information about their contents. It works by analyzing the low-level objects within a PDF, including individual characters, rectangles, lines, and curves, then exposes this data through a Python API. The library includes specialized algorithms for table detection and extraction, as well as text extraction with layout preservation. It is designed for machine-generated PDFs rather than scanned documents, and supports password-protected files and customizable layout analysis parameters.


1

Granular Object Access

Exposes detailed properties of individual text characters, rectangles, and lines with precise coordinates and styling. Enables custom analysis of document structure beyond basic text extraction, useful for complex layout parsing.

2

Built-in Table Detection

Includes algorithms that automatically detect and extract tabular data without manual cell coordinate specification. Handles merged cells, implicit borders, and varied table layouts that other libraries require preprocessing to parse.

3

Visual Debugging Tools

Renders detected objects directly onto PDF pages for inspection during development. Displays character bounding boxes, table boundaries, and extracted regions visually, reducing guesswork when tuning extraction parameters.


import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    first_page = pdf.pages[0]
    text = first_page.extract_text()
    print(text)

vv0.11.8

Adds table setting for capturing small edge segments; upgrades pdfminer.six dependency to 20251107.

  • Set `edge_min_length_prefilter` below 1 to capture dashed lines or other small edge segments in table extraction.
  • Upgrade pdfminer.six from 20250506 to 20251107; review compatibility if you pin transitive dependencies.
vv0.11.7

Breaking: stroking_pattern and non_stroking_pattern attributes removed due to pdfminer.six upgrade to 20250506.

  • Remove any code referencing stroking_pattern or non_stroking_pattern attributes on PDF objects.
  • Access new Page.trimbox, Page.bleedbox, and Page.artbox properties for additional PDF box dimensions.
vv0.11.6

Upgrades pdfminer.six to 20250327 and fixes text extraction bugs with use_text_flow and malformed PDFs.

  • Update pdfminer.six dependency to 20250327 to pick up upstream fixes and changes.
  • Fix use_text_flow=True extraction bug and improve error handling for malformed PDFs and recursion limits.


[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers