Navigate:

All Repospdfplumber

~$PDFPL↑0.7%

pdfplumber: Extract text and tables from PDFs

Python library for extracting PDF text and tables.

LIVE RANKINGS • 10:20 AM • STEADY

OVERALL

#133

DATA ENGINEERING

30 DAY RANKING TREND

ovr#133

·Data#4

STARS

9.8K

FORKS

862

7D STARS

+72

7D FORKS

Tags:

Data Engineering

See Repo:

Learn more about pdfplumber

pdfplumber is a Python library built on pdfminer.six that parses PDF documents to extract structured information about their contents. It works by analyzing the low-level objects within a PDF, including individual characters, rectangles, lines, and curves, then exposes this data through a Python API. The library includes specialized algorithms for table detection and extraction, as well as text extraction with layout preservation. It is designed for machine-generated PDFs rather than scanned documents, and supports password-protected files and customizable layout analysis parameters.

Granular Object Access

Exposes detailed properties of individual text characters, rectangles, and lines with precise coordinates and styling. Enables custom analysis of document structure beyond basic text extraction, useful for complex layout parsing.

Built-in Table Detection

Includes algorithms that automatically detect and extract tabular data without manual cell coordinate specification. Handles merged cells, implicit borders, and varied table layouts that other libraries require preprocessing to parse.

Visual Debugging Tools

Renders detected objects directly onto PDF pages for inspection during development. Displays character bounding boxes, table boundaries, and extracted regions visually, reducing guesswork when tuning extraction parameters.

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    first_page = pdf.pages[0]
    text = first_page.extract_text()
    print(text)

vv0.11.9

This release upgrades pdfminer.six from 20251107 to 20251230.

–Upgrade pdfminer.six from 20251107 to 20251230

vv0.11.8

This release upgrades pdfminer.six from 20250506 to 20251107.

–Upgrade pdfminer.six from 20250506 to 20251107

vv0.11.7

This release adds access to additional PDF page boxes and upgrades pdfminer.six dependency.

–Add access to Page.trimbox, Page.bleedbox, and Page.artbox
–Upgrade pdfminer.six from 20250327 to 20250506
–Remove strokingpattern and nonstroking_pattern object attributes, due to changes in pdfminer.six

See how people are using pdfplumber

Loading tweets...

Top in Data Engineering

Trending Repos

Pi Mono

17,222#1

OpenClaw

233,443#2

Zvec

8,089#3

Claude Code

70,649#4

Heretic

9,761#5

See all →

LIVE RANKINGS • 10:20 AM • STEADY

OVERALL

#133

DATA ENGINEERING

30 DAY RANKING TREND

ovr#133

·Data#4

STARS

9.8K

FORKS

862

7D STARS

+72

7D FORKS

[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers

pdfplumber: Extract text and tables from PDFs

Learn more about pdfplumber

What is pdfplumber for?

What makes pdfplumber different?

Granular Object Access

Built-in Table Detection

Visual Debugging Tools

Example code snippets

Recent Changes

See how people are using pdfplumber

Top in Data Engineering

Zvec

n8n

PostHog

ClickHouse

Apache Airflow

Trending Repos

Pi Mono

OpenClaw

Zvec

Claude Code

Heretic

Related Repositories

COVID-19 Data

dbt

pandas

Patroni

ClickHouse

Product

Company

Helpful Links