Navigate:
All Reposcrawl4ai
~$CRAWL40.3%

Crawl4AI: Open-source web crawler for LLM applications

Async browser automation extracting web content for LLMs.

LIVE RANKINGS • 06:52 AM • STEADY
TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25
OVERALL
#15
11
AI & ML
#7
11
30 DAY RANKING TREND
ovr#15
·AI#7
STARS
58.3K
FORKS
5.9K
DOWNLOADS
107
7D STARS
+169
7D FORKS
+31
Tags:
See Repo:
Share:

Learn more about crawl4ai

Crawl4AI is an open-source Python library designed to crawl and extract web content optimized for consumption by large language models. It operates through asynchronous browser automation to render JavaScript-heavy pages, capturing dynamic content that traditional HTTP-based scrapers cannot access. The crawler processes web pages to extract clean, structured content while removing navigation elements, advertisements, and other noise that would interfere with LLM processing. It implements configurable extraction strategies to transform raw HTML into markdown or structured data formats suitable for embedding in vector databases or direct LLM prompts.


1

LLM-Ready Markdown Output

Extracts web content into structured Markdown with preserved semantic elements like headings, tables, and code blocks. Designed specifically for RAG systems and language model ingestion rather than general HTML parsing.

2

Async Browser Pooling

Manages concurrent crawl requests through a pool of reusable browser instances with asynchronous execution. Reduces startup overhead and enables parallel processing compared to sequential single-browser approaches.

3

Programmable Extraction Hooks

Inject custom JavaScript, define site-specific behaviors, and chain LLM-based extraction strategies through a hook system. Enables adaptive crawling logic and intelligent content filtering without forking the codebase.


import asyncio
from crawl4ai import AsyncWebCrawler

async def crawl_page():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com")
        print(result.markdown)

asyncio.run(crawl_page())

vv0.7.6

Adds webhook support to Docker job queue API endpoints, enabling real-time notifications with automatic retry instead of polling.

  • Configure webhooks for /crawl/job and /llm/job endpoints with custom headers and full payload delivery.
  • Set global webhook URLs in config.yml; delivery includes exponential backoff retry on failure.
vv0.7.5

Requires Python 3.10+; deprecates proxy parameter in favor of proxy_config structure and adds cssselect dependency.

  • Upgrade to Python 3.10 or later and migrate proxy usage to the new proxy_config structure before deploying.
  • Use Docker hooks at 8 pipeline points for auth or performance; fixes JWT validation and URL query parameter handling.
vv0.7.4

Release notes do not specify breaking changes, new requirements, or feature details; consult CHANGELOG.md for actual changes.

  • Install via PyPI with `pip install crawl4ai==0.7.4` or pull Docker image `unclecode/crawl4ai:0.7.4`.
  • Review the project CHANGELOG.md on GitHub to identify breaking changes, deprecations, or new capabilities before upgrading.

See how people are using crawl4ai

Loading tweets...


[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers