Crawl4AI: Open-source web crawler for LLM applications
Async browser automation extracting web content for LLMs.
Learn more about Crawl4AI
Crawl4AI is an open-source Python library designed to crawl and extract web content optimized for consumption by large language models. It operates through asynchronous browser automation to render JavaScript-heavy pages, capturing dynamic content that traditional HTTP-based scrapers cannot access. The crawler processes web pages to extract clean, structured content while removing navigation elements, advertisements, and other noise that would interfere with LLM processing. It implements configurable extraction strategies to transform raw HTML into markdown or structured data formats suitable for embedding in vector databases or direct LLM prompts.
LLM-Ready Markdown Output
Extracts web content into structured Markdown with preserved semantic elements like headings, tables, and code blocks. Designed specifically for RAG systems and language model ingestion rather than general HTML parsing.
Async Browser Pooling
Manages concurrent crawl requests through a pool of reusable browser instances with asynchronous execution. Reduces startup overhead and enables parallel processing compared to sequential single-browser approaches.
Programmable Extraction Hooks
Inject custom JavaScript, define site-specific behaviors, and chain LLM-based extraction strategies through a hook system. Enables adaptive crawling logic and intelligent content filtering without forking the codebase.
import asyncio
from crawl4ai import AsyncWebCrawler
async def crawl_page():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
print(result.markdown)
asyncio.run(crawl_page())See how people are using Crawl4AI
Related Repositories
Discover similar tools and frameworks used by developers
Ollama
Go-based CLI for local LLM inference and management.
TTS
PyTorch toolkit for deep learning text-to-speech synthesis.
OpenVINO
Convert and deploy deep learning models across Intel hardware.
PaddleOCR
Multilingual OCR toolkit with document structure extraction.
Chart-GPT
AI tool that generates charts from natural language text descriptions.