Crawl4AI - LLM 友好的异步爬虫工具

编程乐园

已于 2025-04-26 08:34:36 修改

阅读量1w

点赞数 30

分类专栏： # Python 文章标签：爬虫 Crawl4AI python playwright

于 2024-10-03 08:15:00 首次发布

原文链接：https://github.com/unclecode/crawl4ai

版权

Python 专栏收录该内容

134 篇文章

订阅专栏

文章目录

一、关于 Crawl4AI （异步版）🕷️🤖

Crawl4AI 是一款开源 LLM 有好的网络爬虫，Crawl4AI 简化了异步Web抓取和数据提取，使其可用于大型语言模型（LLM）和AI应用程序。🆓🌐

同步版：查看README.sync.md。您还可以在分支V0.2.76中访问以前的版本。

github : https://github.com/unclecode/crawl4ai
试用 Colab ：https://colab.research.google.com/drive/1REChY6fXQf-EaVYLv0eHEWvzlYxGm0pd
官方文档：https://crawl4ai.com/mkdocs/
贡献指南 | 许可 | 推特：@unclecode

特点✨

🆓完全免费和开源
🚀超快的性能，优于许多付费服务
🤖LLM友好的输出格式（JSON，清理 HTML ，降价）
🌍支持同时抓取多个URL
🎨提取并返回所有媒体标签（图像、音频和视频）
🔗提取所有外部和内部链接
📚从页面中提取元数据
🔄抓取前用于身份验证、标题和页面修改的自定义挂钩
🕵️用户代理定制
🖼️页面截图
📜抓取前执行多个自定义JavaScript
📊生成结构化输出没有LLM使用JsonCssExtractionStrategy
📚各种组块策略：基于主题、正则表达式、句子等
🧠高级提取策略：余弦聚类、LLM等
🎯CSS选择器支持精确的数据提取
📝传递指令/关键字来细化提取
🔒代理支持增强隐私和访问
🔄复杂多页面抓取场景的会话管理
🌐异步架构，以提高性能和可扩展性

二、安装🛠️

Crawl4AI提供灵活的安装选项以适应各种用例。您可以将其安装为Python包或使用Docker。

1、使用pip🐍

选择最适合您需求的安装选项：

1.1 基本安装

对于基本的网络抓取和抓取任务：

pip install crawl4ai

默认情况下，这将安装Crawl4AI的异步版本，使用Playwright进行网络抓取。

👉注意：安装Crawl4AI时，安装脚本应自动安装并设置Playwright。但是，如果您遇到任何与Playwright相关的错误，您可以使用以下方法之一手动安装它：

1、通过命令行：

playwright install

2、如果上述方法不起作用，请尝试以下更具体的命令：

python -m playwright install chromium

第二种方法在某些情况下被证明更可靠。

1.2 使用同步版本安装

如果您需要使用Selenium的同步版本：

pip install crawl4ai[sync]

1.3 开发安装

对于计划修改源代码的贡献者：

git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .

2、使用Docker🐳

我们正在创建Docker映像并将它们推送到Docker Hub。这将提供一种在容器化环境中运行Crawl4AI的简单方法。敬请关注更新！

有关更详细的安装说明和选项，请参阅我们的安装指南。

三、快速启动🚀

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url="https://www.nbcnews.com/business")
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

四、高级使用🔬

1、执行JavaScript和使用CSS选择器

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            js_code=js_code,
            css_selector="article.tease-card",
            bypass_cache=True
        )
        print(result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())

2、使用代理

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True, proxy="http://127.0.0.1:7890") as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            bypass_cache=True
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

3、在没有LLM的情况下提取结构化数据

该JsonCssExtractionStrategy允许使用CSS选择器从网页中精确提取结构化数据。

import asyncio
import json
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def extract_news_teasers():
    schema = {
        "name": "News Teaser Extractor",
        "baseSelector": ".wide-tease-item__wrapper",
        "fields": [
            {
                "name": "category",
                "selector": ".unibrow span[data-testid='unibrow-text']",
                "type": "text",
            },
            {
                "name": "headline",
                "selector": ".wide-tease-item__headline",
                "type": "text",
            },
            {
                "name": "summary",
                "selector": ".wide-tease-item__description",
                "type": "text",
            },
            {
                "name": "time",
                "selector": "[data-testid='wide-tease-date']",
                "type": "text",
            },
            {
                "name": "image",
                "type": "nested",
                "selector": "picture.teasePicture img",
                "fields": [
                    {"name": "src", "type": "attribute", "attribute": "src"},
                    {"name": "alt", "type": "attribute", "attribute": "alt"},
                ],
            },
            {
                "name": "link",
                "selector": "a[href]",
                "type": "attribute",
                "attribute": "href",
            },
        ],
    }

    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            extraction_strategy=extraction_strategy,
            bypass_cache=True,
        )

        assert result.success, "Failed to crawl the page"

        news_teasers = json.loads(result.extracted_content)
        print(f"Successfully extracted {len(news_teasers)} news teasers")
        print(json.dumps(news_teasers[0], indent=2))

if __name__ == "__main__":
    asyncio.run(extract_news_teasers())

有关更高级的使用示例，请查看文档中的示例部分。

4、使用OpenAI提取结构化数据

import os
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url='https://openai.com/api/pricing/',
            word_count_threshold=1,
            extraction_strategy=LLMExtractionStrategy(
                provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'), 
                schema=OpenAIModelFee.schema(),
                extraction_type="schema",
                instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
                Do not miss any models in the entire content. One extracted model JSON format should look like this: 
                {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
            ),            
            bypass_cache=True,
        )
        print(result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())

5、会话管理和动态内容爬行

Crawl4AI擅长处理复杂的场景，例如使用通过JavaScript加载的动态内容抓取多个页面。这是跨多个页面抓取GitHub提交的示例：

import asyncio
import re
from bs4 import BeautifulSoup
from crawl4ai import AsyncWebCrawler

async def crawl_typescript_commits():
    first_commit = ""
    async def on_execution_started(page):
        nonlocal first_commit 
        try:
            while True:
                await page.wait_for_selector('li.Box-sc-g0xbh4-0 h4')
                commit = await page.query_selector('li.Box-sc-g0xbh4-0 h4')
                commit = await commit.evaluate('(element) => element.textContent')
                commit = re.sub(r'\s+', '', commit)
                if commit and commit != first_commit:
                    first_commit = commit
                    break
                await asyncio.sleep(0.5)
        except Exception as e:
            print(f"Warning: New content didn't appear after JavaScript execution: {e}")

    async with AsyncWebCrawler(verbose=True) as crawler:
        crawler.crawler_strategy.set_hook('on_execution_started', on_execution_started)

        url = "https://github.com/microsoft/TypeScript/commits/main"
        session_id = "typescript_commits_session"
        all_commits = []

        js_next_page = """
        const button = document.querySelector('a[data-testid="pagination-next-button"]');
        if (button) button.click();
        """

        for page in range(3):  # Crawl 3 pages
            result = await crawler.arun(
                url=url,
                session_id=session_id,
                css_selector="li.Box-sc-g0xbh4-0",
                js=js_next_page if page > 0 else None,
                bypass_cache=True,
                js_only=page > 0
            )

            assert result.success, f"Failed to crawl page {page + 1}"

            soup = BeautifulSoup(result.cleaned_html, 'html.parser')
            commits = soup.select("li")
            all_commits.extend(commits)

            print(f"Page {page + 1}: Found {len(commits)} commits")

        await crawler.crawler_strategy.kill_session(session_id)
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")

if __name__ == "__main__":
    asyncio.run(crawl_typescript_commits())

此示例演示了Crawl4AI处理异步加载内容的复杂场景的能力。它抓取多个GitHub提交页面，执行JavaScript加载新内容，并使用自定义挂钩确保在继续之前加载数据。

有关更高级的使用示例，请查看文档中的示例部分。

五、速度比较🚀

Crawl4AI的设计以速度为主要关注点。我们的目标是通过高质量的数据提取提供尽可能快的响应，最大限度地减少数据和用户之间的抽象。

我们对Crawl4AI和付费服务Firecrawl进行了速度比较。结果证明了Crawl4AI的卓越性能：

Firecrawl:
Time taken: 7.02 seconds
Content length: 42074 characters
Images found: 49

Crawl4AI (simple crawl):
Time taken: 1.60 seconds
Content length: 18238 characters
Images found: 49

Crawl4AI (with JavaScript execution):
Time taken: 4.64 seconds
Content length: 40869 characters
Images found: 89