Firecrawl深度基础刨析篇

FeelTouch Labs

已于 2025-04-13 00:54:35 修改

阅读量1.1k

点赞数 12

分类专栏： AI 文章标签： Firecrawl LLM 爬取

于 2025-04-12 14:04:47 首次发布

本文链接：https://blog.csdn.net/feeltouch/article/details/147162284

版权

一、Firecrawl是什么

二、Firecrawl主要功能(以Cloud服务讲解)

2.1 Scrape 爬取单个url

2.2 Scrape 批量爬取多个url

2.3 Scrape并具有Extract功能（LLM）

一、Firecrawl是什么

Firecrawl是一个云爬取服务，一个AI 驱动抓取引擎，可以无需网站地图的情况下爬取整个网站内容、单个网页、网站地图；通过内置LLM实现自然语言对爬取内容的抽取。同时，Firecrawl也有对应的开源版本，可以自行下载开源代码并运行。

二、Firecrawl主要功能(以Cloud服务讲解)

2.1 Scrape 爬取单个url

将任何 URL 转换为干净数据

Firecrawl 将网页转换为 markdown，非常适合 LLM 应用程序。

它管理复杂性：代理、缓存、速率限制、js 阻止的内容
处理动态内容：动态网站、js 渲染的网站、PDF、图像
输出干净的 markdown、结构化数据、屏幕截图或 html。

调用端点如下：

curl --request POST \
  --url https://host:port/v1/scrape \
  --header 'Content-Type: application/json' \
  --data '{
  "url": "<string>",
  "formats": [
    "markdown"
  ],
  "onlyMainContent": true,
  "includeTags": [
    "<string>"
  ],
  "excludeTags": [
    "<string>"
  ],
  "headers": {},
  "waitFor": 0,
  "mobile": false,
  "skipTlsVerification": false,
  "timeout": 30000,
  "jsonOptions": {
    "schema": {},
    "systemPrompt": "<string>",
    "prompt": "<string>"
  },
  "actions": [
    {
      "type": "wait",
      "milliseconds": 2,
      "selector": "#my-element"
    }
  ],
  "location": {
    "country": "US",
    "languages": [
      "en-US"
    ]
  },
  "removeBase64Images": true,
  "blockAds": true,
  "proxy": "basic"
}'

2.2 Scrape 批量爬取多个url

可以同时批量抓取多个 URL。它以起始 URL 和可选参数作为参数。params 参数允许您为批量抓取作业指定其他选项，例如输出格式。

它的工作方式与端点非常相似/crawl。它提交批量抓取作业并返回作业 ID 以检查批量抓取的状态。

SDK 提供两种方法：同步和异步。同步方法将返回批量抓取作业的结果，而异步方法将返回一个作业 ID，您可以使用该 ID 检查批量抓取的状态。

curl --request POST \
  --url https://host:port/v1/batch/scrape \
  --header 'Content-Type: application/json' \
  --data '{
  "urls": [
    "<string>"
  ],
  "webhook": {
    "url": "<string>",
    "headers": {},
    "metadata": {},
    "events": [
      "completed"
    ]
  },
  "ignoreInvalidURLs": false,
  "formats": [
    "markdown"
  ],
  "onlyMainContent": true,
  "includeTags": [
    "<string>"
  ],
  "excludeTags": [
    "<string>"
  ],
  "headers": {},
  "waitFor": 0,
  "mobile": false,
  "skipTlsVerification": false,
  "timeout": 30000,
  "jsonOptions": {
    "schema": {