LLMs之Firecrawl:Firecrawl的简介、安装和使用方法、案例应用之详细攻略
目录
Firecrawl 提供了多种功能,通过curl命令与API交互:
Firecrawl的简介
2024年9月发布,Firecrawl是一个API服务,能够将任何URL的整个网站转换为LLM友好的Markdown或结构化数据。它能够抓取所有可访问的子页面,并为每个页面提供干净的数据,无需站点地图。Firecrawl具备先进的抓取、爬取和数据提取功能,旨在为你的AI应用程序提供干净的数据。 它支持多种LLM框架和低代码框架集成,并提供Python和Node.js的SDK。 目前项目仍在开发中,正在将自定义模块集成到单体仓库中,自托管部署尚未完全准备好,但可以本地运行。
Firecrawl的安装和使用方法
1、获取API密钥
你需要在firecrawl.dev注册并获取API密钥才能使用API。
2、使用方法 (命令行)
Firecrawl 提供了多种功能,通过curl命令与API交互:
爬取 (Crawl)
抓取URL及其所有可访问的子页面。提交爬取作业并返回作业ID以检查爬取状态。
curl -X POST https://api.firecrawl.dev/v1/crawl \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer fc-YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev",
"limit": 100,
"scrapeOptions": {
"formats": ["markdown", "html"]
}
}'
返回爬网作业 ID 和 URL 以检查爬网状态。
{
"success": true,
"id": "123-456-789",
"url": "https://api.firecrawl.dev/v1/crawl/123-456-789"
}
检查爬取作业
检查爬取作业的状态并获取结果。
curl -X GET https://api.firecrawl.dev/v1/crawl/123-456-789 \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY'
{
"status": "completed",
"total": 36,
"creditsUsed": 36,
"expiresAt": "2024-00-00T00:00:00.000Z",
"data": [
{
"markdown": "[Firecrawl Docs home page!...",
"html": "<!DOCTYPE html><html lang=\"en\" class=\"js-focus-visible lg:[--scroll-mt:9.5rem]\" data-js-focus-visible=\"\">...",
"metadata": {
"title": "Build a 'Chat with website' using Groq Llama 3 | Firecrawl",
"language": "en",
"sourceURL": "https://docs.firecrawl.dev/learn/rag-llama3",
"description": "Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.",
"ogLocaleAlternate": [],
"statusCode": 200
}
}
]
}
抓取 (Scrape)
抓取URL并以指定的格式获取其内容。
curl -X POST https://api.firecrawl.dev/v1/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev",
"formats" : ["markdown", "html"]
}'
{
"success": true,
"data": {
"markdown": "Launch Week I is here! [See our Day 2 Release 🚀](https://www.firecrawl.dev/blog/launch-week-i-day-2-doubled-rate-limits)[💥 Get 2 months free...",
"html": "<!DOCTYPE html><html lang=\"en\" class=\"light\" style=\"color-scheme: light;\"><body class=\"__variable_36bd41 __variable_d7dc5d font-inter ...",
"metadata": {
"title": "Home - Firecrawl",
"description": "Firecrawl crawls and converts any website into clean markdown.",
"language": "en",
"keywords": "Firecrawl,Markdown,Data,Mendable,Langchain",
"robots": "follow, index",
"ogTitle": "Firecrawl",
"ogDescription": "Turn any website into LLM-ready data.",
"ogUrl": "https://www.firecrawl.dev/",
"ogImage": "https://www.firecrawl.dev/og.png?123",
"ogLocaleAlternate": [],
"ogSiteName": "Firecrawl",
"sourceURL": "https://firecrawl.dev",
"statusCode": 200
}
}
}
映射 (Map)
获取网站的所有URL。
curl -X POST https://api.firecrawl.dev/v1/map \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://firecrawl.dev"
}'
带搜索的映射
在网站内搜索特定URL。
curl -X POST https://api.firecrawl.dev/v1/map \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://firecrawl.dev",
"search": "docs"
}'
LLM提取 (Beta)
从抓取的页面提取结构化数据。 可以使用预定义的schema或prompt。
curl -X POST https://api.firecrawl.dev/v1/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://www.mendable.ai/",
"formats": ["extract"],
"extract": {
"schema": {
"type": "object",
"properties": {
"company_mission": {
"type": "string"
},
"supports_sso": {
"type": "boolean"
},
"is_open_source": {
"type": "boolean"
},
"is_in_yc": {
"type": "boolean"
}
},
"required": [
"company_mission",
"supports_sso",
"is_open_source",
"is_in_yc"
]
}
}
}'
批量抓取
同时抓取多个URL。
curl -X POST https://api.firecrawl.dev/v1/batch/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"urls": ["https://docs.firecrawl.dev", "https://docs.firecrawl.dev/sdks/overview"],
"formats" : ["markdown", "html"]
}'
搜索 (Beta)
搜索网页,获取最相关的结果,抓取每个页面并返回Markdown。
curl -X POST https://api.firecrawl.dev/v0/search \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"query": "firecrawl",
"pageOptions": {
"fetchPageContent": true // false for a fast serp api
}
}'
使用Actions与页面交互 (仅限云端):在抓取内容之前执行各种操作,例如点击、滚动、输入等。
3、使用SDK (Python):
安装
pip install firecrawl-py
示例 (抓取)
from firecrawl.firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="fc-YOUR_API_KEY")
# Scrape a website:
scrape_status = app.scrape_url(
'https://firecrawl.dev',
params={'formats': ['markdown', 'html']}
)
print(scrape_status)
# Crawl a website:
crawl_status = app.crawl_url(
'https://firecrawl.dev',
params={
'limit': 100,
'scrapeOptions': {'formats': ['markdown', 'html']}
},
poll_interval=30
)
print(crawl_status)
从 URL 中提取结构化数据
使用 LLM 提取,您可以轻松地从任何 URL 中提取结构化数据。我们支持 pydantic 模式,以便让您更轻松地完成此操作。以下是使用方法:
from firecrawl.firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="fc-YOUR_API_KEY")
class ArticleSchema(BaseModel):
title: str
points: int
by: str
commentsURL: str
class TopArticlesSchema(BaseModel):
top: List[ArticleSchema] = Field(..., max_items=5, description="Top 5 stories")
data = app.scrape_url('https://news.ycombinator.com', {
'formats': ['extract'],
'extract': {
'schema': TopArticlesSchema.model_json_schema()
}
})
print(data["extract"])
4、使用SDK (Node.js):
安装
npm install @mendable/firecrawl-js
示例 (抓取和爬取)
- 从firecrawl.dev获取 API 密钥
- 将 API 密钥设置为命名的环境变量
FIRECRAWL_API_KEY
或将其作为参数传递给类FirecrawlApp
。
import FirecrawlApp, { CrawlParams, CrawlStatusResponse } from '@mendable/firecrawl-js';
const app = new FirecrawlApp({apiKey: "fc-YOUR_API_KEY"});
// Scrape a website
const scrapeResponse = await app.scrapeUrl('https://firecrawl.dev', {
formats: ['markdown', 'html'],
});
if (scrapeResponse) {
console.log(scrapeResponse)
}
// Crawl a website
const crawlResponse = await app.crawlUrl('https://firecrawl.dev', {
limit: 100,
scrapeOptions: {
formats: ['markdown', 'html'],
}
} satisfies CrawlParams, true, 30) satisfies CrawlStatusResponse;
if (crawlResponse) {
console.log(crawlResponse)
}
从 URL 中提取结构化数据
使用 LLM 提取,您可以轻松地从任何 URL 中提取结构化数据。我们支持 zod 架构,让您更轻松。以下是使用方法:import FirecrawlApp from "@mendable/firecrawl-js";
import { z } from "zod";
const app = new FirecrawlApp({
apiKey: "fc-YOUR_API_KEY"
});
// Define schema to extract contents into
const schema = z.object({
top: z
.array(
z.object({
title: z.string(),
points: z.number(),
by: z.string(),
commentsURL: z.string(),
})
)
.length(5)
.describe("Top 5 stories on Hacker News"),
});
const scrapeResult = await app.scrapeUrl("https://news.ycombinator.com", {
extractorOptions: { extractionSchema: schema },
});
console.log(scrapeResult.data["llm_extraction"]);
Firecrawl的案例应用
Firecrawl可以应用于多种场景:任何需要从网站获取数据的应用
Firecrawl能够处理各种网站,包括那些包含动态内容、反机器人机制等的网站。
1、构建“与网站聊天”的应用
使用Firecrawl、Groq Llama 3和Langchain构建一个可以与你的网站聊天的机器人。
2、从网站中提取结构化数据
使用LLM提取功能,根据自定义schema或prompt提取所需的数据。
3、快速构建知识库
爬取和整理网站内容,构建LLM可用的知识库。
4、自动化数据收集
批量抓取大量URL,自动收集所需数据。