Firecrawl的Docker配置

最近寻找爬虫项目时发现Firecrawl,官网提供的文档比较简单,实际操作遇到的困难还是很多,这里记录一下我的配置过程和解决办法

clone代码到本地

Firecrawl的github上克隆代码到本地即可

下载配置Docker

按照官方文档给出的示例在根目录下配置.env文件

文档地址:https://docs.firecrawl.dev/contributing/self-host

# .env

# ===== Required ENVS ======
NUM_WORKERS_PER_QUEUE=8 
PORT=3002
HOST=0.0.0.0

#for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
REDIS_URL=redis://redis:6379

#for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
REDIS_RATE_LIMIT_URL=redis://redis:6379 
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/html

## To turn on DB authentication, you need to set up supabase.
USE_DB_AUTHENTICATION=false

# ===== Optional ENVS ======

# Supabase Setup (used to support DB authentication, advanced logging, etc.)
SUPABASE_ANON_TOKEN= 
SUPABASE_URL= 
SUPABASE_SERVICE_TOKEN=

# Other Optionals
# use if you've set up authentication and want to test with a real API key
TEST_API_KEY=
# set if you'd like to test the scraping rate limit
RATE_LIMIT_TEST_API_KEY_SCRAPE=
# set if you'd like to test the crawling rate limit
RATE_LIMIT_TEST_API_KEY_CRAWL=
# set if you'd like to use scraping Be to handle JS blocking
SCRAPING_BEE_API_KEY=
# add for LLM dependednt features (image alt generation, etc.)
OPENAI_API_KEY=
BULL_AUTH_KEY=@
# use if you're configuring basic logging with logtail
LOGTAIL_KEY=
# set if you have a llamaparse key you'd like to use to parse pdfs
LLAMAPARSE_API_KEY=
# set if you'd like to send slack server health status messages
SLACK_WEBHOOK_URL=
# set if you'd like to send posthog events like job logs
POSTHOG_API_KEY=
# set if you'd like to send posthog events like job logs
POSTHOG_HOST=

# set if you'd like to use the fire engine closed beta
FIRE_ENGINE_BETA_URL=

# Proxy Settings for Playwright (Alternative you can can use a proxy service like oxylabs, which rotates IPs for you on every request)
PROXY_SERVER=
PROXY_USERNAME=
PROXY_PASSWORD=
# set if you'd like to block media requests to save proxy bandwidth
BLOCK_MEDIA=

# Set this to the URL of your webhook when using the self-hosted version of FireCrawl
SELF_HOSTED_WEBHOOK_URL=

# Resend API Key for transactional emails
RESEND_API_KEY=

# LOGGING_LEVEL determines the verbosity of logs that the system will output.
# Available levels are:
# NONE - No logs will be output.
# ERROR - For logging error messages that indicate a failure in a specific operation.
# WARN - For logging potentially harmful situations that are not necessarily errors.
# INFO - For logging informational messages that highlight the progress of the application.
# DEBUG - For logging detailed information on the flow through the system, primarily used for debugging.
# TRACE - For logging more detailed information than the DEBUG level.
# Set LOGGING_LEVEL to one of the above options to control logging output.
LOGGING_LEVEL=INFO

如果需要配置supabase就将USE_DB_AUTHENTICATION配置为true,使用爬虫的功能一般配为false即可。
如果需要使用一些AI功能,比如官方提供的Extraction或者网页交互的一些功能,则要配置OPENAI_API_KEY,剩下的一些参数根据自己需要的功能配置即可,比较费时间需要自己去琢磨

构建并运行Docker容器

这里是最麻烦的地方,我当时配了能有一天解决

  1. 容器构建

cd到Firecrawl的文件夹下

docker compose build

此时可能会遇到permission denied的问题
这里我搜集资料找到两种解决办法:
第一种(未成功):

# 先退出登录
docker logout
#然后重新登录docker
docker login

第二种(成功):
直接以管理员身份运行

sudo docker compose build

输入系统密码后即可开始构建
此时会遇到更麻烦的问题,因为国内挂了很多docker镜像源,所以拉取镜像时会遇到timeout或者eof等问题,这时需要自己去寻找可用的镜像源去更换然后重新构建。
但是即使你更换镜像源后再次构建花费的时间也会很长,每拉一个镜像都得十几二十分钟,问题主要出现在DockerFile里提到的三个包:python:3.11-slim,node:20-slim,golang:1.19
所以我采取的办法是直接将镜像pull到本地再进行容器构建

docker pull python:3.11-slim
docker pull node:20-slim
docker pull golang:1.19

把这三个镜像pull到本地后再进行构建,就会快很多

sudo docker compose build

运行容器

构建完毕后启动容器

docker compose up

在这里插入图片描述

测试

查看docker都启动后在终端发送请求测试

 % curl -X GET http://localhost:3002/test
# 会返回Hello, world!%
 % curl -X POST http://localhost:3002/v0/crawl \
    -H 'Content-Type: application/json' \
    -d '{
      "url": "https://docs.firecrawl.dev"
    }'
#会返回一个jobid,比如{"jobId":"092aec62-18f3-4cc3-acc9-83ff65d36b9a"}%

代码调试

如果想要使用代码调用服务,设置FirecrawlApp的api_url参数即可
以下是一个调用本地服务爬取网页的示例:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR_API_KEY", api_url="http://localhost:3002/")

# Crawl a website:
scrape_result = app.scrape_url('firecrawl.dev', params={'formats': ['markdown', 'html']})
print(scrape_result['markdown'])
评论 16
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值