Firecrawl的Docker配置

Exclusive_Sea

已于 2024-12-04 16:11:06 修改

阅读量5.8k

点赞数 25

文章标签： docker 容器爬虫

于 2024-12-04 16:10:31 首次发布

本文链接：https://blog.csdn.net/qq_60968494/article/details/144241082

版权

最近寻找爬虫项目时发现Firecrawl，官网提供的文档比较简单，实际操作遇到的困难还是很多，这里记录一下我的配置过程和解决办法

clone代码到本地

去Firecrawl的github上克隆代码到本地即可

下载配置Docker

按照官方文档给出的示例在根目录下配置.env文件

文档地址：https://docs.firecrawl.dev/contributing/self-host

# .env

# ===== Required ENVS ======
NUM_WORKERS_PER_QUEUE=8 
PORT=3002
HOST=0.0.0.0

#for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
REDIS_URL=redis://redis:6379

#for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
REDIS_RATE_LIMIT_URL=redis://redis:6379 
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/html

## To turn on DB authentication, you need to set up supabase.
USE_DB_AUTHENTICATION=false

# ===== Optional ENVS ======

# Supabase Setup (used to support DB authentication, advanced logging, etc.)
SUPABASE_ANON_TOKEN= 
SUPABASE_URL= 
SUPABASE_SERVICE_TOKEN=

# Other Optionals
# use if you've set up authentication and want to test with a real API key
TEST_API_KEY=
# set if you'd like to test the scraping rate limit
RATE_LIMIT_TEST_API_KEY_SCRAPE=
# set if you'd like to test the crawling rate limit
RATE_LIMIT_TEST_API_KEY_CRAWL=
# set if you'd like to use scraping Be to handle JS blocking
SCRAPING_BEE_API_KEY=
# add for LLM dependednt features (image alt generation, etc.)
OPENAI_API_KEY=
BULL_AUTH_KEY=@
# use if you're configuring basic logging with logtail
LOGTAIL_KEY=
# set if you have a llamaparse key you'd like to use to parse pdfs
LLAMAPARSE_API_KEY=
# set if you'd like to send slack server health status messages
SLACK_WEBHOOK_URL=
# set if you'd like to send posthog events like job logs
POSTHOG_API_KEY=
# set if you'd like to send posthog events like job logs
POSTHOG_HOST=

# set if you'd like to use the fire engine closed beta
FIRE_ENGINE_BETA_URL=

# Proxy Settings for Playwright (Alternative you can can use a proxy service like oxylabs, which rotates IPs for you on every request)
PROXY_SERVER=
PROXY_USERNAME=
PROXY_PASSWORD=
# set if you'd like to block media requests to save proxy bandwidth
BLOCK_MEDIA=

# Set this to the URL of your webhook when using the self-hosted version of FireCrawl
SELF_HOSTED_WEBHOOK_URL=

# Resend API Key for transactional emails
RESEND_API_KEY=

# LOGGING_LEVEL determines the verbosity of logs that the system will output.
# Available levels are:
# NONE - No logs will be output.
# ERROR - For logging error messages that indicate a failure in a specific operation.
# WARN - For logging potentially harmful situations that are not necessarily errors.
# INFO - For logging informational messages that highlight the progress of the application.
# DEBUG - For logging detailed information on the flow through the system, primarily used for debugging.
# TRACE - For logging more detailed information than the DEBUG level.
# Set LOGGING_LEVEL to one of the above options to control logging output.
LOGGING_LEVEL=INFO

如果需要配置supabase就将USE_DB_AUTHENTICATION配置为true，使用爬虫的功能一般配为false即可。
如果需要使用一些AI功能，比如官方提供的Extraction或者网页交互的一些功能，则要配置OPENAI_API_KEY，剩下的一些参数根据自己需要的功能配置即可，比较费时间需要自己去琢磨

构建并运行Docker容器

这里是最麻烦的地方，我当时配了能有一天解决

容器构建

cd到Firecrawl的文件夹下

docker compose build

此时可能会遇到permission denied的问题
这里我搜集资料找到两种解决办法：
第一种（未成功）：

# 先退出登录
docker logout
#然后重新登录docker
docker login

第二种（成功）：
直接以管理员身份运行

sudo docker compose build

输入系统密码后即可开始构建
此时会遇到更麻烦的问题，因为国内挂了很多docker镜像源，所以拉取镜像时会遇到timeout或者eof等问题，这时需要自己去寻找可用的镜像源去更换然后重新构建。
但是即使你更换镜像源后再次构建花费的时间也会很长，每拉一个镜像都得十几二十分钟，问题主要出现在DockerFile里提到的三个包：python:3.11-slim，node:20-slim，golang:1.19
所以我采取的办法是直接将镜像pull到本地再进行容器构建

docker pull python:3.11-slim
docker pull node:20-slim
docker pull golang:1.19

把这三个镜像pull到本地后再进行构建，就会快很多

sudo docker compose build

运行容器

构建完毕后启动容器

docker compose up

在这里插入图片描述

测试

查看docker都启动后在终端发送请求测试

 % curl -X GET http://localhost:3002/test
# 会返回Hello, world!%
 % curl -X POST http://localhost:3002/v0/crawl \
    -H 'Content-Type: application/json' \
    -d '{
      "url": "https://docs.firecrawl.dev"
    }'
#会返回一个jobid，比如{"jobId":"092aec62-18f3-4cc3-acc9-83ff65d36b9a"}%

代码调试

如果想要使用代码调用服务，设置FirecrawlApp的api_url参数即可
以下是一个调用本地服务爬取网页的示例：

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR_API_KEY", api_url="http://localhost:3002/")

# Crawl a website:
scrape_result = app.scrape_url('firecrawl.dev', params={'formats': ['markdown', 'html']})
print(scrape_result['markdown'])