本地部署 Firecrawl

最新推荐文章于 2025-04-30 15:23:10 发布

数据Ai指北

最新推荐文章于 2025-04-30 15:23:10 发布

阅读量5.4k

点赞数 6

分类专栏： # --- 大模型文章标签：人工智能自然语言处理

本文链接：https://blog.csdn.net/shujuelin/article/details/145022912

版权

--- 大模型专栏收录该内容

20 篇文章

订阅专栏

文章目录

有小伙伴，问我做的资讯App的数据获取方案。对于来自于不同的网页的新闻数据如何获取。

通常来说对于一个复杂的应用数据获取方案，通常需要从多个网页源抓取数据。首先，可以通过网络爬虫技术（如Python的Scrapy或BeautifulSoup）从目标新闻网站提取结构化数据，包括标题、正文、发布时间、作者等信息。

更多的是，还需处理反爬虫机制，如模拟用户行为、使用代理IP等。最后，将获取的数据进行清洗、去重和存储，以便在App中展示。

新闻这类数据一般而言都是静态数据，直接把url设置给爬虫产品Firecrawl。

今天我们就一起聊聊Firecrawl。这款爬虫产品也在Dify上被内置，小伙伴们都可以使用。

因为有线上版本有额度限制，因此我部署到了自己的服务器，免费撸之😊。

一、firecrawl

FireCrawl是一款创新的爬虫工具，它能够无需站点地图，抓取任何网站的所有可访问子页面。与传统爬虫工具相比，FireCrawl特别擅长处理使用JavaScript动态生成内容的网站，并且可以转换为LLM-ready的数据。

最简单的情况下，只需要填一个URL就可以，firecrawl会抓取到相关的内容，还可以通过LLM来提取信息。使用firecrawl的在线服务是需要付费的，免费的只有500credit，所以接下来我们看下如何自己本地运行。

二、本地部署fircrawl

（1）基础配置

首先去git上clone代码到本地，其次确定好自己的服务器已经安装了docker。

按照官方文档给出的示例在根目录下创建，并配置.env文件

# .env

# ===== Required ENVS ======
NUM_WORKERS_PER_QUEUE=8 
PORT=3002
HOST=0.0.0.0

#for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
REDIS_URL=redis://redis:6379

#for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
REDIS_RATE_LIMIT_URL=redis://redis:6379 
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/html

## To turn on DB authentication, you need to set up supabase.
USE_DB_AUTHENTICATION=false

# ===== Optional ENVS ======

# Supabase Setup (used to support DB authentication, advanced logging, etc.)
SUPABASE_ANON_TOKEN= 
SUPABASE_URL= 
SUPABASE_SERVICE_TOKEN=

# Other Optionals
# use if you've set up authentication and want to test with a real API key
TEST_API_KEY=
# set if you'd like to test the scraping rate limit
RATE_LIMIT_TEST_API_KEY_SCRAPE=
# set if you'd like to test the crawling rate limit
RATE_LIMIT_TEST_API_KEY_CRAWL=
# set if you'd like to use scraping Be to handle JS blocking
SCRAPING_BEE_API_KEY=
# add for LLM dependednt features (image alt generation, etc.)
OPENAI_API_KEY=
BULL_AUTH_KEY=@
# use if you're configuring basic logging with logtail
LOGTAIL_KEY=
# set if you have a llamaparse key you'd like to use to parse pdfs
LLAMAPARSE_API_KEY=
# set if you'd like to send slack server health status messages
SLACK_WEBHOOK_URL=
# set if you'd like to send posthog events like job logs
POSTHOG_API_KEY=
# set if you'd like to send posthog events like job logs
POSTHOG_HOST=

# set if you'd like to use the fire engine closed beta
FIRE_ENGINE_BETA_URL=

# Proxy Settings for Playwright (Alternative you can can use a proxy service like oxylabs, which rotates IPs for you on every request)
PROXY_SERVER=
PROXY_USERNAME=
PROXY_PASSWORD=
# set if you'd like to block media requests to save proxy bandwidth
BLOCK_MEDIA=

# Set this to the URL of your webhook when using the self-hosted version of FireCrawl
SELF_HOSTED_WEBHOOK_URL=

# Resend API Key for transactional emails
RESEND_API_KEY=

# LOGGING_LEVEL determines the verbosity of logs that the system will output.
# Available levels are:
# NONE - No logs will be output.
# ERROR - For logging error messages that indicate a failure in a specific operation.
# WARN - For logging potentially harmful situations that are not necessarily errors.
# INFO - For logging informational messages that highlight the progress of the application.
# DEBUG - For logging detailed information on the flow through the system, primarily used for debugging.
# TRACE - For logging more detailed information than the DEBUG level.
# Set LOGGING_LEVEL to one of the above options to control logging output.
LOGGING_LEVEL=INFO

如果需要配置supabase就将USE_DB_AUTHENTICATION配置为true，使用爬虫的功能一般配为false即可。

如果需要使用一些AI功能，比如官方提供的Extraction或者网页交互的一些功能，则要配置OPENAI_API_KEY，剩下的一些参数根据自己需要的功能配置即可，比较费时间需要自己去琢磨。