通过核心ＡＰＩ启动单个或多个scrapy爬虫

最新推荐文章于 2022-07-11 07:35:00 发布

Python之战

最新推荐文章于 2022-07-11 07:35:00 发布

阅读量571

点赞数

文章标签： Python scrapy 爬虫

本文链接：https://blog.csdn.net/weixin_41624982/article/details/88752980

版权

本文介绍了如何利用Scrapy的CrawlerProcess和CrawlerRunner API在Python环境中启动单个或多个爬虫。Scrapy基于Twisted，CrawlerProcess会启动Twisted reactor并处理设置，适合单个爬虫的运行。而CrawlerRunner则提供了更灵活的控制，适用于在同一个进程中运行多个爬虫。在结束爬虫时，需手动关闭Twisted reactor。示例展示了如何并行运行多个Scrapy爬虫。

摘要由CSDN通过智能技术生成

可以使用API从脚本运行Scrapy，而不是运行Scrapy的典型方法scrapy crawl；Scrapy是基于Twisted异步网络库构建的，因此需要在Twisted容器内运行它，可以通过两个API来运行单个或多个爬虫scrapy.crawler.CrawlerProcess、scrapy.crawler.CrawlerRunner。

启动爬虫的的第一个实用程序是scrapy.crawler.CrawlerProcess 。该类将为您启动Twisted reactor，配置日志记录并设置关闭处理程序，此类是所有Scrapy命令使用的类。

示例运行单个爬虫。

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

通过CrawlerProcess传入参数，并使用get_project_settings获取Settings 项目设置的实例。

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settin

最低0.47元/天解锁文章

Python之战

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫