Scrapy之CrawlerRunner数据无法进入数据库，并不启用pipelines

最新推荐文章于 2022-07-15 12:34:48 发布

風の住む街~

最新推荐文章于 2022-07-15 12:34:48 发布

阅读量871

点赞数

分类专栏： # Scrapy爬虫框架

本文链接：https://blog.csdn.net/weixin_38924500/article/details/105287737

版权

Scrapy爬虫框架专栏收录该内容

18 篇文章 3 订阅

订阅专栏

在同一进程中运行多个蜘蛛

默认情况下，当您运行时，Scrapy会为每个进程运行一个蜘蛛。但是，Scrapy支持使用内部API在每个进程中运行多个蜘蛛。scrapy crawl

这是一个同时运行多个蜘蛛的示例：

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

使用CrawlerRunner以下示例：

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())

reactor.run() # the script will block here until all crawling jobs are finished

相同的示例，但是通过链接延迟项来依次运行蜘蛛程序：

from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor


from ThreatIntellgence.spiders.antiy import AntiySpider
from ThreatIntellgence.spiders.kaspersky import KasperskySpider
from ThreatIntellgence.spiders.a360safety import A360safetySpider
from ThreatIntellgence.spiders.AccentureSecurity import AccenturesecuritySpider

configure_logging()
runner = CrawlerRunner()
runner.crawl(AntiySpider)
runner.crawl(KasperskySpider)
runner.crawl(A360safetySpider)
runner.crawl(AccenturesecuritySpider)
d = runner.join()
d.addBoth(lambda _: reactor.stop())

reactor.run()

注意

此方法并不启用scrapy中的pipelines,只会运行爬虫，所以在管道中的方法并不适用，也并不能将数据通过pipelines存储到数据库中。

那么我们需要启用所有的爬虫可以使用下面的方法

import os


os.system("scrapy crawl antiy -s CLOSESPIDER_TIMEOUT=30")   #爬虫运行完暂停30秒
os.system("scrapy crawl kaspersky -s CLOSESPIDER_TIMEOUT=30")
os.system("scrapy crawl a360safety -s CLOSESPIDER_TIMEOUT=30")
os.system("scrapy crawl Akamai -s CLOSESPIDER_TIMEOUT=30")

風の住む街~

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Scrapy之CrawlerRunner数据无法进入数据库，并不启用pipelines

在同一进程中运行多个蜘蛛默认情况下，当您运行时，Scrapy会为每个进程运行一个蜘蛛。但是，Scrapy支持使用内部API在每个进程中运行多个蜘蛛。scrapy crawl这是一个同时运行多个蜘蛛的示例：import scrapyfrom scrapy.crawler import CrawlerProcessclass MySpider1(scrapy.Spider): # ...
复制链接

扫一扫