Scrapy API 启动爬虫

文章详细介绍了如何使用Scrapy的CrawlerProcess和CrawlerRunnerAPI来启动和运行爬虫。CrawlerProcess适用于单个爬虫的简单启动,而CrawlerRunner则提供了更好的控制,包括显式运行和关闭Twistedreactor。此外,文中还展示了如何配置日志并运行多个爬虫的示例。
摘要由CSDN通过智能技术生成

scarpy 不仅提供了 scrapy crawl spider 命令来启动爬虫,还提供了一种利用 API 编写脚本 来启动爬虫的方法。

scrapy 基于 twisted 异步网络库构建的,因此需要在 twisted 容器内运行它。

可以通过两个 API 运行爬虫:scrapy.crawler.CrawlerProcess  和  scrapy.crawler.CrawlerRunner

scrapy.crawler.CrawlerProcess

这个类内部将会开启 twisted.reactor、配置log 和 设置 twisted.reactor 自动关闭,该类是所有 scrapy 命令使用的类。

运行单个爬虫示例

class QiushispiderSpider(scrapy.Spider):
    name = 'qiushiSpider'
    # allowed_domains = ['qiushibaike.com']
    start_urls = ['https://tianqi.2345.com/']          

    def start_requests(self):
        return [scrapy.Request(url=self.start_urls[0], callback=self.parse)]          #

    def parse(self, response):
        print('proxy simida')


if __name__ == '__main__':
    from scrapy.crawler import CrawlerProcess
    process = CrawlerProcess()
    process.crawl(QiushispiderSpider)         # 'qiushiSpider'
    process.start()

process.crawl() 内的参数可以是 爬虫名'qiushiSpider',也可以是 爬虫类名QiushispiderSpider

这种方式并没有使用爬虫的配置文件settings

2019-05-27 14:39:57 [scrapy.crawler] INFO: Overridden settings: {}

获取配置

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl(QiushispiderSpider)         # 'qiushiSpider'
process.start()

运行多个爬虫

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    ...

class MySpider2(scrapy.Spider):
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start()

scrapy.crawler.CrawlerRunner

  1. 更好的控制爬虫运行过程
  2. 显式运行 twisted.reactor,显式关闭 twisted.reactor
  3. 需要在 CrawlerRunner.crawl 返回的对象中添加回调函数

运行单个爬虫示例

class QiushispiderSpider(scrapy.Spider):
    name = 'qiushiSpider'
    # allowed_domains = ['qiushibaike.com']
    start_urls = ['https://tianqi.2345.com/']          

    def start_requests(self):
        return [scrapy.Request(url=self.start_urls[0], callback=self.parse)]          #

    def parse(self, response):
        print('proxy simida')


if __name__ == '__main__':
    # test CrawlerRunner
    from twisted.internet import reactor
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    from scrapy.utils.project import get_project_settings

    configure_logging({'LOG_FORMAT':'%(levelname)s: %(message)s'})
    runner = CrawlerRunner(get_project_settings())

    d = runner.crawl(QiushispiderSpider)
    d.addBoth(lambda _: reactor.stop())
    reactor.run() # the script will block here until the crawling is finished

configure_logging 设定日志输出格式

addBoth 添加 关闭 twisted.crawl 的回调函数

运行多个爬虫

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):
    ...

class MySpider2(scrapy.Spider):
    ...

configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished

也可以异步实现

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):
    ...

class MySpider2(scrapy.Spider):
    ...

configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(MySpider1)
    yield runner.crawl(MySpider2)
    reactor.stop()

crawl()
reactor.run() # the script 

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值