测试环境
# 环境一
Python 3.6.5
Scrapy==1.5.0
# 环境二
Python 2.7.5
Scrapy==1.1.2
一、命令行运行爬虫
1、编写爬虫文件 baidu.py
# -*- coding: utf-8 -*-
from scrapy import Spider
class BaiduSpider(Spider):
name = 'baidu'
start_urls = ['http://baidu.com/']
def parse(self, response):
self.log("run baidu")
2、运行爬虫(2种方式)
# 运行爬虫
$ scrapy crawl baidu
# 在没有创建项目的情况下运行爬虫
$ scrapy runspider baidu.py
二、文件中运行爬虫
1、cmdline
方式运行爬虫
# -*- coding: utf-8 -*-
from scrapy import cmdline, Spider
class BaiduSpider(Spider):
name = 'baidu'
start_urls = ['http://baidu.com/']
def parse(self, response):
self.log("run baidu")
if __name__ == '__main__':
cmdline.execute("scrapy crawl baidu".split())
2、CrawlerProcess
方式运行爬虫
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
class BaiduSpider(Spider):
name = 'baidu'
start_urls = ['http://baidu.com/']
def parse(self, response):
self.log("run baidu")
if __name__ == '__main__':
# 通过方法 get_project_settings() 获取配置信息
process = CrawlerProcess(get_project_settings())
process.crawl(BaiduSpider)
process.start()
3、通过CrawlerRunner
运行爬虫
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
class BaiduSpider(Spider):
name = 'baidu'
start_urls = ['http://baidu.com/']
def parse(self, response):
self.log("run baidu")
if __name__ == '__main__':
# 直接运行控制台没有日志
configure_logging(
{
'LOG_FORMAT': '%(message)s'
}
)
runner = CrawlerRunner()
d = runner.crawl(BaiduSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run()
三、文件中运行多个爬虫
项目中新建一个爬虫 SinaSpider
# -*- coding: utf-8 -*-
from scrapy import Spider
class SinaSpider(Spider):
name = 'sina'
start_urls = ['https://www.sina.com.cn/']
def parse(self, response):
self.log("run sina")
1、cmdline
方式不可以运行多个爬虫
如果将两个语句放在一起,第一个语句执行完后程序就退出了,执行到不到第二句
# -*- coding: utf-8 -*-
from scrapy import cmdline
cmdline.execute("scrapy crawl baidu".split())
cmdline.execute("scrapy crawl sina".split())
记得之前我还写过一个使用 cmdline
运行多个爬虫的脚本
不过有了以下两个方法来替代,就更优雅了
2、CrawlerProcess
方式运行多个爬虫
备注:爬虫项目文件为:
scrapy_demo/spiders/baidu.py
scrapy_demo/spiders/sina.py
# -*- coding: utf-8 -*-
from scrapy.crawler import CrawlerProcess
from scrapy_demo.spiders.baidu import BaiduSpider
from scrapy_demo.spiders.sina import SinaSpider
process = CrawlerProcess()
process.crawl(BaiduSpider)
process.crawl(SinaSpider)
process.start()
此方式运行,发现日志中中间件只启动了一次,而且发送请求基本是同时的,说明这两个爬虫运行不是独立的,可能会相互干扰
3、通过CrawlerRunner
运行多个爬虫
# -*- coding: utf-8 -*-
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
from scrapy_demo.spiders.baidu import BaiduSpider
from scrapy_demo.spiders.sina import SinaSpider
configure_logging()
runner = CrawlerRunner()
runner.crawl(BaiduSpider)
runner.crawl(SinaSpider)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
此方式也只加载一次中间件,不过是逐个运行的,会减少干扰,官方文档也推荐使用此方法来运行多个爬虫
总结
方式 | 是否读取settings.py | 运行数量 |
---|---|---|
$ scrapy crawl baidu | 读取 | 单个 |
$ scrapy runspider baidu.py | 读取 | 单个 |
cmdline.execute | 读取 | 单个(推荐) |
CrawlerProcess | 不读取 | 单个,多个 |
CrawlerRunner | 不读取 | 单个,多个(推荐) |
cmdline.execute
运行单个爬虫文件的配置最简单,一次配置,多次运行