scrapy一次启动多个爬虫(cmdline和subprocess两种方式)

最新推荐文章于 2024-04-22 15:42:39 发布

菜鸟也想要高飞

最新推荐文章于 2024-04-22 15:42:39 发布

阅读量2.4k

点赞数

分类专栏：学习笔记文章标签： python 爬虫 spider

本文链接：https://blog.csdn.net/qq_36365528/article/details/118914363

版权

学习笔记专栏收录该内容

13 篇文章 9 订阅

订阅专栏

scrapy一次启动多个爬虫

scrapy一次启动多个爬虫

scrapy一次启动多个爬虫

有时候我们会写一些比较通用的爬虫，然后通过传递不同参数实现不同网站或者不同页面类型的爬取。
这种情况下，要启动多个爬虫，我们有两种方式：

通过继承cmdline来自定义crawlall来实现
通过多线程的方式依次启动爬虫（可以实现顺序执行）

通过subprocess.Popen实现多个爬虫的启动

subprocess.Popen可以实现

subprocess.Popen顺序启动爬虫

下面的代码中，

env添加pythonpath避免导入问题
while循环实时log输出

def run_command(command):
    env = os.environ.copy()
    if os.name == 'nt':
        env['PYTHONPATH'] = dirname(dirname(dirname(abspath(__file__)))) + ';' + dirname(dirname(abspath(__file__)))
    else:
        env['PYTHONPATH'] = dirname(dirname(dirname(abspath(__file__)))) + ':' + dirname(dirname(abspath(__file__)))
    process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, env=env)
    while True:
        output = process.stdout.readline()
        if process.poll() is not None:
            break
        if output:
            print(output.decode('utf-8').strip())

# 依次启动多个爬虫，同时只有一个在运行
for scrapy_cmd in scrapy_cmd_list:
	run_command(scrapy_cmd)

subprocess.Popen并行执行爬虫

def run_command(command):
    env = os.environ.copy()
    if os.name == 'nt':
        env['PYTHONPATH'] = dirname(dirname(dirname(abspath(__file__)))) + ';' + dirname(dirname(abspath(__file__)))
    else:
        env['PYTHONPATH'] = dirname(dirname(dirname(abspath(__file__)))) + ':' + dirname(dirname(abspath(__file__)))
    process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, env=env)
    return process

process_list = []
# 启动多个爬虫，同时在运行
for scrapy_cmd in scrapy_cmd_list:
	process = run_command(scrapy_cmd)
	process_list.append(process)

while process_list:
    for process in process_list:
        output = process_list.stdout.readline()
        if output:
            print(output.decode('utf-8').strip())
        if process.poll() is not None:
            output = process.stdout.readline()
            if output:
                print(output.decode('utf-8').strip())
            process_list.remove(process)

为什么不直接用scrapy.cmdline.execute或者os.system来直接执行？

因为在scrapy框架中，当一个爬虫执行完毕后，会直接退出程序，而不是继续执行后面代码。

cmdline.execute('scrapy crawl myspider'.split())
print('程序结束')

程序结束不会输出，因为cmdline.execute结束后程序直接退出了。os.system则会在pythonpath上出问题，所以采用可以设置环境变量的subprocess.Popen

通过自定义cmdline实现多个爬虫启动

自定义cmdline可以实现自由的传参，启动多个爬虫，以及为不同爬虫传递不同参数等

class Command(ScrapyCommand):
    requires_project = True

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def add_options(self, parser):
        ScrapyCommand.add_options(self, parser)
        parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",
                          help="set spider argument (may be repeated)")

    def process_options(self, args, opts):
        ScrapyCommand.process_options(self, args, opts)
        try:
            opts.spargs = arglist_to_dict(opts.spargs)
        except ValueError:
            raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False)

	def run(self, args, opts):
		arg1 = opts.spargs['arg1']
		arg2 = opts.spargs['arg2']
		spider_list = ['sp1', 'sp2']
		for spider_name in spider_list:
        	self.crawler_process.crawl(spider_name, arg1, arg2)
        self.crawler_process.start()

菜鸟也想要高飞

关注

0
点赞
踩
18

收藏

觉得还不错? 一键收藏
0
评论
scrapy一次启动多个爬虫(cmdline和subprocess两种方式)

scrapy一次启动多个爬虫scrapy一次启动多个爬虫通过subprocess.Popen实现多个爬虫的启动subprocess.Popen顺序启动爬虫subprocess.Popen并行执行爬虫为什么不直接用scrapy.cmdline.execute或者os.system来直接执行？scrapy一次启动多个爬虫有时候我们会写一些比较通用的爬虫，然后通过传递不同参数实现不同网站或者不同页面类型的爬取。这种情况下，要启动多个爬虫，我们有两种方式：通过继承cmdline来自定义crawlall来实
复制链接

扫一扫