scrapy一次启动多个爬虫
scrapy一次启动多个爬虫
有时候我们会写一些比较通用的爬虫,然后通过传递不同参数实现不同网站或者不同页面类型的爬取。
这种情况下,要启动多个爬虫,我们有两种方式:
- 通过继承cmdline来自定义crawlall来实现
- 通过多线程的方式依次启动爬虫(可以实现顺序执行)
通过subprocess.Popen实现多个爬虫的启动
subprocess.Popen可以实现
subprocess.Popen顺序启动爬虫
下面的代码中,
- env添加pythonpath避免导入问题
- while循环实时log输出
def run_command(command):
env = os.environ.copy()
if os.name == 'nt':
env['PYTHONPATH'] = dirname(dirname(dirname(abspath(__file__)))) + ';' + dirname(dirname(abspath(__file__)))
else:
env['PYTHONPATH'] = dirname(dirname(dirname(abspath(__file__)))) + ':' + dirname(dirname(abspath(__file__)))
process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, env=env)
while True:
output = process.stdout.readline()
if process.poll() is not None:
break
if output:
print(output.decode('utf-8').strip())
# 依次启动多个爬虫,同时只有一个在运行
for scrapy_cmd in scrapy_cmd_list:
run_command(scrapy_cmd)
subprocess.Popen并行执行爬虫
def run_command(command):
env = os.environ.copy()
if os.name == 'nt':
env['PYTHONPATH'] = dirname(dirname(dirname(abspath(__file__)))) + ';' + dirname(dirname(abspath(__file__)))
else:
env['PYTHONPATH'] = dirname(dirname(dirname(abspath(__file__)))) + ':' + dirname(dirname(abspath(__file__)))
process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, env=env)
return process
process_list = []
# 启动多个爬虫,同时在运行
for scrapy_cmd in scrapy_cmd_list:
process = run_command(scrapy_cmd)
process_list.append(process)
while process_list:
for process in process_list:
output = process_list.stdout.readline()
if output:
print(output.decode('utf-8').strip())
if process.poll() is not None:
output = process.stdout.readline()
if output:
print(output.decode('utf-8').strip())
process_list.remove(process)
为什么不直接用scrapy.cmdline.execute或者os.system来直接执行?
因为在scrapy框架中,当一个爬虫执行完毕后,会直接退出程序,而不是继续执行后面代码。
cmdline.execute('scrapy crawl myspider'.split())
print('程序结束')
程序结束
不会输出,因为cmdline.execute结束后程序直接退出了。os.system则会在pythonpath上出问题,所以采用可以设置环境变量的subprocess.Popen
通过自定义cmdline实现多个爬虫启动
自定义cmdline可以实现自由的传参,启动多个爬虫,以及为不同爬虫传递不同参数等
class Command(ScrapyCommand):
requires_project = True
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders'
def add_options(self, parser):
ScrapyCommand.add_options(self, parser)
parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",
help="set spider argument (may be repeated)")
def process_options(self, args, opts):
ScrapyCommand.process_options(self, args, opts)
try:
opts.spargs = arglist_to_dict(opts.spargs)
except ValueError:
raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False)
def run(self, args, opts):
arg1 = opts.spargs['arg1']
arg2 = opts.spargs['arg2']
spider_list = ['sp1', 'sp2']
for spider_name in spider_list:
self.crawler_process.crawl(spider_name, arg1, arg2)
self.crawler_process.start()