运行爬虫
import datetime as dt
#同时爬取
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
file_name_A="爬虫A"+dt.datetime.now().strftime('%Y-%m-%d') +".json"
file_name_B="爬虫B"+dt.datetime.now().strftime('%Y-%m-%d') +".json"
settings = get_project_settings()
crawler = CrawlerProcess(settings)
#参数 输出结果文件
spargs_A ={'-o':file_name_A}
spargs_B ={'-o':file_name_B}
crawler.crawl('爬虫A',spargs_A )
crawler.crawl('爬虫B',spargs_B )
crawler.start()
crawler.start()
爬虫运行参数
不同爬虫设置不同的pipeline
方法1
class CrawlersPipeline:
def process_item(self, item, spider):
if spider.name in ['爬虫A','爬虫B'] :
return item
方法2 用custom_setting配置。
ITEM_PIPELINES = {
'medicine_crawlers.pipelines.ACrawlersPipeline': 300,
'medicine_crawlers.pipelines.BCrawlersPipeline': 300,
}
爬虫A的spider文件里
class longyi_spider(scrapy.Spider):
name = '***'
allowed_domains = ['***.com']
start_urls = ['***']
custom_settings={
'ITEM_PIPELINES':{medicine_crawlers.pipelines.ACrawlersPipeline}
}
爬虫B格式同上