一. scrapy 自定义 同时启动全部爬虫命令
1. 在spider同级目录建一个文件夹,如:customcommand
2.在文件夹内建立 crawlall.py
3. crawlall.py内容如下:
from scrapy.commands import ScrapyCommand
from scrapy.utils.project import get_project_settings
class Command(ScrapyCommand):
requires_project = True
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders'
def run(self, args, opts):
spider_list = self.crawler_process.spiders.list()
print('爬取开始')
for name in spider_list:
self.crawler_process.crawl(name, **opts.__dict__)
self.crawler_process.start()
4. settings.py文件中加入配置
#自定制命令:自定义爬取全部爬虫命令 COMMANDS_MODULE='spider1.customcommand'
5. cmd命令下执行 scrapy crawlall即可爬取全部
6. 命令是否添加成功可以查看 scrapy --h, 可以看到crawlall命令已经添加进来
D:\pythonProject\spider1>scrapy --h
Scrapy 1.5.1 - project: spider1
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
check Check spider contracts
crawl Run a spider
crawlall Runs all of the spiders
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
settings Get settings values
二。 详解settingspy中各配置参数含义
# -*- coding: utf-8 -*-
# Scrapy settings for spider1 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'spider1'
SPIDER_MODULES = ['spider1.spiders']
NEWSPIDER_MODULE = 'spider1.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'spider1 (+http://www.yourdomain.com)'
#自定义User-Agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False #false表示不遵循爬虫规则
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#并发请求数量,默认16个,难缠的网站设小一点
CONCURRENT_REQUESTS = 16
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#下载延迟设定,避免访问特别快可能被发现封号
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#每个域名最多放16个并发爬虫,和下边设定二选一
CONCURRENT_REQUESTS_PER_DOMAIN = 16
#每个IP可以并发多少个爬虫
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#是否爬取cookies,如果需要登录则需要cookie
COOKIES_ENABLED = True
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'spider1.middlewares.Spider1SpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#自定义代理IP写在这里
# DOWNLOADER_MIDDLEWARES = {
# 'spider1.middlewares.Spider1DownloaderMiddleware': 543, #原有的
# 'spider1.proxymiddleware.ProxyMiddleware': 300, #自定义代理IP
# }
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
#'scrapy.extensions.telnet.TelnetConsole': None,
'spider1.extensions.MyExtend':300,
}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'spider1.pipelines.XianPipeline': 300, #300代表pipline执行权重,顺序
'spider1.pipelines.XiaohuaPipeline': 200, #300代表pipline执行权重,顺序
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#智能延时请求设定,请求延迟时间根据算法计算,前面DOWNLOAD_DELAY = 1 这个设定delay时间太均匀了
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#首次延迟时间
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#最大延迟时间
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#平均每秒并发数
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#缓存启用
#HTTPCACHE_ENABLED = True
#缓存到期时间
#HTTPCACHE_EXPIRATION_SECS = 0
#缓存保存路径
#HTTPCACHE_DIR = 'httpcache'
#缓存忽略状态码,如200 404 503等
#HTTPCACHE_IGNORE_HTTP_CODES = []
#缓存策略,缓存存储的插件
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
#可以自定义或者继承此类
# from scrapy.extensions.httpcache import FilesystemCacheStorage,DummyPolicy
DEPTH_LIMIT=5 #递归深度 0表示不设深度,无线循环
#深度优先是指网络爬虫会从起始页开始,一个链接一个链接跟踪下去,处理完这条线路之后再转入下一个起始页,继续追踪链接
#广度优先,有人也叫宽度优先,是指将新下载网页发现的链接直接插入到待抓取URL队列的末尾,
# 也就是指网络爬虫会先抓取起始页中的所有网页,然后在选择其中的一个连接网页,继续抓取在此网页中链接的所有网页
# DEPTH_PRIORITY=0 #只能[0,1,-1],1表示广度优先;-1表示深度优先;0表示不从深度进行优先级调整,默认是0
DUPEFILTER_CLASS = 'spider1.duplication.RepeatFilter'
#自定制命令:自定义爬取全部爬虫命令
COMMANDS_MODULE='spider1.customcommand'