scrapy 自定制全部爬取命令，另外详解settings.py中各配置参数含义

本文链接：https://blog.csdn.net/java_raylu/article/details/85927103

一. scrapy 自定义同时启动全部爬虫命令

1. 在spider同级目录建一个文件夹，如：customcommand

2.在文件夹内建立 crawlall.py

3. crawlall.py内容如下：

from scrapy.commands import ScrapyCommand
from scrapy.utils.project import get_project_settings


class Command(ScrapyCommand):

    requires_project = True

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        spider_list = self.crawler_process.spiders.list()
        print('爬取开始')
        for name in spider_list:
            self.crawler_process.crawl(name, **opts.__dict__)
        self.crawler_process.start()

4. settings.py文件中加入配置

#自定制命令：自定义爬取全部爬虫命令
COMMANDS_MODULE='spider1.customcommand'

5. cmd命令下执行 scrapy crawlall即可爬取全部

6. 命令是否添加成功可以查看 scrapy --h, 可以看到crawlall命令已经添加进来

D:\pythonProject\spider1>scrapy --h
Scrapy 1.5.1 - project: spider1

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  crawl         Run a spider
  crawlall      Runs all of the spiders
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values

二。详解settingspy中各配置参数含义

# -*- coding: utf-8 -*-

# Scrapy settings for spider1 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'spider1'

SPIDER_MODULES = ['spider1.spiders']
NEWSPIDER_MODULE = 'spider1.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'spider1 (+http://www.yourdomain.com)'
#自定义User-Agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False  #false表示不遵循爬虫规则

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#并发请求数量，默认16个，难缠的网站设小一点
CONCURRENT_REQUESTS = 16

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#下载延迟设定，避免访问特别快可能被发现封号
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#每个域名最多放16个并发爬虫，和下边设定二选一
CONCURRENT_REQUESTS_PER_DOMAIN = 16
#每个IP可以并发多少个爬虫
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#是否爬取cookies，如果需要登录则需要cookie
COOKIES_ENABLED = True

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'spider1.middlewares.Spider1SpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#自定义代理IP写在这里
# DOWNLOADER_MIDDLEWARES = {
#    'spider1.middlewares.Spider1DownloaderMiddleware': 543, #原有的
#    'spider1.proxymiddleware.ProxyMiddleware': 300, #自定义代理IP
# }

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
   #'scrapy.extensions.telnet.TelnetConsole': None,
   'spider1.extensions.MyExtend':300,
}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'spider1.pipelines.XianPipeline': 300, #300代表pipline执行权重，顺序
   'spider1.pipelines.XiaohuaPipeline': 200, #300代表pipline执行权重，顺序
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#智能延时请求设定，请求延迟时间根据算法计算，前面DOWNLOAD_DELAY = 1 这个设定delay时间太均匀了
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#首次延迟时间
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#最大延迟时间
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#平均每秒并发数
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#缓存启用
#HTTPCACHE_ENABLED = True
#缓存到期时间
#HTTPCACHE_EXPIRATION_SECS = 0
#缓存保存路径
#HTTPCACHE_DIR = 'httpcache'
#缓存忽略状态码，如200 404 503等
#HTTPCACHE_IGNORE_HTTP_CODES = []
#缓存策略，缓存存储的插件
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
#可以自定义或者继承此类
# from scrapy.extensions.httpcache import FilesystemCacheStorage,DummyPolicy


DEPTH_LIMIT=5 #递归深度 0表示不设深度，无线循环
#深度优先是指网络爬虫会从起始页开始，一个链接一个链接跟踪下去，处理完这条线路之后再转入下一个起始页，继续追踪链接
#广度优先，有人也叫宽度优先，是指将新下载网页发现的链接直接插入到待抓取URL队列的末尾，
# 也就是指网络爬虫会先抓取起始页中的所有网页，然后在选择其中的一个连接网页，继续抓取在此网页中链接的所有网页
# DEPTH_PRIORITY=0 #只能[0，1，-1]，1表示广度优先;-1表示深度优先;0表示不从深度进行优先级调整，默认是0

DUPEFILTER_CLASS = 'spider1.duplication.RepeatFilter'

#自定制命令：自定义爬取全部爬虫命令
COMMANDS_MODULE='spider1.customcommand'