详细参考
一、中间件
-
下载中间件
写中间件(创建在与settings同级的目录下):
from scrapy.http import HtmlResponse from scrapy.http import Request class Md1(object): @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() return s def process_request(self, request, spider): # 在调用下载器获取结果时 # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called print('md1.process_request',request) # 1. 返回Response # import requests # result = requests.get(request.url) # return HtmlResponse(url=request.url, status=200, headers=None, body=result.content) # 2. 返回Request # return Request('https://dig.chouti.com/r/tec/hot/1') # 3. 抛出异常 # from scrapy.exceptions import IgnoreRequest # raise IgnoreRequest # 4. 对请求进行加工(*) # request.headers['user-agent'] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36" pass def process_response(self, request, response, spider): # 在调用下载器获取结果返回时 # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest print('m1.process_response',request,response) return response def process_exception(self, request, exception, spider): # 当通过下载器的请求出错时调用 # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass
配置:
DOWNLOADER_MIDDLEWARES = { #'xdb.middlewares.XdbDownloaderMiddleware': 543, # 'xdb.proxy.XdbProxyMiddleware':751, 'xdb.md.Md1':666, 'xdb.md.Md2':667, }
应用:
-添加 user-agent -添加 代理
-
爬虫中间件
写中间件(创建在与settings同级的目录下):class Sd1(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() return s def process_spider_input(self, response, spider): # 在下载中间间执行完引擎再将结果交给爬虫中间件的时候执行 # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(self, response, result, spider): # 在下载中间间执行完引擎再将结果交给爬虫中间件的时候,回调函数再次调用了Request方法或Items方法之后执行 # it has processed the response. # Must return an iterable of Request, dict or Item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Response, dict # or Item objects. pass # 只在爬虫启动时,执行一次。 def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r
配置:
SPIDER_MIDDLEWARES = { # 'xdb.middlewares.XdbSpiderMiddleware': 543, 'xdb.sd.Sd1': 666, 'xdb.sd.Sd2': 667, }
应用:
- 深度 - 优先级
二、定制命令
-
单爬虫运行:
在与scrapy.cfg文件同级的目录下创建一个py文件
import sys from scrapy.cmdline import execute if __name__ == '__main__': execute(["scrapy","crawl","chouti","--nolog"])
-
所有爬虫:
1) 在spiders同级创建任意目录,如:commands
2)在其中创建 crawlall.py 文件 (此处文件名就是自定义的命令)from scrapy.commands import ScrapyCommand from scrapy.utils.project import get_project_settings class Command(ScrapyCommand): requires_project = True def syntax(self): return '[options]' def short_desc(self): return 'Runs all of the spiders' def run(self, args, opts): spider_list = self.crawler_process.spiders.list() for name in spider_list: self.crawler_process.crawl(name, **opts.__dict__) self.crawler_process.start()
3)在settings.py 中添加配置 COMMANDS_MODULE = ‘项目名称.目录名称’
4)在项目目录执行命令:scrapy crawlall