python笔记(爬虫 Scrapy 中间件定制命令)

最新推荐文章于 2024-08-19 18:25:31 发布

背后——NULL

最新推荐文章于 2024-08-19 18:25:31 发布

阅读量466

点赞数

分类专栏：爬虫文章标签： python

本文链接：https://blog.csdn.net/qq_41433183/article/details/89931627

版权

爬虫专栏收录该内容

9 篇文章 0 订阅

订阅专栏

详细参考
一、中间件

下载中间件

写中间件（创建在与settings同级的目录下）：

from scrapy.http import HtmlResponse
from scrapy.http import Request

class Md1(object):
	@classmethod
	def from_crawler(cls, crawler):
		# This method is used by Scrapy to create your spiders.
		s = cls()
		return s

	def process_request(self, request, spider):
		#  在调用下载器获取结果时
		# middleware.

		# Must either:
		# - return None: continue processing this request
		# - or return a Response object
		# - or return a Request object
		# - or raise IgnoreRequest: process_exception() methods of
		#   installed downloader middleware will be called
		print('md1.process_request',request)
		# 1. 返回Response
		# import requests
		# result = requests.get(request.url)
		# return HtmlResponse(url=request.url, status=200, headers=None, body=result.content)
		# 2. 返回Request
		# return Request('https://dig.chouti.com/r/tec/hot/1')

		# 3. 抛出异常
		# from scrapy.exceptions import IgnoreRequest
		# raise IgnoreRequest

		# 4. 对请求进行加工(*)
		# request.headers['user-agent'] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"

		pass

	def process_response(self, request, response, spider):
		# 在调用下载器获取结果返回时
		# Must either;
		# - return a Response object
		# - return a Request object
		# - or raise IgnoreRequest
		print('m1.process_response',request,response)
		return response

	def process_exception(self, request, exception, spider):
		# 当通过下载器的请求出错时调用
		# (from other downloader middleware) raises an exception.

		# Must either:
		# - return None: continue processing this exception
		# - return a Response object: stops process_exception() chain
		# - return a Request object: stops process_exception() chain
		pass

配置：

	DOWNLOADER_MIDDLEWARES = {
   #'xdb.middlewares.XdbDownloaderMiddleware': 543,
	# 'xdb.proxy.XdbProxyMiddleware':751,
	'xdb.md.Md1':666,
	'xdb.md.Md2':667,
}

应用：

-添加 user-agent 
-添加 代理

爬虫中间件
写中间件（创建在与settings同级的目录下）：

class Sd1(object):
	# Not all methods need to be defined. If a method is not defined,
	# scrapy acts as if the spider middleware does not modify the
	# passed objects.

	@classmethod
	def from_crawler(cls, crawler):
		# This method is used by Scrapy to create your spiders.
		s = cls()
		return s

	def process_spider_input(self, response, spider):
		# 在下载中间间执行完引擎再将结果交给爬虫中间件的时候执行
		# middleware and into the spider.

		# Should return None or raise an exception.
		return None

	def process_spider_output(self, response, result, spider):
		# 在下载中间间执行完引擎再将结果交给爬虫中间件的时候，回调函数再次调用了Request方法或Items方法之后执行
		# it has processed the response.

		# Must return an iterable of Request, dict or Item objects.
		for i in result:
			yield i

	def process_spider_exception(self, response, exception, spider):
		# Called when a spider or process_spider_input() method
		# (from other spider middleware) raises an exception.

		# Should return either None or an iterable of Response, dict
		# or Item objects.
		pass

	# 只在爬虫启动时，执行一次。
	def process_start_requests(self, start_requests, spider):
		# Called with the start requests of the spider, and works
		# similarly to the process_spider_output() method, except
		# that it doesn’t have a response associated.

		# Must return only requests (not items).
		for r in start_requests:
			yield r

配置：

SPIDER_MIDDLEWARES = {
   # 'xdb.middlewares.XdbSpiderMiddleware': 543,
	'xdb.sd.Sd1': 666,
	'xdb.sd.Sd2': 667,
}

应用：

- 深度
- 优先级

二、定制命令

单爬虫运行：

在与scrapy.cfg文件同级的目录下创建一个py文件

import sys
from scrapy.cmdline import execute

if __name__ == '__main__':
	execute(["scrapy","crawl","chouti","--nolog"])

所有爬虫：

1）在spiders同级创建任意目录，如：commands
2）在其中创建 crawlall.py 文件（此处文件名就是自定义的命令）

from scrapy.commands import ScrapyCommand
from scrapy.utils.project import get_project_settings


class Command(ScrapyCommand):

    requires_project = True

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        spider_list = self.crawler_process.spiders.list()
        for name in spider_list:
            self.crawler_process.crawl(name, **opts.__dict__)
        self.crawler_process.start()

3）在settings.py 中添加配置 COMMANDS_MODULE = ‘项目名称.目录名称’
4）在项目目录执行命令：scrapy crawlall