本文主要讲述scrapy—中间件,理解中间件的处理流程。
下载器中间件
下载器中间件,介于下载器和引擎之间,设置User-Agent、cookie,以及代理。在中间件里使用selenium。
要用下载器中间件,先在settings.py文件中开启下载器中间件
跟管道一样的,权重值越小越先执行。
DOWNLOADER_MIDDLEWARES = {
"Mid.middlewares.MidDownloaderMiddleware": 543,
}
process_request()方法:设置User-Agent、cookie,以及代理。一堆中间件中的该方法是:权重值越小越先执行。
'''在引擎将请求的信息交给下载器之前,自动的调用该方法。
:param request:当前请求
:param spider:发出该请求的spider
:return:不能随便给。
注意:process_request返回值是有规定的。
1.如果返回的是None或者不写return,不做拦截,继续向后的中间件执行。(引擎与下载器之间有一堆的中间件,如果不做拦截,会根据中间件的权重一直执行,直到中间件执行结束,才传递给下载器。)
2.如果返回的是request,后续的中间件将不再执行,将请求重新交给引擎,引擎重新扔给调度器。下载器也拿不到url。
3.如果返回的是response,后续的中间件将不在执行,将响应信息交给引擎,引擎将响应丢给spider,进行数据处理。(意思是:不会到达下载器,会直接通过返回response的这个中间件,将响应信息交给引擎,接着spider做数据处理)
'''
process_response()方法:主要是下载器返回引擎的地方。一堆中间件中的该方法是:权重值越大越先执行。
return response 不做拦截,继续向前进行提交返回。
return request 响应被拦截,将返回内容直接回馈给调度器(通过引擎),后续process_response()接收不到响应内容。
class MidDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):#设置user-agent、cookie
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
print('我是ware,process_request')
return None
def process_response(self, request, response, spider):#主要是下载器返回引擎的地方
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
'''
:param request:
:param response:
:param spider:
:return:
return response 通过引擎将响应内容继续传递给其他的组件或者 其他的process_response()处理。不做拦截,继续向前进行提交返回。
return request 响应被拦截,将返回内容直接回馈给调度器(通过引擎),后续process_response()接收不到响应内容。'''
print('我是ware,process_response')
return response
def process_exception(self, request, exception, spider):#当前请求过程中出错了,自动执行
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):#开启,最开始是运行
print('我是ware,spider_opened')
#spider.logger.info("Spider opened: %s" % spider.name)
运行结果:
我是ware,spider_opened
我是ware,process_request
我是ware,process_response
百度一下,你就知道
若文件中有多个下载器中间件,他们的运行流程是怎样的。
要用下载器中间件,先在settings.py文件中开启下载器中间件
DOWNLOADER_MIDDLEWARES = {
"Mid.middlewares.MidDownloaderMiddleware1": 543,
"Mid.middlewares.MidDownloaderMiddleware2": 544,
}
多个下载器中间件。
#下载器中间件,介于下载器和引擎之间,设置user-agent、cookie
class MidDownloaderMiddleware1:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):#设置user-agent、cookie
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
print('我是ware1,process_request')
return None
def process_response(self, request, response, spider):#主要是下载器返回引擎的地方
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
print('我是ware1,process_response')
return response
def process_exception(self, request, exception, spider):#当前请求过程中出错了,自动执行
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):#开启,最开始是运行
print('我是ware1,spider_opened')
#spider.logger.info("Spider opened: %s" % spider.name)
class MidDownloaderMiddleware2:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider): # 设置user-agent、cookie
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
print('我是ware2,process_request')
return None
def process_response(self, request, response, spider): # 主要是下载器返回引擎的地方
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
print('我是ware2,process_response')
return response
def process_exception(self, request, exception, spider): # 当前请求过程中出错了,自动执行
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider): # 开启,最开始是运行
print('我是ware2,spider_opened')
# spider.logger.info("Spider opened: %s" % spider.name)
运行结果
我是ware1,spider_opened
我是ware2,spider_opened
我是ware1,process_request
我是ware2,process_request
我是ware2,process_response
我是ware1,process_response
百度一下,你就知道
总结:process_request()该函数是权重小的先执行,process_response()权重大的先执行。
爬虫中间件,介于spider和引擎之间。(暂时不讲。)
class MidSpiderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, or item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Request or item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info("Spider opened: %s" % spider.name)
User-Agent设置
有两种方法。
一,设置一个固定的User-Agent。在settings.py文件中设置一个User-Agent。
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
二,设置动态随机User-Agent。在settings.py文件中添加User-Agent的列表,在中间件中处理。
settings.py
USER_AGENT_list=['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36', # 2021.10
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36', # 2021.11
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36', # 2021.12
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36', # 2022.01
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.81 Safari/537.36', # 2022.02
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36', # 2022.03
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36', # 2022.04
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.41 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36', # 2022.05
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.115 Safari/537.36', # 2022.06
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.66 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36', # 2022.07
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.81 Safari/537.36', # 2022.08
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.102 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.5195.54 Safari/537.36', # 2022.09
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.5195.102 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.5195.127 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.5249.91 Safari/537.36', # 2022.10
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.5249.103 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.5249.119 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.5304.63 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.5304.88 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.5304.106 Safari/537.36', # 2022.11
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.5304.107 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.5304.122 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5359.72 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5359.95 Safari/537.36', # 2022.12
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5359.99 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5359.100 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5359.125 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5414.75 Safari/537.36', # 2023.01
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5414.120 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.5481.78 Safari/537.36', # 2023.02
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.5481.104 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.5481.105 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.5481.178 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.5481.180 Safari/537.36', # 2023.03
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.5563.64 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.5563.65 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.5563.111 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.5563.112 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.5563.147 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.50 Safari/537.36', # 2023.04
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.87 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.121 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.138 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.5672.64 Safari/537.36', # 2023.05
]
要把这些User-Agent添加到请求头上。所以要在下载器中间件上处理。
settings.py中开启下载器中间件
DOWNLOADER_MIDDLEWARES = {
"douban.middlewares.DoubanDownloaderMiddleware": 543,
}
下载器中间件。process_request()只用这个方法。
class DoubanDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
#设置User-Agent
ua=choice(USER_AGENT_list)
#放在请求头里
request.headers['User-Agent']=ua
return None#注意这里不能返回任何东西,不然会被拦截。
def process_response(self, request, response, spider):
return response
def process_exception(self, request, exception, spider):
pass
def spider_opened(self, spider):
spider.logger.info("Spider opened: %s" % spider.name)
代理IP
免费代理IP
这个时效性差,速度慢,不推荐使用。但是可以用用。
代理IP网站:https://www.kuaidaili.com
settongs.py
PROXY_IP_LIST=[
你的ip列表
]
DOWNLOADER_MIDDLEWARES = {
"douban.middlewares.DoubanDownloaderMiddleware": 543,
"douban.middlewares.ProxyDoubanDownloaderMiddleware": 544,
}
下载器中间件。免费的一般成功不高。
class ProxyDoubanDownloaderMiddleware:
def process_request(self, request, spider):
#设置代理IP
ip=choice(PROXY_IP_LIST)
#放到request
request.meta['proxy']='https://'+ip
return None#注意这里不能返回任何东西,不然会被拦截。
付费代理IP。----代理IP服务器。
https://www.kuaidaili.com/tps,隧道代理
这个里面有文档,代码案例。
需要图片中的隧道host,端口号、用户名、密码。
根据提供的代码案例,直接放到下载器中间件。
class MoneyProxyDoubanDownloaderMiddleware:
_proxy = ('XXX.XXX.com', '15818')
def process_request(self, request, spider):
# 用户名密码认证
username = "username"
password = "password"
request.meta['proxy'] = "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password,
"proxy": ':'.join(
MoneyProxyDoubanDownloaderMiddleware._proxy)}
# 白名单认证
# request.meta['proxy'] = "http://%(proxy)s/" % {"proxy": proxy}
request.headers["Connection"] = "close"
return None
在中间件里使用selenium
#由于要用selenium所以想要替换原来的downloader,原来的中间件其实对我们来说没有意义了。
#原来的中间件最大的优先级是100,所以selenium要定义在100之前。
DOWNLOADER_MIDDLEWARES = {
"boss.middlewares.BossSeleniumDownloaderMiddleware": 99,
}
操作步骤:
#注意:spider文件夹里会有很多个爬虫py,所以需要先判断那些是使用selenium请求的。
#所以这里应该设置成两种请求的
#设计流程:1.在项目的地方新建一个request.py的文件。request.py这个需要继承Request.导致,当前的seleniumRequest在功能上和Request一样的。
#设计流程:2.在爬虫文件中重写 start_requests(self)
#设计流程:3.在下载器中间件process_request()中做判断。
#设计流程:4.当程序运行时就应该启动selenium,在spider_opened()启动
#设计流程:5.再在设计流程:3中去请求。
#设计流程:6.封装成响应对象
class BossSeleniumDownloaderMiddleware:
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(s.spider_closed, signal=signals.spider_closed)
return s
def process_request(self, request, spider):
#所有的请求都回到这里。
#开始selenium的操作,返回页面源代码组装的response。
#注意:spider文件夹里会有很多个爬虫py,所以需要先判断那些是使用selenium请求的。
#所以这里应该设置成两种请求的
#设计流程:1.在项目的地方新建一个request.py的文件。request.py这个需要继承Request.导致,当前的seleniumRequest在功能上和Request一样的。
#设计流程:2.在爬虫文件中重写 start_requests(self)
#设计流程:3.在下载器中间件process_request()中做判断。
#设计流程:4.当程序运行时就应该启动selenium,在spider_opened()启动
#设计流程:5.再在设计流程:3中去请求。
#设计流程:6.封装成响应对象
if isinstance(request,SeleniumRequest):#isinstance判断xxxx是不是xxxx类型的
#selenium处理
#注意:process_request()方法的返回值的三种情况(None,request,response)。所以这里应该封装成一个响应对象。
self.browser.get(request.url)
page_source=self.browser.page_source
time.sleep(2)
return HtmlResponse(url=request.url, status=200, body=page_source, request=request, encoding='utf-8')
else:
return None
def spider_opened(self, spider):
self.options = webdriver.ChromeOptions()
self.browser = webdriver.Chrome(chrome_options=self.options)
def spider_closed(self, spider):
self.browser.close()
返回的HtmlResponse是要看源码是怎么定义的。
父类没有什么定义。
"""
This module implements the HtmlResponse class which adds encoding
discovering through HTML encoding declarations to the TextResponse class.
See documentation in docs/topics/request-response.rst
"""
from scrapy.http.response.text import TextResponse
class HtmlResponse(TextResponse):
pass
看TextResponse发现
class TextResponse(Response):
def __init__(self, *args, **kwargs):
self._encoding = kwargs.pop("encoding", None)
self._cached_benc = None
self._cached_ubody = None
self._cached_selector = None
super().__init__(*args, **kwargs)
也没有什么定义。再看TextResponse(Response)Response发现:
def __init__(
self,
url: str,
status=200,
headers=None,
body=b"",
flags=None,
request=None,
certificate=None,
ip_address=None,
protocol=None,
):
self.headers = Headers(headers or {})
self.status = int(status)
self._set_body(body)
self._set_url(url)
self.request = request
self.flags = [] if flags is None else list(flags)
self.certificate = certificate
self.ip_address = ip_address
self.protocol = protocol
从这里我们就可以知道怎么去写HtmlResponse
HtmlResponse(url=request.url, status=200, body=page_source, request=request, encoding='utf-8')