四、scrapy中间件
scrapy的架构图
在 Scrapy 框架中的 Engine和 Downloader 之间存在一个中间件层: Downloader Middlewares。
Eingle给Downloader发送的每个请求都会通过这些中间件(类似于Pipeline,可以配置很多中间件)。
Downloader下载完把response发送给Engine的时候也要通过。
我们可以通过中间件的代码来做一下中间操作,如:设置代理,添加请求头等。
- 随机请求头中间件
爬虫在频繁访问一个页面的时候,这个请求头如果一直保持一致,那么很容易被服务器发现,从而禁止掉这个请求头的访问,因此我们要在访问这个页面之前随机的更改请求头,这样可以避免爬虫被抓。
随机更改请求头,可以在下载中间件中实现,在请求发送给服务器之前,随机的选择一个请求头,这样就可以避免总使用一个请求头。
为了实现这个功能,我们来创建一个新的项目:
scrapy startproject mw
当我们创建scrapy项目的时候,它就给我们默认生成了一个中间件文件
Downloader中间件有3个关键的函数:
process_request:发送请求前调用
process_response:请求返回后调用
process_exception :出现异常时调用
为了设置随机请求头,应该放在process_request中,我们重点看一下这个函数:
每个请求在执行具体的下载之前都会通过这个process_request函数
process_request(self,request,spider):
参数:
request:发送请求的 request 对象
spider:发送请求的 spider 对象
返回值:
None:此时 Scrapy 将继续处理该 request
Response 对象:Scrapy 不会再处理该请求,把返回response当做结果返回。接下来会执行后面的 process_response 方法
Request 对象:不再执行之前的 request 对象,把新的 request 提交给scrapy安排执行
Exception:抛出异常,此时调用 process_exception()函数
2.2 随机请求头中间件
创建一个spider
scrapy genspider test httpbin.org/get
创建好了爬虫之后
先修改中间件代码middlewares.py,把process_request重写
class MwDownloaderMiddleware:
USER_AGENTS = [
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
'Mozilla/5.0(Windows NT 10.0;Win64;x64)AppleWebKit/537.36(KHTML, likeGecko)Chrome/70.0.3538.102Safari/537.36Edge/18.18362',
'Mozilla/5.0(Windows NT 10.0;WOW64;Trident/7.0;rv:11.0)likeGecko',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; Tablet PC 2.0; .NET4.0E)',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E) QQBrowser/6.9.11079.201',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)',
'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser 1.98.744; .NET CLR 3.5.30729)',
'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser 1.98.744; .NET CLR 3.5.30729)',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser; GTB5; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; InfoPath.1; .NET CLR 3.5.30729; .NET CLR 3.0.30618)',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; SV1; Acoo Browser; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; Avant Browser)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; GTB5; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; Maxthon; InfoPath.1; .NET CLR 3.5.30729; .NET CLR 3.0.30618)',
'Mozilla/4.0 (compatible; Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser 1.98.744; .NET CLR 3.5.30729); Windows NT 5.1; Trident/4.0)',
'Mozilla/4.0 (compatible; Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB6; Acoo Browser; .NET CLR 1.1.4322; .NET CLR 2.0.50727); Windows NT 5.1; Trident/4.0; Maxthon; .NET CLR 2.0.50727; .NET CLR 1.1.4322; InfoPath.2)',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser; GTB6; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; InfoPath.1; .NET CLR 3.5.30729; .NET CLR 3.0.30618)',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser; GTB5; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; InfoPath.1; .NET CLR 3.5.30729; .NET CLR 3.0.30618)',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB6; Acoo Browser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Trident/4.0; Acoo Browser; GTB5; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; InfoPath.1; .NET CLR 3.5.30729; .NET CLR 3.0.30618)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; GTB5; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; InfoPath.1; .NET CLR 3.5.30729; .NET CLR 3.0.30618)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Acoo Browser; InfoPath.2; .NET CLR 2.0.50727; Alexa Toolbar)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Acoo Browser; .NET CLR 2.0.50727; .NET CLR 1.1.4322)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Acoo Browser; .NET CLR 1.0.3705; .NET CLR 1.1.4322; .NET CLR 2.0.50727; FDM; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; InfoPath.2)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Acoo Browser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
'Mozilla/4.0 (compatible; MSIE 6.0; America Online Browser 1.1; Windows NT 5.1)',
'Mozilla/4.0 (compatible; MSIE 6.0; America Online Browser 1.1; Windows NT 5.0)',
'Mozilla/4.0 (compatible; MSIE 6.0; America Online Browser 1.1; Windows 98)',
'Mozilla/4.0 (compatible; MSIE 7.0; AOL 8.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
'Mozilla/4.0 (compatible; MSIE 7.0; AOL 8.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)',
'Mozilla/4.0 (compatible; MSIE 7.0; AOL 8.0; Windows NT 5.1; .NET CLR 1.0.3705)',
'Mozilla/4.0 (compatible; MSIE 7.0; AOL 8.0; Windows NT 5.1)',
'Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705)',
'Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1; SV1)',
'Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1; Q312461; YComp 5.0.0.0)',
'Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1; Q312461)',
'Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1; Hotbar 4.2.8.0)',
'Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1; Hotbar 4.1.7.0)',
'Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1; .NET CLR 1.0.3705)',
'Mozilla/5.0 (X11; U; OpenBSD ppc; en-US; rv:1.8.1.9) Gecko/20070223 BonEcho/2.0.0.9',
'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.9) Gecko/20071103 BonEcho/2.0.0.9',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.9) Gecko/20071113 BonEcho/2.0.0.9',
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en; rv:1.8.1.12) Gecko/20080206 Camino/1.5.5',
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X Mach-O; en; rv:1.8.1.12) Gecko/20080206 Camino/1.5.5',
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; pl-PL; rv:1.0.1) Gecko/20021111 Chimera/0.6',
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; pl-PL; rv:1.0.1) Gecko/20021111 Chimera/0.6',
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.0.1) Gecko/20021111 Chimera/0.6',
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.0.1) Gecko/20021104 Chimera/0.6',
'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3',
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Crazy Browser 3.0.5) ; .NET CLR 3.0.04506.30; InfoPath.2; InfoPath.3; .NET CLR 1.1.4322; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.2; Crazy Browser 3.0.5)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; Crazy Browser 2.0.1)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; InfoPath.1; Crazy Browser 2.0.1)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Crazy Browser 2.0.1)',
'ELinks/0.13.GIT (textmode; Linux 2.6.29 i686; 119x51-2)',
'ELinks/0.13.GIT (textmode; Linux 2.6.27-rc6.git i686; 175x65-3)',
'ELinks/0.13.GIT (textmode; Linux 2.6.26-rc7.1 i686; 119x68-3)',
'ELinks/0.13.GIT (textmode; Linux 2.6.24-1-686 i686; 175x65-2)'
]
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
user_agent = random.choice(self.USER_AGENTS)
request.headers['User-Agent'] = user_agent
return None
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
运行之前,需要先修改setting.py,去掉相关注释,启用这个中间件
启用(修改完记得要保存代码,要不然是不会生效的。)
现在我们开始写爬虫文件(spiders文件夹中的test.py),能打印请求头就可以了
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['https://httpbin.org/get']
start_urls = ['https://httpbin.org/get']
def parse(self, response):
print(response.text)
输入运行爬虫命令
scrapy crawl test
3. 更换状态码中间件
为了演示process_response函数,我们现在假设一个场景:把每个返回的请求的状态码都改成205
3.1 process_response函数
先看看process_response函数,当Downlader下载完,返回的response都会通过这个process_request函数:
process_response(self,request,response,spider):
参数:
request:发送请求的 request 对象
response:Downloader 返回的 response 对象
spider:发送请求的 spider 对象
返回值:
Response 对象:将该新的 response 对象传递给其它中间件,然后传递给 spider
Request 对象:下载结果被丢弃,不会给Engine,而是重新提交新的 request
Exception:抛出异常,此时调用 request 的 errback 方法,如果不存在 errback 则抛出异常
3.2 更换响应状态码
修改中间件代码middlewares.py,把process_response重写
class MwDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_response(self, request, response, spider):
response.status = 205 # 将响应状态码改成了205
return response
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
修改一下test爬虫的代码,(spiders文件夹中的test.py),打印响应状态码
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['httpbin.org/get']
start_urls = ['https://httpbin.org/get/']
def parse(self, response):
print(f'现在的响应状态码是:{response.status}')
输入运行爬虫命令
scrapy crawl test
顺利将响应码改成205(改之前是200)
3.3 process_exception
现在来简单解释一下process_exception。当下载发生了异常,或者process_request抛出了异常,会执行这个函数
process_exception(self, request, exception, spider):
参数同上。
返回值:
None:把异常交给后续代码处理
Response:把返回的response当做下载结果
Request:提交新的Request
4. 代理IP中间件
前面的二手车之家项目,丰田车抓几十个后就被被封IP,并返回给我们302响应。我们可以通过下载中间价设置代理,解决这个问题。
4.1 在process_request中设置代理IP
使用process_request拦截请求,并修改每次发起的请求,
说白了,就是我要把scrapy发送的每个请求都先拦截,然后去修改这个请求的请求信息
那么,我们就可以把每个请求加上不同的代理IP,再让请求发送。
class ErshoucheDownloaderMiddleware:
...
def get_proxy(self):
'''获取代理IP'''
prox_url = '代理池地址'
porxy_ip = requests.get(prox_url).text
print(porxy_ip)
return porxy_ip
def process_request(self, request, spider):
# 生成代理IP的地址
if request.url.split(':')[0] == 'https':
proxy_url = "https://" + self.get_proxy()
else:
proxy_url = "http://" + self.get_proxy()
# 设置代理IP
request.meta['proxy'] = proxy_url
return None
接下来,我们还需要开启下载中间件;
3.2 在settings文件中开启中间件
开启完毕,保存。运行一下,如果代理IP有效,这就解决了前面被封的问题。
3.3 在process_exception函数更改request的IP
当发起的request发生异常时,下载中间件则会把该request转交给process_exception函数进行处理。
有了前面的代理IP,一般不会遇到被封的情况了。
如果万一被封,我们可以在process_exception函数中再次更换IP地址,再发送一次请求。
那么我们再修改下载中间件中的process_exception函数,使它实现当遭遇302响应时,为request设置代理IP,
是不是很熟悉???
函数里除了返回值,其余都和process_request一样,因为当异常处理掉之后,需要重新发送一个请求。
class ErshoucheDownloaderMiddleware:
....
def process_exception(self, request, exception, spider):
if request.url.split(':')[0] == 'https':
request.meta['proxy'] = 'https://' + self.get_proxy()
else:
request.meta['proxy'] = 'http://' + self.get_proxy()
return request
- 中间件常用场景
下载中间件:
修改请求头
管理cookies
丢弃非200状态码响应
修改响应
蜘蛛中间件:
丢弃非指定域名请求
做内容翻译
丢弃非200状态码响应
设置encoding