在 Scrapy 框架中的 Engine 和 Downloader 之间存在一个 Downloader Middlewares,我们知道 spider 发送的请求需要通过 Engine 发送给 Downloader 进行下载,而 Downloader 完成下载后的响应要通过 Engine 发送到 spider 进行解析。而下载器中间件则可以在请求发送给 Downloader 和响应发送给 Engine 之前执行某些操作,如设置代理,请求头等,而这些操作都是通过两个函数实现的:
- process_request(self,request,spider):该函数在请求发送给 Downloader 之前被执行
- process_response(self,request,response,spider):该函数在响应发送到 Engine 之前执行
process_request
参数:
- request:发送请求的 request 对象
- spider:发送请求的 spider 对象
返回值:
- None:此时 Scrapy 将继续处理该 request,执行其它中间件的相应方法,直到合适的 Downloader 处理函数被调用
- Response 对象:Scrapy 不会再调用其它的 process_request 方法,直接返回该 Response 对象。已经激活的中间件的 process_response 方法则会在每个 response 返回时被调用
- Request 对象:不再使用之前的 request 对象,而是根据现在返回的 request 对象重新请求
- Exception:抛出异常,此时调用 process_exception
process_response
参数:
- request:发送请求的 request 对象
- response:Downloader 返回的 response 对象
- spider:发送请求的 spider 对象
返回值:
- Response 对象:将该新的 response 对象传递给其它中间件,然后传递给 spider
- Request 对象:Downloader 传递被阻拦,此时重新进行 request 请求
- Exception:抛出异常,此时调用 request 的 errback 方法,如果不存在 errback 则抛出异常
设置随机请求头
设置随机请求头可以避免被服务器检测到一直是相同的请求头发送的请求,而该设置可以在 Downloader Middlewares 中实现:
settings.py
仍旧需要设置:
- ROBOTSTXT_OBEY:设置为 False,否则为 True。True 表示遵守机器协议,此时爬虫会首先找 robots.txt 文件,如果找不到则会停止
- DEFAULT_REQUEST_HEADERS:默认请求头,可以在其中添加 User-Agent,表示该请求是从浏览器发出的,而不是爬虫
- DOWNLOAD_DELAY:表示下载的延迟,防止过快
- DOWNLOADER_MIDDLEWARES:启用 middlewares.py
spider
# -*- coding: utf-8 -*-
import scrapy
class HeaderSpider(scrapy.Spider):
name = 'header'
allowed_domains = ['httpbin.org']
start_urls = ['http://www.httpbin.org/user-agent']
def parse(self, response):
print(response.text)
yield scrapy.Request(url=self.start_urls[0],dont_filter=True)
middlewares.py
def process_request(self, request, spider):
USERAGENTS = ['Opera/9.80 (X11; Linux x86_64; U; pl) Presto/2.7.62 Version/11.00',
'Opera/9.80 (X11; Linux i686; U; it) Presto/2.7.62 Version/11.00',
'Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.6.37 Version/11.00',
'Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.7.62 Version/11.00',
'Opera/9.80 (Windows NT 6.1; U; ko) Presto/2.7.62 Version/11.00',
'Opera/9.80 (Windows NT 6.1; U; fi) Presto/2.7.62 Version/11.00',
'Opera/9.80 (Windows NT 6.1; U; en-GB) Presto/2.7.62 Version/11.00',
'Opera/9.80 (Windows NT 6.1 x64; U; en) Presto/2.7.62 Version/11.00',
'Opera/9.80 (Windows NT 6.0; U; en) Presto/2.7.39 Version/11.00',
'Opera/9.80 (Windows NT 5.1; U; ru) Presto/2.7.39 Version/11.00',
'Opera/9.80 (Windows NT 5.1; U; MRA 5.5 (build 02842); ru) Presto/2.7.62 Version/11.00',
'Opera/9.80 (Windows NT 5.1; U; it) Presto/2.7.62 Version/11.00',
'Mozilla/5.0 (Windows NT 6.0; U; ja; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 Opera 11.00',
'Mozilla/5.0 (Windows NT 5.1; U; pl; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 Opera 11.00',
'Mozilla/5.0 (Windows NT 5.1; U; de; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 Opera 11.00',
'Mozilla/4.0 (compatible; MSIE 8.0; X11; Linux x86_64; pl) Opera 11.00',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; fr) Opera 11.00',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; ja) Opera 11.00',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; pl) Opera 11.00'
]
user_agent = random.choice(USERAGENTS)
request.headers['User-Agent'] = user_agent
return None
在上边的文件中只需要编写 process_request 函数即可,结果为:
2020-05-25 17:29:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.httpbin.org/user-agent> (referer: http://www.httpbin.org/user-agent)
{
"user-agent": "Opera/9.80 (Windows NT 6.1 x64; U; en) Presto/2.7.62 Version/11.00"
}
2020-05-25 17:29:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.httpbin.org/user-agent> (referer: http://www.httpbin.org/user-agent)
{
"user-agent": "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; ja) Opera 11.00"
}
2020-05-25 17:29:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.httpbin.org/user-agent> (referer: http://www.httpbin.org/user-agent)
{
"user-agent": "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; pl) Opera 11.00"
}
2020-05-25 17:29:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.httpbin.org/user-agent> (referer: http://www.httpbin.org/user-agent)
{
"user-agent": "Opera/9.80 (Windows NT 6.1 x64; U; en) Presto/2.7.62 Version/11.00"
}
2020-05-25 17:29:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.httpbin.org/user-agent> (referer: http://www.httpbin.org/user-agent)
{
"user-agent": "Opera/9.80 (X11; Linux x86_64; U; pl) Presto/2.7.62 Version/11.00"
}
从结果可以看出每次的 User-Agent 结果都是不一样的,也就是说请求头是随机的。