下载中间件的使用方法:
- process_request(self, request, spider):
- 当每个request通过下载中间件时,该方法被调用。
- 返回None值:继续请求
- 返回Response对象:不再请求,把response返回给引擎
- 返回Request对象:把request对象交给引擎 -> 调度器进行后续的请求
- process_response(self, request, response, spider):
- 当下载器完成http请求,传递响应给引擎的时候调用
- 返回Resposne:交给process_response来处理
- 返回Request对象:交给调取器继续请求
定义实现随机User-Agent的下载
1,在middlewares.py中完善代码
import random
from Douban.settings import USER_AGENT_LIST # 注意导入路径,请忽视pycharm的错误提示
class UserAgentMiddleware(object):
def process_request(self, request, spider):
user_agent = random.choice(USER_AGENT_LIST)
request.headers['User-Agent'] = user_agent
2,在爬虫的parse方法, 检查设置User-Agent是否生效
class HttpbinSpider(scrapy.Spider):
name = 'httpbin'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/get'] # 一个测试UA的网址
def parse(self, response):
print('-'*30)
print(response.body.decode())
print('-' * 30)
yield scrapy.Request('http://httpbin.org/get',dont_filter=True) # dont_filter=True 不过滤url
3 在settings中设置开启自定义的下载中间件,设置方法同管道
DOWNLOADER_MIDDLEWARES = {
'Tencent.middlewares.UserAgentMiddleware': 543,
}
4, 在settings中添加UA的列表
USER_AGENT_LIST = [
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", \
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR