修改请求时的User-Agent一般有两种思路:
一是修改setting中的User-Agent变量(适用于极少量的agent更换,一般不采用);另一种就是通过Downloader Middleware的process_request()方法来修改,即在middlewares.py里面添加一个RandomUserAgentMiddleware的类.
首先介绍下scrapy框架默认的UserAgentMiddleware:
from scrapy import signals
class UserAgentMiddleware(object):
'''
设置User-Agent
'''
def __init__(self, user_agent='Scrapy'):
self.user_agent = user_agent
@classmethod
def from_crawler(cls, crawler):
o = cls(crawler.setting['USER-AGENT'])
crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
return o
def spider_opened(self, spider):
self.user_agent = getattr(spider, 'user-agent', self.user_agent)
def process_request(self, request, spider):
if self.user_agent:
request.headers.setdefault(b'User-Agent', self.user_agent)
在不设置agent时,调用时返回的agent为:
"User-Agent": "Scrapy/1.5.0 (+https://scrapy.org)"
下面设置,介绍第二种思路的添加UserAgentMiddleware的类的三种形式:
1.第一种方法:
在useragent文件下新增加一个