本篇博文主要介绍Scrapy框架里面的下载中间件,在middlewares.py文件中可以找到相应的代码(class GithubDownloaderMiddleware)。并且通过修改中间件的代码可以随机设置请求头和ip。下面我们会先介绍下载中间件的代码内容,然后讲如何随机设置header和ip
1 下载中间件
下面是下载中间件的代码
class GithubDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
我们主要着重介绍def process_request(self, request, spider)、 def process_response(self, request, response, spider)。
def process_request(self, request, spider):
当每个 request 通过下载中间件时,该⽅法被调用
def process_response(self, request, response, spider):
当下载器完成 http 请求,传递响应给引擎的时候调用
def process_exception(self, request, exception:
当request、response被stop,或者遇到其他异常时候被调用
1.1 process_request(self, request, spider)
当每个Request对象经过下载中间件时会被调用,优先级越高的中间件,越先调用;该方法应该返回以下对象:
None/Response对象/Request对象/抛出lgnoreRequest异常
process_request代码如下:
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
返回None: scrapy会继续执行其他中间件相应的方法;
返回Response对象: scrapy不会再调用其他中间件process__request方法,也不会去发起下载,而是直接返回该Response对象
返回Request对象: scrapy不会再调用其他中间件的process__request()方法,而是将其放置调度器待调度下载
如果这个方法抛出异常,则会调用process_ exception方法
1.2 process_response(self, request, spider)
当每个Response经过下载中间件会被调用,优先级越高的中间件,越晚被调用。
与process_ request()相反;该方法返回以下对象: Response对象/Request对象/抛出IgnoreRequest异常。
返回Response对象: scrapy会 继续调用其他中间件的process_ response方法;
返回Request对象:停止中间器调用,将其放置到调度器待调度下载;
抛出IgnoreRequest异常: Request.errback会被调用来处理函数,如果没有处理,它将会被忽略且不会写进日志。
2 设置随机header和ip
爬⾍在频繁访问⼀个页面的时候,这个请求如果⼀直保持⼀致。那么很容易被服务器发现,从而禁止掉这个请求头的访问。因此我们要在访问这个页面之前随机的更改请求头,这样才可以避免爬虫被抓。随机更改请求头,可以在下载中间件实现。在请求发送给服务器之前,随机的选择⼀个请求头。这样就可以避免总使用⼀个请求头。
同理于,ip地址。
2.1 随机设置请求头
测试请求头网址: http://httpbin.org/user-agent
拥有大量请求头的网址:http://www.useragentstring.com/pages/useragentstring.php?
typ=Browser
下面开始讲步骤:
(1)在middlewares文件修改补充代码
# 定义随机请求头
import random
class RandomUserAgent(object):
def process_request(self, request, spider):
# print(request) # 打印出响应码
# print('==========')
# print(spider) # 打印出对象
# 获取随机请求头
user_agent = random.choice(spider.settings['USER_AGENTS'])
request.headers['user_agent'] = user_agent
# print(user_agent)
# 从响应中获取请求头信息
class CheckUserAgent(object):
def process_response(self, request, response, spider):
print(request.headers['user_agent'])
return response
(2)setting文件
1)补充请求头,供middlewares文件选择
USER_AGENTS = ["Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"]
2)打开middleware 设置
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'ua.middlewares.UaDownloaderMiddleware': 543,
'ua.middlewares.RandomUserAgent': 544,
'ua.middlewares.CheckUserAgent': 545,
}
(3)爬虫文件代码
import scrapy
class UseragentSpider(scrapy.Spider):
name = 'useragent'
start_urls = ['http://httpbin.org/user-agent'] # 可以显示请求头是多少的网址,访问这一个
def parse(self, response):
# print(response.text)
pass
2.2 随机设置ip
同理于随机设置请求头
(1)中间件代码
# 定义随机ip
import random
class RandomIp(object):
def process_request(self, request, spider):
proxy = random.choice(spider.settings['PROXY'])
request.meta['proxy'] = proxy ## 在setting中proxy = 'http:xxxx:端口号'
print(proxy)
# 从响应中获取IP信息
class CheckIp(object):
def process_response(self, request, response, spider):
print(request.meta['proxy'])
return response
(2)setting
1)设置ip信息
PROXY = ['http:xxxx:端口号1','http:xxxx:端口号2','http:xxxx:端口号3' ] # 自己去设置多个
2)设置优先级
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'ua.middlewares.UaDownloaderMiddleware': 543,
'ua.middlewares.RandomUserAgent': 544,
'ua.middlewares.CheckUserAgent': 545,
'ua.middlewares.RandomIp': 546,
'ua.middlewares.CheckIp': 547
}
(3)爬虫文件代码
import scrapy
class UseragentSpider(scrapy.Spider):
name = 'useragent'
# start_urls = ['http://httpbin.org/user-agent']
start_urls = ['http://httpbin.org/ip'] # 打印出ip 地址的网址
def parse(self, response):
# print(response.text)
pass