一、定义动态User-Agent中间件
1.1、编写中间件
在middlewares.py
文件中,编写UserAgentMiddlerware
中间件,可以随机动态设置User-Agent
from fake_useragent import UserAgent
# 自定义动态User-Agent中间件
class UserAgentMiddlerware:
def process_request(self, request, spider):
request.headers.setdefault(b'User-Agent', UserAgent().random)
1.2、配置中间件
在settings.py
文件中,开启DOWNLOADER_MIDDLEWARES
,里面可以配置多个中间件,数值表示优先执行顺序,值越小越优先执行
DOWNLOADER_MIDDLEWARES = {
'csrapy02.middlewares.UserAgentMiddlerware': 300,
}
1.3、执行结果
这里我访问的http://httpbin.org/get
地址,可以获取客户端请求头信息,可以看到User-Agent
随机值为Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1467.0 Safari/537.36
{
"args": {},
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1467.0 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-5ef69772-ba3abf0097859a2a24c14a4e"
},
"origin": "27.38.242.248",
"url": "http://httpbin.org/get"
}
二、定义动态Proxy中间件
2.1、编写中间件
在middlewares.py
文件中,编写ProxyMiddlerware
中间件,并定义一个获取代理ip的函数ProxyIp
(ip可以从数据库或者从代理服务商那边随机获取)
# 自定义动态Proxy中间件
class ProxyMiddlerware:
def process_request(self, request, spider):
# request.meta['proxy']='type://uname:password@ip:port'
request.meta['proxy']='http://' + ProxyIp.get_proxy_ip(self)
# 定义获取代理ip函数
class ProxyIp:
def get_proxy_ip(self):
ip = '223.241.2.139:4216'
return ip
2.2、配置中间件
在settings.py
文件DOWNLOADER_MIDDLEWARES
中,新增ProxyMiddlerware
中间件,并设置执行顺序优先级
DOWNLOADER_MIDDLEWARES = {
'csrapy02.middlewares.UserAgentMiddlerware': 300,
'csrapy02.middlewares.ProxyMiddlerware': 301,
}
2.3、执行结果
{
"args": {},
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1467.0 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-5ef69772-ba3abf0097859a2a24c14a4e"
},
"origin": "223.241.2.139",
"url": "http://httpbin.org/get"
}