想要改变scrapy框架的IP和User-Agent,需要改变爬虫的中间件,scrapy常用的中间件有下载中间件和Spider中间件
1、User-Agent
(1)fake-useragent的安装:pip install fake-useragent
用法:
from fake_useragent import UserAgent
ua = UserAgent()
print(ua.ie)
运行结果:
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; InfoPath.2)
上使IE浏览器的用法,其他的浏览器的用法
# Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US);
ua.msie
# Mozilla/5.0 (compatible; MSIE 10.0; Macintosh; Intel Mac OS X 10_7_3; Trident/6.0)'
ua['Internet Explorer']
# Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; GTB7.4; InfoPath.2; SV1; .NET
CLR 3.3.69573; WOW64; en-US)
ua.opera
# Opera/9.80 (X11; Linux i686; U; ru) Presto/2.8.131 Version/11.11
ua.chrome
# Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0
Safari/537.2'
ua.google
# Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.13 (KHTML, like Gecko)
Chrome/24.0.1290.1 Safari/537.13
ua['google chrome']
# Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko)
Chrome/20.0.1132.57 Safari/536.11
ua.firefox
# Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1
ua.ff
# Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1
ua.safari
# Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko)
Version/6.0 Mobile/10A5355d Safari/8536.25
# and the best one, random via real world browser usage statistic
ua.random
ua.update()
(2)设置User-Agent,修改middleware.py中间件,在middleware.py添加如下代码
from fake_useragent import UserAgent
class UserAgentMiddleware(object):
def __init__(self, user_agent=''):
# print('==UserAgentMiddleware init==')
self.ua = UserAgent()
def process_request(self, request, spider):
# print('==UserAgentMiddleware process_request==')
if self.ua:
# print('********Current UserAgent************')
print(self.ua.random)
request.headers.setdefault('User-agent', self.ua.random)
然后修改settings配置文件中的DOWNLOADER_MIDDLEWARES
# 原来的SPIDER_MIDDLEWARES
# DOWNLOADER_MIDDLEWARES = {
# 'xie.middlewares.XieSpiderMiddleware': 543,
# }
DOWNLOADER_MIDDLEWARES = {
# 'xie.middlewares.XieSpiderMiddleware': 543,
'xie.middlewares.UserAgentMiddleware': 200,
}
上述代码是根据源码自带的用户代理改写的,查看源码:site-package/scrapy/downloadermiddlewares/useragent.py
2、代理IP
代理IP的设置和User-Agent的设置相似,在中定义代理IP的中间件;
import random
from scrapy import signals
class RandomProxyMiddleware(object):
def __init__(self):
self.PROXIES = [
{'ip_port': '111.8.60.9:8123', 'user_passwd': 'user1:pass1'},
{'ip_port': '101.71.27.120:80', 'user_passwd': 'user2:pass2'},
{'ip_port': '122.96.59.104:80', 'user_passwd': 'user3:pass3'},
{'ip_port': '122.224.249.122:8088', 'user_passwd': 'user4:pass4'},
]
def process_request(self, request, spider):
proxy = random.choice(self.PROXIES)
# 没有代理账户验证的代理使用方式
if proxy['user passwd'] is None:
request.meta['proxy'] = 'http://' + proxy['ip_port']
else:
# 对账户密码进行 base64 编码转换
base64_userpasswd = base64.b64encode(proxy['user_passwd'])
request.headers['Proxy-Authorization'] = 'Basic ' + base64_userpasswd
request.meta['proxy'] = "http://" + proxy['ip_port']
然后修改settings配置文件中的SPIDER_MIDDLEWARES
DOWNLOADER_MIDDLEWARES = {
# 'xie.middlewares.XieSpiderMiddleware': 543,
'xie.middlewares.UserAgentMiddleware': 200,
'xie.middlewares.RandomProxyMiddleware': 300,
}
3、获取动态页面
from selenium import webdriver
from scrapy.http import HtmlResponse
from scrapy import signals
class WebDriverMiddleware(object):
@classmethod
def from_crawler(cls, crawler):
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# 加载驱动
print('================process_request================')
browser = webdriver.PhantomJS()
browser.get(request.url) # 加载网页
data = browser.page_source # 获取网页文本
data = data.encode('utf-8')
browser.quit()
return HtmlResponse(request.url, body=data, encoding='utf-8', request=request)
def process_response(self, request, response, spider):
return response
def process_exception(self, request, exception, spider):
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
然后修改settings配置文件中的SPIDER_MIDDLEWARES
DOWNLOADER_MIDDLEWARES = {
# 'xie.middlewares.XieSpiderMiddleware': 543,
'xie.middlewares.UserAgentMiddleware': 200,
'xie.middlewares.RandomProxyMiddleware': 300,
'xie.middlewares.WebDriverMiddleware': 400,
}
sina博客就是一动态页面,想要获取页面的所有信息,需要将页面向下滑动……
4、禁用cookie
特殊情况下防止某些网站根据 Cookie 来封锁爬虫。
COOKIES_ENABLED = False