scrapy-splash (IP池/User-Agent池/request参数/referer参数)
在毕设--B站优质视频主播爬取分析遇到的问题
- 问题:
- 在完善代码的过程中,随着测试的次数越来越多,逐渐出现了502错误,111拒绝连接。查阅了多篇文章,我猜的IP可能被封了,
- 解决思路:
- 构建IP池,构建User-Agent池,因为不会使用lua语言,所以只能在配置文件的middlewares或者SplashRequest的splash_headers参数总想一种办法
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware':100
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
# 'BL.middlewares.BlDownloaderMiddleware': 812, # 爬虫的下载中间件
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
SPLASH_URL = 'http://127.0.0.1:8050'
- 以上是scrapy-spalsh的固定配置
- 经过多次测试,自己写的爬虫中间件无法在810之前使用,报错scrapy-splash中的使用工具调用HttpCompressionMiddleware时,未被定义,所以此方法行不通;而如果自己给的编号大于810则也不会被使用。
- 大胆猜测能否通过 将本项目下的middlewares中的内容换成SplashMiddleware的内容,经测试,也会出现上述问题,小工具找HttpCompressionMiddleware时,未被定义,如果接着该,恐怕要牵一发而动全身,此方法不可行
- 这是我的IP池,和User-Agent池子:
-
USER_AGENT_LIST = [ "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.13+ (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10", "Mozilla/5.0 (iPad; CPU OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko ) Version/5.1 Mobile/9B176 Safari/7534.48.3", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; de-at) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_7; da-dk) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1", "Mozilla/5.0 (Windows; U; Windows NT 6.1; tr-TR) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27", "Mozilla/5.0 (Windows; U; Windows NT 6.1; ko-KR) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27", "Mozilla/5.0 (Windows; U; Windows NT 6.1; fr-FR) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27", "Mozilla/5.0 (Windows; U; Windows NT 6.1; cs-CZ) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27", ] PROXY_LIST = [ {'ip_port': 'http://116.8.108.53:16816', 'user_passwd': '***:***'}, {'ip_port': 'http://39.99.158.153:16817', 'user_passwd': '***:***'}, # {'ip_port': 'http://114.117.164.198:16817', 'user_passwd': ***:***'}, # {'ip_port':'http://220.249.149. 52:9999'}, # {'ip_port':'http://175.42.129.2 15:9999'}, # {'ip_port':'http://175.42.129.1 20:9999'}, # {'ip_port':'http://36.250.156.135:9999'}, # {'ip_port':'http://163.204.242.238:9999'}, # {'ip_port':'http://60.174.190.15:9999'}, # {'ip_port':'http://117.95.198.132:9999'}, # {'ip_port':'http://114.239.151.229:9999'}, # {'ip_port':'http://36.249.48.47:9999'}, # {'ip_port':'http://175.44.109.205:9999'}, # {'ip_port':'http://58.22.177.224:9999'}, # {'ip_port':'http://60.169.133.225:9999'}, # {'ip_port':'http://1.198.42.177:9999'}, # {'ip_port':'http://49.70.94.154:9999'}, # {'ip_port':'http://42.238.87.240:9999'}, ]
- 以下是对IP,User-Agent的调用
-
import base64 import random from BL.settings import USER_AGENT_LIST, PROXY_LIST class BlSpider(scrapy.Spider): name = 'bl' allowed_domains = ['目标网站.com'] # 进入主页 start_urls = ['目标网站'] def start_requests(self): # for url in self.start_urls: proxy = random.choice(PROXY_LIST) print('使用的IP是', proxy) b64_up = base64.b64encode(proxy['user_passwd'].encode()) # 设置认证 # Basic 有一个空格 ProxyAuthorization = 'Basic ' + b64_up.decode() # 设置代理 yield SplashRequest(self.start_urls[0], callback=self.parse_splash, args={'wait': 5}, # 最大超时时间 endpoint='render.html', # 使用splash服务的固定参数 dont_process_response=False, splash_headers={ "referer": 'https://www.bilibili.com/v/technology/science#/', "Proxy-Authorization": ProxyAuthorization }, dont_send_headers=False, magic_response=True, session_id='default', http_status_from_error_code=True, cache_args=None, meta=None, )
- 程序运行正常。原理大概就是这个样子,在发起请求时,更改IP,达到反爬的目的。(dokcer中跑的是scrapinghub/splash)
-