爬虫

scrapy-splash (IP池/User-Agent池/request参数/referer参数)

在毕设--B站优质视频主播爬取分析遇到的问题

  1. 问题:
    1. 在完善代码的过程中,随着测试的次数越来越多,逐渐出现了502错误,111拒绝连接。查阅了多篇文章,我猜的IP可能被封了,
  2. 解决思路:
    1. 构建IP池,构建User-Agent池,因为不会使用lua语言,所以只能在配置文件的middlewares或者SplashRequestsplash_headers参数总想一种办法

 

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware':100
}
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    # 'BL.middlewares.BlDownloaderMiddleware': 812,    # 爬虫的下载中间件
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
SPLASH_URL = 'http://127.0.0.1:8050'
  • 以上是scrapy-spalsh的固定配置
  • 经过多次测试,自己写的爬虫中间件无法在810之前使用,报错scrapy-splash中的使用工具调用HttpCompressionMiddleware时,未被定义,所以此方法行不通;而如果自己给的编号大于810则也不会被使用。
  • 大胆猜测能否通过 将本项目下的middlewares中的内容换成SplashMiddleware的内容,经测试,也会出现上述问题,小工具找HttpCompressionMiddleware时,未被定义,如果接着该,恐怕要牵一发而动全身,此方法不可行
  1. 这是我的IP池,和User-Agent池子:
  2. USER_AGENT_LIST = [
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.13+ (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10",
        "Mozilla/5.0 (iPad; CPU OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko ) Version/5.1 Mobile/9B176 Safari/7534.48.3",
        "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; de-at) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1",
        "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_7; da-dk) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; tr-TR) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; ko-KR) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; fr-FR) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; cs-CZ) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27",
        "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27",
    ]
    PROXY_LIST = [
        {'ip_port': 'http://116.8.108.53:16816', 'user_passwd': '***:***'},
        {'ip_port': 'http://39.99.158.153:16817', 'user_passwd': '***:***'},
        # {'ip_port': 'http://114.117.164.198:16817', 'user_passwd': ***:***'},
        # {'ip_port':'http://220.249.149. 52:9999'},
        # {'ip_port':'http://175.42.129.2 15:9999'},
        # {'ip_port':'http://175.42.129.1 20:9999'},
        # {'ip_port':'http://36.250.156.135:9999'},
        # {'ip_port':'http://163.204.242.238:9999'},
        # {'ip_port':'http://60.174.190.15:9999'},
        # {'ip_port':'http://117.95.198.132:9999'},
        # {'ip_port':'http://114.239.151.229:9999'},
        # {'ip_port':'http://36.249.48.47:9999'},
        # {'ip_port':'http://175.44.109.205:9999'},
        # {'ip_port':'http://58.22.177.224:9999'},
        # {'ip_port':'http://60.169.133.225:9999'},
        # {'ip_port':'http://1.198.42.177:9999'},
        # {'ip_port':'http://49.70.94.154:9999'},
        # {'ip_port':'http://42.238.87.240:9999'},
    ]
  3. 以下是对IP,User-Agent的调用
  4. import base64
    import random
    from BL.settings import USER_AGENT_LIST, PROXY_LIST
    class BlSpider(scrapy.Spider):
        name = 'bl'
        allowed_domains = ['目标网站.com']
        # 进入主页
        start_urls = ['目标网站']
    
        def start_requests(self):
            # for url in self.start_urls:
            proxy = random.choice(PROXY_LIST)
            print('使用的IP是', proxy)
            b64_up = base64.b64encode(proxy['user_passwd'].encode())
            # 设置认证
            # Basic 有一个空格
            ProxyAuthorization = 'Basic ' + b64_up.decode()
    
            # 设置代理
            yield SplashRequest(self.start_urls[0],
                                callback=self.parse_splash,
                                args={'wait': 5},  # 最大超时时间
                                endpoint='render.html',  # 使用splash服务的固定参数
                                dont_process_response=False,
                                splash_headers={
                                    "referer": 'https://www.bilibili.com/v/technology/science#/',
                                    "Proxy-Authorization": ProxyAuthorization
                                },
                                dont_send_headers=False,
                                magic_response=True,
                                session_id='default',
                                http_status_from_error_code=True,
                                cache_args=None,
                                meta=None,
                                )
    
  5. 程序运行正常。原理大概就是这个样子,在发起请求时,更改IP,达到反爬的目的。(dokcer中跑的是scrapinghub/splash
  6.  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值