Scrapy爬取链家数据(四):使用动态代理绕过反爬机制

前言

        在上一篇文章中,我们获取了链家页面展示建造年龄在二十年内的全部小区详细信息。但是在爬取过程中,我们会触发链家的反爬机制,将爬虫入口页面重定向到人机验证页面。当触发人机验证时,控制台显示如下:

点击链接后,页面显示如下:

 

本篇文章将详细讲述使用动态ip解决上述问题的过程。

1. 设置User-Agent

        User-Agent是HTTP请求头的一部分,包含浏览器和系统信息,用于标识访问者。服务器可能根据User-Agent识别并阻止爬虫。要查看浏览器的User-Agent,可打开开发者工具,点击Network,然后查看Headers中的User-Agent字段。它通常格式为Mozilla/5.0 (平台) 引擎版本浏览器版本号,例如Chrome的User-Agent显示其WebKit历史和版本详情。

        对于爬虫程序来说,修改请求头中的User-Agent是通识。在Scrapy中,通过Scrapy源码我们可以发现User-Agent默认值为“Scrapy”:

class UserAgentMiddleware:
    """This middleware allows spiders to override the user_agent"""

    def __init__(self, user_agent="Scrapy"):
        self.user_agent = user_agent

该默认值和正常人为操作浏览器产生的User-Agent相差甚远,很容易被服务器识别并拒绝该类型请求,因此我们需要修改默认请求头,在Scrapy框架中,通常采用以下两种方式修改请求头:(1)在settings.py文件中设置,本案例采取该方式,因此在settings.py文件任意位置增加内容如下:

# 设置UA
import random
USER_AGENT_LIST = [
    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6',
    'Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)',
    'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20',
    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6',
    'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)',
    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
]
USER_AGENT = random.choice(USER_AGENT_LIST)

该部分逻辑很好理解,首先引入random 包,其次在网上搜索一些主流版本浏览器请求头User-Agent格式,最后每次请求时随机挑选一个User-Agent。改种逻辑有很多衍生实现,比如将列表更换成随机生成User-Agent第三方工具包等等。

(2)第二种方法为将修改User-Agent的逻辑放入中间件中,如果选取该方式,需要在middlewares.py文件中增加如下内容:

import random
class UserAgentMiddleware(object):
    def __init__(self):
        self.user_agent_list = [
    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6',
    'Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)',
    'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20',
    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6',
    'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)',
    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
        ]
    def process_request(self,request,spider):
        request.headers['USER_AGENT']=random.choice(self.user_agent_list)

然后启用该中间件,启用方式为在settings.py中增加如下代码:

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
 DOWNLOADER_MIDDLEWARES = {
     'lianjia.middlewares.UserAgentMiddleware': 543,
}

2. 增加动态ip机制

        修改请求头只能让我们发起正常的http请求,但是并不能掩饰我们的真实身份,即客户端的ip。因此服务端在短时间内接收到大量同一客户端发起的请求,就会拒绝服务,将我们的请求重定向到人机校验页面。 为了不断“更换”客户端ip,需要引入动态ip代理机制,将我运行爬虫的客户端和链家服务器之间引入一层正向代理。首先在middlewares.py中间件中加入以下类:

class LianjiaProxyMiddleware(object):

    def __init__(self):
        self.proxies = self.get_proxy()

        threading.Thread(target=self.cron_update_proxies).start()

    def process_request(self, request, spider):
        print("进入proxymiddleware")
        if len(self.proxies) == 0:
            self.update_proxies()
            time.sleep(1)
        proxy = random.choice(self.proxies).strip()
        request.meta['proxy'] = "https://" + proxy

        request.meta['dont_redirect'] = True
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        print(response.status, '*****middleware-return*****')
        if not response.text:
            print("wrong response")
            lian_utils.lian_exception_urls.append(response.url)
        if response.status != 200:
            if response.status == 302:
                self.proxies = []
                return request
        else:
            return response

    def process_exception(self, request, exception, spider):
        if str(exception).find('407') != -1:
            self.proxies = []
            return request
        else:
            return request

    def get_proxy(self):
        response = requests.get('https://xxxx.xxxx.xxxx.xxxx/get?key=DAB0748D&num=1')
        if response.status_code == 200:
            res_data = response.json()
            proxys = []
            if res_data['code'] == "SUCCESS":
                for p in res_data['data']:
                    proxys.append(p['server'])
                return proxys
        else:
            return []

    def update_proxies(self):
        self.proxies = []
        self.proxies = self.get_proxy()

    def cron_update_proxies(self):
        sched = BlockingScheduler()
        sched.add_job(self.update_proxies, 'interval', seconds=30)
        sched.start()

在该类中,核心思路为(1)获得代理ip。(2)周期性更新代理ip。(3)如果周期内代理ip失效,则马上更代理ip。具体实现为(1)cron_update_proxies方法为周期性任务,即每隔30秒执行update_proxies。(2)update_proxies方法作用为全量更新代理ip列表。(3)get_proxy方法作用为获取代理ip。(4)其余逻辑放入中间件必须实现的类中。(5)很多免费代理现在并不稳定,本案例使用了付费代理,因此在实际运行过程中请根据实际情况修改get_proxy方法。

其次在settings.py中启用该中间件:

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
 DOWNLOADER_MIDDLEWARES = {
    'lianjia.middlewares.LianjiaProxyMiddleware': 100,
     'lianjia.middlewares.UserAgentMiddleware': 543,
 }

 3. 总结

        通过上述两步,我们(1)构造了“正常”的http请求。(2)每隔30秒更换一次“马甲”。可以愉快的绕开链家反爬机制,获取全部小区数据了。在下一篇文章中,我们将把爬取到的数据记录到mysql数据库中。

 

 

 

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值