成功解决爬取拉勾网:'status': False, 'msg': '您操作太频繁,请稍后再访问', 'clientIp': '117.136.41.XX', 'state': 2402}

之前在爬取拉勾网的职位信息的时候,一直显示这个:

'status': False, 'msg': '您操作太频繁,请稍后再访问', 'clientIp': '117.136.41.41', 'state': 2402}


当时使用headers传递头部信息:

 headers = {
        'Referer': 'https://www.lagou.com/jobs/list_Python?labelWords=&fromSearch=true&suginput=',
        'User-Agent': str(agents),
    }

基本80%的网站在headers上传递Referer和User-Agent参数就可以爬去,可拉勾网的就是不行,拉勾网的反爬虫技术是我目前见过最强的,我也尝试传递全部headers的信息都传递进去,就是不行,后来尝试通过get请求,构造请求链接:


import requests
response =requests.get("https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&city=%e5%b9%bf%e5%b7%9e&kd=python&pn=1")
print(response.content.decode("utf-8"))

在返回的结果中,我发现我为啥会爬取失败:

<div class="tip">您的IP地址: <span id="ip"></span> 存在异常行为,需要登录后才能继续访问</div>
    <p class="qq"><a href="https://passport.lagou.com/login/login.html">立即登录</a></p>
 


原因在于我没有传递登入后的Cookie信息,即便传递没有登入的Cookie信息也会失败,问题就在于我之间有尝试爬取拉勾网,那是估计是没有传递Referer的参数,所以我的Ip被识别为爬虫

解决:

然后我就在拉钩网上登入账号,获取登入后的Cookie信息,在复制到headers上,最后就可以爬去得到所有的信息。

完整代码:

import requests
from urllib import parse
from lxml import etree
import random


def request_list_page():
    url = "https://www.lagou.com/jobs/positionAjax.json?city=%E5%B9%BF%E5%B7%9E&needAddtionalResult=false"
    proxies = {
        "http": "117.172.147.171:38292"
    }
    agent = [
        'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
        'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
        'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
        'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
        'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
        'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    ]
    agents = random.sample(agent, 1)
    headers = {
        'User-Agent': str(agents),
        'Cookie':"JSESSIONID=ABAAABAAAFCAAEG459D425D48B79B35BB1CE934F54FEB8E; _ga=GA1.2.1431205688.1552005232; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1552005232; _gid=GA1.2.414650395.1552005232; user_trace_token=20190308083352-d8f70a7a-4139-11e9-8b95-5254005c3644; LGUID=20190308083352-d8f70d62-4139-11e9-8b95-5254005c3644; index_location_city=%E5%B9%BF%E5%B7%9E; TG-TRACK-CODE=search_code; SEARCH_ID=f056e89dacd24d4388205dc253e20463; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221695aefe80a331-020e4f519c2e5b-3e740b5c-1049088-1695aefe80b244%22%2C%22%24device_id%22%3A%221695aefe80a331-020e4f519c2e5b-3e740b5c-1049088-1695aefe80b244%22%7D; sajssdk_2015_cross_new_user=1; LGSID=20190308093425-4eb9bed7-4142-11e9-8b95-5254005c3644; PRE_UTM=; PRE_HOST=; PRE_SITE=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_python%25E5%25AE%259E%25E4%25B9%25A0%3Foquery%3Dpython%25E5%2590%258E%25E7%25AB%25AF%26fromSearch%3Dtrue%26labelWords%3Drelative%26city%3D%25E5%25B9%25BF%25E5%25B7%259E; PRE_LAND=https%3A%2F%2Fpassport.lagou.com%2Flogin%2Flogin.html%3Fsignature%3DA2AB0B5330C0EE3DAA9F01CB50A1C810%26service%3Dhttps%25253A%25252F%25252Fwww.lagou.com%25252F%26action%3Dlogin%26serviceId%3Dlagou%26ts%3D1552008864183; _putrc=6C1E90D537EEF5A5123F89F2B170EADC; login=true; unick=%E9%82%B5%E6%97%AD%E8%BE%89; gate_login_token=358559283927fe9f6f17f4b930fc6d18757a8c56796bc0aba3baab9bbcd6bc15; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; hasDeliver=0; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1552008921; LGRID=20190308093520-6f75a36e-4142-11e9-a922-525400f775ce"
    }
    data = parse.urlencode([
        ('first', 'true'),
        ('pn', 1),
        ('kd', 'python')
    ])
    response = requests.post(url, headers=headers, data=data, proxies=proxies)
    print(response.json())


def main():
    request_list_page()


if __name__ == "__main__":
    main()

感想:

拉勾网反爬虫技术真的很强,即便我设置了代理ip,在httpbin/orp/ip上尝试是可以, 但是爬取这个网页的时候,给我返回的clientip地址还是原来那个'clientIp': '117.136.41.41',我用的还是高匿名的代理Ip,这点真的很佩服拉勾网,不知道会不会是我的代理ip设置问题,如果有知道的伙伴,麻烦告诉我一下。这个网站尝试了很久,今天终于成功爬取了,有点开心。


 

最后,对于拉勾网这种通过Ajax技术生成的信息,分析Ajax的规律请求生成的json数据是有一定难度的,我们还可以selenium+chromedriver技术来爬取,这个会更加简单一点。

  • 3
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 16
    评论
评论 16
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值