之前在爬取拉勾网的职位信息的时候,一直显示这个:
'status': False, 'msg': '您操作太频繁,请稍后再访问', 'clientIp': '117.136.41.41', 'state': 2402}
当时使用headers传递头部信息:
headers = {
'Referer': 'https://www.lagou.com/jobs/list_Python?labelWords=&fromSearch=true&suginput=',
'User-Agent': str(agents),
}
基本80%的网站在headers上传递Referer和User-Agent参数就可以爬去,可拉勾网的就是不行,拉勾网的反爬虫技术是我目前见过最强的,我也尝试传递全部headers的信息都传递进去,就是不行,后来尝试通过get请求,构造请求链接:
import requests
response =requests.get("https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&city=%e5%b9%bf%e5%b7%9e&kd=python&pn=1")
print(response.content.decode("utf-8"))
在返回的结果中,我发现我为啥会爬取失败:
<div class="tip">您的IP地址: <span id="ip"></span> 存在异常行为,需要登录后才能继续访问</div>
<p class="qq"><a href="https://passport.lagou.com/login/login.html">立即登录</a></p>
原因在于我没有传递登入后的Cookie信息,即便传递没有登入的Cookie信息也会失败,问题就在于我之间有尝试爬取拉勾网,那是估计是没有传递Referer的参数,所以我的Ip被识别为爬虫
解决:
然后我就在拉钩网上登入账号,获取登入后的Cookie信息,在复制到headers上,最后就可以爬去得到所有的信息。
完整代码:
import requests
from urllib import parse
from lxml import etree
import random
def request_list_page():
url = "https://www.lagou.com/jobs/positionAjax.json?city=%E5%B9%BF%E5%B7%9E&needAddtionalResult=false"
proxies = {
"http": "117.172.147.171:38292"
}
agent = [
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
]
agents = random.sample(agent, 1)
headers = {
'User-Agent': str(agents),
'Cookie':"JSESSIONID=ABAAABAAAFCAAEG459D425D48B79B35BB1CE934F54FEB8E; _ga=GA1.2.1431205688.1552005232; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1552005232; _gid=GA1.2.414650395.1552005232; user_trace_token=20190308083352-d8f70a7a-4139-11e9-8b95-5254005c3644; LGUID=20190308083352-d8f70d62-4139-11e9-8b95-5254005c3644; index_location_city=%E5%B9%BF%E5%B7%9E; TG-TRACK-CODE=search_code; SEARCH_ID=f056e89dacd24d4388205dc253e20463; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221695aefe80a331-020e4f519c2e5b-3e740b5c-1049088-1695aefe80b244%22%2C%22%24device_id%22%3A%221695aefe80a331-020e4f519c2e5b-3e740b5c-1049088-1695aefe80b244%22%7D; sajssdk_2015_cross_new_user=1; LGSID=20190308093425-4eb9bed7-4142-11e9-8b95-5254005c3644; PRE_UTM=; PRE_HOST=; PRE_SITE=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_python%25E5%25AE%259E%25E4%25B9%25A0%3Foquery%3Dpython%25E5%2590%258E%25E7%25AB%25AF%26fromSearch%3Dtrue%26labelWords%3Drelative%26city%3D%25E5%25B9%25BF%25E5%25B7%259E; PRE_LAND=https%3A%2F%2Fpassport.lagou.com%2Flogin%2Flogin.html%3Fsignature%3DA2AB0B5330C0EE3DAA9F01CB50A1C810%26service%3Dhttps%25253A%25252F%25252Fwww.lagou.com%25252F%26action%3Dlogin%26serviceId%3Dlagou%26ts%3D1552008864183; _putrc=6C1E90D537EEF5A5123F89F2B170EADC; login=true; unick=%E9%82%B5%E6%97%AD%E8%BE%89; gate_login_token=358559283927fe9f6f17f4b930fc6d18757a8c56796bc0aba3baab9bbcd6bc15; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; hasDeliver=0; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1552008921; LGRID=20190308093520-6f75a36e-4142-11e9-a922-525400f775ce"
}
data = parse.urlencode([
('first', 'true'),
('pn', 1),
('kd', 'python')
])
response = requests.post(url, headers=headers, data=data, proxies=proxies)
print(response.json())
def main():
request_list_page()
if __name__ == "__main__":
main()
感想:
拉勾网反爬虫技术真的很强,即便我设置了代理ip,在httpbin/orp/ip上尝试是可以, 但是爬取这个网页的时候,给我返回的clientip地址还是原来那个'clientIp': '117.136.41.41',我用的还是高匿名的代理Ip,这点真的很佩服拉勾网,不知道会不会是我的代理ip设置问题,如果有知道的伙伴,麻烦告诉我一下。这个网站尝试了很久,今天终于成功爬取了,有点开心。
最后,对于拉勾网这种通过Ajax技术生成的信息,分析Ajax的规律请求生成的json数据是有一定难度的,我们还可以selenium+chromedriver技术来爬取,这个会更加简单一点。