拉勾网简单爬虫实践（一）

最新推荐文章于 2021-11-13 11:01:10 发布

moke黎明

最新推荐文章于 2021-11-13 11:01:10 发布

阅读量667

点赞数 1

文章标签：爬虫经验分享

本文链接：https://blog.csdn.net/moke926/article/details/104738210

版权

自学网络爬虫的开始，很多人都是从对拉勾网的数据爬取开始实践的，不过在跟着教程实践的时候，遇到了一些困难，我想这个困难也是大家经常遇到的。

{"status":false,"msg":"您操作太频繁,请稍后再访问","clientIp":"112.65.12.1","state":2402}

这行信息大家一定不陌生，而原本教程里的代码里是这么写的：

from urllib import request,parse
url = 'https://www.lagou.com/jobs/positionAjax.json?city=%E4%B8%8A%E6%B5%B7&needAddtionalResult=false'

headers = {
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36',
    'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
}

data = {
    'first': 'true',
    'pn': '1',
    'kd': 'python'
}
req = request.Request(url, data=parse.urlencode(data).encode('utf-8'), headers=headers, method='POST')
resp = request.urlopen(req)
print(resp.read().decode('utf-8'))

{"status":false,"msg":"您操作太频繁,请稍后再访问","clientIp":"112.65.12.1","state":2402}

返回确却仍然是 “操作太频繁”, 这是为什么呢，难道是 headers 请求头的信息不足吗，后来我一股脑的那所有请求头信息全加了进去，仍然没有效果。

　　后来我不断刷新拉勾网的界面，观察到请求头里只有 Cookie 值在不断变化，Cookie 值是一次性的，我又在网上查询了半天，应该是拉勾网加了新的反爬虫机制，推测是从当前界面使用 json 的 url 地址去服务端请求数据时，会对当期界面的Cookie值进行判断，因为每次请求数据时 Cookie 值都不一样，所以使用请求过一次的Cookie再去请求时会被判断为爬虫，返回的就是“操作太频繁”。

　　这个时候我们可以先使用代码模拟打开当前页面并获得 Cookie 值，然后使用这个Cookie 值去请求 JSON 数据。
　　我们可以使用两种方式获取 Cookie 值。

urllib.request 函数

url_data = 'https://www.lagou.com/jobs/positionAjax.json?city=%E4%B8%8A%E6%B5%B7&needAddtionalResult=false'
url_start = 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput='
headers = {
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36',
    'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
}
data = {
    'first': 'true',
    'pn': '1',
    'kd': 'python'
}
req = request.Request(url_start, headers=headers, method='POST')
resp = request.urlopen(req)
CookieStr = ""
#这时返回的headers包含了好几个key为“Set-Cookie”的Cookie信息，我们需要把它们全部合在一起
for key, value in resp.headers.items():
    if key == 'Set-Cookie':
        CookieStr = CookieStr + value + ";"
headers['Cookie'] = CookieStr #请求头加入Cookie信息
req_data = request.Request(url_data, data=parse.urlencode(data).encode('utf-8'), headers=headers, method='POST')
resp_data = request.urlopen(req_data)
print(resp_data.read().decode('utf-8')
clear_output()

使用requests库的Session会话保持

requests 库的 Session 会话对象可以跨请求保持某些参数，这其中很重要的就是 Cookie 信息。

import requests, json

url = 'https://www.lagou.com/jobs/positionAjax.json?city=%E4%B8%8A%E6%B5%B7&needAddtionalResult=false'
url_start = 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput='
headers = {
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36',
    'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
}
data = {
    'first': 'true',
    'pn': '1',
    'kd': 'python'
}

s = requests.Session()
s.get(url_start, headers=headers, timeout=3)
cookie = s.cookies
response = s.post(url_data, data=data, headers=headers, cookies=cookie, timeout=5)
response.encoding = response.apparent_encoding
json_data = json.loads(response.text)
print(json_data)