首先,明确目标:我爬取的是python职位的前10页数据
然后,找到自己需要的数据并获取地址:
其次,先爬取1页是数据(请求头、获取数据):
import json
import requests
import pprint
import time
url = 'https://www.lagou.com/jobs/positionAjax.json'
params = 'px=default&needAddtionalResult=false'
data = {
'first': 'true',
'pn': '1',
'kd': 'python'
}
headers = {
'authority': 'www.lagou.com',
'origin': 'https://www.lagou.com',
'referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
'user-agent': '你的user-agent'
}
def get_cookie():
u = 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput='
session = requests.session()
rst = session.get(url=u, headers=headers)
cookies = rst.cookies.get_dict()
return cookies
response = requests.post(url=url,
params=params,
data=data,
headers=headers,
cookies=get_cookie())
mes_dict = json.loads(response.text)
result = mes_dict['content']['positionResult']['result']
# pprint.pprint(result)
with open("拉勾职位信息.csv", "a+", encoding='utf-8-sig') as f:
for one in result:
one_msg = f"{one['city']}, {one['companyFullName']}, {one['companySize']}, {one['education']}, {one['positionName']}, {one['salary']}, {one['workYear']}"
# print(one_msg)
f.write(one_msg + '\n')
成功拿下!接下来就是获取十页的数据:
先分析第一页与第二页第三页的网址,发现并没有变化,所以我们要进一步分析,于是找到在form data下的pn,而它的值就是翻页的关键
我们就可以对代码进行修改:
import json
import requests
import pprint
import time
url = 'https://www.lagou.com/jobs/positionAjax.json'
params = 'px=default&needAddtionalResult=false'
headers = {
'authority': 'www.lagou.com',
'origin': 'https://www.lagou.com',
'referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
'user-agent': '你的user-agent'
}
def get_data(page):
if page == 1:
return {
'first': 'true',
'pn': {page},
'kd': 'python',
}
else:
return {
'first': 'true',
'pn': f'{page}',
'kd': 'python'
'sid': '复制的sid' # 每请求一次会产生一个新的token,可以请求获取,但当时还没有意识到,这里可以直接复制请求头的(下次重新修改下代码)
}
def get_cookie():
u = 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput='
session = requests.session()
rst = session.get(url=u, headers=headers)
cookies = rst.cookies.get_dict()
return cookies
def get_mes(page):
response = requests.post(url=url,
params=params,
data=get_data(page),
headers=headers,
cookies=get_cookie())
mes_dict = json.loads(response.text)
result = mes_dict['content']['positionResult']['result']
# pprint.pprint(result)
with open("拉勾职位信息.csv", "a", encoding='utf-8-sig') as f:
for one in result:
one_page_msg = f"{one['city']}, {one['companyFullName']}, {one['companySize']}, {one['education']}, {one['positionName']}, {one['salary']}, {one['workYear']}"
f.write(one_page_msg + '\n')
print(one_page_msg)
if __name__ == '__main__':
for i in range(1, 11):
get_mes(i)
time.sleep(5)
拿下!
如有疑问,请在下方留言
嚯呀,其实这个代码是有问题的,当时没有运行完就终止了(以为正确可行,这就很尴尬了)。其实cookie经过10次后就必须重新获取,所以必须加个判断:如果i%10==0就重新获取cookie;关于sid可以在post请求之后在content——>showId获取,这样就真真完全没有问题了!!