平常在找工作的时候,经常会使用到拉钩网,比如搜索关键字“自动化测试工程师”,然后就会显示很多的招聘信息,那么如何批量的获取这些招聘信息并对这些信息进行整个的数据分析了? 如果我们能够拿到批量的数据,并且对这些数据进行分析,比如最高薪资,最低薪资,招聘自动化测试要求必须掌握的工作内容等等。那么获取到这些数据后,经过分析对我们还是很有参考价值的,那么今天晚上就先来实现第一部分,在拉钩网进行关键字搜索,搜索后,拿到自动化测试工程师招聘列表的信息,包含每一页的信息,以及总共多少页的信息,搜索后,进行翻页,拉钩网上面的URL是不会发生变化的,但是它会进行ajax发送请求的,也就是说针对这些动态网站的数据获取的方式,见翻页得到的请求信息,可以得到如下的信息:
请求地址:
https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false
请求头:
Content-Type: application/x-www-form-urlencoded; charset=UTF-8 Cookie: _ga=GA1.2.1237290736.1534169036; user_trace_token=20180813220356-b7e42516-9f01-11e8-bb78-525400f775ce; LGUID=20180813220356-b7e428ad-9f01-11e8-bb78-525400f775ce; index_location_city=%E5%85%A8%E5%9B%BD; _gid=GA1.2.675811712.1540794503; JSESSIONID=ABAAABAAADEAAFI097BA2BE39D3B0D0BEA1C82AE832AF02; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1539785285,1540794503,1540819902,1540905505; _gat=1; LGSID=20181030211826-48e29064-dc46-11e8-8467-5254005c3644; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; TG-TRACK-CODE=index_search; SEARCH_ID=389112e1ab2640b098233a552d502745; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1540905515; LGRID=20181030211836-4eaa43a5-dc46-11e8-b7be-525400f775ce DNT: 1 Host: www.lagou.com Origin: https://www.lagou.com Referer: https://www.lagou.com/jobs/list_%E8%87%AA%E5%8A%A8%E5%8C%96%E6%B5%8B%E8%AF%95%E5%B7%A5%E7%A8%8B%E5%B8%88?labelWords=&fromSearch=true&suginput= User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36
请求参数:
请求方法:POST
在如上的信息中,可以得到它的请求方法是post,请求参数中pn是代表页数,kd是搜索的关键字参数,那么我们先来获取每一页它的招聘列表的数据,实现的源码为:
def getHeaders(): headers={ 'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8', 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36', 'Cookie':'_ga=GA1.2.1237290736.1534169036; user_trace_token=20180813220356-b7e42516-9f01-11e8-bb78-525400f775ce; LGUID=20180813220356-b7e428ad-9f01-11e8-bb78-525400f775ce; index_location_city=%E5%85%A8%E5%9B%BD; _gid=GA1.2.675811712.1540794503; JSESSIONID=ABAAABAAAGFABEF93F47251563A52306423D37E945D2C54; _gat=1; LGSID=20181029213144-fa3c8e13-db7e-11e8-b51c-525400f775ce; PRE_UTM=; PRE_HOST=www.bing.com; PRE_SITE=https%3A%2F%2Fwww.bing.com%2F; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1539529521,1539785285,1540794503,1540819902; SEARCH_ID=ae3ae41a58d94802a68e848d36c30711; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1540819909; LGRID=20181029213151-fe7324dc-db7e-11e8-b51c-525400f775ce', 'Referer':'https://www.lagou.com/jobs/list_%E8%87%AA%E5%8A%A8%E5%8C%96%E6%B5%8B%E8%AF%95%E5%B7%A5%E7%A8%8B%E5%B8%88?labelWords=sug&fromSearch=true&suginput=%E8%87%AA%E5%8A%A8%E5%8C%96%E6%B5%8B%E8%AF%95'} return headers def laGou(url='https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false',page=2): positions = [] r = requests.post( url=url, headers=getHeaders(), data={'first': False, 'pn': page, 'kd': '自动化测试工程师'}) for i in range(15): city = r.json()['content']['positionResult']['result'][i]['city'] education = r.json()['content']['positionResult']['result'][i]['education'] workYear = r.json()['content']['positionResult']['result'][i]['workYear'] positionAdvantage = r.json()['content']['positionResult']['result'][i]['positionAdvantage'] salary = r.json()['content']['positionResult']['result'][i]['salary'] companyFullName = r.json()['content']['positionResult']['result'][i]['companyFullName'] positionLables = r.json()['content']['positionResult']['result'][i]['positionLables'] position = { '公司名称': companyFullName, '城市': city, '学历': education, '工作年限': workYear, '薪资': salary, '工作标签': positionLables, '福利': positionAdvantage } positions.append(position) for item in positions: print(item)
注:在上面的源码中,page参数代表的是页数,我们可以随意的指定,调用函数laGou()后,就会打印出如上获取到的招聘信息,如公司,薪资等信息,见调用laGou()函数后打印的数据截图:
在上面中实现了每一页的招聘数据,下来来实现关键字搜索后所有页数的招聘数据,“自动化测试工程师”搜索后得到的页面是30页,如下图所示:
那么我们调用laGou()函数,在执行该函数的时候,给它的参数page传不同的值来实现,见实现的源码:
for item in range(1, 31): laGou(page=item)
上面的代码相对来说就非常简单了。下面见实现的所有源码:
#!/use/bin/env python
#coding:utf-8
#Author:WuYa
import csv
import requests
def getHeaders():
headers={
'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
'Cookie':'_ga=GA1.2.1237290736.1534169036; user_trace_token=20180813220356-b7e42516-9f01-11e8-bb78-525400f775ce; LGUID=20180813220356-b7e428ad-9f01-11e8-bb78-525400f775ce; index_location_city=%E5%85%A8%E5%9B%BD; _gid=GA1.2.675811712.1540794503; JSESSIONID=ABAAABAAAGFABEF93F47251563A52306423D37E945D2C54; _gat=1; LGSID=20181029213144-fa3c8e13-db7e-11e8-b51c-525400f775ce; PRE_UTM=; PRE_HOST=www.bing.com; PRE_SITE=https%3A%2F%2Fwww.bing.com%2F; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1539529521,1539785285,1540794503,1540819902; SEARCH_ID=ae3ae41a58d94802a68e848d36c30711; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1540819909; LGRID=20181029213151-fe7324dc-db7e-11e8-b51c-525400f775ce',
'Referer':'https://www.lagou.com/jobs/list_%E8%87%AA%E5%8A%A8%E5%8C%96%E6%B5%8B%E8%AF%95%E5%B7%A5%E7%A8%8B%E5%B8%88?labelWords=sug&fromSearch=true&suginput=%E8%87%AA%E5%8A%A8%E5%8C%96%E6%B5%8B%E8%AF%95'}
return headers
def laGou(url='https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false',page=2):
positions = []
r = requests.post(
url=url,
headers=getHeaders(),
data={'first': False, 'pn': page, 'kd': '自动化测试工程师'})
for i in range(15):
city = r.json()['content']['positionResult']['result'][i]['city']
education = r.json()['content']['positionResult']['result'][i]['education']
workYear = r.json()['content']['positionResult']['result'][i]['workYear']
positionAdvantage = r.json()['content']['positionResult']['result'][i]['positionAdvantage']
salary = r.json()['content']['positionResult']['result'][i]['salary']
companyFullName = r.json()['content']['positionResult']['result'][i]['companyFullName']
positionLables = r.json()['content']['positionResult']['result'][i]['positionLables']
position = {
'公司名称': companyFullName,
'城市': city,
'学历': education,
'工作年限': workYear,
'薪资': salary,
'工作标签': positionLables,
'福利': positionAdvantage
}
positions.append(position)
for item in positions:
print(item)
if __name__ == '__main__':
for item in range(1, 31):
laGou(page=item)
如上,我们通过Requests的库就轻易的实现了获取拉钩网某个搜索关键字的招聘信息。当然还需要做的很多。