获取拉钩网招聘数据

平常在找工作的时候,经常会使用到拉钩网,比如搜索关键字“自动化测试工程师”,然后就会显示很多的招聘信息,那么如何批量的获取这些招聘信息并对这些信息进行整个的数据分析了? 如果我们能够拿到批量的数据,并且对这些数据进行分析,比如最高薪资,最低薪资,招聘自动化测试要求必须掌握的工作内容等等。那么获取到这些数据后,经过分析对我们还是很有参考价值的,那么今天晚上就先来实现第一部分,在拉钩网进行关键字搜索,搜索后,拿到自动化测试工程师招聘列表的信息,包含每一页的信息,以及总共多少页的信息,搜索后,进行翻页,拉钩网上面的URL是不会发生变化的,但是它会进行ajax发送请求的,也就是说针对这些动态网站的数据获取的方式,见翻页得到的请求信息,可以得到如下的信息:

请求地址:

https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false

请求头:

Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Cookie: _ga=GA1.2.1237290736.1534169036; user_trace_token=20180813220356-b7e42516-9f01-11e8-bb78-525400f775ce; LGUID=20180813220356-b7e428ad-9f01-11e8-bb78-525400f775ce; index_location_city=%E5%85%A8%E5%9B%BD; _gid=GA1.2.675811712.1540794503; JSESSIONID=ABAAABAAADEAAFI097BA2BE39D3B0D0BEA1C82AE832AF02; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1539785285,1540794503,1540819902,1540905505; _gat=1; LGSID=20181030211826-48e29064-dc46-11e8-8467-5254005c3644; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; TG-TRACK-CODE=index_search; SEARCH_ID=389112e1ab2640b098233a552d502745; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1540905515; LGRID=20181030211836-4eaa43a5-dc46-11e8-b7be-525400f775ce
DNT: 1
Host: www.lagou.com
Origin: https://www.lagou.com
Referer: https://www.lagou.com/jobs/list_%E8%87%AA%E5%8A%A8%E5%8C%96%E6%B5%8B%E8%AF%95%E5%B7%A5%E7%A8%8B%E5%B8%88?labelWords=&fromSearch=true&suginput=
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36

请求参数:

请求方法:POST    

     在如上的信息中,可以得到它的请求方法是post,请求参数中pn是代表页数,kd是搜索的关键字参数,那么我们先来获取每一页它的招聘列表的数据,实现的源码为:

def getHeaders():
   headers={
      'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
      'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
      'Cookie':'_ga=GA1.2.1237290736.1534169036; user_trace_token=20180813220356-b7e42516-9f01-11e8-bb78-525400f775ce; LGUID=20180813220356-b7e428ad-9f01-11e8-bb78-525400f775ce; index_location_city=%E5%85%A8%E5%9B%BD; _gid=GA1.2.675811712.1540794503; JSESSIONID=ABAAABAAAGFABEF93F47251563A52306423D37E945D2C54; _gat=1; LGSID=20181029213144-fa3c8e13-db7e-11e8-b51c-525400f775ce; PRE_UTM=; PRE_HOST=www.bing.com; PRE_SITE=https%3A%2F%2Fwww.bing.com%2F; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1539529521,1539785285,1540794503,1540819902; SEARCH_ID=ae3ae41a58d94802a68e848d36c30711; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1540819909; LGRID=20181029213151-fe7324dc-db7e-11e8-b51c-525400f775ce',
      'Referer':'https://www.lagou.com/jobs/list_%E8%87%AA%E5%8A%A8%E5%8C%96%E6%B5%8B%E8%AF%95%E5%B7%A5%E7%A8%8B%E5%B8%88?labelWords=sug&fromSearch=true&suginput=%E8%87%AA%E5%8A%A8%E5%8C%96%E6%B5%8B%E8%AF%95'}
   return headers


def laGou(url='https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false',page=2):
   positions = []
   r = requests.post(
      url=url,
      headers=getHeaders(),
      data={'first': False, 'pn': page, 'kd': '自动化测试工程师'})
   for i in range(15):
      city = r.json()['content']['positionResult']['result'][i]['city']
      education = r.json()['content']['positionResult']['result'][i]['education']
      workYear = r.json()['content']['positionResult']['result'][i]['workYear']
      positionAdvantage = r.json()['content']['positionResult']['result'][i]['positionAdvantage']
      salary = r.json()['content']['positionResult']['result'][i]['salary']
      companyFullName = r.json()['content']['positionResult']['result'][i]['companyFullName']
      positionLables = r.json()['content']['positionResult']['result'][i]['positionLables']
      position = {
         '公司名称': companyFullName,
         '城市': city,
         '学历': education,
         '工作年限': workYear,
         '薪资': salary,
         '工作标签': positionLables,
         '福利': positionAdvantage
      }
      positions.append(position)
   for item in positions:
      print(item)

注:在上面的源码中,page参数代表的是页数,我们可以随意的指定,调用函数laGou()后,就会打印出如上获取到的招聘信息,如公司,薪资等信息,见调用laGou()函数后打印的数据截图:

        在上面中实现了每一页的招聘数据,下来来实现关键字搜索后所有页数的招聘数据,“自动化测试工程师”搜索后得到的页面是30页,如下图所示:

那么我们调用laGou()函数,在执行该函数的时候,给它的参数page传不同的值来实现,见实现的源码:

for item in range(1, 31):
   laGou(page=item)

上面的代码相对来说就非常简单了。下面见实现的所有源码:

#!/use/bin/env python
#coding:utf-8 

#Author:WuYa

import  csv
import  requests

def getHeaders():
   headers={
      'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
      'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
      'Cookie':'_ga=GA1.2.1237290736.1534169036; user_trace_token=20180813220356-b7e42516-9f01-11e8-bb78-525400f775ce; LGUID=20180813220356-b7e428ad-9f01-11e8-bb78-525400f775ce; index_location_city=%E5%85%A8%E5%9B%BD; _gid=GA1.2.675811712.1540794503; JSESSIONID=ABAAABAAAGFABEF93F47251563A52306423D37E945D2C54; _gat=1; LGSID=20181029213144-fa3c8e13-db7e-11e8-b51c-525400f775ce; PRE_UTM=; PRE_HOST=www.bing.com; PRE_SITE=https%3A%2F%2Fwww.bing.com%2F; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1539529521,1539785285,1540794503,1540819902; SEARCH_ID=ae3ae41a58d94802a68e848d36c30711; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1540819909; LGRID=20181029213151-fe7324dc-db7e-11e8-b51c-525400f775ce',
      'Referer':'https://www.lagou.com/jobs/list_%E8%87%AA%E5%8A%A8%E5%8C%96%E6%B5%8B%E8%AF%95%E5%B7%A5%E7%A8%8B%E5%B8%88?labelWords=sug&fromSearch=true&suginput=%E8%87%AA%E5%8A%A8%E5%8C%96%E6%B5%8B%E8%AF%95'}
   return headers


def laGou(url='https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false',page=2):
   positions = []
   r = requests.post(
      url=url,
      headers=getHeaders(),
      data={'first': False, 'pn': page, 'kd': '自动化测试工程师'})
   for i in range(15):
      city = r.json()['content']['positionResult']['result'][i]['city']
      education = r.json()['content']['positionResult']['result'][i]['education']
      workYear = r.json()['content']['positionResult']['result'][i]['workYear']
      positionAdvantage = r.json()['content']['positionResult']['result'][i]['positionAdvantage']
      salary = r.json()['content']['positionResult']['result'][i]['salary']
      companyFullName = r.json()['content']['positionResult']['result'][i]['companyFullName']
      positionLables = r.json()['content']['positionResult']['result'][i]['positionLables']
      position = {
         '公司名称': companyFullName,
         '城市': city,
         '学历': education,
         '工作年限': workYear,
         '薪资': salary,
         '工作标签': positionLables,
         '福利': positionAdvantage
      }
      positions.append(position)
   for item in positions:
      print(item)

if __name__ == '__main__':
   for item in range(1, 31):
      laGou(page=item)

       如上,我们通过Requests的库就轻易的实现了获取拉钩网某个搜索关键字的招聘信息。当然还需要做的很多。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值