python 获取cookie_用Python批量爬取拉钩网招聘信息

最新推荐文章于 2020-12-04 11:34:27 发布

weixin_39537397

最新推荐文章于 2020-12-04 11:34:27 发布

阅读量315

点赞数 1

文章标签： python 获取cookie 是不是只有 post 才会获取header信息

今天我们要爬取的是拉勾网的招聘信息

需求1：

获取以下信息

城市
公司名
公司规模
学历
职位名称
薪资
工作时间

需求2:

以逗号(，)分割信息内容，写入csv文件。

网址分析：

URL ：https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=

查看网址源码

通过查看源码分析得知，拉勾网的数据都是动态加载的，需要通过抓包才能获取里面的数据。按F12或者右键点击检查找到以下页面。(如果没有请刷新一下页面)

这么多数据包，到底哪一个是我们需要的呢？

搜索一下关键词

确定里面包含了我们需要的信息，那么我们开始进行编写爬虫

取出headers里面的url

import requestsimport jsonapi_url ='https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'header = {    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',    }response = requests.post(api_url)result = response.json()print(result)

我们尝试一下打印：

得到以下结果，提示操作太频繁，但是只请求了一次呀，请求头也加了，怎么会这样呢？

把以下几个参数加入试试；

HostOriginReferercookie

import requestsimport jsonurl ='https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'header = {    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',    }response = requests.post(api_url,headers=header)result = response.json()print(result)

打印以下结果看看

还是提示操作太频繁了。user-agent和cookie都传了，为什么还是不行呢？

分析一下：

首先，我们的ip肯定是没有被封的。
根据http协议原理，cookie是http客户端服务器设置的
js可以修改cookie
那么我们来清理一下所有的cookie

再次刷新网页

服务器重新给我们返回了cookie，里面包含着一些session，id等等信息

那么我们尝试是一下那个是我们需要的：

再次打印：

成功返回我们需要的数据～

但是发现一个问题，这个cookie是不稳定的，使用的次数多了，还是会出现操作太频繁的警告，那么怎么实现一劳永逸呢？

向https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=这个url发起请求。

url='https://www.lagou.com/jobs/list_python/p-city_0?&cl=false&fromSearch=true&labelWords=&suginput='responses = requests.get(url,headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'})cookie = responses.cookiesprint(cookie)

打印一下：

出现了我们需要的cookie，成功解决了我们刚才的不稳定因素。

下面我们把cookie加入到请求里面

url='https://www.lagou.com/jobs/list_python/p-city_0?&cl=false&fromSearch=true&labelWords=&suginput='responses = requests.get(url,headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'})cookie = responses.cookies

下面开始对数据进行处理

通过返回的数据可以分析出，我们需要的数据在conment——positionResult——result这样的层级里面，下面我们开始编写代码。

import requestsimport jsonurl='https://www.lagou.com/jobs/list_python/p-city_0?&cl=false&fromSearch=true&labelWords=&suginput='responses = requests.get(url,headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'})cookie = responses.cookiesapi_url ='https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'header = {    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',    'Host':'www.lagou.com''Origin':'https://www.lagou.com''Referer':'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput='}data = {    'first': 'true',    'pn':'1',    'kd':'python',    }response = requests.post(api_url,headers=header,data=data,cookies=cookie) #发起请求result = response.json()  #转换成json格式的results = result['content']['positionResult']['result'] #通过层级获取result里面的信息    for i in  results:# 遍历数据        d = {      #以字典的形式展示            'city':i['city'],            'companyFullName':i['companyFullName'],            'companySize':i['companySize'],            'education':i['education'],            'positionName':i['positionName'],            'salary':i['salary'],            'workYear':i['workYear'],        }     print('d')

打印结果：

好啦，第一个需求已经完成，下面开始完成第二个需求；

以逗号(，)分割信息内容，写入csv文件。

话不多说，开始编写代码

import requestsimport jsonurl='https://www.lagou.com/jobs/list_python/p-city_0?&cl=false&fromSearch=true&labelWords=&suginput='responses = requests.get(url,headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'})cookie = responses.cookiesapi_url ='https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'header = {    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',    'Host':'www.lagou.com''Origin':'https://www.lagou.com''Referer':'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput='}data = {    'first': 'true',    'pn':'1',    'kd':'python',    }response = requests.post(api_url,headers=header,data=data,cookies=cookie) #发起请求result = response.json()  #转换成json格式的results = result['content']['positionResult']['result'] #通过层级获取result里面的信息    for i in  results:# 遍历数据        d = {      #以字典的形式展示            'city':i['city'],            'companyFullName':i['companyFullName'],            'companySize':i['companySize'],            'education':i['education'],            'positionName':i['positionName'],            'salary':i['salary'],            'workYear':i['workYear'],        }     with open('拉钩职位信息.csv','a',encoding='utf-8')as f:        f.write(','.join(d.values()))            f.write('')

运行一下