爬拉勾网并进行可视化分析

爬取动态网页-拉勾网:

  • 获取数据:

拉勾网通过ajax后台数据动态加载。爬取‘爬虫’岗位的关键字,并存入到本地MongoDB中,通过pandas读取爬取的岗位数据可视化展示。

网站加入了反爬cookies,需携带cookies才返回需要的数据,需登录后获取cookies发送post请求。

import requests
import json
import time,pymongo

MONGO_URL='localhost'
MONGO_DB='lagou'
MONGO_Table='爬虫职位'
client=pymongo.MongoClient(MONGO_URL)
db=client[MONGO_DB]

header={
    # "Accept": 'application/json, text/javascript, */*; q=0.01',
    # 'Accept-Encoding': 'gzip, deflate, br',
    # 'Content-Length': '25',
    #'Cookie':'user_trace_token=20180104184119-ccd43637-f13b-11e7-a015-5254005c3644; LGUID=20180104184119-ccd439d1-f13b-11e7-a015-5254005c3644; LG_LOGIN_USER_ID=a643439eedb073e2f644f66224bb681e4fe7f6f95511a0d3bcd37bf967d17e5a; index_location_city=%E5%8C%97%E4%BA%AC; TG-TRACK-CODE=index_navigation; JSESSIONID=ABAAABAAADEAAFI7C9BBF1C591A3763E668C3E7CA66E092; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_Python%3Fpx%3Ddefault%26city%3D%25E5%258C%2597%25E4%25BA%25AC; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2F97805.html; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1532158783,1532173689,1532333600,1532661280; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1532671197; _gid=GA1.2.1231654137.1532670794; _ga=GA1.2.698874613.1515062479; LGSID=20180727135313-597a0c87-9161-11e8-9f5b-5254005c3644; LGRID=20180727135956-49fd2498-9162-11e8-9f5b-5254005c3644; SEARCH_ID=ad4fa500888b4659b4d1e24c422039f1',
    #'Host': 'www.lagou.com',
    'Referer': 'https://www.lagou.com/jobs/list_Python?px=default&city=%E5%8C%97%E4%BA%AC',
    'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Mobile Safari/537.36',
    # 'X-Anit-Forge-Code': '0',
    # 'X-Anit-Forge-Token': None,
    #'X-Requested-With': 'XMLHttpRequest'
}

def get_info(url):
    for x in range(1, 30):
        if (x == 1):
            y = 'true'
        else:
            y = 'false'
        params = {'first': y, 'pn': x, 'kd': 'Python爬虫'}
        try:
            html = requests.post(url, data=params, headers=header)
            json_data = json.loads(html.text)
            results = json_data['content']['positionResult']['result']
            for result in results:
                infos = {
                    'positionName': result['positionName'],
                    'companyShortName':result['companyShortName'],
                    'financeStage': result['financeStage'],
                    'companySize': result['companySize'],
                    'firstType': result['firstType'],
                    'district':result['district'],
                    'positionAdvantage':result['positionAdvantage'],
                    'industryField':result['industryField'],
                    'positionLables':result['positionLables'],
                    'education': result['education'],
                    'salary': result['salary'],
                    'secondType':result['secondType'],
                    'workYear': result['workYear'],
                }
                print(infos)
                db[MONGO_Table].insert_one(infos)
                time.sleep(1)
        except requests.exceptions.ConnectionError:
            pass


if __name__ == '__main__':
    url='https://www.lagou.com/jobs/positionAjax.json?px=default&city=%E6%9D%AD%E5%B7%9E&needAddtionalResult=false'
    #url = 'https://www.lagou.com/jobs/positionAjax.json?px=default&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false'
    get_info(url)

参考的帖子:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值