python机器学习之爬取拉勾网上与python相关的工作信息

easterding

于 2021-04-10 20:56:21 发布

阅读量406

点赞数

文章标签：可视化 python 数据分析

本文链接：https://blog.csdn.net/easterding/article/details/115583925

版权

本文介绍了使用Python爬取并分析拉勾网上20页Python相关职位信息的过程，包括数据爬取、清洗、可视化，发现北京、上海、深圳的Python平均薪资及职位数量领先，公司福利集中在绩效奖金和股票期权。

摘要由CSDN通过智能技术生成

1.数据爬取

1.1分析爬取的网页结构

我选择了互联网工作相对集中的拉勾网进行数据的爬取，进入拉勾网然后按下右键检查，查看网页的源代码，为了应对拉勾网的反爬虫机制得到所需要的数据，应先建立一个模拟的浏览器来实现和访问的拉勾网服务器进行对等连接（从指定的url中通过requests请求携带请求头和请求体获取网页中的信息），获取到所需数据信息。

在这儿我爬取了20页的数据，共爬取了’公司全名’, ‘公司简称’, ‘公司规模’,‘职位名称’,‘工作经验’, ‘学历要求’, ‘薪资’, ‘职位福利’,‘职位类型’, ‘公司福利’, '城市’这些数据属性，并存在了Python_engineer_second.csv文件中。

1.2爬取网页代码

import requests
import math
import time
import pandas as pd
from bs4 import BeautifulSoup
from threading import Thread
import threading
import numpy as np
import matplotlib.pyplot as plt
import re
 
#从指定的url中通过requests请求携带请求头和请求体获取网页中的信息
def get_json(url, num):
    #拉钩网
    url1 = 'https://www.lagou.com/jobs/list_python%E5%BC%80%E5%8F%91%E5%B7%A5%E7%A8%8B%E5%B8%88?labelWords=&fromSearch=true&suginput='
    headers = {
   
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36',
        'Host': 'www.lagou.com',
        'Referer': 'https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?labelWords=&fromSearch=true&suginput=',
        'X-Anit-Forge-Code': '0',
        'X-Anit-Forge-Token': 'None',
        'X-Requested-With': 'XMLHttpRequest'
    }
#首先通过分析网页network工作情况首先要找到数据接口网址
    data = {
   
        'first': 'true',
        'pn': num,
        'kd': 'python'}
    s = requests.Session()
    s.get(url=url1, headers=headers, timeout=3)
    cookie = s.cookies
    res = requests.post(url, headers=headers, data=data, cookies=cookie, timeout=3)
    res.raise_for_status()
    res.encoding = 'utf-8'
    page_data = res.json()
    return page_data

def get_page_info(jobs_list):
    #定义一个空字典
    page_info_list = []  
    for i in jobs_list:  
        job_info = []
        job_info.append(i['companyFullName'])
        job_info.append(i['companyShortName'])
        job_info.append(i['companySize'])
        job_info.append(i['positionName'])
        job_info.append(i['workYear']