各主要城市数据分析岗位薪资水平分析
一、项目背景
由于个人考虑转行数据分析,故通过对招聘信息数据的分析,了解该岗位的市场需求、行业分布、薪资水平,以便明确求职方向
二、数据获取
数据来源于boss直聘网,通过爬虫采集
采集的城市主要为一线、新一线等较为发达的城市
爬虫代码如下:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(r'D:\PycharmProjects\python_present\boss直聘爬取\chromedriver.exe')
cities = [{"name": "北京", "code": 101010100, "url": "/beijing/"},
{"name": "上海", "code": 101020100, "url": "/shanghai/"},
{"name": "广州", "code": 101280100, "url": "/guangzhou/"},
{"name": "深圳", "code": 101280600, "url": "/shenzhen/"},
{"name": "杭州", "code": 101210100, "url": "/hangzhou/"},
{"name": "天津", "code": 101030100, "url": "/tianjin/"},
{"name": "苏州", "code": 101190400, "url": "/suzhou/"},
{"name": "武汉", "code": 101200100, "url": "/wuhan/"},
{"name": "厦门", "code": 101230200, "url": "/xiamen/"},
{"name": "长沙", "code": 101250100, "url": "/changsha/"},
{"name": "成都", "code": 101270100, "url": "/chengdu/"},
{"name": "郑州", "code": 101180100, "url": "/zhengzhou/"},
{"name": "重庆", "code": 101040100, "url": "/chongqing/"},
{"name": "青岛", "code": 101120200, "url": "/qingdao/"},
{"name": "南京", "code": 101190100, "url": "/nanjing/"}]
for city in cities:
urls = ['https://www.zhipin.com/c{}/?query=数据分析&page={}&ka=page-{}'.format(city['code'], i, i) for i in
range(1, 8)]
for url in urls:
driver.get(url)
html = driver.page_source
bs = BeautifulSoup(html, 'html.parser')
job_all = bs.find_all('div', {"class": "job-primary"})
for job in job_all:
position = job.find('span', {"class": "job-name"}).get_text()
address = job.find('span', {'class': "job-area"}).get_text()
company = job.find('div', {'class': 'company-text'}).find('h3', {'class': "name"}).get_text()
salary = job.find('span', {'class': 'red'}).get_text()
diploma = job.find('div', {'class': 'job-limit'}).find('p').get_text()[-2:]
experience = job.find('div', {'class': 'job-limit'}).find('p').get_text()[:-2]
labels = job.find('a', {'class': 'false-link'}).get_text()
with open('position.csv', 'a+', encoding='UTF-8-SIG') as f_obj:
f_obj.write(position.replace(',', '、') + "," + address + "," + company + "," + salary + "," + diploma
+ "," + experience + ',' + labels + "\n")
driver.quit()
三、数据清洗
In [59]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
from scipy.stats import norm,mode
import re
warnings.filterwarnings('ignore')
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False
原数据没有字段名,设置字段名:
position:岗位名
address:公司所在地区
company:公司名
salary:薪水
diploma:学历要求
experience:工作经验要求
lables:行业标签
In [60]:
df = pd.read_csv('job.csv',header=None,names=['position','address','company','salary','diploma','experience','lables'])
查看数据整体情况
In [61]:
df.head()
Out[61]:
position | address | company | salary | diploma | experience | lables | |
---|---|---|---|---|---|---|---|
0 | 数据分析 | 北京·朝阳区·亚运村 | 中信百信银行 | 25-40K·15薪 | 本科 | 5-10年 | 银行 |
1 | 数据分析 | 北京·朝阳区·太阳宫 | BOSS直聘 | 25-40K·16薪 | 博士 | 1-3年 | 人力资源服务 |
2 | 数据分析 | 北京·朝阳区·鸟巢 | 京东集团 | 50-80K·14薪 | 本科 | 3-5年 | 电子商务 |
3 | 数据分析 | 北京·海淀区·清河 | 一亩田 | 15-25K | 本科 | 3-5年 | O2O |
4 | 数据分析岗 | 北京·海淀区·西北旺 | 建信金科 | 20-40K·14薪 | 硕士 | 5-10年 | 银行 |
In [62]:
df.shape
Out[62]:
(3045, 7)
In [63]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3045 entries, 0 to 3044
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 position 3045 non-null object
1 address 3045 non-null object
2 company 3045 non-null object
3 salary 3045 non-null object
4 diploma 3045 non-null object
5 experience 3045 non-null object
6 lables 3045 non-null object
dtypes: object(7)
memory usage: 83.3+ KB
发现有45行重复数据,进行删除
In [64]:
df.duplicated().sum()
Out[64]:
45
In [65]:
df.drop_duplicates(keep='first',inplace=True)
In [66]:
df.duplicated().sum()
Out[66]:
0
In [67]:
df.shape
Out[67]:
(3000, 7)
In [68]:
df.isnull().sum()
Out[68]:
position 0
address 0
company 0
salary 0
diploma 0
experience 0
lables 0
dtype: int64
考虑到数据中有实习岗位,实习岗薪资按天算,不具有太大的参考价值,故删除包含实习的数据
In [69]:
#df['position'] = df['position'].astype('string')
In [70]:
x=df['position'].str.contains('实习')
df=df[~x]
df.reset_index(drop=True,inplace=True)
address列的值不规范,进行处理,全部转换为城市名
In [71]:
df['address']=df['add