import pandas as pd
import numpy as np
from pandas import DataFrame,Series
from matplotlib import pyplot as plt
#添加中文字体
from pylab import mpl
mpl.rcParams['font.sans-serif'] = ['SimHei']
tests = pd.read_excel("/******/copy.xlsx")
#查看数据完整性,只有labels列有缺失值,但是影响并不大
tests.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 449 entries, 0 to 448
Data columns (total 18 columns):
_clueid 449 non-null int64
scale 449 non-null object
salary 449 non-null object
average_salary 449 non-null float64
low 449 non-null int64
top 449 non-null int64
avg_salary 449 non-null float64
experience 449 non-null object
education 449 non-null object
campany 449 non-null object
industry 449 non-null object
scale2 449 non-null object
phase 449 non-null object
temptation 449 non-null object
description 449 non-null object
district 449 non-null object
labels 446 non-null object
city 449 non-null object
dtypes: float64(2), int64(3), object(13)
memory usage: 63.2+ KB
清晰数据:
1、将全部重复的列,和无用的列直接在excel中做删除
2、薪资是一个范围区间,可求出平均值来分析,最低工资:=LEFT(C2,FIND(“k”,C2,1)-1),最高工资:=MID(C2,FIND("-",C2)+1,LEN(C2)-FIND("-",C2)-1)
3、观察公司规模有2列scale和scale2,需要做合并,删减
# 查看城市对数据分析职位的需求
tests.city.value_counts()
#可以将数量少的归为‘其他’类
tests['city']= tests['city'].replace(["郑州","厦门","重庆","苏州","佛山","天津","贵阳","昆明","南京","东莞","福州","常州","珠海",
"乌鲁木齐","石家庄"]</