通过对数据的分析 预判员工离职的可能性
首先去分析是否存在不干净数据,
import pandas as pd
import numpy as np
df = pd.read_csv('HR_comma_sep.csv')
# print(df.isnull().any()) #判断是否有null值
# print(np.count_nonzero(df != df)) #判断nan数量
print(df.info()) #数据集很干净 无缺失值
输出: 可以发现这份数据还是比较干净的 不存在缺失值,只存在两个object类型的特征
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
satisfaction_level 14999 non-null float64
last_evaluation 14999 non-null float64
number_project 14999 non-null int64
average_montly_hours 14999 non-null int64
time_spend_company 14999 non-null int64
Work_accident 14999 non-null int64
left 14999 non-null int64
promotion_last_5years 14999 non-null int64
sales 14999 non-null object
salary 14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB
None
更正列名:
df.rename(columns = {'average_montly_hours':'average_monthly_hours',
'sales':'department'},inplace = True)
分析数值类型的分布特征
# 自动打印数值类型分布情况
print(df.describe())
输出: 计数 均值 方差 最小 最大值 上下四分位数 中间值
satisfaction_level last_evaluation number_project \
count 14999.000000 14999.000000 14999.000000
mean 0.612834 0.716102 3.803054
std 0.248631 0.171169 1.232592
min 0.090000 0.360000 2.000000
25% 0.440000 0.560000 3.000000
50% 0.640000 0.720000 4.000000
75% 0.820000 0.870000 5.000000
max 1.000000 1.000000 7.000000
average_monthly_hours time_spend_company Work_accident left \
count 14999.000000 14999.000000 14999.000000 14999.000000
mean 201.050337 3.498233 0.144610 0.238083
std 49.943099 1.460136 0.351719 0.425924
min 96.000000 2.000000 0.000000 0.000000
25% 156.000000 3.000000 0.000000 0.000000
50% 200.000000 3.000000 0.000000 0.000000
75% 245.000000 4.000000 0.000000 0.000000
max 310.000000 10.000000 1.000000 1.000000
promotion_last_5years
count 14999.000000
mean 0.021268
std 0.144281
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
查看某一特征离散属性值的分布情况
print('Departments:')
print(df['department'].value_counts())
print('\nSalary:')
print(df['salary'].value_counts())
输出:
Departments:
sales 4140
technical 2720
support 2229
IT 1227
product_mng 902
marketing 858
RandD 787
accounting