DataCastle员工离职预测数据竞赛个人总结
赛题链接
赛题任务:
给定影响员工离职的因素和员工是否离职的记录,建立模型预测有可能离职的员工。
数据字段:
(1)Age:员工年龄 ;
(2)Label:员工是否已经离职,1表示已经离职,2表示未离职,这是目标预测值;
(3)BusinessTravel:商务差旅频率,Non-Travel表示不出差,Travel_Rarely表示不经常出差,Travel_Frequently表示经常出差;
(4)Department:员工所在部门,Sales表示销售部,Research & Development表示研发部,Human Resources表示人力资源部;
(5)DistanceFromHome:公司跟家庭住址的距离,从1到29,1表示最近,29表示最远;
(6)Education:员工的教育程度,从1到5,5表示教育程度最高;
(7)EducationField:员工所学习的专业领域,Life Sciences表示生命科学,Medical表示医疗,Marketing表示市场营销,Technical Degree表示技术学位,Human Resources表示人力资源,Other表示其他;
(8)EmployeeNumber:员工号码;
(9)EnvironmentSatisfaction:员工对于工作环境的满意程度,从1到4,1的满意程度最低,4的满意程度最高;
(10)Gender:员工性别,Male表示男性,Female表示女性;
(11)JobInvolvement:员工工作投入度,从1到4,1为投入度最低,4为投入度最高;
(12)JobLevel:职业级别,从1到5,1为最低级别,5为最高级别;
(13)JobRole:工作角色:Sales Executive是销售主管,Research Scientist是科学研究员,Laboratory Technician实验室技术员,Manufacturing Director是制造总监,Healthcare Representative是医疗代表,Manager是经理,Sales Representative是销售代表,Research Director是研究总监,Human Resources是人力资源;
(14)JobSatisfaction:工作满意度,从1到4,1代表满意程度最低,4代表满意程度最高;
(15)MaritalStatus:员工婚姻状况,Single代表单身,Married代表已婚,Divorced代表离婚;
(16)MonthlyIncome:员工月收入,范围在1009到19999之间;
(17)NumCompaniesWorked:员工曾经工作过的公司数;
(18)Over18:年龄是否超过18岁;
(19)OverTime:是否加班,Yes表示加班,No表示不加班;
(20)PercentSalaryHike:工资提高的百分比;
(21)PerformanceRating:绩效评估;
(22)RelationshipSatisfaction:关系满意度,从1到4,1表示满意度最低,4表示满意度最高;
(23)StandardHours:标准工时;
(24)StockOptionLevel:股票期权水平;
(25)TotalWorkingYears:总工龄;
(26)TrainingTimesLastYear:上一年的培训时长,从0到6,0表示没有培训,6表示培训时间最长;
(27)WorkLifeBalance:工作与生活平衡程度,从1到4,1表示平衡程度最低,4表示平衡程度最高;
(28)YearsAtCompany:在目前公司工作年数;
(29)YearsInCurrentRole:在目前工作职责的工作年数 ;
(30)YearsSinceLastPromotion:距离上次升职时长 ;
(31)YearsWithCurrManager:跟目前的管理者共事年数;
评分标准:
评分算法为准确率,准确率越高,说明正确预测出离职员工与留职员工的效果越好。
载入常用库及数据
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
train = pd.read_csv('rain.csv')
test = pd.read_csv('test_noLabel.csv')
pd.set_option('display.max_columns', None)
简单EDA
train.head()
test.head()
train.info()
print('-------------------')
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 1100 non-null int64
1 Age 1100 non-null int64
2 BusinessTravel 1100 non-null object
3 Department 1100 non-null object
4 DistanceFromHome 1100 non-null int64
5 Education 1100 non-null int64
6 EducationField 1100 non-null object
7 EmployeeNumber 1100 non-null int64
8 EnvironmentSatisfaction 1100 non-null int64
9 Gender 1100 non-null object
10 JobInvolvement 1100 non-null int64
11 JobLevel 1100 non-null int64
12 JobRole 1100 non-null object
13 JobSatisfaction 1100 non-null int64
14 MaritalStatus 1100 non-null object
15 MonthlyIncome 1100 non-null int64
16 NumCompaniesWorked 1100 non-null int64
17 Over18 1100 non-null object
18 OverTime 1100 non-null object
19 PercentSalaryHike 1100 non-null int64
20 PerformanceRating 1100 non-null int64
21 RelationshipSatisfaction 1100 non-null int64
22 StandardHours 1100 non-null int64
23 StockOptionLevel 1100 non-null int64
24 TotalWorkingYears 1100 non-null int64
25 TrainingTimesLastYear 1100 non-null int64
26 WorkLifeBalance 1100 non-null int64
27 YearsAtCompany 1100 non-null int64
28 YearsInCurrentRole 1100 non-null int64
29 YearsSinceLastPromotion 1100 non-null int64
30 YearsWithCurrManager 1100 non-null int64
31 Label 1100 non-null int64
dtypes: int64(24), object(8)
memory usage: 275.1+ KB
----------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 350 non-null int64
1 Age 350 non-null int64
2 BusinessTravel 350 non-null object
3 Department 350 non-null object
4 DistanceFromHome 350 non-null int64
5 Education 350 non-null int64
6 EducationField 350 non-null object
7 EmployeeNumber 350 non-null int64
8 EnvironmentSatisfaction 350 non-null int64
9 Gender 350 non-null object
10 JobInvolvement 350 non-null int64
11 JobLevel 350 non-null int64
12 JobRole 350 non-null object
13 JobSatisfaction 350 non-null int64
14 MaritalStatus 350 non-null object
15 MonthlyIncome 350 non-null int64
16 NumCompaniesWorked 350 non-null int64
17 Over18 350 non-null object
18 OverTime 350 non-null object
19 PercentSalaryHike 350 non-null int64
20 PerformanceRating 350 non-null int64
21 RelationshipSatisfaction 350 non-null int64
22 StandardHours 350 non-null int64
23 StockOptionLevel 350 non-null int64
24 TotalWorkingYears 350 non-null int64
25 TrainingTimesLastYear 350 non-null int64
26 WorkLifeBalance 350 non-null int64
27 YearsAtCompany 350 non-null int64
28 YearsInCurrentRole 350 non-null int64
29 YearsSinceLastPromotion 350 non-null int64
30 YearsWithCurrManager 350 non-null int64
dtypes: int64(23), object(8)
memory usage: 84.9+ KB
可以看出,数据集十分简单,并且没有缺失值,因此不需要进行缺失值处理。
train.columns
Index(['ID', 'Age', 'BusinessTravel', 'Department', 'DistanceFromHome',
'Education', 'EducationField', 'EmployeeNumber',
'EnvironmentSatisfaction', 'Gender', 'JobInvolvement', 'JobLevel',
'JobRole', 'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome',
'NumCompaniesWorked', 'Over18', 'OverTime', 'PercentSalaryHike',
'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours',
'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
'YearsSinceLastPromotion', 'YearsWithCurrManager', 'Label'],
dtype='object')
此组数据中分类变量很多,因此看一下分类变量的分布情况。
catfeatures = ['BusinessTravel', 'Department', 'Education', 'EducationField',
'EnvironmentSatisfaction', 'Gender', 'JobInvolvement', 'JobLevel',
'JobRole', 'JobSatisfaction', 'MaritalStatus',
'NumCompaniesWorked', 'Over18', 'OverTime',
'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours',
'StockOptionLevel','TrainingTimesLastYear','WorkLifeBalance']
for feature in catfeatures:
print(train[feature].value_counts())
print('----------------')
Travel_Rarely 787
Travel_Frequently 205
Non-Travel 108
Name: BusinessTravel, dtype: int64
----------------
Research & Development 727
Sales 331
Human Resources 42
Name: Department, dtype: int64
----------------
3 431
4 301
2 206
1 126
5 36
Name: Education, dtype: int64
----------------
Life Sciences 462
Medical 337
Marketing 127
Technical Degree 92
Other 63
Human Resources 19
Name: EducationField, dtype: int64
----------------
4 338
3 337
1 215
2 210
Name: EnvironmentSatisfaction, dtype: int64
----------------
Male 653
Female 447
Name: Gender, dtype: int64
----------------
3 661
2 273
4 103
1 63
Name: JobInvolvement, dtype: int64
----------------
1 412
2 399
3 157
4 81
5 51
Name: JobLevel, dtype: int64
----------------
Sales Executive 247
Research Scientist 221
Laboratory Technician 205
Manufacturing Director 101
Healthcare Representative 100
Manager 80
Sales Representative 57
Research Director 56
Human Resources 33
Name: JobRole, dtype: int64
----------------
4 350
3 325
1 219
2 206
Name: JobSatisfaction, dtype: int64
----------------
Married 500
Single 362
Divorced 238
Name: MaritalStatus, dtype: int64
----------------
1 390
0 151
3 114
2 113
4 101
7 56
6 52
5 45
8 41
9 37
Name: NumCompaniesWorked, dtype: int64
----------------
Y 1100
Name: Over18, dtype: int64
----------------
No 794
Yes 306
Name: OverTime, dtype: int64
----------------
3 932
4 168
Name: PerformanceRating, dtype: int64
----------------
3 340
4 323
1 220
2 217
Name: RelationshipSatisfaction, dtype: int64
----------------
80 1100
Name: StandardHours, dtype: int64
----------------
0 473
1 446
2 122
3 59
Name: StockOptionLevel, dtype: int64
----------------
2 396
3 379
4 94
5 89
1 50
6 48
0 44
Name: TrainingTimesLastYear, dtype: int64
----------------
3 678
2 256
4 103
1 63
Name: WorkLifeBalance, dtype: int64
----------------
sns.set_style('whitegrid')
for feature in catfeatures:
train[[feature,'Label']].groupby([feature]).mean().plot.bar()
通过观察可以发现,'StandardHours’和’Over18’两列数据仅有一个取值,因此将其删除。
train.drop(['StandardHours','Over18'],axis=1,inplace=True)
test.drop(['StandardHours','Over18'],axis=1,inplace=True)
查看相关性
columns = train.columns.drop('ID')
correlation = train[columns].corr()
plt.figure(figsize=(15, 15))
sns.heatmap(correlation,square = True, annot=True, fmt='0.2f',vmax=0.8)
通过相关性矩阵发现’MonthlyIncome’和’JobLevel’之间存在严重的共线性,因此将相关性较小的’MonthlyIncome’删除。
train.drop(['MonthlyIncome'],axis=1,inplace=True)
test.drop(['MonthlyIncome'],axis=1,inplace=True)
特征工程
可以根据几种满意度的加和构造新的特征——总满意度。
def fea_creat(df):
df['Satisfaction'] = df['JobSatisfaction'] + df['EnvironmentSatisfaction'] + df['RelationshipSatisfaction']
fea_creat(train)
fea_creat(test)
对数据进行dummies处理,利于后续分析。
train = pd.get_dummies(train)
test = pd.get_dummies(test)
EmployeeNumber为标识值,将其删除。
train.drop(['EmployeeNumber'],axis=1,inplace=True)
test.drop(['EmployeeNumber'],axis=1,inplace=True)
对数据进行归一化处理:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_train = train.drop(['ID','Label'],axis=1)
df_test = test.drop('ID',axis=1)
train_scaled = scaler.fit_transform(df_train)
test_scaled = scaler.transform(df_test)
df_train.iloc[:,:] = train_scaled[:,:]
df_test.iloc[:,:] = test_scaled[:,:]
df_train = pd.concat([df_train,train['Label']],axis=1)
数据分析建模
X_data = df_train.drop('Label',axis=1)
Y_data = df_train['Label']
X_test = df_test
print('X train shape:',X_data.shape)
print('X test shape:',X_test.shape)
X train shape: (1100, 48)
X test shape: (350, 48)
# 多模型交叉验证
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
import xgboost as xgb
import lightgbm as lgb
from sklearn.model_selection import cross_val_score
models = {
'LR': LogisticRegression(solver='liblinear', penalty='l2', C=1),
'SVM': SVC(C=1, gamma='auto'),
'DT': DecisionTreeClassifier(),
'RF' : RandomForestClassifier(n_estimators=100),
'AdaBoost': AdaBoostClassifier(n_estimators=100),
'GBDT': GradientBoostingClassifier(n_estimators=100),
'XGB': xgb.XGBClassifier(max_depth=10,subsample=0.7,colsample_bytree=0.75,n_estimators=100),
'LGB': lgb.LGBMClassifier(num_leaves=120,n_estimators = 100)
}
for k, clf in models.items():
print("the model is {}".format(k))
scores = cross_val_score(clf, X_data, Y_data, cv=10)
print(scores)
print("Mean accuracy is {}".format(np.mean(scores)))
print("-" * 100)
the model is LR
[0.91891892 0.85585586 0.90909091 0.87272727 0.89090909 0.88181818
0.85454545 0.88181818 0.8440367 0.86238532]
Mean accuracy is 0.877210588403249
----------------------------------------------------------------------------------------------------
the model is SVM
[0.83783784 0.83783784 0.83636364 0.83636364 0.83636364 0.83636364
0.83636364 0.83636364 0.8440367 0.8440367 ]
Mean accuracy is 0.8381930888352906
----------------------------------------------------------------------------------------------------
the model is DT
[0.81081081 0.82882883 0.80909091 0.79090909 0.82727273 0.79090909
0.76363636 0.74545455 0.78899083 0.74311927]
Mean accuracy is 0.7899022458655487
----------------------------------------------------------------------------------------------------
the model is RF
[0.88288288 0.84684685 0.87272727 0.87272727 0.86363636 0.85454545
0.86363636 0.85454545 0.87155963 0.87155963]
Mean accuracy is 0.8654667177602958
----------------------------------------------------------------------------------------------------
the model is AdaBoost
[0.90990991 0.81981982 0.83636364 0.83636364 0.89090909 0.86363636
0.86363636 0.82727273 0.86238532 0.89908257]
Mean accuracy is 0.8609379437819804
----------------------------------------------------------------------------------------------------
the model is GBDT
[0.88288288 0.87387387 0.85454545 0.85454545 0.9 0.86363636
0.84545455 0.85454545 0.86238532 0.83486239]
Mean accuracy is 0.8626731735906048
----------------------------------------------------------------------------------------------------
the model is XGB
[0.89189189 0.85585586 0.86363636 0.86363636 0.9 0.84545455
0.84545455 0.86363636 0.83486239 0.86238532]
Mean accuracy is 0.8626813635987949
----------------------------------------------------------------------------------------------------
the model is LGB
[0.88288288 0.85585586 0.86363636 0.80909091 0.88181818 0.84545455
0.84545455 0.86363636 0.8440367 0.85321101]
Mean accuracy is 0.8545077354251667
----------------------------------------------------------------------------------------------------
可以发现,简单的Logistic回归模型的效果最好,这里直接选取该模型进行预测,输出预测结果。
clf = LogisticRegression(solver='liblinear', penalty='l2', C=1)
clf.fit(X_data, Y_data)
result = clf.predict(X_test)
file = pd.DataFrame()
file['ID'] = test.ID
file['Label'] = result
file.to_csv('sub.csv',index=False)