DataCastle员工离职预测数据竞赛个人总结

DataCastle员工离职预测数据竞赛个人总结

赛题链接
赛题任务:
给定影响员工离职的因素和员工是否离职的记录,建立模型预测有可能离职的员工。
数据字段:
(1)Age:员工年龄 ;
(2)Label:员工是否已经离职,1表示已经离职,2表示未离职,这是目标预测值;
(3)BusinessTravel:商务差旅频率,Non-Travel表示不出差,Travel_Rarely表示不经常出差,Travel_Frequently表示经常出差;
(4)Department:员工所在部门,Sales表示销售部,Research & Development表示研发部,Human Resources表示人力资源部;
(5)DistanceFromHome:公司跟家庭住址的距离,从1到29,1表示最近,29表示最远;
(6)Education:员工的教育程度,从1到5,5表示教育程度最高;
(7)EducationField:员工所学习的专业领域,Life Sciences表示生命科学,Medical表示医疗,Marketing表示市场营销,Technical Degree表示技术学位,Human Resources表示人力资源,Other表示其他;
(8)EmployeeNumber:员工号码;
(9)EnvironmentSatisfaction:员工对于工作环境的满意程度,从1到4,1的满意程度最低,4的满意程度最高;
(10)Gender:员工性别,Male表示男性,Female表示女性;
(11)JobInvolvement:员工工作投入度,从1到4,1为投入度最低,4为投入度最高;
(12)JobLevel:职业级别,从1到5,1为最低级别,5为最高级别;
(13)JobRole:工作角色:Sales Executive是销售主管,Research Scientist是科学研究员,Laboratory Technician实验室技术员,Manufacturing Director是制造总监,Healthcare Representative是医疗代表,Manager是经理,Sales Representative是销售代表,Research Director是研究总监,Human Resources是人力资源;
(14)JobSatisfaction:工作满意度,从1到4,1代表满意程度最低,4代表满意程度最高;
(15)MaritalStatus:员工婚姻状况,Single代表单身,Married代表已婚,Divorced代表离婚;
(16)MonthlyIncome:员工月收入,范围在1009到19999之间;
(17)NumCompaniesWorked:员工曾经工作过的公司数;
(18)Over18:年龄是否超过18岁;
(19)OverTime:是否加班,Yes表示加班,No表示不加班;
(20)PercentSalaryHike:工资提高的百分比;
(21)PerformanceRating:绩效评估;
(22)RelationshipSatisfaction:关系满意度,从1到4,1表示满意度最低,4表示满意度最高;
(23)StandardHours:标准工时;
(24)StockOptionLevel:股票期权水平;
(25)TotalWorkingYears:总工龄;
(26)TrainingTimesLastYear:上一年的培训时长,从0到6,0表示没有培训,6表示培训时间最长;
(27)WorkLifeBalance:工作与生活平衡程度,从1到4,1表示平衡程度最低,4表示平衡程度最高;
(28)YearsAtCompany:在目前公司工作年数;
(29)YearsInCurrentRole:在目前工作职责的工作年数 ;
(30)YearsSinceLastPromotion:距离上次升职时长 ;
(31)YearsWithCurrManager:跟目前的管理者共事年数;
评分标准:
评分算法为准确率,准确率越高,说明正确预测出离职员工与留职员工的效果越好。

载入常用库及数据

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
train = pd.read_csv('rain.csv')
test = pd.read_csv('test_noLabel.csv')
pd.set_option('display.max_columns', None)

简单EDA

train.head()

在这里插入图片描述

test.head()

在这里插入图片描述

train.info()
print('-------------------')
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 32 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   ID                        1100 non-null   int64 
 1   Age                       1100 non-null   int64 
 2   BusinessTravel            1100 non-null   object
 3   Department                1100 non-null   object
 4   DistanceFromHome          1100 non-null   int64 
 5   Education                 1100 non-null   int64 
 6   EducationField            1100 non-null   object
 7   EmployeeNumber            1100 non-null   int64 
 8   EnvironmentSatisfaction   1100 non-null   int64 
 9   Gender                    1100 non-null   object
 10  JobInvolvement            1100 non-null   int64 
 11  JobLevel                  1100 non-null   int64 
 12  JobRole                   1100 non-null   object
 13  JobSatisfaction           1100 non-null   int64 
 14  MaritalStatus             1100 non-null   object
 15  MonthlyIncome             1100 non-null   int64 
 16  NumCompaniesWorked        1100 non-null   int64 
 17  Over18                    1100 non-null   object
 18  OverTime                  1100 non-null   object
 19  PercentSalaryHike         1100 non-null   int64 
 20  PerformanceRating         1100 non-null   int64 
 21  RelationshipSatisfaction  1100 non-null   int64 
 22  StandardHours             1100 non-null   int64 
 23  StockOptionLevel          1100 non-null   int64 
 24  TotalWorkingYears         1100 non-null   int64 
 25  TrainingTimesLastYear     1100 non-null   int64 
 26  WorkLifeBalance           1100 non-null   int64 
 27  YearsAtCompany            1100 non-null   int64 
 28  YearsInCurrentRole        1100 non-null   int64 
 29  YearsSinceLastPromotion   1100 non-null   int64 
 30  YearsWithCurrManager      1100 non-null   int64 
 31  Label                     1100 non-null   int64 
dtypes: int64(24), object(8)
memory usage: 275.1+ KB
----------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 31 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   ID                        350 non-null    int64 
 1   Age                       350 non-null    int64 
 2   BusinessTravel            350 non-null    object
 3   Department                350 non-null    object
 4   DistanceFromHome          350 non-null    int64 
 5   Education                 350 non-null    int64 
 6   EducationField            350 non-null    object
 7   EmployeeNumber            350 non-null    int64 
 8   EnvironmentSatisfaction   350 non-null    int64 
 9   Gender                    350 non-null    object
 10  JobInvolvement            350 non-null    int64 
 11  JobLevel                  350 non-null    int64 
 12  JobRole                   350 non-null    object
 13  JobSatisfaction           350 non-null    int64 
 14  MaritalStatus             350 non-null    object
 15  MonthlyIncome             350 non-null    int64 
 16  NumCompaniesWorked        350 non-null    int64 
 17  Over18                    350 non-null    object
 18  OverTime                  350 non-null    object
 19  PercentSalaryHike         350 non-null    int64 
 20  PerformanceRating         350 non-null    int64 
 21  RelationshipSatisfaction  350 non-null    int64 
 22  StandardHours             350 non-null    int64 
 23  StockOptionLevel          350 non-null    int64 
 24  TotalWorkingYears         350 non-null    int64 
 25  TrainingTimesLastYear     350 non-null    int64 
 26  WorkLifeBalance           350 non-null    int64 
 27  YearsAtCompany            350 non-null    int64 
 28  YearsInCurrentRole        350 non-null    int64 
 29  YearsSinceLastPromotion   350 non-null    int64 
 30  YearsWithCurrManager      350 non-null    int64 
dtypes: int64(23), object(8)
memory usage: 84.9+ KB

可以看出,数据集十分简单,并且没有缺失值,因此不需要进行缺失值处理。

train.columns
Index(['ID', 'Age', 'BusinessTravel', 'Department', 'DistanceFromHome',
       'Education', 'EducationField', 'EmployeeNumber',
       'EnvironmentSatisfaction', 'Gender', 'JobInvolvement', 'JobLevel',
       'JobRole', 'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome',
       'NumCompaniesWorked', 'Over18', 'OverTime', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager', 'Label'],
      dtype='object')

此组数据中分类变量很多,因此看一下分类变量的分布情况。

catfeatures = ['BusinessTravel', 'Department', 'Education', 'EducationField', 
       'EnvironmentSatisfaction', 'Gender', 'JobInvolvement', 'JobLevel',
       'JobRole', 'JobSatisfaction', 'MaritalStatus', 
       'NumCompaniesWorked', 'Over18', 'OverTime',
       'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours',
       'StockOptionLevel','TrainingTimesLastYear','WorkLifeBalance']
for feature in catfeatures:
    print(train[feature].value_counts())
    print('----------------')
Travel_Rarely        787
Travel_Frequently    205
Non-Travel           108
Name: BusinessTravel, dtype: int64
----------------
Research & Development    727
Sales                     331
Human Resources            42
Name: Department, dtype: int64
----------------
3    431
4    301
2    206
1    126
5     36
Name: Education, dtype: int64
----------------
Life Sciences       462
Medical             337
Marketing           127
Technical Degree     92
Other                63
Human Resources      19
Name: EducationField, dtype: int64
----------------
4    338
3    337
1    215
2    210
Name: EnvironmentSatisfaction, dtype: int64
----------------
Male      653
Female    447
Name: Gender, dtype: int64
----------------
3    661
2    273
4    103
1     63
Name: JobInvolvement, dtype: int64
----------------
1    412
2    399
3    157
4     81
5     51
Name: JobLevel, dtype: int64
----------------
Sales Executive              247
Research Scientist           221
Laboratory Technician        205
Manufacturing Director       101
Healthcare Representative    100
Manager                       80
Sales Representative          57
Research Director             56
Human Resources               33
Name: JobRole, dtype: int64
----------------
4    350
3    325
1    219
2    206
Name: JobSatisfaction, dtype: int64
----------------
Married     500
Single      362
Divorced    238
Name: MaritalStatus, dtype: int64
----------------
1    390
0    151
3    114
2    113
4    101
7     56
6     52
5     45
8     41
9     37
Name: NumCompaniesWorked, dtype: int64
----------------
Y    1100
Name: Over18, dtype: int64
----------------
No     794
Yes    306
Name: OverTime, dtype: int64
----------------
3    932
4    168
Name: PerformanceRating, dtype: int64
----------------
3    340
4    323
1    220
2    217
Name: RelationshipSatisfaction, dtype: int64
----------------
80    1100
Name: StandardHours, dtype: int64
----------------
0    473
1    446
2    122
3     59
Name: StockOptionLevel, dtype: int64
----------------
2    396
3    379
4     94
5     89
1     50
6     48
0     44
Name: TrainingTimesLastYear, dtype: int64
----------------
3    678
2    256
4    103
1     63
Name: WorkLifeBalance, dtype: int64
----------------
sns.set_style('whitegrid')
for feature in catfeatures:
    train[[feature,'Label']].groupby([feature]).mean().plot.bar()

在这里插入图片描述
通过观察可以发现,'StandardHours’和’Over18’两列数据仅有一个取值,因此将其删除。

train.drop(['StandardHours','Over18'],axis=1,inplace=True)
test.drop(['StandardHours','Over18'],axis=1,inplace=True)

查看相关性

columns = train.columns.drop('ID')
correlation = train[columns].corr()
plt.figure(figsize=(15, 15)) 
sns.heatmap(correlation,square = True, annot=True, fmt='0.2f',vmax=0.8)

在这里插入图片描述
通过相关性矩阵发现’MonthlyIncome’和’JobLevel’之间存在严重的共线性,因此将相关性较小的’MonthlyIncome’删除。

train.drop(['MonthlyIncome'],axis=1,inplace=True)
test.drop(['MonthlyIncome'],axis=1,inplace=True)

特征工程

可以根据几种满意度的加和构造新的特征——总满意度。

def fea_creat(df):
    df['Satisfaction'] = df['JobSatisfaction'] + df['EnvironmentSatisfaction'] + df['RelationshipSatisfaction']
fea_creat(train)
fea_creat(test)

对数据进行dummies处理,利于后续分析。

train = pd.get_dummies(train)
test = pd.get_dummies(test)

EmployeeNumber为标识值,将其删除。

train.drop(['EmployeeNumber'],axis=1,inplace=True)
test.drop(['EmployeeNumber'],axis=1,inplace=True)

对数据进行归一化处理:

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

df_train = train.drop(['ID','Label'],axis=1)
df_test = test.drop('ID',axis=1)
train_scaled = scaler.fit_transform(df_train)
test_scaled = scaler.transform(df_test)
df_train.iloc[:,:] = train_scaled[:,:]
df_test.iloc[:,:] = test_scaled[:,:]
df_train = pd.concat([df_train,train['Label']],axis=1)

数据分析建模

X_data = df_train.drop('Label',axis=1)
Y_data = df_train['Label']
X_test  = df_test

print('X train shape:',X_data.shape)
print('X test shape:',X_test.shape)
X train shape: (1100, 48)
X test shape: (350, 48)
# 多模型交叉验证
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
import xgboost as xgb
import lightgbm as lgb
from sklearn.model_selection import cross_val_score

models = {
    'LR': LogisticRegression(solver='liblinear', penalty='l2', C=1),
    'SVM': SVC(C=1, gamma='auto'),
    'DT': DecisionTreeClassifier(),
    'RF' : RandomForestClassifier(n_estimators=100),
    'AdaBoost': AdaBoostClassifier(n_estimators=100),
    'GBDT': GradientBoostingClassifier(n_estimators=100),
    'XGB': xgb.XGBClassifier(max_depth=10,subsample=0.7,colsample_bytree=0.75,n_estimators=100),
    'LGB': lgb.LGBMClassifier(num_leaves=120,n_estimators = 100)
}

for k, clf in models.items():
    print("the model is {}".format(k))
    scores = cross_val_score(clf, X_data, Y_data, cv=10)
    print(scores)
    print("Mean accuracy is {}".format(np.mean(scores)))
    print("-" * 100)
the model is LR
[0.91891892 0.85585586 0.90909091 0.87272727 0.89090909 0.88181818
 0.85454545 0.88181818 0.8440367  0.86238532]
Mean accuracy is 0.877210588403249
----------------------------------------------------------------------------------------------------
the model is SVM
[0.83783784 0.83783784 0.83636364 0.83636364 0.83636364 0.83636364
 0.83636364 0.83636364 0.8440367  0.8440367 ]
Mean accuracy is 0.8381930888352906
----------------------------------------------------------------------------------------------------
the model is DT
[0.81081081 0.82882883 0.80909091 0.79090909 0.82727273 0.79090909
 0.76363636 0.74545455 0.78899083 0.74311927]
Mean accuracy is 0.7899022458655487
----------------------------------------------------------------------------------------------------
the model is RF
[0.88288288 0.84684685 0.87272727 0.87272727 0.86363636 0.85454545
 0.86363636 0.85454545 0.87155963 0.87155963]
Mean accuracy is 0.8654667177602958
----------------------------------------------------------------------------------------------------
the model is AdaBoost
[0.90990991 0.81981982 0.83636364 0.83636364 0.89090909 0.86363636
 0.86363636 0.82727273 0.86238532 0.89908257]
Mean accuracy is 0.8609379437819804
----------------------------------------------------------------------------------------------------
the model is GBDT
[0.88288288 0.87387387 0.85454545 0.85454545 0.9        0.86363636
 0.84545455 0.85454545 0.86238532 0.83486239]
Mean accuracy is 0.8626731735906048
----------------------------------------------------------------------------------------------------
the model is XGB
[0.89189189 0.85585586 0.86363636 0.86363636 0.9        0.84545455
 0.84545455 0.86363636 0.83486239 0.86238532]
Mean accuracy is 0.8626813635987949
----------------------------------------------------------------------------------------------------
the model is LGB
[0.88288288 0.85585586 0.86363636 0.80909091 0.88181818 0.84545455
 0.84545455 0.86363636 0.8440367  0.85321101]
Mean accuracy is 0.8545077354251667
----------------------------------------------------------------------------------------------------

可以发现,简单的Logistic回归模型的效果最好,这里直接选取该模型进行预测,输出预测结果。

clf = LogisticRegression(solver='liblinear', penalty='l2', C=1)
clf.fit(X_data, Y_data)
result = clf.predict(X_test)
file = pd.DataFrame()
file['ID'] = test.ID
file['Label'] = result
file.to_csv('sub.csv',index=False)
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值