【阿旭机器学习实战】【35】员工离职率预测---决策树与随机森林预测

【阿旭机器学习实战】系列文章主要介绍机器学习的各种算法模型及其实战案例,欢迎点赞,关注共同学习交流。

本文的主要任务是通过决策树与随机森林模型预测一个员工离职的可能性并帮助人事部门理解员工为何离职。

1.获取数据

关注GZH:阿旭算法与机器学习,回复:“ML35”即可获取本文数据集、源码与项目文档

# 引入工具包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as matplot
import seaborn as sns
%matplotlib inline
# 读入数据到Pandas Dataframe "df"
df = pd.read_csv('HR_comma_sep.csv', index_col=None)

2.数据预处理

# 检测是否有缺失数据
df.isnull().any()
satisfaction_level       False
last_evaluation          False
number_project           False
average_montly_hours     False
time_spend_company       False
Work_accident            False
left                     False
promotion_last_5years    False
sales                    False
salary                   False
dtype: bool
# 数据的样例
df.head()
satisfaction_levellast_evaluationnumber_projectaverage_montly_hourstime_spend_companyWork_accidentleftpromotion_last_5yearssalessalary
00.380.5321573010saleslow
10.800.8652626010salesmedium
20.110.8872724010salesmedium
30.720.8752235010saleslow
40.370.5221593010saleslow

注:“turnover”列为标签:1表示离职,0表示不离职,其他列均为特征值

# 重命名
df = df.rename(columns={'satisfaction_level': 'satisfaction', 
                        'last_evaluation': 'evaluation',
                        'number_project': 'projectCount',
                        'average_montly_hours': 'averageMonthlyHours',
                        'time_spend_company': 'yearsAtCompany',
                        'Work_accident': 'workAccident',
                        'promotion_last_5years': 'promotion',
                        'sales' : 'department',
                        'left' : 'turnover'
                        })
# 将预测标签‘是否离职’放在第一列
front = df['turnover']
df.drop(labels=['turnover'], axis=1, inplace = True)
df.insert(0, 'turnover', front)
df.head()
turnoversatisfactionevaluationprojectCountaverageMonthlyHoursyearsAtCompanyworkAccidentpromotiondepartmentsalary
010.380.532157300saleslow
110.800.865262600salesmedium
210.110.887272400salesmedium
310.720.875223500saleslow
410.370.522159300saleslow

3.分析数据

  • 14999 条数据, 每一条数据包含 10 个特征
  • 总的离职率: 24%
  • 平均满意度为 0.61
df.shape
(14999, 10)
# 特征数据类型. 
df.dtypes
turnover                 int64
satisfaction           float64
evaluation             float64
projectCount             int64
averageMonthlyHours      int64
yearsAtCompany           int64
workAccident             int64
promotion                int64
department              object
salary                  object
dtype: object
turnover_rate = df.turnover.value_counts() / len(df)
turnover_rate
0    0.761917
1    0.238083
Name: turnover, dtype: float64
# 显示统计数据
df.describe()
turnoversatisfactionevaluationprojectCountaverageMonthlyHoursyearsAtCompanyworkAccidentpromotion
count14999.00000014999.00000014999.00000014999.00000014999.00000014999.00000014999.00000014999.000000
mean0.2380830.6128340.7161023.803054201.0503373.4982330.1446100.021268
std0.4259240.2486310.1711691.23259249.9430991.4601360.3517190.144281
min0.0000000.0900000.3600002.00000096.0000002.0000000.0000000.000000
25%0.0000000.4400000.5600003.000000156.0000003.0000000.0000000.000000
50%0.0000000.6400000.7200004.000000200.0000003.0000000.0000000.000000
75%0.0000000.8200000.8700005.000000245.0000004.0000000.0000000.000000
max1.0000001.0000001.0000007.000000310.00000010.0000001.0000001.000000
# 分组的平均数据统计
turnover_Summary = df.groupby('turnover')
turnover_Summary.mean()
satisfactionevaluationprojectCountaverageMonthlyHoursyearsAtCompanyworkAccidentpromotion
turnover
00.6668100.7154733.786664199.0602033.3800320.1750090.026251
10.4400980.7181133.855503207.4192103.8765050.0473260.005321

3.1 相关性分析

# 相关性矩阵
corr = df.corr()
#corr = (corr)
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

corr
turnoversatisfactionevaluationprojectCountaverageMonthlyHoursyearsAtCompanyworkAccidentpromotion
turnover1.000000-0.3883750.0065670.0237870.0712870.144822-0.154622-0.061788
satisfaction-0.3883751.0000000.105021-0.142970-0.020048-0.1008660.0586970.025605
evaluation0.0065670.1050211.0000000.3493330.3397420.131591-0.007104-0.008684
projectCount0.023787-0.1429700.3493331.0000000.4172110.196786-0.004741-0.006064
averageMonthlyHours0.071287-0.0200480.3397420.4172111.0000000.127755-0.010143-0.003544
yearsAtCompany0.144822-0.1008660.1315910.1967860.1277551.0000000.0021200.067433
workAccident-0.1546220.058697-0.007104-0.004741-0.0101430.0021201.0000000.039245
promotion-0.0617880.025605-0.008684-0.006064-0.0035440.0674330.0392451.000000

请添加图片描述


正相关的特征:

  • projectCount VS evaluation: 0.349333
  • projectCount VS averageMonthlyHours: 0.417211
  • averageMonthlyHours VS evaluation: 0.339742

负相关的特征:

  • satisfaction VS turnover: -0.388375
# 比较离职和未离职员工的满意度
emp_population = df['satisfaction'][df['turnover'] == 0].mean()
emp_turnover_satisfaction = df[df['turnover']==1]['satisfaction'].mean()

print( '未离职员工满意度: ' + str(emp_population))
print( '离职员工满意度: ' + str(emp_turnover_satisfaction) )
未离职员工满意度: 0.666809590479516
离职员工满意度: 0.44009801176140917

3.2 进行 T-Test


进行一个 t-test, 看离职员工的满意度是不是和未离职员工的满意度明显不同

import scipy.stats as stats
stats.ttest_1samp(a = df[df['turnover']==1]['satisfaction'], # 离职员工的满意度样本
                  popmean = emp_population)  # 未离职员工的满意度均值
Ttest_1sampResult(statistic=-51.3303486754725, pvalue=0.0)

T-Test 显示pvalue (0) 非常小, 所以他们之间是显著不同的

degree_freedom = len(df[df['turnover']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # 95%致信区间的左边界

RQ = stats.t.ppf(0.975,degree_freedom)  # 95%致信区间的右边界

print ('The t-分布 左边界: ' + str(LQ))
print ('The t-分布 右边界: ' + str(RQ))

The t-分布 左边界: -1.9606285215955626
The t-分布 右边界: 1.9606285215955621
# 概率密度函数估计
fig = plt.figure(figsize=(15,4),)
ax=sns.kdeplot(df.loc[(df['turnover'] == 0),'evaluation'] , color='b',shade=True,label='no turnover')
ax=sns.kdeplot(df.loc[(df['turnover'] == 1),'evaluation'] , color='r',shade=True, label='turnover')
ax.set(xlabel='Employee Evaluation', ylabel='Frequency')
ax.legend()
plt.title('Employee Evaluation Distribution - Turnover V.S. No Turnover')
Text(0.5, 1.0, 'Employee Evaluation Distribution - Turnover V.S. No Turnover')

在这里插入图片描述

# 概率密度函数估计
fig = plt.figure(figsize=(15,4))
ax=sns.kdeplot(df.loc[(df['turnover'] == 0),'averageMonthlyHours'] , color='b',shade=True, label='no turnover')
ax=sns.kdeplot(df.loc[(df['turnover'] == 1),'averageMonthlyHours'] , color='r',shade=True, label='turnover')
ax.legend()
ax.set(xlabel='Employee Average Monthly Hours', ylabel='Frequency')
plt.title('Employee AverageMonthly Hours Distribution - Turnover V.S. No Turnover')
Text(0.5, 1.0, 'Employee AverageMonthly Hours Distribution - Turnover V.S. No Turnover')

在这里插入图片描述

# 概率密度函数估计
fig = plt.figure(figsize=(15,4))
ax=sns.kdeplot(df.loc[(df['turnover'] == 0),'satisfaction'] , color='b',shade=True, label='no turnover')
ax=sns.kdeplot(df.loc[(df['turnover'] == 1),'satisfaction'] , color='r',shade=True, label='turnover')
plt.title('Employee Satisfaction Distribution - Turnover V.S. No Turnover')
ax.legend()
<matplotlib.legend.Legend at 0x281a5a6b820>

在这里插入图片描述

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, confusion_matrix, precision_recall_curve

# 将string类型转换为整数类型
df["department"] = df["department"].astype('category').cat.codes
df["salary"] = df["salary"].astype('category').cat.codes

# 产生X, y
target_name = 'turnover'
X = df.drop('turnover', axis=1)
y = df[target_name]

# 将数据分为训练和测试数据集
# 注意参数 stratify = y 意味着在产生训练和测试数据中, 离职的员工的百分比等于原来总的数据中的离职的员工的百分比
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.15, random_state=123, stratify=y)

df.head()
turnoversatisfactionevaluationprojectCountaverageMonthlyHoursyearsAtCompanyworkAccidentpromotiondepartmentsalary
010.380.53215730071
110.800.86526260072
210.110.88727240072
310.720.87522350071
410.370.52215930071

4. 建立预测模型:Decision Tree V.S. Random Forest

from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier


# 决策树
dtree = tree.DecisionTreeClassifier(
    criterion='entropy',
    #max_depth=3, # 定义树的深度, 可以用来防止过拟合
    min_weight_fraction_leaf=0.01 # 定义叶子节点最少需要包含多少个样本(使用百分比表达), 防止过拟合
    )
dtree = dtree.fit(X_train,y_train)
print ("\n\n ---决策树---")
dt_roc_auc = roc_auc_score(y_test, dtree.predict(X_test))
print ("决策树 AUC = %2.2f" % dt_roc_auc)
print(classification_report(y_test, dtree.predict(X_test)))

# 随机森林
rf = RandomForestClassifier(
    criterion='entropy',
    n_estimators=1000, 
    max_depth=None, # 定义树的深度, 可以用来防止过拟合
    min_samples_split=10, # 定义至少多少个样本的情况下才继续分叉
    #min_weight_fraction_leaf=0.02 # 定义叶子节点最少需要包含多少个样本(使用百分比表达), 防止过拟合
    )
rf.fit(X_train, y_train)
print ("\n\n ---随机森林---")
rf_roc_auc = roc_auc_score(y_test, rf.predict(X_test))
print ("随机森林 AUC = %2.2f" % rf_roc_auc)
print(classification_report(y_test, rf.predict(X_test)))
 ---决策树---
决策树 AUC = 0.93
              precision    recall  f1-score   support

           0       0.97      0.98      0.97      1714
           1       0.93      0.89      0.91       536

    accuracy                           0.96      2250
   macro avg       0.95      0.93      0.94      2250
weighted avg       0.96      0.96      0.96      2250



 ---随机森林---
随机森林 AUC = 0.97
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1714
           1       0.99      0.94      0.97       536

    accuracy                           0.98      2250
   macro avg       0.99      0.97      0.98      2250
weighted avg       0.98      0.98      0.98      2250

5. 模型评估

5.1ROC 图


# ROC 图
from sklearn.metrics import roc_curve
rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test, rf.predict_proba(X_test)[:,1])
dt_fpr, dt_tpr, dt_thresholds = roc_curve(y_test, dtree.predict_proba(X_test)[:,1])

plt.figure()

# 随机森林 ROC
plt.plot(rf_fpr, rf_tpr, label='Random Forest (area = %0.2f)' % rf_roc_auc)

# 决策树 ROC
plt.plot(dt_fpr, dt_tpr, label='Decision Tree (area = %0.2f)' % dt_roc_auc)

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Graph')
plt.legend(loc="lower right")
plt.show()

在这里插入图片描述

5.2通过决策树分析不同的特征的重要性

## 画出决策树特征的重要性 ##
importances = rf.feature_importances_
feat_names = df.drop(['turnover'],axis=1).columns


indices = np.argsort(importances)[::-1]
plt.figure(figsize=(12,6))
plt.title("Feature importances by RandomForest")
plt.bar(range(len(indices)), importances[indices], color='lightblue',  align="center")
plt.step(range(len(indices)), np.cumsum(importances[indices]), where='mid', label='Cumulative')
plt.xticks(range(len(indices)), feat_names[indices], rotation='vertical',fontsize=14)
plt.xlim([-1, len(indices)])
plt.show()

请添加图片描述

## 画出决策树的特征的重要性 ##
importances = dtree.feature_importances_
feat_names = df.drop(['turnover'],axis=1).columns


indices = np.argsort(importances)[::-1]
plt.figure(figsize=(12,6))
plt.title("Feature importances by Decision Tree")
plt.bar(range(len(indices)), importances[indices], color='lightblue',  align="center")
plt.step(range(len(indices)), np.cumsum(importances[indices]), where='mid', label='Cumulative')
plt.xticks(range(len(indices)), feat_names[indices], rotation='vertical',fontsize=14)
plt.xlim([-1, len(indices)])
plt.show()

请添加图片描述

如果文章对你有帮助,感谢点赞+关注!

关注下方GZH:阿旭算法与机器学习,回复:“ML35”即可获取本文数据集、源码与项目文档,欢迎共同学习交流

  • 2
    点赞
  • 62
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

阿_旭

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值