基于机器学习预测员工流失的情况

最新推荐文章于 2025-05-03 08:38:15 发布

放学-别走

最新推荐文章于 2025-05-03 08:38:15 发布

阅读量750

点赞数 9

文章标签：机器学习人工智能 python 大作业毕设

本文链接：https://blog.csdn.net/lhyandlwl/article/details/136991954

版权

背景

员工流失对于任何组织都是一个严重的问题，它可能导致生产力下降和成本增加。通过分析员工流失的数据，并使用机器学习模型来预测员工是否会流失，我们可以采取预防性措施，尽量减少员工的流失。
数据探索性分析（EDA）
首先，让我们加载数据并进行一些基本的探索性分析

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 加载数据
data = pd.read_csv('employee.csv')

# 探索性数据分析
print(data.info())
print(data.describe())

# 可视化员工流失情况
plt.figure(figsize=(8, 5))
sns.countplot(x='Attrition', data=data)
plt.title('Attrition Distribution')
plt.show()

# 可视化不同部门的流失情况
plt.figure(figsize=(10, 6))
sns.countplot(x='Department', hue='Attrition', data=data)
plt.title('Attrition in Different Departments')
plt.show()

# 可视化年龄与流失的关系
plt.figure(figsize=(10, 6))
sns.boxplot(x='Attrition', y='Age', data=data)
plt.title('Age vs Attrition')
plt.show()

在这里插入图片描述

数据预处理
接下来，我们对数据进行预处理，包括去除无关的特征、对分类变量进行编码以及特征缩放

from sklearn.preprocessing import StandardScaler

# 去除无关特征并进行编码
data.drop(['EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours'], axis=1, inplace=True)
data['Attrition'] = data['Attrition'].replace({'No': 0, 'Yes': 1})
data_encoded = pd.get_dummies(data, drop_first=True)

# 特征缩放
features_to_scale = ['Age', 'DailyRate', 'DistanceFromHome', 'HourlyRate', 'MonthlyIncome', 'MonthlyRate',
                     'NumCompaniesWorked', 'PercentSalaryHike', 'TotalWorkingYears', 'YearsAtCompany',
                     'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_encoded[features_to_scale])

模型构建与评估
现在，我们将数据集拆分为训练集和测试集，并构建逻辑回归模型和随机森林模型来预测员工的流失情况。

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# 划分数据集
X = data_encoded.drop('Attrition', axis=1)
y = data_encoded['Attrition']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 构建逻辑回归模型
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
y_pred_logistic = logistic_model.predict(X_test)
accuracy_logistic = accuracy_score(y_test, y_pred_logistic)

# 构建随机森林模型
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)

# 输出结果
print("Logistic Regression Accuracy:", accuracy_logistic)
print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_logistic))

print("\nRandom Forest Accuracy:", accuracy_rf)
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))