Logistic Regression - IBM 员工离职预测_公司离职logistic回归分析-CSDN博客

本文链接：https://blog.csdn.net/weixin_43937759/article/details/106952602

本文探讨了如何使用逻辑回归预测IBM员工离职，包括数据读取、处理、可视化分析，以及模型训练和评估。通过对数据进行预处理，如填充缺失值和特征数字化，最终模型达到了83.79%的预测准确率。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

公司从招聘到培训一名员工，每个环节都需花费不少的资源，而一个员工的离职多多少少会给公司带来损失，为了了解员工离职的原因并预测潜在的离职对象，IBM 公布了他们真实的员工信息并提出以下问题陈述：
“预测员工的流失，即员工是否会减员，考虑到员工的详细信息，即导致员工流失的原因”

本文将利用 logistic regression 来探索这一问题。

1. 前期准备

import matplotlib.pyplot as plt
import pylab as pl
import pandas as pd
import seaborn as sns
import numpy as np
from IPython.core.display import display
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report,roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

'exec(%matplotlib inline)'
sns.set()

2. 数据读取

#loading the dataset using Pandas
data = pd.read_csv('/.../logistic_regression_data.csv',sep=",")
data.head()# Output shown below

在此只显示了部分信息
在这里插入图片描述

3. 数据处理

填充缺省值：

# Data preprocessing
data.fillna(0, inplace=True)

观察得到， Age 这一列数据跨度太大，因此我们需要对这个特征进行分组操作：

# function to create group of ages, this helps because we have 78 different values here
def Age(dataframe):
    dataframe.loc[dataframe['Age'] <= 30, 'Age'] = 1
    dataframe.loc[(dataframe['Age'] > 30) & (dataframe['Age'] <= 40), 'Age'] = 2
    dataframe.loc[(dataframe['