Kaggle数据集HR Analytics: Job Change of Data Scientists （XGBoost）

最新推荐文章于 2023-08-07 21:44:35 发布

小番薯emmm

最新推荐文章于 2023-08-07 21:44:35 发布

阅读量3.3k

点赞数 6

文章标签：机器学习

本文链接：https://blog.csdn.net/weixin_47308674/article/details/114051214

版权

本文分析了Kaggle上的HR Analytics数据集，目标是预测数据科学家是否在寻找工作变动。通过探索数据，发现男性和私企员工更倾向于换工作。在数据预处理中，删除了不必要的变量，填充了缺失值，并对类别变量进行了编码。接着，通过SMOTE处理了数据不平衡问题，然后使用XGBoost进行建模。尽管存在轻微过拟合，但模型在训练和测试集上表现出良好的分类准确率和AUC分数。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

数据集介绍
一家从事大数据和数据科学的公司，想从成功通过他们开展的课程的人里面雇佣数据科学家，报名的人很多，公司想知道培训之后哪些人是真的在寻找工作机会。候选人注册之后，即可获得人口统计、教育、经验等相关信息。使用这些信息建立模型，预测候选人是寻找新工作，还是会继续留在当前的公司。
特征：
enrollee_id : Unique ID for candidate
city: City code
city_ development _index : Developement index of the city (scaled)
gender: Gender of candidate
relevent_experience: Relevant experience of candidate
enrolled_university: Type of University course enrolled if any
education_level: Education level of candidate
major_discipline :Education major discipline of candidate
experience: Candidate total experience in years
company_size: No of employees in current employer's company
company_type : Type of current employer
lastnewjob: Difference in years between previous job and current job
training_hours: training hours completed
target: 0 – Not looking for job change, 1 – Looking for a job change

数据探索及可视化
首先导入需要的模快

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import roc_auc_score, roc_curve, accuracy_score, confusion_matrix, log_loss, plot_roc_curve, auc, precision_recall_curve

from sklearn.preprocessing import StandardScaler

df=pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_train.csv')
df.head()

print('shape of train is {}'.format(df.shape))

数据集总共有19158行，14列，再看看各个变量的情况

缺失值情况

df.isnull().sum()

接下来观察一下哪类人更倾向于找新的工作
gender

sns.countplot(df['gender'])

数据集中男性的人数远远超过女性

gender=df[df['target']==1]['gender']
temp=gender.value_counts()
labels=temp.keys()
bar,ax=plt.subplots(figsize=(8,8))
plt.pie(x=temp,labels=labels,colors=['green','yellow','red'],autopct='%.2f%%',pctdistance=0.6)
plt.title('Gender % looking for new job',fontsize=20)

male_newjob = df[(df['gender']=='Male') & (df['target']==1)]
female_newjob = df[(df['gender']=='Female') & (df['target']==1)]
print('{} % of male who are looking for a new job'.format(len(male_newjob)/len(df['gender']=='Male')*100))
print('{} % of female who are looking for a new job'.format(len(female_newjob)/len(df['gender']=='Female')*100))