Kaggle数据集HR Analytics: Job Change of Data Scientists (XGBoost)

本文分析了Kaggle上的HR Analytics数据集,目标是预测数据科学家是否在寻找工作变动。通过探索数据,发现男性和私企员工更倾向于换工作。在数据预处理中,删除了不必要的变量,填充了缺失值,并对类别变量进行了编码。接着,通过SMOTE处理了数据不平衡问题,然后使用XGBoost进行建模。尽管存在轻微过拟合,但模型在训练和测试集上表现出良好的分类准确率和AUC分数。
摘要由CSDN通过智能技术生成
  1. 数据集介绍
    一家从事大数据和数据科学的公司,想从成功通过他们开展的课程的人里面雇佣数据科学家,报名的人很多,公司想知道培训之后哪些人是真的在寻找工作机会。候选人注册之后,即可获得人口统计、教育、经验等相关信息。使用这些信息建立模型,预测候选人是寻找新工作,还是会继续留在当前的公司。
    特征:
    enrollee_id : Unique ID for candidate
    city: City code
    city_ development _index : Developement index of the city (scaled)
    gender: Gender of candidate
    relevent_experience: Relevant experience of candidate
    enrolled_university: Type of University course enrolled if any
    education_level: Education level of candidate
    major_discipline :Education major discipline of candidate
    experience: Candidate total experience in years
    company_size: No of employees in current employer's company
    company_type : Type of current employer
    lastnewjob: Difference in years between previous job and current job
    training_hours: training hours completed
    target: 0 – Not looking for job change, 1 – Looking for a job change

  2. 数据探索及可视化
    首先导入需要的模快

    import pandas as pd
    import numpy as np
    
    import matplotlib.pyplot as plt
    import seaborn as sns
    from collections import Counter
    
    from imblearn.over_sampling import SMOTE
    from sklearn.preprocessing import LabelEncoder
    from sklearn.model_selection import train_test_split
    from xgboost import XGBClassifier
    from sklearn.model_selection import RandomizedSearchCV
    from sklearn.metrics import roc_auc_score, roc_curve, accuracy_score, confusion_matrix, log_loss, plot_roc_curve, auc, precision_recall_curve
    
    from sklearn.preprocessing import StandardScaler
    df=pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_train.csv')
    df.head()


     

    print('shape of train is {}'.format(df.shape))


    数据集总共有19158行,14列,再看看各个变量的情况

    缺失值情况
     

    df.isnull().sum()


    接下来观察一下哪类人更倾向于找新的工作
    gender
     

    sns.countplot(df['gender'])


    数据集中男性的人数远远超过女性
     

    gender=df[df['target']==1]['gender']
    temp=gender.value_counts()
    labels=temp.keys()
    bar,ax=plt.subplots(figsize=(8,8))
    plt.pie(x=temp,labels=labels,colors=['green','yellow','red'],autopct='%.2f%%',pctdistance=0.6)
    plt.title('Gender % looking for new job',fontsize=20)


     

    male_newjob = df[(df['gender']=='Male') & (df['target']==1)]
    female_newjob = df[(df['gender']=='Female') & (df['target']==1)]
    print('{} % of male who are looking for a new job'.format(len(male_newjob)/len(df['gender']=='Male')*100))
    print('{} % of female who are looking for a new job'.format(len(female_newjob)/len(df['gender']=='Female')*100))


评论 7
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值