-
数据集介绍
一家从事大数据和数据科学的公司,想从成功通过他们开展的课程的人里面雇佣数据科学家,报名的人很多,公司想知道培训之后哪些人是真的在寻找工作机会。候选人注册之后,即可获得人口统计、教育、经验等相关信息。使用这些信息建立模型,预测候选人是寻找新工作,还是会继续留在当前的公司。
特征:
enrollee_id : Unique ID for candidate
city: City code
city_ development _index : Developement index of the city (scaled)
gender: Gender of candidate
relevent_experience: Relevant experience of candidate
enrolled_university: Type of University course enrolled if any
education_level: Education level of candidate
major_discipline :Education major discipline of candidate
experience: Candidate total experience in years
company_size: No of employees in current employer's company
company_type : Type of current employer
lastnewjob: Difference in years between previous job and current job
training_hours: training hours completed
target: 0 – Not looking for job change, 1 – Looking for a job change -
数据探索及可视化
首先导入需要的模快import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from collections import Counter from imblearn.over_sampling import SMOTE from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split from xgboost import XGBClassifier from sklearn.model_selection import RandomizedSearchCV from sklearn.metrics import roc_auc_score, roc_curve, accuracy_score, confusion_matrix, log_loss, plot_roc_curve, auc, precision_recall_curve from sklearn.preprocessing import StandardScaler
df=pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_train.csv') df.head()
print('shape of train is {}'.format(df.shape))
数据集总共有19158行,14列,再看看各个变量的情况
缺失值情况
df.isnull().sum()
接下来观察一下哪类人更倾向于找新的工作
gender
sns.countplot(df['gender'])
数据集中男性的人数远远超过女性
gender=df[df['target']==1]['gender'] temp=gender.value_counts() labels=temp.keys() bar,ax=plt.subplots(figsize=(8,8)) plt.pie(x=temp,labels=labels,colors=['green','yellow','red'],autopct='%.2f%%',pctdistance=0.6) plt.title('Gender % looking for new job',fontsize=20)
male_newjob = df[(df['gender']=='Male') & (df['target']==1)] female_newjob = df[(df['gender']=='Female') & (df['target']==1)] print('{} % of male who are looking for a new job'.format(len(male_newjob)/len(df['gender']=='Male')*100)) print('{} % of female who are looking for a new job'.format(len(female_newjob)/len(df['gender']=='Female')*100))
Kaggle数据集HR Analytics: Job Change of Data Scientists (XGBoost)
最新推荐文章于 2023-10-26 11:22:53 发布