校园招聘影响因素分析

最新推荐文章于 2024-02-20 21:30:00 发布

楠楠szl

最新推荐文章于 2024-02-20 21:30:00 发布

阅读量1.3k

点赞数 1

分类专栏：数据化运营文章标签：数据分析

本文链接：https://blog.csdn.net/qq_24206673/article/details/107124833

版权

数据化运营专栏收录该内容

14 篇文章 5 订阅

订阅专栏

校园招聘影响因素分析

**
数据来源：本文数据来源于Kaggle提供的数据集campus_recuritment.csv(获取链接：https://www.kaggle.com/benroshan/notebook)，此数据集包含印度某校园中商学院得硕士研究生得应聘录取数据。它包括中学和高中的百分比和专业。它还包括学位专业，类型和工作经验以及对所就读学生的薪资待遇等信息。
分析目标：
1.Which factor influenced a candidate in getting placed?
2.Does percentage matters for one to get placed?
3.Which degree specialization is much demanded by corporate?
4.Play with the data conducting all statistical tests.
字段说明：
ssl_no:serial number
gender：性别
ssc_p：secondary education percentage -10th grade 中学一级课程平均分
ssc_b：board of education-central/others 教育委员会
hsc_p：higher secondary education percentage -12th grade 中学二级课程平均分-
hsc_b：board of education-centeral/others 教育委员会-中央/其它
hsc_s：specialization in higher secondary education commerce/science/other高中专业化商业/科学/其它
degree_p：degree percentage 学位百分比是本科课程的最终平均分数
degree_t：under graduation(degree type)-field of degree education comm&mgmt/sci&Tech/others 学位教育领域
workex：work experience 工作经验
etest_p：employability test percentage(conducted by college) 就业能力测试分
specialiation：Post graduation(mba)-specialiization
mba_p：MBA percentage MBA百分比是工商管理课程中的硕士学位（毕业后）的平均分数。
status：status of plaecment placed/not placed
salary：salary offered by corporate to candidates
另外，查了一下，印度的学制是1-10年级为中学一级（ssc)，有点类似我们的小学初中（九年义务教育阶段），11-12年级为中学二级(hsc)，有点类似我们得高中阶段，degree,mba有点类似大学，研究生阶段，这样这些字段可能容易理解一些。

读取数据

import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

plt.rcParams['font.sans-serif']=['SimHei']
df1=pd.read_csv(r'D:\kaggle\campus recuritment.csv',index_col=0)

在这里插入图片描述

df1.info()

在这里插入图片描述
可以发现，共有215条记录，每一条记录代表一个学生饿信息，其中，_p的指标都为数值型指标，gender，_b为字符型指标，status为目标变量，salary因为只有status="placed"的记录才会有，因此有较多缺失值，本研究薪水不是主要研究对象，因此忽略该字段。

数据探索

#数值类型变量
quantitives=[f for f in df1.columns if df1.dtypes[f]!='object']
#字符类型变量
qulitives=[f for f in df1.columns if df1.dtypes[f]=='object']
quantitives.remove('salary')#删除薪水字段，不做研究
qulitives.remove('status')#status是目标变量，不是自变量，删除
print (quantitives,qulitives)
output:['ssc_p', 'hsc_p', 'degree_p', 'etest_p', 'mba_p'] ['gender', 'ssc_b', 'hsc_b', 'hsc_s', 'degree_t', 'workex', 'specialisation']

数据分析

#分析目标1：哪些因素影响了候选人的入职？
#查看定性因素与status的交叉分布情况（交叉分析）
fig=plt.figure()
plt.rcParams['figure.figsize'] = (12.0, 18.0)
for i in range(len(qulitives)):
    gb1=df1.groupby(['status',qulitives[i]])['sl_no'].count().reset_index()
    ax1=fig.add_subplot(4,2,i+1)
    sns.barplot(x='sl_no',y=qulitives[i],hue='status',data=gb1)
    #plt.subplots_adjust()

在这里插入图片描述

#百分比对一个人重要吗？
##查看定量因素与录取状况的关系
fig2=plt.figure()
plt.rcParams['figure.figsize'] = (10.0, 10.0)
for j in range(len(quantitives)-1):
    ax2=fig2.add_subplot(3,2,j+1)
    sns.boxplot(x=quantitives[j],y='status',data=df1)

在这里插入图片描述

通过上面的描述性统计分析，为了进一步量化和验证各自变量与因变量status的相关性，我们进行相关性分析。因为status为类型变量，需要对其进行编码，我们可以采用binary或者get_dummies方法进行转换，当然这两种编码方式，会将placed：0，not placed:1,在做相关性分析时，会认为0和1是有数值大小的，但是实际上这两个只是单纯的类型，不存在大小关系，因此在编码的时候我们更多的是推荐one-hot编码。

for c in qulitives:
    df1[c]=df1[c].astype('category')

# 1.使用binary编码的方式来编码类别变量
#encoder = ce.BinaryEncoder(cols='MSZoning').fit(train[['MSZoning','SalePrice']])
# 转换数据
#numeric_dataset = encoder.transform(train[['MSZoning','SalePrice']])

dumies=pd.get_dummies(df1,columns=['status'],prefix=['status'],prefix_sep="_",dummy_na=False,drop_first=False)
print(dumies.columns)
#output:Index(['gender', 'ssc_p', 'ssc_b', 'hsc_p', 'hsc_b', 'hsc_s','degree_p','degree_t', 'workex', 'etest_p', 'specialisation', 'mba_p', 'salary','status_Not Placed', 'status_Placed'], type='object')
#对category变量进行赋值编码
from sklearn.preprocessing import OneHotEncoder,LabelEncoder
oenc=OneHotEncoder(sparse=False)
lenc=LabelEncoder()
for c in qulitives:
    lenc.fit(df1[c])
    dumies[c+'2']=pd.DataFrame(lenc.transform(df1[c]))
qual_encoded=[i+'2' for i in qulitives ]
import category_encoders as ce

plt.rcParams['figure.figsize'] = (14.0, 5.0)
fig=plt.figure()
fig.add_subplot(1,2,1)
corr1 = dumies[quantitives+qual_encoded+['status_Placed']].corr()
sns.heatmap(corr1)
fig.add_subplot(1,2,2)
corr2 = dumies[quantitives+qual_encoded+['status_Not Placed']].corr()
sns.heatmap(corr2)
print (corr1['status_Placed'],'\n',corr2['status_Not Placed'])

output1:
在这里插入图片描述
output2:

#处理进行所有统计检验的数据
from scipy import stats
index_all=quantitives+qual_encoded #合并两部分指标
for i in index_all:
    res=stats.ttest_ind(dumies[i],dumies['status_Placed'])
    #print(res['pvalue'])
    print ('{}的T检验结果：{}'.format(i,res))