1、首先确定数据分析目标——薪酬受哪些因素影响
确定变量:
因变量:薪资
自变量:(定性)-- 公司类别、公司规模、地区、行业类别、学历要求、软件要求、
(定量)-- 经验要求(数值型)
分析目标:建立因变量和自变量的多元线性回归模型,估计模型系数,检验系数显著性,确定自变量是否对因变量有影响。并实现自变量新值带入实现模型预测。
2、数据预处理。
(整理数据,使其成为可以直接建模分析的数据格式),首先看下数据结构。
1) 读数据 数据量大时不建议用xlsx包,比较慢 library(xlsx) jobInfo2 = read.xlsx('jobinfo.xlsx',1,encoding = 'UTF-8') str(jobInfo2) # 查看数据结构 head(jobInfo2) 2)library(readxl) jobInfo = read_excel('jobinfo.xlsx') str(jobInfo) # 查看数据结构 # head()函数好像没有,查看前5行
options(scipen = 200) # 去除科学计数法 jobInfo = read_excel('jobinfo.xlsx') str(jobInfo) # 查看数据结构
1)最低薪资和最高薪资因变量转换为数值型
jobInfo$最低薪资 = as.numeric(jobInfo$最低薪资) jobInfo$最高薪资 = as.numeric(jobInfo$最高薪资) jobInfo$平均薪资 = (jobInfo$最低薪资+jobInfo$最高薪资)/2
2) 地区处理,分北上深和非北上深
loc = which(jobInfo$地区 %in% c("北京","上海","深圳")) loc_other = which(!jobInfo$地区 %in% c("北京","上海","深圳")) jobInfo$地区[loc] = 1 jobInfo$地区[loc_other] = 0 jobInfo$地区 = as.numeric(jobInfo$地区)
3) 处理公司规模、学历,转化为因子变量。便于画图
jobInfo$公司规模 = factor(jobInfo$公司规模,levels = c("少于50人", "50-150人", "150-500人", "500-1000人", "1000-5000人", "5000-10000人", "10000人以上")) levels(jobInfo$公司规模)[c(2, 3)] = c("50-500人","50-500人") jobInfo$学历 = factor(jobInfo$学历,levels = c("中专", "高中", "大专", "无", "本科", "硕士", "博士"))
4)匹配公司需求掌握的工具
分析工具包含:"R", "SPSS", "Excel", "Python", "MATLAB", "Java", "SQL", "SAS", "Stata", "EViews", "Spark", "Hadoop"
software = as.data.frame(matrix(0,nrow = length(jobInfo$描述),ncol = 12)) # 生成*行*列的数据框 colnames(software) = c("R", "SPSS", "Excel", "Python", "MATLAB", "Java", "SQL", "SAS", "Stata", "EViews", "Spark", "Hadoop") mixseg = worker() for (i in 1:length(jobInfo$描述)) { subData = as.character(jobInfo$描述[i]) fenci = mixseg[subData] R.identify = ("R" %in% fenci) | ("r" %in% fenci) SPSS.identify = ("spss" %in% fenci) | ("Spss" %in% fenci) | ("SPSS" %in% fenci) Excel.identify = ("excel" %in% fenci) | ("EXCEL" %in% fenci) | ("Excel" %in% fenci) Python.identify = ("Python" %in% fenci) | ("python" %in% fenci) | ("PYTHON" %in% fenci) MATLAB.identify = ("matlab" %in% fenci) | ("Matlab" %in% fenci) | ("MATLAB" %in% fenci) Java.identify = ("java" %in% fenci) | ("JAVA" %in% fenci) | ("Java" %in% fenci) SQL.identify = ("SQL" %in% fenci) | ("Sql" %in% fenci) | ("sql" %in% fenci) SAS.identify = ("SAS" %in% fenci) | ("Sas" %in% fenci) | ("sas" %in% fenci) Stata.identify = ("STATA" %in% fenci) | ("Stata" %in% fenci) | ("stata" %in% fenci) EViews.identify = ("EViews" %in% fenci) | ("EVIEWS" %in% fenci) | ("Eviews" %in% fenci) | ("eviews&