案例所用数据集bschool0.txt
。
1. 读数据
一般我们用read.table()
函数直接读入文件。如果是csv文件,可以用read.cvs()读入,内部参数设置都是一样的。其中路径的表达注意不要用反斜杠而要用斜杠/
表示,header取T/F
,表示是否第一行为列名,sep表示文件的分隔方式,这里为逗号。于是有
x <- read.table("E:/R/bschool0.txt",header=T,sep=",")
2. 数据初步认识
查看维度
dim(x) #,60个观测9个变量
>(60,9)
查看前五行
head(x)
> FiveYearGain FiveYearGp
Dartmouth (Tuck) 134 84
Pennsylvania (Wharton) 129 75
Chicago 121 71
Columbia 120 74
Yale 119 84
Stanford 115 70
YearsToPayback SalaPreMBA
Dartmouth (Tuck) 2.9 54
Pennsylvania (Wharton) 3.2 64
Chicago 3.3 60
Columbia 3.1 57
Yale 2.8 46
Stanford 3.2 58
SalaPostMBA Tuition GMAT
Dartmouth (Tuck) 165 75 710
Pennsylvania (Wharton) 177 75 716
Chicago 164 75 710
Columbia 160 75 707
Yale 134 72 700
Stanford 160 78 710
Location State
Dartmouth (Tuck) Hanover NH
Pennsylvania (Wharton) Philadelphia PA
Chicago Chicago IL
Columbia New York NY
Yale New Haven CT
Stanford Stanford CA
可用summary()
函数对定量数据做一个描述性统计。
summary(x)
>FiveYearGain FiveYearGp YearsToPayback
Min. : 2.00 Min. : 2.00 Min. :2.200
1st Qu.: 62.00 1st Qu.: 57.25 1st Qu.:2.900
Median : 79.00 Median : 67.00 Median :3.100
Mean : 78.83 Mean : 65.93 Mean :3.190
3rd Qu.: 99.25 3rd Qu.: 76.00 3rd Qu.:3.325
Max. :134.00 Max. :126.00 Max. :4.900
SalaPreMBA SalaPostMBA Tuition
Min. :24.00 Min. : 70.0 Min. :22.00
1st Qu.:38.00 1st Qu.: 94.0 1st Qu.:47.75
Median :42.50 Median :105.5 Median :61.50
Mean :43.75 Mean :112.4 Mean :58.07
3rd Qu.:48.75 3rd Qu.:128.5 3rd Qu.:69.25
Max. :64.00 Max. :180.0 Max. :78.00
GMAT Location State
Min. :560.0 Los Angeles: 3 CA : 7
1st Qu.:640.0 Atlanta : 2 NY : 5
Median :660.0 Boston : 2 MA : 4
Mean :659.9 New York : 2 TX : 4
3rd Qu.:690.0 Ann Arbor : 1 FL : 3
Max. :716.0 Athens : 1 GA : 3
(Other) :49 (Other):34
3. 相关性度量
这里选取SalaPostMBA
作为被解释变量,则需要寻找和其具有较高相关关系的变量构建回归方程。可用cor()
函数,$
为索引号。例如度量SalaPreMBA与SalaPostMBA的皮尔逊相关系数:
cor(x$SalaPostMBA,x$SalaPreMBA,method = "pearson")
>0.9242337
可见该变量与被解释变量的相关性很高,可以作为解释变量进行回归。因此有
cor(x[,1:7]) #最后两列变量是字符,删去
FiveYearGain FiveYearGp YearsToPayback SalaPreMBA SalaPostMBA Tuition
FiveYearGain 1.0000000 0.69466576 -0.70818368 0.5310755 0.69842102 0.4540424
FiveYearGp 0.6946658 1.00000000 -0.95323083 -0.1491163 0.03342518 -0.2422810
YearsToPayback -0.7081837 -0.95323083 1.00000000 0.1222730 -0.01691694 0.1623029
SalaPreMBA 0.5310755 -0.14911629 0.12227301 1.0000000 0.92423368 0.7836609
SalaPostMBA 0.6984210 0.03342518 -0.01691694 0.9242337 1.00000000 0.7811176
Tuition 0.4540424 -0.24228099 0.16230293 0.7836609 0.78111759 1.0000000
GMAT 0.6672689 0.13096436 -0.20274377 0.8250422 0.77723442 0.6621838
GMAT
FiveYearGain 0.6672689
FiveYearGp 0.1309644
YearsToPayback -0.2027438
SalaPreMBA 0.8250422
SalaPostMBA 0.7772344
Tuition 0.6621838
GMAT 1.0000000
同时,可以引入car包中的scatterplotMatrix()
可迅速得到可视化的相关关系如下图。
install.package("car")
library(car)
scatterplotMatrix(x,spread=F,lty.smooth=2)
4.回归方程构建
第5行代表被解释变量SalaPostMBA
与所有变量之间的相关关系,可见第1、4、6、7个变量具有较高的相关关系。我们选择这四个变量,利用lm()
最小二乘法进行回归方程构建。
lm.sol <- lm(SalaPostMBA ~ FiveYearGain + SalaPreMBA + Tuition + GMAT,x)
# 方程具体摘要
summary(lm.sol)
> Call:
lm(formula = SalaPostMBA ~ FiveYearGain + SalaPreMBA + Tuition +
GMAT, data = x)
Residuals:
Min 1Q Median 3Q Max
-12.3369 -6.0117 0.7479 4.4076 15.2953
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 65.72676 26.70138 2.462 0.01700 *
FiveYearGain 0.32540 0.04442 7.325 1.11e-09 ***
SalaPreMBA 2.50162 0.23279 10.746 3.98e-15 ***
Tuition 0.21477 0.10131 2.120 0.03853 *
GMAT -0.15296 0.05185 -2.950 0.00466 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.15 on 55 degrees of freedom
Multiple R-squared: 0.9307, Adjusted R-squared: 0.9256
F-statistic: 184.6 on 4 and 55 DF, p-value: < 2.2e-16
得到方程形式在Coefficients组件中,可用summary(lm.sol)$coef查看。
5.系数检验
回归方程构建完了需要对方程及其系数做显著性检验。
对方程来说,最后一行中
F-statistic: 184.6 on 4 and 55 DF, p-value: < 2.2e-16
方程的P值小于0.05,表明回归具有意义。对各系数检验,从
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 65.72676 26.70138 2.462 0.01700 *
FiveYearGain 0.32540 0.04442 7.325 1.11e-09 ***
SalaPreMBA 2.50162 0.23279 10.746 3.98e-15 ***
Tuition 0.21477 0.10131 2.120 0.03853 *
GMAT -0.15296 0.05185 -2.950 0.00466 **
可见各系数的P值均小于0.05,因此不显著,变量都具有意义。
因此方程可以投入使用,判定系数
R
2
R^2
R2= 0.9256,解释效果很好。
全部代码如下:
x <- read.table("E:/R/bschool0.txt",header = T,sep=",")
dim(x)
head(x)
cor( x$SalaPostMBA, x$SalaPreMBA, method = "pearson")
library(car)
scatterplotMatrix(x, spread = F, lty.smooth = 2)
cor(x[, 1:7])
lm.sol <- lm(SalaPostMBA ~ FiveYearGain + SalaPreMBA + Tuition + GMAT,x)
# 系数检验
summary(lm.sol)
# 一元画图
lm.sol2 <- lm(x$SalaPostMBA ~ x$Tuition , data = x)
plot(x$Tuition, x$SalaPostMBA)
abline(lm.sol2)