6、变量选择方法之二
6.1岭回归
对于Hitters数据集,先去掉缺失值
注:本文中的predict是自己之前定义的函数,参见变量选择方法之二
#去掉缺失值后,Salary作为因变量,其它19列作为自变量
Hitters =na.omit(Hitters)
x <- model.matrix(Salary~.,Hitters)[,-1]
y <- Hitters$Salary
library(glmnet)
#定义lambda值
grid <- 10^seq(10,-2,length = 100)
#alpha = 0是岭回归 alpha =1是lasso
ridge.mod <- glmnet(x,y,alpha =0,lambda = grid)
dim(coef(ridge.mod))
#20*100 每一行表示取不同lambda时包含截距项的参数估计值,每一列表示某一个lambda值对应的参数估计值
以下结果给出了lambda取不同值,对应的参数估计值,以及参数向量的L2范数值:lambda越小,参数向量的二范数越大
ridge.mod$lambda[50]
> coef(ridge.mod)[,50]
(Intercept) AtBat Hits HmRun Runs
407.356050200 0.036957182 0.138180344 0.524629976 0.230701523
RBI Walks Years CAtBat CHits
0.239841459 0.289618741 1.107702929 0.003131815 0.011653637
CHmRun CRuns CRBI CWalks LeagueN
0.087545670 0.023379882 0.024138320 0.025015421 0.085028114
DivisionW PutOuts Assists Errors NewLeagueN
-6.215440973 0.016482577 0.002612988 -0.020502690 0.301433531
> sqrt(sum(coef(ridge.mod)[-1,50]^2))
[1] 6.360612
> ridge.mod$lambda[60]
[1] 705.4802
> coef(ridge.mod)[,60]
(Intercept) AtBat Hits HmRun Runs
54.32519950 0.11211115 0.65622409 1.17980910 0.93769713
RBI Walks Years CAtBat CHits
0.84718546 1.31987948 2.59640425 0.01083413 0.04674557
CHmRun CRuns CRBI CWalks LeagueN
0.33777318 0.09355528 0.09780402 0.07189612 13.68370191
DivisionW PutOuts Assists Errors NewLeagueN
-54.65877750 0.11852289 0.01606037 -0.70358655 8.61181213
> sqrt(sum(coef(ridge.mod)[-1,60]^2))
[1] 57.11001
用上述结果我们可以对函数进行预测,例如当Lambda = 50时得到参数估计为
> predict(ridge.mod,s =50,type="coefficients")[1:20,]
(Intercept) AtBat Hits HmRun Runs
4.876610e+01 -3.580999e-01 1.969359e+00 -1.278248e+00 1.145892e+00
RBI Walks Years CAtBat CHits
8.038292e-01 2.716186e+00 -6.218319e+00 5.447837e-03 1.064895e-01
CHmRun CRuns CRBI CWalks LeagueN
6.244860e-01 2.214985e-01 2.186914e-01 -1.500245e-01 4.592589e+01
DivisionW PutOuts Assists Errors NewLeagueN
-1.182011e+02 2.502322e-01 1.215665e-01 -3.278600e+00 -9.496680e+00
接下来使用交叉验证方法选择参数lambda
产生训练集和验证集的方法(原文写的很好,直接粘贴下来啦!)
We now split the samples into a training set and a test set in order to estimate the test error of ridge regression and the lasso. There are two common ways to randomly split a data set. The first is to produce a random vector of TRUE, FALSE elements and select the observations corresponding to TRUE for the training data. The second is to randomly choose a subset of numbers between 1 and n; these can then be used as the indices for the training observations. The two approaches work equally well.
在这里,采用第二种方法产生划分训练集和验证集(样本大小比例1:1,是原始样本量的一半)
set.seed(1)
train <- sample(1:nrow(x),nrow(x)/2)
test <- (-train)
y.test <- y[test]
使用训练集拟合模型,用得到的模型对测试集预测,得到测试赛误差
ridge.mod <- glmnet(x[train,],y[train],alpha =0,lambda = grid,thresh = 1e-12)
ridge.pred <- predict(ridge.mod,s =4,newx = x[test,])
mean((ridge.pred-y.test)^2)
[1] 142199.2
因为截距项没有被惩罚,单独考虑截距项(此时预测值等于样本均值)
> mean((mean(y[train])-y.test)^2)
[1] 224669.9
取一个较大的lambda值,用岭回归拟合
> ridge.pred <- predict(ridge.mod,s = 1e10,newx = x[test,])
> mean((ridge.pred-y.test)^2)
[1] 224669.8
比较最小二乘与岭回归
> ridge.pred=predict (ridge.mod ,s=0, newx=x[test ,], exact=T)
> mean((ridge.pred -y.test)^2)
[1] 114783
> lm(y∼x, subset=train)
> predict (ridge.mod ,s=0,exact=T,type=" coefficients")[1:20,]
用岭回归交叉验证的方法得到最优lambda值
set.seed(1)
cv.out <- cv.glmnet(x[train,],y[train],alpha=0)
plot(cv.out)
bestlam <- cv.out$lambda.min
bestlam
对于最优的lambda值计算MSE
ridge.pred <- predict(ridge.mod,s=bestlam,newx = x[test,])
mean((ridge.pred-y.test)^2)
最后,使用交叉验证选出的lambda值和全部样本观测值,做岭回归估计,得到的系数估计如下([1:20,:]表示从截距项到19个自变量):
> out <- glmnet(x,y,alpha = 0)
> predict(out,type="coefficients",s = bestlam)[1:20,]
(Intercept) AtBat Hits HmRun Runs
15.44383135 0.07715547 0.85911581 0.60103107 1.06369007
RBI Walks Years CAtBat CHits
0.87936105 1.62444616 1.35254780 0.01134999 0.05746654
CHmRun CRuns CRBI CWalks LeagueN
0.40680157 0.11456224 0.12116504 0.05299202 22.09143189
DivisionW PutOuts Assists Errors NewLeagueN
-79.04032637 0.16619903 0.02941950 -1.36092945 9.12487767
6.2 LASSO
lasso方法与岭回归方法代码实现过程基本一致,只需将alpha =0 改为 alpha =1
glmnet需要自己设置lambda值
lasso.mod=glmnet(x[train ,],y[ train],alpha=1, lambda =grid)
plot(lasso.mod)
cv.glmnet采用交叉验证,自动用不同的lambda值计算模型误差
set.seed(1)
cv.out <- cv.glmnet(x[train,],y[train],alpha = 1)
plot(cv.out)
bestlam <- cv.out$lambda.min
lasso.pred <- predict(lasso.mod,s = bestlam,newx = x[test,])
mean((lasso.pred-y.test)^2)
图中第一根虚线表示最小的误差对应的变量个数,第二根虚线表示在一倍标准差范围内选择的变量个数
out <- glmnet(x,y,alpha = 1,lambda = grid)
lasso.coef <- predict(out,type = "coefficients",s=bestlam)[1:20,]
> lasso.coef[lasso.coef!=0]
(Intercept) AtBat Hits Walks
1.27479059 -0.05497143 2.18034583 2.29192406
Years CHmRun CRuns CRBI
-0.33806109 0.02825013 0.21628385 0.41712537
LeagueN DivisionW PutOuts Errors
20.28615023 -116.16755870 0.23752385 -0.85629148
由结果可知lasso方法可以压缩一些系数至0,最终选择了11个变量(岭回归没有压缩变量的功能)
6.3PCR(主成分回归)
library(pls)
Hitters =na.omit(Hitters)
set.seed(2)
pcr.fit <- pcr(Salary~.,data = Hitters,scale = TRUE,validation = "CV")
summary(pcr.fit)
validationplot(pcr.fit,val.type = "RMSEP")
#计算训练误差和测试误差
set.seed(1)
pcr.fit <- pcr(Salary~.,data=Hitters,subsets=train,scale=TRUE,validation="CV")
validationplot(pcr.fit,val.type = "MSEP")
#由图片可以知道,最小的交叉验证误差是有7个主成分时
pcr.pred <- predict(pcr.fit,x[test,],ncomp = 7)
mean((pcr.pred-y.test)^2)
#使用7个主成分在所有的数据集上
pcr.fit <- pcr(y~x,scale=TRUE,ncomp=7)
summary(pcr.fit)
6.4 PLS(偏最小二乘)
set.seed(1)
pls.fit <- plsr(Salary~.,data = Hitters,subset = train,scale = TRUE,validation = "CV")
summary(pls.fit)
validationplot(pls.fit,val.type = "MSEP")
pls.pred <- predict(pls.fit,x[test,],ncomp = 2)
mean((pls.pred - y.test)^2)
pls.fit <- plsr(Salary~.,data=Hitters,scale=TRUE,ncomp=2)
summary(pls.fit)