有任何建议或疑问,请加 QQ: 1595218767 ,共同探讨学习
如R/python代码编程作图等方面需要帮忙,欢迎来店咨询 之恒科技, 挥动热情的小爪爪期待你哦
Lasso回归复杂度调整的程度由参数lambda来控制,lambda越大模型复杂度的惩罚力度越大,从而获得一个较少变量的模型。Lasso回归和bridge回归都是Elastic Net广义线性模型的特例。除了参数lambda,还有参数alpha,控制对高相关性数据时建模的形状。
Lasso回归,alpha=1(R语言glmnet的默认值); brigde回归,alpha=0; 一般的elastic net 0
以下五类模型的变量选择可采用R语言的glmnet包来解决。这五类模型分别是:
二分类logistic回归模型
多分类logistic回归模型
Possion模型
Cox比例风险模型
SVM
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18cv.fit
x为输入特征,x应该是矩阵格式的,若非矩阵格式,采用model.matrix()转换成矩阵格式
参数family规定了回归模型的类型:
-family="gaussian"适用于一维连续因变量
-family=mgaussian"适用于多维连续因变量
-family="poisson"适用于非负次数因变量(count)
-family="binomial"适用于二元离散因变量(binary)
-family="multinomial"适用于多元离散因变量(category)
cv.fit$lambda.min #最佳lambda值
cv.fit$lambda.1se #指在lambda.min一个标准差范围内得到的最简单模型的那一个lambda值。因为lambda值达到一定大小之后,继续增加模型自变量个数及缩小lambda值,并不能显著提高模型性能,lambda.lse给出的就是一个具备优良性能但是自变量个数最少的模型
coefficients
> 最佳lambda值
Active.Index
> 系数不为0的特征索引
Active.coefficients
> 系数不为0的特征系数值
开题:
某种水泥在凝固时释放出热量Y(卡/克)与水泥中的化学成分X1,X2,X3,X4有关,现测得13组数据(如下所示),希望从中选取主要的影响变量建立Y与X的回归方程
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21dat1
X2 = c(26,29, 56, 31, 52, 55, 71, 31, 54, 47, 40, 66, 68),
X3 = c(6, 15, 8, 8, 6,9, 17, 22, 18, 4, 23, 9, 8),
X4 = c(60, 52, 20, 47, 33, 22, 6, 44, 22, 26,34, 12, 12),
Y = c(78.5, 74.3, 104.3, 87.6, 95.9, 109.2, 102.7, 72.5, 93.1, 115.9, 83.8, 113.3, 109.4))
dat1
# X1 X2 X3 X4 Y
# 1 7 26 6 60 78.5
# 2 1 29 15 52 74.3
# 3 11 56 8 20 104.3
# 4 11 31 8 47 87.6
# 5 7 52 6 33 95.9
# 6 11 55 9 22 109.2
# 7 3 71 17 6 102.7
# 8 1 31 22 44 72.5
# 9 2 54 18 22 93.1
# 10 21 47 4 26 115.9
# 11 1 40 23 34 83.8
# 12 11 66 9 12 113.3
# 13 10 68 8 12 109.4
思路及步骤:
简单线性回归
既然是回归分析,拿到数据后第一个想法就是先建个 简单线性回归看看效果, 走起
1
2lm1
summary(lm1)
首先看模型的整体显著性 p-value: 4.756e-07, 模型通过检验
再看 Adjusted R-Squared: 0.9736, 竟然能解释原始信息的 97.36 %,解释力非常好
最后来看我们关心的 主要影响变量, X1-X4 的 p.value 全部大于临界值 0.05,情何以堪
出现这种情况, 一般会有两种思路:
1.所选变量不足以满足模型需要,需要添加类似(X^2)的变量,但是此模型的解释力达到了 97.36%,所以变量的增加对模型的改进已无多大意义
2.现存变量之间存在共线性问题,就此展开以下讨论
检验自变量之间有无共线性
方差膨胀因子VIF是指回归系数的估计量由于自变量共线性使得方差增加的一个相对度量。一般建议,如VIF>10,表明模型中有很强的共线性问题
1
2
3
4library(car)
vif(lm.sol)
# X1 X2 X3 X4
# 38.49621 254.42317 46.86839 282.51286
通过观察发现4个自变量的 vif值都超过了 10,X2和X4竟达到了 200 多,可见模型中是存在严重的共线性问题的, 为了能直观的看到这一点,我们来作图分析
1plot(dat1[, 1:4])
通过cor展示变量间相关程度
1corrplot::corrplot(cor(dat1[,1:4]),diag = FALSE)
通过以上的分析,我们可以清楚的知道 X2与X4 X1与X3 都有很强的相关关系
解决方案有3:
step(lm)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46lm1
summary(lm1)
Call:
lm(formula = Y ~ ., data = dat1)
Residuals:
Min 1Q Median 3Q Max
-3.1750 -1.6709 0.2508 1.3783 3.9254
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.4054 70.0710 0.891 0.3991
X1 1.5511 0.7448 2.083 0.0708 .
X2 0.5102 0.7238 0.705 0.5009
X3 0.1019 0.7547 0.135 0.8959
X4 -0.1441 0.7091 -0.203 0.8441
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.446 on 8 degrees of freedom
Multiple R-squared: 0.9824,Adjusted R-squared: 0.9736
F-statistic: 111.5 on 4 and 8 DF, p-value: 4.756e-07
lm2
summary(lm2)
Call:
lm(formula = Y ~ X1 + X2 + X4, data = dat1)
Residuals:
Min 1Q Median 3Q Max
-3.0919 -1.8016 0.2562 1.2818 3.8982
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 71.6483 14.1424 5.066 0.000675 ***
X1 1.4519 0.1170 12.410 5.78e-07 ***
X2 0.4161 0.1856 2.242 0.051687 .
X4 -0.2365 0.1733 -1.365 0.205395
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.309 on 9 degrees of freedom
Multiple R-squared: 0.9823,Adjusted R-squared: 0.9764
F-statistic: 166.8 on 3 and 9 DF, p-value: 3.323e-08
岭回归1
2
3
4
5
6
7
8
9
10library(glmnet)
ridg1
y=dat1$Y,
family = 'gaussian',
alpha = 0,
# nfolds = 10,
nlambda = 50)
plot(ridg1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26bestlam1=ridg1$lambda.min
[1] 1.432556
bestlam2=ridg1$lambda.1se
[1] 3.03838
pred.ridg
newx=model.matrix(~.,dat1[,-ncol(dat1)]))
mean((pred.ridg[,1]-dat1$Y)^2)#当lambda=lambda.min
[1] 4.525766
mean((pred.ridg[,2]-dat1$Y)^2)#当lambda=lambda.lse
[1] 6.223506
predict(ridg1,s=c(bestlam1,bestlam2),
newx=model.matrix(~.,dat1[,-ncol(dat1)]),
type="coefficients")
6 x 2 sparse Matrix of class "dgCMatrix"
1 2
(Intercept) 86.7090450 87.8027815
(Intercept) . .
X1 1.1010431 0.9692898
X2 0.2904642 0.2891403
X3 -0.2708346 -0.3301220
X4 -0.3433633 -0.3216678
lasso回归1
2
3
4
5
6
7
8library(glmnet)
laso2
y=dat1$Y,
family = 'gaussian',
alpha = 1,
# nfolds = 10,
nlambda = 50)
plot(laso2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29bestlam1=laso2$lambda.min
[1] 0.08954262
bestlam2=laso2$lambda.1se
[1] 0.7079278
pred.laso
newx=model.matrix(~.,dat1[,-ncol(dat1)]))
mean((pred.laso[,1]-dat1$Y)^2)#当lambda=lambda.min
[1] 3.703116
mean((pred.laso[,2]-dat1$Y)^2)#当lambda=lambda.lse
[1] 4.503159
# lambda.min 误差最小, lambda.lse 性能优良,自变量个数最少
# 获取2个 lambda值 对应的自变量系数
predict(laso2,s=c(bestlam1,bestlam2),
newx=model.matrix(~.,dat1[,-ncol(dat1)]),
type="coefficients")
6 x 2 sparse Matrix of class "dgCMatrix"
1 2
(Intercept) 71.7384683 73.3234255
(Intercept) . .
X1 1.4391733 1.3506153
X2 0.4143973 0.3898440
X3 . .
X4 -0.2336221 -0.2250169
lasso回归tips
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28# cvfit = cv.glmnet(x, y, family = "binomial", type.measure = "class")
# library(doParallel)
# # Windows System
# cl
# registerDoParallel(cl)
# cvfit = cv.glmnet(x, y, family = "binomial",
# type.measure = "class", parallel=TRUE)
# stopCluster(cl)
# # CV for 11 alpha value
# for (i in 0:10) {
# assign(paste("cvfit", i, sep=""),
# cv.glmnet(x, y, family="binomial", type.measure="class", alpha=i/10))
# }
# # Plot Solution Paths
# par(mfrow=c(3,1))
# plot(cvfit10, main="LASSO")
# plot(cvfit0, main="Ridge")
# plot(cvfit5, main="Elastic Net")
# select lambda
# cvfit = cv.glmnet(x, y, type.measure = "mse", nfolds = 20)
# select alpha
# foldid=sample(1:10,size=length(y),replace=TRUE)
# cv1=cv.glmnet(x,y,foldid=foldid,alpha=1)
# cv.5=cv.glmnet(x,y,foldid=foldid,alpha=.5)
# cv0=cv.glmnet(x,y,foldid=foldid,alpha=0)
模型之间的比较1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29rmse.laso
sqrt(mean((pred.laso[,2]-dat1$Y)^2)))
rmse.ridg
sqrt(mean((pred.ridg[,2]-dat1$Y)^2)))
rmse.lm
# pl.dt
# pl.dt
#
# library(ggplot2)
# ggplot(pl.dt,aes(x=var1,y=coef1,fill=model1))+
# geom_bar(stat = 'identity',position = 'dodge')+
# geom_text(aes(x=seq(0.7,by=0.33,length.out = nrow(pl.dt)),
# y=ifelse(coef1>0,coef1+0.1,coef1-0.1),
# label=round(coef1,2)),size=3,col='darkgray')+
# labs(title='RMSE&Coef of linear/Lasso/Ridge',y='',x='')+
# theme(legend.position = 'top',
# legend.justification = 'left',
# legend.key.height = unit(0.1,'cm'),
# panel.background = element_blank(),
# axis.ticks.y = element_blank(),
# axis.text.y = element_blank(),
# axis.ticks.x = element_blank())+
# scale_fill_grey(start = 0.6,end=0)
#
#
# plot(laso2,s='lambda.min')
# update(laso2,s='lambda.min')
References: