变量选择——lasso、SCAD、MCP的实现（R语言）

最新推荐文章于 2025-04-11 17:42:38 发布

统计学小王子

最新推荐文章于 2025-04-11 17:42:38 发布

阅读量1.6w

点赞数 18

分类专栏： R语言入门文章标签： R语言变量选择 SCAD、MCP，LASSO

本文链接：https://blog.csdn.net/weixin_46111814/article/details/119674705

版权

R语言入门专栏收录该内容

61 篇文章

订阅专栏

本文介绍了R语言中的glmnet、msaenet和ncvreg包，通过实例展示了如何使用cv.glmnet进行LASSO选择，以及aenet、ncvreg分别实现Adaptivelasso和MCP。对比了不同包在模型选择上的应用和结果。适合初学者快速入门和进阶者深入理解。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

0引言

自1996年lasso被提出以来，很多学者针对不同的模型提出有效的算法进行计算，例如多元线性线性模型、cox回归模型、广义线性模型等。也给出了很多选择惩罚参数的方式，比如cv、aic、bic。还有很多惩罚类型：lasso、适应性lasso、弹性网、SCAD、MCP。
本文主要介绍下面三个包：glmnet、ncvreg、msaenet。先汇总每个包的主要函数、方法。如下表：

函数	模型	惩罚参数	惩罚类型	包	是否弹性网
glmnet	“gaussian”, “binomial”, “poisson”, “multinomial”, “cox”, “mgaussian”	—	lasso	glmnet	是
cv.glmnet	“gaussian”, “binomial”, “poisson”, “multinomial”, “cox”, “mgaussian”	CV	lasso	glmnet	是
aenet	“gaussian”, “binomial”, “poisson”, “cox”	“cv”, “ebic”, “bic”, “aic”	Adaptive lasso	msaenet	是
asnet	“gaussian”, “binomial”, “poisson”, “cox”	“cv”, “ebic”, “bic”, “aic”	SCAD	msaenet	是
amnet	“gaussian”, “binomial”, “poisson”, “cox”	“cv”, “ebic”, “bic”, “aic”	MCP	msaenet	是
ncvreg	“gaussian”, “binomial”, “poisson”	—	“MCP”, “SCAD”, “lasso”	ncvreg	否
cv.ncvreg	“gaussian”, “binomial”, “poisson”	CV	“MCP”, “SCAD”, “lasso”	ncvreg	否

做lasso类变量选择的包有很多，上表只列出了三个我常用的。有知道更多包的大佬欢迎评论区留言或私信给我们安利，大家一块扩充知识。
上表给出了包和函数主要的参数和模型，大部分熟悉R语言的可以很容易的安装实现自己想要的模型。但是对于初学者可能有些困难，为了大部分人快速入手上述包。下面生成合适的模型数据，以高斯的误差和lasso（或者Adaptive lasso）为例，使用CV选择作为选择惩罚因子的方法给出实现例子。

1、glmnet

1.1生成数据

本文使用的数据都是本节生成的，下面是生成代码：

> # R版本 3.6.3
> Beta <- cbind(c(1,1,1,1,0,0)) # 真是的beta
> n <- 500  # 样本量
> set.seed(1234)  # 为了本文的复现，设置随机种子
> x1 <- rnorm(n)
> x2 <- rnorm(n)
> x3 <- rnorm(n)
> x4 <- rnorm(n)
> x5 <- rnorm(n)
> x6 <- rnorm(n)
> x <- cbind(x1, x2, x3, x4, x5, x6)
> y <- x %*% Beta + rnorm(n, 0, 0.5)
> 
> # 数据展示
> head(x, 10)
              x1          x2         x3         x4         x5         x6
 [1,] -1.2070657  0.98477997 -1.2053334  1.7940745 -0.9738186 -1.3662376
 [2,]  0.2774292 -1.22473788  0.3014667 -1.3645489 -0.0996312  0.5392093
 [3,]  1.0844412  0.70972622 -1.5391452 -0.7074400 -0.1107350 -1.3219320
 [4,] -2.3456977 -0.10921999  0.6353707 -0.5562843  1.1921946 -0.2812887
 [5,]  0.4291247  1.78260790  0.7029518 -0.3100811 -1.6558859 -2.1049469
 [6,]  0.5060559 -0.24344468 -1.9058829 -0.3761793 -1.0456433 -1.6176047
 [7,] -0.5747400 -1.52710702  0.9389214  0.4585260 -1.7402391 -0.7237319
 [8,] -0.5466319  0.49183437 -0.2244921 -1.2611491  0.5131208  0.3067410
 [9,] -0.5644520  0.35450366 -0.6738168 -0.5274652 -0.4459568  0.2255962
[10,] -0.8900378 -0.01762635  0.4457874 -0.5568142 -1.8391938  0.9357160
> head(y, 10)
            [,1]
 [1,]  0.1739191
 [2,] -1.7533630
 [3,] -0.2984157
 [4,] -1.4562544
 [5,]  3.4013056
 [6,] -2.1985494
 [7,] -1.2608381
 [8,] -1.2924795
 [9,] -2.0985074
[10,] -0.7405859

1.2 glmnet::cv.glmnet

> (fit <- glmnet::cv.glmnet(x, y))

Call:  glmnet::cv.glmnet(x = x, y = y) 

Measure: Mean-Squared Error 

     Lambda Index Measure       SE Nonzero
min 0.01019    52  0.2512 0.009655       5
1se 0.05439    34  0.2605 0.010606       4
> summary(fit)
           Length Class  Mode     
lambda     57     -none- numeric  
cvm        57     -none- numeric  
cvsd       57     -none- numeric  
cvup       57     -none- numeric  
cvlo       57     -none- numeric  
nzero      57     -none- numeric  
call        3     -none- call     
name        1     -none- character
glmnet.fit 12     elnet  list     
lambda.min  1     -none- numeric  
lambda.1se  1     -none- numeric  
index       2     -none- numeric  
> str(fit)
List of 12
 $ lambda    : num [1:57] 1.172 1.068 0.973 0.886 0.808 ...
 $ cvm       : num [1:57] 4.46 4.23 3.79 3.23 2.73 ...
 $ cvsd      : num [1:57] 0.369 0.376 0.364 0.323 0.272 ...
 $ cvup      : num [1:57] 4.83 4.6 4.16 3.56 3 ...
 $ cvlo      : num [1:57] 4.09 3.85 3.43 2.91 2.45 ...
 $ nzero     : Named int [1:57] 0 2 4 4 4 4 4 4 4 4 ...
  ..- attr(*, "names")= chr [1:57] "s0" "s1" "s2" "s3" ...
 $ call      : language glmnet::cv.glmnet(x = x, y = y)
 $ name      : Named chr "Mean-Squared Error"
  ..- attr(*, "names")= chr "mse"
 $ glmnet.fit:List of 12
  ..$ a0       : Named num [1:57] -0.000854 -0.000831 0.004037 0.005898 0.007593 ...
  .. ..- attr(*, "names")= chr [1:57] "s0" "s1" "s2" "s3" ...
  ..$ beta     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. .. ..@ i       : int [1:236] 0 1 0 1 2 3 0 1 2 3 ...
  .. .. ..@ p       : int [1:58] 0 0 2 6 10 14 18 22 26 30 ...
  .. .. ..@ Dim     : int [1:2] 6 57
  .. .. ..@ Dimnames:List of 2
  .. .. .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
  .. .. .. ..$ : chr [1:57] "s0" "s1" "s2" "s3" ...
  .. .. ..@ x       : num [1:236] 0.10041 0.00378 0.18545 0.09365 0.0017 ...
  .. .. ..@ factors : list()
  ..$ df       : int [1:57] 0 2 4 4 4 4 4 4 4 4 ...
  ..$ dim      : int [1:2] 6 57
  ..$ lambda   : num [1:57] 1.172 1.068 0.973 0.886 0.808 ...
  ..$ dev.ratio: num [1:57] 0 0.0537 0.1566 0.2905 0.4016 ...
  ..$ nulldev  : num 2235
  ..$ npasses  : int 206
  ..$ jerr     : int 0
  ..$ offset   : logi FALSE
  ..$ call     : language glmnet(x = x, y = y)
  ..$ nobs     : int 500
  ..- attr(*, "class")= chr [1:2] "elnet" "glmnet"
 $ lambda.min: num 0.0102
 $ lambda.1se: num 0.0544
 $ index     : int [1:2, 1] 52 34
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "min" "1se"
  .. ..$ : chr "Lambda"
 - attr(*, "class")= chr "cv.glmnet"
> coef(fit)
7 x 1 sparse Matrix of class "dgCMatrix"
                     1
(Intercept) 0.02380925
x1          0.94151352
x2          0.98011080
x3          0.94789578
x4          0.93311113
x5          .         
x6          .         
>

2、msaenet

> fit <- msaenet::aenet(x, y)
> summary(fit)
                  Length Class     Mode   
beta               6     dgCMatrix S4     
model             12     elnet     list   
beta.first         6     dgCMatrix S4     
model.first       12     elnet     list   
best.alpha.enet    1     -none-    numeric
best.alpha.aenet   1     -none-    numeric
best.lambda.enet   1     -none-    numeric
best.lambda.aenet  1     -none-    numeric
step.criterion     2     -none-    numeric
adpen              6     -none-    numeric
seed               1     -none-    numeric
call               3     -none-    call   
> str(fit)
List of 12
 $ beta             :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. ..@ i       : int [1:4] 0 1 2 3
  .. ..@ p       : int [1:2] 0 4
  .. ..@ Dim     : int [1:2] 6 1
  .. ..@ Dimnames:List of 2
  .. .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
  .. .. ..$ : chr "s0"
  .. ..@ x       : num [1:4] 0.98 1.026 0.997 0.978
  .. ..@ factors : list()
 $ model            :List of 12
  ..$ a0       : Named num 0.0248
  .. ..- attr(*, "names")= chr "s0"
  ..$ beta     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. .. ..@ i       : int [1:4] 0 1 2 3
  .. .. ..@ p       : int [1:2] 0 4
  .. .. ..@ Dim     : int [1:2] 6 1
  .. .. ..@ Dimnames:List of 2
  .. .. .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
  .. .. .. ..$ : chr "s0"
  .. .. ..@ x       : num [1:4] 0.98 1.026 0.997 0.978
  .. .. ..@ factors : list()
  ..$ df       : int 4
  ..$ dim      : int [1:2] 6 1
  ..$ lambda   : num 1.11e+13
  ..$ dev.ratio: num 0.945
  ..$ nulldev  : num 2235
  ..$ npasses  : int 5
  ..$ jerr     : int 0
  ..$ offset   : logi FALSE
  ..$ call     : language glmnet(x = x, y = y, family = family, alpha = best.alpha.aenet, lambda = best.lambda.aenet,      penalty.factor =| __truncated__
  ..$ nobs     : int 500
  ..- attr(*, "class")= chr [1:2] "elnet" "glmnet"
 $ beta.first       :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. ..@ i       : int [1:5] 0 1 2 3 5
  .. ..@ p       : int [1:2] 0 5
  .. ..@ Dim     : int [1:2] 6 1
  .. ..@ Dimnames:List of 2
  .. .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
  .. .. ..$ : chr "s0"
  .. ..@ x       : num [1:5] 0.98 1.027 0.999 0.978 -0.017
  .. ..@ factors : list()
 $ model.first      :List of 12
  ..$ a0       : Named num 0.025
  .. ..- attr(*, "names")= chr "s0"
  ..$ beta     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. .. ..@ i       : int [1:5] 0 1 2 3 5
  .. .. ..@ p       : int [1:2] 0 5
  .. .. ..@ Dim     : int [1:2] 6 1
  .. .. ..@ Dimnames:List of 2
  .. .. .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
  .. .. .. ..$ : chr "s0"
  .. .. ..@ x       : num [1:5] 0.98 1.027 0.999 0.978 -0.017
  .. .. ..@ factors : list()
  ..$ df       : int 5
  ..$ dim      : int [1:2] 6 1
  ..$ lambda   : num 0.00686
  ..$ dev.ratio: num 0.945
  ..$ nulldev  : num 2235
  ..$ npasses  : int 5
  ..$ jerr     : int 0
  ..$ offset   : logi FALSE
  ..$ call     : language glmnet(x = x, y = y, family = family, alpha = best.alpha.enet, lambda = best.lambda.enet,      penalty.factor = p| __truncated__ ...
  ..$ nobs     : int 500
  ..- attr(*, "class")= chr [1:2] "elnet" "glmnet"
 $ best.alpha.enet  : num 0.85
 $ best.alpha.aenet : num 0.05
 $ best.lambda.enet : num 0.00686
 $ best.lambda.aenet: num 1.11e+13
 $ step.criterion   : num [1:2] 0.501 0.502
 $ adpen            : num [1:6, 1] 1.02 9.73e-01 1.00 1.02 4.50e+15 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
  .. ..$ : chr "s0"
 $ seed             : num 1001
 $ call             : language msaenet::aenet(x = x, y = y)
 - attr(*, "class")= chr [1:2] "msaenet" "msaenet.aenet"
> coef(fit)
[1] 0.9799217 1.0258625 0.9968319 0.9781661 0.0000000 0.0000000

3、ncvreg

> fit <- ncvreg::cv.ncvreg(x, y)
> summary(fit)
MCP-penalized linear regression with n=500, p=6
At minimum cross-validation error (lambda=0.1781):
-------------------------------------------------
  Nonzero coefficients: 4
  Cross-validation error (deviance): 0.25
  R-squared: 0.94
  Signal-to-noise ratio: 16.69
  Scale estimate (sigma): 0.503
> str(fit)
List of 9
 $ cve       : num [1:100] 4.45 4.25 3.91 3.36 2.76 ...
 $ cvse      : num [1:100] 0.278 0.267 0.246 0.211 0.174 ...
 $ fold      : num [1:500] 2 2 5 1 1 2 8 9 7 4 ...
 $ lambda    : num [1:100] 1.172 1.093 1.019 0.95 0.886 ...
 $ fit       :List of 13
  ..$ beta          : num [1:7, 1:100] -0.000854 0 0 0 0 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:7] "(Intercept)" "x1" "x2" "x3" ...
  .. .. ..$ : chr [1:100] "1.17175" "1.09278" "1.01913" "0.95044" ...
  ..$ iter          : int [1:100] 0 2 4 6 4 4 4 4 4 4 ...
  ..$ lambda        : num [1:100] 1.172 1.093 1.019 0.95 0.886 ...
  ..$ penalty       : chr "MCP"
  ..$ family        : chr "gaussian"
  ..$ gamma         : num 3
  ..$ alpha         : num 1
  ..$ convex.min    : NULL
  ..$ loss          : num [1:100] 2235 2103 1926 1642 1345 ...
  ..$ penalty.factor: num [1:6] 1 1 1 1 1 1
  ..$ n             : int 500
  ..$ X             : num [1:500, 1:6] -1.169 0.267 1.047 -2.271 0.413 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : NULL
  .. .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
  .. ..- attr(*, "center")= num [1:6] 0.00184 -0.05503 0.03168 -0.00266 0.05087 ...
  .. ..- attr(*, "scale")= num [1:6] 1.034 0.958 0.937 1.023 1.041 ...
  .. ..- attr(*, "nonsingular")= int [1:6] 1 2 3 4 5 6
  ..$ y             : num [1:500, 1] 0.175 -1.753 -0.298 -1.455 3.402 ...
  ..- attr(*, "class")= chr "ncvreg"
 $ min       : int 28
 $ lambda.min: num 0.178
 $ null.dev  : num 4.47
 $ Bias      : num 0.00686
 - attr(*, "class")= chr "cv.ncvreg"
> coef(fit)
(Intercept)          x1          x2          x3          x4          x5 
 0.02498015  0.98628670  1.03260613  1.00392827  0.98542619  0.00000000 
         x6 
 0.00000000

4、结果对比

> coef(fit1) # cv.glmnet
7 x 1 sparse Matrix of class "dgCMatrix"
                    1
(Intercept) 0.0235698
x1          0.9323572
x2          0.9693753
x3          0.9364369
x4          0.9224125
x5          .        
x6          .        
> coef(fit2) # cv.ncvreg
(Intercept)          x1          x2          x3          x4          x5 
 0.02483197  0.97635269  1.02273049  0.99368238  0.97404453  0.00000000 
         x6 
-0.01177071 
> coef(fit3) # aenet
[1] 0.9804661 1.0261746 0.9968471 0.9786486 0.0000000 0.0000000

5、总结

希望可以帮助大家提高R水平。
水平有限发现错误还望及时评论区指正，您的意见和批评是我不断前进的动力。

6、纠错代码

(fit1 <- glmnet::cv.glmnet(x, y, alpha = 1, family = "gaussian"))
summary(fit1)
str(fit1)
coef(fit1) # cv.glmnet
?glmnet

fit2 <- ncvreg::cv.ncvreg(x, y, family = "gaussian", penalty = "lasso")
summary(fit2)
str(fit2)
coef(fit2) # cv.ncvreg



fit3 <- msaenet::aenet(x, y, alphas = 1)
summary(fit3)
str(fit3)
coef(fit3) # aenet