统计学习导论_统计学习导论 | 读书笔记10 | Ridge & Lasso (2)

最新推荐文章于 2022-04-30 22:38:53 发布

weixin_39926739

最新推荐文章于 2022-04-30 22:38:53 发布

阅读量250

点赞数

文章标签：统计学习导论

ISLR(6.2)- 正则化

Ridge & Lasso模型对比与最优参数选择

笔记要点：
1.Ridge & Lasso 几何解释
2.Ridge & Lasso 原理对比
3.特例：当数据量与特征量相等时 (n==p)
4.贝叶斯解释
5.选择调节参数
6.参考

1. Ridge & Lasso 几何解释

The Lasso and Ridge Regression coefficient estimates can solve:

For every value of

, there is some s such that above equations will give lasso & ridge coefficient estimates
「
控制」

和

来寻找在约束下最小的RSS
「当
很大」
, 约束降低, 系数可以很大
- 最小二乘系数有可能落入限制区域中
- Lasso & Ridge的估计结果等同于最小二乘的估计结果（下图）
「当
很小」
, 那么
和

需要足够小(不超过

)的条件下尽可能寻找使得
「RSS 最小」的系数

以最小二乘估计系数

为中心的椭圆代表RSS值

all of the points on a given ellipse share a common value of the RSS
as the ellipses expand away from the least squares coefficient estimates, the RSS increases
- 牺牲偏差增加换取方差大幅减小

Lasso和岭回归「系数估计」是由其条件区域(constraint region)与椭圆第一次「相交点」所决定 ==> (满足RSS最小)

Since ridge regression has a circular constraint 「with no sharp point」
- this intersection will NOT generally occur on an axis
While Lasso constraint has 「corners」 at each of the axes
- so the ellipse (contour of the RSS) will often intersect the constraint region at an axis (one of the coefficients = 0)
- In higher dimensions (p>3), the constraint for the LASSO becomes a polytope, many of the coefficient estimates may equal to zero simultaneously at the intersection point.

2. Ridge & Lasso 原理对比

Lasso优势： 可以得到只包含部分变量的简单易解释模型Ridge优势： 在每个预测变量都与响应变量相关对前提下，岭回归的方差稍小, 使得最小MSE稍小于LASSO

Lasso潜在地假设了一些系数的真值为0

When the response is a function of ONLY 2 out of 45 predictors

the LASSO tends to outperform ridge regression in terms of 「bias, variance, and MSE」

当只有一小部分预测变量是真实有效的而其他预测变量系数非常小或者等于0时， LASSO Wins！
当响应变量是很多(甚至全部的)预测变量的函数，Ridge Wins!

3. 特例：当数据量与特征量相等时 (n==p)

Let

be a

diagonal matrix with 1's on the diagonal and 0's in all off-diagonal elements, consider regression without an intercept, find

that minimize

「Two types of shrinkage」
Ridge & LASSO amounts to finding

by minimizing:

「Main Idea」：

Ridge Regression shrinks every dimension of the data by the same proportion
LASSO shrinks all coefficients toward zero by a similar amount -
- sufficiently small (less than
  )
  coefficients are all the way to zero
  - 「Soft-thresholding」

4. 贝叶斯解释

对于回归, 贝叶斯理论假设回归系数向量

具有先验分布(prior distribution) =>

The Likelihood of the data can be written as

, where
Multiplying the prior distribution by the likelihood gives us (up to a proportionality constant)「posterior distribution」:

比例服从贝叶斯定理

「密度函数g」
假设

, 岭回归和LASSO的密度函数g服从以下两种情况：

If 「
is a Gaussian Distribution」
, with mean 0 and standard deviation a function of

it follows the 「posterior mode」 for
is the ridge regression solution, that is the most likely value for

, given the data
In fact, the Ridge Regression solution is also the 「posterior mean」

If 「
is a Laplace(double-exponential) Distribution」
, with mean 0 and scale parameter a function of

it follows the 「posterior mode」 for
is the LASSO solution
However, the LASSO solution is 「NOT the posterior mean」
In fact, the posterior mean DOES NOT yield a sparse coefficient vector

从贝叶斯角度, Ridge & LASSO 和普通线性模型相同, 误差正态, 同样假设

具有简单先验分布

LASSO prior is 「steeply peaked」 at zero
- expects a priori that many of the coefficients are exactly zero
Ridge prior is 「flatter and flatter」 at zero
- assumes the coefficients are randomly distributed about zero

【简单解释】

L1正则化相当于对模型参数
引入了拉普拉斯先验,
L2正则化相当于引入了高斯先验
而拉普拉斯先验使参数为0的可能性更大

5. 选择调节参数(6.2.3)

To select the best tuning parameter

, CV provides a simple way to tackle this problem

choose a grid of
values, and compute the CV Error for each value of
select the

with the smallest CV Error
Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter

「LOOCV on Ridge from Credit Data:」

In Credit dataset, the

is relatively small

so that the optimal fit only involves a small amount of shrinkage relative to the least squares solution

The dip at the beginning is not very pronounced, so there is rather a wide range of values that would give very similar error

in this case we might simply use the 「least squares solution」

「10-fold CV on LASSO:」

Not only has the LASSO correctly given much larger coefficient estimates to the two signal predictors,

but also the minimum CV Error corresponds to a set of coefficient estimates for which only the signal variables are non-zero
even though this is a challenging setting, with p = 45 and n = 50
- 最小二乘结果中, 一个信号变量趋近0

6. 参考：

《Introduction to Statistical Learning》
《老董聊卡》
《百面机器学习》

TOGO: (7.1) Moving Beyond Linearity!