ISLR(6.2)- 正则化
Ridge & Lasso模型对比与最优参数选择
笔记要点:
1.Ridge & Lasso 几何解释
2.Ridge & Lasso 原理对比
3.特例:当数据量与特征量相等时 (n==p)
4.贝叶斯解释
5.选择调节参数
6.参考
1. Ridge & Lasso 几何解释
The Lasso and Ridge Regression coefficient estimates can solve:
- For every value of
- 「
控制」和来寻找在约束下最小的RSS
- 「当
很大」, 约束降低, 系数可以很大
- 最小二乘系数有可能落入限制区域中
- Lasso & Ridge的估计结果等同于最小二乘的估计结果(下图)
- 「当
很小」, 那么和需要足够小(不超过)的条件下尽可能寻找使得「RSS 最小」的系数
以最小二乘估计系数
- all of the points on a given ellipse share a common value of the RSS
- as the ellipses expand away from the least squares coefficient estimates, the RSS increases
- 牺牲偏差增加换取方差大幅减小
Lasso和岭回归「系数估计」是由其条件区域(constraint region)与椭圆第一次「相交点」所决定 ==> (满足RSS最小)
- Since ridge regression has a circular constraint 「with no sharp point」
- this intersection will NOT generally occur on an axis
- While Lasso constraint has 「corners」 at each of the axes
- so the ellipse (contour of the RSS) will often intersect the constraint region at an axis (one of the coefficients = 0)
- In higher dimensions (p>3), the constraint for the LASSO becomes a polytope, many of the coefficient estimates may equal to zero simultaneously at the intersection point.
2. Ridge & Lasso 原理对比
Lasso优势: 可以得到只包含部分变量的简单易解释模型Ridge优势: 在每个预测变量都与响应变量相关对前提下,岭回归的方差稍小, 使得最小MSE稍小于LASSO
- Lasso潜在地假设了一些系数的真值为0
When the response is a function of ONLY 2 out of 45 predictors
- the LASSO tends to outperform ridge regression in terms of 「bias, variance, and MSE」
- 当只有一小部分预测变量是真实有效的而其他预测变量系数非常小或者等于0时, LASSO Wins!
- 当响应变量是很多(甚至全部的)预测变量的函数,Ridge Wins!
3. 特例:当数据量与特征量相等时 (n==p)
Let
「Two types of shrinkage」
Ridge & LASSO amounts to finding
「Main Idea」:
- Ridge Regression shrinks every dimension of the data by the same proportion
- LASSO shrinks all coefficients toward zero by a similar amount -
- sufficiently small (less than
)coefficients are all the way to zero
- 「Soft-thresholding」
- sufficiently small (less than
4. 贝叶斯解释
对于回归, 贝叶斯理论假设回归系数向量
- The Likelihood of the data can be written as
- Multiplying the prior distribution by the likelihood gives us (up to a proportionality constant)「posterior distribution」:
- 比例服从贝叶斯定理
「密度函数g」
假设
- If 「
is a Gaussian Distribution」, with mean 0 and standard deviation a function of
- it follows the 「posterior mode」 for
is the ridge regression solution, that is the most likely value for, given the data
- In fact, the Ridge Regression solution is also the 「posterior mean」
- If 「
is a Laplace(double-exponential) Distribution」, with mean 0 and scale parameter a function of
- it follows the 「posterior mode」 for
is the LASSO solution
- However, the LASSO solution is 「NOT the posterior mean」
- In fact, the posterior mean DOES NOT yield a sparse coefficient vector
从贝叶斯角度, Ridge & LASSO 和普通线性模型相同, 误差正态, 同样假设
- LASSO prior is 「steeply peaked」 at zero
- expects a priori that many of the coefficients are exactly zero
- Ridge prior is 「flatter and flatter」 at zero
- assumes the coefficients are randomly distributed about zero
【简单解释】
- L1正则化相当于对模型参数
引入了拉普拉斯先验,
- L2正则化相当于引入了高斯先验
- 而拉普拉斯先验使参数为0的可能性更大
5. 选择调节参数(6.2.3)
To select the best tuning parameter
- choose a grid of
values, and compute the CV Error for each value of
- select the
- Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter
「LOOCV on Ridge from Credit Data:」
In Credit dataset, the
The dip at the beginning is not very pronounced, so there is rather a wide range of values that would give very similar error
- in this case we might simply use the 「least squares solution」
「10-fold CV on LASSO:」
Not only has the LASSO correctly given much larger coefficient estimates to the two signal predictors,
- but also the minimum CV Error corresponds to a set of coefficient estimates for which only the signal variables are non-zero
- even though this is a challenging setting, with p = 45 and n = 50
- 最小二乘结果中, 一个信号变量趋近0
6. 参考:
- 《Introduction to Statistical Learning》
- 《老董聊卡》
- 《百面机器学习》
TOGO: (7.1) Moving Beyond Linearity!