统计学习导论_统计学习导论 | 读书笔记10 | Ridge & Lasso (2)

ISLR(6.2)- 正则化

Ridge & Lasso模型对比与最优参数选择

笔记要点:
1.Ridge & Lasso 几何解释
2.Ridge & Lasso 原理对比
3.特例:当数据量与特征量相等时 (n==p)
4.贝叶斯解释
5.选择调节参数
6.参考

1. Ridge & Lasso 几何解释

The Lasso and Ridge Regression coefficient estimates can solve:

  • For every value of
    , there is some s such that above equations will give lasso & ridge coefficient estimates
  • 控制」
    来寻找在约束下最小的RSS
  • 「当
    很大」
    , 约束降低, 系数可以很大
    • 最小二乘系数有可能落入限制区域中
    • Lasso & Ridge的估计结果等同于最小二乘的估计结果(下图)
  • 「当
    很小」
    , 那么
    需要足够小(不超过
    )的条件下尽可能寻找使得
    「RSS 最小」的系数

86606f613902d900528638f80ae146a2.png

以最小二乘估计系数

为中心的椭圆代表RSS值
  • all of the points on a given ellipse share a common value of the RSS
  • as the ellipses expand away from the least squares coefficient estimates, the RSS increases
    • 牺牲偏差增加换取方差大幅减小

Lasso和岭回归「系数估计」是由其条件区域(constraint region)与椭圆第一次「相交点」所决定 ==> (满足RSS最小)

  • Since ridge regression has a circular constraint 「with no sharp point」
    • this intersection will NOT generally occur on an axis
  • While Lasso constraint has 「corners」 at each of the axes
    • so the ellipse (contour of the RSS) will often intersect the constraint region at an axis (one of the coefficients = 0)
    • In higher dimensions (p>3), the constraint for the LASSO becomes a polytope, many of the coefficient estimates may equal to zero simultaneously at the intersection point.

2. Ridge & Lasso 原理对比

Lasso优势: 可以得到只包含部分变量的简单易解释模型Ridge优势: 在每个预测变量都与响应变量相关对前提下,岭回归的方差稍小, 使得最小MSE稍小于LASSO

  • Lasso潜在地假设了一些系数的真值为0

bc0043681f53ac220bb1b72d7d17b79c.png

When the response is a function of ONLY 2 out of 45 predictors

  • the LASSO tends to outperform ridge regression in terms of 「bias, variance, and MSE」

4857e7e4e524a7758aa4d261d6a70233.png
  • 当只有一小部分预测变量是真实有效的而其他预测变量系数非常小或者等于0时, LASSO Wins!
  • 当响应变量是很多(甚至全部的)预测变量的函数,Ridge Wins!

3. 特例:当数据量与特征量相等时 (n==p)

Let

be a
diagonal matrix with 1's on the diagonal and 0's in all off-diagonal elements, consider regression without an intercept, find
that minimize

「Two types of shrinkage」
Ridge & LASSO amounts to finding

by minimizing:

c72ab5747b452e2fc3040839b2e6f60d.png

「Main Idea」

  • Ridge Regression shrinks every dimension of the data by the same proportion
  • LASSO shrinks all coefficients toward zero by a similar amount -
    • sufficiently small (less than
      )
      coefficients are all the way to zero
      • 「Soft-thresholding」

4. 贝叶斯解释

对于回归, 贝叶斯理论假设回归系数向量

具有先验分布(prior distribution) =>
  • The Likelihood of the data can be written as
    , where
  • Multiplying the prior distribution by the likelihood gives us (up to a proportionality constant)「posterior distribution」:

  • 比例服从贝叶斯定理

「密度函数g」
假设

, 岭回归和LASSO的密度函数g服从以下两种情况:
  1. If
    is a Gaussian Distribution」
    , with mean 0 and standard deviation a function of
  • it follows the 「posterior mode」 for
    is the ridge regression solution, that is the most likely value for
    , given the data
  • In fact, the Ridge Regression solution is also the 「posterior mean」
  1. If
    is a Laplace(double-exponential) Distribution」
    , with mean 0 and scale parameter a function of
  • it follows the 「posterior mode」 for
    is the LASSO solution
  • However, the LASSO solution is 「NOT the posterior mean」
  • In fact, the posterior mean DOES NOT yield a sparse coefficient vector

48475d76ccf25242c0e3540087919f2c.png

从贝叶斯角度, Ridge & LASSO 和普通线性模型相同, 误差正态, 同样假设

具有简单先验分布
  • LASSO prior is 「steeply peaked」 at zero
    • expects a priori that many of the coefficients are exactly zero
  • Ridge prior is 「flatter and flatter」 at zero
    • assumes the coefficients are randomly distributed about zero

【简单解释】

  • L1正则化相当于对模型参数
    引入了拉普拉斯先验,
  • L2正则化相当于引入了高斯先验
  • 而拉普拉斯先验使参数为0的可能性更大

5. 选择调节参数(6.2.3)

To select the best tuning parameter

, CV provides a simple way to tackle this problem
  • choose a grid of
    values, and compute the CV Error for each value of
  • select the
    with the smallest CV Error
  • Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter

「LOOCV on Ridge from Credit Data:」

73c1acb5308d20a9540d52b8918c89d9.png

In Credit dataset, the

is relatively small
so that the optimal fit only involves a small amount of shrinkage relative to the least squares solution

The dip at the beginning is not very pronounced, so there is rather a wide range of values that would give very similar error

  • in this case we might simply use the 「least squares solution」

「10-fold CV on LASSO:」

40cdc87ed39ff5ddb47da6dc5c22077c.png

Not only has the LASSO correctly given much larger coefficient estimates to the two signal predictors,

  • but also the minimum CV Error corresponds to a set of coefficient estimates for which only the signal variables are non-zero
  • even though this is a challenging setting, with p = 45 and n = 50
    • 最小二乘结果中, 一个信号变量趋近0

6. 参考:

  • 《Introduction to Statistical Learning》
  • 《老董聊卡》
  • 《百面机器学习》

TOGO: (7.1) Moving Beyond Linearity!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值