L1与L2正则的区别

L1 Norms vs. L2 Norms

Explaination of L1 and L2 Norms and their major differences.

This blog is major sourced from HERE

L1 Norm Versus L2 Norm

Also known as Manhattan Distance or Taxicab norm. L1 Norm is the sum of the magnitudes of the vectors in a space. It is the most natural way of measure distance between vectors, that is the sum of absolute difference of the components of the vectors. In this norm, all the components of the vector are weighted equally.

L2 nrom Also known as the Euclidean norm. It is the shortest distance to go from one point to another.

Ridge regression and lasso regression are two different techniques for increasing the robustness against colinearity of ordinary least squares regression. Both of these algorithms attempt to minimize a cost function.

The cost is a function of two terms: one, the residual sum of squares (RSS), taken from ordinary least squares; the other, an additional regularizer penalty. The second term is an L2 norm in ridge regression, and an L1 norm in lasso regression.

In ordinary least squares, we solve to minimize the following cost function:

C o s t = ( y − x β ) T ( y − x β ) Cost = (y-x\beta)^T(y-x\beta) Cost=(yxβ)T(yxβ)

This term is the RSS, residual sum of squares. In ridge regression we instead solve:

C o s t = ( y − x β ) T ( y − x β ) + λ β T β Cost = (y-x\beta)^T(y-x\beta)+\lambda\beta^T\beta Cost=(yxβ)T(yxβ)+λβTβ

The λ β T β \lambda\beta^T\beta λβTβ term is an L2 norm.

In lasso regression we instead solve:

C o s t = ( y − x β ) T ( y − x β ) + λ ∣ β ∣ Cost = (y-x\beta)^T(y-x\beta)+\lambda |\beta| Cost=(yxβ)T(yxβ)+λβ

The λ ∣ β ∣ \lambda {|\beta|} λβ term is an L1 norm.

At a higher level, the chief difference between the L1 and the L2 terms is that the L2 term is proportional to the square of the β values, while the L1 norm is proportional the absolute value of the values in β . This fundamental difference accounts for all of the difference between how lasso regression and ridge regression “work”.

Definiation of Norm

A norm is a mathematical thing that is applied to a vector (like the vector β above). The norm of a vector maps vector values to values in [0,∞) . In machine learning, norms are useful because they are used to express distances: this vector and this vector are so-and-so far apart, according to this-or-that norm.

Going a bit further, we define ||x||p as a “p-norm”. Given x , a vector with i components, a p-norm is defined as:

∣ ∣ x ∣ ∣ p = ( ∑ i ∣ x i ∣ p ) 1 / p ||x||_p=(∑_i|x_i|^p)^{1/p} xp=(ixip)1/p

The simplest norm conceptually is Euclidean distance. This is what we typically think of as distance between two points in space:

∣ ∣ x ∣ ∣ 2 = ∑ i x i 2 = x 1 2 + x 2 2 + . . . + x i 2 ||x||_2=\sqrt{∑_ix^2_i}=\sqrt{x_1^2+x_2^2+...+x_i^2} x2=ixi2 =x12+x22+...+xi2

Another common norm is taxicab distance, which is the 1-norm:

∣ ∣ x ∣ ∣ 1 = ∑ i ∣ x i ∣ = ∣ x 1 ∣ + ∣ x 2 ∣ + … + ∣ x i ∣ ||x||_1=∑_i|x_i|=|x_1|+|x_2|+…+|x_i| x1=ixi=x1+x2++xi

Taxicab distance is so-called because it emulates moving between two points as though you are moving through the streets of Manhattan in a taxi cab. Instead of measuring the distance “as the crow flies” it measures the right-angle distance between two points:

[外链图片转存失败(img-127xV9Sm-1564986625386)(https://upload.wikimedia.org/wikipedia/commons/0/08/Manhattan_distance.svg)]

p-norms and regularization

Taxicab distance is the 1-norm, also known as the L1 norm. The L2 norm is actually the 2-norm, Euclidian distance, squared. Hence, we can rewrite our cost equations as:

R i d g e C o s t = ( y − X β ) T ( y − X β ) + ∣ ∣ β ∣ ∣ 2 2 Ridge Cost=(y−Xβ)T(y−Xβ)+||β||_2^2 RidgeCost=(yXβ)T(yXβ)+β22

L a s s o C o s t = ( y − X β ) T ( y − X β ) + ∣ ∣ β ∣ ∣ 1 Lasso Cost=(y−Xβ)T(y−Xβ)+||β||_1 LassoCost=(yXβ)T(yXβ)+β1

This process of adding a norm to our cost function is known as regularization. We can regularize the data for different underlying reasons and with different effects. In the case of ridge and lasso regression, both of these regularizers are built to problem-solve colinearity and model complexity; but as we saw in earlier notebooks. the way in which they go about doing so is fundamentally different.

L1-L2 norm comparisons

Robustness: L1 > L2

Robustness is defined as resistance to outliers in a dataset. The more able a model is to ignore extreme values in the data, the more robust it is.

The L1 norm is more robust than the L2 norm, for fairly obvious reasons: the L2 norm squares values, so it increases the cost of outliers exponentially; the L1 norm only takes the absolute value, so it considers them linearly.

Stability: L2 > L1

Stability is defined as resistance to horizontal adjustments. This is the perpendicular opposite of robustness.

The L2 norm is more stable than the L1 norm. A later notebook will explore why.

Solution numeracy: L2 one, L1 many

Because L2 is Euclidean distance, there is always one right answer as to how to get between two points fastest. Because L1 is taxicab distance, there are as many solutions to getting between two points as there are ways of driving between two points in Manhattan! This is best illustrated by the same graphic from earlier:

L1-L2 regularizer comparisons

Computational difficulty: L2 > L1

L2 has a closed form solution because it’s a square of a thing. L1 does not have a closed form solution because it is a non-differenciable piecewise function, as it involves an absolute value. For this reason, L1 is computationally more expensive, as we can’t solve it in terms of matrix math, and most rely on approximations (in the lasso case, coordinate descent).

Sparsity: L1 > L2

Sparsity is the property of having coefficients which are highly significant: very near 0 or very not near 0. In theory, the coefficients very near 0 can later be eliminated.

Feature selection is a further-involved form of sparsity: instead of shrinking coefficients near to 0, feature selection is taking them to exactly 0, and hence excluding certain features from the model entirely. Feature selection is a technique moreso than a property: you can do feature selection as an additional step after running a highly sparse model. But lasso regression is interesting in that it features inbuilt feature selection, while ridge regression is just very sparse.

That about covers the high-level properties of L2 and L1 norms and regularizers. Hopefully you can see how these properties are exactly the same ones exposed in ridge and lasso regression!

L-Infinity Norm

  • Gives the largest magnitude among each element of a vector.
    Having the vector X= [-6, 4, 2], the L-infinity norm is 6.
    In L-infinity norm, only the largest element has any effect. So, for example, if your vector represents the cost of constructing a building, by minimizing L-infinity norm we are reducing the cost of the most expensive building.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值