L1 Norms vs. L2 Norms
Explaination of L1 and L2 Norms and their major differences.
This blog is major sourced from HERE
L1 Norm Versus L2 Norm
Also known as Manhattan Distance
or Taxicab norm
. L1 Norm is the sum of the magnitudes of the vectors in a space. It is the most natural way of measure distance between vectors, that is the sum of absolute difference of the components of the vectors. In this norm, all the components of the vector are weighted equally.
L2 nrom Also known as the Euclidean norm
. It is the shortest distance to go from one point to another.
Ridge regression
and lasso regression
are two different techniques for increasing the robustness against colinearity of ordinary least squares regression. Both of these algorithms attempt to minimize a cost function.
The cost is a function of two terms: one, the residual sum of squares (RSS), taken from ordinary least squares; the other, an additional regularizer penalty. The second term is an L2 norm in ridge regression, and an L1 norm in lasso regression.
In ordinary least squares, we solve to minimize the following cost function:
C o s t = ( y − x β ) T ( y − x β ) Cost = (y-x\beta)^T(y-x\beta) Cost=(y−xβ)T(y−xβ)
This term is the RSS, residual sum of squares. In ridge regression
we instead solve:
C o s t = ( y − x β ) T ( y − x β ) + λ β T β Cost = (y-x\beta)^T(y-x\beta)+\lambda\beta^T\beta Cost=(y−xβ)T(y−xβ)+λβTβ
The λ β T β \lambda\beta^T\beta λβTβ term is an L2 norm.
In lasso regression
we instead solve:
C o s t = ( y − x β ) T ( y − x β ) + λ ∣ β ∣ Cost = (y-x\beta)^T(y-x\beta)+\lambda |\beta| Cost=(y−xβ)T(y−xβ)+λ∣β∣
The λ ∣ β ∣ \lambda {|\beta|} λ∣β∣ term is an L1 norm.
At a higher level, the chief difference between the L1 and the L2 terms is that the L2 term is proportional to the square of the β values, while the L1 norm is proportional the absolute value of the values in β . This fundamental difference accounts for all of the difference between how lasso regression and ridge regression “work”.
Definiation of Norm
A norm is a mathematical thing that is applied to a vector (like the vector β above). The norm of a vector maps vector values to values in [0,∞) . In machine learning, norms are useful because they are used to express distances: this vector and this vector are so-and-so far apart, according to this-or-that norm.
Going a bit further, we define ||x||p as a “p-norm”. Given x , a vector with i components, a p-norm is defined as:
∣ ∣ x ∣ ∣ p = ( ∑ i ∣ x i ∣ p ) 1 / p ||x||_p=(∑_i|x_i|^p)^{1/p} ∣∣x∣∣p=(i∑∣xi∣p)1/p
The simplest norm conceptually is Euclidean distance. This is what we typically think of as distance between two points in space:
∣ ∣ x ∣ ∣ 2 = ∑ i x i 2 = x 1 2 + x 2 2 + . . . + x i 2 ||x||_2=\sqrt{∑_ix^2_i}=\sqrt{x_1^2+x_2^2+...+x_i^2} ∣∣x∣∣2=i∑xi2=x12+x22+...+xi2
Another common norm is taxicab distance, which is the 1-norm:
∣ ∣ x ∣ ∣ 1 = ∑ i ∣ x i ∣ = ∣ x 1 ∣ + ∣ x 2 ∣ + … + ∣ x i ∣ ||x||_1=∑_i|x_i|=|x_1|+|x_2|+…+|x_i| ∣∣x∣∣1=i∑∣xi∣=∣x1∣+∣x2∣+…+∣xi∣
Taxicab distance is so-called because it emulates moving between two points as though you are moving through the streets of Manhattan in a taxi cab. Instead of measuring the distance “as the crow flies” it measures the right-angle distance between two points:
[外链图片转存失败(img-127xV9Sm-1564986625386)(https://upload.wikimedia.org/wikipedia/commons/0/08/Manhattan_distance.svg)]
p-norms and regularization
Taxicab distance is the 1-norm, also known as the L1 norm. The L2 norm is actually the 2-norm, Euclidian distance, squared. Hence, we can rewrite our cost equations as:
R i d g e C o s t = ( y − X β ) T ( y − X β ) + ∣ ∣ β ∣ ∣ 2 2 Ridge Cost=(y−Xβ)T(y−Xβ)+||β||_2^2 RidgeCost=(y−Xβ)T(y−Xβ)+∣∣β∣∣22
L a s s o C o s t = ( y − X β ) T ( y − X β ) + ∣ ∣ β ∣ ∣ 1 Lasso Cost=(y−Xβ)T(y−Xβ)+||β||_1 LassoCost=(y−Xβ)T(y−Xβ)+∣∣β∣∣1
This process of adding a norm to our cost function is known as regularization
. We can regularize the data for different underlying reasons and with different effects. In the case of ridge and lasso regression, both of these regularizers are built to problem-solve colinearity and model complexity; but as we saw in earlier notebooks. the way in which they go about doing so is fundamentally different.
L1-L2 norm comparisons
Robustness: L1 > L2
Robustness is defined as resistance to outliers in a dataset. The more able a model is to ignore extreme values in the data, the more robust it is.
The L1 norm is more robust than the L2 norm, for fairly obvious reasons: the L2 norm squares values, so it increases the cost of outliers exponentially; the L1 norm only takes the absolute value, so it considers them linearly.
Stability: L2 > L1
Stability is defined as resistance to horizontal adjustments. This is the perpendicular opposite of robustness.
The L2 norm is more stable than the L1 norm. A later notebook will explore why.
Solution numeracy: L2 one, L1 many
Because L2 is Euclidean distance, there is always one right answer as to how to get between two points fastest. Because L1 is taxicab distance, there are as many solutions to getting between two points as there are ways of driving between two points in Manhattan! This is best illustrated by the same graphic from earlier:
L1-L2 regularizer comparisons
Computational difficulty: L2 > L1
L2 has a closed form solution because it’s a square of a thing. L1 does not have a closed form solution because it is a non-differenciable piecewise function, as it involves an absolute value. For this reason, L1 is computationally more expensive, as we can’t solve it in terms of matrix math, and most rely on approximations (in the lasso case, coordinate descent).
Sparsity: L1 > L2
Sparsity is the property of having coefficients which are highly significant: very near 0 or very not near 0. In theory, the coefficients very near 0 can later be eliminated.
Feature selection is a further-involved form of sparsity: instead of shrinking coefficients near to 0, feature selection is taking them to exactly 0, and hence excluding certain features from the model entirely. Feature selection is a technique moreso than a property: you can do feature selection as an additional step after running a highly sparse model. But lasso regression is interesting in that it features inbuilt feature selection, while ridge regression is just very sparse.
That about covers the high-level properties of L2 and L1 norms and regularizers. Hopefully you can see how these properties are exactly the same ones exposed in ridge and lasso regression!
L-Infinity Norm
- Gives the largest magnitude among each element of a vector.
Having the vector X= [-6, 4, 2], the L-infinity norm is 6.
In L-infinity norm, only the largest element has any effect. So, for example, if your vector represents the cost of constructing a building, by minimizing L-infinity norm we are reducing the cost of the most expensive building.