Notes on Machine Learning By Andrew Ng (4)
Click here to check former notes.
Regularization(正则化)
The problem of overfitting
Addressiong overfitting
Options:
- Reduce # of features
-
- Manually select which features to keep.
-
- Model selection algorithm (later in course) .
-
- abandon some useful info.
- Regularization
-
- Keep all the features, but reduce magnitude/ values of parameters θ j \theta_j θj.
-
- Works well when we have a lot of features, each of which contributes a bit to predicting y y y.
Cost Function
Regularization
Small values for parameters θ i , 1 ≤ i ≤ n \theta_i, 1\leq i \leq n θi,1≤i≤n
- “Simpler” hypothesis
- Less prone to overfitting
J ( θ ) = 1 2 m [ ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + λ ∑ i = 1 n θ i 2 ] J(\theta) = \frac{1}{2m}[\sum_{i=1}^m{(h_\theta(x^{(i)}) - y^{(i)})}^2 + \lambda \sum _{i =1} ^ n \theta_i^2] J(θ)=2m1[∑i=1m(hθ(x(i))−y(i))2+λ∑i=1nθi2]
Regularzation term: λ ∑ i = 1 n θ i 2 \lambda \sum _{i =1} ^ n \theta_i^2 λ∑i=1nθi2
Regularzation parameter: λ \lambda λ, controls a trade off between 2 different goals.
- First goal captured by the first term of the objective, (a.k.a ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 \sum_{i=1}^m{(h_\theta(x^{(i)}) - y^{(i)})}^2 ∑i=1m(hθ(x(i))−y(i))2), to fit the data well.
- Second goal is keep the parameters small, so that avoid overfitting.
If λ \lambda λ is set too large, it will be like θ i ≈ 0 , 1 ≤ i ≤ n \theta_i \approx 0, 1\leq i \leq n θi≈0,1≤i≤n, which will lead to underfitting.
Regularized linear regression
Gradient descent
Modify its process.
Repeat{
θ 0 : = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 0 ( i ) \theta_0:=\theta_0- \alpha \frac{1}{m} \sum_{i = 1} ^m (h_\theta(x^{(i)}) - y^{(i)}) x_0^{(i)} θ0:=θ0−αm1∑i=1m(hθ(x(i))−y(i))x0(i)
θ j : = θ j − α [ 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) + λ m θ j ] ( j = 1 , 2 ⋯   , n ) \theta_j:=\theta_j - \alpha [\frac{1}{m} \sum_{i = 1} ^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}+\frac{\lambda}{m}\theta_j](j =1,2\cdots, n) θj:=θj−α[m1∑i=1m(hθ(x(i))−y(i))xj(i)+mλθj](j=1,2⋯,n)
}
→ θ j : = θ j ( 1 − α λ m ) − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \rightarrow \theta_j:=\theta_j(1- \alpha\frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i = 1} ^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} →θj:=θj(1−αmλ)−αm1∑i=1m(hθ(x(i))−y(i))xj(i)
It is a fact that 1 − α λ m < 1 1- \alpha\frac{\lambda}{m} < 1 1−αmλ<1, so θ \theta θ descent a little in every progress.
Normal equation
Modify it.
θ
=
(
X
T
X
+
λ
[
0
1
1
⋱
1
]
)
−
1
X
T
y
\mathbb{\theta} = (X^T X+\lambda \left[ \begin{matrix}0\\&1\\&&1\\ &&&\ddots\\ &&&&1 \end{matrix} \right])^{-1}X^Ty
θ=(XTX+λ⎣⎢⎢⎢⎢⎡011⋱1⎦⎥⎥⎥⎥⎤)−1XTy
Non-invertibilty
If m ≤ m \leq m≤ n, m m m is the # of examples and n n n is the # of features, then X T X X^T X XTX is singular matrix.
But when you add the λ \lambda λ part, this matrix is invertible.
Regularized logistic regression
Just like linear regression, we add the term to penalize θ \theta θs in the cost function.
J ( θ ) = − [ 1 m ∑ i = 1 m y ( i ) ∗ l o g ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) ] + λ m ∑ j = 1 n θ j 2 J(\theta)=-[\frac{1}{m}\sum_{i=1}^m y^{(i)}*log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))] +\frac{\lambda}{m}\sum_{j=1}^n \theta_j^2 J(θ)=−[m1∑i=1my(i)∗log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))]+mλ∑j=1nθj2
Annotation
It has to be clear that although logistic and linear regression share the same math equation in the graident descent progress, they are two very different algorithm. They have their own hypothesis. The only reason they look alike in the gradient descent is that the cost fuction are different, so that we can unify the equation in graident descent.
Click here to see the following note.