第3章 正则化
1 Addressing Overfitting
Reduce number of features | Regularization |
---|---|
Manually select which features to keep | Keep all the features but reduce magnitude / values of parameters θ j \theta_j θj |
Model selection algorithm(Automatically select) | Works well when we have a lot of features, each of which contributese a bit to predicting y y y |
有可能会丢失信息 |
2 Regularization
2.1 Theory
Small values for parameters θ 0 , θ 1 , ⋅ ⋅ ⋅ , θ n \theta_0,\theta_1,···,\theta_n θ0,θ1,⋅⋅⋅,θn
- “Simpler” hypothesis
- Less prone to overfitting
2.2 Model
-
minimize
θ
J
(
θ
)
\mathop{\text{minimize}}\limits_{\theta} J(\theta)
θminimizeJ(θ)
J ( θ ) = 1 2 m [ ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + λ ∑ j = 1 n θ j 2 ] J(\theta)=\frac{1}{2m}\left[\sum_{i=1}^m(h_\theta{(x^{(i)})-y^{(i)})}^2+\lambda\sum_{j=1}^n{\theta_j}^2\right] J(θ)=2m1[i=1∑m(hθ(x(i))−y(i))2+λj=1∑nθj2]
λ \lambda λ:Regularization Parameter
λ ∑ j = 1 n θ j 2 \lambda\sum_{j=1}^n{\theta_j}^2 λ∑j=1nθj2:to shrink every single parameter θ j \theta_j θj
【注意】从 θ 1 \theta_1 θ1开始(其实从 θ 0 \theta_0 θ0开始影响不大)
2.3 λ \lambda λ 设置值太大可能导致的问题
- Algorithm works fine; setting λ \lambda λ to be very large can not hurt it
- Algorithm fails to eliminate overfitting
- Algorithm results in underfitting.(Fails to fit even training data well)
- Gradient descent will fail to converge
3 Regularized Linear Regession
3.1 Gradient Descent
- repeat until convergence{
θ 0 : = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 0 ( i ) θ j : = θ j − α [ 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x j ( i ) + λ m θ j ] 变 式 : θ j : = θ j ( 1 − α λ m ) − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x j ( i ) \begin{aligned} \theta_0&:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_0^{(i)}\\ \theta_j&:=\theta_j-\alpha\left[\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})·x_j^{(i)}+\frac{\lambda}{m}\theta_j\right]\\ 变式:\theta_j&:=\theta_j(1-\alpha\frac{\lambda}{m})-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})·x_j^{(i)} \end{aligned} θ0θj变式:θj:=θ0−αm1i=1∑m(hθ(x(i))−y(i))x0(i):=θj−α[m1i=1∑m(hθ(x(i))−y(i))⋅xj(i)+mλθj]:=θj(1−αmλ)−αm1i=1∑m(hθ(x(i))−y(i))⋅xj(i)
( j = 1 , 2 , 3 , ⋅ ⋅ ⋅ , n j=1,2,3,···,n j=1,2,3,⋅⋅⋅,n)
}
3.2 Normal Equation
θ = ( X T X + λ [ 0 1 1 ⋱ 1 ] ( n + 1 ) × ( n + 1 ) ) − 1 X T y \theta={\left(X^TX+\lambda\left[ \begin{matrix} 0\\ &1\\ &&1\\ &&&\ddots\\ &&&&1 \end{matrix}\right]_{(n+1)×(n+1)}\right)} ^{-1}X^Ty θ=⎝⎜⎜⎜⎜⎛XTX+λ⎣⎢⎢⎢⎢⎡011⋱1⎦⎥⎥⎥⎥⎤(n+1)×(n+1)⎠⎟⎟⎟⎟⎞−1XTy
- 只要 λ > 0 \lambda>0 λ>0,就不会non-invertibility
4 Regularized Logistic Regession
4.1 Gradient Descent
- J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) log h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ] + λ 2 m ∑ j = 1 n θ j 2 J(\theta)=-\frac{1}{m}\sum_{i=1}^m\left[y^{(i)}\text{log}h_{\theta}(x^{(i)})+(1-y^{(i)})\text{log}(1-h_{\theta}(x^{(i)})\right]+\frac{\lambda}{2m}\sum_{j=1}^n{\theta_j}^2 J(θ)=−m1i=1∑m[y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i))]+2mλj=1∑nθj2
- repeat until convergence{
θ 0 : = θ 0 − α 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x 0 ( i ) ) θ j : = θ j − α [ 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x j ( i ) − λ m θ j ] \begin{aligned} \theta_0&:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})·x_0^{(i)})\\ \theta_j&:=\theta_j-\alpha\left[\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})·x_j^{(i)}-\frac{\lambda}{m}\theta_j\right] \end{aligned} θ0θj:=θ0−αm1i=1∑m((hθ(x(i))−y(i))⋅x0(i)):=θj−α[m1i=1∑m(hθ(x(i))−y(i))⋅xj(i)−mλθj]
( j = 1 , 2 , 3 , ⋅ ⋅ ⋅ , n j=1,2,3,···,n j=1,2,3,⋅⋅⋅,n)
(simultaneously update all θ j \theta_j θj)
}
import numpy as np
def costReg(theta, X, y, learningRate):
theta = np.matrix(theta)
X = np.matrix(X)
y = np.matrix(y)
first = np.multiply(-y, np.log(sigmoid(X*theta.T)))
second = np.multiply((1-y),np.log(1-sigmoid(X*theta*T)))
reg = (learningRate / (2 * len(X))* np.sum(np.power(theta[:,1:theta.shape[1]],2))
return np.sum(first - second) / (len(X)) + reg
4.2 Advanced optimization
Octave代码
function [jval, gradient] = costFunction(theta)
jVal = [...code to compute J(theta)...];
gradient = [... code to compute derivative of J(theta)...];
end
options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
5 Reference
吴恩达 机器学习 coursera machine learning
黄海广 机器学习笔记