吴恩达·Machine Learning || chap7 Regularizationn 简记

7 Regularization

7-1 The problem of overfitting

underfitting——high bias

Just right

overfitting——high variance 高方差

Overfitting: If we have too many features, the learned hypothesis may fit the training set very well, but fail to generalize to new examples

Addressing overfitting

Options:

  1. Reduce number of features
  2. Regularization
  • Keep all the features, but reduce magnitude/values of parameters 0
  • Works well when we have a lot of features, each of which contributes a bit to predicting y

7-2 Cost function

Intuition

if θ 0 + θ 1 x + θ 2 x 2 \theta _ { 0 } + \theta _ { 1 } x + \theta _ { 2 } x ^ { 2 } θ0+θ1x+θ2x2 fitting,and θ 0 + θ 1 x + θ 2 x 2 + θ 3 x 3 + θ 4 x 4 \theta _ { 0 } + \theta _ { 1 } x + \theta _ { 2 } x ^ { 2 } + \theta_3 x ^ { 3 } + \theta_4 x ^ { 4 } θ0+θ1x+θ2x2+θ3x3+θ4x4 overfitting

⟶ \longrightarrow Suppose we penalize and make θ 3 , θ 4 \theta_3,\theta_4 θ3,θ4​ really small,close to 0

Regularization

Small values for parameters θ 0 , θ 1 , ⋯   , θ n \theta_0,\theta_1,\cdots,\theta_n θ0,θ1,,θn

  • Simpler"hypothesis
  • Less prone to overfitting

Housing:

  • Features: x 1 , x 2 , ⋯   , x 100 x_1,x_2,\cdots,x_{100} x1,x2,,x100
  • Parameters: θ 0 , θ 1 , θ 2 , ⋯   , θ 100 \theta_0,\theta_1,\theta_2,\cdots,\theta_{100} θ0,θ1,θ2,,θ100

J ( θ ) = 1 2 m [ ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + λ ∑ j = 1 n θ j 2 ] J ( \theta ) = \frac { 1 } { 2 m } [ \sum _ { i = 1 } ^ { m } ( h _ { \theta } ( x ^ { ( i ) } ) - y ^ { ( i ) } ) ^ { 2 } + \lambda \sum _ { j = 1 } ^ { n } \theta _ { j } ^ { 2 } ] J(θ)=2m1[i=1m(hθ(x(i))y(i))2+λj=1nθj2]

( λ \lambda λ​ regularization parameter)

In regularized linear regression, we choose θ \theta θ​ to minimize J ( θ ) J(\theta) J(θ)

What if λ \lambda λ is set to an extremely large value (perhaps for too large for our problem, say λ = 1 0 10 \lambda=10^{10} λ=1010)?

7-3 Regularization linear regression

Regularization linear regression

J ( θ ) = 1 2 m [ ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + λ ∑ j = 1 n θ j 2 ] J ( \theta ) = \frac { 1 } { 2 m } [ \sum _ { i = 1 } ^ { m } ( h _ { \theta } ( x ^ { ( i ) } ) - y ^ { ( i ) } ) ^ { 2 } + \lambda \sum _ { j = 1 } ^ { n } \theta _ { j } ^ { 2 } ] J(θ)=2m1[i=1m(hθ(x(i))y(i))2+λj=1nθj2]

θ j : = θ j ( 1 − α 1 m ) − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \theta _ { j } : = \theta _ { j } ( 1 - \alpha \frac { 1 } { m } ) - \alpha \frac { 1 } { m } \sum _ { i = 1 } ^ { m } ( h _ { \theta } ( x ^ { ( i ) } ) - y ^ { ( i ) } ) x _ { j } ^ { ( i ) } θj:=θj(1αm1)αm1i=1m(hθ(x(i))y(i))xj(i)

Normal equation

X = [ ( x ( 1 ) ) T ⋮ ( x ( m ) ) T ] X = \left[ \begin{array} { l } { ( x ^ { ( 1 ) } ) ^ { T } }\\\vdots \\ { ( x ^ { ( m ) } ) ^ { T } } \end{array} \right] X=(x(1))T(x(m))T y = [ y ( 1 ) ⋮ y ( m ) ] y = \left[ \begin{array} { l } { y ^ { ( 1 ) } } \\ { \vdots } \\ { y ^ { ( m ) } } \end{array} \right] y=y(1)y(m)

Non-invertibility(optional/advanced)

Suppose m ≤ \le n (m: examples,n: features)

θ = ( X T X ) − 1 X T y \theta = ( X ^ { T } X ) ^ { - 1 } X ^ { T } y θ=(XTX)1XTy

if λ > 0 \lambda>0 λ>0​,

θ = ( X T X + λ ( [ 0 1 1 ⋱ 1 ] ) − 1 X T y \theta = ( X ^ { T } X + \lambda(\begin{bmatrix} 0 \\ &1 \\ &&1 \\ &&&\ddots \\&&&&1\end{bmatrix}) ^ { - 1 } X ^ { T } y θ=(XTX+λ(0111)1XTy​​​​​​​

7-4 Regularized logistic regression

Gradient descent

Repeat{

θ 0 : = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 0 ( i ) \theta _ { 0 } : = \theta _ { 0 } - \alpha \frac { 1 } { m } \sum _ { i = 1 } ^ { m } ( h _ { \theta } ( x ^ { ( i ) } ) - y ^ { ( i ) } ) x _ { 0 } ^ { ( i ) } θ0:=θ0αm1i=1m(hθ(x(i))y(i))x0(i)

θ j : = θ j ( 1 − α 1 m ) − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \theta _ { j } : = \theta _ { j } ( 1 - \alpha \frac { 1 } { m } ) - \alpha \frac { 1 } { m } \sum _ { i = 1 } ^ { m } ( h _ { \theta } ( x ^ { ( i ) } ) - y ^ { ( i ) } ) x _ { j } ^ { ( i ) } θj:=θj(1αm1)αm1i=1m(hθ(x(i))y(i))xj(i)

​ }

Advanced optimization

function [jVal,gradient] = costFunction(theta)
	jVal = [code to compute J(θ)];
	gradient(1)= [code to compute ∂J(θ)/∂(θ_0) ] 
	gradient(2)= [code to compute ∂J(θ)/∂(θ_1) ]
	gradient(3)= [code to compute ∂J(θ)/∂(θ_2) ]
	...
	gradient(n+1)= [code to compute ∂J(θ)/∂(θ_n) ]
	
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值