10 Advice for applying machine learning
10-1 Deciding what to try next
Debugging a learning algorithm
Suppose you have implemented regularized linear regression to predict housing prices.
J ( θ ) = 1 2 m [ ∑ i = 1 m ( h 0 ( x ( i ) ) − y ( i ) ) 2 + λ ∑ j = 1 m θ j 2 ] J ( \theta ) = \frac { 1 } { 2 m } [ \sum _ { i = 1 } ^ { m } ( h _ { 0 } ( x ^ { ( i ) } ) - y ^ { ( i ) } ) ^ { 2 } + \lambda\sum _ { j = 1 } ^ { m } \theta _ { j } ^ { 2 } ] J(θ)=2m1[i=1∑m(h0(x(i))−y(i))2+λj=1∑mθj2]
However, when you test your hypothesis on a new set of houses, you find that it makes unacceptably large errors in its predictions. What should you try next?
-
Get more training examples
-
Try smaller sets if features
-
Try getting additional features
-
Try adding polynomial features
-
Try decreasing λ \lambda λ
-
Try increasing λ \lambda λ
Machine learning diagnostic:
Diagnostic: A test that you can run to gain insight what is/isn’ t working with a learning algorithm, and gain guidance as to how best to improve its performance.
Diagnostics can take time to implement, but doing so can be a very good use of your time.
10-2 Evaluating a hypothesis
Evaluating your hypothesis
Fails to generalize to new examples not in training set
Training/testing procedure for linear regression
- Learn parameter θ \theta θ from training data(minimizing training error ( J ( θ ) J(\theta) J(θ)))
- Compute test set error : J t e s t ( θ ) J_{test}(\theta) Jtest(θ)
classification problem:
- Learn parameter θ \theta θ from training data
- Compute test set error:
J t e s t ( θ ) = − 1 m t e s t ∑ i = 1 m t e s t y t e s t ( i ) log h θ ( x t e s t ( i ) ) + ( 1 − y t e s t ) ( i ) l o g h θ ( x t e s t ( i ) ) J _ {test} ( \theta ) = - \frac { 1 } { m _ { test} } \sum _ { i = 1 } ^ { m _ { t e s t } } y^{(i)}_{test}\log h_{\theta}(x_{test}^{(i)}) + ( 1 - y _ { t e s t }) ^ { ( i ) }logh_{\theta}(x_{test}^{(i)}) Jtest(θ)=−mtest1i=1∑mtestytest(i)loghθ(xtest(i))+(1−ytest)(i)loghθ(xtest(i))
- Miscalssification error (0/1 miscalssification error):
10-3 Model selection and training/validation/test sets
Overfitting example
Once parameters θ 0 , θ 1 , ⋯ , θ n \theta_0,\theta_1,\cdots,\theta_n θ0,θ1,⋯,θn were fit to some set of data (training set), the error of the parameters as measured on that data(the training error J ( θ ) J(\theta) J(θ) is likely to be lower than the actual generalization error.
Model selection
parameter d=degree of polynomial
How well does the model generalize? Report test set error J t e s t ( θ ( d ) ) J_{test}(\theta^{(d)}) Jtest(θ(d))
Problem: J t e s t ( θ ( d ) ) J_{test}(\theta^{(d)}) Jtest(θ(d)) is likely to be an optimistic estimate of generalization error l.e. our extra parameter(d = degree of polynomial)is fit to test set.
Evaluating your hypothesis
Training set 60%
cross validation 20%
test set 20%
Training/validation/test error
Training error:
J t r a i n ( θ ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J _ { t r a i n } ( \theta ) = \frac { 1 } { 2 m } \sum _ { i = 1 } ^ { m } ( h _ { \theta } ( x ^ { ( i ) } ) - y ^ { ( i ) } ) ^ { 2 } Jtrain(θ)=2m1i=1∑m(hθ(x(i))−y(i))2
Cross validation error:
J c v ( θ ) = 1 2 m c v ∑ i = 1 m v ( h θ ( x c v ( i ) ) − y c v ( i ) ) 2 J _ { c v } ( \theta ) = \frac { 1 } { 2 m _ { c v } } \sum _ { i = 1 } ^ { m _ { v } } ( h _ { \theta } ( x _ { c v } ^ { ( i )} ) - y _ { c v } ^ { ( i ) } ) ^ { 2 } Jcv(θ)=2mcv1i=1∑mv(hθ(xcv(i))−ycv(i))2
Test error:
J t e s t ( θ ) = 1 2 m t e s t ∑ i = 1 m t e s t ( h θ ( x t e s t ( i ) ) − y t e s t ( i ) ) 2 J _ { t e s t } ( \theta ) = \frac { 1 } { 2 m_{ t e s t} } \sum _ { i = 1 } ^ { m _ { t e s t } } ( h _ { \theta } ( x _ { t e s t }^{(i)} ) - y _ { t e s t } ^ { ( i ) } ) ^ { 2 } Jtest(θ)=2mtest1i=1∑mtest(hθ(xtest(i))−ytest(i))2
Estimate generalization error for test set J t e s t ( θ ( 4 ) ) J_{test}(\theta^{(4)}) Jtest(θ(4))
10-4 Diagnosing bias vs. variance
Bias/variance 偏差/方差
Training error:
Crossing validation error:
Diagnosing bias vs variance
Suppose your learning algorithm is performing less well than you were hoping.
(
J
c
v
(
θ
)
o
r
J
t
e
s
t
(
θ
)
i
s
h
i
g
h
)
(J_{cv}(\theta)\;or\;J_{test}(\theta) \; is \;high)
(Jcv(θ)orJtest(θ)ishigh) Is it a bias problem or a variance problem?
Bias (underfit): J t r a i n ( θ ) J_{train}(\theta) Jtrain(θ) will be high; J c v ( θ ) ≈ J t r a i n ( θ ) J_{cv}(\theta)\approx J_{train}(\theta) Jcv(θ)≈Jtrain(θ)
Variance(Overfit): J t r a i n ( θ ) J_{train}(\theta) Jtrain(θ) will be low; J c v ( θ ) > > J t r a i n ( θ ) J_{cv}(\theta)>> J_{train}(\theta) Jcv(θ)>>Jtrain(θ)
10-5 Regularization and bias/variance
Linear regression with regularization
Model:
h θ ( x ) = θ 0 + θ 1 x + θ 2 x 2 + θ 3 x 3 + θ 4 x 4 h _ { \theta } ( x ) = \theta _ { 0 } + \theta _ { 1 } x + \theta _ { 2 } x ^ { 2 } + \theta _ { 3 } x ^ { 3 } + \theta _ { 4 } x ^ { 4 } hθ(x)=θ0+θ1x+θ2x2+θ3x3+θ4x4
J ( θ ) = 1 2 m [ ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + λ ∑ j = 1 n θ j 2 ] J ( \theta ) = \frac { 1 } { 2 m } [ \sum _ { i = 1 } ^ { m } ( h _ { \theta } ( x ^ { ( i ) } ) - y ^ { ( i ) } ) ^ { 2 } + \lambda \sum _ { j = 1 } ^ { n } \theta _ { j } ^ { 2 } ] J(θ)=2m1[i=1∑m(hθ(x(i))−y(i))2+λj=1∑nθj2]
High bias——Large λ \lambda λ
Just right——intermediate λ \lambda λ
High variance——small λ \lambda λ
Choosing the regularization parameter λ \lambda λ
the smallest J c v ( θ ( i ) ) J_{cv}(\theta^{(i)}) Jcv(θ(i)) error
Bias/variance as a function of the regularization parameter λ \lambda λ
10-6 Learning curves
Learning curves
J t r a i n ( θ ) J_{train}(\theta) Jtrain(θ)
J c v ( θ ) J_{cv}(\theta) Jcv(θ)
High bias: more and more close
If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much
High variance: large gap
If a learning algorithm is suffering from high variance, getting more training data is likely to help
10-7 Deciding what to try next (revisited)
Debugging a learning algorithm:
- Get more training examples ⟶ \longrightarrow ⟶fixed high variance
- Try smaller sets if features ⟶ \longrightarrow ⟶fixed high variance
- Try getting additional features ⟶ \longrightarrow ⟶fixed high bias
- Try adding polynomial features ⟶ \longrightarrow ⟶ ⟶ \longrightarrow ⟶fixed high bias
- Try decreasing λ \lambda λ ⟶ \longrightarrow ⟶fixed high bias
- Try increasing λ \lambda λ ⟶ \longrightarrow ⟶fixed high variance
Neural networks and overfitting