第5章 应用机器学习的建议
1 Evaluating the Hypothesis
1.1 Training / Testing Procedure
Training sets | Testing sets |
---|---|
70% | 30% |
( x ( 1 ) , y ( 1 ) ) ⋅ ⋅ ⋅ ( x ( m ) , y ( m ) ) (x^{(1)},y^{(1)})···(x^{(m)},y^{(m)}) (x(1),y(1))⋅⋅⋅(x(m),y(m)) | ( x t e s t ( 1 ) , y t e s t ( 1 ) ) ⋅ ⋅ ⋅ ( x t e s t ( m t e s t ) , y t e s t ( m t e s t ) ) (x_{test}^{(1)},y_{test}^{(1)})···(x_{test}^{(m_{test})},y_{test}^{(m_{test})}) (xtest(1),ytest(1))⋅⋅⋅(xtest(mtest),ytest(mtest)) |
datas are all randomly ordered
1.1.1 for Linear Regression
- Learn parameter θ \theta θ from training data
- Compute test set error:
J t e s t ( θ ) = 1 2 m t e s t ∑ i = 1 m t e s t ( h θ ( x t e s t ( i ) ) − y t e s t ( i ) ) 2 J_{test}(\theta)=\frac{1}{2m_{test}}\sum_{i=1}^{m_{test}}{(h_\theta(x_{test}^{(i)})-y_{test}^{(i)})}^2 Jtest(θ)=2mtest1i=1∑mtest(hθ(xtest(i))−ytest(i))2
1.1.2 for Logistic Regression
- Learn parameter θ \theta θ from training data
- Compute test set error:
way(1)
J t e s t ( θ ) = − 1 m t e s t ∑ i = 1 m t e s t ( y t e s t ( i ) log h θ ( x t e s t ( i ) ) + ( 1 − y t e s t ( i ) ) log h θ ( x t e s t ( i ) ) ) J_{test}(\theta)=-\frac{1}{m_{test}}\sum_{i=1}^{m_{test}}\left(y_{test}^{(i)}\text{log}h_\theta(x_{test}^{(i)})+(1-y_{test}^{(i)})\text{log}h_\theta(x_{test}^{(i)})\right) Jtest(θ)=−mtest1i=1∑mtest(ytest(i)loghθ(xtest(i))+(1−ytest(i))loghθ(xtest(i)))
way(2)
0/1 misclassification error:
e
r
r
(
h
θ
(
x
)
,
y
)
=
{
1
,
if
h
(
x
)
≥
0.5
and
y
=
0
, or
h
(
x
)
<
0.5
and
y
=
1
0
,
otherwise
err(h_\theta(x),y)=\begin{cases} 1,&\text{if $h(x)≥0.5$ and $y=0$, or $h(x)<0.5$ and $y=1$}\\ 0,&\text{otherwise} \end{cases}
err(hθ(x),y)={1,0,if h(x)≥0.5 and y=0, or h(x)<0.5 and y=1otherwise
T
e
s
t
e
r
r
o
r
=
1
m
t
e
s
t
∑
i
=
1
m
t
e
s
t
e
r
r
(
h
θ
(
x
t
e
s
t
(
i
)
)
,
y
t
e
s
t
(
i
)
)
Test\ \ error=\frac{1}{m_{test}}\sum_{i=1}^{m_{test}}err(h_\theta(x_{test}^{(i)}),y_{test}^{(i)})
Test error=mtest1i=1∑mtesterr(hθ(xtest(i)),ytest(i))
2 Model Selection and Training / Validation / Test sets(交叉验证集)
Training set | Cross validation set | Test set |
---|---|---|
60% | 20% | 20% |
( x ( 1 ) , y ( 1 ) ) ⋅ ⋅ ⋅ ( x ( m ) , y ( m ) ) (x^{(1)},y^{(1)})···(x^{(m)},y^{(m)}) (x(1),y(1))⋅⋅⋅(x(m),y(m)) | ( x c v ( 1 ) , y c v ( 1 ) ) ⋅ ⋅ ⋅ ( x c v ( m c v ) , y c v ( m c v ) ) (x_{cv}^{(1)},y_{cv}^{(1)})···(x_{cv}^{(m_{cv})},y_{cv}^{(m_{cv})}) (xcv(1),ycv(1))⋅⋅⋅(xcv(mcv),ycv(mcv)) | ( x t e s t ( 1 ) , y t e s t ( 1 ) ) ⋅ ⋅ ⋅ ( x t e s t ( m t e s t ) , y t e s t ( m t e s t ) ) (x_{test}^{(1)},y_{test}^{(1)})···(x_{test}^{(m_{test})},y_{test}^{(m_{test})}) (xtest(1),ytest(1))⋅⋅⋅(xtest(mtest),ytest(mtest)) |
- Training error:
J t r a i n ( θ ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J_{train}(\theta)=\frac{1}{2m}\sum_{i=1}^m{\left(h_\theta(x^{(i)})-y^{(i)}\right)}^2 Jtrain(θ)=2m1i=1∑m(hθ(x(i))−y(i))2
Cross Validation error:
J c v ( θ ) = 1 2 m c v ∑ i = 1 m c v ( h θ ( x c v ( i ) ) − y c v ( i ) ) 2 J_{cv}(\theta)=\frac{1}{2m_{cv}}\sum_{i=1}^{m_{cv}}{\left(h_\theta(x_{cv}^{(i)})-y_{cv}^{(i)}\right)}^2 Jcv(θ)=2mcv1i=1∑mcv(hθ(xcv(i))−ycv(i))2
Test error:
J t e s t ( θ ) = 1 2 m t e s t ∑ i = 1 m t e s t ( h θ ( x t e s t ( i ) ) − y t e s t ( i ) ) 2 J_{test}(\theta)=\frac{1}{2m_{test}}\sum_{i=1}^{m_{test}}{\left(h_\theta(x_{test}^{(i)})-y_{test}^{(i)}\right)}^2 Jtest(θ)=2mtest1i=1∑mtest(hθ(xtest(i))−ytest(i))2 - Model selection:
1° 使用训练集训练出 n n n个模型
2° 用 n n n个模型分别对交叉验证集计算得出交叉验证误差(代价函数的值)
3° 选取代价函数值最小的模型
4° 用步骤3°中选出的模型对测试集计算得出推广误差(代价函数的值)
3 Diagnosing Bias(偏差,欠拟合) vs. Variance(方差,过拟合)
Bias(Underfit) | Variance(overfit) |
---|---|
J t r a i n ( θ ) J_{train}(\theta) Jtrain(θ) will be high | J t r a i n ( θ ) J_{train}(\theta) Jtrain(θ) will be low |
J c v ( θ ) ≈ J t r a i n ( θ ) J_{cv}(\theta)≈J_{train}(\theta) Jcv(θ)≈Jtrain(θ) | J c v ( θ ) > > J t r a i n ( θ ) J_{cv}(\theta)>>J_{train}(\theta) Jcv(θ)>>Jtrain(θ) |
3.1 Regularization and Bias / Variance
3.2 Learning Curves
- 将训练集误差和交叉验证集误差作为训练集实例数量 m m m的函数绘制的图表
3.2.1 High Bias
3.2.2 High Variance
3.2.3 Solutions
to solve high bias | to solve high variance |
---|---|
Try getting additional features | Get more training examples |
Try adding polynomial features | Try smaller sets of features |
Try decreasing λ \lambda λ | Try increasing λ \lambda λ |
4 Nerual networks and overfitting
“small” neural network | “large” neural network |
---|---|
fewer parameters | more parameters |
more prone to underfitting | more prone to overfitting |
computationally cheaper | computationally more expensive |
use regularization( λ \lambda λ)to address overfitting |
5 Reference
吴恩达 机器学习 coursera machine learning
黄海广 机器学习笔记