1.1 本周内容——诊断程序(Diagnostic)
Diagnostic: A test that you run to gain insight into what is/isn’t working with a learning algorithm, to gain guidance into improving its performance.
Diagnostics can take time to implement but doing so can be a very good use of your time.
1.2 模型评估(Evaluating a model)
对于线性回归方法:
将数据集拆分成两部分:训练集(Training set)、测试集(Test set);
通过使代价函数最小化来训练参数:
J
(
w
⃗
,
b
)
=
min
w
⃗
,
b
[
1
2
m
t
r
a
i
n
∑
i
=
1
m
t
r
a
i
n
(
f
w
⃗
,
b
(
x
⃗
(
i
)
)
−
y
(
i
)
)
2
+
λ
2
m
t
r
a
i
n
∑
j
=
1
n
w
j
2
]
J\left( {\vec w,b} \right) = \mathop {\min }\limits_{\vec w,b} \left[ {\frac{1}{{2{m_{train}}}}\sum\limits_{i = 1}^{{m_{train}}} {{{\left( {{f_{\vec w,b}}\left( {{{\vec x}^{\left( i \right)}}} \right) - {y^{\left( i \right)}}} \right)}^2} + \frac{\lambda }{{2{m_{train}}}}\sum\limits_{j = 1}^n {{w_j}^2} } } \right]
J(w,b)=w,bmin[2mtrain1i=1∑mtrain(fw,b(x(i))−y(i))2+2mtrainλj=1∑nwj2]
计算测试集的误差:
J
t
e
s
t
(
w
⃗
,
b
)
=
1
2
m
t
e
s
t
[
∑
i
=
1
m
t
e
s
t
(
f
w
⃗
,
b
(
x
⃗
t
e
s
t
(
i
)
)
−
y
t
e
s
t
(
i
)
)
2
]
{J_{test}}\left( {\vec w,b} \right) = \frac{1}{{2{m_{test}}}}\left[ {\sum\limits_{i = 1}^{{m_{test}}} {{{\left( {{f_{\vec w,b}}\left( {\vec x_{test}^{\left( i \right)}} \right) - y_{test}^{\left( i \right)}} \right)}^2}} } \right]
Jtest(w,b)=2mtest1[i=1∑mtest(fw,b(xtest(i))−ytest(i))2]
计算训练集的误差:
J
t
r
a
i
n
(
w
⃗
,
b
)
=
1
2
m
t
r
a
i
n
[
∑
i
=
1
m
t
r
a
i
n
(
f
w
⃗
,
b
(
x
⃗
t
r
a
i
n
(
i
)
)
−
y
t
r
a
i
n
(
i
)
)
2
]
{J_{train}}\left( {\vec w,b} \right) = \frac{1}{{2{m_{train}}}}\left[ {\sum\limits_{i = 1}^{{m_{train}}} {{{\left( {{f_{\vec w,b}}\left( {\vec x_{train}^{\left( i \right)}} \right) - y_{train}^{\left( i \right)}} \right)}^2}} } \right]
Jtrain(w,b)=2mtrain1[i=1∑mtrain(fw,b(xtrain(i))−ytrain(i))2]
比较训练集和测试集的误差,如果相差较大,则模型存在问题。
对于逻辑回归方法:
代价函数:
J
(
w
⃗
,
b
)
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
log
(
f
w
⃗
,
b
(
x
⃗
(
i
)
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
f
w
⃗
,
b
(
x
⃗
(
i
)
)
)
]
+
λ
2
m
∑
j
=
1
n
w
j
2
J\left( {\vec w,b} \right) = - \frac{1}{m}\sum\limits_{i = 1}^m {\left[ {{y^{\left( i \right)}}\log \left( {{f_{\vec w,b}}\left( {{{\vec x}^{\left( i \right)}}} \right)} \right) + \left( {1 - {y^{\left( i \right)}}} \right)\log \left( {1 - {f_{\vec w,b}}\left( {{{\vec x}^{\left( i \right)}}} \right)} \right)} \right]} + \frac{\lambda }{{2m}}\sum\limits_{j = 1}^n {{w_j}^2}
J(w,b)=−m1i=1∑m[y(i)log(fw,b(x(i)))+(1−y(i))log(1−fw,b(x(i)))]+2mλj=1∑nwj2
计算测试集的误差:
J
t
e
s
t
(
w
⃗
,
b
)
=
−
1
m
t
e
s
t
∑
i
=
1
m
t
e
s
t
[
y
t
e
s
t
(
i
)
log
(
f
w
⃗
,
b
(
x
⃗
t
e
s
t
(
i
)
)
)
+
(
1
−
y
t
e
s
t
(
i
)
)
log
(
1
−
f
w
⃗
,
b
(
x
⃗
t
e
s
t
(
i
)
)
)
]
{J_{test}}\left( {\vec w,b} \right) = - \frac{1}{{{m_{test}}}}\sum\limits_{i = 1}^{{m_{test}}} {\left[ {y_{test}^{\left( i \right)}\log \left( {{f_{\vec w,b}}\left( {\vec x_{test}^{\left( i \right)}} \right)} \right) + \left( {1 - y_{test}^{\left( i \right)}} \right)\log \left( {1 - {f_{\vec w,b}}\left( {\vec x_{test}^{\left( i \right)}} \right)} \right)} \right]}
Jtest(w,b)=−mtest1i=1∑mtest[ytest(i)log(fw,b(xtest(i)))+(1−ytest(i))log(1−fw,b(xtest(i)))]
计算训练集的误差:
J
t
r
a
i
n
(
w
⃗
,
b
)
=
−
1
m
t
r
a
i
n
∑
i
=
1
m
t
r
a
i
n
[
y
t
r
a
i
n
(
i
)
log
(
f
w
⃗
,
b
(
x
⃗
t
r
a
i
n
(
i
)
)
)
+
(
1
−
y
t
r
a
i
n
(
i
)
)
log
(
1
−
f
w
⃗
,
b
(
x
⃗
t
r
a
i
n
(
i
)
)
)
]
{J_{train}}\left( {\vec w,b} \right) = - \frac{1}{{{m_{train}}}}\sum\limits_{i = 1}^{{m_{train}}} {\left[ {y_{train}^{\left( i \right)}\log \left( {{f_{\vec w,b}}\left( {\vec x_{train}^{\left( i \right)}} \right)} \right) + \left( {1 - y_{train}^{\left( i \right)}} \right)\log \left( {1 - {f_{\vec w,b}}\left( {\vec x_{train}^{\left( i \right)}} \right)} \right)} \right]}
Jtrain(w,b)=−mtrain1i=1∑mtrain[ytrain(i)log(fw,b(xtrain(i)))+(1−ytrain(i))log(1−fw,b(xtrain(i)))]
1.3 模型选择与交叉验证测试集的训练方法(Model selection and training/cross validation/test sets)
问题描述:回归问题,你希望自动选择模型,比如决定选择什么次数的多项式(Decide what degree polynomial to use)。
方法一:以d(degree of polynomial,次数)为参数,列出d=1,2,3,…,10的模型,依次进行训练,得到相应的w与b,以及J-test,通过比较每个模型的J-test,找出J-test最小的模型即为当前方法下最优的模型;
How well does the model perform? Report test error J-test(w[5], b[5])?
The problem is J-test(w[5], b[5]) is likely to be an optimistic estimate ofgeneralization error
(泛化误差).
I.E.: An extra parameter d(degree of polynomial) was chosen using the test set.
方法二:将数据集分成三部分——训练集/交叉验证集/测试集(Training/cross validation/test set
)
交叉验证集:cross validation set
/ validation set
/ development set
/ dev set
训练集误差(60%的数据):
J
t
r
a
i
n
(
w
⃗
,
b
)
=
1
2
m
t
r
a
i
n
[
∑
i
=
1
m
t
r
a
i
n
(
f
w
⃗
,
b
(
x
⃗
t
r
a
i
n
(
i
)
)
−
y
t
r
a
i
n
(
i
)
)
2
]
{J_{train}}\left( {\vec w,b} \right) = \frac{1}{{2{m_{train}}}}\left[ {\sum\limits_{i = 1}^{{m_{train}}} {{{\left( {{f_{\vec w,b}}\left( {\vec x_{train}^{\left( i \right)}} \right) - y_{train}^{\left( i \right)}} \right)}^2}} } \right]
Jtrain(w,b)=2mtrain1[i=1∑mtrain(fw,b(xtrain(i))−ytrain(i))2]
交叉验证集误差(20%的数据):
J
c
v
(
w
⃗
,
b
)
=
1
2
m
c
v
[
∑
i
=
1
m
c
v
(
f
w
⃗
,
b
(
x
⃗
c
v
(
i
)
)
−
y
c
v
(
i
)
)
2
]
{J_{cv}}\left( {\vec w,b} \right) = \frac{1}{{2{m_{cv}}}}\left[ {\sum\limits_{i = 1}^{{m_{cv}}} {{{\left( {{f_{\vec w,b}}\left( {\vec x_{cv}^{\left( i \right)}} \right) - y_{cv}^{\left( i \right)}} \right)}^2}} } \right]
Jcv(w,b)=2mcv1[i=1∑mcv(fw,b(xcv(i))−ycv(i))2]
测试集误差(20%的数据):
J
t
e
s
t
(
w
⃗
,
b
)
=
1
2
m
t
e
s
t
[
∑
i
=
1
m
t
e
s
t
(
f
w
⃗
,
b
(
x
⃗
t
e
s
t
(
i
)
)
−
y
t
e
s
t
(
i
)
)
2
]
{J_{test}}\left( {\vec w,b} \right) = \frac{1}{{2{m_{test}}}}\left[ {\sum\limits_{i = 1}^{{m_{test}}} {{{\left( {{f_{\vec w,b}}\left( {\vec x_{test}^{\left( i \right)}} \right) - y_{test}^{\left( i \right)}} \right)}^2}} } \right]
Jtest(w,b)=2mtest1[i=1∑mtest(fw,b(xtest(i))−ytest(i))2]
Armed with these measures of learning algorithm performance, this is how you can then go about carrying out model selection.
总结:使用训练集的数据对现有的多个模型进行训练,并使用交叉验证集的数据对训练好的模型的评估(不包含正则项);通过比较交叉训练机的J-cv结果选择出J-cv最小的模型作为最优模型,并在测试集上进行测试;这样确保了测试集的数据不会对训练好模型的泛化误差发生过于乐观的评估。
This ensures that your test set is a fair and not overly optimistic estimate of how well your model will generalize to new data.
2.1 对偏差和方差进行诊断(Diagnosing bias and variance)
回顾在P1W3中提到的过拟合/欠拟合的概念:
欠拟合(underfit)——high bias
适度拟合——generalization
过拟合(overfit)——high variance
其中,bias指欠拟合时模型与实际数据之间的相差;variance指过拟合时模型有着较高的次数,模型图像方差/变化较大。
但是当维度较高时我们没法通过作图来直观地判断该模型是欠拟合还是过拟合,因此需要一种方式来诊断。
经过比较欠拟合和过拟合的J-train和J-cv发现:
当J-train较高,J-cv也较高时,模型具有较高的偏差/high bias;
当J-train较低,J-cv也较低时,模型表现得比较好;
当J-train较低,J-cv较高时,模型具有较高的方差/high variance;如下图所示:
总结:
High bias(高偏差)/ underfit(欠拟合)——J-train很高
High variance(高方差)/ overfit(过拟合)——J-cv远大于J-train
2.2 正则化/偏差/方差(Regularization/ bias/ variance)
问题描述:如果我们正在拟合一个四次多项式,同时这个多项式模型使用正则化,怎么通过分析偏差和方差来选择合适的λ?
J-train、J-cv与λ的关系描述:
由于λ的意义是平衡拟合度,当λ=0时,一个模型是过拟合的;随着λ的增大,该模型与训练集数据的拟合程度逐渐减小,起到平滑曲线,降低variance的作用,但同时J-train也随着拟合度的降低而增大;因此J-train和λ的关系图像是一条正相关的曲线;λ过大时,参数w会变得很小,模型近似一条直线(y=b);
使用交叉验证集找到最合适的λ(d)。
2.3 制定一个用于性能评估的基准(Establishing a baseline level of performance)
案例描述:假如我们做了一个语音识别的模型,其训练误差J-train为10.8%;交叉验证误差为14.8%;看起来都挺高的,好像这个模型不太行,但实际上人类识别的误差是10.6%,这样比较起来还可以;训练误差其实可以满足要求,相比起来交叉验证误差比较高,是个问题。
This algorithm has more of a variance problem than a bias problem.
因此选择一个衡量模型性能的基准很重要;常用的基准有:
- 人类水平(Human level performance)
- 竞争对象的性能(Competing algorithms performance)
- 根据经验猜测(Guess based on experience)
举例:
2.4 学习曲线(learning curve)
学习曲线:研究训练集数据量大小对训练模型性能的影响。
训练集误差、交叉验证误差与训练集(数据量)大小的关系:
当模型欠拟合/具有高偏差时:
当模型过拟合/具有高方差时:
2.5 决定下一步做什么(Deciding what to try next)
You’ve implemented regularized linear regression(正则化线性回归) on housing prices, but it makes unacceptably large errors in predictions.
J
(
w
⃗
,
b
)
=
1
2
m
∑
i
=
1
m
(
f
w
⃗
,
b
(
x
⃗
(
i
)
)
−
y
(
i
)
)
2
+
λ
2
m
∑
j
=
1
n
w
j
2
J\left( {\vec w,b} \right) = {\frac{1}{{2{m}}}\sum\limits_{i = 1}^{{m}} {{{\left( {{f_{\vec w,b}}\left( {{{\vec x}^{\left( i \right)}}} \right) - {y^{\left( i \right)}}} \right)}^2} + \frac{\lambda }{{2{m}}}\sum\limits_{j = 1}^n {{w_j}^2} } }
J(w,b)=2m1i=1∑m(fw,b(x(i))−y(i))2+2mλj=1∑nwj2
常用处理方法:
方法 | 应用场景 |
---|---|
获取更多训练数据(Get more training examples) | 处理高方差问题(fixes high variance) |
尝试更少的属性/特征(Try smaller sets of features) | 处理高方差问题(fixes high variance) |
尝试新的属性/特征(Try getting additional features) | 处理高偏差问题(fixes high bias) |
尝试添加高次特征(Try adding polynomial features) | 处理高偏差问题(fixes high bias) |
尝试降低λ(Try decreasing λ) | 处理高偏差问题(fixes high bias) |
尝试增大λ(Try increasing λ) | 处理高方差问题(fixes high variance) |
2.6 偏差与方差和神经网络(Bias/Variance and neural networks)
偏差与方差的平衡(The bias-variance tradeoff):过于简单的模型会导致高偏差(欠拟合),过于复杂的模型会导致高方差(过拟合),因此需要在二者之间进行权衡。
现象:大型神经网络是低偏差学习机(Large neural networks are low bias machines.):如果神经网络足够大,基本上总能很好地适应训练集的数据(除非训练集巨大)。基于该现象提出一个在特定场合下会很好用的方法:
现象2:神经网络一般是越大性能越好(A larger neural network will usually do as well or better than a smaller one so long as regularization is chosen appropriately)。但确实会增大时间和金钱成本。
神经网络正则化代码实现:
J
(
w
⃗
,
b
)
=
1
m
∑
i
=
1
m
L
(
f
(
x
⃗
(
i
)
)
,
y
(
i
)
)
+
λ
2
m
∑
a
l
l
_
w
e
i
g
h
t
s
_
W
(
w
2
)
J\left( {\vec w,b} \right) = \frac{1}{m}\sum\limits_{i = 1}^m {L\left( {f\left( {{{\vec x}^{\left( i \right)}}} \right),{y^{\left( i \right)}}} \right)} + \frac{\lambda }{{2m}}\sum\limits_{all\_weights\_W} {\left( {{w^2}} \right)}
J(w,b)=m1i=1∑mL(f(x(i)),y(i))+2mλall_weights_W∑(w2)
# Unregularized MNIST model
layer_1 = Dense(units=25, activation='relu')
layer_2 = Dense(units=15, activation='relu')
layer_3 = Dense(units=1, activation='sigmoid')
model = Sequential([layer_1, layer_2, layer_3])
# Regularized MNIST model
layer_1 = Dense(units=25, activation='relu', kernel_regularizer=L2(0.01))
layer_2 = Dense(units=15, activation='relu', kernel_regularizer=L2(0.01))
layer_3 = Dense(units=1, activation='sigmoid', kernel_regularizer=L2(0.01))
model = Sequential([layer_1, layer_2, layer_3])