Lec2.1 Regularization
Most machine learning tasks are estimation of a function f ^ ( x ) \hat{f}(x) f^(x) parameterized by a vector of parameters θ \theta θ.
A central problem in machine learning is how to make an algorithm that will perform well not just on the training data, but also on new inputs.
机器学习或深度学习会假设一个模型 f ^ ( x ) \hat{f}(x) f^(x),有一组参数 θ \theta θ,然后根据训练数据,学习一组参数。我们想要这个模型不仅要在训练数据上表现好,还要在新的数据上表现好。
True function:
f
(
x
)
f(x)
f(x), Estimate function $\hat{f}(x) $
Given a training set T = { ( x i , y i ) } i = 1 n T = \{(x_i, y_i)\}_{i=1}^{n} T={(xi,yi)}i=1n, y i = f ( x i ) + ϵ i y_i=f(x_i)+\epsilon_i yi=f(xi)+ϵi,观测值,是真实值加上一个误差, ϵ i ∈ N ( 0 , σ 2 ) \epsilon_i \in N(0, \sigma^2) ϵi∈N(0,σ2).
对应一个给定的点,计算预测误差和真实误差的关系,如下。并且分两种情况,点在训练集中,和不在训练集中。
for point ( x 0 , y 0 ) (x_0, y_0) (x0,y0),
E [ ( y ^ 0 − y 0 ) 2 ] = E [ ( f ^ 0 − f 0 − ϵ 0 ) 2 ] = E [ ( f ^ 0 − f 0 ) 2 ] − 2 E [ ϵ ( f ^ 0 − f 0 ) ] + σ 2 E[(\hat{y}_0-y_0)^2] \\=E[(\hat{f}_0-f_0-\epsilon_0)^2] \\=E[(\hat{f}_0-f_0)^2] - 2E[\epsilon(\hat{f}_0-f_0)]+\sigma^2 E[(y^0−y0)2]=E[(f^0−f0−ϵ0)2]=E[(f^0−f0)2]−2E[ϵ(f^0−f0)]+σ2
-
Case 1, assume ( x 0 , y 0 ) ∉ T (x_0,y_0)\notin T (x0,y0)∈/T
E [ ϵ 0 ( f ^ 0 − f 0 ) ] = E [ ( y 0 − f 0 ) ( f 0 ^ − f 0 ) ] = c o v ( y 0 , f 0 ^ ) = 0 E[\epsilon_0(\hat{f}_0-f_0)] \\=E[(y_0-f_0)(\hat{f_0}-f_0)]\\=cov(y_0,\hat{f_0})\\=0 E[ϵ0(f^0−f0)]=E[(y0−f0)(f0^−f0)]=cov(y0,f0^)=0
summing up over all m points that are not in T,
∑ i = 1 m ( y i ^ − y i ) 2 = ∑ i = 1 m ( f i ^ − f i ) 2 + m σ 2 \sum_{i=1}^m(\hat{y_i}-y_i)^2=\sum_{i=1}^m(\hat{f_i}-f_i)^2+m\sigma^2 i=1∑m(yi^−yi)2=i=1∑m(fi^−fi)2+mσ2
左侧是estimate error(err),右侧第一项是true error(Err),第二项是一个常数
E r r = e r r − m σ 2 Err = err -m\sigma^2 Err=err−mσ2,这个等式说明,当点不在训练集中时,预测误差可以很好的代表实际误差,这就是交叉验证的原理。
-
Case 2: assume ( x 0 , y 0 ) ∈ T (x_0,y_0)\in T (x0,y0)∈T
插播一条定理,Stein’s Lemma, x ∈ N ( θ , σ 2 ) , g ( x ) x\in N(\theta,\sigma^2), g(x) x∈N(θ,σ2),g(x)可导,则 E [ g ( x ) ( x − θ ) ] = σ 2 ∂ g ( x ) ∂ x E[g(x)(x-\theta)]=\sigma^2\frac{\partial g(x)}{\partial x} E[g(x)(x−θ)]=σ2∂x∂g(x)
E [ ϵ 0 ( f 0 ^ − f 0 ) ] = σ 2 ∂ ( f 0 ^ − f 0 ) ∂ ϵ 0 = σ 2 ∂ f 0 ^ ∂ y 0 ∂ y 0 ∂ ϵ 0 = σ 2 ∂ f 0 ^ ∂ y 0 = σ 2 D 0 E[\epsilon_0(\hat{f_0}-f_0)] \\=\sigma^2 \frac{\partial(\hat{f_0}-f_0)}{\partial \epsilon_0}\\=\sigma^2\frac{\partial \hat{f_0}}{\partial y_0}\frac{\partial y_0}{\partial \epsilon_0}\\=\sigma^2\frac{\partial \hat{f_0}}{\partial y_0}\\=\sigma^2 D_0 E[ϵ0(f0^−f0)]=σ2∂ϵ0∂(f0^−f0)=σ2∂y0∂f0^∂ϵ0∂y0=σ2∂y0∂f0^=σ2D0
D 0 D_0 D0代表模型的复杂程度。
summing up over all m points that are in T,
∑ i = 1 m ( y i ^ − y i ) 2 = ∑ i = 1 m ( f i ^ − f i ) 2 − 2 σ 2 ∑ i = 1 m D i + m σ 2 \sum_{i=1}^m(\hat{y_i}-y_i)^2=\sum_{i=1}^m (\hat{f_i}-f_i)^2-2\sigma^2\sum_{i=1}^mD_i+m\sigma^2 i=1∑m(yi^−yi)2=i=1∑m(fi^−fi)2−2σ2i=1∑mDi+mσ2
训练误差(err)不能代表真实误差(Err), true error is estimate error plus a bias.
E r r = e r r + 2 σ 2 ∑ i = 1 m D i − m σ 2 Err=err+2\sigma^2\sum_{i=1}^m D_i - m \sigma^2 Err=err+2σ2∑i=1mDi−mσ2.
这就是正则化Regularization的原因 J ( θ ; x , y ) + Ω ( θ ) J(\theta;x,y)+\Omega(\theta) J(θ;x,y)+Ω(θ),在训练误差的基础上加上一个与模型复杂度相关的偏置。
theta;x,y)+\Omega(\theta)$,在训练误差的基础上加上一个与模型复杂度相关的偏置。
这里可以看上图中的函数图,随着模型复杂度的增加,训练误差逐渐减少,真实误差,先减少后增大。