Model:
y
=
F
(
x
)
+
v
F
(
x
)
在这里可以看做的oracle model,不随training data的改变而改变。
\begin{aligned} &y = F(\mathbf{x}) + v\\ &\text{$F(\mathbf{x})$ 在这里可以看做的oracle model,不随training data的改变而改变。} \end{aligned}
y=F(x)+vF(x) 在这里可以看做的oracle model,不随training data的改变而改变。
where
v
v
v is additive
w
h
i
t
e
white
white noise with
σ
v
2
\sigma^2_v
σv2. (Note: noise does not have to be gaussian, but does have to be white)
That means, for any
x
0
\mathbf{x}_0
x0, we have
E
y
∣
x
[
y
0
∣
x
0
]
=
F
(
x
0
)
这里的
(
x
0
,
y
0
)
可以看做是test data point
\begin{aligned} & E_{y|x}[y_0|\mathbf{x}_0] = F(\mathbf{x}_0) \\ & \text{这里的 $(\mathbf{x}_0,y_0)$ 可以看做是test data point} \end{aligned}
Ey∣x[y0∣x0]=F(x0)这里的 (x0,y0) 可以看做是test data point
The expected loss with a predictor
f
^
\hat{f}
f^ is taken w.r.t
x
0
\mathbf{x}_0
x0 and
y
0
y_0
y0: (Can be interpreted as Expectation w.r.t test data)
E
x
,
y
[
(
y
0
−
f
^
(
x
0
)
)
2
]
=
E
x
,
y
[
(
y
0
−
F
(
x
0
)
+
F
(
x
0
)
−
f
^
(
x
0
)
)
2
]
=
E
x
,
y
[
(
y
0
−
F
(
x
0
)
)
2
]
(
1
)
(
=
σ
v
2
)
+
E
x
,
y
[
(
F
(
x
0
)
−
f
^
(
x
0
)
)
2
]
(
2
)
(
i
m
p
o
r
t
a
n
t
)
+
2
E
x
,
y
[
(
y
0
−
F
(
x
0
)
)
(
F
(
x
0
)
−
f
^
(
x
0
)
)
]
(
3
)
(
=
0
)
\begin{aligned} E_{\mathbf{x},y}[(y_0-\hat{f}(\mathbf{x}_0))^2] &=E_{\mathbf{x},y}[(y_0-F(\mathbf{x}_0) + F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))^2] \\ &=E_{\mathbf{x}, y}[(y_0 - F(\mathbf{x}_0))^2] \qquad (1) (=\sigma^2_v)\\ &+E_{\mathbf{x},y}[(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))^2] \qquad (2) (important)\\ &+2E_{\mathbf{x},y}[(y_0-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))] \qquad (3)(=0) \end{aligned}
Ex,y[(y0−f^(x0))2]=Ex,y[(y0−F(x0)+F(x0)−f^(x0))2]=Ex,y[(y0−F(x0))2](1)(=σv2)+Ex,y[(F(x0)−f^(x0))2](2)(important)+2Ex,y[(y0−F(x0))(F(x0)−f^(x0))](3)(=0)
The cross term (3) can be written as:
E
x
,
y
[
(
y
0
−
F
(
x
0
)
)
(
F
(
x
0
)
−
f
^
(
x
0
)
)
]
=
∫
∫
(
y
0
−
F
(
x
0
)
)
(
F
(
x
0
)
−
f
^
(
x
0
)
)
p
(
y
0
∣
x
0
)
p
(
x
0
)
d
y
0
d
x
0
=
∫
{
E
y
∣
x
[
(
y
0
−
F
(
x
0
)
)
]
}
(
F
(
x
0
)
−
f
^
(
x
0
)
)
p
(
x
0
)
d
x
0
=
0
这里困惑我的问题是:为什么
x
0
固定后,
f
^
(
x
0
)
相对于
E
y
∣
x
on something是定值?
因为
f
^
是你在training data中得到的模型,这个模型只是跟training data
X
有关,
当得到
f
^
后,
f
^
(
x
0
)
只跟你要带入的input
x
0
有关,所以他跟
y
0
是无关的。
了
解
了
这
层
关
系
后
接
下
来
的
公
式
都
是
信
手
拈
来
。
下
面
的
公
式
对
这
个
解
释
更
直
观
一
些
。
\begin{aligned} & E_{\mathbf{x},y}[(y_0-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))] \\ &=\int\int(y_0-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))p(y_0|\mathbf{x}_0)p(\mathbf{x}_0)dy_0d\mathbf{x}_0\\ &=\int\{E_{y|\mathbf{x}}[(y_0-F(\mathbf{x}_0))]\}(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))p(\mathbf{x}_0)d\mathbf{x}_0 \\ &=0 \\ & \text{这里困惑我的问题是:为什么 $\mathbf{x}_0$ 固定后,$\hat{f}(\mathbf{x}_0)$ 相对于$E_{y|\mathbf{x}}$ on something是定值? } \\ & \text{因为 $\hat{f}$ 是你在training data中得到的模型,这个模型只是跟training data $X$ 有关,}\\ & \text{当得到$\hat{f}$后,$\hat{f}(\mathbf{x}_0)$ 只跟你要带入的input $\mathbf{x}_0$有关,所以他跟$y_0$ 是无关的。} \\ &了解了这层关系后接下来的公式都是信手拈来。 \\ &下面的公式对这个解释更直观一些。 \end{aligned}
Ex,y[(y0−F(x0))(F(x0)−f^(x0))]=∫∫(y0−F(x0))(F(x0)−f^(x0))p(y0∣x0)p(x0)dy0dx0=∫{Ey∣x[(y0−F(x0))]}(F(x0)−f^(x0))p(x0)dx0=0这里困惑我的问题是:为什么 x0 固定后,f^(x0) 相对于Ey∣x on something是定值? 因为 f^ 是你在training data中得到的模型,这个模型只是跟training data X 有关,当得到f^后,f^(x0) 只跟你要带入的input x0有关,所以他跟y0 是无关的。了解了这层关系后接下来的公式都是信手拈来。下面的公式对这个解释更直观一些。
Another way to think about the above equation:
E
x
,
y
[
(
y
0
−
F
(
x
0
)
)
(
F
(
x
0
)
−
f
^
(
x
0
)
)
]
=
E
x
{
E
y
∣
x
[
(
y
0
−
F
(
x
0
)
)
(
F
(
x
0
)
−
f
^
(
x
0
)
)
∣
x
0
]
}
=
E
x
{
E
y
∣
x
[
(
y
0
−
F
(
x
0
)
)
∣
x
0
]
(
F
(
x
0
)
−
f
^
(
x
0
)
)
}
=
E
x
{
(
E
y
∣
x
[
y
0
∣
x
0
]
−
F
(
x
0
)
)
(
F
(
x
0
)
−
f
^
(
x
0
)
)
}
(
N
o
t
e
:
因
为
E
y
∣
x
[
y
0
∣
x
0
]
=
F
(
x
0
)
)
=
0
\begin{aligned} & E_{\mathbf{x},y}[(y_0-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))] \\ &=E_{\mathbf{x}}\{E_{y|\mathbf{x}}[(y_0-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))|\mathbf{x}_0]\} \\ &=E_{\mathbf{x}}\{E_{y|\mathbf{x}}[(y_0-F(\mathbf{x}_0))|\mathbf{x}_0](F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))\} \\ &= E_{\mathbf{x}}\{(E_{y|\mathbf{x}}[y_0|\mathbf{x}_0]-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))\} \\ &(Note: 因为E_{y|x}[y_0|\mathbf{x}_0] = F(\mathbf{x}_0))\\ &=0 \end{aligned}
Ex,y[(y0−F(x0))(F(x0)−f^(x0))]=Ex{Ey∣x[(y0−F(x0))(F(x0)−f^(x0))∣x0]}=Ex{Ey∣x[(y0−F(x0))∣x0](F(x0)−f^(x0))}=Ex{(Ey∣x[y0∣x0]−F(x0))(F(x0)−f^(x0))}(Note:因为Ey∣x[y0∣x0]=F(x0))=0
We will analyze (2). More clearly, f ^ ( x 0 ) = f ^ ( x 0 , X ) \hat{f}(\mathbf{x_0})=\hat{f}(\mathbf{x_0},X) f^(x0)=f^(x0,X)(which d e p e n d s \mathbf{depends} depends on X X X). Let’s define f ˉ ( x 0 ) = E X ( f ^ ( x 0 ) ) \bar{f}(\mathbf{x_0})=E_X(\hat{f}(\mathbf{x_0})) fˉ(x0)=EX(f^(x0))(which does n o t d e p e n d \mathbf{not \ depend} not depend on X X X). Then the term inside (2) can be rewritten as:
(
F
(
x
0
)
−
f
^
(
x
0
)
)
2
(
∗
∗
∗
∗
∗
∗
)
=
(
F
(
x
0
)
−
f
ˉ
(
x
0
)
+
f
ˉ
(
x
0
)
−
f
^
(
x
0
)
)
2
=
(
F
(
x
0
)
−
f
ˉ
(
x
0
)
)
2
(
4
)
+
(
f
ˉ
(
x
0
)
−
f
^
(
x
0
)
)
2
(
5
)
+
2
(
F
(
x
0
)
−
f
ˉ
(
x
0
)
)
(
f
ˉ
(
x
0
)
−
f
^
(
x
0
)
)
(
6
)
\begin{aligned} & (F(\mathbf{x_0})-\hat{f}(\mathbf{x}_0))^2 \qquad (******)\\ &= (F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}) + \bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0}))^2 \\ &= (F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}))^2 \qquad (4)\\ &+(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0}))^2 \qquad (5)\\ &+2(F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}))(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0})) \qquad (6) \end{aligned}
(F(x0)−f^(x0))2(∗∗∗∗∗∗)=(F(x0)−fˉ(x0)+fˉ(x0)−f^(x0))2=(F(x0)−fˉ(x0))2(4)+(fˉ(x0)−f^(x0))2(5)+2(F(x0)−fˉ(x0))(fˉ(x0)−f^(x0))(6)
The
e
x
p
e
c
t
a
t
i
o
n
\mathbf{expectation}
expectation will be taken w.r.t. the random
t
r
a
i
n
i
n
g
\mathbf{training}
training data set
X
X
X, the cross term (6) can be written as:
E
X
[
2
(
F
(
x
0
)
−
f
ˉ
(
x
0
)
)
(
f
ˉ
(
x
0
)
−
f
^
(
x
0
)
)
]
=
2
(
F
(
x
0
)
−
f
ˉ
(
x
0
)
)
E
X
(
f
ˉ
(
x
0
)
−
f
^
(
x
0
)
)
=
0
\begin{aligned} E_X[2(F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}))(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0}))] &= 2(F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}))E_X(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0})) \\ &=0 \end{aligned}
EX[2(F(x0)−fˉ(x0))(fˉ(x0)−f^(x0))]=2(F(x0)−fˉ(x0))EX(fˉ(x0)−f^(x0))=0
Then we take expectation of
(
∗
∗
∗
∗
∗
∗
)
(******)
(∗∗∗∗∗∗) w.r.t.
X
X
X:
(Note: 这部分跟我们常见的
E
(
θ
^
−
θ
)
2
E(\hat{\theta}-\theta)^2
E(θ^−θ)2很相似,并且通过下面的分析我们可以更好地理解这个公式。记住,
θ
^
\hat{\theta}
θ^ 是关于training data的变量,所以
E
(
θ
^
−
θ
)
2
E(\hat{\theta}-\theta)^2
E(θ^−θ)2中的Expectation是w.r.t Training Data)
E
X
[
(
F
(
x
0
)
)
−
(
f
^
(
x
0
)
)
2
]
=
(
F
(
x
0
)
−
f
ˉ
(
x
0
)
)
2
+
E
X
[
(
f
ˉ
(
x
0
)
−
f
^
(
x
0
)
)
2
]
=
B
i
a
s
2
+
V
a
r
i
a
n
c
e
\begin{aligned} E_X[(F(\mathbf{x_0})) - (\hat{f}(\mathbf{x}_0))^2]&=(F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}))^2 \\ &+E_X[(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0}))^2]\\ &=Bias^2 + Variance \end{aligned}
EX[(F(x0))−(f^(x0))2]=(F(x0)−fˉ(x0))2+EX[(fˉ(x0)−f^(x0))2]=Bias2+Variance
Putting it all together (1), (2) and (3), we have the decomposition of the expected error:
E
X
,
x
0
,
y
0
[
(
y
0
−
f
^
(
x
0
,
X
)
)
]
=
σ
v
2
(
noise variance
)
+
∫
(
F
(
x
0
)
−
f
ˉ
(
x
0
)
)
2
p
(
x
0
)
d
x
0
expected squared bias
+
∫
E
X
[
(
f
ˉ
(
x
0
)
−
f
^
(
x
0
)
)
2
]
p
(
x
0
)
d
x
0
expected variance
\begin{aligned} E_{X, \mathbf{x}_0, y_0}[(y_0 - \hat{f}(x_0,X))] &= \sigma^2_v \qquad (\text{noise variance} ) \\ &+\int(F(\mathbf{x}_0) - \bar{f}(\mathbf{x}_0))^2p(\mathbf{x}_0)d\mathbf{x}_0 \qquad \text{expected squared bias} \\ &+ \int E_X[(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0}))^2]p(\mathbf{x}_0)d\mathbf{x}_0 \qquad \text{expected variance} \end{aligned}
EX,x0,y0[(y0−f^(x0,X))]=σv2(noise variance)+∫(F(x0)−fˉ(x0))2p(x0)dx0expected squared bias+∫EX[(fˉ(x0)−f^(x0))2]p(x0)dx0expected variance
Short Summary:Data is divided into two parts: training and testing.
Expected Squared Error can be viewed as true error or prediction error, which comes from both training error + test error.