Markov 不等式
对于非负随机变量
X
X
X 及
a
>
0
a>0
a>0,有:
P
(
X
≥
a
)
≤
E
[
X
]
a
P(X \geq a) \leq \frac{E[X]}{a}
P(X≥a)≤aE[X]
Proof:
E
[
X
]
=
∫
0
+
∞
x
f
(
x
)
d
x
≥
∫
a
+
∞
a
f
(
x
)
d
x
=
a
⋅
P
(
x
≥
a
)
E[X] = \int_{0}^{ +\infty } xf(x) \, dx \geq \int_{a}^{ +\infty } af(x) \, dx = a \cdot P(x \geq a)
E[X]=∫0+∞xf(x)dx≥∫a+∞af(x)dx=a⋅P(x≥a)
对于离散情况,把积分变为求和即可。
Hoeffding 不等式
考虑独立随机变量
X
i
∈
[
a
i
,
b
i
]
(
i
=
1
,
2
,
.
.
.
,
n
)
X_{i} \in [a_{i}, b_{i}] \, (i=1,2,...,n)
Xi∈[ai,bi](i=1,2,...,n) 的和
S
n
=
∑
i
=
1
n
X
i
S_{n} = \sum_{i=1}^n X_{i}
Sn=∑i=1nXi,则对任意
s
>
0
s > 0
s>0 有:
P
(
∣
S
n
−
E
[
S
n
]
∣
≥
t
)
≤
e
−
2
t
2
∑
i
=
1
n
(
b
i
−
a
i
)
2
P(|S_{n} - E[S_{n}]| \geq t) \leq e^{ \frac {-2t^2} {\sum_{i=1}^{n} (b_{i} - a_{i})^2} }
P(∣Sn−E[Sn]∣≥t)≤e∑i=1n(bi−ai)2−2t2
Proof:
Hoeffding 引理
考虑独立随机变量
X
i
∈
[
a
,
b
]
(
i
=
1
,
2
,
.
.
.
,
n
)
X_{i} \in [a, b] \, (i=1,2,...,n)
Xi∈[a,b](i=1,2,...,n),
E
[
X
]
=
0
E[X] = 0
E[X]=0,则对任意
s
>
0
s > 0
s>0 有:
E
[
e
s
X
]
≤
e
1
8
s
2
(
b
−
a
)
2
E[e^{sX}] \leq e^{\frac{1}{8} s^2 (b-a)^2}
E[esX]≤e81s2(b−a)2
Proof:
注意到
e
s
X
e^{sX}
esX 是关于
X
X
X 的一个凸函数,则有:
e
s
X
≤
b
−
X
b
−
a
e
s
a
+
X
−
a
b
−
a
e
s
b
e^{sX} \leq \frac{b-X}{b-a} e^{sa} + \frac{X-a}{b-a} e^{sb}
esX≤b−ab−Xesa+b−aX−aesb
对上式两边取期望,结合条件
E
[
X
]
=
0
E[X] = 0
E[X]=0 有:
E
[
e
s
X
]
≤
b
−
E
[
X
]
b
−
a
e
s
a
+
E
[
X
]
−
a
b
−
a
e
s
b
=
b
b
−
a
e
s
a
−
a
b
−
a
e
s
b
E[e^{sX}] \leq \frac{b-E[X]}{b-a} e^{sa} + \frac{E[X]-a}{b-a} e^{sb} = \frac{b}{b-a} e^{sa} - \frac{a}{b-a} e^{sb}
E[esX]≤b−ab−E[X]esa+b−aE[X]−aesb=b−abesa−b−aaesb
令
θ
=
−
a
b
−
a
\theta = - \frac{a}{b-a}
θ=−b−aa,
u
=
s
(
b
−
a
)
u=s(b-a)
u=s(b−a),由于
E
[
X
]
=
0
E[X] = 0
E[X]=0 ,故
a
≤
0
≤
b
a \leq 0 \leq b
a≤0≤b,
θ
≥
0
\theta \geq 0
θ≥0,
u
≥
0
u \geq 0
u≥0,得:
b
b
−
a
e
s
a
−
a
b
−
a
e
s
b
\frac{b}{b-a} e^{sa} - \frac{a}{b-a} e^{sb}
b−abesa−b−aaesb
=
−
a
b
−
a
e
s
a
(
e
s
(
b
−
a
)
−
b
a
)
=
−
a
b
−
a
e
−
s
(
b
−
a
)
−
a
b
−
a
(
e
s
(
b
−
a
)
−
b
−
a
a
−
1
)
= -\frac{a}{b-a} e^{sa} (e^{s(b-a)} - \frac{b}{a}) = -\frac{a}{b-a} e^{-s(b-a) \frac{-a}{b-a}} (e^{s(b-a)} - \frac{b-a}{a} - 1)
=−b−aaesa(es(b−a)−ab)=−b−aae−s(b−a)b−a−a(es(b−a)−ab−a−1)
=
θ
e
−
θ
u
(
e
u
+
1
θ
−
1
)
=
e
−
θ
u
(
θ
e
u
+
1
−
θ
)
= \theta e^{- \theta u} (e^u + \frac{1}{\theta} - 1) = e^{- \theta u} (\theta e^u + 1 - \theta)
=θe−θu(eu+θ1−1)=e−θu(θeu+1−θ)
令
ψ
(
u
)
=
e
−
θ
u
(
θ
e
u
+
1
−
θ
)
\psi (u) = e^{- \theta u} (\theta e^u +1 - \theta)
ψ(u)=e−θu(θeu+1−θ),由于
u
≥
0
u \geq 0
u≥0,故
ψ
(
u
)
>
0
\psi (u) >0
ψ(u)>0。
令
ϕ
(
u
)
=
ln
[
ψ
(
u
)
]
=
−
θ
u
+
ln
(
θ
e
u
+
1
−
θ
)
\phi (u) = \ln [\psi (u)] = - \theta u + \ln (\theta e^u + 1 - \theta)
ϕ(u)=ln[ψ(u)]=−θu+ln(θeu+1−θ),
根据泰勒公式,
∃
v
∈
[
0
,
u
]
\exists v \in [0,u]
∃v∈[0,u],使得:
ϕ
(
u
)
=
ϕ
(
0
)
+
u
⋅
ϕ
′
(
0
)
+
u
2
2
!
⋅
ϕ
′
′
(
v
)
(
1
)
\phi (u) = \phi (0) + u \cdot \phi ' (0) + \frac{u^2}{2!} \cdot \phi '' (v) \quad (1)
ϕ(u)=ϕ(0)+u⋅ϕ′(0)+2!u2⋅ϕ′′(v)(1)
其中:
ϕ
′
(
u
)
=
−
θ
+
θ
e
u
θ
e
u
+
1
−
θ
\phi ' (u) = - \theta + \frac {\theta e^u} {\theta e^u + 1 - \theta}
ϕ′(u)=−θ+θeu+1−θθeu
ϕ
′
′
(
u
)
=
θ
e
u
(
1
−
θ
)
(
θ
e
u
+
1
−
θ
)
2
\phi '' (u) = \frac {\theta e^u (1 - \theta)} {(\theta e^u + 1 - \theta)^2}
ϕ′′(u)=(θeu+1−θ)2θeu(1−θ)
可得:
ϕ
(
0
)
=
0
\phi (0) = 0
ϕ(0)=0
ϕ
′
(
0
)
=
0
\phi ' (0) = 0
ϕ′(0)=0
ϕ
′
′
(
u
)
=
1
2
+
(
θ
1
−
θ
e
u
+
1
−
θ
θ
e
−
u
)
≤
1
4
\phi '' (u) = \frac {1} {2 + (\frac{\theta}{1 - \theta} e^u + \frac{1 - \theta}{\theta} e^{-u})} \leq \frac{1}{4}
ϕ′′(u)=2+(1−θθeu+θ1−θe−u)1≤41
由 (1) 式得:
ϕ
(
u
)
≤
1
8
u
2
=
1
8
s
2
(
b
−
a
)
2
\phi (u) \leq \frac{1}{8} u^2 = \frac{1}{8} s^2 (b-a)^2
ϕ(u)≤81u2=81s2(b−a)2
故:
E
[
e
s
X
]
≤
b
b
−
a
e
s
a
−
a
b
−
a
e
s
b
=
ψ
(
u
)
=
e
ϕ
(
u
)
≤
e
1
8
s
2
(
b
−
a
)
2
E[e^{sX}] \leq \frac{b}{b-a} e^{sa} - \frac{a}{b-a} e^{sb} = \psi (u) = e^{\phi (u)} \leq e^{\frac{1}{8} s^2 (b-a)^2}
E[esX]≤b−abesa−b−aaesb=ψ(u)=eϕ(u)≤e81s2(b−a)2
Hoeffding 引理得证。
下面证明 Hoeffding 不等式。
首先证明不等式
P
(
S
n
−
E
[
S
n
]
≥
t
)
≤
e
−
2
t
2
∑
i
=
1
n
(
b
i
−
a
i
)
2
P(S_{n} - E[S_{n}] \geq t) \leq e^{\frac {-2t^2} {\sum_{i=1}^{n} (b_{i} - a_{i})^2}}
P(Sn−E[Sn]≥t)≤e∑i=1n(bi−ai)2−2t2。
对于任意
s
>
0
s>0
s>0,有:
P
(
S
n
−
E
[
S
n
]
≥
t
)
=
P
(
e
s
(
S
n
−
E
[
S
n
]
)
≥
e
s
t
)
P(S_{n} - E[S_{n}] \geq t) = P(e^{s(S_{n} - E[S_{n}])} \geq e^{st})
P(Sn−E[Sn]≥t)=P(es(Sn−E[Sn])≥est)
由于随机变量
e
s
(
S
n
−
E
[
S
n
]
)
>
0
e^{s(S_{n} - E[S_{n}])} > 0
es(Sn−E[Sn])>0,则由 Markov 不等式得:
P
(
e
s
(
S
n
−
E
[
S
n
]
)
≥
e
s
t
)
≤
E
[
e
s
(
S
n
−
E
[
S
n
]
)
]
e
s
t
=
e
−
s
t
⋅
∏
i
=
1
n
E
[
e
s
(
X
i
−
E
[
X
i
]
)
]
P(e^{s(S_{n} - E[S_{n}])} \geq e^{st}) \leq \frac {E[e^{s(S_{n} - E[S_{n}])}]} {e^{st}} = e^{-st} \cdot \prod_{i=1}^{n} E[e^{s(X_{i} - E[X_{i}])}]
P(es(Sn−E[Sn])≥est)≤estE[es(Sn−E[Sn])]=e−st⋅i=1∏nE[es(Xi−E[Xi])]
对于随机变量
X
i
−
E
[
X
i
]
X_{i} - E[X_{i}]
Xi−E[Xi],有
E
[
X
i
−
E
[
X
i
]
]
=
0
E[X_{i} - E[X_{i}]] = 0
E[Xi−E[Xi]]=0,且
X
i
−
E
[
X
i
]
∈
[
a
i
−
E
[
X
i
]
,
b
i
−
E
[
X
i
]
]
X_{i} - E[X_{i}] \in [a_{i} - E[X_{i}], b_{i} - E[X_{i}]]
Xi−E[Xi]∈[ai−E[Xi],bi−E[Xi]]。
则由 Hoeffding 引理得:
e
−
s
t
⋅
∏
i
=
1
n
E
[
e
s
(
X
i
−
E
[
X
i
]
)
]
≤
e
−
s
t
⋅
∏
i
=
1
n
e
s
2
(
b
i
−
a
i
)
2
8
=
e
−
s
t
+
1
8
s
2
∑
i
=
1
n
(
b
i
−
a
i
)
2
e^{-st} \cdot \prod_{i=1}^{n} E[e^{s(X_{i} - E[X_{i}])}] \leq e^{-st} \cdot \prod_{i=1}^{n} e^{\frac {s^2 (b_{i} - a_{i})^2} {8}} =e^{-st + \frac{1}{8} s^2 \sum_{i=1}^{n} (b_{i} - a_{i})^2}
e−st⋅i=1∏nE[es(Xi−E[Xi])]≤e−st⋅i=1∏ne8s2(bi−ai)2=e−st+81s2∑i=1n(bi−ai)2
令
g
(
s
)
=
−
s
t
+
1
8
s
2
∑
i
=
1
n
(
b
i
−
a
i
)
2
g(s) = -st + \frac{1}{8} s^2 \sum_{i=1}^{n} (b_{i} - a_{i})^2
g(s)=−st+81s2∑i=1n(bi−ai)2,则:
g
′
(
s
)
=
−
t
+
1
4
s
∑
i
=
1
n
(
b
i
−
a
i
)
2
g'(s) = -t + \frac{1}{4} s \sum_{i=1}^{n} (b_{i} - a_{i})^2
g′(s)=−t+41si=1∑n(bi−ai)2
求解
g
′
(
s
)
=
0
g'(s) = 0
g′(s)=0,得:
s
=
4
t
∑
i
=
1
n
(
b
i
−
a
i
)
2
s = \frac {4t} {\sum_{i=1}^{n} (b_{i} - a_{i})^2}
s=∑i=1n(bi−ai)24t
故:
P
(
S
n
−
E
[
S
n
]
≥
t
)
≤
e
−
2
t
2
∑
i
=
1
n
(
b
i
−
a
i
)
2
P(S_{n} - E[S_{n}] \geq t) \leq e^{\frac {-2t^2} {\sum_{i=1}^{n} (b_{i} - a_{i})^2}}
P(Sn−E[Sn]≥t)≤e∑i=1n(bi−ai)2−2t2
不等式
P
(
E
[
S
n
]
−
S
n
≥
t
)
≤
e
−
2
t
2
∑
i
=
1
n
(
b
i
−
a
i
)
2
P(E[S_{n}] -S_{n} \geq t) \leq e^{\frac {-2t^2} {\sum_{i=1}^{n} (b_{i} - a_{i})^2}}
P(E[Sn]−Sn≥t)≤e∑i=1n(bi−ai)2−2t2 的证明同理。
Hoeffding 不等式得证。
二类分类问题的泛化误差上界
定理:
对二类分类问题,当假设空间是有限个函数的集合
F
=
{
f
1
,
f
1
,
.
.
.
,
f
d
}
F = \{f_{1}, f_{1}, ..., f_{d}\}
F={f1,f1,...,fd} 时,对任意一个函数
f
∈
F
f \in F
f∈F,至少以概率
1
−
δ
1 - \delta
1−δ,以下不等式成立:
R
(
f
)
≤
R
^
(
f
)
+
ε
(
d
,
N
,
δ
)
(
2
)
R(f) \leq \hat R(f) + \varepsilon (d, N, \delta) \quad (2)
R(f)≤R^(f)+ε(d,N,δ)(2)
其中,不等式左侧
R
(
f
)
R(f)
R(f) 为泛化误差:
R
(
f
)
=
E
[
L
(
Y
,
f
(
X
)
)
]
R(f) = E[L(Y, f(X))]
R(f)=E[L(Y,f(X))]
不等式右侧即为泛化误差的上界:
R
^
(
f
)
=
1
N
∑
i
=
1
N
L
(
y
i
,
f
(
x
i
)
)
\hat R(f) = \frac{1}{N} \sum_{i=1}^{N} L(y_{i}, f(x_{i}))
R^(f)=N1i=1∑NL(yi,f(xi))
ε
(
d
,
N
,
δ
)
=
1
2
N
(
ln
d
+
ln
1
δ
)
(
3
)
\varepsilon (d, N, \delta) = \sqrt {\frac{1}{2N} (\ln {d} + \ln {\frac{1}{\delta}})} \quad (3)
ε(d,N,δ)=2N1(lnd+lnδ1)(3)
在泛化误差上界中,第1项是训练误差;
第2项
ε
(
d
,
N
,
δ
)
\varepsilon (d, N, \delta)
ε(d,N,δ) 是
N
N
N 的单调递减函数,当
N
→
+
∞
N \to +\infty
N→+∞ 时趋于
0
0
0;同时它也是
ln
d
\sqrt {\ln d}
lnd 阶的函数,假设空间
F
F
F 包含的函数越多,其值越大。
Proof:
对任意函数
f
∈
F
f \in F
f∈F,
R
^
(
f
)
\hat R(f)
R^(f) 是
N
N
N 个独立的随机变量
L
(
Y
,
f
(
X
)
)
的
样
本
均
值
,
L(Y, f(X)) 的样本均值,
L(Y,f(X))的样本均值,R(f)$ 是随机变量
L
(
Y
,
f
(
X
)
)
L(Y, f(X))
L(Y,f(X)) 的期望值,则:
P
(
R
(
f
)
−
R
^
(
f
)
≥
ε
)
=
P
(
N
⋅
R
(
f
)
−
N
⋅
R
^
(
f
)
≥
N
⋅
ε
)
P(R(f) - \hat R(f) \geq \varepsilon) = P(N \cdot R(f) - N \cdot \hat R(f) \geq N \cdot \varepsilon)
P(R(f)−R^(f)≥ε)=P(N⋅R(f)−N⋅R^(f)≥N⋅ε)
=
P
(
N
⋅
E
[
L
(
Y
,
f
(
X
)
)
]
−
∑
i
=
1
N
L
(
y
i
−
f
(
x
i
)
)
≥
N
⋅
ε
)
= P(N \cdot E[L(Y, f(X))] - \sum_{i=1}^{N} L(y_{i} - f(x_{i})) \geq N \cdot \varepsilon)
=P(N⋅E[L(Y,f(X))]−i=1∑NL(yi−f(xi))≥N⋅ε)
令
S
n
=
∑
i
=
1
N
L
(
y
i
,
f
(
x
i
)
)
S_{n} = \sum_{i=1}^{N} L(y_{i}, f(x_{i}))
Sn=∑i=1NL(yi,f(xi)),则:
E
[
S
n
]
=
E
[
∑
i
=
1
N
L
(
y
i
,
f
(
x
i
)
)
]
=
N
⋅
E
[
L
(
Y
,
f
(
X
)
]
E[S_{n}] = E[\sum_{i=1}^{N} L(y_{i}, f(x_{i}))] = N \cdot E[L(Y, f(X)]
E[Sn]=E[i=1∑NL(yi,f(xi))]=N⋅E[L(Y,f(X)]
如果损失函数取值于区间
[
0
,
1
]
[0, 1]
[0,1],即对所有
i
i
i,
[
a
i
,
b
i
]
=
[
0
,
1
]
[a_{i}, b_{i}] = [0, 1]
[ai,bi]=[0,1],则由 Hoeffding 不等式可知,对任意
ε
>
0
\varepsilon > 0
ε>0,以下不等式成立:
P
(
N
⋅
E
[
L
(
Y
,
f
(
X
)
)
]
−
∑
i
=
1
N
L
(
y
i
−
f
(
x
i
)
)
≥
N
⋅
ε
)
P(N \cdot E[L(Y, f(X))] - \sum_{i=1}^{N} L(y_{i} - f(x_{i})) \geq N \cdot \varepsilon)
P(N⋅E[L(Y,f(X))]−i=1∑NL(yi−f(xi))≥N⋅ε)
=
P
(
E
[
S
n
]
−
S
n
≥
N
⋅
ε
)
≤
e
−
2
(
N
ε
2
)
N
(
1
−
0
)
2
=
e
−
2
N
ε
2
= P(E[S_{n}] - S_{n} \geq N \cdot \varepsilon) \leq e^{\frac {-2(N \varepsilon ^ 2)} {N(1-0)^2}} = e^{-2N \varepsilon ^ 2}
=P(E[Sn]−Sn≥N⋅ε)≤eN(1−0)2−2(Nε2)=e−2Nε2
由于
F
=
{
f
1
,
f
1
,
.
.
.
,
f
d
}
F = \{f_{1}, f_{1}, ..., f_{d}\}
F={f1,f1,...,fd} 是一个有限集合,故:
P
(
∃
f
∈
F
:
R
(
f
)
−
R
^
(
f
)
≥
ε
)
=
P
(
⋃
f
∈
F
{
R
(
f
)
−
R
^
(
f
)
≥
ε
}
)
P(\exists f \in F: \, R(f) - \hat R(f) \geq \varepsilon) = P(\bigcup_{f \in F} \{ R(f) - \hat R(f) \geq \varepsilon \})
P(∃f∈F:R(f)−R^(f)≥ε)=P(f∈F⋃{R(f)−R^(f)≥ε})
≤
∑
f
∈
F
P
(
R
(
f
)
−
R
^
(
f
)
≥
ε
)
≤
d
⋅
e
−
2
N
ε
2
\leq \sum_{f \in F} P(R(f) - \hat R(f) \geq \varepsilon) \leq d \cdot e^{-2N \varepsilon ^ 2}
≤f∈F∑P(R(f)−R^(f)≥ε)≤d⋅e−2Nε2
或者,等价的,
∀
f
∈
F
\forall f \in F
∀f∈F,有:
P
(
R
(
f
)
−
R
^
(
f
)
<
ε
)
≥
1
−
d
⋅
e
−
2
N
ε
2
P(R(f) - \hat R(f) < \varepsilon) \geq 1 - d \cdot e^{-2N \varepsilon ^ 2}
P(R(f)−R^(f)<ε)≥1−d⋅e−2Nε2
令:
δ
=
d
⋅
e
−
2
N
ε
2
(
4
)
\delta = d \cdot e^{-2N \varepsilon ^ 2} \quad (4)
δ=d⋅e−2Nε2(4)
可得:
P
(
R
(
f
)
<
R
^
(
f
)
+
ε
)
≥
1
−
δ
P(R(f) < \hat R(f) + \varepsilon) \geq 1 - \delta
P(R(f)<R^(f)+ε)≥1−δ
即:至少以概率
1
−
δ
1 - \delta
1−δ,有
R
(
f
)
<
R
^
(
f
)
+
ε
R(f) < \hat R(f) + \varepsilon
R(f)<R^(f)+ε,其中
ε
\varepsilon
ε 由式 (4) 得到,即为式 (3)。
故不等式 (2) 得证,即定理得证。
参考文献
李航【统计学习方法】第一版,1.6.2