文章目录
Lecture 6: Theory of Generalization
Restriction of Break Point
m
H
(
N
)
≤
maximum possible
m
H
(
N
)
given
k
≤
p
o
l
y
(
N
)
\begin{aligned} & m_{\mathcal{H}}(N) \\ \leq & \text { maximum possible } m_{\mathcal{H}}(N) \text { given } k \\ \leq & p o l y(N) \end{aligned}
≤≤mH(N) maximum possible mH(N) given kpoly(N)
Fun Time
When minimum break point k = 1, what is the maximum possible
m
H
(
N
)
m_{\mathcal{H}}(N)
mH(N) when
N
=
3
N = 3
N=3?
1. 1
✓
\checkmark
✓ 2. 2 3. 3 4. 4
Explanation
因为
k
=
1
k=1
k=1,所以没有任何一个点可以和它共存,所以
m
H
(
N
)
=
1
m_H (N) = 1
mH(N)=1
Bounding Function: Basic Cases
Bounding Function
bounding function
B
(
N
,
k
)
B(N,k)
B(N,k):
maximum possible
m
H
(
N
)
m_H (N)
mH(N) when break point = k
B
(
N
,
k
)
≤
p
o
l
y
(
N
)
B(N, k) \leq p o l y(N)
B(N,k)≤poly(N)
换言之, B ( N , k ) B(N, k) B(N,k)是 m H ( N ) m_H (N) mH(N)的上界。
Table of Bounding Function
Fun Time
For the 2D perceptrons, which of the following claim is true?
1 minimum break point k = 2
2
m
H
(
4
)
m_{\mathcal{H}}(4)
mH(4)= 15
3
m
H
(
N
)
<
B
(
N
,
k
)
m_{\mathcal{H}}(N)<B(N, k)
mH(N)<B(N,k) when $N = k = $ minimum break point
✓
\checkmark
✓
4
m
H
(
N
)
>
B
(
N
,
k
)
m_{\mathcal{H}}(N)>B(N, k)
mH(N)>B(N,k) when $N = k = $ minimum break point
Explanation
minimum break point k = 3
m
H
(
4
)
m_{\mathcal{H}}(4)
mH(4)= 14
B
(
N
,
k
)
B(N, k)
B(N,k)是
m
H
(
N
)
m_H (N)
mH(N)的上界
不记得2D感知器的同学,可以回顾Lecture 5: Training versus Testing中的Effective Number of Hypotheses ?
Bounding Function: Inductive Cases
B
(
4
,
3
)
=
11
=
2
α
+
β
B(4,3)=11=2 \alpha+\beta
B(4,3)=11=2α+β
B
(
N
,
k
)
=
2
α
+
β
α
+
β
≤
B
(
N
−
1
,
k
)
α
≤
B
(
N
−
1
,
k
−
1
)
⇒
B
(
N
,
k
)
≤
B
(
N
−
1
,
k
)
+
B
(
N
−
1
,
k
−
1
)
B
(
N
,
k
)
≤
∑
i
=
0
k
−
1
(
N
i
)
\begin{aligned} B(N, k) &=2 \alpha+\beta \\ \alpha+\beta & \leq B(N-1, k) \\ \alpha & \leq B(N-1, k-1) \\ \Rightarrow B(N, k) & \leq B(N-1, k)+B(N-1, k-1) \end{aligned} \\ B(N, k) \leq \sum_{i=0}^{k-1} \left( \begin{array}{c}{N} \\ {i}\end{array}\right)
B(N,k)α+βα⇒B(N,k)=2α+β≤B(N−1,k)≤B(N−1,k−1)≤B(N−1,k)+B(N−1,k−1)B(N,k)≤i=0∑k−1(Ni)
≤
\le
≤ 实际上是
=
=
=
即
B
(
N
,
k
)
=
B
(
N
−
1
,
k
)
+
B
(
N
−
1
,
k
−
1
)
B
(
N
,
k
)
=
∑
i
=
0
k
−
1
(
N
i
)
=
C
N
0
+
C
N
1
+
.
.
.
+
C
N
k
−
1
B(N, k) = B(N-1, k)+B(N-1, k-1) \\ B(N, k) = \sum_{i=0}^{k-1} \left( \begin{array}{c}{N} \\ {i}\end{array}\right) = C_N^0+C_N^1 +...+C_N^{k-1}
B(N,k)=B(N−1,k)+B(N−1,k−1)B(N,k)=i=0∑k−1(Ni)=CN0+CN1+...+CNk−1
2D perceptrons break point at 4,
m
H
(
N
)
≤
B
(
N
,
4
)
=
1
6
N
3
+
5
6
N
+
1
=
O
(
N
3
)
m_{\mathcal{H}}(N) \leq B(N, 4) = \frac{1}{6} N^{3}+\frac{5}{6} N+1 = O(N^3)
mH(N)≤B(N,4)=61N3+65N+1=O(N3)
Fun Time
For 1D perceptrons (positive and negative rays), we know that
m
H
(
N
)
m_H (N)
mH(N) = 2N. Let k be the minimum break point. Which of the following is not true?
1 k = 3
2 for some integers
N
>
0
,
m
H
(
N
)
=
∑
i
=
0
k
−
1
(
N
i
)
N>0 ,\ m_{\mathcal{H}}(N)=\sum_{i=0}^{k-1} \left( \begin{array}{c}{N} \\ {i}\end{array}\right)
N>0, mH(N)=∑i=0k−1(Ni)
3 for all integers
N
>
0
,
m
H
(
N
)
=
∑
i
=
0
k
−
1
(
N
i
)
N>0 ,\ m_{\mathcal{H}}(N)=\sum_{i=0}^{k-1} \left( \begin{array}{c}{N} \\ {i}\end{array}\right)
N>0, mH(N)=∑i=0k−1(Ni)
✓
\checkmark
✓
4 for all integers
N
>
2
,
m
H
(
N
)
<
∑
i
=
0
k
−
1
(
N
i
)
N>2 ,\ m_{\mathcal{H}}(N)<\sum_{i=0}^{k-1} \left( \begin{array}{c}{N} \\ {i}\end{array}\right)
N>2, mH(N)<∑i=0k−1(Ni)
Explanation
minimum break point k = 3
B
(
N
,
k
)
=
∑
i
=
0
k
−
1
(
N
i
)
B(N, k) = \sum_{i=0}^{k-1} \left( \begin{array}{c}{N} \\ {i}\end{array}\right)
B(N,k)=∑i=0k−1(Ni)
B
(
N
,
k
)
B(N, k)
B(N,k)是
m
H
(
N
)
m_H (N)
mH(N)的上界,当N
≥
\ge
≥k时,
m
H
(
N
)
<
B
(
N
,
k
)
m_H (N)<B(N, k)
mH(N)<B(N,k); 当N
<
<
<k时,
m
H
(
N
)
=
B
(
N
,
k
)
m_H (N)=B(N, k)
mH(N)=B(N,k).
拓展:回顾下Lecture 5: Training versus Testing中的Effective Number of Hypotheses Funtime
求2维感知器中5个点的有效分类数(k=3,N=5
m
H
(
N
)
=
?
≤
1
6
N
3
+
5
6
N
+
1
m_{\mathcal{H}}(N)=? \leq \frac{1}{6} N^{3}+\frac{5}{6} N+1
mH(N)=?≤61N3+65N+1),N>k,=取不到。
正确答案22<(
125
6
+
25
6
+
1
=
25
\frac{125}{6}+\frac{25}{6}+1=25
6125+625+1=25),验证成功,回顾题目也挺有趣味的。?
A Pictorial Proof
用 E i n ′ E_{in}' Ein′(有限)替换 E o u t E_{out} Eout(无限),但是这个不等式及 1 2 \frac{1}{2} 21的系数的出处,我没想明白。
将上界定义为以 m H ( 2 N ) m_{H}(2N) mH(2N)为基准的。
使用无放回的霍夫丁不等式,结果类似,只是 ν = E in , μ = E in + E in ′ 2 \nu=E_{\text { in }},\mu=\frac{E_{\text { in }}+E_{\text { in }}^{\prime}}{2} ν=E in ,μ=2E in +E in ′。
Vapnik-Chervonenkis (VC) bound
P
[
∃
h
∈
H
s.t.
∣
E
in
(
h
)
−
E
out
(
h
)
∣
>
ϵ
]
≤
4
m
H
(
2
N
)
exp
(
−
1
8
ϵ
2
N
)
\begin{aligned} & \mathbb{P}\left[\exists h \in \mathcal{H} \text { s.t. } | E_{\text { in }}(h)-E_{\text { out }}(h) |>\epsilon\right] \\ & \leq 4 m_{\mathcal{H}}(2 N) \exp \left(-\frac{1}{8} \epsilon^{2} N\right) \end{aligned}
P[∃h∈H s.t. ∣E in (h)−E out (h)∣>ϵ]≤4mH(2N)exp(−81ϵ2N)
m
H
(
N
)
m_H (N)
mH(N) can replace M with a few changes
Fun Time
For positive rays,
m
H
(
N
)
=
N
+
1
m_H (N) = N + 1
mH(N)=N+1. Plug it into the VC bound for ? = 0.1 and N = 10000. What is VC bound of BAD events?
P
[
∃
h
∈
H
s.t.
∣
E
in
(
h
)
−
E
out
(
h
)
∣
>
ϵ
]
≤
4
m
H
(
2
N
)
exp
(
−
1
8
ϵ
2
N
)
\mathbb{P}\left[\exists h \in \mathcal{H} \text { s.t. } | E_{\text { in }}(h)-E_{\text { out }}(h) |>\epsilon\right] \leq 4 m_{\mathcal{H}}(2 N) \exp \left(-\frac{1}{8} \epsilon^{2} N\right)
P[∃h∈H s.t. ∣E in (h)−E out (h)∣>ϵ]≤4mH(2N)exp(−81ϵ2N)
1
2.77
×
1
0
−
87
2.77 × 10^{−87}
2.77×10−87
2
5.54
×
1
0
−
83
5.54 × 10^{−83}
5.54×10−83
3
2.98
×
1
0
−
1
2.98 × 10^{−1}
2.98×10−1
✓
\checkmark
✓
4
2.29
×
1
0
−
2
2.29 × 10^{−2}
2.29×10−2
Explanation
代入公式计算即可。
0.2981471603789822
Summary
本篇讲义主要讲了Bound Function B ( N , k ) B(N,k) B(N,k)以及VC Bound的含义及推导。
讲义总结
若 m H ( N ) m_{\mathcal{H}}(N) mH(N)有break point,且 N N N足够大,那么 E o u t ≈ E i n E_{\mathrm{out}} \approx E_{\mathrm{in}} Eout≈Ein.
Restriction of Break Point
break point ‘breaks’ consequent points
Bounding Function: Basic Cases
B
(
N
,
k
)
B(N,k)
B(N,k) bounds
m
H
(
N
)
m_H (N)
mH(N) with break point k
Bounding Function: Inductive Cases
B
(
N
,
k
)
B(N,k)
B(N,k) is poly(N)
A Pictorial Proof
m
H
(
N
)
m_H (N)
mH(N) can replace M with a few changes
参考文献
《Machine Learning Foundations》(机器学习基石)—— Hsuan-Tien Lin (林轩田)