VC-dimension的定义
机器学习中,我们希望引入一个量化指标,去衡量一个learner的最大表达能力,这就是VC-dimension
直观定义
不妨把多维数据想象为空间中的点,分类器learner想象为曲面。
给定空间点的个数m,甲乙两人做对抗游戏:
- 甲:选取m个点的位置
- 乙:从中选取0~m个数据作为一类(换句话说,给这m个数据加0/1 label)
- 甲:给出一种参数取值,使得learner能够正确分类
乙需要尽可能地刁难甲
如果甲能做到,说明这个learner能够轻松应对m个数据的某种分布的任意label取值,VC-dimension
≥
\ge
≥m
否则VC-dimension
<
<
<m
逻辑定义
对于给定的m,如果
∃
x
1
,
⋯
,
x
m
,
∀
l
1
,
⋯
,
l
m
,
∃
θ
,
f
(
x
i
;
θ
)
=
l
i
,
i
=
1
,
⋯
,
m
\exist x_1,\cdots,x_m, \forall l_1,\cdots,l_m, \exist \theta, f(x_i;\theta)=l_i , i = 1,\cdots,m
∃x1,⋯,xm,∀l1,⋯,lm,∃θ,f(xi;θ)=li,i=1,⋯,m
则
V
C
(
f
)
≥
m
\mathrm{VC}(f) \ge m
VC(f)≥m
否则
V
C
(
f
)
<
m
\mathrm{VC}(f)<m
VC(f)<m
更一般的数学定义
-
A set system ( X , H ) (X, \mathcal{H}) (X,H) consists of a set X X X and a class H \mathcal H H of subsets of X X X, i.e. H ⊆ P ( X ) \mathcal{H} \subseteq P(X) H⊆P(X)
( X X X is a instance space, H \mathcal H H is a class of classifiers) -
A set system ( X , H ) (X, \mathcal{H}) (X,H) shatters a set A ⊆ X A\ \subseteq X A ⊆X iff ∀ A ′ ⊆ A , ∃ h ∈ H , A ′ = A ∩ h \forall A' \subseteq A, \exist h \in \mathcal H, A'=A\cap h ∀A′⊆A,∃h∈H,A′=A∩h
-
The VC-dimension of H \mathcal H H is V C ( H ) = max A i s s h a t t e r e d b y H ∣ A ∣ \mathrm{VC}(\mathcal H) = \max_{A\ is\ shattered\ by\ \mathcal H}|A| VC(H)=A is shattered by Hmax∣A∣
VC-dimension的应用
在数据量给定的情况下,随着模型参数增多,表达能力增强,Train Error不断减小,但一般误差中的VC项变大,导致模型的Test Error先减小后增大。增大对应着我们一般说的过拟合。
主要定理
Definition
For a set system
(
X
,
H
)
(X, \mathcal{H})
(X,H), the shatter function
π
H
(
n
)
\pi_{\mathcal H}(n)
πH(n) is the maximum number of subsets of any set A of size n that can be expressed as
A
∩
h
A\cap h
A∩h for some
h
∈
H
h \in \mathcal H
h∈H, i.e.
π
H
(
n
)
=
max
∣
A
∣
=
n
∣
{
A
∩
h
∣
h
∈
H
}
∣
\pi_{\mathcal H}(n) = \max_{|A|=n}|\{A\cap h| h\in \mathcal H \}|
πH(n)=∣A∣=nmax∣{A∩h∣h∈H}∣
Lemma (Sauer)
For a set system
(
X
,
H
)
(X, \mathcal{H})
(X,H) whose VC-dimension equals
d
d
d,
π
H
(
n
)
{
=
2
n
,
n
≤
d
≤
(
n
≤
d
)
,
n
>
d
\pi_{\mathcal H}(n) \begin{cases} = 2^n & ,n \le d\\ \le \dbinom{n}{\le d} & ,n > d \end{cases}
πH(n)⎩⎨⎧=2n≤(≤dn),n≤d,n>d
where
(
n
≤
d
)
=
(
n
0
)
+
(
n
1
)
+
⋯
+
(
n
d
)
≤
n
d
+
1
\dbinom{n}{\le d}=\dbinom n 0 + \dbinom n 1 + \cdots + \dbinom n d \le n^d+1
(≤dn)=(0n)+(1n)+⋯+(dn)≤nd+1
The Key Theorem
With sufficiently large
n
n
n,
n
≥
4
ϵ
n \ge \dfrac 4\epsilon
n≥ϵ4 and
n
≥
1
ϵ
(
log
2
π
H
(
2
n
)
+
log
2
2
δ
)
n \ge \dfrac 1\epsilon \Big( \log_2\pi_{\mathcal H}(2n)+\log_2\dfrac 2 \delta \Big)
n≥ϵ1(log2πH(2n)+log2δ2)
Given training set
T
T
T,
∣
T
∣
=
n
|T|=n
∣T∣=n
P
r
o
b
[
∃
h
,
T
r
u
e
E
r
r
(
h
)
=
ϵ
,
T
r
a
i
n
E
r
r
(
h
)
=
0
]
<
δ
\mathrm{Prob}\Big[\exist h, \mathrm{TrueErr}(h)=\epsilon, \mathrm{TrainErr}(h)=0\Big] < \delta
Prob[∃h,TrueErr(h)=ϵ,TrainErr(h)=0]<δ