Contents
Bias / Variance Tradeoff
In the diagram below, we modeled the training set with 2 2 2, 3 3 3, and 7 7 7 parameters
As we can see, h 1 h_1 h1 is an underfit of the training set. No matter how large the training set grows, the model cannot capture the structure of the data. This model is said to have high bias.
On the contrary, h 6 h_6 h6 is an overfit of the training set. The model is too sensitive to random factors that we don’t want to include in our model. This model is said to have large variance.
ERM
For the purpose of simplicity, consider binary classification with
- labels y ∈ { 0 , 1 } y \in \{0,1\} y∈{0,1}
- training set S = { ( x ( i ) , y ( i ) ) ∣ i = 1 , … , m } S = \{(x^{(i)}, y^{(i)}) | i = 1,\dots,m\} S={(x(i),y(i))∣i=1,…,m}
- new sample ( x , y ) (x, y) (x,y)
As one of the PAC (probably approximately correct) assumptions, assume that there exists some distribution
D
D
D such that
(
x
,
y
)
,
(
x
(
1
)
,
y
(
1
)
)
,
…
,
(
x
(
m
)
,
y
(
m
)
)
∼
i
.
i
.
d
.
D
(x,y), (x^{(1)}, y^{(1)}), \dots, (x^{(m)}, y^{(m)})\mathop{\sim}\limits^{i.i.d.} D
(x,y),(x(1),y(1)),…,(x(m),y(m))∼i.i.d.D For a hypothesis
h
h
h define
Z
=
I
{
h
(
x
)
≠
y
}
Z
i
=
I
{
h
(
x
(
i
)
)
≠
y
(
i
)
}
\begin{array}{rcl} Z &=& I\{h(x) \neq y\}\\ Z_i &=& I\{h(x^{(i)}) \neq y^{(i)}\} \end{array}
ZZi==I{h(x)=y}I{h(x(i))=y(i)} Note training error or empirical risk as
ϵ
^
(
h
)
=
1
m
∑
i
=
1
m
Z
i
\hat\epsilon(h) = \frac{1}{m} \sum\limits_{i=1}^m Z_i
ϵ^(h)=m1i=1∑mZi and generalization error
ϵ
(
h
)
=
P
(
Z
)
\epsilon(h) = P(Z)
ϵ(h)=P(Z) Then it is obvious that
Z
,
Z
1
,
Z
2
,
…
,
Z
m
∼
i
.
i
.
d
.
B
e
r
n
(
ϵ
(
h
)
)
Z, Z_1, Z_2, \dots,Z_m \mathop\sim\limits^{i.i.d.} Bern(\epsilon(h))
Z,Z1,Z2,…,Zm∼i.i.d.Bern(ϵ(h)) Think of the problem as we are picking
h
h
h from a hypothesis class, for instance
H
=
{
h
θ
∣
θ
∈
R
n
+
1
}
H = \{h_\theta|\theta\in\mathbb{R}^{n+1}\}
H={hθ∣θ∈Rn+1} and our goal is empirical risk minimization
h
^
=
arg
min
h
∈
H
ϵ
^
(
h
)
\hat h = \arg\min_{h\in H}\hat\epsilon(h)
h^=argh∈Hminϵ^(h) Define the theoretically best hypothesis in
H
H
H
h
∗
=
arg
min
h
∈
H
ϵ
(
h
)
h^* = \arg\min_{h\in H} \epsilon(h)
h∗=argh∈Hminϵ(h)
Uniform Convergence
Suppose
H
=
{
h
1
,
…
,
h
k
}
H = \{h_1,\dots,h_k\}
H={h1,…,hk} is a finite set. For any
h
i
∈
H
h_i \in H
hi∈H donate
A
i
=
I
{
∣
ϵ
(
h
i
)
−
ϵ
^
(
h
i
)
∣
>
γ
}
A_i = I\{|\epsilon(h_i) - \hat\epsilon(h_i)|>\gamma\}
Ai=I{∣ϵ(hi)−ϵ^(hi)∣>γ} Hoeffding inequality gives that
P
(
A
i
)
≤
2
exp
(
−
2
γ
2
m
)
P(A_i) \le 2\exp(-2\gamma^2m)
P(Ai)≤2exp(−2γ2m) Using the union bound, we have that
P
(
∃
h
i
∈
H
,
∣
ϵ
(
h
i
)
−
ϵ
^
(
h
i
)
∣
>
γ
)
=
P
(
⋃
i
=
1
k
A
i
)
≤
∑
i
=
1
k
P
(
A
i
)
≤
2
k
exp
(
−
2
γ
2
m
)
\begin{array}{rcl} P(\exist h_i \in H,\ |\epsilon(h_i) - \hat\epsilon(h_i)|>\gamma) &=& P(\bigcup\limits_{i=1}^k A_i)\\ &\le& \sum\limits_{i=1}^kP(A_i)\\ &\le& 2k\exp(-2\gamma^2m) \end{array}
P(∃hi∈H, ∣ϵ(hi)−ϵ^(hi)∣>γ)=≤≤P(i=1⋃kAi)i=1∑kP(Ai)2kexp(−2γ2m) Therefore
P
(
∀
h
∈
H
,
∣
ϵ
(
h
i
)
−
ϵ
^
(
h
i
)
∣
≤
γ
)
≥
1
−
2
k
exp
(
−
2
γ
2
m
)
P(\forall h\in H,\ |\epsilon(h_i) - \hat\epsilon(h_i)|\le\gamma) \ge 1-2k\exp(-2\gamma^2m)
P(∀h∈H, ∣ϵ(hi)−ϵ^(hi)∣≤γ)≥1−2kexp(−2γ2m)
Error bound
Given
m
m
m and
δ
>
0
\delta>0
δ>0, with probability
1
−
δ
1-\delta
1−δ we have that
∀
h
∈
H
,
∣
ϵ
(
h
)
−
ϵ
^
(
h
)
∣
≤
1
2
m
log
2
k
δ
\forall h\in H,\ |\epsilon(h)-\hat\epsilon(h)| \le \sqrt{\frac{1}{2m}\log\frac{2k}{\delta}}
∀h∈H, ∣ϵ(h)−ϵ^(h)∣≤2m1logδ2k Since
ϵ
^
(
h
^
)
≤
ϵ
^
(
h
∗
)
\hat\epsilon(\hat h) \le \hat\epsilon(h^*)
ϵ^(h^)≤ϵ^(h∗) we have that
ϵ
(
h
^
)
≤
(
min
h
∈
H
ϵ
(
h
)
)
+
2
1
2
m
log
2
k
δ
\epsilon(\hat h) \le \Big(\min_{h\in H}\epsilon(h)\Big) + 2\sqrt{\frac{1}{2m}\log\frac{2k}{\delta}}
ϵ(h^)≤(h∈Hminϵ(h))+22m1logδ2k When we expand
H
H
H to some super-set
H
’
⊃
H
H’ \supset H
H’⊃H, the first term
min
h
∈
H
ϵ
(
h
)
\min\limits_{h\in H}\epsilon(h)
h∈Hminϵ(h) can only decrease, while the second term
1
2
m
log
2
k
δ
\sqrt{\frac{1}{2m}\log\frac{2k}{\delta}}
2m1logδ2k can only increase. This loosely corresponds to the bias-variance tradeoff.
Sample Complexity Bound
Given
γ
\gamma
γ and
δ
>
0
\delta > 0
δ>0 in order for
ϵ
(
h
^
)
≤
ϵ
(
h
∗
)
+
2
γ
\epsilon(\hat h) \le \epsilon(h^*) + 2\gamma
ϵ(h^)≤ϵ(h∗)+2γ to be true with probability at least
1
−
δ
1-\delta
1−δ , it suffices that
m
≥
1
2
γ
2
log
2
k
δ
=
O
(
1
γ
2
log
k
δ
)
m \ge \frac{1}{2\gamma^2}\log\frac{2k}{\delta} = O(\frac{1}{\gamma^2}\log\frac{k}{\delta})
m≥2γ21logδ2k=O(γ21logδk)
VC Dimension
Given a set
S
=
{
x
(
1
)
,
…
,
x
(
d
)
}
S = \{x^{(1)}, \dots, x^{(d)}\}
S={x(1),…,x(d)}, we say that
H
H
H shatters
S
S
S if
∀
{
y
(
1
)
,
…
,
y
(
d
)
}
∈
{
0
,
1
}
d
,
∃
h
∈
H
,
s
.
t
.
∀
i
∈
{
1
,
…
,
d
}
,
h
(
x
(
i
)
)
=
y
(
i
)
\forall \{y^{(1)},\dots,y^{(d)}\} \in \{0,1\}^d,\ \exist h\in H,\\\ s.t.\ \forall i \in \{1,\dots,d\},\ h(x^{(i)}) = y^{(i)}
∀{y(1),…,y(d)}∈{0,1}d, ∃h∈H, s.t. ∀i∈{1,…,d}, h(x(i))=y(i) Define Vapnik-Chervonenkis dimension
V
C
(
H
)
VC(H)
VC(H) to be the size of the largest set that is shattered by
H
H
H. It can be shown that if
H
H
H contains all linear classifiers in
n
n
n dimensional, then
V
C
(
H
)
=
n
+
1
VC(H) = n+1
VC(H)=n+1.
For SVMs using kernel, VC dimension is usually small. If
∃
R
,
s
.
t
.
∀
i
∈
{
1
,
…
,
m
}
,
∣
∣
x
(
i
)
∣
∣
≤
R
\exist R,\ s.t. \forall i \in \{1,\dots,m\},\ ||x^{(i)}|| \le R
∃R, s.t.∀i∈{1,…,m}, ∣∣x(i)∣∣≤R then
V
C
(
H
)
≤
⌈
R
2
4
γ
2
⌉
+
1
VC(H) \le \lceil\frac{R^2}{4\gamma^2}\rceil + 1
VC(H)≤⌈4γ2R2⌉+1
Let
H
H
H be given and let
d
=
V
C
(
H
)
≠
+
∞
d = VC(H) \neq +\infin
d=VC(H)=+∞. With probability at least
1
−
δ
1-\delta
1−δ, we have that
∀
h
∈
H
,
∣
ϵ
^
(
h
)
−
ϵ
(
h
)
∣
≤
O
(
d
m
log
m
d
+
1
m
log
1
δ
)
\forall h \in H,\ |\hat\epsilon(h) - \epsilon(h)| \le O\Bigg(\sqrt{\frac{d}{m}\log\frac{m}{d} + \frac{1}{m}\log\frac{1}{\delta}}\Bigg)
∀h∈H, ∣ϵ^(h)−ϵ(h)∣≤O(mdlogdm+m1logδ1) Thus
ϵ
(
h
^
)
≤
(
min
h
∈
H
ϵ
(
h
)
)
+
O
(
d
m
log
m
d
+
1
m
log
1
δ
)
\epsilon(\hat h) \le \Big(\min_{h\in H}\epsilon(h)\Big) + O\Bigg(\sqrt{\frac{d}{m}\log\frac{m}{d} + \frac{1}{m}\log\frac{1}{\delta}}\Bigg)
ϵ(h^)≤(h∈Hminϵ(h))+O(mdlogdm+m1logδ1) Moreover
For ϵ ( h ^ ) ≤ ϵ ( h ∗ ) + 2 γ \epsilon(\hat h) \le \epsilon(h^*) + 2\gamma ϵ(h^)≤ϵ(h∗)+2γ to hold with probability at least 1 − δ 1-\delta 1−δ, it suffices that m = O γ , δ ( d ) m = O_{\gamma, \delta}(d) m=Oγ,δ(d).
Model Selection
Suppse we have a finite set of models M = { M 1 , … , M d } M = \{M_1, \dots, M_d\} M={M1,…,Md} that we are trying to select among. There are several techniques to automatically deal with bias-variance tradeoff.
Cross Validation
In hold-out cross validation or simple cross validation, we
randomly split
S
into
S
train
and
S
cv
(
say
70
%
to
30
%
)
for i in 1...d
train
M
i
on
S
train
test
M
i
on
S
cv
to get
ϵ
i
choose
M
=
arg
min
M
i
ϵ
i
(
optional
)
retrain
M
on
S
\begin{aligned} & \text{randomly split } S \text{ into } S_{\text{train}} \text{ and } S_{\text{cv}} (\text{ say } 70\% \text{ to } 30\%)\\ & \text{for i in 1...d}\\ & \qquad \text{train } M_i \text{ on } S_\text{train}\\ & \qquad \text{test } M_i \text{ on } S_\text{cv} \text{ to get } \epsilon_i\\ & \text{choose } M = \arg\min_{M_i} \epsilon_i\\ & (\text{optional})\text{ retrain } M \text{ on } S \end{aligned}
randomly split S into Strain and Scv( say 70% to 30%)for i in 1...dtrain Mi on Straintest Mi on Scv to get ϵichoose M=argMiminϵi(optional) retrain M on S A main disadvantage of this algorithm is the waste of data. To hold out less data each time, we can use k-fold cross validation.
randomly split
S
into
k
disjoint subsets
S
1
,
…
,
S
k
for i in 1...d
for j in 1...k
train
M
i
on
S
−
S
j
test
M
i
to get
ϵ
i
j
ϵ
i
=
1
k
∑
j
=
1
k
ϵ
i
j
choose
M
=
M
arg
min
i
ϵ
i
(
optional
)
retrain
M
on
S
\begin{aligned} & \text{randomly split } S \text{ into } k \text{ disjoint subsets } S_1,\dots, S_k\\ & \text{for i in 1...d}\\ & \qquad\text{for j in 1...k}\\ & \qquad\qquad\text{train } M_i \text{ on } S-S_j\\ & \qquad\qquad\text{test } M_i \text{ to get } \epsilon_{ij}\\ & \qquad\epsilon_i = \frac{1}{k}\sum\limits_{j=1}^k\epsilon_{ij}\\ & \text{choose } M = M_{\arg\min\limits_{i} \epsilon_i}\\ & (\text{optional})\text{ retrain } M \text{ on } S \end{aligned}
randomly split S into k disjoint subsets S1,…,Skfor i in 1...dfor j in 1...ktrain Mi on S−Sjtest Mi to get ϵijϵi=k1j=1∑kϵijchoose M=Margiminϵi(optional) retrain M on S A typical choice for
K
K
K is
10
10
10. But for problems with really scarce data, we can use
k
=
m
k=m
k=m, which lead to the leave-one-out cross validation.
Feature Selection
Suppose you have a supervised learning problem where n n n is very large, but you suspect that only a small number of features are relevant to the task. In such a setting, you can apply some heuristic algorithms to rank every feature and choose the top- k k k.
Wrapper Model Feature Selection
In a forward search, use
F
F
F to record the most relevant features
F
:
=
ϕ
repeat
{
for i in 1...n
{
if
(
i
∈
F
)
continue
F
i
:
=
F
∪
{
i
}
}
cross validate over
{
M
i
∣
M
i
depends only on
F
i
}
F
:
=
result
}
\begin{aligned} & F := \phi\\ & \text{repeat }\{\\ & \qquad\text{for i in 1...n }\{\\ & \qquad\qquad\text{if }(i \in F)\ \text{continue}\\ & \qquad\qquad F_i := F \cup\{i\}\\ & \qquad\}\\ & \qquad\text{cross validate over } \{M_i| M_i \text{ depends only on } F_i\}\\ & \qquad F := \text{result}\\ & \} \end{aligned}
F:=ϕrepeat {for i in 1...n {if (i∈F) continueFi:=F∪{i}}cross validate over {Mi∣Mi depends only on Fi}F:=result} Similarly, backward search
F
:
=
{
1
,
…
,
n
}
repeat
{
for i in 1...n
{
if
(
i
∉
F
)
continue
F
i
:
=
F
−
{
i
}
}
cross validate over
{
M
i
∣
M
i
depends only on
F
i
}
F
:
=
result
}
\begin{aligned} & F := \{1,\dots,n\}\\ & \text{repeat }\{\\ & \qquad\text{for i in 1...n }\{\\ & \qquad\qquad\text{if }(i \notin F)\ \text{continue}\\ & \qquad\qquad F_i := F -\{i\}\\ & \qquad\}\\ & \qquad\text{cross validate over } \{M_i| M_i \text{ depends only on } F_i\}\\ & \qquad F := \text{result}\\ & \} \end{aligned}
F:={1,…,n}repeat {for i in 1...n {if (i∈/F) continueFi:=F−{i}}cross validate over {Mi∣Mi depends only on Fi}F:=result} Wrapper feature selection algorithms often work quite well, but can be computationally expensive.
Filter Feature Selection
Compute some simple score S ( i ) S(i) S(i) to measure how informative each feature x i x_i xi is. Then, simply pick the k k k features with the largest scores S ( i ) S(i) S(i).
It is common to choose mutual information as
S
S
S
S
(
i
)
=
M
I
(
x
i
,
y
)
=
∑
x
i
∑
y
p
(
x
i
,
y
)
log
p
(
x
i
,
y
)
p
(
x
i
)
p
(
y
)
S(i) = MI(x_i,y) = \sum\limits_{x_i} \sum\limits_{y} p(x_i,y)\log\frac{p(x_i,y)}{p(x_i)p(y)}
S(i)=MI(xi,y)=xi∑y∑p(xi,y)logp(xi)p(y)p(xi,y) Note that
M
I
(
x
i
,
y
)
=
K
L
(
p
(
x
i
,
y
)
∣
∣
p
(
x
i
)
p
(
y
)
)
MI(x_i,y) = KL(p(x_i,y)||p(x_i)p(y))
MI(xi,y)=KL(p(xi,y)∣∣p(xi)p(y))
Online Learning
So far, we have been considering batch learning in which we are first given a training set to learn. Online learning, on the contrary, requires the algorithm to make predictions continuously even while it is learning.
In this setting, the algorithm is first given x ( i ) x^{(i)} x(i) and is asked to predict h ( x ( i ) ) h(x^{(i)}) h(x(i)). Then the true y ( i ) y^{(i)} y(i) is revealed for the model and this process repeats. Total online error is the total number of errors made by the algorithm during this process.
Assume that the class labels
y
∈
{
−
1
,
1
}
y \in \{-1,1\}
y∈{−1,1}, in perceptron algorithm with parameters
θ
∈
R
n
+
1
\theta\in\mathbb{R}^{n+1}
θ∈Rn+1
h
(
x
)
=
g
(
θ
T
x
)
h(x) = g(\theta^Tx)
h(x)=g(θTx) where
g
(
z
)
=
{
1
if
z
≥
0
−
1
otherwise
g(z) = \left\{\begin{array}{ll} 1 & \text{if } z \ge 0\\ -1 & \text{otherwise} \end{array}\right.
g(z)={1−1if z≥0otherwise Given a training example
(
x
,
y
)
(x,y)
(x,y), the parameter updates if
h
(
x
)
≠
y
h(x) \neq y
h(x)=y
θ
:
=
θ
+
y
x
\theta := \theta + yx
θ:=θ+yx Suppose that
∃
D
,
s
.
t
.
∀
i
,
∣
∣
x
(
i
)
∣
∣
≤
D
∃
u
(
∣
∣
u
∣
∣
=
1
)
,
s
.
t
.
∀
i
,
y
(
i
)
⋅
(
u
T
x
(
i
)
)
≥
γ
\begin{aligned} &\exist D,\ s.t.\ \forall i,\ ||x^{(i)}|| \le D\\ &\exist u(||u|| = 1),\ s.t.\ \forall i,\ y^{(i)}\cdot(u^Tx^{(i)}) \ge \gamma \end{aligned}
∃D, s.t. ∀i, ∣∣x(i)∣∣≤D∃u(∣∣u∣∣=1), s.t. ∀i, y(i)⋅(uTx(i))≥γ Block and Novikoff shows that the total number of mistakes that the perceptron algorithm makes is at most
(
D
/
γ
)
2
(D/\gamma)^2
(D/γ)2.