ML Note 2 - Learning Theory

Bias / Variance Tradeoff

In the diagram below, we modeled the training set with 2 2 2, 3 3 3, and 7 7 7 parameters

tradeoff

As we can see, h 1 h_1 h1 is an underfit of the training set. No matter how large the training set grows, the model cannot capture the structure of the data. This model is said to have high bias.

On the contrary, h 6 h_6 h6 is an overfit of the training set. The model is too sensitive to random factors that we don’t want to include in our model. This model is said to have large variance.

ERM

For the purpose of simplicity, consider binary classification with

  • labels y ∈ { 0 , 1 } y \in \{0,1\} y{0,1}
  • training set S = { ( x ( i ) , y ( i ) ) ∣ i = 1 , … , m } S = \{(x^{(i)}, y^{(i)}) | i = 1,\dots,m\} S={(x(i),y(i))i=1,,m}
  • new sample ( x , y ) (x, y) (x,y)

As one of the PAC (probably approximately correct) assumptions, assume that there exists some distribution D D D such that
( x , y ) , ( x ( 1 ) , y ( 1 ) ) , … , ( x ( m ) , y ( m ) ) ∼ i . i . d . D (x,y), (x^{(1)}, y^{(1)}), \dots, (x^{(m)}, y^{(m)})\mathop{\sim}\limits^{i.i.d.} D (x,y),(x(1),y(1)),,(x(m),y(m))i.i.d.D For a hypothesis h h h define
Z = I { h ( x ) ≠ y } Z i = I { h ( x ( i ) ) ≠ y ( i ) } \begin{array}{rcl} Z &=& I\{h(x) \neq y\}\\ Z_i &=& I\{h(x^{(i)}) \neq y^{(i)}\} \end{array} ZZi==I{h(x)=y}I{h(x(i))=y(i)} Note training error or empirical risk as
ϵ ^ ( h ) = 1 m ∑ i = 1 m Z i \hat\epsilon(h) = \frac{1}{m} \sum\limits_{i=1}^m Z_i ϵ^(h)=m1i=1mZi and generalization error
ϵ ( h ) = P ( Z ) \epsilon(h) = P(Z) ϵ(h)=P(Z) Then it is obvious that
Z , Z 1 , Z 2 , … , Z m ∼ i . i . d . B e r n ( ϵ ( h ) ) Z, Z_1, Z_2, \dots,Z_m \mathop\sim\limits^{i.i.d.} Bern(\epsilon(h)) Z,Z1,Z2,,Zmi.i.d.Bern(ϵ(h)) Think of the problem as we are picking h h h from a hypothesis class, for instance
H = { h θ ∣ θ ∈ R n + 1 } H = \{h_\theta|\theta\in\mathbb{R}^{n+1}\} H={hθθRn+1} and our goal is empirical risk minimization
h ^ = arg ⁡ min ⁡ h ∈ H ϵ ^ ( h ) \hat h = \arg\min_{h\in H}\hat\epsilon(h) h^=arghHminϵ^(h) Define the theoretically best hypothesis in H H H
h ∗ = arg ⁡ min ⁡ h ∈ H ϵ ( h ) h^* = \arg\min_{h\in H} \epsilon(h) h=arghHminϵ(h)

Uniform Convergence

Suppose H = { h 1 , … , h k } H = \{h_1,\dots,h_k\} H={h1,,hk} is a finite set. For any h i ∈ H h_i \in H hiH donate
A i = I { ∣ ϵ ( h i ) − ϵ ^ ( h i ) ∣ > γ } A_i = I\{|\epsilon(h_i) - \hat\epsilon(h_i)|>\gamma\} Ai=I{ϵ(hi)ϵ^(hi)>γ} Hoeffding inequality gives that
P ( A i ) ≤ 2 exp ⁡ ( − 2 γ 2 m ) P(A_i) \le 2\exp(-2\gamma^2m) P(Ai)2exp(2γ2m) Using the union bound, we have that
P ( ∃ h i ∈ H ,   ∣ ϵ ( h i ) − ϵ ^ ( h i ) ∣ > γ ) = P ( ⋃ i = 1 k A i ) ≤ ∑ i = 1 k P ( A i ) ≤ 2 k exp ⁡ ( − 2 γ 2 m ) \begin{array}{rcl} P(\exist h_i \in H,\ |\epsilon(h_i) - \hat\epsilon(h_i)|>\gamma) &=& P(\bigcup\limits_{i=1}^k A_i)\\ &\le& \sum\limits_{i=1}^kP(A_i)\\ &\le& 2k\exp(-2\gamma^2m) \end{array} P(hiH, ϵ(hi)ϵ^(hi)>γ)=P(i=1kAi)i=1kP(Ai)2kexp(2γ2m) Therefore
P ( ∀ h ∈ H ,   ∣ ϵ ( h i ) − ϵ ^ ( h i ) ∣ ≤ γ ) ≥ 1 − 2 k exp ⁡ ( − 2 γ 2 m ) P(\forall h\in H,\ |\epsilon(h_i) - \hat\epsilon(h_i)|\le\gamma) \ge 1-2k\exp(-2\gamma^2m) P(hH, ϵ(hi)ϵ^(hi)γ)12kexp(2γ2m)

Error bound

Given m m m and δ > 0 \delta>0 δ>0, with probability 1 − δ 1-\delta 1δ we have that
∀ h ∈ H ,   ∣ ϵ ( h ) − ϵ ^ ( h ) ∣ ≤ 1 2 m log ⁡ 2 k δ \forall h\in H,\ |\epsilon(h)-\hat\epsilon(h)| \le \sqrt{\frac{1}{2m}\log\frac{2k}{\delta}} hH, ϵ(h)ϵ^(h)2m1logδ2k Since
ϵ ^ ( h ^ ) ≤ ϵ ^ ( h ∗ ) \hat\epsilon(\hat h) \le \hat\epsilon(h^*) ϵ^(h^)ϵ^(h) we have that
ϵ ( h ^ ) ≤ ( min ⁡ h ∈ H ϵ ( h ) ) + 2 1 2 m log ⁡ 2 k δ \epsilon(\hat h) \le \Big(\min_{h\in H}\epsilon(h)\Big) + 2\sqrt{\frac{1}{2m}\log\frac{2k}{\delta}} ϵ(h^)(hHminϵ(h))+22m1logδ2k When we expand H H H to some super-set H ’ ⊃ H H’ \supset H HH, the first term min ⁡ h ∈ H ϵ ( h ) \min\limits_{h\in H}\epsilon(h) hHminϵ(h) can only decrease, while the second term 1 2 m log ⁡ 2 k δ \sqrt{\frac{1}{2m}\log\frac{2k}{\delta}} 2m1logδ2k can only increase. This loosely corresponds to the bias-variance tradeoff.

Sample Complexity Bound

Given γ \gamma γ and δ > 0 \delta > 0 δ>0 in order for
ϵ ( h ^ ) ≤ ϵ ( h ∗ ) + 2 γ \epsilon(\hat h) \le \epsilon(h^*) + 2\gamma ϵ(h^)ϵ(h)+2γ to be true with probability at least 1 − δ 1-\delta 1δ , it suffices that
m ≥ 1 2 γ 2 log ⁡ 2 k δ = O ( 1 γ 2 log ⁡ k δ ) m \ge \frac{1}{2\gamma^2}\log\frac{2k}{\delta} = O(\frac{1}{\gamma^2}\log\frac{k}{\delta}) m2γ21logδ2k=O(γ21logδk)

VC Dimension

Given a set S = { x ( 1 ) , … , x ( d ) } S = \{x^{(1)}, \dots, x^{(d)}\} S={x(1),,x(d)}, we say that H H H shatters S S S if
∀ { y ( 1 ) , … , y ( d ) } ∈ { 0 , 1 } d ,   ∃ h ∈ H ,   s . t .   ∀ i ∈ { 1 , … , d } ,   h ( x ( i ) ) = y ( i ) \forall \{y^{(1)},\dots,y^{(d)}\} \in \{0,1\}^d,\ \exist h\in H,\\\ s.t.\ \forall i \in \{1,\dots,d\},\ h(x^{(i)}) = y^{(i)} {y(1),,y(d)}{0,1}d, hH, s.t. i{1,,d}, h(x(i))=y(i) Define Vapnik-Chervonenkis dimension V C ( H ) VC(H) VC(H) to be the size of the largest set that is shattered by H H H. It can be shown that if H H H contains all linear classifiers in n n n dimensional, then V C ( H ) = n + 1 VC(H) = n+1 VC(H)=n+1.

For SVMs using kernel, VC dimension is usually small. If
∃ R ,   s . t . ∀ i ∈ { 1 , … , m } ,   ∣ ∣ x ( i ) ∣ ∣ ≤ R \exist R,\ s.t. \forall i \in \{1,\dots,m\},\ ||x^{(i)}|| \le R R, s.t.i{1,,m}, x(i)R then
V C ( H ) ≤ ⌈ R 2 4 γ 2 ⌉ + 1 VC(H) \le \lceil\frac{R^2}{4\gamma^2}\rceil + 1 VC(H)4γ2R2+1


Let H H H be given and let d = V C ( H ) ≠ + ∞ d = VC(H) \neq +\infin d=VC(H)=+. With probability at least 1 − δ 1-\delta 1δ, we have that
∀ h ∈ H ,   ∣ ϵ ^ ( h ) − ϵ ( h ) ∣ ≤ O ( d m log ⁡ m d + 1 m log ⁡ 1 δ ) \forall h \in H,\ |\hat\epsilon(h) - \epsilon(h)| \le O\Bigg(\sqrt{\frac{d}{m}\log\frac{m}{d} + \frac{1}{m}\log\frac{1}{\delta}}\Bigg) hH, ϵ^(h)ϵ(h)O(mdlogdm+m1logδ1 ) Thus
ϵ ( h ^ ) ≤ ( min ⁡ h ∈ H ϵ ( h ) ) + O ( d m log ⁡ m d + 1 m log ⁡ 1 δ ) \epsilon(\hat h) \le \Big(\min_{h\in H}\epsilon(h)\Big) + O\Bigg(\sqrt{\frac{d}{m}\log\frac{m}{d} + \frac{1}{m}\log\frac{1}{\delta}}\Bigg) ϵ(h^)(hHminϵ(h))+O(mdlogdm+m1logδ1 ) Moreover

For ϵ ( h ^ ) ≤ ϵ ( h ∗ ) + 2 γ \epsilon(\hat h) \le \epsilon(h^*) + 2\gamma ϵ(h^)ϵ(h)+2γ to hold with probability at least 1 − δ 1-\delta 1δ, it suffices that m = O γ , δ ( d ) m = O_{\gamma, \delta}(d) m=Oγ,δ(d).

Model Selection

Suppse we have a finite set of models M = { M 1 , … , M d } M = \{M_1, \dots, M_d\} M={M1,,Md} that we are trying to select among. There are several techniques to automatically deal with bias-variance tradeoff.

Cross Validation

In hold-out cross validation or simple cross validation, we
randomly split  S  into  S train  and  S cv (  say  70 %  to  30 % ) for i in 1...d train  M i  on  S train test  M i  on  S cv  to get  ϵ i choose  M = arg ⁡ min ⁡ M i ϵ i ( optional )  retrain  M  on  S \begin{aligned} & \text{randomly split } S \text{ into } S_{\text{train}} \text{ and } S_{\text{cv}} (\text{ say } 70\% \text{ to } 30\%)\\ & \text{for i in 1...d}\\ & \qquad \text{train } M_i \text{ on } S_\text{train}\\ & \qquad \text{test } M_i \text{ on } S_\text{cv} \text{ to get } \epsilon_i\\ & \text{choose } M = \arg\min_{M_i} \epsilon_i\\ & (\text{optional})\text{ retrain } M \text{ on } S \end{aligned} randomly split S into Strain and Scv( say 70% to 30%)for i in 1...dtrain Mi on Straintest Mi on Scv to get ϵichoose M=argMiminϵi(optional) retrain M on S A main disadvantage of this algorithm is the waste of data. To hold out less data each time, we can use k-fold cross validation.
randomly split  S  into  k  disjoint subsets  S 1 , … , S k for i in 1...d for j in 1...k train  M i  on  S − S j test  M i  to get  ϵ i j ϵ i = 1 k ∑ j = 1 k ϵ i j choose  M = M arg ⁡ min ⁡ i ϵ i ( optional )  retrain  M  on  S \begin{aligned} & \text{randomly split } S \text{ into } k \text{ disjoint subsets } S_1,\dots, S_k\\ & \text{for i in 1...d}\\ & \qquad\text{for j in 1...k}\\ & \qquad\qquad\text{train } M_i \text{ on } S-S_j\\ & \qquad\qquad\text{test } M_i \text{ to get } \epsilon_{ij}\\ & \qquad\epsilon_i = \frac{1}{k}\sum\limits_{j=1}^k\epsilon_{ij}\\ & \text{choose } M = M_{\arg\min\limits_{i} \epsilon_i}\\ & (\text{optional})\text{ retrain } M \text{ on } S \end{aligned} randomly split S into k disjoint subsets S1,,Skfor i in 1...dfor j in 1...ktrain Mi on SSjtest Mi to get ϵijϵi=k1j=1kϵijchoose M=Margiminϵi(optional) retrain M on S A typical choice for K K K is 10 10 10. But for problems with really scarce data, we can use k = m k=m k=m, which lead to the leave-one-out cross validation.

Feature Selection

Suppose you have a supervised learning problem where n n n is very large, but you suspect that only a small number of features are relevant to the task. In such a setting, you can apply some heuristic algorithms to rank every feature and choose the top- k k k.

Wrapper Model Feature Selection

In a forward search, use F F F to record the most relevant features
F : = ϕ repeat  { for i in 1...n  { if  ( i ∈ F )  continue F i : = F ∪ { i } } cross validate over  { M i ∣ M i  depends only on  F i } F : = result } \begin{aligned} & F := \phi\\ & \text{repeat }\{\\ & \qquad\text{for i in 1...n }\{\\ & \qquad\qquad\text{if }(i \in F)\ \text{continue}\\ & \qquad\qquad F_i := F \cup\{i\}\\ & \qquad\}\\ & \qquad\text{cross validate over } \{M_i| M_i \text{ depends only on } F_i\}\\ & \qquad F := \text{result}\\ & \} \end{aligned} F:=ϕrepeat {for i in 1...n {if (iF) continueFi:=F{i}}cross validate over {MiMi depends only on Fi}F:=result} Similarly, backward search
F : = { 1 , … , n } repeat  { for i in 1...n  { if  ( i ∉ F )  continue F i : = F − { i } } cross validate over  { M i ∣ M i  depends only on  F i } F : = result } \begin{aligned} & F := \{1,\dots,n\}\\ & \text{repeat }\{\\ & \qquad\text{for i in 1...n }\{\\ & \qquad\qquad\text{if }(i \notin F)\ \text{continue}\\ & \qquad\qquad F_i := F -\{i\}\\ & \qquad\}\\ & \qquad\text{cross validate over } \{M_i| M_i \text{ depends only on } F_i\}\\ & \qquad F := \text{result}\\ & \} \end{aligned} F:={1,,n}repeat {for i in 1...n {if (i/F) continueFi:=F{i}}cross validate over {MiMi depends only on Fi}F:=result} Wrapper feature selection algorithms often work quite well, but can be computationally expensive.

Filter Feature Selection

Compute some simple score S ( i ) S(i) S(i) to measure how informative each feature x i x_i xi is. Then, simply pick the k k k features with the largest scores S ( i ) S(i) S(i).

It is common to choose mutual information as S S S
S ( i ) = M I ( x i , y ) = ∑ x i ∑ y p ( x i , y ) log ⁡ p ( x i , y ) p ( x i ) p ( y ) S(i) = MI(x_i,y) = \sum\limits_{x_i} \sum\limits_{y} p(x_i,y)\log\frac{p(x_i,y)}{p(x_i)p(y)} S(i)=MI(xi,y)=xiyp(xi,y)logp(xi)p(y)p(xi,y) Note that
M I ( x i , y ) = K L ( p ( x i , y ) ∣ ∣ p ( x i ) p ( y ) ) MI(x_i,y) = KL(p(x_i,y)||p(x_i)p(y)) MI(xi,y)=KL(p(xi,y)p(xi)p(y))

Online Learning

So far, we have been considering batch learning in which we are first given a training set to learn. Online learning, on the contrary, requires the algorithm to make predictions continuously even while it is learning.

In this setting, the algorithm is first given x ( i ) x^{(i)} x(i) and is asked to predict h ( x ( i ) ) h(x^{(i)}) h(x(i)). Then the true y ( i ) y^{(i)} y(i) is revealed for the model and this process repeats. Total online error is the total number of errors made by the algorithm during this process.

Assume that the class labels y ∈ { − 1 , 1 } y \in \{-1,1\} y{1,1}, in perceptron algorithm with parameters θ ∈ R n + 1 \theta\in\mathbb{R}^{n+1} θRn+1
h ( x ) = g ( θ T x ) h(x) = g(\theta^Tx) h(x)=g(θTx) where
g ( z ) = { 1 if  z ≥ 0 − 1 otherwise g(z) = \left\{\begin{array}{ll} 1 & \text{if } z \ge 0\\ -1 & \text{otherwise} \end{array}\right. g(z)={11if z0otherwise Given a training example ( x , y ) (x,y) (x,y), the parameter updates if h ( x ) ≠ y h(x) \neq y h(x)=y
θ : = θ + y x \theta := \theta + yx θ:=θ+yx Suppose that
∃ D ,   s . t .   ∀ i ,   ∣ ∣ x ( i ) ∣ ∣ ≤ D ∃ u ( ∣ ∣ u ∣ ∣ = 1 ) ,   s . t .   ∀ i ,   y ( i ) ⋅ ( u T x ( i ) ) ≥ γ \begin{aligned} &\exist D,\ s.t.\ \forall i,\ ||x^{(i)}|| \le D\\ &\exist u(||u|| = 1),\ s.t.\ \forall i,\ y^{(i)}\cdot(u^Tx^{(i)}) \ge \gamma \end{aligned} D, s.t. i, x(i)Du(u=1), s.t. i, y(i)(uTx(i))γ Block and Novikoff shows that the total number of mistakes that the perceptron algorithm makes is at most ( D / γ ) 2 (D/\gamma)^2 (D/γ)2.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

LutingWang

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值