机器学习基石笔记(二):学会说 是/否

Lecture 2: Learning to Answer Yes/No

Perceptron Hypothesis Set

For x = ( x 1 , x 2 , ⋅ ⋅ ⋅ , x d ) x = (x_1 ,x_2 ,··· ,x_d ) x=(x1,x2,,xd) ‘features of sample’, compute a weighted ‘score’( ∑ i = 1 d w i x i > t h r e s h o l d \sum_{i=1}^{d}w_ix_i>threshold i=1dwixi>threshold)
and approve credit if score>threshold, deny credit if score>threshold and ignore the equals.
Y : { + 1 , − 1 } Y:\{+1, -1\} Y:{+1,1}:
h ( x ) = s i g n ( ( ∑ i = 1 d w i x i ) − t h r e s h o l d ) = w 0 = − t h r e s h o l d , x 0 = + 1 s i g n ( ( ∑ i = 1 d w i x i ) + w 0 ∗ x 0 ) = s i g n ( ( ∑ i = 0 d w i x i ) ) = s i g n ( w T x ) h(x)=sign((\sum_{i=1}^{d}w_ix_i)-threshold) \stackrel{w_0 = -threshold, x_0 = +1}{\xlongequal{\quad\quad\quad\quad\quad\quad\quad}} sign((\sum_{i=1}^{d}w_ix_i)+w_0*x_0) \\ = sign((\sum_{i=0}^{d}w_ix_i)) = sign(w^Tx) h(x)=sign((i=1dwixi)threshold) w0=threshold,x0=+1sign((i=1dwixi)+w0x0)=sign((i=0dwixi))=sign(wTx)
called ‘perceptron’ hypothesis historically

Perceptrons in R 2 R^2 R2

Perceptrons in 2D

perceptrons(感知器) ⇔ linear (binary) classifiers


Fun Time

Consider using a perceptron to detect spam messages.
Assume that each email is represented by the frequency of keyword
occurrence, and output +1 indicates a spam. Which keywords below
shall have large positive weights in a good perceptron for the task?
1. coffee, tea, hamburger, steak
2. free, drug, fantastic, deal   ✓ \checkmark
3. machine, learning, statistics, textbook
4. national, Taiwan, university, coursera

Explanation
垃圾邮件常含"免费",’'打折","好消息"等词。

Perceptron Learning Algorithm (PLA)

PLA算法就是要在Perceptrons(Perceptron Hypothesis Set)找到合适的Perceptron,而Perceptron由 w w w决定,即找到合适 w w w,使样本完全分开。

Cyclic PLA

Cyclic PLA算法如下所示
(1)初始化 w ​ w​ w-> w 0 ​ w_0​ w0
(2)更新 w ​ w​ w
  For t = 0,1,…
    find a mistake of w t ​ w_t​ wt called ( x n ( t ) ​ x_{n(t)}​ xn(t), y n ( t ) ​ y_{n(t)}​ yn(t))
       s i g n ( w t T x n ( t ) ) ≠ y n ( t ) ​ sign(w_t^Tx_{n(t)}) \ne y_{n(t)}​ sign(wtTxn(t))̸=yn(t)

    correct the mistake by
       w ( t + 1 ) ← w t + y n ( t ) x n ( t ) ​ w_{(t+1)} ← w_t + y_{n(t)}x_{n(t)}​ w(t+1)wt+yn(t)xn(t)

  … until no more mistakes
  return last w (called w P L A ​ w_{PLA}​ wPLA ) as g

PLA算法的精髓在于找到与感知器分类不正确的点,然后通过不正确的点修正感知器,直到所有的样本点分类正确。
修正公式: w ( t + 1 ) ← w t + y n ( t ) x n ( t ) ​ w_{(t+1)} ← w_t + y_{n(t)}x_{n(t)}​ w(t+1)wt+yn(t)xn(t)
修正图示:
PLA更新权值

Fun Time

Let’s try to think about why PLA may work.
Let n = n(t), according to the rule of PLA below, which formula is true?
s i g n ( w t T x n ) ≠ y n , w ( t + 1 ) ← w t + y n x n sign(w_t^Tx_{n}) \ne y_{n},\quad w_{(t+1)} ← w_t + y_{n}x_{n} sign(wtTxn)̸=ynw(t+1)wt+ynxn
1.  w t + 1 T x n = y n w_{t+1}^Tx_{n} = y_{n} wt+1Txn=yn
2.  s i g n ( w t + 1 T x n ) = y n sign(w_{t+1}^Tx_{n}) = y_{n} sign(wt+1Txn)=yn
3.  y n w t + 1 T x n ≥ y n w t T x n y_{n}w_{t+1}^Tx_{n} \geq y_{n}w_{t}^Tx_{n} ynwt+1TxnynwtTxn   ✓ \checkmark
4.  y n w t + 1 T x n &lt; y n w t T x n ​ y_{n}w_{t+1}^Tx_{n} &lt; y_{n}w_{t}^Tx_{n}​ ynwt+1Txn<ynwtTxn

Explanation
经过( x n x_{n} xn, y n y_{n} yn)更新后, w t w_t wt w t + 1 w_{t+1} wt+1的变化
由于感知器是相对修正( Δ w = y n x n \Delta w = y_{n}x_{n} Δw=ynxn),修正后不一定就能将( x n x_{n} xn, y n y_{n} yn)分类正确。

y n w t + 1 T x n − y n w t T x n = ( w t + 1 T − w t T ) y n x n = x n T y n T y n x n = y n T y n = + 1 x n T x n ≥ 0 ​ y_{n}w_{t+1}^Tx_{n} - y_{n}w_{t}^Tx_{n} = (w_{t+1}^T-w_{t}^T)y_{n}x_{n} = x_{n}^Ty_{n}^Ty_{n}x_{n} \stackrel{y_{n}^Ty_{n}= +1}{\xlongequal{\quad\quad\quad}} x_{n}^Tx_{n} \ge 0​ ynwt+1TxnynwtTxn=(wt+1TwtT)ynxn=xnTynTynxn ynTyn=+1xnTxn0


Guarantee of PLA


Linear Separability

if PLA halts (i.e. no more mistakes), (necessary condition) D allows some w to make no mistake call such D linear separable.

linear separable D D D ⇔ exists perfect w f \mathbf{w_f} wf such that s i g n ( w f T x n ) = y n ​ sign(w_f^Tx_{n}) = y_{n}​ sign(wfTxn)=yn
线性可分性

感知器的修正

Fun Time中我们推导出感知器每次修正, y n w t + 1 T x n ≥ y n w t T x n y_{n}w_{t+1}^Tx_{n} \ge y_{n}w_{t}^Tx_{n} ynwt+1TxnynwtTxn, w t w_t wt就会向 w f w_f wf逼近。
我们计算 Δ w f T w t + 1 ​ \Delta w_f^Tw_{t+1}​ ΔwfTwt+1再次确认下
∵ w f   i s   p e r f e c t   f o r   x n ∴ y n ( t ) w t T x n ( t ) ≥ m i n n &ThinSpace; y n ( t ) w t T x n ( t ) &gt; 0 w t T w n ( t ) ↑   b y   u p d a t i n g   w i t h   a n y   ( x n ( t ) , y n ( t ) ) w f T w t + 1 = w f T ( w t + y n ( t ) x n ( t ) ) ≥ w f T w t + min ⁡ n y n w f T x n &gt; w f T w t + 0 \because w_f \ is \ perfect \ for \ x_n \\ \therefore y_{n(t)}w_t^Tx_{n(t)} \ge \mathop{min}\limits_n \, y_{n(t)}w_t^Tx_{n(t)} &gt; 0 \\ \mathbf{w_t^Tw_{n(t)} \uparrow} \ by \ updating \ with \ any \ (x_{n(t)}, y_{n(t)}) \\ \begin{aligned} \mathbf{w}_{f}^{T} \mathbf{w}_{t+1} &amp;= \mathbf{w}_{f}^{T}\left(\mathbf{w}_{t}+y_{n(t)} \mathbf{x}_{n(t)}\right) \\ &amp; \geq \mathbf{w}_{f}^{T} \mathbf{w}_{t}+\min _{n} y_{n} \mathbf{w}_{f}^{T} \mathbf{x}_{n} \\ &amp;&gt;\mathbf{w}_{f}^{T} \mathbf{w}_{t}+0 \end{aligned} wf is perfect for xnyn(t)wtTxn(t)nminyn(t)wtTxn(t)>0wtTwn(t) by updating with any (xn(t),yn(t))wfTwt+1=wfT(wt+yn(t)xn(t))wfTwt+nminynwfTxn>wfTwt+0

感知器每次修正, w f T w t + 1 &gt; w f T w t ​ \mathbf{w}_{f}^{T} \mathbf{w}_{t+1} &gt; \mathbf{w}_{f}^{T} \mathbf{w}_{t}​ wfTwt+1>wfTwt w t ​ w_t​ wt是在向 w f ​ w_f​ wf逼近。

但是我们希望的逼近是夹角的逼近,而不是数值的增大。计算 Δ ∣ ∣ w t ∣ ∣ 2 ​ \Delta||w_t||^2​ Δwt2

∥ w t + 1 ∥ 2 = ∥ w t + y n ( t ) x n ( t ) ∥ 2 = ∥ w t ∥ 2 + 2 y n ( t ) w t T x n ( t ) + ∥ y n ( t ) x n ( t ) ∥ 2 又   sign ⁡ ( w t T x n ( t ) ) ≠ y n ( t ) ⇔ y n ( t ) w t T x n ( t ) ≤ 0 ≤ ∥ w t ∥ 2 + 0 + ∥ y n ( t ) x n ( t ) ∥ 2 ≤ ∥ w t ∥ 2 + max ⁡ n ∥ y n x n ∥ 2 ≤ ∥ w t ∥ 2 + max ⁡ n ∥ x n ∥ 2 \begin{aligned} \left\|\mathbf{w}_{t+1}\right\|^{2} &amp;=\left\|\mathbf{w}_{t}+y_{n(t)} \mathbf{x}_{n(t)} \right\|^{2} \\ &amp;=\left\|\mathbf{w}_{t}\right\|^{2}+2 y_{n(t)} \mathbf{w}_{t}^{T} \mathbf{x}_{n(t)}+\left\|y_{n(t)} \mathbf{x}_{n(t)}\right\|^{2} \\ 又 \ \operatorname{sign} &amp; \left(\mathbf{w}_{t}^{T} \mathbf{x}_{n(t)}\right) \neq y_{n(t)} \Leftrightarrow y_{n(t)} \mathbf{w}_{t}^{T} \mathbf{x}_{n(t)} \leq 0 \\ &amp; \leq\left\|\mathbf{w}_{t}\right\|^{2}+0+\left\|y_{n(t)} \mathbf{x}_{n(t)}\right\|^{2} \\ &amp; \leq\left\|\mathbf{w}_{t}\right\|^{2}+\max _{n}\left\|y_{n} \mathbf{x}_{n}\right\|^{2} \\ &amp; \leq\left\|\mathbf{w}_{t}\right\|^{2}+\max _{n}\left\| \mathbf{x}_{n}\right\|^{2} \end{aligned} wt+12 sign=wt+yn(t)xn(t)2=wt2+2yn(t)wtTxn(t)+yn(t)xn(t)2(wtTxn(t))̸=yn(t)yn(t)wtTxn(t)0wt2+0+yn(t)xn(t)2wt2+nmaxynxn2wt2+nmaxxn2
最大增长不会超过 m a x n ∥ x n ∥ 2 ​ \mathop{max} \limits_{n}\left\|\mathbf{x}_{n}\right\|^{2}​ nmaxxn2

实际上 w 0 = 0 w_0=0 w0=0,在T次修正后 cos ⁡ θ = w f T ∥ w f ∥ w T ∥ w T ∥ ≥ T ⋅ c o n s t a n t ​ \cos\theta=\frac{\mathbf{w}_{f}^{T}}{\left\|\mathbf{w}_{f}\right\|} \frac{\mathbf{w}_{T}}{\left\|\mathbf{w}_{T}\right\|} \geq \sqrt{T} \cdot constant​ cosθ=wfwfTwTwTT constant
证明:
R 2 = max ⁡ n ∥ x n ∥ 2 ρ = min ⁡ n y n w f T x n ∥ w f ∥ ∵ w 0 = 0 w f T w T ≥ T min ⁡ n y n w f T x n ∥ w T ∥ 2 ≤ T max ⁡ n ∥ x n ∥ 2 c o s θ = w f T ∥ w f ∥ w T ∥ w T ∥ = 1 ∥ w f ∥ w f T w T ∥ w T ∥ = 1 ∥ w f ∥ T min ⁡ n y n w f T x n T max ⁡ n ∥ x n ∥ 2 ≥ T ρ R ∴ cos ⁡ θ = w f T ∥ w f ∥ w T ∥ w T ∥ ≥ T ⋅ c o n s t a n t , &ThinSpace; 其 中 &ThinSpace; c o n s t a n t = ρ R R^2 = \max _{n} \|\mathbf{x}_{n}\|^{2} \quad \rho = \frac{\min \limits _{n} y_{n} \mathbf{w}_{f}^{T} x_n}{\mathbf{\| w_f \|}} \\ \\ \because \mathbf{w_0 = 0} \quad \mathbf{w}_{f}^{T}\mathbf{w}_{T} \geq T\min \limits _{n} y_{n} \mathbf{w}_{f}^{T} \mathbf{x}_{n} \quad \left\|\mathbf{w}_{T}\right\|^{2} \leq T\max \limits _{n}\left\| \mathbf{x}_{n}\right\|^{2} \\ cos\theta=\frac{\mathbf{w}_{f}^{T}}{\left\|\mathbf{w}_{f}\right\|} \frac{\mathbf{w}_{T}}{\left\|\mathbf{w}_{T}\right\|} = \frac{1}{\left\|\mathbf{w}_{f}\right\|} \frac{\mathbf{w}_{f}^{T} \mathbf{w}_{T}}{\left\|\mathbf{w}_{T}\right\|} = \frac{1}{\left\|\mathbf{w}_{f}\right\|} \frac{T\min \limits _{n} y_{n} \mathbf{w}_{f}^{T} \mathbf{x}_{n}}{\sqrt{T}\sqrt{\max \limits _{n}\left\| \mathbf{x}_{n}\right\|^{2}}} \geq \sqrt{T}\frac{\rho}{R} \\ \therefore \cos\theta=\frac{\mathbf{w}_{f}^{T}}{\left\|\mathbf{w}_{f}\right\|} \frac{\mathbf{w}_{T}}{\left\|\mathbf{w}_{T}\right\|} \geq \sqrt{T} \cdot constant,\, 其中\, constant = \frac{\rho}{R} R2=nmaxxn2ρ=wfnminynwfTxnw0=0wfTwTTnminynwfTxnwT2Tnmaxxn2cosθ=wfwfTwTwT=wf1wTwfTwT=wf1T nmaxxn2 TnminynwfTxnT Rρcosθ=wfwfTwTwTT constant,constant=Rρ


Fun Time

Let’s upper-bound T, the number of mistakes that PLA ‘corrects’.
D e f i n e &ThinSpace; R 2 = max ⁡ n ∥ x n ∥ 2 ρ = min ⁡ n y n w f T x n ∥ w f ∥ Define \, R^2 = \max _{n} \|\mathbf{x}_{n}\|^{2} \quad \rho = \frac{\min \limits _{n} y_{n} \mathbf{w}_{f}^{T} x_n}{\mathbf{\| w_f \|}} DefineR2=nmaxxn2ρ=wfnminynwfTxn

We want to show that T ≤ □ T \leq \square T. Express the upper bound □ \square by the two terms above.
1 R ρ \frac{R}{\rho} ρR
2 R 2 ρ 2 \frac{R^2}{\rho^2} ρ2R2   ✓ \checkmark
3 R ρ 2 \frac{R}{\rho^2} ρ2R
4 ρ 2 R 2 \frac{\rho^2}{R^2} R2ρ2


Explanation
T ρ R ≤ w f T ∥ w f ∥ w T ∥ w T ∥ = c o s θ ≤ 1 ∴ T ≥ ρ 2 R 2 \sqrt{T}\frac{\rho}{R} \leq \frac{\mathbf{w}_{f}^{T}}{\left\|\mathbf{w}_{f}\right\|} \frac{\mathbf{w}_{T}}{\left\|\mathbf{w}_{T}\right\|} = cos\theta\leq 1 \\ \therefore T \geq \frac{\rho^2}{R^2} T RρwfwfTwTwT=cosθ1TR2ρ2


Non-Separable Data

More about PLA

保证
数据线性可分,存在一个感知器能使数据完全分开。

过程
通过一次次修正,更新感知器的参数。

结果
最终会达到完全可分,不知道会更新多少次。
在真实数据中学习

Pocket PLA

假设数据存在噪音(noise),线性不可分,此时我们想得到一个大致完全分类的感知器。
Pocket PLA:modify PLA algorithm (black lines) by keeping best weights in pocket
(1)初始化 w w w-> w 0 w_0 w0
(2)更新 w w w
  For t = 0,1,…
    find a (random) mistake of w t w_t wt called ( x n ( t ) x_{n(t)} xn(t), y n ( t ) y_{n(t)} yn(t))
       s i g n ( w t T x n ( t ) ) ≠ y n ( t ) sign(w_t^Tx_{n(t)}) \ne y_{n(t)} sign(wtTxn(t))̸=yn(t)

    correct the mistake by
       w ( t + 1 ) ← w t + y n ( t ) x n ( t ) ​ w_{(t+1)} ← w_t + y_{n(t)}x_{n(t)}​ w(t+1)wt+yn(t)xn(t)

    if w t + 1 ​ w_ {t+1}​ wt+1 makes fewer mistakes than w ^ ​ \hat{\mathbf{w}}​ w^, replace w ^ ​ \hat{\mathbf{w}}​ w^ by w t + 1 ​ w_ {t+1}​ wt+1
       w ( t + 1 ) ← w t + y n ( t ) x n ( t ) ​ w_{(t+1)} ← w_t + y_{n(t)}x_{n(t)}​ w(t+1)wt+yn(t)xn(t)

  … until engough iterations
  return w ^ \hat{\mathbf{w}} w^ (called w P o c k e t w_{Pocket} wPocket ) as g

Pocket PLA算法:
  按照PLA一样更新 w t ​ w_{t}​ wt的值,但是用一个Pocket(​ w ^ ​ \hat{\mathbf{w}}​ w^)存储当前犯错误最少的感知器。
  当执行一定时间或者更新次数,返回当前​ w ^ ​ \hat{\mathbf{w}}​ w^


Fun Time

Should we use pocket or PLA?
Since we do not know whether D is linear separable in advance, we may decide to just go with pocket instead of PLA. If D is actually linear separable, what’s the difference between the two?
1 pocket on D is slower than PLA   ✓ \checkmark
2 pocket on D is faster than PLA
3 pocket on D returns a better g in approximating f than PLA
4 pocket on D returns a worse g in approximating f than PLA

Explanation
如果数据线性可分,PLA和Pocket PLA结果一样,获得一个将数据完全分类(no mistakes)的感知器;
但是Pocket PLA会慢些,主要原因是每一轮,需要检查犯错的个数;其次与当前最优感知器比较,替换也会耗费一些时间。

Summary

讲义总结

Perceptron Hypothesis Set
  感知器假设集是多维空间的超平面

Perceptron Learning Algorithm (PLA)
   s i g n ( w t T x n ( t ) ) ≠ y n ( t ) ​ sign(w_t^Tx_{n(t)}) \ne y_{n(t)}​ sign(wtTxn(t))̸=yn(t) w ( t + 1 ) ← w t + y n ( t ) x n ( t ) ​ w_{(t+1)} \leftarrow w_t + y_{n(t)}x_{n(t)}​ w(t+1)wt+yn(t)xn(t)
   g ← w T ​ g \leftarrow w_T ​ gwT

Guarantee of PLA
  若数据集线性可分,则使用PLA算法,找一个完全分类的感知器。

Non-Separable Data
  若数据集线性不可分,则使用Pocket PLA算法,找一个较好的感知器。
  Pocket PLA算法按照PLA一样更新 w t w_{t} wt的值,但是用一个Pocket( w ^ \hat{\mathbf{w}} w^)存储当前犯错误最少的感知器。
   g ← w ^ g \leftarrow \hat{\mathbf{w}} gw^

PLA A A A takes linear separable D D D and perceptrons H to get hypothesis g

参考文献

《Machine Learning Foundations》(机器学习基石)—— Hsuan-Tien Lin (林轩田)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值