文章目录
Lecture 2: Learning to Answer Yes/No
Perceptron Hypothesis Set
For
x
=
(
x
1
,
x
2
,
⋅
⋅
⋅
,
x
d
)
x = (x_1 ,x_2 ,··· ,x_d )
x=(x1,x2,⋅⋅⋅,xd) ‘features of sample’, compute a weighted ‘score’(
∑
i
=
1
d
w
i
x
i
>
t
h
r
e
s
h
o
l
d
\sum_{i=1}^{d}w_ix_i>threshold
∑i=1dwixi>threshold)
and approve credit if score>threshold, deny credit if score>threshold and ignore the equals.
Y
:
{
+
1
,
−
1
}
Y:\{+1, -1\}
Y:{+1,−1}:
h
(
x
)
=
s
i
g
n
(
(
∑
i
=
1
d
w
i
x
i
)
−
t
h
r
e
s
h
o
l
d
)
=
w
0
=
−
t
h
r
e
s
h
o
l
d
,
x
0
=
+
1
s
i
g
n
(
(
∑
i
=
1
d
w
i
x
i
)
+
w
0
∗
x
0
)
=
s
i
g
n
(
(
∑
i
=
0
d
w
i
x
i
)
)
=
s
i
g
n
(
w
T
x
)
h(x)=sign((\sum_{i=1}^{d}w_ix_i)-threshold) \stackrel{w_0 = -threshold, x_0 = +1}{\xlongequal{\quad\quad\quad\quad\quad\quad\quad}} sign((\sum_{i=1}^{d}w_ix_i)+w_0*x_0) \\ = sign((\sum_{i=0}^{d}w_ix_i)) = sign(w^Tx)
h(x)=sign((i=1∑dwixi)−threshold)w0=−threshold,x0=+1sign((i=1∑dwixi)+w0∗x0)=sign((i=0∑dwixi))=sign(wTx)
called ‘perceptron’ hypothesis historically
Perceptrons in R 2 R^2 R2
perceptrons(感知器) ⇔ linear (binary) classifiers
Fun Time
Consider using a perceptron to detect spam messages.
Assume that each email is represented by the frequency of keyword
occurrence, and output +1 indicates a spam. Which keywords below
shall have large positive weights in a good perceptron for the task?
1. coffee, tea, hamburger, steak
2. free, drug, fantastic, deal
✓
\checkmark
✓
3. machine, learning, statistics, textbook
4. national, Taiwan, university, coursera
Explanation
垃圾邮件常含"免费",’'打折","好消息"等词。
Perceptron Learning Algorithm (PLA)
PLA算法就是要在Perceptrons(Perceptron Hypothesis Set)找到合适的Perceptron,而Perceptron由 w w w决定,即找到合适 w w w,使样本完全分开。
Cyclic PLA
Cyclic PLA算法如下所示
(1)初始化
w
w
w->
w
0
w_0
w0
(2)更新
w
w
w
For t = 0,1,…
find a mistake of
w
t
w_t
wt called (
x
n
(
t
)
x_{n(t)}
xn(t),
y
n
(
t
)
y_{n(t)}
yn(t))
s
i
g
n
(
w
t
T
x
n
(
t
)
)
≠
y
n
(
t
)
sign(w_t^Tx_{n(t)}) \ne y_{n(t)}
sign(wtTxn(t))̸=yn(t)
correct the mistake by
w
(
t
+
1
)
←
w
t
+
y
n
(
t
)
x
n
(
t
)
w_{(t+1)} ← w_t + y_{n(t)}x_{n(t)}
w(t+1)←wt+yn(t)xn(t)
… until no more mistakes
return last w (called
w
P
L
A
w_{PLA}
wPLA ) as g
PLA算法的精髓在于找到与感知器分类不正确的点,然后通过不正确的点修正感知器,直到所有的样本点分类正确。
修正公式:
w
(
t
+
1
)
←
w
t
+
y
n
(
t
)
x
n
(
t
)
w_{(t+1)} ← w_t + y_{n(t)}x_{n(t)}
w(t+1)←wt+yn(t)xn(t)
修正图示:
Fun Time
Let’s try to think about why PLA may work.
Let n = n(t), according to the rule of PLA below, which formula is true?
s
i
g
n
(
w
t
T
x
n
)
≠
y
n
,
w
(
t
+
1
)
←
w
t
+
y
n
x
n
sign(w_t^Tx_{n}) \ne y_{n},\quad w_{(t+1)} ← w_t + y_{n}x_{n}
sign(wtTxn)̸=yn,w(t+1)←wt+ynxn
1.
w
t
+
1
T
x
n
=
y
n
w_{t+1}^Tx_{n} = y_{n}
wt+1Txn=yn
2.
s
i
g
n
(
w
t
+
1
T
x
n
)
=
y
n
sign(w_{t+1}^Tx_{n}) = y_{n}
sign(wt+1Txn)=yn
3.
y
n
w
t
+
1
T
x
n
≥
y
n
w
t
T
x
n
y_{n}w_{t+1}^Tx_{n} \geq y_{n}w_{t}^Tx_{n}
ynwt+1Txn≥ynwtTxn
✓
\checkmark
✓
4.
y
n
w
t
+
1
T
x
n
<
y
n
w
t
T
x
n
y_{n}w_{t+1}^Tx_{n} < y_{n}w_{t}^Tx_{n}
ynwt+1Txn<ynwtTxn
Explanation
经过(
x
n
x_{n}
xn,
y
n
y_{n}
yn)更新后,
w
t
w_t
wt到
w
t
+
1
w_{t+1}
wt+1的变化
由于感知器是相对修正(
Δ
w
=
y
n
x
n
\Delta w = y_{n}x_{n}
Δw=ynxn),修正后不一定就能将(
x
n
x_{n}
xn,
y
n
y_{n}
yn)分类正确。
y n w t + 1 T x n − y n w t T x n = ( w t + 1 T − w t T ) y n x n = x n T y n T y n x n = y n T y n = + 1 x n T x n ≥ 0 y_{n}w_{t+1}^Tx_{n} - y_{n}w_{t}^Tx_{n} = (w_{t+1}^T-w_{t}^T)y_{n}x_{n} = x_{n}^Ty_{n}^Ty_{n}x_{n} \stackrel{y_{n}^Ty_{n}= +1}{\xlongequal{\quad\quad\quad}} x_{n}^Tx_{n} \ge 0 ynwt+1Txn−ynwtTxn=(wt+1T−wtT)ynxn=xnTynTynxnynTyn=+1xnTxn≥0
Guarantee of PLA
Linear Separability
if PLA halts (i.e. no more mistakes), (necessary condition) D allows some w to make no mistake call such D linear separable.
linear separable
D
D
D ⇔ exists perfect
w
f
\mathbf{w_f}
wf such that
s
i
g
n
(
w
f
T
x
n
)
=
y
n
sign(w_f^Tx_{n}) = y_{n}
sign(wfTxn)=yn
感知器的修正
Fun Time中我们推导出感知器每次修正,
y
n
w
t
+
1
T
x
n
≥
y
n
w
t
T
x
n
y_{n}w_{t+1}^Tx_{n} \ge y_{n}w_{t}^Tx_{n}
ynwt+1Txn≥ynwtTxn,
w
t
w_t
wt就会向
w
f
w_f
wf逼近。
我们计算
Δ
w
f
T
w
t
+
1
\Delta w_f^Tw_{t+1}
ΔwfTwt+1再次确认下
∵
w
f
i
s
p
e
r
f
e
c
t
f
o
r
x
n
∴
y
n
(
t
)
w
t
T
x
n
(
t
)
≥
m
i
n
n
 
y
n
(
t
)
w
t
T
x
n
(
t
)
>
0
w
t
T
w
n
(
t
)
↑
b
y
u
p
d
a
t
i
n
g
w
i
t
h
a
n
y
(
x
n
(
t
)
,
y
n
(
t
)
)
w
f
T
w
t
+
1
=
w
f
T
(
w
t
+
y
n
(
t
)
x
n
(
t
)
)
≥
w
f
T
w
t
+
min
n
y
n
w
f
T
x
n
>
w
f
T
w
t
+
0
\because w_f \ is \ perfect \ for \ x_n \\ \therefore y_{n(t)}w_t^Tx_{n(t)} \ge \mathop{min}\limits_n \, y_{n(t)}w_t^Tx_{n(t)} > 0 \\ \mathbf{w_t^Tw_{n(t)} \uparrow} \ by \ updating \ with \ any \ (x_{n(t)}, y_{n(t)}) \\ \begin{aligned} \mathbf{w}_{f}^{T} \mathbf{w}_{t+1} &= \mathbf{w}_{f}^{T}\left(\mathbf{w}_{t}+y_{n(t)} \mathbf{x}_{n(t)}\right) \\ & \geq \mathbf{w}_{f}^{T} \mathbf{w}_{t}+\min _{n} y_{n} \mathbf{w}_{f}^{T} \mathbf{x}_{n} \\ &>\mathbf{w}_{f}^{T} \mathbf{w}_{t}+0 \end{aligned}
∵wf is perfect for xn∴yn(t)wtTxn(t)≥nminyn(t)wtTxn(t)>0wtTwn(t)↑ by updating with any (xn(t),yn(t))wfTwt+1=wfT(wt+yn(t)xn(t))≥wfTwt+nminynwfTxn>wfTwt+0
感知器每次修正, w f T w t + 1 > w f T w t \mathbf{w}_{f}^{T} \mathbf{w}_{t+1} > \mathbf{w}_{f}^{T} \mathbf{w}_{t} wfTwt+1>wfTwt, w t w_t wt是在向 w f w_f wf逼近。
但是我们希望的逼近是夹角的逼近,而不是数值的增大。计算 Δ ∣ ∣ w t ∣ ∣ 2 \Delta||w_t||^2 Δ∣∣wt∣∣2。
∥
w
t
+
1
∥
2
=
∥
w
t
+
y
n
(
t
)
x
n
(
t
)
∥
2
=
∥
w
t
∥
2
+
2
y
n
(
t
)
w
t
T
x
n
(
t
)
+
∥
y
n
(
t
)
x
n
(
t
)
∥
2
又
sign
(
w
t
T
x
n
(
t
)
)
≠
y
n
(
t
)
⇔
y
n
(
t
)
w
t
T
x
n
(
t
)
≤
0
≤
∥
w
t
∥
2
+
0
+
∥
y
n
(
t
)
x
n
(
t
)
∥
2
≤
∥
w
t
∥
2
+
max
n
∥
y
n
x
n
∥
2
≤
∥
w
t
∥
2
+
max
n
∥
x
n
∥
2
\begin{aligned} \left\|\mathbf{w}_{t+1}\right\|^{2} &=\left\|\mathbf{w}_{t}+y_{n(t)} \mathbf{x}_{n(t)} \right\|^{2} \\ &=\left\|\mathbf{w}_{t}\right\|^{2}+2 y_{n(t)} \mathbf{w}_{t}^{T} \mathbf{x}_{n(t)}+\left\|y_{n(t)} \mathbf{x}_{n(t)}\right\|^{2} \\ 又 \ \operatorname{sign} & \left(\mathbf{w}_{t}^{T} \mathbf{x}_{n(t)}\right) \neq y_{n(t)} \Leftrightarrow y_{n(t)} \mathbf{w}_{t}^{T} \mathbf{x}_{n(t)} \leq 0 \\ & \leq\left\|\mathbf{w}_{t}\right\|^{2}+0+\left\|y_{n(t)} \mathbf{x}_{n(t)}\right\|^{2} \\ & \leq\left\|\mathbf{w}_{t}\right\|^{2}+\max _{n}\left\|y_{n} \mathbf{x}_{n}\right\|^{2} \\ & \leq\left\|\mathbf{w}_{t}\right\|^{2}+\max _{n}\left\| \mathbf{x}_{n}\right\|^{2} \end{aligned}
∥wt+1∥2又 sign=∥∥wt+yn(t)xn(t)∥∥2=∥wt∥2+2yn(t)wtTxn(t)+∥∥yn(t)xn(t)∥∥2(wtTxn(t))̸=yn(t)⇔yn(t)wtTxn(t)≤0≤∥wt∥2+0+∥∥yn(t)xn(t)∥∥2≤∥wt∥2+nmax∥ynxn∥2≤∥wt∥2+nmax∥xn∥2
最大增长不会超过
m
a
x
n
∥
x
n
∥
2
\mathop{max} \limits_{n}\left\|\mathbf{x}_{n}\right\|^{2}
nmax∥xn∥2
实际上
w
0
=
0
w_0=0
w0=0,在T次修正后
cos
θ
=
w
f
T
∥
w
f
∥
w
T
∥
w
T
∥
≥
T
⋅
c
o
n
s
t
a
n
t
\cos\theta=\frac{\mathbf{w}_{f}^{T}}{\left\|\mathbf{w}_{f}\right\|} \frac{\mathbf{w}_{T}}{\left\|\mathbf{w}_{T}\right\|} \geq \sqrt{T} \cdot constant
cosθ=∥wf∥wfT∥wT∥wT≥T⋅constant
证明:
R
2
=
max
n
∥
x
n
∥
2
ρ
=
min
n
y
n
w
f
T
x
n
∥
w
f
∥
∵
w
0
=
0
w
f
T
w
T
≥
T
min
n
y
n
w
f
T
x
n
∥
w
T
∥
2
≤
T
max
n
∥
x
n
∥
2
c
o
s
θ
=
w
f
T
∥
w
f
∥
w
T
∥
w
T
∥
=
1
∥
w
f
∥
w
f
T
w
T
∥
w
T
∥
=
1
∥
w
f
∥
T
min
n
y
n
w
f
T
x
n
T
max
n
∥
x
n
∥
2
≥
T
ρ
R
∴
cos
θ
=
w
f
T
∥
w
f
∥
w
T
∥
w
T
∥
≥
T
⋅
c
o
n
s
t
a
n
t
,
 
其
中
 
c
o
n
s
t
a
n
t
=
ρ
R
R^2 = \max _{n} \|\mathbf{x}_{n}\|^{2} \quad \rho = \frac{\min \limits _{n} y_{n} \mathbf{w}_{f}^{T} x_n}{\mathbf{\| w_f \|}} \\ \\ \because \mathbf{w_0 = 0} \quad \mathbf{w}_{f}^{T}\mathbf{w}_{T} \geq T\min \limits _{n} y_{n} \mathbf{w}_{f}^{T} \mathbf{x}_{n} \quad \left\|\mathbf{w}_{T}\right\|^{2} \leq T\max \limits _{n}\left\| \mathbf{x}_{n}\right\|^{2} \\ cos\theta=\frac{\mathbf{w}_{f}^{T}}{\left\|\mathbf{w}_{f}\right\|} \frac{\mathbf{w}_{T}}{\left\|\mathbf{w}_{T}\right\|} = \frac{1}{\left\|\mathbf{w}_{f}\right\|} \frac{\mathbf{w}_{f}^{T} \mathbf{w}_{T}}{\left\|\mathbf{w}_{T}\right\|} = \frac{1}{\left\|\mathbf{w}_{f}\right\|} \frac{T\min \limits _{n} y_{n} \mathbf{w}_{f}^{T} \mathbf{x}_{n}}{\sqrt{T}\sqrt{\max \limits _{n}\left\| \mathbf{x}_{n}\right\|^{2}}} \geq \sqrt{T}\frac{\rho}{R} \\ \therefore \cos\theta=\frac{\mathbf{w}_{f}^{T}}{\left\|\mathbf{w}_{f}\right\|} \frac{\mathbf{w}_{T}}{\left\|\mathbf{w}_{T}\right\|} \geq \sqrt{T} \cdot constant,\, 其中\, constant = \frac{\rho}{R}
R2=nmax∥xn∥2ρ=∥wf∥nminynwfTxn∵w0=0wfTwT≥TnminynwfTxn∥wT∥2≤Tnmax∥xn∥2cosθ=∥wf∥wfT∥wT∥wT=∥wf∥1∥wT∥wfTwT=∥wf∥1Tnmax∥xn∥2TnminynwfTxn≥TRρ∴cosθ=∥wf∥wfT∥wT∥wT≥T⋅constant,其中constant=Rρ
Fun Time
Let’s upper-bound T, the number of mistakes that PLA ‘corrects’.
D
e
f
i
n
e
 
R
2
=
max
n
∥
x
n
∥
2
ρ
=
min
n
y
n
w
f
T
x
n
∥
w
f
∥
Define \, R^2 = \max _{n} \|\mathbf{x}_{n}\|^{2} \quad \rho = \frac{\min \limits _{n} y_{n} \mathbf{w}_{f}^{T} x_n}{\mathbf{\| w_f \|}}
DefineR2=nmax∥xn∥2ρ=∥wf∥nminynwfTxn
We want to show that
T
≤
□
T \leq \square
T≤□. Express the upper bound
□
\square
□ by the two terms above.
1
R
ρ
\frac{R}{\rho}
ρR
2
R
2
ρ
2
\frac{R^2}{\rho^2}
ρ2R2
✓
\checkmark
✓
3
R
ρ
2
\frac{R}{\rho^2}
ρ2R
4
ρ
2
R
2
\frac{\rho^2}{R^2}
R2ρ2
Explanation
T
ρ
R
≤
w
f
T
∥
w
f
∥
w
T
∥
w
T
∥
=
c
o
s
θ
≤
1
∴
T
≥
ρ
2
R
2
\sqrt{T}\frac{\rho}{R} \leq \frac{\mathbf{w}_{f}^{T}}{\left\|\mathbf{w}_{f}\right\|} \frac{\mathbf{w}_{T}}{\left\|\mathbf{w}_{T}\right\|} = cos\theta\leq 1 \\ \therefore T \geq \frac{\rho^2}{R^2}
TRρ≤∥wf∥wfT∥wT∥wT=cosθ≤1∴T≥R2ρ2
Non-Separable Data
More about PLA
保证
数据线性可分,存在一个感知器能使数据完全分开。
过程
通过一次次修正,更新感知器的参数。
结果
最终会达到完全可分,不知道会更新多少次。
Pocket PLA
假设数据存在噪音(noise),线性不可分,此时我们想得到一个大致完全分类的感知器。
Pocket PLA:modify PLA algorithm (black lines) by keeping best weights in pocket
(1)初始化
w
w
w->
w
0
w_0
w0
(2)更新
w
w
w
For t = 0,1,…
find a (random) mistake of
w
t
w_t
wt called (
x
n
(
t
)
x_{n(t)}
xn(t),
y
n
(
t
)
y_{n(t)}
yn(t))
s
i
g
n
(
w
t
T
x
n
(
t
)
)
≠
y
n
(
t
)
sign(w_t^Tx_{n(t)}) \ne y_{n(t)}
sign(wtTxn(t))̸=yn(t)
correct the mistake by
w
(
t
+
1
)
←
w
t
+
y
n
(
t
)
x
n
(
t
)
w_{(t+1)} ← w_t + y_{n(t)}x_{n(t)}
w(t+1)←wt+yn(t)xn(t)
if
w
t
+
1
w_ {t+1}
wt+1 makes fewer mistakes than
w
^
\hat{\mathbf{w}}
w^, replace
w
^
\hat{\mathbf{w}}
w^ by
w
t
+
1
w_ {t+1}
wt+1
w
(
t
+
1
)
←
w
t
+
y
n
(
t
)
x
n
(
t
)
w_{(t+1)} ← w_t + y_{n(t)}x_{n(t)}
w(t+1)←wt+yn(t)xn(t)
… until engough iterations
return
w
^
\hat{\mathbf{w}}
w^ (called
w
P
o
c
k
e
t
w_{Pocket}
wPocket ) as g
Pocket PLA算法:
按照PLA一样更新
w
t
w_{t}
wt的值,但是用一个Pocket(
w
^
\hat{\mathbf{w}}
w^)存储当前犯错误最少的感知器。
当执行一定时间或者更新次数,返回当前
w
^
\hat{\mathbf{w}}
w^。
Fun Time
Should we use pocket or PLA?
Since we do not know whether D is linear separable in advance, we may decide to just go with pocket instead of PLA. If D is actually linear separable, what’s the difference between the two?
1 pocket on D is slower than PLA
✓
\checkmark
✓
2 pocket on D is faster than PLA
3 pocket on D returns a better g in approximating f than PLA
4 pocket on D returns a worse g in approximating f than PLA
Explanation
如果数据线性可分,PLA和Pocket PLA结果一样,获得一个将数据完全分类(no mistakes)的感知器;
但是Pocket PLA会慢些,主要原因是每一轮,需要检查犯错的个数;其次与当前最优感知器比较,替换也会耗费一些时间。
Summary
讲义总结
Perceptron Hypothesis Set
感知器假设集是多维空间的超平面
Perceptron Learning Algorithm (PLA)
s
i
g
n
(
w
t
T
x
n
(
t
)
)
≠
y
n
(
t
)
sign(w_t^Tx_{n(t)}) \ne y_{n(t)}
sign(wtTxn(t))̸=yn(t),
w
(
t
+
1
)
←
w
t
+
y
n
(
t
)
x
n
(
t
)
w_{(t+1)} \leftarrow w_t + y_{n(t)}x_{n(t)}
w(t+1)←wt+yn(t)xn(t)
g
←
w
T
g \leftarrow w_T
g←wT
Guarantee of PLA
若数据集线性可分,则使用PLA算法,找一个完全分类的感知器。
Non-Separable Data
若数据集线性不可分,则使用Pocket PLA算法,找一个较好的感知器。
Pocket PLA算法按照PLA一样更新
w
t
w_{t}
wt的值,但是用一个Pocket(
w
^
\hat{\mathbf{w}}
w^)存储当前犯错误最少的感知器。
g
←
w
^
g \leftarrow \hat{\mathbf{w}}
g←w^
PLA
A
A
A takes linear separable
D
D
D and perceptrons H to get hypothesis g
参考文献
《Machine Learning Foundations》(机器学习基石)—— Hsuan-Tien Lin (林轩田)