2.1 感知机模型
模型
感知机(Perceptron)针对的是二分类的线性模型,其输入为实例的特征向量,输出为实例的类别,取+1、-1。
假设输入空间是 X ⊆ R n X \subseteq R^{n} X⊆Rn
输入变量是 x ∈ X x \in X x∈X
输出空间是 Y = { + 1 , − 1 } Y=\{+1,-1\} Y={+1,−1}
输出变量是 y ∈ { + 1 , − 1 } y \in\{+1,-1\} y∈{+1,−1}
由输入空间到输出空间满足下列函数:
f
(
x
)
=
sign
(
w
⋅
x
+
b
)
f(x)=\operatorname{sign} (w \cdot x+b)
f(x)=sign(w⋅x+b)
其中
w
w
w是权重参数,
b
b
b是偏置项,
s
i
g
n
sign
sign是符号函数,即
sign
(
x
)
=
{
1
,
x
≥
0
−
1
,
x
<
0
\operatorname{sign}(x)=\left\{\begin{aligned} 1, & x \geq 0 \\ -1, & x<0 \end{aligned}\right.
sign(x)={1,−1,x≥0x<0
感知机模型属于线性判别模型,旨在求出将训练数据进行线性划分的分离超平面
w ⋅ x + b w \cdot x+b w⋅x+b是一个n维空间中的超平面S,其中w是超平面的法向量,b是超平面的截距,这个超平面将特征空间划分成两部分,位于两部分的点分别被分为正负两类,所以,超平面S称为分离超平面。其中w是超平面的法向量, b是超平面的截距, 特征空间也就是整个n维空间,样本的每个属性都叫一个特征,特征空间的意思是在这个空 间中可以找到样本所有的属性组合
感知机学习策略
函数间隔与几何间隔
空间中任意一个点 𝑥 0 𝑥_0 x0到超平面S的距离。
函数间隔:
∣
w
⋅
x
0
+
b
∣
\left|w \cdot x_{0}+b\right|
∣w⋅x0+b∣
几何间隔:
1
∥
w
∥
∣
w
⋅
x
0
+
b
∣
∥
w
∥
2
=
∑
i
=
1
N
w
i
2
\frac{1}{\|w\|}\left|w \cdot x_{0}+b\right|\\ \|w\|_{2}=\sqrt{\sum_{i=1}^{N} w_{i}^{2}}
∥w∥1∣w⋅x0+b∣∥w∥2=i=1∑Nwi2
对于误分类数据而言,
−
y
i
(
w
⋅
x
i
+
b
)
>
0
-y_{i}\left(w \cdot x_{i}+b\right)>0
−yi(w⋅xi+b)>0
误分类点
𝑥
𝑖
𝑥_𝑖
xi 到超平面S的距离为
−
1
∥
w
∥
y
i
(
w
⋅
x
i
+
b
)
-\frac{1}{\|w\|} y_{i}\left(w \cdot x_{i}+b\right)
−∥w∥1yi(w⋅xi+b)
因此,所有误分类点到超平面S的总距离为:
−
1
∥
w
∥
∑
x
i
∈
M
y
i
(
w
⋅
x
i
+
b
)
-\frac{1}{\|w\|} \sum_{x_{i} \in M} y_{i}\left(w \cdot x_{i}+b\right)
−∥w∥1xi∈M∑yi(w⋅xi+b)
损失函数:误分类点到超平面的总距离
L
(
w
,
b
)
=
−
∑
x
i
∈
M
y
i
(
w
⋅
x
i
+
b
)
L(w, b)=-\sum_{x_{i} \in M} y_{i}\left(w \cdot x_{i}+b\right)
L(w,b)=−xi∈M∑yi(w⋅xi+b)
输入空间
R
n
R_n
Rn中任意一点
x
0
x_0
x0到超平面
S
S
S的距离可表示为
∣
w
⋅
x
0
+
b
∣
∥
w
∥
\dfrac{|w \cdot x_0+b|}{\|w\|}
∥w∥∣w⋅x0+b∣,其中
∥
w
∥
\|w\|
∥w∥表示为
w
w
w的
L
2
L2
L2范数。其次,对于误分类点
(
x
i
,
y
i
)
(x_i, y_i)
(xi,yi)来说,
−
y
i
(
w
⋅
x
i
+
b
)
>
0
-y_i(w\cdot x_i+b)>0
−yi(w⋅xi+b)>0成立. 当
w
⋅
x
i
+
b
>
0
w \cdot x_{i}+b>0
w⋅xi+b>0时,
y
i
=
−
1
y_i=-1
yi=−1。而当
w
⋅
x
i
+
b
<
0
w\cdot x_i+b<0
w⋅xi+b<0时,
y
i
=
+
1
y_i=+1
yi=+1。因此而误分类点
x
i
x_i
xi到超平面
S
S
S的距离是
−
1
∥
w
∥
y
i
(
w
⋅
x
i
+
b
)
-\frac{1}{\|w\|} y_{i}\left(w \cdot x_{i}+b\right)
−∥w∥1yi(w⋅xi+b)
假设超平面
S
S
S的五分类点集合为
M
M
M,那么所有误分类点到超平面
S
S
S的总距离为
−
1
∥
w
∥
∑
x
i
∈
M
y
i
(
w
⋅
x
i
+
b
)
-\frac{1}{\|w\|} \sum_{x_{i} \in M} y_{i}\left(w \cdot x_{i}+b\right)
−∥w∥1xi∈M∑yi(w⋅xi+b)
忽略
∥
w
∥
\|w\|
∥w∥,就得到感知机学习的损失函数
L
(
w
,
b
)
=
−
∑
x
i
∈
M
y
i
(
w
⋅
x
i
+
b
)
L(w, b)=-\sum_{x_{i} \in M} y_{i}\left(w \cdot x_{i}+b\right)
L(w,b)=−xi∈M∑yi(w⋅xi+b)
其中
M
M
M为误分类点的集合
感知机学习算法
算法2.1(随机梯度下降法)
输入:训练数据集 T = [ ( x 1 , y 1 ) , … , ( x N , y N ) ] T=\left[\left(x_{1}, y_{1}\right), \ldots,\left(x_{N}, y_{N}\right)\right] T=[(x1,y1),…,(xN,yN)]
-
选取超平面初始值 w 0 , b 0 w_0,b_0 w0,b0
-
在训练集中选取数据 ( x i , y i ) (x_i,y_i) (xi,yi) ,如果 y i ( w ⋅ x i + b ) ≤ 0 y_{i}\left(w \cdot x_{i}+b\right) \leq 0 yi(w⋅xi+b)≤0,采用梯度下降法极小化目标函数
L ( w , b ) = − ∑ x i ∈ M y i ( w ⋅ x i + b ) ∇ w L ( w , b ) = − ∑ x i ∈ M y i x i ∇ b L ( w , b ) = − ∑ x i ∈ M y i \begin{aligned} &L(w, b)=-\sum_{x_{i} \in M} y_{i}\left(w \cdot x_{i}+b\right) \\ &\nabla_{w} L(w, b)=-\sum_{x_{i} \in M} y_{i} x_{i} \\ &\nabla_{b} L(w, b)=-\sum_{x_{i} \in M} y_{i} \end{aligned} L(w,b)=−xi∈M∑yi(w⋅xi+b)∇wL(w,b)=−xi∈M∑yixi∇bL(w,b)=−xi∈M∑yi -
更新 w , b w, b w,b
w ← w + η y i x i b ← b + η y i \begin{aligned} &w \leftarrow w+\eta y_{i} x_{i} \\ &b \leftarrow b+\eta y_{i} \end{aligned} w←w+ηyixib←b+ηyi
- 转至2, 直到训练集中没有误分类点。
输出: w , b w,b w,b
2.2 对偶形式
感知机模型对偶形式
f
(
x
)
=
sign
(
∑
i
=
1
N
α
i
y
i
x
i
⋅
x
+
b
)
,
α
=
(
α
1
,
α
2
,
⋯
,
α
N
)
T
f(x)=\operatorname{sign}\left(\sum_{i=1}^{N} \alpha_{i} y_{i} x_{i} \cdot x+b\right), \alpha = (\alpha_1, \alpha_2, \cdots, \alpha_N)^T
f(x)=sign(i=1∑Nαiyixi⋅x+b),α=(α1,α2,⋯,αN)T
算法2.2原始形式
输入:训练数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , … , ( x N , y N ) } T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots,\left(x_{N}, y_{N}\right)\right\} T={(x1,y1),(x2,y2),…,(xN,yN)}, 其中 x i ∈ X = R n , y i ∈ Y = { − 1 , + 1 } , i = 1 , 2 , … , N x_{i} \in X=R^{n}, y_{i} \in Y=\{-1,+1\}, i=1,2, \ldots, N xi∈X=Rn,yi∈Y={−1,+1},i=1,2,…,N; 学习率 η ( 0 < η ≤ 1 ) \eta(0<\eta \leq 1) η(0<η≤1);
输出: w w w, b b b; 感知机模型 f ( x ) = sign ( w ⋅ x + b ) f(x)=\operatorname{sign}(w \cdot x+b) f(x)=sign(w⋅x+b)
- 选取初始值 w 0 , b 0 w_{0}, b_{0} w0,b0
- 在训练集中选取数据 ( x i , y i ) (x_i,y_i) (xi,yi)
- 如果 y i ( w ⋅ x i + b ) ≤ 0 y_{i}\left(w \cdot x_{i}+b\right) \leq 0 yi(w⋅xi+b)≤0
w ← w + η y i x i b ← b + η y i \begin{gathered} w \leftarrow w+\eta y_{i} x_{i} \\ b \leftarrow b+\eta y_{i} \end{gathered} w←w+ηyixib←b+ηyi
- 转至2, 直至训练集中没有误分类点
思考:
- 每次参数的更新公式是:
w ← w + η y i x i b ← b + η y i \begin{gathered} w \leftarrow w+\eta y_{i} x_{i} \\ b \leftarrow b+\eta y_{i} \end{gathered} w←w+ηyixib←b+ηyi
- 每次按照上式更新,假设修改n次,那么对于样本点 ( x i , y i ) (x_i,y_i) (xi,yi)而言, w w w和 b b b的增量为 α i y i x i \alpha_{i} y_{i} x_{i} αiyixi和 α i y i \alpha_{i} y_{i} αiyi,其中 α i = n i η \alpha_{i}=n_{i} \eta αi=niη。
w = ∑ i = 1 N α i y i x i b = ∑ i = 1 N α i y i \begin{aligned} &w=\sum_{i=1}^{N} \alpha_{i} y_{i} x_{i} \\ &b=\sum_{i=1}^{N} \alpha_{i} y_{i} \end{aligned} w=i=1∑Nαiyixib=i=1∑Nαiyi
- 原始的感知机形式为:
f ( x ) = sign ( w ⋅ x + b ) f(x)=\operatorname{sign}(w \cdot x+b) f(x)=sign(w⋅x+b)
- 将目前的 w w w和 b b b代入原始感知机形式中:
f ( x ) = sign ( ∑ i = 1 N α i y i x i ⋅ x + b ) f(x)=\operatorname{sign}\left(\sum_{i=1}^{N} \alpha_{i} y_{i} x_{i} \cdot x+b\right) f(x)=sign(i=1∑Nαiyixi⋅x+b)
感知机的对偶形式
输入:线性可分的数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , … , ( x N , y N ) } T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots,\left(x_{N}, y_{N}\right)\right\} T={(x1,y1),(x2,y2),…,(xN,yN)}, 其中 x i ∈ R n , y i ∈ { − 1 , + 1 } , i = 1 , 2 , … , N x_{i} \in R^{n}, y_{i} \in\{-1,+1\}, i=1,2, \ldots, N xi∈Rn,yi∈{−1,+1},i=1,2,…,N, 学习率 η ( 0 < η ≤ 1 ) \eta(0<\eta \leq 1) η(0<η≤1);
输出: α , b \alpha, b α,b; 感知机模型 f ( x ) = sign ( ∑ j = 1 N α j y j x j ⋅ x + b ) f(x)=\operatorname{sign}\left(\sum_{j=1}^{N} \alpha_{j} y_{j} x_{j} \cdot x+b\right) f(x)=sign(∑j=1Nαjyjxj⋅x+b), 其中 α = ( α 1 , α 2 , … , α N ) T \alpha=\left(\alpha_{1}, \alpha_{2}, \ldots, \alpha_{N}\right)^{T} α=(α1,α2,…,αN)T.
- α ← 0 , b ← 0 \alpha \leftarrow 0, b \leftarrow 0 α←0,b←0
- 在训练集中选取数据 ( x i , y i ) (x_i,y_i) (xi,yi)
- 如果 y i ( ∑ j = 1 N α j y j x j ⋅ x i + b ) ≤ 0 y_{i}\left(\sum_{j=1}^{N} \alpha_{j} y_{j} x_{j} \cdot x_{i}+b\right) \leq 0 yi(∑j=1Nαjyjxj⋅xi+b)≤0
α 1 ← α 1 + η b ← b + η y i \begin{aligned} &\alpha_{1} \leftarrow \alpha_{1}+\eta \\ &b \leftarrow b+\eta y_{i} \end{aligned} α1←α1+ηb←b+ηyi
- 转至第二步,直至训练集中没有误分类点
对偶形式中训练实例仅以内积的形式出现。为了方便,可以预先将训练集中实例间的内积计算出来并以矩阵 的形式储存,这个矩阵就是所谓的Gram矩阵:
G
=
[
x
i
⋅
x
j
]
N
×
N
G=\left[x_{i} \cdot x_{j}\right]_{N \times N}
G=[xi⋅xj]N×N
2.3 算法收敛性
需要证明,感知机学习算法的原始形式在线性可分数据集上收敛。
为了便于推导,将偏置b并入权重向量w,计作 w ~ = ( w T , b ) T \widetilde{w}=\left(w^{T}, b\right)^{T} w =(wT,b)T,同样也将输入向量加以扩充,加进常数1,记作 x ~ = ( x T , 1 ) T \tilde{x}=\left(x^{T}, 1\right)^{T} x~=(xT,1)T,显然,经过处理后, w ~ ⋅ x ~ = w ⋅ x + b \widetilde{w} \cdot \tilde{x}=w \cdot x+b w ⋅x~=w⋅x+b。
设训练数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , … , ( x N , y N ) } T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots,\left(x_{N}, y_{N}\right)\right\} T={(x1,y1),(x2,y2),…,(xN,yN)}是线性可分的,其中 x i ∈ X = R n , y i ∈ Y = { − 1 , + 1 } , i = 1 , 2 , … , N x_{i} \in X=R^{n}, y_{i} \in Y=\{-1,+1\}, i=1,2, \ldots, N xi∈X=Rn,yi∈Y={−1,+1},i=1,2,…,N, 则
- 存在瞒住条件 ∥ w ^ opt ∥ = 1 \left\|\widehat{\mathbf{w}}_{\text {opt }}\right\|=1 ∥w opt ∥=1的超平面 w ^ opt ⋅ w ^ = w o p t ⋅ x + b o p t = 0 \widehat{\mathrm{w}}_{\text {opt }} \cdot \widehat{\mathrm{w}}=w_{o p t} \cdot x+b_{o p t}=0 w opt ⋅w =wopt⋅x+bopt=0将训练数据集完全正确分开;且存在 γ > 0 \gamma>0 γ>0,对所有 i = 1 , 2 , … , N i=1,2, \ldots, N i=1,2,…,N:
y i ( w ^ o p t ⋅ x i ^ ) = y i ( w o p t ⋅ x i + b opt ) ≥ γ y_{i}\left(\widehat{\mathrm{w}}_{\mathrm{opt}} \cdot \widehat{x_{i}}\right)=y_{i}\left(w_{\mathrm{opt}} \cdot x_{i}+b_{\text {opt }}\right) \geq \gamma yi(w opt⋅xi )=yi(wopt⋅xi+bopt )≥γ
- 令 R = max 1 ≤ i ≤ N ∥ x ^ i ∥ R=\max _{1 \leq i \leq N}\left\|\hat{x}_{i}\right\| R=max1≤i≤N∥x^i∥, 则感知机算法在训练数据集上的误分类次数k满足不等式:
k ≤ ( R γ ) 2 k \leq\left(\frac{R}{\gamma}\right)^{2} k≤(γR)2
证明(1)
由于训练数据集是线性可分的,按照定义2.2,存在超平面可将训练数据集完全正确分开,取此超平面为
w
^
opt
⋅
x
^
=
w
o
p
t
⋅
x
+
b
o
p
t
=
0
\widehat{\mathrm{w}}_{\text {opt }} \cdot \hat{x}=w_{o p t} \cdot x+b_{o p t}=0
w
opt ⋅x^=wopt⋅x+bopt=0, 使得
∥
w
^
opt
∥
=
1
\left\|\widehat{\mathrm{w}}_{\text {opt }}\right\|=1
∥w
opt ∥=1。由于对有限的
i
=
1
,
2
,
…
,
N
i=1,2, \ldots, N
i=1,2,…,N, 均有
y
i
(
w
^
o
p
t
⋅
x
^
i
)
=
y
i
(
w
o
p
t
⋅
x
i
+
b
o
p
t
)
>
0
y_{i}\left(\widehat{\mathrm{w}}_{\mathrm{opt}} \cdot \widehat{x}_{i}\right)=y_{i}\left(w_{\mathrm{opt}} \cdot x_{i}+b_{o p t}\right)>0
yi(w
opt⋅x
i)=yi(wopt⋅xi+bopt)>0
所以存在
γ
=
min
{
y
i
(
w
o
p
t
⋅
x
i
+
b
o
p
t
)
\gamma=\min \left\{y_{i}\left(w_{\mathrm{opt}} \cdot x_{i}+b_{o p t}\right)\right.
γ=min{yi(wopt⋅xi+bopt)
使得
y
i
(
w
^
o
p
t
⋅
x
^
i
)
=
y
i
(
w
o
p
t
⋅
x
i
+
b
o
p
t
)
≥
γ
y_{i}\left(\widehat{\mathrm{w}}_{\mathrm{opt}} \cdot \widehat{x}_{i}\right)=y_{i}\left(w_{\mathrm{opt}} \cdot x_{i}+b_{\mathrm{opt}}\right) \geq \gamma
yi(w
opt⋅x
i)=yi(wopt⋅xi+bopt)≥γ
证明(2)
感知机算法从
w
^
=
0
\widehat{\mathrm{w}}=0
w
=0开始,如果实例被误分类,则更新权重,令
w
^
k
−
1
\widehat{w}_{k-1}
w
k−1是第k个误分类 实例之前的扩充权重向量,即
w
^
k
−
1
=
(
w
k
−
1
T
,
b
k
−
1
)
T
\widehat{\mathrm{w}}_{k-1}=\left(w_{k-1}^{T}, b_{k-1}\right)^{T}
w
k−1=(wk−1T,bk−1)T
则第
k
k
k个误分类实例的条件是
y
i
(
w
^
k
−
1
⋅
x
^
i
)
=
y
i
(
w
k
−
1
⋅
x
i
+
b
k
−
1
)
≤
0
y_{i}\left(\widehat{w}_{k-1} \cdot \widehat{x}_{i}\right)=y_{i}\left(w_{k-1} \cdot x_{i}+b_{k-1}\right) \leq 0
yi(w
k−1⋅x
i)=yi(wk−1⋅xi+bk−1)≤0
若
(
x
i
,
y
i
)
\left(x_{i}, y_{i}\right)
(xi,yi)是被
w
^
k
−
1
=
(
w
k
−
1
T
,
b
k
−
1
)
T
\widehat{w}_{k-1}=\left(w_{k-1}^{T}, b_{k-1}\right)^{T}
w
k−1=(wk−1T,bk−1)T误分类的数据,则
w
w
w和
b
b
b的更新是
w
k
←
w
k
−
1
+
η
y
i
x
i
b
k
←
b
k
−
1
+
η
y
i
\begin{gathered} w_{k} \leftarrow w_{k-1}+\eta y_{i} x_{i} \\ b_{k} \leftarrow b_{k-1}+\eta y_{i} \end{gathered}
wk←wk−1+ηyixibk←bk−1+ηyi
即
w
^
k
=
w
^
k
−
1
+
η
y
i
x
i
^
\widehat{\mathrm{w}}_{k}=\widehat{\mathrm{w}}_{k-1}+\eta y_{i} \widehat{x_{i}}
w
k=w
k−1+ηyixi
下面推导两个不等式.首先第一个:
W
^
k
⋅
W
^
o
p
t
≥
k
η
γ
\widehat{\mathbf{W}}_{k} \cdot \widehat{\mathbf{W}}_{o p t} \geq k \eta \gamma
W
k⋅W
opt≥kηγ
由书中式(2.11)(这里的式(30))和式(2.8)(这里的式(22))可得
w
^
k
⋅
w
^
o
p
t
=
w
^
k
−
1
⋅
w
^
o
p
t
+
η
y
i
w
^
o
p
t
⋅
x
i
^
≥
w
^
k
−
1
⋅
w
^
o
p
t
+
η
γ
\begin{aligned} \widehat{\mathrm{w}}_{k} \cdot \widehat{\mathrm{w}}_{o p t} &=\widehat{\mathrm{w}}_{k-1} \cdot \widehat{\mathrm{w}}_{o p t}+\eta y_{i} \widehat{\mathrm{w}}_{o p t} \cdot \widehat{x_{i}} \\ & \geq \widehat{\mathrm{w}}_{k-1} \cdot \widehat{\mathrm{w}}_{o p t}+\eta \gamma \end{aligned}
w
k⋅w
opt=w
k−1⋅w
opt+ηyiw
opt⋅xi
≥w
k−1⋅w
opt+ηγ
由此递推得到不等式(书中式(2.12))
w
^
k
⋅
w
^
o
p
t
≥
w
^
k
−
1
,
w
^
o
p
t
+
η
γ
≥
w
^
k
−
2
⋅
w
^
o
p
t
+
2
η
γ
≥
⋯
≥
k
η
γ
\widehat{\mathrm{w}}_{k} \cdot \widehat{\mathrm{w}}_{o p t} \geq \widehat{\mathrm{w}}_{k-1}, \widehat{\mathrm{w}}_{o p t}+\eta \gamma \geq \widehat{\mathrm{w}}_{k-2} \cdot \widehat{\mathrm{w}}_{o p t}+2 \eta \gamma \geq \cdots \geq k \eta \gamma
w
k⋅w
opt≥w
k−1,w
opt+ηγ≥w
k−2⋅w
opt+2ηγ≥⋯≥kηγ
∥ w ^ k ∥ 2 ≤ k η 2 R 2 \left\|\widehat{\mathrm{w}}_{k}\right\|^{2} \leq k \eta^{2} R^{2} ∥w k∥2≤kη2R2
由书中式(2.11)(这里的式(30))及式(2.10)(这里的(28))得
∥
w
^
k
∥
2
=
∥
w
^
k
−
1
∥
2
+
2
η
y
i
w
^
k
−
1
⋅
x
^
i
+
η
2
∥
x
^
i
∥
2
≤
∥
w
^
k
−
1
∥
2
+
η
2
∥
x
^
i
∥
2
≤
∥
w
^
k
−
1
∥
2
+
η
2
R
2
≤
∥
w
^
k
−
2
∥
2
+
2
η
2
R
2
≤
⋯
≤
k
η
2
R
2
\begin{gathered} \left\|\widehat{\mathrm{w}}_{k}\right\|^{2}=\left\|\widehat{\mathrm{w}}_{k-1}\right\|^{2}+2 \eta y_{i} \widehat{\mathrm{w}}_{k-1} \cdot \hat{x}_{i}+\eta^{2}\left\|\hat{x}_{i}\right\|^{2} \\ \leq\left\|\widehat{\mathrm{w}}_{k-1}\right\|^{2}+\eta^{2}\left\|\hat{x}_{i}\right\|^{2} \\ \leq\left\|\widehat{\mathrm{w}}_{k-1}\right\|^{2}+\eta^{2} R^{2} \\ \leq\left\|\widehat{\mathrm{w}}_{k-2}\right\|^{2}+2 \eta^{2} R^{2} \leq \cdots \\ \leq k \eta^{2} R^{2} \end{gathered}
∥w
k∥2=∥w
k−1∥2+2ηyiw
k−1⋅x^i+η2∥x^i∥2≤∥w
k−1∥2+η2∥x^i∥2≤∥w
k−1∥2+η2R2≤∥w
k−2∥2+2η2R2≤⋯≤kη2R2
结合上面的不等式(33)和(34)可得
w
^
k
⋅
w
^
o
p
t
≥
k
η
γ
∥
w
^
k
∥
2
≤
k
η
2
R
2
k
η
γ
≤
w
^
k
⋅
w
^
o
p
t
≤
∥
w
^
k
∥
∥
w
^
o
p
t
∥
≤
k
η
R
k
2
η
2
≤
k
R
2
k
≤
(
R
γ
)
2
\begin{gathered} \widehat{\mathrm{w}}_{k} \cdot \widehat{\mathrm{w}}_{o p t} \geq k \eta \gamma \\ \left\|\widehat{\mathrm{w}}_{k}\right\|^{2} \leq k \eta^{2} R^{2} \\ k \eta \gamma \leq \widehat{\mathrm{w}}_{k} \cdot \widehat{\mathrm{w}}_{o p t} \leq\left\|\widehat{\mathrm{w}}_{k}\right\|\left\|\widehat{\mathrm{w}}_{o p t}\right\| \leq \sqrt{k} \eta R \\ k^{2} \eta^{2} \leq k R^{2} \\ k \leq\left(\frac{R}{\gamma}\right)^{2} \end{gathered}
w
k⋅w
opt≥kηγ∥w
k∥2≤kη2R2kηγ≤w
k⋅w
opt≤∥w
k∥∥w
opt∥≤kηRk2η2≤kR2k≤(γR)2
总结
- 通过证明感知机误分类次数是有上界的,说明通过有限次搜索可以找到将数据集完全正确分开的分离超平面
- 当数据集线性不可分时,感知机学习算法不收敛, 会发生振荡。