首先介绍感知机模型,然后叙述感知机的学习策略,特别是损失函数,最后介绍感知机学习算法,包括原始模式和对偶模式,并证明算法的收敛性
感知机模型
输入空间(特征空间):
X
⊆
R
n
\mathcal{X} \subseteq R^n
X⊆Rn,假设
x
∈
X
x \in \mathcal{X}
x∈X
输出空间:
Y
=
{
+
1
,
−
1
}
\mathcal{Y} = \{+1,-1\}
Y={+1,−1},假设
y
∈
Y
y \in \mathcal{Y}
y∈Y
由输入空间到输出空间的如下函数称为感知机:
f
(
x
)
=
s
i
g
n
(
ω
⋅
x
+
b
)
f(x) = sign(\omega \cdot x + b)
f(x)=sign(ω⋅x+b)
其中
ω
∈
R
n
\omega \in R^n
ω∈Rn叫做权值,
b
∈
R
b \in R
b∈R叫做偏置,
ω
⋅
x
\omega \cdot x
ω⋅x表示内积,sign表示符号函数:
s
i
g
n
(
x
)
=
{
+
1
x
≥
0
−
1
x
<
0
sign(x) = \begin{cases} +1 & x\geq 0 \\ -1 & x \lt 0 \end{cases}
sign(x)={+1−1x≥0x<0
- 感知机是一种线性分类模型,属于判别模型,对应于输入空间(特征空间)中将实例划分为正负两类的分离超平面。
- 感知机模型的假设空间时定义在特征空间中的所有线性分类函数或线性分类器,即函数集合 f ∣ f ( x ) = ω ⋅ x + b f|f(x) = \omega \cdot x + b f∣f(x)=ω⋅x+b
感知机学习策略
目标:确定模型参数
ω
,
b
\omega,b
ω,b
损失函数:误分类点到超平面
S
S
S的总距离
1
∣
∣
ω
∣
∣
∣
ω
⋅
x
i
+
b
∣
\frac{1}{||\omega||}|\omega \cdot x_i + b|
∣∣ω∣∣1∣ω⋅xi+b∣
∣
∣
ω
∣
∣
||\omega||
∣∣ω∣∣是指
ω
\omega
ω的
L
2
L_2
L2范数
又因为误分类点有如下定义:
−
y
i
(
ω
⋅
x
i
+
b
)
>
0
-y_i(\omega \cdot x_i + b) \gt 0
−yi(ω⋅xi+b)>0
所以误分类点到超平面的距离为:
−
1
∣
∣
ω
∣
∣
y
i
(
ω
⋅
x
i
+
b
)
-\frac{1}{||\omega||}y_i(\omega \cdot x_i + b)
−∣∣ω∣∣1yi(ω⋅xi+b)
假设误分类点集合
M
M
M,那么所有误分类点到超平面的总距离为:
−
1
∣
∣
ω
∣
∣
∑
x
i
∈
M
y
i
(
ω
⋅
x
i
+
b
)
-\frac{1}{||\omega||}\sum_{x_i \in M}y_i(\omega \cdot x_i + b)
−∣∣ω∣∣1xi∈M∑yi(ω⋅xi+b)
得到损失函数:
L
(
ω
,
b
)
=
−
∑
x
i
∈
M
y
i
(
ω
⋅
x
i
+
b
)
L(\omega,b) = -\sum_{x_i \in M}y_i(\omega \cdot x_i + b)
L(ω,b)=−xi∈M∑yi(ω⋅xi+b)
感知机学习算法
感知机学习问题转化为求解损失函数式的最优化问题,最优化的方法是随机梯度下降法
感知机学习算法的原始形式
输入:训练数据集
T
=
{
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
,
⋯
,
(
x
N
,
y
N
)
}
T = \{(x_1,y_1),(x_2,y_2),\cdots,(x_N,y_N)\}
T={(x1,y1),(x2,y2),⋯,(xN,yN)},其中
x
i
∈
X
=
R
n
,
y
i
∈
Y
=
{
−
1
,
+
1
}
,
i
=
1
,
2
,
⋯
,
N
x_i \in \mathcal{X} = R^n,y_i \in \mathcal{Y} = \{-1,+1\},i=1,2,\cdots,N
xi∈X=Rn,yi∈Y={−1,+1},i=1,2,⋯,N;学习率
η
(
0
<
η
≤
1
)
\eta(0 \lt \eta \leq 1)
η(0<η≤1)
输出:
ω
,
b
\omega,b
ω,b;感知机模型
f
(
x
)
=
s
i
g
n
(
ω
⋅
x
+
b
)
f(x) = sign(\omega \cdot x + b)
f(x)=sign(ω⋅x+b)
- 选取初值 ω 0 , b 0 \omega_0,b_0 ω0,b0;
- 在训练集中选取数据 ( x i , y i ) (x_i,y_i) (xi,yi);
- 如果
y
i
(
ω
⋅
x
i
+
b
)
≤
0
y_i(\omega \cdot x_i + b) \leq 0
yi(ω⋅xi+b)≤0:
ω ← ω + η y i x i b ← b + η y i \omega \leftarrow \omega + \eta y_ix_i \\ b \leftarrow b + \eta y_i ω←ω+ηyixib←b+ηyi - 转至第二步,直到没有误分类点
算法的收敛性
收敛性证明:误分类的次数
k
k
k是有上界的,经过有限次搜索可以找到将训练数据完全正确分开的超平面
假设
ω
^
=
(
ω
T
,
b
)
T
,
x
^
=
(
x
T
,
1
)
T
,
ω
^
∈
R
n
+
1
,
x
^
∈
R
n
+
1
\hat{\omega} = (\omega^T,b)^T,\hat{x} = (x^T,1)^T,\hat{\omega} \in R^{n + 1},\hat{x} \in R^{n + 1}
ω^=(ωT,b)T,x^=(xT,1)T,ω^∈Rn+1,x^∈Rn+1,则
ω
^
⋅
x
^
=
ω
⋅
x
+
b
\hat{\omega} \cdot \hat{x} = \omega \cdot x + b
ω^⋅x^=ω⋅x+b
定理:
- 存在满足条件
∣
∣
ω
^
o
p
t
∣
∣
=
1
||\hat{\omega}_{opt}|| = 1
∣∣ω^opt∣∣=1的超平面
ω
^
o
p
t
⋅
x
^
=
ω
o
p
t
⋅
x
+
b
o
p
t
=
0
\hat{\omega} _{opt}\cdot \hat{x} = \omega_{opt} \cdot x + b_{opt} = 0
ω^opt⋅x^=ωopt⋅x+bopt=0将训练数据集完全正确分开;且存在
γ
>
0
\gamma \gt 0
γ>0对所有
i
=
1
,
2
,
⋯
,
N
i = 1,2,\cdots,N
i=1,2,⋯,N
y i ( ω ^ o p t ⋅ x i ^ ) = y i ( ω o p t ⋅ x i + b o p t ) ≥ γ y_i(\hat{\omega} _{opt}\cdot \hat{x_i}) =y_i(\omega_{opt} \cdot x_i + b_{opt}) \geq \gamma yi(ω^opt⋅xi^)=yi(ωopt⋅xi+bopt)≥γ - 令
R
=
m
a
x
1
≤
i
≤
N
∣
∣
x
^
i
∣
∣
R = \mathop{max}\limits_{1 \leq i \leq N}||\hat{x}_i||
R=1≤i≤Nmax∣∣x^i∣∣,则感知机算法在训练数据集上的误分类次数
k
k
k满足不等式:
k ≤ ( R γ ) 2 k \leq \left( \frac{R}{\gamma} \right)^2 k≤(γR)2
证明:
- 由于训练数据集是线性可分的,所以存在超平面可将训练数据集完全正确分开,取此超平面为
ω
^
o
p
t
⋅
x
^
=
ω
o
p
t
⋅
x
+
b
o
p
t
=
0
\hat{\omega} _{opt}\cdot \hat{x} = \omega_{opt} \cdot x + b_{opt} = 0
ω^opt⋅x^=ωopt⋅x+bopt=0,使
∣
∣
ω
^
o
p
t
∣
∣
=
1
||\hat{\omega}_{opt}|| = 1
∣∣ω^opt∣∣=1,由于对有限的
i
=
1
,
2
,
⋯
,
N
i = 1,2,\cdots,N
i=1,2,⋯,N,均有
y i ( ω ^ o p t ⋅ x i ^ ) = y i ( ω o p t ⋅ x i + b o p t ) > 0 y_i(\hat{\omega} _{opt}\cdot \hat{x_i}) =y_i(\omega_{opt} \cdot x_i + b_{opt}) \gt 0 yi(ω^opt⋅xi^)=yi(ωopt⋅xi+bopt)>0
所以存在
γ = m i n i { y i ( ω o p t ⋅ x i + b o p t ) } \gamma = \mathop{min}\limits_{i}\{y_i(\omega_{opt} \cdot x_i + b_{opt})\} γ=imin{yi(ωopt⋅xi+bopt)}
使
y i ( ω ^ o p t ⋅ x i ^ ) = y i ( ω o p t ⋅ x i + b o p t ) ≥ γ y_i(\hat{\omega} _{opt}\cdot \hat{x_i}) =y_i(\omega_{opt} \cdot x_i + b_{opt}) \geq \gamma yi(ω^opt⋅xi^)=yi(ωopt⋅xi+bopt)≥γ - 感知机算法从
ω
^
0
=
0
\hat{\omega}_0 = 0
ω^0=0开始,如果实例被误分类,则更新权重,令
ω
^
k
−
1
\hat{\omega}_{k-1}
ω^k−1是第
k
k
k个误分类实例之前的扩充权重向量,即:
ω ^ k − 1 = ( ω k − 1 T , b k − 1 ) T \hat{\omega}_{k-1} = (\omega_{k-1}^T,b_{k-1})^T ω^k−1=(ωk−1T,bk−1)T
则第 k k k个误分类实例的条件是:
y i ( ω ^ k − 1 ⋅ x ^ i ) = y i ( ω k − 1 ⋅ x i + b k − 1 ) ≤ 0 y_i(\hat{\omega}_{k-1} \cdot \hat{x}_i) = y_i(\omega_{k-1} \cdot x_i + b_{k - 1}) \leq 0 yi(ω^k−1⋅x^i)=yi(ωk−1⋅xi+bk−1)≤0
若 ( x i , y i ) (x_i,y_i) (xi,yi)是被 ω ^ k − 1 = ( ω k − 1 T , b k − 1 ) T \hat{\omega}_{k-1} = (\omega_{k-1}^T,b_{k-1})^T ω^k−1=(ωk−1T,bk−1)T误分类的数据,则 ω \omega ω和 b b b的更新是:
ω k ← ω k − 1 + η y i x i b k ← b k − 1 + η y i \omega_k \leftarrow \omega_{k-1} + \eta y_ix_i \\ b_k \leftarrow b_{k-1} + \eta y_i ωk←ωk−1+ηyixibk←bk−1+ηyi
即:
ω ^ k = ω ^ k − 1 + η y i x ^ i \hat{\omega}_k = \hat{\omega}_{k-1} + \eta y_i \hat{x}_i ω^k=ω^k−1+ηyix^i
递推1:
ω ^ k ⋅ ω ^ o p t = ω ^ k − 1 ⋅ ω ^ o p t + η y i ω ^ o p t ⋅ x ^ i ≥ ω ^ k − 1 ⋅ ω ^ o p t + η γ ≥ ω ^ k − 2 ⋅ ω ^ o p t + 2 η γ ≥ ⋯ ≥ k η γ \begin{aligned} \hat{\omega}_k \cdot \hat{\omega}_{opt} &= \hat{\omega}_{k-1} \cdot \hat{\omega}_{opt} + \eta y_i \hat{\omega}_{opt} \cdot \hat{x}_i \\ &\geq \hat{\omega}_{k-1} \cdot \hat{\omega}_{opt} + \eta\gamma \\ &\geq \hat{\omega}_{k-2} \cdot \hat{\omega}_{opt} + 2\eta\gamma \\ &\geq \cdots \\ &\geq k\eta\gamma \end{aligned} ω^k⋅ω^opt=ω^k−1⋅ω^opt+ηyiω^opt⋅x^i≥ω^k−1⋅ω^opt+ηγ≥ω^k−2⋅ω^opt+2ηγ≥⋯≥kηγ
递推2:
∣ ∣ ω ^ k ∣ ∣ 2 = ∣ ∣ ω ^ k − 1 ∣ ∣ 2 + 2 η y i ω ^ k − 1 ⋅ x ^ i + η 2 ∣ ∣ x ^ i ∣ ∣ 2 ≤ ∣ ∣ ω ^ k − 1 ∣ ∣ 2 + η 2 ∣ ∣ x ^ i ∣ ∣ 2 ≤ ∣ ∣ ω ^ k − 1 ∣ ∣ 2 + η 2 R 2 ≤ ∣ ∣ ω ^ k − 2 ∣ ∣ 2 + 2 η 2 R 2 ≤ ⋯ ≤ k η 2 R 2 \begin{aligned} ||\hat{\omega}_k||^2 &= ||\hat{\omega}_{k-1}||^2 + 2\eta y_i\hat{\omega}_{k-1} \cdot \hat{x}_i + \eta^2||\hat{x}_i||^2 \\ &\leq ||\hat{\omega}_{k-1}||^2 + \eta^2||\hat{x}_i||^2 \\ &\leq ||\hat{\omega}_{k-1}||^2 + \eta^2R^2 \\ &\leq ||\hat{\omega}_{k-2}||^2 + 2\eta^2R^2 \\ &\leq \cdots \\ &\leq k\eta^2 R^2 \end{aligned} ∣∣ω^k∣∣2=∣∣ω^k−1∣∣2+2ηyiω^k−1⋅x^i+η2∣∣x^i∣∣2≤∣∣ω^k−1∣∣2+η2∣∣x^i∣∣2≤∣∣ω^k−1∣∣2+η2R2≤∣∣ω^k−2∣∣2+2η2R2≤⋯≤kη2R2
结合两个递推式:
k η γ ≤ ω ^ k ⋅ ω ^ o p t ≤ ∣ ∣ ω ^ k ∣ ∣ ∣ ∣ ω ^ o p t ∣ ∣ ≤ k η R k 2 γ 2 ≤ k R 2 k ≤ ( R γ ) 2 k\eta\gamma \leq \hat{\omega}_k \cdot \hat{\omega}_{opt} \leq ||\hat{\omega}_k||\ ||\hat{\omega}_{opt}|| \leq \sqrt{k}\eta R \\ k^2\gamma^2 \leq kR^2 \\ k \leq \left( \frac{R}{\gamma} \right)^2 kηγ≤ω^k⋅ω^opt≤∣∣ω^k∣∣ ∣∣ω^opt∣∣≤kηRk2γ2≤kR2k≤(γR)2
感知机算法的对偶形式
不失一般性,假设
ω
0
,
b
0
\omega_0,b_0
ω0,b0均为0,已知下式
ω
←
ω
+
η
y
i
x
i
b
←
b
+
η
y
i
\omega \leftarrow \omega + \eta y_ix_i \\ b \leftarrow b + \eta y_i
ω←ω+ηyixib←b+ηyi
逐步修改
ω
,
b
\omega,b
ω,b,设修改
n
n
n次,则
ω
,
b
\omega,b
ω,b关于
(
x
i
,
y
i
)
(x_i,y_i)
(xi,yi)的增量分别是
α
i
y
i
x
i
,
α
i
y
i
\alpha_iy_ix_i,\alpha_iy_i
αiyixi,αiyi,这里
α
i
=
n
i
η
\alpha_i = n_i\eta
αi=niη,可以得到:
ω
=
∑
i
=
1
N
α
i
y
i
x
i
b
=
∑
i
=
1
N
α
i
y
i
\omega = \sum_{i = 1}^N\alpha_iy_ix_i \\ b = \sum_{i = 1}^N\alpha_iy_i
ω=i=1∑Nαiyixib=i=1∑Nαiyi
输入:训练数据集
T
=
{
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
,
⋯
,
(
x
N
,
y
N
)
}
T = \{(x_1,y_1),(x_2,y_2),\cdots,(x_N,y_N)\}
T={(x1,y1),(x2,y2),⋯,(xN,yN)},其中
x
i
∈
X
=
R
n
,
y
i
∈
Y
=
{
−
1
,
+
1
}
,
i
=
1
,
2
,
⋯
,
N
x_i \in \mathcal{X} = R^n,y_i \in \mathcal{Y} = \{-1,+1\},i=1,2,\cdots,N
xi∈X=Rn,yi∈Y={−1,+1},i=1,2,⋯,N;学习率
η
(
0
<
η
≤
1
)
\eta(0 \lt \eta \leq 1)
η(0<η≤1)
输出:
α
,
b
\alpha,b
α,b
感知机模型:
f
(
x
)
=
s
i
g
n
(
∑
j
=
1
N
α
j
y
j
x
j
⋅
x
+
b
)
,
α
=
(
α
1
,
α
2
,
⋯
,
α
N
)
T
f(x) = sign\left(\sum_{j = 1}^N\alpha_jy_jx_j \cdot x + b \right),\alpha = (\alpha_1,\alpha_2,\cdots,\alpha_N)^T
f(x)=sign(j=1∑Nαjyjxj⋅x+b),α=(α1,α2,⋯,αN)T
- α ← 0 , b ← 0 \alpha \leftarrow 0,b \leftarrow 0 α←0,b←0;
- 在训练集中选取数据 ( x i , y i ) (x_i,y_i) (xi,yi)
- 如果
y
i
(
∑
j
=
1
N
α
j
y
j
x
j
⋅
x
i
+
b
)
≤
0
y_i\left(\sum_{j = 1}^N\alpha_jy_jx_j \cdot x_i + b \right) \leq 0
yi(∑j=1Nαjyjxj⋅xi+b)≤0,则:
α i ← α i + η b ← b + η y i \alpha_i \leftarrow \alpha_i + \eta \\ b \leftarrow b + \eta y_i αi←αi+ηb←b+ηyi - 转至第二步直到没有误分类数据
注:在计算的过程中训练实例以内积的形式出现,可以预先计算储存下来,这个矩阵叫做Gram矩阵, G = [ x i ⋅ x j ] N × N G = [x_i \cdot x_j]_{N \times N} G=[xi⋅xj]N×N