感知机模型
- 定义:假设输入空间(特征空间)是
χ
⊆
R
n
\chi \subseteq R^n
χ⊆Rn,输出空间是
γ
=
{
+
1
,
−
1
}
\gamma=\{+1,-1\}
γ={+1,−1}.输入
x
∈
χ
x \in\chi
x∈χ表示实例的特征向量,对应于输入空间(特征空间)的点;输出y
∈
γ
\in \gamma
∈γ表示实例的类别。由输入空间到输出空间的函数如下:
f ( x ) = s i g n ( w ⋅ x + b ) f(x)=sign(w\cdot{x}+b) f(x)=sign(w⋅x+b) 称为感知机模型,其中 w , b w,b w,b为感知机的参数, w ∈ R n w \in R^n w∈Rn 叫作权值或者权值向量, b ∈ R b \in R b∈R叫做偏值, w ⋅ x w\cdot{x} w⋅x表示 w , b w,b w,b的内积. s i g n sign sign是符号函数,即
s i g n ( x ) = { + 1 x ≥ 0 − 1 x < 0 sign(x)=\begin{cases} +1 &x\ge0\\ -1 & x < 0\\ \end{cases} sign(x)={+1−1x≥0x<0
感知机模型学习策略
- 如果
(
x
i
,
y
i
)
(x_i,y_i)
(xi,yi)是正分类点,则
y
i
∗
(
w
⋅
x
i
+
b
)
>
0
y_i*(w\cdot{x_i}+b)>0
yi∗(w⋅xi+b)>0,如果
(
x
i
,
y
i
)
(x_i,y_i)
(xi,yi)是误分类点,则
y
i
∗
(
w
⋅
x
i
+
b
)
≤
0
y_i*(w\cdot{x_i}+b)\le0
yi∗(w⋅xi+b)≤0
定义: L ( w , b ) = − ∑ x i ∈ M y i ∗ ( w ⋅ x i + b ) L(w,b)=-\sum\limits_{x_i \in M} y_i*(w\cdot{x_i}+b) L(w,b)=−xi∈M∑yi∗(w⋅xi+b),其中 M M M是误分类点集合,即误分类点到超平面的距离
感知机学习算法
- 这里我们采用梯度下降法
∇ w L ( w . b ) = − ∑ x i ∈ M y i ∗ x i \nabla_wL(w.b)=-\sum\limits_{x_i \in M}y_i*x_i ∇wL(w.b)=−xi∈M∑yi∗xi
∇ b L ( w . b ) = − ∑ x i ∈ M y i \nabla_bL(w.b)=-\sum\limits_{x_i \in M}y_i ∇bL(w.b)=−xi∈M∑yi
w = w + η ∗ y i ∗ x i w=w+\eta*y_i*x_i w=w+η∗yi∗xi
b = b + η ∗ y i b=b+\eta*y_i b=b+η∗yi,其中 η \eta η为学习率 - 算法:
(1) 输入:训练数据集合 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) . . . ( x N , y N ) } T=\{(x_1,y_1),(x_2,y_2)...(x_N,y_N)\} T={(x1,y1),(x2,y2)...(xN,yN)},其中 x i ∈ χ = R n , y i ∈ γ = { − 1 , + 1 } , x_i \in \chi=R^n,y_i \in\gamma=\{-1,+1\}, xi∈χ=Rn,yi∈γ={−1,+1},
i = 1 , 2... N ; i=1,2...N; i=1,2...N;学习率为 η ( 0 < η ≤ 1 ) \eta(0<\eta\le1) η(0<η≤1),输出: w , b ; f ( x ) = s i g n ( w ⋅ x + b ) w,b;f(x)=sign(w\cdot{x}+b) w,b;f(x)=sign(w⋅x+b)
(2)在训练集中选取数据 ( x i , y i ) (x_i,y_i) (xi,yi)
(3)如果 y i ∗ ( w ⋅ x i + b ) ≤ 0 y_i*(w\cdot{x_i}+b)\le0 yi∗(w⋅xi+b)≤0
w = w + η ∗ y i ∗ x i w=w+\eta*y_i*x_i w=w+η∗yi∗xi
b = b + η ∗ y i b=b+\eta*y_i b=b+η∗yi
(4)转至(2),直到算法结束没有误分类点 - 算法的收敛性证明:
训练数据集合 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) . . . ( x N , y N ) } T=\{(x_1,y_1),(x_2,y_2)...(x_N,y_N)\} T={(x1,y1),(x2,y2)...(xN,yN)},为线性可分,其中 x i ∈ χ = R n , y i ∈ γ = { − 1 , + 1 } , i = 1 , 2... N ; x_i \in \chi=R^n,y_i \in\gamma=\{-1,+1\},i=1,2...N; xi∈χ=Rn,yi∈γ={−1,+1},i=1,2...N;则
存在满足条件 ∥ w ^ o p t ∥ = 1 \lVert{\hat{w}_{opt}\rVert}=1 ∥w^opt∥=1的超平面 w ^ o p t ⋅ x ^ + b o p t = 0 \hat{w}_{opt}\cdot{\hat{x}}+b_{opt}=0 w^opt⋅x^+bopt=0将训练数据集完全正确分开;且存在 γ > 0 \gamma>0 γ>0.对所有
i = 1 , 2 , . . N i=1,2,..N i=1,2,..N
y i ∗ ( w ^ o p t ⋅ x ^ ) = y i ∗ ( w o p t ⋅ x + b o p t ) ≥ γ y_i*(\hat{w}_{opt}\cdot{\hat{x}})=y_i*(w_{opt}\cdot{x}+b_{opt})\ge\gamma yi∗(w^opt⋅x^)=yi∗(wopt⋅x+bopt)≥γ
令 R = m a x ∥ x ^ o p t ∥ R=max\lVert{\hat{x}_{opt}\rVert} R=max∥x^opt∥,则感知机算法,在训练集上的误分类次数 k k k满足
k ≤ ( R γ ) 2 k \le(\frac{R}{\gamma})^2 k≤(γR)2- 证明:
(1)
取 w ^ o p t \hat{w}_{opt} w^opt,则 w ^ o p t ∗ x = w o p t ⋅ x + b o p t = 0 \hat{w}_{opt}*x=w_{opt}\cdot{x}+b_{opt}=0 w^opt∗x=wopt⋅x+bopt=0,使 ∥ w o p t ∥ = 1 \lVert{w_{opt}}\rVert=1 ∥wopt∥=1,由于对有限的 i = 1 , 2 , . . . . N i=1,2,....N i=1,2,....N,均有
y i ∗ ( w ^ o p t ⋅ x i ^ ) = y i ∗ ( w o p t ⋅ x i + b o p t ) > 0 y_{i}*(\hat{w}_{opt}\cdot{\hat{x_i}})=y_{i}*(w_{opt}\cdot{x_i}+b_{opt})>0 yi∗(w^opt⋅xi^)=yi∗(wopt⋅xi+bopt)>0
所以存在 γ = m i n i { y i ∗ ( w o p t ⋅ x i + b o p t ) } \gamma=min_i\{y_i*(w_{opt}\cdot{x_i}+b_{opt})\} γ=mini{yi∗(wopt⋅xi+bopt)}
y i ∗ ( w ^ o p t ⋅ x i ^ ) = y i ∗ ( w o p t ⋅ x i + b o p t ) ≥ γ y_{i}*(\hat{w}_{opt}\cdot{\hat{x_i}})=y_{i}*(w_{opt}\cdot{x_i}+b_{opt})\ge\gamma yi∗(w^opt⋅xi^)=yi∗(wopt⋅xi+bopt)≥γ
(2)因为感知机是从 w 0 ^ = 0 \hat{w_0}=0 w0^=0开始,如果被误分类,则跟新权重。令 w ^ k − 1 \hat{w}_{k-1} w^k−1是第k个误分类的扩充向量,
w ^ k − 1 = ( w k − 1 T , b k − 1 ) T \hat{w}_{k-1}=(w_{k-1}^T,b_{k-1})^T w^k−1=(wk−1T,bk−1)T
则第k个误分类实例条件是 y i ∗ ( w ^ k − 1 ⋅ x i + b k − 1 ) ≤ 0 y_i*(\hat{w}_{k-1}\cdot{x_i}+b_{k-1})\le0 yi∗(w^k−1⋅xi+bk−1)≤0
证明两个不等式:
1). w ^ k ⋅ w ^ o p t ≥ k ∗ η ∗ γ \hat{w}_{k}\cdot{\hat{w}_{opt}}\ge k*\eta*\gamma w^k⋅w^opt≥k∗η∗γ
w ^ k ⋅ w ^ o p t = w ^ k − 1 ⋅ w ^ o p t + η ∗ y i ⋅ x ^ i ≥ w ^ k − 1 ⋅ w ^ o p t + η ∗ γ \hat{w}_{k}\cdot{\hat{w}_{opt}}=\hat{w}_{k-1}\cdot{\hat{w}_{opt}}+\eta*{y_i}\cdot{\hat{x}_{i}}\ge\hat{w}_{k-1}\cdot{\hat{w}_{opt}}+\eta*\gamma w^k⋅w^opt=w^k−1⋅w^opt+η∗yi⋅x^i≥w^k−1⋅w^opt+η∗γ.我们不断的递推
w ^ k ⋅ w ^ o p t ≥ w ^ k − 1 ⋅ w ^ o p t + η ∗ γ ≥ w ^ k − 2 ⋅ w ^ o p t + 2 ∗ η ∗ γ ≥ . . . . ≥ k ∗ η γ \hat{w}_{k}\cdot{\hat{w}_{opt}}\ge\hat{w}_{k-1}\cdot{\hat{w}_{opt}}+\eta*\gamma\ge\hat{w}_{k-2}\cdot{\hat{w}_{opt}}+2*\eta*\gamma\ge....\ge k*\eta\gamma w^k⋅w^opt≥w^k−1⋅w^opt+η∗γ≥w^k−2⋅w^opt+2∗η∗γ≥....≥k∗ηγ
2) ∥ w k ∥ \lVert{w_{k}}\rVert ∥wk∥ 2 ≤ k ∗ η 2 ∗ R 2 ^2\le k*\eta^2*R^2 2≤k∗η2∗R2
∥ w k ∥ 2 \lVert{w_{k}}\rVert^2 ∥wk∥2 = ∥ w k − 1 ∥ 2 + 2 ∗ η ∗ y i ∗ w ^ k − 1 ⋅ x ^ i + η 2 ∗ ∥ x ^ i ∥ ≤ ∥ w k − 1 ∥ 2 =\lVert{w_{k-1}}\rVert^2+2*\eta*y_i*\hat{w}_{k-1}\cdot{\hat{x}_{i}}+\eta^2*\lVert{\hat{x}_{i}}\rVert\le \lVert{w_{k-1}}\rVert^2 =∥wk−1∥2+2∗η∗yi∗w^k−1⋅x^i+η2∗∥x^i∥≤∥wk−1∥2 + η 2 ∗ ∥ x ^ i ∥ ≤ ∥ w k − 1 ∥ 2 +\eta^2*\lVert{\hat{x}_{i}}\rVert\le \lVert{w_{k-1}}\rVert^2 +η2∗∥x^i∥≤∥wk−1∥2 + η 2 ∗ R ≤ +\eta^2*{R}\le +η2∗R≤ ∥ w k − 2 ∥ 2 \lVert{w_{k-2}}\rVert^2 ∥wk−2∥2 + 2 ∗ η 2 ∗ R ≤ . . . . ≤ k ∗ η 2 ∗ R 2 +2*\eta^2*{R}\le....\le k*\eta^2*R^2 +2∗η2∗R≤....≤k∗η2∗R2
证明完毕,结合以上两个不等式
k ∗ η ∗ γ ≤ w ^ k ⋅ w ^ o p t ≤ ∣ ∣ w ^ k ∣ ∣ ∗ ∣ ∣ w ^ o p t ∣ ∣ ≤ k ∗ η ∗ R k*\eta*\gamma\le\hat{w}_k\cdot{\hat{w}_{opt}}\le||\hat{w}_k||*||\hat{w}_{opt}||\le\sqrt{k}*\eta*R k∗η∗γ≤w^k⋅w^opt≤∣∣w^k∣∣∗∣∣w^opt∣∣≤k∗η∗R
k 2 γ 2 ≤ k ∗ R 2 k^2\gamma^2\le k*R^2 k2γ2≤k∗R2
k ≤ ( R γ ) 2 k\le(\frac{R}{\gamma})^2 k≤(γR)2,完毕
- 证明:
- 对偶形式
w = w + η ∗ y i ∗ x i w=w+\eta*y_i*x_i w=w+η∗yi∗xi
b = b + η ∗ y i b=b+\eta*y_i b=b+η∗yi
w = ∑ i = 1 N a i ∗ y i ∗ x i w=\sum\limits_{i=1}^Na_{i}*y_i*x_i w=i=1∑Nai∗yi∗xi
b = ∑ i = 1 N a i ∗ y i b=\sum\limits_{i=1}^Na_{i}*y_i b=i=1∑Nai∗yi
其中 N N N为训练数据数量, a i = n i ∗ η ≥ 0 a_i=n_{i}*\eta\ge0 ai=ni∗η≥0- 算法:
(1) 输入:训练数据集合 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) . . . ( x N , y N ) } T=\{(x_1,y_1),(x_2,y_2)...(x_N,y_N)\} T={(x1,y1),(x2,y2)...(xN,yN)},其中 x i ∈ χ = R n , x_i \in \chi=R^n, xi∈χ=Rn,
y i ∈ γ = { − 1 , + 1 } , 输 出 a . b . f ( x ) = s i g n ( ∑ j = 1 N a j ∗ y j ∗ x j ⋅ x + b ) . a = ( a 1 , a 2 , . . . , a n ) y_i \in\gamma=\{-1,+1\},输出a.b.f(x)=sign(\sum\limits_{j=1}^Na_j*y_j*x_j\cdot{x}+b).a=(a_1,a_2,...,a_n) yi∈γ={−1,+1},输出a.b.f(x)=sign(j=1∑Naj∗yj∗xj⋅x+b).a=(a1,a2,...,an)
i = 1 , 2... N ; i=1,2...N; i=1,2...N;学习率为 η ( 0 < η ≤ 1 ) \eta(0<\eta\le1) η(0<η≤1)
(2)在训练集中选取数据 ( x i , y i ) (x_i,y_i) (xi,yi)
(3)如果 s i g n ( ∑ j = 1 N a j ∗ y j ∗ x j ⋅ x i + b ) ≤ 0 sign(\sum\limits_{j=1}^Na_j*y_j*x_j\cdot{x_i}+b)\le0 sign(j=1∑Naj∗yj∗xj⋅xi+b)≤0
a i = a i + η a_i=a_i+\eta ai=ai+η
b = b + η ∗ y i b=b+\eta*y_i b=b+η∗yi
(4)转至(2),直到算法结束没有误分类点
- 算法:
- Gram矩阵: G = [ x i ⋅ x j ] N × N G=[x_i\cdot{x_j}]_{N×N} G=[xi⋅xj]N×N
实现代码后续补上