《统计学习方法》(第二章)—— 感知机

感知机模型

  • 定义:假设输入空间(特征空间)是 χ ⊆ R n \chi \subseteq R^n χRn,输出空间是 γ = { + 1 , − 1 } \gamma=\{+1,-1\} γ={+1,1}.输入 x ∈ χ x \in\chi xχ表示实例的特征向量,对应于输入空间(特征空间)的点;输出y ∈ γ \in \gamma γ表示实例的类别。由输入空间到输出空间的函数如下:
    f ( x ) = s i g n ( w ⋅ x + b ) f(x)=sign(w\cdot{x}+b) f(x)=sign(wx+b) 称为感知机模型,其中 w , b w,b w,b为感知机的参数, w ∈ R n w \in R^n wRn 叫作权值或者权值向量, b ∈ R b \in R bR叫做偏值, w ⋅ x w\cdot{x} wx表示 w , b w,b w,b的内积. s i g n sign sign是符号函数,即
    s i g n ( x ) = { + 1 x ≥ 0 − 1 x < 0 sign(x)=\begin{cases} +1 &x\ge0\\ -1 & x < 0\\ \end{cases} sign(x)={+11x0x<0

感知机模型学习策略

  • 如果 ( x i , y i ) (x_i,y_i) (xi,yi)是正分类点,则 y i ∗ ( w ⋅ x i + b ) > 0 y_i*(w\cdot{x_i}+b)>0 yi(wxi+b)>0,如果 ( x i , y i ) (x_i,y_i) (xi,yi)是误分类点,则 y i ∗ ( w ⋅ x i + b ) ≤ 0 y_i*(w\cdot{x_i}+b)\le0 yi(wxi+b)0
    定义: L ( w , b ) = − ∑ x i ∈ M y i ∗ ( w ⋅ x i + b ) L(w,b)=-\sum\limits_{x_i \in M} y_i*(w\cdot{x_i}+b) L(w,b)=xiMyi(wxi+b),其中 M M M是误分类点集合,即误分类点到超平面的距离

感知机学习算法

  • 这里我们采用梯度下降法
    ∇ w L ( w . b ) = − ∑ x i ∈ M y i ∗ x i \nabla_wL(w.b)=-\sum\limits_{x_i \in M}y_i*x_i wL(w.b)=xiMyixi
    ∇ b L ( w . b ) = − ∑ x i ∈ M y i \nabla_bL(w.b)=-\sum\limits_{x_i \in M}y_i bL(w.b)=xiMyi
    w = w + η ∗ y i ∗ x i w=w+\eta*y_i*x_i w=w+ηyixi
    b = b + η ∗ y i b=b+\eta*y_i b=b+ηyi,其中 η \eta η为学习率
  • 算法:
    (1) 输入:训练数据集合 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) . . . ( x N , y N ) } T=\{(x_1,y_1),(x_2,y_2)...(x_N,y_N)\} T={(x1,y1),(x2,y2)...(xN,yN)},其中 x i ∈ χ = R n , y i ∈ γ = { − 1 , + 1 } , x_i \in \chi=R^n,y_i \in\gamma=\{-1,+1\}, xiχ=Rn,yiγ={1,+1},
    i = 1 , 2... N ; i=1,2...N; i=1,2...N;学习率为 η ( 0 < η ≤ 1 ) \eta(0<\eta\le1) η(0<η1),输出: w , b ; f ( x ) = s i g n ( w ⋅ x + b ) w,b;f(x)=sign(w\cdot{x}+b) w,b;f(x)=sign(wx+b)
    (2)在训练集中选取数据 ( x i , y i ) (x_i,y_i) (xi,yi)
    (3)如果 y i ∗ ( w ⋅ x i + b ) ≤ 0 y_i*(w\cdot{x_i}+b)\le0 yi(wxi+b)0
          w = w + η ∗ y i ∗ x i w=w+\eta*y_i*x_i w=w+ηyixi
          b = b + η ∗ y i b=b+\eta*y_i b=b+ηyi
    (4)转至(2),直到算法结束没有误分类点
  • 算法的收敛性证明:
      训练数据集合 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) . . . ( x N , y N ) } T=\{(x_1,y_1),(x_2,y_2)...(x_N,y_N)\} T={(x1,y1),(x2,y2)...(xN,yN)},为线性可分,其中 x i ∈ χ = R n , y i ∈ γ = { − 1 , + 1 } , i = 1 , 2... N ; x_i \in \chi=R^n,y_i \in\gamma=\{-1,+1\},i=1,2...N; xiχ=Rn,yiγ={1,+1},i=1,2...N;
      存在满足条件 ∥ w ^ o p t ∥ = 1 \lVert{\hat{w}_{opt}\rVert}=1 w^opt=1的超平面 w ^ o p t ⋅ x ^ + b o p t = 0 \hat{w}_{opt}\cdot{\hat{x}}+b_{opt}=0 w^optx^+bopt=0将训练数据集完全正确分开;且存在 γ > 0 \gamma>0 γ>0.对所有
       i = 1 , 2 , . . N i=1,2,..N i=1,2,..N
       y i ∗ ( w ^ o p t ⋅ x ^ ) = y i ∗ ( w o p t ⋅ x + b o p t ) ≥ γ y_i*(\hat{w}_{opt}\cdot{\hat{x}})=y_i*(w_{opt}\cdot{x}+b_{opt})\ge\gamma yi(w^optx^)=yi(woptx+bopt)γ
      令 R = m a x ∥ x ^ o p t ∥ R=max\lVert{\hat{x}_{opt}\rVert} R=maxx^opt,则感知机算法,在训练集上的误分类次数 k k k满足
       k ≤ ( R γ ) 2 k \le(\frac{R}{\gamma})^2 k(γR)2
    • 证明:
      (1)
        取 w ^ o p t \hat{w}_{opt} w^opt,则 w ^ o p t ∗ x = w o p t ⋅ x + b o p t = 0 \hat{w}_{opt}*x=w_{opt}\cdot{x}+b_{opt}=0 w^optx=woptx+bopt=0,使 ∥ w o p t ∥ = 1 \lVert{w_{opt}}\rVert=1 wopt=1,由于对有限的 i = 1 , 2 , . . . . N i=1,2,....N i=1,2,....N,均有
           y i ∗ ( w ^ o p t ⋅ x i ^ ) = y i ∗ ( w o p t ⋅ x i + b o p t ) > 0 y_{i}*(\hat{w}_{opt}\cdot{\hat{x_i}})=y_{i}*(w_{opt}\cdot{x_i}+b_{opt})>0 yi(w^optxi^)=yi(woptxi+bopt)>0
          所以存在 γ = m i n i { y i ∗ ( w o p t ⋅ x i + b o p t ) } \gamma=min_i\{y_i*(w_{opt}\cdot{x_i}+b_{opt})\} γ=mini{yi(woptxi+bopt)}
           y i ∗ ( w ^ o p t ⋅ x i ^ ) = y i ∗ ( w o p t ⋅ x i + b o p t ) ≥ γ y_{i}*(\hat{w}_{opt}\cdot{\hat{x_i}})=y_{i}*(w_{opt}\cdot{x_i}+b_{opt})\ge\gamma yi(w^optxi^)=yi(woptxi+bopt)γ
      (2)因为感知机是从 w 0 ^ = 0 \hat{w_0}=0 w0^=0开始,如果被误分类,则跟新权重。令 w ^ k − 1 \hat{w}_{k-1} w^k1是第k个误分类的扩充向量,
           w ^ k − 1 = ( w k − 1 T , b k − 1 ) T \hat{w}_{k-1}=(w_{k-1}^T,b_{k-1})^T w^k1=(wk1T,bk1)T
          则第k个误分类实例条件是 y i ∗ ( w ^ k − 1 ⋅ x i + b k − 1 ) ≤ 0 y_i*(\hat{w}_{k-1}\cdot{x_i}+b_{k-1})\le0 yi(w^k1xi+bk1)0
          证明两个不等式:
          1). w ^ k ⋅ w ^ o p t ≥ k ∗ η ∗ γ \hat{w}_{k}\cdot{\hat{w}_{opt}}\ge k*\eta*\gamma w^kw^optkηγ
           w ^ k ⋅ w ^ o p t = w ^ k − 1 ⋅ w ^ o p t + η ∗ y i ⋅ x ^ i ≥ w ^ k − 1 ⋅ w ^ o p t + η ∗ γ \hat{w}_{k}\cdot{\hat{w}_{opt}}=\hat{w}_{k-1}\cdot{\hat{w}_{opt}}+\eta*{y_i}\cdot{\hat{x}_{i}}\ge\hat{w}_{k-1}\cdot{\hat{w}_{opt}}+\eta*\gamma w^kw^opt=w^k1w^opt+ηyix^iw^k1w^opt+ηγ.我们不断的递推
           w ^ k ⋅ w ^ o p t ≥ w ^ k − 1 ⋅ w ^ o p t + η ∗ γ ≥ w ^ k − 2 ⋅ w ^ o p t + 2 ∗ η ∗ γ ≥ . . . . ≥ k ∗ η γ \hat{w}_{k}\cdot{\hat{w}_{opt}}\ge\hat{w}_{k-1}\cdot{\hat{w}_{opt}}+\eta*\gamma\ge\hat{w}_{k-2}\cdot{\hat{w}_{opt}}+2*\eta*\gamma\ge....\ge k*\eta\gamma w^kw^optw^k1w^opt+ηγw^k2w^opt+2ηγ....kηγ
          2) ∥ w k ∥ \lVert{w_{k}}\rVert wk 2 ≤ k ∗ η 2 ∗ R 2 ^2\le k*\eta^2*R^2 2kη2R2
           ∥ w k ∥ 2 \lVert{w_{k}}\rVert^2 wk2 = ∥ w k − 1 ∥ 2 + 2 ∗ η ∗ y i ∗ w ^ k − 1 ⋅ x ^ i + η 2 ∗ ∥ x ^ i ∥ ≤ ∥ w k − 1 ∥ 2 =\lVert{w_{k-1}}\rVert^2+2*\eta*y_i*\hat{w}_{k-1}\cdot{\hat{x}_{i}}+\eta^2*\lVert{\hat{x}_{i}}\rVert\le \lVert{w_{k-1}}\rVert^2 =wk12+2ηyiw^k1x^i+η2x^iwk12 + η 2 ∗ ∥ x ^ i ∥ ≤ ∥ w k − 1 ∥ 2 +\eta^2*\lVert{\hat{x}_{i}}\rVert\le \lVert{w_{k-1}}\rVert^2 +η2x^iwk12 + η 2 ∗ R ≤ +\eta^2*{R}\le +η2R ∥ w k − 2 ∥ 2 \lVert{w_{k-2}}\rVert^2 wk22 + 2 ∗ η 2 ∗ R ≤ . . . . ≤ k ∗ η 2 ∗ R 2 +2*\eta^2*{R}\le....\le k*\eta^2*R^2 +2η2R....kη2R2
          证明完毕,结合以上两个不等式
           k ∗ η ∗ γ ≤ w ^ k ⋅ w ^ o p t ≤ ∣ ∣ w ^ k ∣ ∣ ∗ ∣ ∣ w ^ o p t ∣ ∣ ≤ k ∗ η ∗ R k*\eta*\gamma\le\hat{w}_k\cdot{\hat{w}_{opt}}\le||\hat{w}_k||*||\hat{w}_{opt}||\le\sqrt{k}*\eta*R kηγw^kw^optw^kw^optk ηR
           k 2 γ 2 ≤ k ∗ R 2 k^2\gamma^2\le k*R^2 k2γ2kR2
           k ≤ ( R γ ) 2 k\le(\frac{R}{\gamma})^2 k(γR)2,完毕
  • 对偶形式
          w = w + η ∗ y i ∗ x i w=w+\eta*y_i*x_i w=w+ηyixi
          b = b + η ∗ y i b=b+\eta*y_i b=b+ηyi
          w = ∑ i = 1 N a i ∗ y i ∗ x i w=\sum\limits_{i=1}^Na_{i}*y_i*x_i w=i=1Naiyixi
          b = ∑ i = 1 N a i ∗ y i b=\sum\limits_{i=1}^Na_{i}*y_i b=i=1Naiyi
         其中 N N N为训练数据数量, a i = n i ∗ η ≥ 0 a_i=n_{i}*\eta\ge0 ai=niη0
    • 算法:
          (1) 输入:训练数据集合 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) . . . ( x N , y N ) } T=\{(x_1,y_1),(x_2,y_2)...(x_N,y_N)\} T={(x1,y1),(x2,y2)...(xN,yN)},其中 x i ∈ χ = R n , x_i \in \chi=R^n, xiχ=Rn,
          y i ∈ γ = { − 1 , + 1 } , 输 出 a . b . f ( x ) = s i g n ( ∑ j = 1 N a j ∗ y j ∗ x j ⋅ x + b ) . a = ( a 1 , a 2 , . . . , a n ) y_i \in\gamma=\{-1,+1\},输出a.b.f(x)=sign(\sum\limits_{j=1}^Na_j*y_j*x_j\cdot{x}+b).a=(a_1,a_2,...,a_n) yiγ={1,+1},a.b.f(x)=sign(j=1Najyjxjx+b).a=(a1,a2,...,an)
          i = 1 , 2... N ; i=1,2...N; i=1,2...N;学习率为 η ( 0 < η ≤ 1 ) \eta(0<\eta\le1) η(0<η1)
          (2)在训练集中选取数据 ( x i , y i ) (x_i,y_i) (xi,yi)
          (3)如果 s i g n ( ∑ j = 1 N a j ∗ y j ∗ x j ⋅ x i + b ) ≤ 0 sign(\sum\limits_{j=1}^Na_j*y_j*x_j\cdot{x_i}+b)\le0 sign(j=1Najyjxjxi+b)0
                a i = a i + η a_i=a_i+\eta ai=ai+η
                b = b + η ∗ y i b=b+\eta*y_i b=b+ηyi
         (4)转至(2),直到算法结束没有误分类点
  • Gram矩阵: G = [ x i ⋅ x j ] N × N G=[x_i\cdot{x_j}]_{N×N} G=[xixj]N×N
实现代码后续补上
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值