支持向量机

支持向量机(回顾)

线性可分

训练数据标签: ( X , y ) , y ∈ { 1 , − 1 } (X,y), y\in\{1,-1\} (X,y)y{1,1}
寻找一个超平面分离 +1,-1 超平面: W T X + b = 0 W^TX+b=0 WTX+b=0

训练集 { ( x i , y i ) } i = 1 , . . . , N \lbrace (x_i,y_i)\rbrace_{i=1,...,N} {(xi,yi)}i=1,...,N
∃ ( W , b ) \exist(W,b) (W,b)使得 ∀ i = 1 , . . . N 有 \forall i=1,...N有 i=1,...N

  1. y i = + 1 W T x i + b ≥ 0 y_i=+1 W^Tx_i+b \ge 0 yi=+1WTxi+b0
  2. y i = − 1 W T x i + b < 0 y_i=-1 W^Tx_i+b < 0 yi=1WTxi+b<0

y i ( w T x + b ) ≥ 0 y_i(w^Tx+b)\ge0 yi(wTx+b)0

优化问题 :Margin(间隔)最大化 即 ∥ W ∥ \Vert W \Vert W最小化
限制条件(Subject to): y i ( w T x + b ) ≥ 0 i = 1 , . . . , N y_i(w^Tx+b)\ge0\quad {i=1,...,N} yi(wTx+b)0i=1,...,N

事实一: W T X + b = 0 与 a W T X + a b = 0 W^TX+b=0 与 aW^TX+ab=0 WTX+b=0aWTX+ab=0 ,表示同一超平面 a为非零实数
事实二:点 ( x 0 , y 0 ) (x_0,y_0) (x0,y0) 到平面 w 1 x + w 2 y + b = 0 w_1x+w_2y+b=0 w1x+w2y+b=0 的距离
d = ∣ w 1 x 0 + w 2 y 0 + b ∣ w 1 2 + w 2 2 d=\frac {\vert w_1x_0+w_2y_0+b\vert}{\sqrt{w_1^2+w_2^2}} d=w12+w22 w1x0+w2y0+b
扩展:向量 X 0 X_0 X0到超平面 W T X + b = 0 W^TX+b=0 WTX+b=0 的距离
d = ∣ W T X 0 + b ∣ ∥ W ∥ d=\frac{\vert W^TX_0+b \vert}{\Vert W \Vert} d=WWTX0+b
其中 ∥ W ∥ = w 1 2 + w 2 2 + . . . + w n 2 \Vert W \Vert=\sqrt{w_1^2+w_2^2+...+w_n^2} W=w12+w22+...+wn2 n维

可以使用a去缩放
( W , b ) → ( a W = W ′ , a b = b ′ ) (W,b)\rarr (aW=W',ab=b') (W,b)(aW=W,ab=b)
使得在支持向量X_0上有
∣ W ′ T X 0 + b ′ ∣ = 1 \vert W'^TX_0+b' \vert=1 WTX0+b=1
此时 支持向量到超平面的距离
d = 1 ∥ W ∥ d=\frac 1{\Vert W \Vert} d=W1

即 线性模型 下

X ∈ R D , y ∈ ( − 1 , + 1 ) f ( x ; W , b ) = W T x + b , ∥ W ∥ 最 小 X\in R^D,\quad y\in(-1,+1)\\ f(x;W,b)=W^Tx+b ,\quad \Vert W \Vert 最小 XRD,y(1,+1)f(x;W,b)=WTx+b,W

线性不可分

直接处理思路

初始优化问题:
最 小 化 : 1 2 ∥ W ∥ 2 + C ∑ i = 1 N ξ i 最小化:\frac 1 2 \Vert W \Vert^2+C \sum_{i=1}^{N}{\xi_i} 21W2+Ci=1Nξi

Subject to:
y i ( W T x i + b ) ≥ 1 − ξ i ξ i ≥ 0 松 弛 向 量 , i = 1 , . . . , N N 个 ξ i 使 得 松 弛 , 将 ≥ 0 扩 大 为 ≥ a , a 可 以 足 够 小 另 外 C 使 得 ξ i 不 能 太 小 C 事 先 设 定 , 正 则 化 项 y_i(W^Tx_i+b)\ge1-\xi_i \\ \xi_i\ge0 松弛向量,\quad i=1,...,N\\ N个 \xi_i 使得松弛 ,将\ge 0 扩大为 \ge a ,a可以足够小\\ 另外C使得 \xi_i 不能太小 C事先设定,正则化项 yi(WTxi+b)1ξiξi0,i=1,...,NNξi使0a,aC使ξiC
代求参数: W , b , ξ i W,b ,\xi_i W,b,ξi 其中 ξ i \xi_i ξi对应于每一个样本点

仍是寻找一个直线or平面

理解: 该算法找一条直线使得大部分正确, ξ i \xi_i ξi仅是为了找出直线来
测试时 仍然是, y j ( W T x j + b ) ≥ 0 y_j(W^Tx_j+b)\ge0 yj(WTxj+b)0 (测试集数据不可能对应 ξ j \xi_j ξj
在这里插入图片描述在这里插入图片描述

转化为线性处理

思路:将线性不可分问题转化为线性可分

  1. 增高数据X维度: 维度越高线性可分的可能性越大 (亦即找出足够多的特征必可以将数据分离)

X i → Φ ( X i ) → y i ( W T Φ ( X i ) + b ) ≥ 0 X_i \rarr \Phi(X_i) \rarr y_i(W^T \Phi(X_i)+b) \ge 0 XiΦ(Xi)yi(WTΦ(Xi)+b)0

例:异或问题
d a t a : x 1 = ( 0 , 0 ) y 1 = − 1 x 4 = ( 1 , 1 ) y 4 = − 1 x 2 = ( 0 , 1 ) y 2 = + 1 x 3 = ( 1 , 0 ) y 3 = + 1 二 维 平 面 上 不 可 分 , 无 法 找 到 y i ( W T x i + b ) ≥ 0 高 维 转 化 取 Φ ( x ) 为 x = ( a , b ) Φ ( x ) = ( a 2 , b 2 , a , b , a b ) 易 得 一 种 取 法 : w = ( − 1 , − 1 , − 1 , − 1 , 6 ) b = 1 使 得 y i ( W T x i + b ) ≥ 0 成 立 \begin{aligned} data:&\\ &x_1=(0,0) y_1=-1 \quad x_4=(1,1) y_4=-1\\ &x_2=(0,1) y_2=+1 \quad x_3=(1,0) y_3=+1\\ &二维平面上不可分,无法找到 y_i(W^Tx_i+b)\ge0\\ \\ 高维转化\\ &取\Phi(x)为 \quad x=(a,b) \quad\Phi(x)=(a^2,b^2,a,b,ab)\\ &易得一种取法:w=(-1,-1,-1,-1,6) b=1 \quad 使得y_i(W^Tx_i+b)\ge0成立 \end{aligned} data:x1=(0,0)y1=1x4=(1,1)y4=1x2=(0,1)y2=+1x3=(1,0)y3=+1yi(WTxi+b)0Φ(x)x=(a,b)Φ(x)=(a2,b2,a,b,ab)w=(1,1,1,1,6)b=1使yi(WTxi+b)0

利用上面的直接处理思路

最 小 化 : 1 2 ∥ W ∥ 2 + C ∑ i = 1 N ξ i S u b j e c t t o : y i ( W T Φ ( x i ) + b ) ≥ 1 − ξ i , ξ i ≥ 0 , i = 1 , . . . , N 其 中 Φ ( x ) 为 无 限 维 \begin{aligned} 最小化:&\\ &\frac 1 2 \Vert W \Vert^2+C \sum_{i=1}^{N}{\xi_i}\\ Subject \enspace to:&\\ &y_i(W^T\Phi(x_i)+b)\ge1-\xi_i ,\quad \xi_i\ge0 ,\quad i=1,...,N\\ 其中\Phi(x)为无限维 \end{aligned} Subjectto:Φ(x)21W2+Ci=1Nξiyi(WTΦ(xi)+b)1ξi,ξi0,i=1,...,N

无限维不可解
引入核函数(Kernel Function):
只要知道 K ( x 1 , x 2 ) = Φ ( x 1 ) T Φ ( x 2 ) K(x_1,x_2)=\Phi(x_1)T\Phi(x_2) K(x1,x2)=Φ(x1)TΦ(x2) 则上述问题依旧可解

K ( X 1 , X 2 ) 能 写 成 Φ ( X 1 ) T Φ ( X 2 ) 的 充 要 条 件 : 1. K ( X 1 , X 2 ) = K ( X 2 , X 1 ) 2. ∀ C i , X i ( i = 1 , . . . , N ) 有 ∑ i = 1 N ∑ j = 1 N C i C j K ( X i , X j ) ≥ 0 K(X_1,X_2)能写成\Phi(X_1)^T\Phi(X_2)的充要条件:\\ 1.K(X_1,X_2)=K(X_2,X_1) \\ 2.\forall C_i,X_i (i=1,...,N)有 \sum_{i=1}^{N}{\sum_{j=1}^{N}{C_iC_jK(X_i,X_j)}}\ge0 K(X1,X2)Φ(X1)TΦ(X2)1.K(X1,X2)=K(X2,X1)2.Ci,Xi(i=1,...,N)i=1Nj=1NCiCjK(Xi,Xj)0

常见核函数:高斯核、多项式核

原问题与对偶问题
目的:只用核函数而不用 Φ ( X ) \Phi(X) Φ(X)1 来解决优化问题

原问题(Prime Problem)
最小化: f ( w ) f(w) f(w)
限制条件:
g i ( w ) ≤ 0 ( i = 1 , . . , k ) h i ( w ) = 0 ( i = 1 , . . , m ) g_i(w)\le0\quad(i=1,..,k)\\ h_i(w)=0\quad(i=1,..,m) gi(w)0(i=1,..,k)hi(w)=0(i=1,..,m)

对偶问题(Dual Problem)
定义:
L ( w , α , β ) = f ( w ) + ∑ i = 1 k α i g i ( w ) + ∑ i = 1 m β i h i ( w ) = f ( w ) + α T g ( w ) + β T h ( w ) 最 大 化 : θ ( α , β ) = inf ⁡ 所 有 的 w L ( w , α , β ) 限 制 条 件 : α i ≥ 0 i = 1 , . . . , k \begin{aligned} L(w,\alpha,\beta)&=f(w)+\sum_{i=1}^{k}{\alpha_ig_i(w)}+\sum_{i=1}^{m}{\beta_ih_i(w)}\\ &=f(w)+\alpha^Tg(w)+\beta^Th(w) \\ 最大化:&\theta(\alpha,\beta)=\inf_{所有的w}{L(w,\quad\alpha,\beta)}\\ 限制条件:& \alpha_i \ge 0 \quad i=1,...,k \end{aligned} L(w,α,β)=f(w)+i=1kαigi(w)+i=1mβihi(w)=f(w)+αTg(w)+βTh(w)θ(α,β)=winfL(w,α,β)αi0i=1,...,k

原问题与对偶问题之间关系
如果 W ∗ W^* W 是原问题的解,而 α ∗ , β ∗ \alpha^*,\beta^* α,β 就是对偶问题的解,那么
f ( w ∗ ) ≥ θ ( α ∗ , β ∗ ) f(w^*)\ge \theta(\alpha^*,\beta^*) f(w)θ(α,β)

证明
θ ( α ∗ , β ∗ ) = inf ⁡ 对 所 有 的 w L ( w , α ∗ , β ∗ ) ≤ L ( W ∗ , α ∗ , β ∗ ) = f ( w ∗ ) + ∑ i = i k α i ∗ g i ( w ∗ ) + ∑ i = 1 m β i ∗ h i ( w ∗ ) # # g i ( w ∗ ) ≤ 0 h i ( w ∗ ) = 0 # # α ∗ ≥ 0 ≤ f ( w ∗ ) # # 当 且 仅 当 W ∗ 时 , 对 于 ∀ α , ∀ β 时 L ( W , α , β ) 取 得 最 小 值 # # 另 外 ∀ i = 1 , . . , k 均 有 α i ∗ = 0 或 g i ( w ∗ ) = 0 称 为 K K T 条 件 \begin{aligned} \theta(\alpha^*,\beta^*)&=\inf_{对所有的w}{L(w,\quad \alpha^*,\beta^*)}\\ &\le L(W*,\quad \alpha^*,\beta^*)\\ &=f(w^*)+\sum_{i=i}^{k}{\alpha_i^*g_i(w^*)} + \sum_{i=1}^{m}{\beta_i^*h_i(w^*)}\\ &\#\# g_i(w^*)\le0 \quad h_i(w^*)=0\\ &\#\# \alpha^*\ge0\\ &\le f(w^*)\\ &\#\# 当且仅当W^*时,对于 \forall \alpha,\forall \beta时L(W,\alpha,\beta)取得最小值\\ &\#\# 另外\forall i=1,..,k均有 \alpha_i^*=0或g_i(w^*)=0 \quad 称为 KKT条件 \end{aligned} θ(α,β)=winfL(w,α,β)L(W,α,β)=f(w)+i=ikαigi(w)+i=1mβihi(w)##gi(w)0hi(w)=0##α0f(w)##W,α,βL(W,α,β)##i=1,..,kαi=0gi(w)=0KKT

定义:
w ∗ 为 原 问 题 的 解 ; α ∗ , β ∗ 为 对 偶 问 题 的 解 G = f ( w ∗ ) − θ ( α ∗ , β ∗ ) ≥ 0 G 叫 做 原 问 题 与 对 偶 问 题 之 间 的 间 距 . 对 于 某 些 特 定 的 问 题 ( 强 对 偶 问 题 ) 有 G = 0 \begin{aligned} &w^*为原问题的解;\alpha^*,\beta^*为对偶问题的解\\ &G=f(w^*)-\theta(\alpha^*,\beta^*)\ge0\\ &G叫做原问题与对偶问题之间的间距.\\ &对于某些特定的问题(强对偶问题)有G=0\\ \end{aligned} wα,βG=f(w)θ(α,β)0G.G=0

强对偶问题(实际意义):
f ( w ) f(w) f(w) 为凸函数且 g ( w ) = a w + b g(w)=aw+b g(w)=aw+b 线性函数 h ( w ) = c w + d h(w)=cw+d h(w)=cw+d 线性函数
则此优化问题的原问题与对偶问题的间距G=0,即 f ( w ∗ ) = θ ( α ∗ , β ∗ ) f(w^*)=\theta(\alpha^*,\beta^*) f(w)=θ(α,β)

将SVM的非线性可分的凸优化问题转换为对偶问题

最 小 化 : f ( W , ξ ) = 1 2 ∥ W ∥ 2 + C ∑ i = 1 N ξ i # # C 为 事 先 制 定 的 正 则 化 系 数 , 常 数 S u b j e c t T o : y i ( W T Φ ( x i ) + b ) ≥ 1 − ξ i , ξ i ≥ 0 , i = 1 , . . . , N 其 中 Φ ( x i ) 为 无 限 维 \begin{aligned} 最小化:&\\ &f(W,\xi)=\frac 1 2 \Vert W \Vert^2+C \sum_{i=1}^{N}{\xi_i}\\ &\#\#C为事先制定的正则化系数,常数 \\ Subject\enspace To:&\\ &y_i(W^T\Phi(x_i)+b)\ge1-\xi_i ,\\ &\xi_i\ge0 ,\quad i=1,...,N\\ &其中\Phi(x_i)为无限维 \end{aligned} SubjectTo:f(W,ξ)=21W2+Ci=1Nξi##C,yi(WTΦ(xi)+b)1ξi,ξi0,i=1,...,NΦ(xi)

转化

新原问题:
最 小 化 : f ( w ) = 1 2 ∥ W ∥ 2 − C ∑ i = 1 N ξ i ′ S u b j e c t T o : ξ i ′ ≤ 0 1 + ξ i ′ − y i ( W T Φ ( x i ) + b ) ≤ 0 # # 此 处 ξ i ′ = − ξ i , 以 下 仍 记 为 ξ i \begin{aligned} 最小化:&\\ &f(w)=\frac 1 2 \Vert W \Vert^2-C \sum_{i=1}^{N}{\xi'_i}\\ Subject\enspace To:&\\ &\xi'_i\le0\\ &1+\xi'_i-y_i(W^T\Phi(x_i)+b)\le0\\ &\#\#此处\xi_i'=-\xi_i , 以下仍记为 \xi_i \end{aligned} SubjectTo:f(w)=21W2Ci=1Nξiξi01+ξiyi(WTΦ(xi)+b)0##ξi=ξi,ξi
对偶问题:
最 大 化 : θ ( α , β ) = inf ⁡ 所 有 的 w , ξ , b L ( W , ξ i , b ) L ( W , ξ i , b ) = 1 2 ∥ W ∥ 2 − C ∑ i = 1 N ξ i + ∑ i = 1 N α i ( 1 + ξ i − y i ( W T Φ ( x i ) + b ) ) + ∑ i = 1 N β i ξ i 限 制 条 件 : ∀ i = 1 , . . . , N , 有 α i ≥ 0 , β i ≥ 0 \begin{aligned} 最大化:&\\ \theta(\alpha,\beta)&=\inf_{所有的w,\xi,b}{L(W,\xi_i,b)}\\ \\ L(W,\xi_i,b)&=\frac 1 2\Vert W\Vert^2-C\sum_{i=1}^{N}{\xi_i}\\ &+ \sum_{i=1}^{N}{\alpha_i\big(1+\xi_i-y_i(W^T\Phi(x_i)+b) \big)}\\ &+ \sum_{i=1}^{N}{\beta_i\xi_i}\\ 限制条件:&\\ &\forall i=1,...,N,有\alpha_i\ge0,\quad \beta_i\ge0 \end{aligned} θ(α,β)L(W,ξi,b)=w,ξ,binfL(W,ξi,b)=21W2Ci=1Nξi+i=1Nαi(1+ξiyi(WTΦ(xi)+b))+i=1Nβiξii=1,...,Nαi0,βi0

满足强队偶问题的条件:
L ( W ∗ , ξ ∗ , b ∗ ) 取 得 最 小 值 # # 偏 导 为 0 K K T 条 件 : α i = 0 或 1 + ξ i − y i ( W T Φ ( x i ) + b ) = 0 且 β i = 0 或 ξ i = 0 \begin{aligned} &L(W^*,\xi^*,b^*)取得最小值\\ &\#\# 偏导为0\\ \\ \\KKT条件:\\ & \alpha_i=0 或 1+\xi_i-y_i(W^T\Phi(x_i)+b)=0\\ & 且 \beta_i=0 或 \xi_i=0 \end{aligned} KKTL(W,ξ,b)##0αi=01+ξiyi(WTΦ(xi)+b)=0βi=0ξi=0

计算:
∂ L ∂ W = 0 即 W = ∑ i = 1 N α i y i Φ ( x i ) ∂ L ∂ ξ i = 0 即 α i + β i = C ∂ L ∂ b = 0 即 ∑ i = 1 N α i y i = 0 此 时 有 : L ( W , ξ i , b ) m i n = 1 2 ∥ W ∥ 2 − C ∑ i = 1 N ξ i + ∑ i = 1 N α i ( 1 + ξ i − y i ( W T Φ ( x i ) + b ) ) + ∑ i = 1 N β i ξ i = 1 2 ∥ W ∥ 2 − C ∑ i = 1 N ξ i + ∑ i = 1 N ξ i ( α i + β i ) + ∑ i = 1 N α i − ∑ i = 1 N α i y i W T Φ ( x i ) − ∑ i = 1 N α i y i b = 1 2 ∥ W ∥ 2 + ∑ i = 1 N α i − ∑ i = 1 N α i y i W T Φ ( x i ) 分 开 计 算 : 1 2 ∥ W ∥ 2 = 1 2 W T W = 1 2 ( ∑ i = 1 N α i y i Φ ( x i ) ) T ( ∑ i = 1 N α i y i Φ ( x i ) ) = 1 2 ∑ i = 1 N α i y i ( Φ ( x i ) ) T ∑ j = 1 N α j y j Φ ( x j ) = 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j ( Φ ( x i ) ) T Φ ( x j ) # # ( Φ ( x i ) ) T Φ ( x j ) = K ( x i , x j ) 核 函 数 − ∑ i = 1 N α i y i W T Φ ( x i ) = − ∑ i = 1 N α i y i ( ∑ j = 1 N α j y j ( Φ ( x j ) ) T ) Φ ( x i ) = − ∑ i = 1 N ∑ j = 1 N α i α j y i y j ( Φ ( x j ) ) T Φ ( x i ) # # ( Φ ( x j ) ) T Φ ( x i ) = K ( x i , x j ) 核 函 数 L ( W , ξ i , b ) m i n = ∑ i = 1 N α i − 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j ( Φ ( x i ) ) T Φ ( x j ) # # K ( x i , x j ) 核 函 数 = ∑ i = 1 N α i − 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j K ( x i , x j ) # # α i 与 α j 未 知 目 标 最 大 化 : θ ( α , β ) = ∑ i = 1 N α i − 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j K ( x i , x j ) # # β 消 去 限 制 条 件 ( 原 有 限 制 + 偏 导 结 论 ) : ∀ i = 1 , . . . , N , 有 0 ≤ α i ≤ C ∑ i = 1 N α i y i = 0 凸 优 化 问 题 S M O 算 法 求 解 α \begin{aligned} \frac {\partial L}{\partial W} =0 即 & W=\sum_{i=1}^{N}{\alpha_iy_i\Phi(x_i)}\\ \frac {\partial L}{\partial \xi_i}=0 即 & \alpha_i+\beta_i=C\\ \frac {\partial L}{\partial b}=0 即 & \sum_{i=1}^{N}{\alpha_iy_i}=0\\ \\ 此时有:&\\ L(W,\xi_i,b)_{min}&=\frac 1 2\Vert W\Vert^2-C\sum_{i=1}^{N}{\xi_i}\\ &+ \sum_{i=1}^{N}{\alpha_i\big(1+\xi_i-y_i(W^T\Phi(x_i)+b) \big)}\\ &+ \sum_{i=1}^{N}{\beta_i\xi_i}\\ \\ &=\frac 1 2 \Vert W\Vert^2-C\sum_{i=1}^{N}{\xi_i}\\ &+\sum_{i=1}^{N}{\xi_i(\alpha_i+\beta_i)}\\ &+\sum_{i=1}^{N}{\alpha_i}\\ &-\sum_{i=1}^{N}{\alpha_iy_iW^T\Phi(x_i)}\\ &-\sum_{i=1}^{N}{\alpha_iy_i b}\\ \\ &=\frac 1 2 \Vert W\Vert^2\\ &+\sum_{i=1}^{N}{\alpha_i}\\ &-\sum_{i=1}^{N}{\alpha_iy_iW^T\Phi(x_i)}\\ \\ \\ 分开计算:&\\ \frac 1 2 \Vert W\Vert^2&=\frac 1 2 W^TW\\ &=\frac 1 2(\sum_{i=1}^{N}{\alpha_iy_i\Phi(x_i)})^T (\sum_{i=1}^{N}{\alpha_iy_i\Phi(x_i)})\\ &=\frac 1 2\sum_{i=1}^{N}{\alpha_iy_i(\Phi(x_i))^T} \sum_{j=1}^{N}{\alpha_jy_j\Phi(x_j)}\\ &=\frac 1 2\sum_{i=1}^{N}{\sum_{j=1}^{N}{\alpha_i\alpha_jy_iy_j \quad (\Phi(x_i))^T\Phi(x_j)}}\\ &\#\# (\Phi(x_i))^T\Phi(x_j)=K(x_i,x_j) 核函数\\ \\ -\sum_{i=1}^{N}{\alpha_iy_iW^T\Phi(x_i)}&=-\sum_{i=1}^{N}{\alpha_iy_i\bigg(\sum_{j=1}^{N}{\alpha_jy_j \big(\Phi(x_j)\big)^T}\bigg)\Phi(x_i)}\\ &=-\sum_{i=1}^{N}{\sum_{j=1}^{N}{\alpha_i\alpha_jy_iy_j \quad (\Phi(x_j))^T\Phi(x_i)}}\\ &\#\# (\Phi(x_j))^T\Phi(x_i)=K(x_i,x_j) 核函数\\ \\ \\ L(W,\xi_i,b)_{min}&=\sum_{i=1}^{N}{\alpha_i}-\frac 12 \sum_{i=1}^{N}{\sum_{j=1}^{N}{\alpha_i\alpha_jy_iy_j \quad (\Phi(x_i))^T\Phi(x_j)}}\\ &\#\#K(x_i,x_j) 核函数\\ &=\sum_{i=1}^{N}{\alpha_i}-\frac 12 \sum_{i=1}^{N}{\sum_{j=1}^{N}{\alpha_i\alpha_jy_iy_j \quad K(x_i,x_j)}}\\ &\#\# \alpha_i 与 \alpha_j 未知\\ \\ \\ 目标 最大化:&\\ \theta(\alpha,\beta)&=\sum_{i=1}^{N}{\alpha_i}-\frac 12 \sum_{i=1}^{N}{\sum_{j=1}^{N}{\alpha_i\alpha_jy_iy_j \quad K(x_i,x_j)}}\\ &\#\#\beta 消去\\ 限制条件(原有限制+偏导结论):&\\ &\forall i=1,...,N,有0\le \alpha_i\le C\\ & \sum_{i=1}^{N}{\alpha_iy_i}=0\\ \\ 凸优化问题 SMO算法求解\alpha \end{aligned} WL=0ξiL=0bL=0L(W,ξi,b)min21W2i=1NαiyiWTΦ(xi)L(W,ξi,b)minθ(α,β)+:SMOαW=i=1NαiyiΦ(xi)αi+βi=Ci=1Nαiyi=0=21W2Ci=1Nξi+i=1Nαi(1+ξiyi(WTΦ(xi)+b))+i=1Nβiξi=21W2Ci=1Nξi+i=1Nξi(αi+βi)+i=1Nαii=1NαiyiWTΦ(xi)i=1Nαiyib=21W2+i=1Nαii=1NαiyiWTΦ(xi)=21WTW=21(i=1NαiyiΦ(xi))T(i=1NαiyiΦ(xi))=21i=1Nαiyi(Φ(xi))Tj=1NαjyjΦ(xj)=21i=1Nj=1Nαiαjyiyj(Φ(xi))TΦ(xj)##(Φ(xi))TΦ(xj)=K(xi,xj)=i=1Nαiyi(j=1Nαjyj(Φ(xj))T)Φ(xi)=i=1Nj=1Nαiαjyiyj(Φ(xj))TΦ(xi)##(Φ(xj))TΦ(xi)=K(xi,xj)=i=1Nαi21i=1Nj=1Nαiαjyiyj(Φ(xi))TΦ(xj)##K(xi,xj)=i=1Nαi21i=1Nj=1NαiαjyiyjK(xi,xj)##αiαj=i=1Nαi21i=1Nj=1NαiαjyiyjK(xi,xj)##βi=1,...,N0αiCi=1Nαiyi=0
转换为对偶问题后可以求解出 α \alpha α 对应原问题 求解 W , b W,b W,b
W = ∑ i = 1 N α i y i Φ ( x i ) 不 可 求 Φ ( x i ) W=\sum_{i=1}^{N}{\alpha_iy_i\Phi(x_i)}\quad 不可求\Phi(x_i)\\ W=i=1NαiyiΦ(xi)Φ(xi)

由于分类问题测试时,对于样本 ( x , y ) (x,y) (x,y) 只需计算出: y ^ = W T Φ ( x ) + b \hat y=W^T\Phi(x)+b y^=WTΦ(x)+b 比对 y ^ = = y \hat y==y y^==y

W T Φ ( x i ) = ∑ i = 1 N α i y i ( Φ ( x i ) T Φ ( x ) = ∑ i = 1 N α i y i K ( x i , x ) b = ? K K T 条 件 α i = 0 或 1 + ξ i − y i ( W T Φ ( x i ) + b ) = 0 且 β i = 0 或 ξ i = 0 \begin{aligned} W^T\Phi(x_i)&=\sum_{i=1}^{N}{\alpha_iy_i(\Phi(x_i)^T\Phi(x)}\\ &=\sum_{i=1}^{N}{\alpha_iy_iK(x_i,x)}\\ \\ \\ b=? \quad KKT条件\\ &\alpha_i=0 或 1+\xi_i-y_i(W^T\Phi(x_i)+b)=0\\ &且 \beta_i=0 或 \xi_i=0 \end{aligned} WTΦ(xi)b=?KKT=i=1Nαiyi(Φ(xi)TΦ(x)=i=1NαiyiK(xi,x)αi=01+ξiyi(WTΦ(xi)+b)=0βi=0ξi=0

b的求解:
对 于 S M O 算 法 已 求 出 的 α , 取 出 一 个 α i 使 得 0 ≤ α i ≤ C 那 么 β i ≠ 0 则 ξ i = 0 同 时 1 + ξ i − y i ( W T Φ ( x i ) + b ) = 0 1 − y i ( W T Φ ( x i ) + b ) = 0 b = 1 − y i W T Φ ( x i ) y i = 1 − y i ∑ j = 1 N α j y j K ( x j , x i ) y i o k a y ! ! ! ! \begin{aligned} &对于SMO算法已求出的\alpha , 取出一个 \alpha_i 使得 0 \le \alpha_i \le C\\ &那么 \beta_i \not =0 则 \xi_i=0\\ &同时 1+\xi_i-y_i(W^T\Phi(x_i)+b)=0\\ &1-y_i(W^T\Phi(x_i)+b)=0 \\ & b=\frac {1-y_iW^T\Phi(x_i)}{y_i}=\frac{1-y_i\sum_{j=1}^{N}{\alpha_jy_jK(x_j,x_i)}}{y_i} \quad okay!!!! \end{aligned} SMOα,αi使0αiCβi=0ξi=01+ξiyi(WTΦ(xi)+b)=01yi(WTΦ(xi)+b)=0b=yi1yiWTΦ(xi)=yi1yij=1NαjyjK(xj,xi)okay!!!!

the end!

总结

训练集数据 { x i , y i } , i = 1 , . . . , N \lbrace x_i,y_i \rbrace,i=1,...,N {xi,yi},i=1,...,N

优化问题:

求解 W , ξ i , b , Φ ( x ) W,\xi_i,b,\Phi(x) W,ξi,b,Φ(x)
最 小 化 : f ( w ) = 1 2 ∥ W ∥ 2 − C ∑ i = 1 N ξ i S u b j e c t T o : ξ i ≤ 0 1 + ξ i − y i ( W T Φ ( x i ) + b ) ≤ 0 \begin{aligned} 最小化:&\\ &f(w)=\frac 1 2 \Vert W \Vert^2-C \sum_{i=1}^{N}{\xi_i}\\ Subject\enspace To:&\\ &\xi_i\le0\\ &1+\xi_i-y_i(W^T\Phi(x_i)+b)\le0 \end{aligned} SubjectTo:f(w)=21W2Ci=1Nξiξi01+ξiyi(WTΦ(xi)+b)0
转化后
求解 α i \alpha_i αi (凸优化问题–SMO算法)
最 大 化 : θ ( α , β ) = ∑ i = 1 N α i − 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j K ( x i , x j ) 限 制 条 件 : ∀ i = 1 , . . . , N , 有 0 ≤ α i ≤ C ∑ i = 1 N α i y i = 0 最大化:\\ \theta(\alpha,\beta)=\sum_{i=1}^{N}{\alpha_i}-\frac 12 \sum_{i=1}^{N}{\sum_{j=1}^{N}{\alpha_i\alpha_jy_iy_j \quad K(x_i,x_j)}}\\ 限制条件:\\ \forall i=1,...,N ,有 0 \le \alpha_i \le C \quad \sum_{i=1}^{N}{\alpha_iy_i}=0 θ(α,β)=i=1Nαi21i=1Nj=1NαiαjyiyjK(xi,xj)i=1,...,N,0αiCi=1Nαiyi=0

测试时,对于某一测试样本 ( x , y ) (x,y) (x,y)
计算 W T Φ ( x ) W^T\Phi(x) WTΦ(x) b b b
W T Φ ( x ) = ∑ i = 1 N α i y i K ( x i , x ) 由 训 练 集 中 求 解 , 取 一 个 α i 使 得 : 0 < α i < C b = 1 − y i ∑ j = 1 N α j y j K ( x j , x i ) y i W^T\Phi(x)=\sum_{i=1}^{N}{\alpha_iy_iK(x_i,x)}\\ \\ 由训练集中求解,取一个\alpha_i使得: 0<\alpha_i<C\\ b=\frac{1-y_i\sum_{j=1}^{N}{\alpha_jy_jK(x_j,x_i)}}{y_i} WTΦ(x)=i=1NαiyiK(xi,x)αi使0<αi<Cb=yi1yij=1NαjyjK(xj,xi)

( W T Φ ( x ) + b ) y ≥ 0 (W^T\Phi(x)+b)y\ge0 (WTΦ(x)+b)y0 预测成功,否则err!

兵王代码

import pandas as pd
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix,roc_curve,auc
import seaborn as sns
import matplotlib.pyplot as plt
"""
前期数据整理
"""

# 读取数据
data = pd.read_csv('krkopt.data', header=None)
data.dropna(inplace=True)  # 删除数据
# 将样本数值化
for i in [0, 2, 4]:
    """
    数据处理替换a,b,c,d,e,f,g,h 为数字
    """
    data.loc[data[i] == 'a', i] = 1
    data.loc[data[i] == 'b', i] = 2
    data.loc[data[i] == 'c', i] = 3
    data.loc[data[i] == 'd', i] = 4
    data.loc[data[i] == 'e', i] = 5
    data.loc[data[i] == 'f', i] = 6
    data.loc[data[i] == 'g', i] = 7
    data.loc[data[i] == 'h', i] = 8

# 将标签数值化
data.loc[data[6] != 'draw', 6] = -1
data.loc[data[6] == 'draw', 6] = 1

for i in range(6):
    data[i] = (data[i] - data[i].mean()) / data[i].std()  # 化为标准正态分布,样本归一化

"""
模型建立
"""

# 拆分训练集和测试集
# 前六项为数据,第七项为标签    训练样本占比0.82
X_train, X_test, y_train, y_test = train_test_split(data.iloc[:, :6], data[6].astype("int").values,
                                                    test_size=0.82178500142572)


# SVC中的参数C越大,对于训练集来说,其误差越小,但是很容易发生过拟合;C越小,则允许有更多的训练集误分类,相当于soft margin
# gamma越大,样本的接受度和影响也会下降,从而导致对错误的容忍度降低。所以当欠拟合的时候gamma应该增加。相反,小的gamma容忍度较高,适合在过拟合的时候使用。

# 寻找C和gamma的粗略范围
CScale = [i for i in range(100, 201, 10)]
gammaScale = [i / 10 for i in range(1, 11)]
cv_scores = 0
savei = 0
savej = 0


# 参数学习
for i in CScale:
    for j in gammaScale:
        model = SVC(kernel='rbf', C=i, gamma=j)  # 支持向量机 rbf核函数    正则化项C(过拟合处理)     gamma
        scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')  # 交叉验证并为每次验证效果评测
        # Array of scores of the estimator for each run of the cross validation.
        if scores.mean() > cv_scores:
            cv_scores = scores.mean()
            savei = i
            savej = j * 100

# 找到更精确的C和gamma
CScale = [i for i in range(savei - 5, savei + 5)]
gammaScale = [i / 100 + 0.01 for i in range(int(savej) - 5, int(savej) + 5)]
cv_scores = 0
for i in CScale:
    for j in gammaScale:
        model = SVC(kernel='rbf', C=i, gamma=j)
        scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
        if scores.mean() > cv_scores:
            cv_scores = scores.mean()
            savei = i
            savej = j

# 将确定好的参数重新建立svm模型
model = SVC(kernel='rbf', C=savei, gamma=(savej))
model.fit(X_train, y_train)
pre = model.predict(X_test)
model.score(X_test, y_test)



"""
绘图:简单评估
"""
# 绘制AUC和EER图形
cm = confusion_matrix(y_test, pre, labels=[-1, 1], sample_weight=None)
sns.set()
f, ax = plt.subplots()
sns.heatmap(cm, annot=True, ax=ax)  # 画热力图
ax.set_title('confusion matrix')  # 标题
ax.set_xlabel('predict')  # x轴
ax.set_ylabel('true')  # y轴
fpr, tpr, threshold = roc_curve(y_test, pre)  ###计算真正率和假正率
roc_auc = auc(fpr, tpr)  ###计算auc的值,auc就是曲线包围的面积,越大越好
plt.figure()
lw = 2
plt.figure(figsize=(10, 10))
plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)  ###假正率为横坐标,真正率为纵坐标做曲线
plt.plot([0, 1], [1, 0], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

简析

支持向量机分类器

class SVC(BaseSVC):
    """C-Support Vector Classification.  The implementation is based on libsvm.
    The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples.(拟合时间样本数量的二次方;样本量多于万不切实际的)

    For large datasets
    consider using :class:`~sklearn.svm.LinearSVC` or
    :class:`~sklearn.linear_model.SGDClassifier` instead, possibly after a
    :class:`~sklearn.kernel_approximation.Nystroem` transformer.

    Parameters
    ----------
    C : float, default=1.0   Regularization parameter.正数  正则化系数  过拟合  
    kernel : {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'}, default='rbf'    核函数  线性内核;多项式内核;高斯径向基函数核;sigmoid核

    degree :  多项式内核的次数 int, default=3  Degree of the polynomial kernel function ('poly').

    gamma : Kernel coefficient(系数) for 'rbf', 'poly' and 'sigmoid'.
    """
    pass

class LinearSVC(LinearClassifierMixin,SparseCoefMixin,BaseEstimator):
    """Linear Support Vector Classification.

    Similar to SVC with parameter kernel='linear', ... , so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

    Parameters
    ----------
    penalty : {'l1', 'l2'}, default='l2' Specifies the norm used in the penalization.

    loss : {'hinge', 'squared_hinge'}, default='squared_hinge'   Hinge 损失函数
    The combination of ``penalty='l1'`` and ``loss='hinge'`` is not supported.   
    tol : float, default=1e-4  Tolerance for stopping criteria.
    ...

    Examples
    --------
    >>> from sklearn.svm import LinearSVC
    >>> from sklearn.pipeline import make_pipeline
    >>> from sklearn.preprocessing import StandardScaler
    >>> from sklearn.datasets import make_classification

    # Generate a random n-class classification problem.
      # n_samples : int, default=100  The number of samples.
      n_features : int, default=20  The total number of features.
    >>> X, y = make_classification(n_features=4, random_state=0)

    # Construct a Pipeline from the given estimators(估计者:分类器).
    # 管道连接多个分类模型
      # This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically.
      # return Pipeline(_name_estimators(steps), memory=memory, verbose=verbose)
???here???# ??? Pipeline of transforms with a final estimator. ????
    >>> clf = make_pipeline(StandardScaler(),
    ...                     LinearSVC(random_state=0, tol=1e-5))
    >>> clf.fit(X, y)
    Pipeline(steps=[('standardscaler', StandardScaler()),
                    ('linearsvc', LinearSVC(random_state=0, tol=1e-05))])

    >>> print(clf.named_steps['linearsvc'].coef_)
    [[0.141...   0.526... 0.679... 0.493...]]

    >>> print(clf.named_steps['linearsvc'].intercept_)
    [0.1693...]
    >>> print(clf.predict([[0, 0, 0, 0]]))
    [1]
    """
    pass

class Nystroem(TransformerMixin, BaseEstimator):
    """Approximate a kernel map using a subset of the training data.

    Examples
    --------
    >>> from sklearn import datasets, svm
    >>> from sklearn.kernel_approximation import Nystroem
    >>> X, y = datasets.load_digits(n_class=9, return_X_y=True)
    >>> data = X / 16.
    >>> clf = svm.LinearSVC()
    >>> feature_map_nystroem = Nystroem(gamma=.2,
    ...                                 random_state=1,
    ...                                 n_components=300)
    >>> data_transformed = feature_map_nystroem.fit_transform(data)
    >>> clf.fit(data_transformed, y)
    LinearSVC()
    >>> clf.score(data_transformed, y)
    0.9987...
    """

参数具体的含义

def train_test_split(*arrays, test_size=None, train_size=None, random_state=None,shuffle=True,stratify=None):
    """Split arrays or matrices into random train and test subsets

    random_state: Controls the shuffling applied to the data before applying the split.
    shuffle(洗牌,打乱):Whether or not to shuffle the data before splitting.
    >>> X, y = np.arange(10).reshape((5, 2)), range(5)
    >>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    >>> y_test
    [1, 4]
    数据 X 标签 y   测试集大小占比0.33
    """
    # .....
    pass

def cross_val_score(estimator, X, y=None, *, groups=None, scoring=None,cv=None, n_jobs=None, verbose=0, fit_params=None,pre_dispatch='2*n_jobs', error_score=np.nan):
    """Evaluate a score by cross-validation  交叉验证
    Examples
    --------
    >>> from sklearn import datasets, linear_model
    >>> from sklearn.model_selection import cross_val_score
    >>> diabetes = datasets.load_diabetes()
    >>> X = diabetes.data[:150]
    >>> y = diabetes.target[:150]
    # Linear Model trained with L1 prior as regularizer
    >>> lasso = linear_model.Lasso()
    >>> print(cross_val_score(lasso, X, y, cv=3))
    [0.33150734 0.08022311 0.03531764]
    """

SVM使用建议

SVM使用建议
We propose that beginners try the following procedure first:

  1. Transform data to the format of an SVM package
  2. Conduct simple scaling on the data

  1. Consider the RBF kernel K ( x , y ) = e − γ ∥ x − y ∥ 2 K(x,y)=e^{-\gamma\Vert x-y \Vert^2} K(x,y)=eγxy2
  2. Use cross-validation to find the best parameter C and γ
  3. Use the best parameter C and γ to train the whole training set
  4. Test

数据处理:

  • 多分类问题:采用one-hot向量编码
  • 缩放(avoid attributes in greater numeric ranges dominating those in smaller numeric ranges;计算难度) → \rarr 样本归一化(化为正态分布)
    We recommend linearly scaling each attribute to the range [−1, +1] or [0, 1].

模型选择(核函数)

  • RBF reasonable first choice
    • 非线性映射,可以处理属性与分类标签之间的非线性的情况
    • linear kernel 是 RBF kernel的特例 {带正则化系数C的线性核与RBF核( c , γ c,\gamma c,γ )的能力相同}
    • sigmoid kernel在某些情况下业余RBF kernel表现相同
    • 超参数数目少于多项式核(Polynomial)
    • the RBF kernel has fewer numerical difficulties.

There are some situations where the RBF kernel is not suitable.
In particular, when the number of features is very large, one may just use the linear kernel.

  • Cross-validation and Grid-search
    找到比较好的 ( c , γ ) (c,\gamma) (c,γ) —(The goal is to identify good (C, γ) so that the
    classifier can accurately predict unknown data) \quad but小心过拟合问题!!! — 交叉验证{数据集划分}

We recommend a “grid-search” on C and γ using cross-validation. (穷举遍历)

The grid-search is straightforward but seems naive.
In fact, there are several advanced methods which can save computational cost by, for example, approximating the cross-validation rate.

但仍推荐Grid-search原因:

  • feel safe to use methods which avoid doing an exhaustive parameter search by approximations or heuristics(启发).
  • 只有两个参数 ( c , γ ) (c,\gamma) (c,γ) ,使用穷举法也不户增加太多计算时间开销
  • 容易并行计算okay \quad Many of advanced methods are iterative processes, e.g. walking along a path, which can be hard to parallelize

Since doing a complete grid-search may still be time-consuming, we recommend using a coarse grid first.(兵王代码:粗粒度的0.1 → \rarr 0.01)

推荐线性核的情况

the num of features is large
a. 样本量 ≪ \ll 特征数 {生物信息}

b. 样本量与特征数都很大{文本信息}
Such data often occur in document classification.
LIBSVM is not particularly good for this type of problems.
Fortunately, we have another software LIBLINEAR , which is very suitable for such data.

c. 样本量 ≫ \gg 特征数 \quad 推荐非线性核
As the number of features is small, one often maps data to higher dimensional spaces.
However, if you really would like to use the linear kernel, you may use LIBLINEAR with the option -s 2.


  1. Φ ( X ) \Phi(X) Φ(X)无限高维数据不可求优化问题 ↩︎

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值