Support Vector Machine 支持向量机

Support Vector Machine 支持向量机

Vapnik(前苏联人)
适合样本数较小时的预测
由听浙江大学的研究生课程整理得来,链接

0. No Free Lunch Therome

(随手放到这里)
如果不对特征空间有先验假设,所有算法平均表现一致。
我们认为:特征差距小的样本更有可能是一类,所以机器学习不是白学di!


SVM简介


图片来自百度百科

可以证明,对于一个 线性可分(linear separable) 的样本空间有无数条线可以分开,那么怎么算最好的?使 间隔(margin) d最大,且 d 1 = d 2 = d 2 d_1=d_2=\frac{d}{2} d1=d2=2d的线唯一且最佳。图中在虚线上的的样本点称为 支持向量(support vector)

定义:
①训练数据和标签 ( x 1 , y 1 ) , ( x 2 , y 2 ) . . . ( x n , y n ) (x_1, y_1), (x_2, y_2)...(x_n, y_n) (x1,y1),(x2,y2)...(xn,yn) x i x_i xi是向量,代表该样本各个特征(Feature)的取值, y i y_i yi标签(label),在SVM中一般取+1,-1(为什么不是1,0?后面会解释)
②线性模型 ( ω , b ) (\boldsymbol\omega, b) (ω,b)代表了 ω T x + b = 0 \boldsymbol\omega^T\boldsymbol{x}+b=0 ωTx+b=0超平面(hyperplane)

要求什么?

当然是要求d(margin)的最大值了,但是显然不太好求,需要转化为:(一律先给结果再解释)
{ min ⁡ 1 2 ∥ ω ∥ 2 s . t . ( s u b j e c t   t o ) y i [ ω T x i + b ] ≥ 1 \left\{\begin{array}{l} \min\frac{1}{2}\|\boldsymbol{\omega}\|^2\\ s.t.(subject \ to)y_i[\boldsymbol\omega^T\boldsymbol{x_i}+b]\geq1 \end{array}\right. {min21ω2s.t.(subject to)yi[ωTxi+b]1
(这是一个凸优化理论中的二次规划问题)
考虑到以下的事实:
事实1: ω T x + b = 0 \boldsymbol\omega^T\boldsymbol{x}+b=0 ωTx+b=0 a ω T x + a b = 0 ( a ∈ R + ) a\boldsymbol\omega^T\boldsymbol{x}+ab=0(a\in R^+) aωTx+ab=0(aR+)是同一个超平面,即 ( ω , b ) = ( a ω , a b ) (\boldsymbol\omega, b)=(a\boldsymbol\omega, ab) (ω,b)=(aω,ab)
事实2:向量 x 0 x_0 x0和超平面 ( ω , b ) (\boldsymbol\omega, b) (ω,b)的距离(类比点到直线的距离) d = ∣ ω T x 0 + b ∣ ∥ ω ∥ d=\frac{|\boldsymbol\omega^T\boldsymbol{x_0}+b|}{\|\boldsymbol\omega\|} d=ωωTx0+b
下面开始分析:
对超平面 ( ω 0 , b 0 ) ⟶ a ( ω , b ) (\boldsymbol\omega_0, b_0)\stackrel{a}{\longrightarrow}(\boldsymbol\omega, b) (ω0,b0)a(ω,b)进行变换,一定存在a使得 ∣ ω T x + b ∣ = 1 |\boldsymbol\omega^T\boldsymbol{x}+b|=1 ωTx+b=1,此时 d = 1 ∥ ω ∥ d=\frac{1}{\|\boldsymbol\omega\|} d=ω1(所以要 min ⁡ 1 2 ∥ ω ∥ 2 \min\frac{1}{2}\|\boldsymbol{\omega}\|^2 min21ω2, 1 2 \frac{1}{2} 21是为了求导方便)
For non-support vector ∣ ω T x + b ∣ ∥ ω ∥ = d > 1 ∥ ω ∥ \frac{|\boldsymbol\omega^T\boldsymbol{x}+b|}{\|\boldsymbol\omega\|}=d>\frac{1}{\|\boldsymbol\omega\|} ωωTx+b=d>ω1, so ∣ ω T x + b ∣ > 1 |\boldsymbol\omega^T\boldsymbol{x}+b|>1 ωTx+b>1,. Now we need to drop the absolute value sign. We notice that y i y_i yi has the same sign with ∣ ω T x + b ∣ > 1 |\boldsymbol\omega^T\boldsymbol{x}+b|>1 ωTx+b>1, so y i [ ω T x + b ] > 1 y_i[\boldsymbol\omega^T\boldsymbol{x}+b]>1 yi[ωTx+b]>1. I don’t why the teacher use [] instead of ()😦
For support vector y i [ ω T x + b ] = 1 y_i[\boldsymbol\omega^T\boldsymbol{x}+b]=1 yi[ωTx+b]=1
Finally, we get the restrictive condition y i [ ω T x + b ] ≥ 1 y_i[\boldsymbol\omega^T\boldsymbol{x}+b]\geq1 yi[ωTx+b]1

PS When you set is non-linear separable the ineuqlity y i [ ω T x + b ] ≥ 1 y_i[\boldsymbol\omega^T\boldsymbol{x}+b]\geq1 yi[ωTx+b]1 has no solution.
PPS In fact, you can say y i [ ω T x + b ] y_i[\boldsymbol\omega^T\boldsymbol{x}+b] yi[ωTx+b] is no less than any positive number, but 1 is much more simple.

How do we use SVM dealing with non-separable data?

{ min ⁡ 1 2 ∥ ω ∥ 2 + C ∑ i = 1 N ξ i s . t . ( s u b j e c t   t o ) { s . t . ( s u b j e c t   t o ) y i [ ω T x i + b ] ≥ 1 − ξ i ξ i ≥ 0 \left\{\begin{array}{l} \min\frac{1}{2}\|\boldsymbol{\omega}\|^2+C\sum\limits_{i=1}^N\xi_i\\ s.t.(subject \ to) \left\{\begin{array}{l} s.t.(subject \ to)y_i[\boldsymbol\omega^T\boldsymbol{x_i}+b]\geq1-\xi_i\\ \xi_i\geq0 \end{array}\right. \end{array}\right. min21ω2+Ci=1Nξis.t.(subject to){s.t.(subject to)yi[ωTxi+b]1ξiξi0
We introduce ξ to make the ineqlity hold. At the same time, we have to make sure ξ \xi ξ is not too big, so we add regulation term to the first equation. (Regulation term is pretty common in the machine learning field). C is a parameter we have already set. In practice, we usually set up upper bound, lower bound and step and try everyone to get the best C.
One of the most importent differnece between SVM and other algorithms is how they deal with non-linear separable data. Other algorithms try to use circles or rectangles etc to find the boundry. SVM tries to find a linear boundary in a higher dimension space.
X ⟶ ϕ ϕ ( X ) \boldsymbol{X}\stackrel{\phi}{\longrightarrow}\phi(\boldsymbol{X}) Xϕϕ(X)
example:
X = [ a b ] → ϕ ( X ) = [ a 2 b 2 a b a b ] \boldsymbol{X} = \begin{bmatrix} a \\ b \end{bmatrix}\to\phi(\boldsymbol{X}) = \begin{bmatrix} a^2 \\ b^2\\a\\b\\ab \end{bmatrix} X=[ab]ϕ(X)=a2b2abab
Warning:The dimension of ω \boldsymbol{\omega} ω changes too.
We have already proved that the possibility of linear separable grow as the dimension grows. In a infinity dimension space, all the data sets are linear separable.

Kernel Function

We don’t have to know the expilicit expression of φ(x) if we have kernel function.
K ( x 1 , x 2 ) = ϕ ( x 1 ) T ϕ ( x 2 ) K(x_1, x_2) = \phi(x_1)^T\phi(x_2) K(x1,x2)=ϕ(x1)Tϕ(x2)
Here are some common kernel function:

  1. Gaussian kernel function: K ( x 1 , x 2 ) = e x p ( − ∣ ∣ x 1 − x 2 ∣ ∣ 2 2 σ 2 ) K(x_1, x_2)=exp(−\frac{∣∣x_1-x_2||^2}{2σ^2}) K(x1,x2)=exp(2σ2x1x22)
  2. Polynomial kernel function: K ( x 1 , x 2 ) = ( γ ϕ ( x 1 ) T ϕ ( x 2 ) + c ) n K(x_1, x_2)=(γ\phi(x_1)^T\phi(x_2)+c)^n K(x1,x2)=(γϕ(x1)Tϕ(x2)+c)n
  3. Linear kernel function(no kernel function) K ( x 1 , x 2 ) = ϕ ( x 1 ) T ϕ ( x 2 ) K(x_1, x_2)=\phi(x_1)^T\phi(x_2) K(x1,x2)=ϕ(x1)Tϕ(x2)
  4. Sigmoid kernel function: K ( x 1 , x 2 ) = t a n h ( γ ϕ ( x 1 ) T ϕ ( x 2 ) + c ) K(x_1, x_2)=tanh(γ\phi(x_1)^T\phi(x_2)+c) K(x1,x2)=tanh(γϕ(x1)Tϕ(x2)+c)
  5. Laplace kernel function: K ( x 1 , x 2 ) = e x p ( − ∣ ∣ x 1 − x 2 ∣ ∣ σ ) K(x_1, x_2) = exp(-\frac{||x_1-x_2||}{\sigma}) K(x1,x2)=exp(σx1x2)

Kernel function must follow the following rules:
K ( x 1 , x 2 ) = K ( x 2 , x 1 ) K(x_1, x_2) = K(x_2, x_1) K(x1,x2)=K(x2,x1)
②半正定性 ∀ C i ( p a r a m e t e r ) , X i ( v e c t o r ) , ∃ ∑ i = 1 N ∑ j = 1 N C i C j K ( x i , x j ) \forall C_i(parameter) , \boldsymbol{X_i}(vector), \exists \sum\limits_{i=1}^N\sum\limits_{j=1}^NC_iC_jK( \boldsymbol{x_i}, \boldsymbol{x_j}) Ci(parameter),Xi(vector),i=1Nj=1NCiCjK(xi,xj)

Use prime problem and dual problem to avoid ϕ ( x ) \phi(x) ϕ(x)

What is prime problem and dual problem? Check this
Prime problem:
{ m i n i m i z e   f ( ω ) s . t . { g i ( ω ) ≤ ( i = 1 ∼ K ) h i ( ω ) = 0 ( i = 1 ∼ N ) \left\{\begin{array}{l} minimize\ f(\boldsymbol\omega)\\ s.t.\left\{\begin{array}{l} g_i(\boldsymbol\omega)\leq(i=1\sim K)\\ h_i(\boldsymbol\omega) = 0 (i=1\sim N) \end{array}\right. \end{array}\right. minimize f(ω)s.t.{gi(ω)(i=1K)hi(ω)=0(i=1N)
Dual Problem:
{ Θ ( α , β ) = min ⁡ a l l   ω { L ( ω , α , β ) } s . t .   α i ≥ 0 \left\{\begin{array}{l} \Theta(\boldsymbol\alpha, \boldsymbol\beta) = \min\limits_{all \ ω}\{L(\boldsymbol\omega, \boldsymbol\alpha, \boldsymbol\beta)\}\\ s.t. \ \boldsymbol\alpha_i \geq0 \end{array}\right. {Θ(α,β)=all ωmin{L(ω,α,β)}s.t. αi0
We need to find:
{ min ⁡ 1 2 ∥ ω ∥ 2 + C ∑ i = 1 N ξ i s . t . ( s u b j e c t   t o ) { s . t . ( s u b j e c t   t o ) y i [ ω T x i + b ] ≥ 1 − ξ i ξ i ≥ 0 \left\{\begin{array}{l} \min\frac{1}{2}\|\boldsymbol{\omega}\|^2+C\sum\limits_{i=1}^N\xi_i\\ s.t.(subject \ to) \left\{\begin{array}{l} s.t.(subject \ to)y_i[\boldsymbol\omega^T\boldsymbol{x_i}+b]\geq1-\xi_i\\ \xi_i\geq0 \end{array}\right. \end{array}\right. min21ω2+Ci=1Nξis.t.(subject to){s.t.(subject to)yi[ωTxi+b]1ξiξi0
Considerate it as prime problem, we can get dual problem:
{ Θ ( α ) = min ⁡ a l l   ω ,   ξ i ,   b { 1 2 ∥ ω ∥ 2 − C ∑ i = 1 N ξ i + ∑ i = 1 N β i ξ i + ∑ i = 1 N α i [ 1 + ξ i − y i ω T ϕ ( x i ) − y i b ] } s . t .   α i ≥ 0 ,   β i ≥ 0 \left\{\begin{array}{l} \Theta(\boldsymbol\alpha) = \min\limits_{all \ \omega, \ \xi_i, \ b} \{ \frac{1}{2}\|\boldsymbol{\omega}\|^2 - C \sum\limits_{i=1}^N\xi_i + \sum\limits_{i=1}^N\beta_i\xi_i + \sum\limits_{i=1}^N\alpha_i[1 + \xi_i - y_i\boldsymbol\omega^T \phi(x_i) - y_ib] \}\\ s.t. \ \boldsymbol\alpha_i \geq0,\ \boldsymbol\beta_i \geq 0\end{array}\right. Θ(α)=all ω, ξi, bmin{21ω2Ci=1Nξi+i=1Nβiξi+i=1Nαi[1+ξiyiωTϕ(xi)yib]}s.t. αi0, βi0
ω → ω , ξ i , b ;   α → α i , β i ,   β → ϕ \boldsymbol{\omega\to\omega,\xi_i,b ;\ \alpha\to\alpha_i,\beta_i, \ \beta\to\phi} ωω,ξi,b; ααi,βi, βϕ
( α \alpha α controlls the inequality, β \beta β controlls the euqality)

Let’s start!
L = 1 2 ∥ ω ∥ 2 − C ∑ i = 1 N ξ i + ∑ i = 1 N β i ξ i + ∑ i = 1 N α i [ 1 + ξ i − y i ω T ϕ ( x i ) − y i b ] { ∂ L ∂ ω = 0 ∂ L ∂ ξ i = 0 ∂ L ∂ b = 0 L=\frac{1}{2}\|\boldsymbol{\omega}\|^2 - C \sum\limits_{i=1}^N\xi_i + \sum\limits_{i=1}^N\beta_i\xi_i + \sum\limits_{i=1}^N\alpha_i[1 + \xi_i - y_i\boldsymbol\omega^T \phi(x_i) - y_ib] \\ \left\{\begin{array}{l} \frac{\partial L}{\partial\omega}=0\\ \frac{\partial L}{\partial\xi_i}=0\\ \frac{\partial L}{\partial b}=0 \end{array}\right. L=21ω2Ci=1Nξi+i=1Nβiξi+i=1Nαi[1+ξiyiωTϕ(xi)yib]ωL=0ξiL=0bL=0
Solve these three equation and we can get:
{ ω = ∑ i = 1 N α i y i ϕ ( x i ) α i + β i = C ∑ i = 0 N α i y i = 0 \left\{\begin{array}{l} \omega=\sum\limits_{i=1}^N\alpha_iy_i\phi(x_i)\\ \alpha_i + \beta_i = C\\ \sum\limits_{i=0}^N\alpha_iy_i=0 \end{array}\right. ω=i=1Nαiyiϕ(xi)αi+βi=Ci=0Nαiyi=0
By using these three euqations we can solve Θ ( α ) \Theta(\alpha) Θ(α). Remember what we want? We want to use K ( x i , x 2 ) K(x_i,x_2) K(xi,x2) to replace ϕ ( x i ) \phi(x_i) ϕ(xi).
L m i n = 1 2 ω T ω + ∑ i = 1 N α i − ∑ i = 1 N α i y i ω T ϕ ( x i ) L m i n = ∑ i = 1 N α i + 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j ϕ ( x i ) T ϕ ( x j ) − ∑ i = 1 N ∑ j = 1 N α i α j y i y j ϕ ( x j ) T ϕ ( x i ) a t t e n t i o n !   K ( x i , x j ) = ϕ ( x i ) T ϕ ( x i ) L m i n = ∑ i = 1 N α i − 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j K ( x i , x j ) L_{min}=\frac{1}{2}\boldsymbol{\omega}^T\boldsymbol{\omega} + \sum\limits_{i=1}^N\alpha_i -\sum\limits_{i=1}^N\alpha_iy_i\boldsymbol\omega^T \phi(x_i) \\ L _{min}= \sum\limits_{i=1}^N\alpha_i + \frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N\alpha_i\alpha_jy_iy_j\phi(x_i)^T\phi(x_j) - \sum\limits_{i=1}^N\sum\limits_{j=1}^N\alpha_i\alpha_jy_iy_j\phi(x_j)^T\phi(x_i)\\ attention! \ K(x_i,x_j) = \phi(x_i)^T\phi(x_i)\\ L_{min} = \sum\limits_{i=1}^N\alpha_i - \frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N\alpha_i\alpha_jy_iy_jK(x_i,x_j) Lmin=21ωTω+i=1Nαii=1NαiyiωTϕ(xi)Lmin=i=1Nαi+21i=1Nj=1Nαiαjyiyjϕ(xi)Tϕ(xj)i=1Nj=1Nαiαjyiyjϕ(xj)Tϕ(xi)attention! K(xi,xj)=ϕ(xi)Tϕ(xi)Lmin=i=1Nαi21i=1Nj=1NαiαjyiyjK(xi,xj)
Finally, we get our optimize target:
{ max ⁡ Θ ( α ) = ∑ i = 1 N α i − 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j K ( x i , x j ) s . t .   0 ≤ α i ≤ C ,   ∑ i = 1 N α i y i = 0 \left\{\begin{array}{l} \max\Theta(\boldsymbol\alpha) = \sum\limits_{i=1}^N\alpha_i - \frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N\alpha_i\alpha_jy_iy_jK(x_i,x_j)\\ s.t. \ 0\leq\boldsymbol\alpha_i \leq C, \ \sum\limits_{i=1}^N\alpha_i y_i = 0 \end{array}\right. maxΘ(α)=i=1Nαi21i=1Nj=1NαiαjyiyjK(xi,xj)s.t. 0αiC, i=1Nαiyi=0
This is also a convex optimization problem, but we can solve it. (We know what K is!). There are different algorithms to slove convex optimization problem. We won’t talk about that here. SMO is one of the most common algorithms.

evaluation ω   a n d   b \omega \ and \ b ω and b

ω \boldsymbol\omega ω:
Remember how do we predict y?
if ω T ϕ ( x i ) + b ≥ 0 \boldsymbol\omega^T\phi(\boldsymbol{x_i})+b\geq0 ωTϕ(xi)+b0, then y = 1 y = 1 y=1
if ω T ϕ ( x i ) + b < 0 \boldsymbol\omega^T\phi(\boldsymbol{x_i})+b<0 ωTϕ(xi)+b<0, then y = − 1 y = -1 y=1
Actually we don’t need to know the exact value of ω \boldsymbol\omega ω if we know ω T ϕ ( x i ) \boldsymbol\omega^T\phi(\boldsymbol{x_i}) ωTϕ(xi).
ω T ϕ ( x i ) = ∑ i = 1 N α i y i K ( x i , x j ) \boldsymbol\omega^T\phi(\boldsymbol{x_i}) = \sum\limits_{i=1}^N\alpha_iy_iK(x_i,x_j) ωTϕ(xi)=i=1NαiyiK(xi,xj)

b b b:
We need to use KKT condition: ∀ i = 1 ∼ K ,   α = 0   o r   g ( ω ) = 0 \forall i=1\sim K, \ \alpha=0 \ or \ g(\omega) = 0 i=1K, α=0 or g(ω)=0
β i = 0 ,   α i = 0 o r ξ i = 0 ,   1 + ξ i − y i ω T ϕ ( x i ) − y i b = 0 \beta_i =0, \ \alpha_i = 0 \\ or \\ \xi_i = 0, \ 1 + \xi_i - y_i\boldsymbol\omega^T \phi(x_i) - y_ib = 0 βi=0, αi=0orξi=0, 1+ξiyiωTϕ(xi)yib=0
Choose a 0 < α i < C 0<\alpha_i<C 0<αi<C randomly, then:
{ ξ i = 0 1 + ξ i − y i ω T ϕ ( x i ) − y i b = 0 b = 1 − y i ∑ i = 1 N α i y i K ( x i , x j ) y i \left\{\begin{array}{l} \xi_i = 0\\ 1+\xi_i - y_i\boldsymbol\omega^T \phi(x_i) - y_ib = 0 \end{array}\right. \\ b = \frac{1 - y_i\sum\limits_{i=1}^N\alpha_iy_iK(x_i,x_j)}{y_i} {ξi=01+ξiyiωTϕ(xi)yib=0b=yi1yii=1NαiyiK(xi,xj)
In parctice, we often choose many α i \alpha_i αi and evaluate mean b as the final output parameter.


Summary

①train the model

  • input { ( x i , y i ) } i = 1 ∼ N \{(x_i,y_i)\}_{i=1\sim N} {(xi,yi)}i=1N
  • solve the optimization problem (SMO, etc.) { max ⁡ Θ ( α ) = ∑ i = 1 N α i − 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j K ( x i , x j ) s . t .   0 ≤ α i ≤ C ,   ∑ i = 1 N α i y i = 0 \left\{\begin{array}{l} \max\Theta(\boldsymbol\alpha) = \sum\limits_{i=1}^N\alpha_i - \frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N\alpha_i\alpha_jy_iy_jK(x_i,x_j)\\ s.t. \ 0\leq\boldsymbol\alpha_i \leq C, \ \sum\limits_{i=1}^N\alpha_i y_i = 0 \end{array}\right. maxΘ(α)=i=1Nαi21i=1Nj=1NαiαjyiyjK(xi,xj)s.t. 0αiC, i=1Nαiyi=0
  • evaluate b

②test model

  • input x

{ i f   ∑ i = 1 N α i y i K ( x i , x j ) + b ≥ 0 , t h e n   y = 1 i f   ∑ i = 1 N α i y i K ( x i , x j ) + b < 0 , t h e n   y = − 1 \left\{\begin{array}{l} if \ \sum\limits_{i=1}^N\alpha_iy_iK(x_i,x_j)+b\geq0, then \ y = 1\\ if \ \sum\limits_{i=1}^N\alpha_iy_iK(x_i,x_j)+b<0, then \ y = -1 \end{array}\right. if i=1NαiyiK(xi,xj)+b0,then y=1if i=1NαiyiK(xi,xj)+b<0,then y=1

  • ouput y
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值