SVM (支持向量机推导)

SVM (支持向量机推导)

Background
1. Hyperplane Definition

A hyperplane can be defined by two vectors w = ( − b − a 1 ) ,    x = ( 1 x y ) \bold{w}=\begin{pmatrix} -b \\ -a \\ 1 \end{pmatrix},\; \bold{x}=\begin{pmatrix} 1 \\ x \\ y \end{pmatrix} w=ba1,x=1xy, because

  • A hyperplane can be defined as y = a x + b    ⟹    y − a x − b = 0 y=ax+b \implies y-ax-b=0 y=ax+byaxb=0
  • And we have w T x = − b ∗ 1 + ( − a ) ∗ x + 1 ∗ y = y − a x − b \bold{w}^T\bold{x}=-b*1+(-a)*x+1*y=y-ax-b wTx=b1+(a)x+1y=yaxb
2. Vector subtraction

subtract

3. Dot product to calculate projection length
  • Vector dot product: a ⋅ b = ∣ a ∣ × ∣ b ∣ × cos ⁡ ( θ ) \bold{a} \cdot \bold{b}=|\bold{a}| \times |\bold{b}|\times\cos(\theta) ab=a×b×cos(θ)
  • Length  ( ∣ a ∣ ’s projection on  ∣ b ∣ ) = ∣ a ∣ × cos ⁡ ( θ ) = a ⋅ b ∣ b ∣ \text{Length }(|\bold{a}|\text{'s projection on }|\bold{b}|)=|\bold{a}| \times\cos(\theta)=\frac{\bold{a} \cdot \bold{b}}{|\bold{b}|} Length (a’s projection on b)=a×cos(θ)=bab
    dot product
4. The distance from a point to a line
  • d = ∣ A x 0 + B y 0 + C ∣ A 2 + B 2 d=\frac{|Ax_0+By_0+C|}{\sqrt{A^2+B^2}} d=A2+B2 Ax0+By0+C

What is Support Vector Machines
  • Intuition: The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space (N - the number of features) that distinctly classifies the data points (the hyperplane with maximum margin).
  • Decision boundaries: hyperplanes that help classify the data points.
  • Support vectors
    • Data points that are closer to the hyperplane and influence the position and orientation of the hyperplane.
    • Using these support vectors, we maximize the margin of the classifier.
    • Delete the support vectors will change the position of the hyperplane.

Objective function

min ⁡ w , b ∣ ∣ w ∣ ∣ 2 2 ,    s.t.     y i ( w T x i + b ) ≥ 1 , i = 1 , 2 , … , m \min_{\bold{w},b}\frac{||\bold{w}||^2}{2}, \; \text{s.t. }\;y_i(w^Tx_i+b)\ge1,\quad i=1,2,\dots, m w,bmin2w2,s.t. yi(wTxi+b)1,i=1,2,,m

  • With above function, we can find the hyperplane ( w , b ) (\bold{w}, b) (w,b) with the largest margin, where w \bold{w} w is the weight vector and b is the bias of the hyperplane.
  • Assumpe the hyperplane ( w , b ) (\bold{w}, b) (w,b) can classify all the data correctly, i.e. for any ( x i , y i ) ∈ D (x_i,y_i)\in D (xi,yi)D, w T x i + b > 0 \bold{w}^T\bold{x_i}+b\gt 0 wTxi+b>0 if y i = + 1 y_i=+1 yi=+1; w T x i + b < 0 \bold{w}^T\bold{x_i}+b\lt 0 wTxi+b<0 if y i = − 1 y_i=-1 yi=1. Let,
    { w T x i + b ≥ 1 , y i = + 1 w T x i + b ≤ − 1 , y i = − 1 \begin{cases} \bold{w}^T\bold{x_i}+b\ge 1, &y_i=+1 \\ \bold{w}^T\bold{x_i}+b\le -1, &y_i=-1 \end{cases} {wTxi+b1,wTxi+b1,yi=+1yi=1
  • For the two support vectors (the closest data points on the margin line), we have w T x + + b = + 1 , w T x − + b = − 1 \bold{w}^T\bold{x^+}+b=+1,\bold{w}^T\bold{x^-}+b=-1 wTx++b=+1,wTx+b=1 (e.q. 1)
1. Proof with dot product

Calculate margin

  • w ^ = w ∣ ∣ w ∣ ∣ \hat{\bold{w}}=\frac{\bold{w}}{||\bold{w}||} w^=ww is the unit vector that are orthogonal to hyperplane.

    • why w ^ \hat{\bold{w}} w^ is orthogonal to w T x + b = 0 \bold{w}^T\bold{x}+b=0 wTx+b=0? Can prove for a line y = k x + b y=kx+b y=kx+b in 2D axis.
  • Then use dot product to calculate projected length of c \bold{c} c alongside w ^ \hat{\bold{w}} w^, margin = ∣ ∣ c ∣ ∣ × cos ⁡ θ = c ⋅ w ∣ ∣ w ∣ ∣ = c ⋅ w ^ = ( x + − x − ) ⋅ w ^ = ( x + − x − ) ⋅ w ∣ ∣ w ∣ ∣ = x + ⋅ w ∣ ∣ w ∣ ∣ − x − ⋅ w ∣ ∣ w ∣ ∣ \text{margin}=||\bold{c}||\times \cos\theta=\frac{\bold{c}\cdot \bold{w}}{||\bold{w}||}=\bold{c}\cdot \hat{\bold{w}}=(\bold{x}^+-\bold{x}^-)\cdot \hat{\bold{w}}=(\bold{x}^+-\bold{x}^-)\cdot \frac{\bold{w}}{||\bold{w}||}=\bold{x}^+\cdot \frac{\bold{w}}{||\bold{w}||} -\bold{x}^-\cdot \frac{\bold{w}}{||\bold{w}||} margin=c×cosθ=wcw=cw^=(x+x)w^=(x+x)ww=x+wwxww

    • c \bold{c} c is the vector starting at x − x^- x and ending at x + x^+ x+
  • With (e.q. 1) , we have margin = x + ⋅ w ∣ ∣ w ∣ ∣ − x − ⋅ w ∣ ∣ w ∣ ∣ = 1 − b ∣ ∣ w ∣ ∣ − − 1 − b ∣ ∣ w ∣ ∣ = 2 ∣ ∣ w ∣ ∣ \text{margin}=\bold{x}^+\cdot \frac{\bold{w}}{||\bold{w}||} -\bold{x}^-\cdot \frac{\bold{w}}{||\bold{w}||}=\frac{1-b}{||\bold{w}||} -\frac{-1-b}{||\bold{w}||}=\frac{2}{||\bold{w}||} margin=x+wwxww=w1bw1b=w2

  • All in one
    All in one

2. Proof with distance formula
  • Using formula to calculate distance from a point to a line
    margin = r + + r − = ∣ w T x + + b ∣ ∣ ∣ w ∣ ∣ + ∣ w T x − + b ∣ ∣ ∣ w ∣ ∣ \text{margin}=r^++r^-=\frac{|w^Tx^++b|}{||w||}+\frac{|w^Tx^-+b|}{||w||} margin=r++r=wwTx++b+wwTx+b
  • With (e.q. 1) , we have margin = r + + r − = ∣ w T x + + b ∣ ∣ ∣ w ∣ ∣ + ∣ w T x − + b ∣ ∣ ∣ w ∣ ∣ = ∣ 1 ∣ ∣ ∣ w ∣ ∣ + ∣ − 1 ∣ ∣ ∣ w ∣ ∣ = ∣ 2 ∣ ∣ ∣ w ∣ ∣ \text{margin}=r^++r^-=\frac{|w^Tx^++b|}{||w||}+\frac{|w^Tx^-+b|}{||w||}=\frac{|1|}{||w||}+\frac{|-1|}{||w||}=\frac{|2|}{||w||} margin=r++r=wwTx++b+wwTx+b=w1+w1=w2

Our objective function is then: max ⁡ 2 ∣ ∣ w ∣ ∣    ⟹    max ⁡ 1 ∣ ∣ w ∣ ∣    ⟹    min ⁡ ∣ ∣ w ∣ ∣    ⟹    min ⁡ ∣ ∣ w ∣ ∣ 2 2 \max\frac{2}{||\bold{w}||}\implies\max\frac{1} {||\bold{w}||}\implies\min||\bold{w}||\implies\min\frac{||\bold{w}||^2}{2} maxw2maxw1minwmin2w2


Soft Margin SVM
  • Margin: The distance of the vectors from the hyperplane.
    • Hard margin: all data are separated correctly
    • Soft margin: allow some margin violation to occur
      Hard & Soft Margin
  • It’s not always plausible to classify all the data points correctly with a hyperplane, so we need to tolerate some error classification which not satisfy the restriction: y i ( w T x i + b ) ≥ 1 y_i(\bold{w}^T\bold{x_i}+b)\ge 1 yi(wTxi+b)1 (e.q. 2) .We call this soft margin SVM.
Objective function of soft margin SVM

min ⁡ w , b ∣ ∣ w ∣ ∣ 2 2 + C ∑ i = 1 m ℓ 0 / 1 ( y i ( w T x i + b ) ) , ℓ 0 / 1 = { 1 , if  z < 0 ; 0 , otherwise. \min_{\bold{w},b}\frac{||\bold{w}||^2}{2}+C\sum_{i=1}^m\ell_{0/1}(y_i(\bold{w}^T\bold{x_i}+b)),\quad \ell_{0/1}=\begin{cases} 1, &\text{if } z\lt 0; \\ 0, &\text{otherwise.} \end{cases} w,bmin2w2+Ci=1m0/1(yi(wTxi+b)),0/1={1,0,if z<0;otherwise.

  • When C = + ∞ C=+\infty C=+, objective function force all the data subject to our restriction e.q.2. Otherwise, the function tolerates some error classification.
  • Because ℓ 0 / 1 \ell_{0/1} 0/1 is not a convex and continuous function, we usually use other function to take place of ℓ 0 / 1 \ell_{0/1} 0/1, we call them “surrogate loss”.
    • hinge loss: ℓ h i n g e ( z ) = max ⁡ ( 0 , 1 − z ) \ell_{hinge}(z)=\max(0,1-z) hinge(z)=max(0,1z)
    • exponential loss: ℓ e x p ( z ) = exp ⁡ ( − z ) \ell_{exp}(z)=\exp(-z) exp(z)=exp(z)
    • logistic loss: ℓ h i n g e ( z ) = log ⁡ ( 1 + exp ⁡ ( − z ) ) \ell_{hinge}(z)=\log (1+\exp{(-z)}) hinge(z)=log(1+exp(z))

Kernel Tricks -TBD

If ata points are not separable in low dimensional space, use kernel function to map them to higher dimensional space
在这里插入图片描述
在这里插入图片描述


Solve objective function -TBD
  • Lagrange Multiplier

Reference:A Top Machine Learning Algorithm Explained: Support Vector Machines (SVMs)

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值