SVM (支持向量机推导)
Background
1. Hyperplane Definition
A hyperplane can be defined by two vectors w = ( − b − a 1 ) , x = ( 1 x y ) \bold{w}=\begin{pmatrix} -b \\ -a \\ 1 \end{pmatrix},\; \bold{x}=\begin{pmatrix} 1 \\ x \\ y \end{pmatrix} w=⎝⎛−b−a1⎠⎞,x=⎝⎛1xy⎠⎞, because
- A hyperplane can be defined as y = a x + b ⟹ y − a x − b = 0 y=ax+b \implies y-ax-b=0 y=ax+b⟹y−ax−b=0
- And we have w T x = − b ∗ 1 + ( − a ) ∗ x + 1 ∗ y = y − a x − b \bold{w}^T\bold{x}=-b*1+(-a)*x+1*y=y-ax-b wTx=−b∗1+(−a)∗x+1∗y=y−ax−b
2. Vector subtraction
3. Dot product to calculate projection length
- Vector dot product: a ⋅ b = ∣ a ∣ × ∣ b ∣ × cos ( θ ) \bold{a} \cdot \bold{b}=|\bold{a}| \times |\bold{b}|\times\cos(\theta) a⋅b=∣a∣×∣b∣×cos(θ)
-
Length
(
∣
a
∣
’s projection on
∣
b
∣
)
=
∣
a
∣
×
cos
(
θ
)
=
a
⋅
b
∣
b
∣
\text{Length }(|\bold{a}|\text{'s projection on }|\bold{b}|)=|\bold{a}| \times\cos(\theta)=\frac{\bold{a} \cdot \bold{b}}{|\bold{b}|}
Length (∣a∣’s projection on ∣b∣)=∣a∣×cos(θ)=∣b∣a⋅b
4. The distance from a point to a line
- d = ∣ A x 0 + B y 0 + C ∣ A 2 + B 2 d=\frac{|Ax_0+By_0+C|}{\sqrt{A^2+B^2}} d=A2+B2∣Ax0+By0+C∣
What is Support Vector Machines
- Intuition: The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space (N - the number of features) that distinctly classifies the data points (the hyperplane with maximum margin).
- Decision boundaries: hyperplanes that help classify the data points.
- Support vectors
- Data points that are closer to the hyperplane and influence the position and orientation of the hyperplane.
- Using these support vectors, we maximize the margin of the classifier.
- Delete the support vectors will change the position of the hyperplane.
Objective function
min w , b ∣ ∣ w ∣ ∣ 2 2 , s.t. y i ( w T x i + b ) ≥ 1 , i = 1 , 2 , … , m \min_{\bold{w},b}\frac{||\bold{w}||^2}{2}, \; \text{s.t. }\;y_i(w^Tx_i+b)\ge1,\quad i=1,2,\dots, m w,bmin2∣∣w∣∣2,s.t. yi(wTxi+b)≥1,i=1,2,…,m
- With above function, we can find the hyperplane ( w , b ) (\bold{w}, b) (w,b) with the largest margin, where w \bold{w} w is the weight vector and b is the bias of the hyperplane.
- Assumpe the hyperplane
(
w
,
b
)
(\bold{w}, b)
(w,b) can classify all the data correctly, i.e. for any
(
x
i
,
y
i
)
∈
D
(x_i,y_i)\in D
(xi,yi)∈D,
w
T
x
i
+
b
>
0
\bold{w}^T\bold{x_i}+b\gt 0
wTxi+b>0 if
y
i
=
+
1
y_i=+1
yi=+1;
w
T
x
i
+
b
<
0
\bold{w}^T\bold{x_i}+b\lt 0
wTxi+b<0 if
y
i
=
−
1
y_i=-1
yi=−1. Let,
{ w T x i + b ≥ 1 , y i = + 1 w T x i + b ≤ − 1 , y i = − 1 \begin{cases} \bold{w}^T\bold{x_i}+b\ge 1, &y_i=+1 \\ \bold{w}^T\bold{x_i}+b\le -1, &y_i=-1 \end{cases} {wTxi+b≥1,wTxi+b≤−1,yi=+1yi=−1 - For the two support vectors (the closest data points on the margin line), we have w T x + + b = + 1 , w T x − + b = − 1 \bold{w}^T\bold{x^+}+b=+1,\bold{w}^T\bold{x^-}+b=-1 wTx++b=+1,wTx−+b=−1 (e.q. 1)
1. Proof with dot product
-
w ^ = w ∣ ∣ w ∣ ∣ \hat{\bold{w}}=\frac{\bold{w}}{||\bold{w}||} w^=∣∣w∣∣w is the unit vector that are orthogonal to hyperplane.
- why w ^ \hat{\bold{w}} w^ is orthogonal to w T x + b = 0 \bold{w}^T\bold{x}+b=0 wTx+b=0? Can prove for a line y = k x + b y=kx+b y=kx+b in 2D axis.
-
Then use dot product to calculate projected length of c \bold{c} c alongside w ^ \hat{\bold{w}} w^, margin = ∣ ∣ c ∣ ∣ × cos θ = c ⋅ w ∣ ∣ w ∣ ∣ = c ⋅ w ^ = ( x + − x − ) ⋅ w ^ = ( x + − x − ) ⋅ w ∣ ∣ w ∣ ∣ = x + ⋅ w ∣ ∣ w ∣ ∣ − x − ⋅ w ∣ ∣ w ∣ ∣ \text{margin}=||\bold{c}||\times \cos\theta=\frac{\bold{c}\cdot \bold{w}}{||\bold{w}||}=\bold{c}\cdot \hat{\bold{w}}=(\bold{x}^+-\bold{x}^-)\cdot \hat{\bold{w}}=(\bold{x}^+-\bold{x}^-)\cdot \frac{\bold{w}}{||\bold{w}||}=\bold{x}^+\cdot \frac{\bold{w}}{||\bold{w}||} -\bold{x}^-\cdot \frac{\bold{w}}{||\bold{w}||} margin=∣∣c∣∣×cosθ=∣∣w∣∣c⋅w=c⋅w^=(x+−x−)⋅w^=(x+−x−)⋅∣∣w∣∣w=x+⋅∣∣w∣∣w−x−⋅∣∣w∣∣w
- c \bold{c} c is the vector starting at x − x^- x− and ending at x + x^+ x+
-
With (e.q. 1) , we have margin = x + ⋅ w ∣ ∣ w ∣ ∣ − x − ⋅ w ∣ ∣ w ∣ ∣ = 1 − b ∣ ∣ w ∣ ∣ − − 1 − b ∣ ∣ w ∣ ∣ = 2 ∣ ∣ w ∣ ∣ \text{margin}=\bold{x}^+\cdot \frac{\bold{w}}{||\bold{w}||} -\bold{x}^-\cdot \frac{\bold{w}}{||\bold{w}||}=\frac{1-b}{||\bold{w}||} -\frac{-1-b}{||\bold{w}||}=\frac{2}{||\bold{w}||} margin=x+⋅∣∣w∣∣w−x−⋅∣∣w∣∣w=∣∣w∣∣1−b−∣∣w∣∣−1−b=∣∣w∣∣2
-
All in one
2. Proof with distance formula
- Using formula to calculate distance from a point to a line
margin = r + + r − = ∣ w T x + + b ∣ ∣ ∣ w ∣ ∣ + ∣ w T x − + b ∣ ∣ ∣ w ∣ ∣ \text{margin}=r^++r^-=\frac{|w^Tx^++b|}{||w||}+\frac{|w^Tx^-+b|}{||w||} margin=r++r−=∣∣w∣∣∣wTx++b∣+∣∣w∣∣∣wTx−+b∣ - With (e.q. 1) , we have margin = r + + r − = ∣ w T x + + b ∣ ∣ ∣ w ∣ ∣ + ∣ w T x − + b ∣ ∣ ∣ w ∣ ∣ = ∣ 1 ∣ ∣ ∣ w ∣ ∣ + ∣ − 1 ∣ ∣ ∣ w ∣ ∣ = ∣ 2 ∣ ∣ ∣ w ∣ ∣ \text{margin}=r^++r^-=\frac{|w^Tx^++b|}{||w||}+\frac{|w^Tx^-+b|}{||w||}=\frac{|1|}{||w||}+\frac{|-1|}{||w||}=\frac{|2|}{||w||} margin=r++r−=∣∣w∣∣∣wTx++b∣+∣∣w∣∣∣wTx−+b∣=∣∣w∣∣∣1∣+∣∣w∣∣∣−1∣=∣∣w∣∣∣2∣
Our objective function is then: max 2 ∣ ∣ w ∣ ∣ ⟹ max 1 ∣ ∣ w ∣ ∣ ⟹ min ∣ ∣ w ∣ ∣ ⟹ min ∣ ∣ w ∣ ∣ 2 2 \max\frac{2}{||\bold{w}||}\implies\max\frac{1} {||\bold{w}||}\implies\min||\bold{w}||\implies\min\frac{||\bold{w}||^2}{2} max∣∣w∣∣2⟹max∣∣w∣∣1⟹min∣∣w∣∣⟹min2∣∣w∣∣2
Soft Margin SVM
- Margin: The distance of the vectors from the hyperplane.
- Hard margin: all data are separated correctly
- Soft margin: allow some margin violation to occur
- It’s not always plausible to classify all the data points correctly with a hyperplane, so we need to tolerate some error classification which not satisfy the restriction: y i ( w T x i + b ) ≥ 1 y_i(\bold{w}^T\bold{x_i}+b)\ge 1 yi(wTxi+b)≥1 (e.q. 2) .We call this soft margin SVM.
Objective function of soft margin SVM
min w , b ∣ ∣ w ∣ ∣ 2 2 + C ∑ i = 1 m ℓ 0 / 1 ( y i ( w T x i + b ) ) , ℓ 0 / 1 = { 1 , if z < 0 ; 0 , otherwise. \min_{\bold{w},b}\frac{||\bold{w}||^2}{2}+C\sum_{i=1}^m\ell_{0/1}(y_i(\bold{w}^T\bold{x_i}+b)),\quad \ell_{0/1}=\begin{cases} 1, &\text{if } z\lt 0; \\ 0, &\text{otherwise.} \end{cases} w,bmin2∣∣w∣∣2+Ci=1∑mℓ0/1(yi(wTxi+b)),ℓ0/1={1,0,if z<0;otherwise.
- When C = + ∞ C=+\infty C=+∞, objective function force all the data subject to our restriction e.q.2. Otherwise, the function tolerates some error classification.
- Because
ℓ
0
/
1
\ell_{0/1}
ℓ0/1 is not a convex and continuous function, we usually use other function to take place of
ℓ
0
/
1
\ell_{0/1}
ℓ0/1, we call them “surrogate loss”.
- hinge loss: ℓ h i n g e ( z ) = max ( 0 , 1 − z ) \ell_{hinge}(z)=\max(0,1-z) ℓhinge(z)=max(0,1−z)
- exponential loss: ℓ e x p ( z ) = exp ( − z ) \ell_{exp}(z)=\exp(-z) ℓexp(z)=exp(−z)
- logistic loss: ℓ h i n g e ( z ) = log ( 1 + exp ( − z ) ) \ell_{hinge}(z)=\log (1+\exp{(-z)}) ℓhinge(z)=log(1+exp(−z))
Kernel Tricks -TBD
If ata points are not separable in low dimensional space, use kernel function to map them to higher dimensional space
Solve objective function -TBD
- Lagrange Multiplier
Reference:A Top Machine Learning Algorithm Explained: Support Vector Machines (SVMs)