Support Vector Machine 支持向量机

最新推荐文章于 2023-06-03 00:26:00 发布

weixin_45286813

最新推荐文章于 2023-06-03 00:26:00 发布

阅读量235

点赞数

分类专栏： Machine Learning

本文链接：https://blog.csdn.net/weixin_45286813/article/details/104975120

版权

Machine Learning 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

Support Vector Machine 支持向量机

Vapnik（前苏联人）
适合样本数较小时的预测
由听浙江大学的研究生课程整理得来，链接

0. No Free Lunch Therome

(随手放到这里)
如果不对特征空间有先验假设，所有算法平均表现一致。
我们认为：特征差距小的样本更有可能是一类，所以机器学习不是白学di!

SVM简介

图片来自百度百科

可以证明，对于一个 线性可分(linear separable) 的样本空间有无数条线可以分开，那么怎么算最好的？使 间隔(margin) d最大，且 $d_1=d_2=\frac{d}{2}$ 的线唯一且最佳。图中在虚线上的的样本点称为 支持向量(support vector)

定义：
①训练数据和标签 $x_1, y_1), (x_2, y_2)...(x_n, y_n)$ 。 $x_i$ 是向量，代表该样本各个特征(Feature)的取值， $y_i$ 是标签(label)，在SVM中一般取+1，-1（为什么不是1，0？后面会解释）
②线性模型 $(\boldsymbol\omega, b)$ 代表了 $\boldsymbol\omega^T\boldsymbol{x}+b=0$ 的超平面(hyperplane)

要求什么？

当然是要求d(margin)的最大值了，但是显然不太好求，需要转化为：（一律先给结果再解释）
$\left\{\begin{array}{l} \min\frac{1}{2}\|\boldsymbol{\omega}\|^2\\ s.t.(subject \ to)y_i[\boldsymbol\omega^T\boldsymbol{x_i}+b]\geq1 \end{array}\right.$
（这是一个凸优化理论中的二次规划问题）
考虑到以下的事实：
事实1： $\boldsymbol\omega^T\boldsymbol{x}+b=0$ 与 $a\boldsymbol\omega^T\boldsymbol{x}+ab=0(a\in R^+)$ 是同一个超平面，即 $(\boldsymbol\omega, b)=(a\boldsymbol\omega, ab)$
事实2：向量 $x_0$ 和超平面 $(\boldsymbol\omega, b)$ 的距离（类比点到直线的距离） $d=\frac{|\boldsymbol\omega^T\boldsymbol{x_0}+b|}{\|\boldsymbol\omega\|}$
下面开始分析：
对超平面 $(\boldsymbol\omega_0, b_0)\stackrel{a}{\longrightarrow}(\boldsymbol\omega, b)$ 进行变换，一定存在a使得 $|\boldsymbol\omega^T\boldsymbol{x}+b|=1$ ，此时 $d=\frac{1}{\|\boldsymbol\omega\|}$ （所以要 $\min\frac{1}{2}\|\boldsymbol{\omega}\|^2$ , $\frac{1}{2}$ 是为了求导方便）
For non-support vector $\frac{|\boldsymbol\omega^T\boldsymbol{x}+b|}{\|\boldsymbol\omega\|}=d>\frac{1}{\|\boldsymbol\omega\|}$ , so $|\boldsymbol\omega^T\boldsymbol{x}+b|>1$ ,. Now we need to drop the absolute value sign. We notice that $y_i$ has the same sign with $|\boldsymbol\omega^T\boldsymbol{x}+b|>1$ , so $y_i[\boldsymbol\omega^T\boldsymbol{x}+b]>1$ . I don’t why the teacher use [] instead of ()😦
For support vector $y_i[\boldsymbol\omega^T\boldsymbol{x}+b]=1$
Finally, we get the restrictive condition $y_i[\boldsymbol\omega^T\boldsymbol{x}+b]\geq1$

PS When you set is non-linear separable the ineuqlity $y_i[\boldsymbol\omega^T\boldsymbol{x}+b]\geq1$ has no solution.
PPS In fact, you can say $y_i[\boldsymbol\omega^T\boldsymbol{x}+b]$ is no less than any positive number, but 1 is much more simple.

How do we use SVM dealing with non-separable data?

$\left\{\begin{array}{l} \min\frac{1}{2}\|\boldsymbol{\omega}\|^2+C\sum\limits_{i=1}^N\xi_i\\ s.t.(subject \ to) \left\{\begin{array}{l} s.t.(subject \ to)y_i[\boldsymbol\omega^T\boldsymbol{x_i}+b]\geq1-\xi_i\\ \xi_i\geq0 \end{array}\right. \end{array}\right.$
We introduce ξ to make the ineqlity hold. At the same time, we have to make sure $\xi$ is not too big, so we add regulation term to the first equation. (Regulation term is pretty common in the machine learning field). C is a parameter we have already set. In practice, we usually set up upper bound, lower bound and step and try everyone to get the best C.
One of the most importent differnece between SVM and other algorithms is how they deal with non-linear separable data. Other algorithms try to use circles or rectangles etc to find the boundry. SVM tries to find a linear boundary in a higher dimension space.
$\boldsymbol{X}\stackrel{\phi}{\longrightarrow}\phi(\boldsymbol{X})$
example:
$\boldsymbol{X} = \begin{bmatrix} a \\ b \end{bmatrix}\to\phi(\boldsymbol{X}) = \begin{bmatrix} a^2 \\ b^2\\a\\b\\ab \end{bmatrix}$
Warning:The dimension of $\boldsymbol{\omega}$ changes too.
We have already proved that the possibility of linear separable grow as the dimension grows. In a infinity dimension space, all the data sets are linear separable.

Kernel Function

We don’t have to know the expilicit expression of φ(x) if we have kernel function.
$K(x_1, x_2) = \phi(x_1)^T\phi(x_2)$
Here are some common kernel function:

Gaussian kernel function： $K(x_1, x_2)=exp(−\frac{∣∣x_1-x_2||^2}{2σ^2})$
Polynomial kernel function： $K(x_1, x_2)=(γ\phi(x_1)^T\phi(x_2)+c)^n$
Linear kernel function(no kernel function) $K(x_1, x_2)=\phi(x_1)^T\phi(x_2)$
Sigmoid kernel function： $K(x_1, x_2)=tanh(γ\phi(x_1)^T\phi(x_2)+c)$
Laplace kernel function： $K(x_1, x_2) = exp(-\frac{||x_1-x_2||}{\sigma})$

Kernel function must follow the following rules:
① $K(x_1, x_2) = K(x_2, x_1)$
②半正定性 $\forall C_i(parameter) , \boldsymbol{X_i}(vector), \exists \sum\limits_{i=1}^N\sum\limits_{j=1}^NC_iC_jK( \boldsymbol{x_i}, \boldsymbol{x_j})$

Use prime problem and dual problem to avoid $\phi(x)$

What is prime problem and dual problem? Check this
Prime problem:
$\left\{\begin{array}{l} minimize\ f(\boldsymbol\omega)\\ s.t.\left\{\begin{array}{l} g_i(\boldsymbol\omega)\leq(i=1\sim K)\\ h_i(\boldsymbol\omega) = 0 (i=1\sim N) \end{array}\right. \end{array}\right.$
Dual Problem:
$\left\{\begin{array}{l} \Theta(\boldsymbol\alpha, \boldsymbol\beta) = \min\limits_{all \ ω}\{L(\boldsymbol\omega, \boldsymbol\alpha, \boldsymbol\beta)\}\\ s.t. \ \boldsymbol\alpha_i \geq0 \end{array}\right.$
We need to find:
$\left\{\begin{array}{l} \min\frac{1}{2}\|\boldsymbol{\omega}\|^2+C\sum\limits_{i=1}^N\xi_i\\ s.t.(subject \ to) \left\{\begin{array}{l} s.t.(subject \ to)y_i[\boldsymbol\omega^T\boldsymbol{x_i}+b]\geq1-\xi_i\\ \xi_i\geq0 \end{array}\right. \end{array}\right.$
Considerate it as prime problem, we can get dual problem:
$\left\{\begin{array}{l} \Theta(\boldsymbol\alpha) = \min\limits_{all \ \omega, \ \xi_i, \ b} \{ \frac{1}{2}\|\boldsymbol{\omega}\|^2 - C \sum\limits_{i=1}^N\xi_i + \sum\limits_{i=1}^N\beta_i\xi_i + \sum\limits_{i=1}^N\alpha_i[1 + \xi_i - y_i\boldsymbol\omega^T \phi(x_i) - y_ib] \}\\ s.t. \ \boldsymbol\alpha_i \geq0,\ \boldsymbol\beta_i \geq 0\end{array}\right.$
$\boldsymbol{\omega\to\omega,\xi_i,b ;\ \alpha\to\alpha_i,\beta_i, \ \beta\to\phi}$
( $\alpha$ controlls the inequality, $\beta$ controlls the euqality)

Let’s start!
$L=\frac{1}{2}\|\boldsymbol{\omega}\|^2 - C \sum\limits_{i=1}^N\xi_i + \sum\limits_{i=1}^N\beta_i\xi_i + \sum\limits_{i=1}^N\alpha_i[1 + \xi_i - y_i\boldsymbol\omega^T \phi(x_i) - y_ib] \\ \left\{\begin{array}{l} \frac{\partial L}{\partial\omega}=0\\ \frac{\partial L}{\partial\xi_i}=0\\ \frac{\partial L}{\partial b}=0 \end{array}\right.$
Solve these three equation and we can get:
$\left\{\begin{array}{l} \omega=\sum\limits_{i=1}^N\alpha_iy_i\phi(x_i)\\ \alpha_i + \beta_i = C\\ \sum\limits_{i=0}^N\alpha_iy_i=0 \end{array}\right.$
By using these three euqations we can solve $\Theta(\alpha)$ . Remember what we want? We want to use $K(x_i,x_2)$ to replace $\phi(x_i)$ .
$L_{min}=\frac{1}{2}\boldsymbol{\omega}^T\boldsymbol{\omega} + \sum\limits_{i=1}^N\alpha_i -\sum\limits_{i=1}^N\alpha_iy_i\boldsymbol\omega^T \phi(x_i) \\ L _{min}= \sum\limits_{i=1}^N\alpha_i + \frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N\alpha_i\alpha_jy_iy_j\phi(x_i)^T\phi(x_j) - \sum\limits_{i=1}^N\sum\limits_{j=1}^N\alpha_i\alpha_jy_iy_j\phi(x_j)^T\phi(x_i)\\ attention! \ K(x_i,x_j) = \phi(x_i)^T\phi(x_i)\\ L_{min} = \sum\limits_{i=1}^N\alpha_i - \frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N\alpha_i\alpha_jy_iy_jK(x_i,x_j)$
Finally, we get our optimize target:
$\left\{\begin{array}{l} \max\Theta(\boldsymbol\alpha) = \sum\limits_{i=1}^N\alpha_i - \frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N\alpha_i\alpha_jy_iy_jK(x_i,x_j)\\ s.t. \ 0\leq\boldsymbol\alpha_i \leq C, \ \sum\limits_{i=1}^N\alpha_i y_i = 0 \end{array}\right.$
This is also a convex optimization problem, but we can solve it. (We know what K is!). There are different algorithms to slove convex optimization problem. We won’t talk about that here. SMO is one of the most common algorithms.

evaluation $\omega \ and \ b$

$\boldsymbol\omega$ :
Remember how do we predict y?
if $\boldsymbol\omega^T\phi(\boldsymbol{x_i})+b\geq0$ , then $y = 1$
if $\boldsymbol\omega^T\phi(\boldsymbol{x_i})+b<0$ , then $y = - 1$
Actually we don’t need to know the exact value of $\boldsymbol\omega$ if we know $\boldsymbol\omega^T\phi(\boldsymbol{x_i})$ .
$\boldsymbol\omega^T\phi(\boldsymbol{x_i}) = \sum\limits_{i=1}^N\alpha_iy_iK(x_i,x_j)$

$b$ :
We need to use KKT condition: $\forall i=1\sim K, \ \alpha=0 \ or \ g(\omega) = 0$
$\beta_i =0, \ \alpha_i = 0 \\ or \\ \xi_i = 0, \ 1 + \xi_i - y_i\boldsymbol\omega^T \phi(x_i) - y_ib = 0$
Choose a $0<\alpha_i<C$ randomly, then:
$\left\{\begin{array}{l} \xi_i = 0\\ 1+\xi_i - y_i\boldsymbol\omega^T \phi(x_i) - y_ib = 0 \end{array}\right. \\ b = \frac{1 - y_i\sum\limits_{i=1}^N\alpha_iy_iK(x_i,x_j)}{y_i}$
In parctice, we often choose many $\alpha_i$ and evaluate mean b as the final output parameter.

Summary

①train the model

input $\{(x_i,y_i)\}_{i=1\sim N}$
solve the optimization problem (SMO, etc.) $\left\{\begin{array}{l} \max\Theta(\boldsymbol\alpha) = \sum\limits_{i=1}^N\alpha_i - \frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N\alpha_i\alpha_jy_iy_jK(x_i,x_j)\\ s.t. \ 0\leq\boldsymbol\alpha_i \leq C, \ \sum\limits_{i=1}^N\alpha_i y_i = 0 \end{array}\right.$
evaluate b

②test model

input x

$\left\{\begin{array}{l} if \ \sum\limits_{i=1}^N\alpha_iy_iK(x_i,x_j)+b\geq0, then \ y = 1\\ if \ \sum\limits_{i=1}^N\alpha_iy_iK(x_i,x_j)+b<0, then \ y = -1 \end{array}\right.$