SVM from another perspective

最新推荐文章于 2021-03-10 20:46:15 发布

linjiet

最新推荐文章于 2021-03-10 20:46:15 发布

阅读量695

点赞数

分类专栏：机器学习机器学习与统计学习方法文章标签： SVM

本文链接：https://blog.csdn.net/qq_39742013/article/details/89324111

版权

机器学习同时被 2 个专栏收录

27 篇文章 0 订阅

订阅专栏

机器学习与统计学习方法

4 篇文章 1 订阅

订阅专栏

This article records my process of study.Link:SVM.There are not Lagrange and KKT condition.Here,we mainly use gradient descent.

Binary Classification

Because $g (x)$ only outputs $+ 1$ or $- 1$ .Thus $\delta$ can’t use gradient descent.We use another loss function.(PS: $\delta$ is a piecewise function). $\hat{y}^n$ is 1 or -1,it just is defined for convenience. This can unify two classes express about loss function by $\hat{y}^n*f(x)$ . For $\hat{y}^n=1$ ,we hope bigger $f (x)$ .For $\hat{y}^n=-1$ ,we hope smaller $f (x)$ . $f (x) = 0$ is bound between tow classes.
在这里插入图片描述
For intuition(Suppose $\delta$ or $l$ is y axis):
We use square loss,it is unreasonable.Because when $\hat{y}^nf(x)$ is big,the loss is big.

PS:I think express above is not meaningful( $f (x)$ close to 1 or -1).Next,we use $s i g m o i d$ + square loss.The curve is blue.
在这里插入图片描述
Refer to 简单谈谈Cross Entropy Loss About
$S o f t m a x$ and $Cross\ Entropy$ .Next,we use $s i g m o i d$ + $C r o s s E n t r o p y$ .It is reasonable.And if $l$ is divided by $l n 2$ ,this method result is upper-bound of ideal loss,we minimize the $l$ to minimize ideal loss. In contrast with $sigmoid+Square\ loss$ , gradient descent in this method represent better. $sigmoid+Square\ loss$ don’t operate better in $\hat{y}^nf(x)$ being tiny positive number.The $C r o s s E n t r o p y$ is easier to train. So in some regressions we often use it.
在这里插入图片描述
Next,we give $hinge\ loss$ .Sometimes the $hinge\ loss$ has stronger robustness than $cross\ entropy$ .For example,there are outliers.Because it has sparsity for training data and only considers support vectors(In kernel section,we can understand it). While $C r o s s E n t r o p y$ considers all training data.
在这里插入图片描述
Why do we use 1?Because you can think that the hinge function is upper-bound of ideal loss.

Linear SVM

We give linear SVM model.The $l$ and normalization are convex function. So we can use gradient descent to minimize hinge loss function.Thus we can use SVM as deep learning classifier layer.There is a reference.
在这里插入图片描述

SVM gradient descent

在这里插入图片描述
The $c^n(w)=+1,-1,or\ 0$ .The $x^n_i$ is real number which is a dimension of feature vector.

Linear SVM another formulation

Suppose training data set is linear separable.Express above is different with that we know.We transform hinge loss function to that we know.With using hinge loss function,we are going to compute soft margin SVM.For hard margin,you can see my another article which is supplementary for Z.H Zhou Watermelon Book SVM part.The $m a x$ is $\epsilon^n$ here.The variant is following:
$Minimizing\ the\ Loss\ function\\ L(f)=\sum_n\epsilon^n+\lambda||w||_2\\ s.t\ \ \ \epsilon^n>=0\\ \hat{y^n}f(x^n)>=1-\epsilon^n$
You can clearly think express above. In fact,it is similar with Watermelon Book by ZhiHua Zhou.Here,I think we should add a hyper-parameter in front of $\sum_n\epsilon^n$ .When the hyper-parameter is infinite,we hope get a hard margin SVM.
$\min_{w,b}\ \frac{1}{2}||w||^2 \\ s.t.\ y_i(w^Tx_i+b)>=1,i=1,2,...m.\\ where\ w\ is\ weights.$
In common,we find smallest length of weights that satisfies $y^nf(x^n)>=1$ as possible.If the training data set is linear separable,we can get $\epsilon^n=0$ .For example,you can scale the $w$ and $b$ to make minimum interval $1$ .So,we need minimize $w||^2$ .Between difference of two express is type of margin.
In fact,it is hard to make data linear separable.For relaxing this problem,we use soft margin which allows to wrongly classifier a few data samplers.
在这里插入图片描述
Why is the variant equivalent to proposed express before?We want to minimize $L (f)$ ,i.e., $\epsilon$ .When $\epsilon$ is minimized,this quadratic programming problem is equivalent to minimize hinge loss $\epsilon^n=max(0,1-\hat{y^n}f(x^n))$ . $\epsilon$ can’t be very large. $\epsilon^n$ is the smallest number but bigger than $0$ and $1-\hat{y^n}f(x^n)$ .So with minimization,the two formulations are equal.The $\epsilon^n$ is a slack variable.

Dual representation

Many SVM slides interpret express below by Lagrange.Now,we interpret why $\hat{w}$ is a linear combination of $x^n$ from another perspective.
$\hat{w}=\sum_n\hat{a}_n*x^n$
Where $\hat{a}_n$ may be sparse. It is a integer.
If we initialize $w$ 0:
$w=w-\eta\sum_nc^n(w)x^n\\ c^n(x)=\frac{\partial l(f(x^n),\hat{y}^n)}{\partial f(x^n)}=0\ or\ -\hat{y}^n(+1\ or\ -1)$
What the $\hat{w}$ is linear combination of $x^n$ is obvious.The hinge loss function usually gets zero. So many $x$ is not used to determine $\hat{w}$ .These have non-zero $\hat{a}$ are defined support vectors.For logistic regression,it is always non-zero.Because it uses $E n t r o p y$ .For $E n t r o p y$ loss function,there is not zero for gradient. So if we use $E n t r o p y$ loss,we can’t get sparse $\hat{a}$ .Every data can influence the result( $\hat{w}$ ).As mentioned above,hinge loss has stronger robustness.

Step 1

Now,we can get new formulation for our $f (x)$ .Suppose $x$ is in linear space.
$due\ to\ w=\sum_n a_nx^n=[x^1,x^2...x^n].dot( \left[ \begin{matrix} a_1\\ a_2\\ ...\\ a_n \end{matrix} \right] )=Xa\\ f(x)=w^Tx=a^TX^Tx=[a_1,a_2,...a_n] \left[ \begin{matrix} (x^{1})^T\\(x^2)^T\\...\\(x^n)^T \end{matrix} \right] \left[ \begin{matrix} x_1\\x_2\\...\\x_m \end{matrix} \right]\\ Finally\\ f(x)=\sum_na_n((x^n)^T.dot(x))=\sum_n a_nK((x^n)^T,x)$
Where $x^n$ is a column vector,the $K$ is defined as kernel function.Now,we get a new model,and we want to get $a_{1-n}$ .Because our x isn’t usually in linear space,we use function $K$ to replace inner product in linear space.

Step 2、3

We find the best $a_{1-n}$ to minimize loss function(PS:here,YouTuber doesn’t give update method.But I think it can use gradient descent and QP packages).We use new model $f (x)$ to replace origin $f (x)$ in loss function.
$L(f)=\sum_nl(\sum_{n'} a_{n'}K((x^{n'})^T,x^n),\hat{y}^n)$
We need not to know specific $x^n$ ,and we only need know the $K$ .We use kernel trick to solve it.For kernel trick,you can use wherever it is effective,e.g,logistic regression,linear regression.

Kernel trick

Why do we define kernel function,rather than using directly inner product between two vectors? Because it makes computation effective and efficient for inner product.We can define different kernel function to easily compute inner product.How to release efficient computation?If we don’t use kernel function.For using linear model,We need to compute feature transformation.For example,in neural network we add hidden layer to compute feature transformation.The feature is a high dimension vector(denoted by $\phi$ ). Then we compute inner product. It is expensive.Conversely,we directly compute inner product of x and z by defining kernel concept. Tow ways are equivalent. As following:
在这里插入图片描述
$\phi$ is feature vector of x or z.
For example:

Radial Basis Function Kernel

Another kernel function(we use Taylor Expansion to expand exponent computation):
在这里插入图片描述
What you should pay attention to is that it may be over fit due to infinite dimension. In kernel trick section,we use a square expression. Its feature dimension is finite(at most 4).

Sigmoid Kernel

$K (x, z) = t a n h (x . d o t (z))$
You can use similar Taylor Expansion to find tow high dimension vectors（ $\phi(x),\phi(z)$ ） whose inner product is equal to $K (x, z)$ .
If we use sigmoid kernel,we get a neural network with a hidden layer.
在这里插入图片描述
Thinking about inner product computation,when tow data points are very closed,the value of $K$ is very big. So we think $K$ is something like similarity between two data points.We can only consider value of $K$ .You can define your kernel function,then using Mercer’s theory to check whether this kernel function is equal to inner product of tow high dimension vectors. Here,we give the reference.
在这里插入图片描述
The kernel function is hyper-parameter. It can influence effectiveness of model.Some kernel function can’t make data separate.For different task,we choose different kernel function.If you don’t know,you can choose RBF kernel.For text data,we usually use linear kernel( $k(x_i,x_j)=x_i^Tx_j$ ).

SVM related methods

在这里插入图片描述
Our SVM can’t conduct multi-classifier task. It need to be extended. In contrast with SVM,regression need more data.

Relationship with deep learning

在这里插入图片描述
Hidden layer is to transform feature.Our kernel function is to change data to a high dimension space,too. In high dimension space, data is linearly separable.Then we use hinge loss to get linear classifier.The kernel function is learnable.The below paper is reference.

linjiet

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
SVM from another perspective

This article records my process of study.Binary ClassificationBecause g(x)g(x)g(x) only outputs +1+1+1 or −1-1−1.Thus δ\deltaδ can’t use gradient descent.We use another loss function.(PS: δ\deltaδ...
复制链接

扫一扫