SVM from another perspective

This article records my process of study.Link:SVM.There are not Lagrange and KKT condition.Here,we mainly use gradient descent.

Binary Classification

Because g ( x ) g(x) g(x) only outputs + 1 +1 +1 or − 1 -1 1.Thus δ \delta δ can’t use gradient descent.We use another loss function.(PS: δ \delta δ is a piecewise function). y ^ n \hat{y}^n y^n is 1 or -1,it just is defined for convenience. This can unify two classes express about loss function by y ^ n ∗ f ( x ) \hat{y}^n*f(x) y^nf(x). For y ^ n = 1 \hat{y}^n=1 y^n=1,we hope bigger f ( x ) f(x) f(x).For y ^ n = − 1 \hat{y}^n=-1 y^n=1,we hope smaller f ( x ) f(x) f(x). f ( x ) = 0 f(x)=0 f(x)=0 is bound between tow classes.
在这里插入图片描述
For intuition(Suppose δ \delta δ or l l l is y axis):
We use square loss,it is unreasonable.Because when y ^ n f ( x ) \hat{y}^nf(x) y^nf(x) is big,the loss is big.
在这里插入图片描述
PS:I think express above is not meaningful( f ( x ) f(x) f(x) close to 1 or -1).Next,we use s i g m o i d sigmoid sigmoid + square loss.The curve is blue.
在这里插入图片描述
Refer to 简单谈谈Cross Entropy Loss About
S o f t m a x Softmax Softmax and C r o s s   E n t r o p y Cross\ Entropy Cross Entropy.Next,we use s i g m o i d sigmoid sigmoid + C r o s s E n t r o p y Cross Entropy CrossEntropy.It is reasonable.And if l l l is divided by l n 2 ln2 ln2,this method result is upper-bound of ideal loss,we minimize the l l l to minimize ideal loss. In contrast with s i g m o i d + S q u a r e   l o s s sigmoid+Square\ loss sigmoid+Square loss, gradient descent in this method represent better. s i g m o i d + S q u a r e   l o s s sigmoid+Square\ loss sigmoid+Square loss don’t operate better in y ^ n f ( x ) \hat{y}^nf(x) y^nf(x) being tiny positive number.The C r o s s E n t r o p y Cross Entropy CrossEntropy is easier to train. So in some regressions we often use it.
在这里插入图片描述
Next,we give h i n g e   l o s s hinge\ loss hinge loss.Sometimes the h i n g e   l o s s hinge\ loss hinge loss has stronger robustness than c r o s s   e n t r o p y cross\ entropy cross entropy.For example,there are outliers.Because it has sparsity for training data and only considers support vectors(In kernel section,we can understand it). While C r o s s E n t r o p y CrossEntropy CrossEntropy considers all training data.
在这里插入图片描述
Why do we use 1?Because you can think that the hinge function is upper-bound of ideal loss.

Linear SVM

We give linear SVM model.The l l l and normalization are convex function. So we can use gradient descent to minimize hinge loss function.Thus we can use SVM as deep learning classifier layer.There is a reference.
在这里插入图片描述

SVM gradient descent

在这里插入图片描述
The c n ( w ) = + 1 , − 1 , o r   0 c^n(w)=+1,-1,or\ 0 cn(w)=+1,1,or 0.The x i n x^n_i xin is real number which is a dimension of feature vector.

Linear SVM another formulation

Suppose training data set is linear separable.Express above is different with that we know.We transform hinge loss function to that we know.With using hinge loss function,we are going to compute soft margin SVM.For hard margin,you can see my another article which is supplementary for Z.H Zhou Watermelon Book SVM part.The m a x max max is ϵ n \epsilon^n ϵn here.The variant is following:
M i n i m i z i n g   t h e   L o s s   f u n c t i o n L ( f ) = ∑ n ϵ n + λ ∣ ∣ w ∣ ∣ 2 s . t     ϵ n > = 0 y n ^ f ( x n ) > = 1 − ϵ n Minimizing\ the\ Loss\ function\\ L(f)=\sum_n\epsilon^n+\lambda||w||_2\\ s.t\ \ \ \epsilon^n>=0\\ \hat{y^n}f(x^n)>=1-\epsilon^n Minimizing the Loss functionL(f)=nϵn+λw2s.t   ϵn>=0yn^f(xn)>=1ϵn
You can clearly think express above. In fact,it is similar with Watermelon Book by ZhiHua Zhou.Here,I think we should add a hyper-parameter in front of ∑ n ϵ n \sum_n\epsilon^n nϵn.When the hyper-parameter is infinite,we hope get a hard margin SVM.
min ⁡ w , b   1 2 ∣ ∣ w ∣ ∣ 2 s . t .   y i ( w T x i + b ) > = 1 , i = 1 , 2 , . . . m . w h e r e   w   i s   w e i g h t s . \min_{w,b}\ \frac{1}{2}||w||^2 \\ s.t.\ y_i(w^Tx_i+b)>=1,i=1,2,...m.\\ where\ w\ is\ weights. w,bmin 21w2s.t. yi(wTxi+b)>=1,i=1,2,...m.where w is weights.
In common,we find smallest length of weights that satisfies y n f ( x n ) > = 1 y^nf(x^n)>=1 ynf(xn)>=1 as possible.If the training data set is linear separable,we can get ϵ n = 0 \epsilon^n=0 ϵn=0.For example,you can scale the w w w and b b b to make minimum interval 1 1 1.So,we need minimize ∣ ∣ w ∣ ∣ 2 ||w||^2 w2.Between difference of two express is type of margin.
In fact,it is hard to make data linear separable.For relaxing this problem,we use soft margin which allows to wrongly classifier a few data samplers.
在这里插入图片描述
Why is the variant equivalent to proposed express before?We want to minimize L ( f ) L(f) L(f),i.e., ϵ \epsilon ϵ.When ϵ \epsilon ϵ is minimized,this quadratic programming problem is equivalent to minimize hinge loss ϵ n = m a x ( 0 , 1 − y n ^ f ( x n ) ) \epsilon^n=max(0,1-\hat{y^n}f(x^n)) ϵn=max(0,1yn^f(xn)). ϵ \epsilon ϵ can’t be very large. ϵ n \epsilon^n ϵn is the smallest number but bigger than 0 0 0 and 1 − y n ^ f ( x n ) 1-\hat{y^n}f(x^n) 1yn^f(xn) .So with minimization,the two formulations are equal.The ϵ n \epsilon^n ϵn is a slack variable.

Dual representation

Many SVM slides interpret express below by Lagrange.Now,we interpret why w ^ \hat{w} w^ is a linear combination of x n x^n xn from another perspective.
w ^ = ∑ n a ^ n ∗ x n \hat{w}=\sum_n\hat{a}_n*x^n w^=na^nxn
Where a ^ n \hat{a}_n a^n may be sparse. It is a integer.
If we initialize w w w 0:
w = w − η ∑ n c n ( w ) x n c n ( x ) = ∂ l ( f ( x n ) , y ^ n ) ∂ f ( x n ) = 0   o r   − y ^ n ( + 1   o r   − 1 ) w=w-\eta\sum_nc^n(w)x^n\\ c^n(x)=\frac{\partial l(f(x^n),\hat{y}^n)}{\partial f(x^n)}=0\ or\ -\hat{y}^n(+1\ or\ -1) w=wηncn(w)xncn(x)=f(xn)l(f(xn),y^n)=0 or y^n(+1 or 1)
What the w ^ \hat{w} w^ is linear combination of x n x^n xn is obvious.The hinge loss function usually gets zero. So many x x x is not used to determine w ^ \hat{w} w^.These have non-zero a ^ \hat{a} a^ are defined support vectors.For logistic regression,it is always non-zero.Because it uses E n t r o p y Entropy Entropy.For E n t r o p y Entropy Entropy loss function,there is not zero for gradient. So if we use E n t r o p y Entropy Entropy loss,we can’t get sparse a ^ \hat{a} a^.Every data can influence the result( w ^ \hat{w} w^).As mentioned above,hinge loss has stronger robustness.

Step 1

Now,we can get new formulation for our f ( x ) f(x) f(x).Suppose x x x is in linear space.
d u e   t o   w = ∑ n a n x n = [ x 1 , x 2 . . . x n ] . d o t ( [ a 1 a 2 . . . a n ] ) = X a f ( x ) = w T x = a T X T x = [ a 1 , a 2 , . . . a n ] [ ( x 1 ) T ( x 2 ) T . . . ( x n ) T ] [ x 1 x 2 . . . x m ] F i n a l l y f ( x ) = ∑ n a n ( ( x n ) T . d o t ( x ) ) = ∑ n a n K ( ( x n ) T , x ) due\ to\ w=\sum_n a_nx^n=[x^1,x^2...x^n].dot( \left[ \begin{matrix} a_1\\ a_2\\ ...\\ a_n \end{matrix} \right] )=Xa\\ f(x)=w^Tx=a^TX^Tx=[a_1,a_2,...a_n] \left[ \begin{matrix} (x^{1})^T\\(x^2)^T\\...\\(x^n)^T \end{matrix} \right] \left[ \begin{matrix} x_1\\x_2\\...\\x_m \end{matrix} \right]\\ Finally\\ f(x)=\sum_na_n((x^n)^T.dot(x))=\sum_n a_nK((x^n)^T,x) due to w=nanxn=[x1,x2...xn].dot(a1a2...an)=Xaf(x)=wTx=aTXTx=[a1,a2,...an](x1)T(x2)T...(xn)Tx1x2...xmFinallyf(x)=nan((xn)T.dot(x))=nanK((xn)T,x)
Where x n x^n xn is a column vector,the K K K is defined as kernel function.Now,we get a new model,and we want to get a 1 − n a_{1-n} a1n.Because our x isn’t usually in linear space,we use function K K K to replace inner product in linear space.

Step 2、3

We find the best a 1 − n a_{1-n} a1n to minimize loss function(PS:here,YouTuber doesn’t give update method.But I think it can use gradient descent and QP packages).We use new model f ( x ) f(x) f(x) to replace origin f ( x ) f(x) f(x) in loss function.
L ( f ) = ∑ n l ( ∑ n ′ a n ′ K ( ( x n ′ ) T , x n ) , y ^ n ) L(f)=\sum_nl(\sum_{n'} a_{n'}K((x^{n'})^T,x^n),\hat{y}^n) L(f)=nl(nanK((xn)T,xn),y^n)
We need not to know specific x n x^n xn,and we only need know the K K K.We use kernel trick to solve it.For kernel trick,you can use wherever it is effective,e.g,logistic regression,linear regression.

Kernel trick

Why do we define kernel function,rather than using directly inner product between two vectors? Because it makes computation effective and efficient for inner product.We can define different kernel function to easily compute inner product.How to release efficient computation?If we don’t use kernel function.For using linear model,We need to compute feature transformation.For example,in neural network we add hidden layer to compute feature transformation.The feature is a high dimension vector(denoted by ϕ \phi ϕ). Then we compute inner product. It is expensive.Conversely,we directly compute inner product of x and z by defining kernel concept. Tow ways are equivalent. As following:
在这里插入图片描述
ϕ \phi ϕ is feature vector of x or z.
For example:
在这里插入图片描述

Radial Basis Function Kernel

Another kernel function(we use Taylor Expansion to expand exponent computation):
在这里插入图片描述
What you should pay attention to is that it may be over fit due to infinite dimension. In kernel trick section,we use a square expression. Its feature dimension is finite(at most 4).

Sigmoid Kernel

K ( x , z ) = t a n h ( x . d o t ( z ) ) K(x,z)=tanh(x.dot(z)) K(x,z)=tanh(x.dot(z))
You can use similar Taylor Expansion to find tow high dimension vectors( ϕ ( x ) , ϕ ( z ) \phi(x),\phi(z) ϕ(x),ϕ(z)) whose inner product is equal to K ( x , z ) K(x,z) K(x,z).
If we use sigmoid kernel,we get a neural network with a hidden layer.
在这里插入图片描述
Thinking about inner product computation,when tow data points are very closed,the value of K K K is very big. So we think K K K is something like similarity between two data points.We can only consider value of K K K.You can define your kernel function,then using Mercer’s theory to check whether this kernel function is equal to inner product of tow high dimension vectors. Here,we give the reference.
在这里插入图片描述
The kernel function is hyper-parameter. It can influence effectiveness of model.Some kernel function can’t make data separate.For different task,we choose different kernel function.If you don’t know,you can choose RBF kernel.For text data,we usually use linear kernel( k ( x i , x j ) = x i T x j k(x_i,x_j)=x_i^Tx_j k(xi,xj)=xiTxj).

SVM related methods

在这里插入图片描述
Our SVM can’t conduct multi-classifier task. It need to be extended. In contrast with SVM,regression need more data.

Relationship with deep learning

在这里插入图片描述
Hidden layer is to transform feature.Our kernel function is to change data to a high dimension space,too. In high dimension space, data is linearly separable.Then we use hinge loss to get linear classifier.The kernel function is learnable.The below paper is reference.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值