Structured Learning --Structured SVM

This article notes Lecture1、2、3 of YouTube HongYi Li .

Introduction

When we are doing SVM and Deep Learning,we only operate vector.But in fact we need process other data,for example sequence,list,tree,bounding box…
Relation between Structured Learning and DNN:
在这里插入图片描述
Structured Learning defines F F F as a evaluation of compatibility between x x x and y y y.We can consider F F F as Loss Function of DNN and x x x as D N N _ o u t p u t ( x ) DNN\_output(x) DNN_output(x).So we can consider DNN as a special case of Structured learning.

Structured Learning

We want a more powerful function f,input and output are both objects with sturctures(Object contains bounding-box,sequence,list,tree).
We give a unified framework:
在这里插入图片描述
F ( x , y ) F(x,y) F(x,y) is a evaluation function which is used to evaluate degree of matching between X X X and Y Y Y.When we get a F F F trained,we can use it to infer a new x x x.
We have three problems:
在这里插入图片描述
In this article,we will use Object Detection as example task.
在这里插入图片描述

problem1:Definition of Evaluation

We suppose that x x x and y y y are linear relation. ϕ = ( ϕ 1 , ϕ 2 , ϕ 3 , ϕ 4 . . . ) \phi=(\phi_1,\phi_2,\phi_3,\phi_4...) ϕ=(ϕ1,ϕ2,ϕ3,ϕ4...) is feature vector.Each ϕ i \phi_i ϕi is a characteristics. W = ( W 1 , W 2 , W 3 , . . . ) W=(W_1,W_2,W_3,...) W=(W1,W2,W3,...) is a parameter vector which need to be learned.Your ϕ \phi ϕ can be computed by CNN as following or defined by yourself.The ϕ \phi ϕ should present information in bounding box. Google do Object Detection by using CNN.We can’t get bounding box by using CNN,we should use CNN and Structured Learning to get bounding box.
在这里插入图片描述
在这里插入图片描述

problem2: Inference

We enumerate all y y y,then get y ^ \hat{y} y^.在这里插入图片描述
There are some solution of problem2.The algorithm depends on ϕ \phi ϕ and task.
在这里插入图片描述
Now,we suppose problem2 has been solved.

problem3:Training

We wish correct classification score is higher than each error classification score.
在这里插入图片描述
Now,we suppose problem2 and problem1 have been solved.We are going to solve problem3.

Ouline

在这里插入图片描述

Separable case

We assume a separable data that the difference between correct score and error score is more than δ \delta δ.If we can find a good ϕ \phi ϕ Feature Function,it can come ture. As follow:
在这里插入图片描述
If we can find this ϕ \phi ϕ,we can use following algorithm to get W W W.

Structured Perceptron

We can use following algorithm to get W W W.The algorithm is defined as Structured Perceptron:
在这里插入图片描述
Why do we use W + ϕ ( x n , y ^ n ) − ϕ ( x n , y ~ n ) W+\phi(x^n,\hat{y}^n)-\phi(x^n,\widetilde{y}^n) W+ϕ(xn,y^n)ϕ(xn,y n) to update W W W?
I give my derivation process:
W = ∣ w 1 w 2 w 3 w 4 ∣ ϕ ( x n , y ^ n ) = ∣ y 1 y 2 y 3 y 4 ∣ ϕ ( x n , y ~ n ) = ∣ z 1 z 2 z 3 z 4 ∣ n e w _ W = ∣ w 1 + y 1 − z 1 w 2 + y 2 − z 2 w 3 + y 3 − z 3 w 4 + y 4 − z 4 ∣ n e w _ W ∗ ( ϕ ( x n , y ^ n ) − ϕ ( x n , y ~ n ) ) = ∑ i = 1 4 ( w i ∗ ( y i − z i ) + ( y i − z i ) 2 ) f o r W : W ∗ ( ϕ ( x n , y ^ n ) − ϕ ( x n , y ~ n ) ) = ∑ i = 1 4 w i ∗ ( y i − z i ) W= \left| \begin{array}{c} w_1 \\ w_2 \\ w_3 \\ w_4 \end{array} \right| \phi(x^n,\hat{y}^n)= \left| \begin{array}{c} y_1 \\ y_2 \\ y_3 \\ y_4 \end{array} \right| \phi(x^n,\widetilde{y}^n)= \left| \begin{array}{c} z_1 \\ z_2 \\ z_3 \\ z_4 \end{array} \right| \\ new\_W= \left| \begin{array}{c} w_1+y_1-z_1 \\ w_2+y_2-z_2 \\ w_3+y_3-z_3 \\ w_4+y_4-z_4 \end{array} \right| \\ new\_W*(\phi(x^n,\hat{y}^n)-\phi(x^n,\widetilde{y}^n))=\sum_{i=1}^4(w_i*(y_i-z_i)+(y_i-z_i)^2) \\ for W: W*(\phi(x^n,\hat{y}^n)-\phi(x^n,\widetilde{y}^n))=\sum_{i=1}^4w_i*(y_i-z_i) W=w1w2w3w4ϕ(xn,y^n)=y1y2y3y4ϕ(xn,y n)=z1z2z3z4new_W=w1+y1z1w2+y2z2w3+y3z3w4+y4z4new_W(ϕ(xn,y^n)ϕ(xn,y n))=i=14(wi(yizi)+(yizi)2)forW:W(ϕ(xn,y^n)ϕ(xn,y n))=i=14wi(yizi)
n e w _ W new\_W new_W makes W ∗ ϕ ( x n , y ^ n ) W*\phi(x^n,\hat{y}^n) Wϕ(xn,y^n) close to W ∗ ϕ ( x n , y ~ n ) W*\phi(x^n,\widetilde{y}^n) Wϕ(xn,y n).When W W W isn’t be updated,we can end this algorithm.

Can algorithm above be convergent?

Firstly,we give conclusion:
在这里插入图片描述
The amount of Iteration has no relation with amount of y y y.As following,we give proof of termination:
在这里插入图片描述
Our proof is based on separable case. As following process,only if this is a separable case,we can get following proof.
在这里插入图片描述
If W 0 W^0 W0 is initialize with 0,we can infer w ^ ∗ w k ≥ k ∗ δ \hat{w}*w^k\geq{k*\delta} w^wkkδ.Although we have proved that numerator gets bigger with bigger k k k,but denominator might also is changing.
在这里插入图片描述
Discuss denominator:
在这里插入图片描述
Because w k − 1 w^{k-1} wk1 is not w ^ \hat{w} w^,so we need to update w w w to remedy mistake. Because of mistake, w k − 1 ∗ ( ϕ ( x n , y ^ n ) − ϕ ( x n , y ~ n ) ) &lt; 0 w^{k-1}*(\phi(x^n,\hat{y}^n)-\phi(x^n,\widetilde{y}^n))&lt;0 wk1(ϕ(xn,y^n)ϕ(xn,y n))<0. Now,we marge denominator and numerator:
在这里插入图片描述
k ∗ δ R \sqrt{k}*\frac{\delta}{R} k Rδ is lower bound of cos ⁡ ρ k \cos{\rho}_k cosρk. cos ⁡ ( ρ ) k \cos(\rho)_k cos(ρ)k varies in area between upper bound and lower bound.I think time consumed is relative with amount of y y y.Proof above only use update op and it doesn’t consider a r g m a x argmax argmax op.Proof above only proves that amount of iteration is finite.But the amount of iteration has no relation with amount of y y y.
If δ \delta δ is smaller,amount of iteration might be smaller.If we only increase δ \delta δ to two times, R R R will be two times bigger than before.You should find a good feature function to produce a larger margin,rather than use multiplication simply.
在这里插入图片描述

non-separable case

It’s very hard that we find a good feature function which can produce a separable feature vector. So we discuss non-separable data.When we have a non-separable data,we define a cost function C C C which can evaluate how bad a W W W is,then we pick a W W W minimizing the cost C C C.Although the W W W is not a separable W W W.
在这里插入图片描述
The minimum value is zero.This express is simple,and we needn’t choose other cost function that sum of first three subtracts W ∗ ϕ ( x n , y ^ n ) W*\phi(x^n,\hat{y}^n) Wϕ(xn,y^n),what’s more,this express might be too hard to compute.

How to pick W W W minimizing C C C?

We can use stochastic gradient descent.Of course,we need to consider how to compute gradient of m a x max max.All we need to do is compute g r a d ( C n ) grad(C^n) grad(Cn).We discuss in space of W W W:
在这里插入图片描述
When a specific W W W is given,the y y y maximizing [ W ∗ ϕ ( x n , y ) ] [W*\phi(x^n,y)] [Wϕ(xn,y)] is specific. So now we can compute gradient of C n C^n Cn.For max ⁡ y [ W ∗ ϕ ( x n , y ) ] \max\limits_{y}[W*\phi(x^n,y)] ymax[Wϕ(xn,y)],we can use solution of problem2 to solve this. This algorithm is as following:
在这里插入图片描述
There is a error in this slide page, m a x max max should be a r g m a x argmax argmax.We locate the region of W W W by a r g m a x argmax argmax op.If we set η \eta η=1,we are doing Structured Perceptron. So this Stochastic Gradient Descent is generalization of Structured Perceptron Algorithm.

Considering Errors

The algorithm above treats all incorrect y equally,it’s not good.Some better y y y can’t reflect its advantage. It only wants to put y ^ \hat{y} y^ top.Now we want to sort W ∗ ϕ {W*\phi} Wϕ and make y y y closing to y ^ \hat{y} y^ have higher score than one far from y ^ \hat{y} y^.This algorithm is better than one above.This new algorithm is safer for testing.If testing data is little different with training data,the distance of between the top y y y and y ^ \hat{y} y^ is not large. So result is acceptable and better than result of algorithm above.
在这里插入图片描述
Now we need to modify our cost function to get following result.The y y y closing to correct box has smaller distances of evaluation score with correct box.
在这里插入图片描述
Firstly,we define a error function to evaluate distance of y y y and y ^ \hat{y} y^.The error function is based on your task. It is up to you.Now our definition as following, δ \delta δ is a positive number. Function A A A is to compute area of box.
在这里插入图片描述
We get another cost function as following.We choose a y y y which have the biggest w ∗ ϕ + δ w*\phi+\delta wϕ+δ,then make score of y ^ \hat{y} y^ more than this y y y.The minimum of C n C^n Cn is zero.When w ∗ ϕ ( x n , y ^ n ) w*\phi(x^n,\hat{y}^n) wϕ(xn,y^n) is bigger than all w ∗ ϕ + δ w*\phi+\delta wϕ+δ, it means that value that ( w ∗ ϕ ( x n , y ^ n ) − w ∗ ϕ ) (w*\phi(x^n,\hat{y}^n)-w*\phi) (wϕ(xn,y^n)wϕ) is bigger than δ \delta δ.This cost function aims to have smaller difference of score of y y y closing to y ^ \hat{y} y^ and score of y ^ \hat{y} y^. The difference of score is named margin.
在这里插入图片描述
The definition of δ \delta δ is up to you,but this is not problem2,we need to redefine solution of this m a x max max question.If you define a complex δ \delta δ,this question is difficult to solve.You should carefully think about definition of δ \delta δ.
We update W W W by using following similar method.New Cost Function just changes maximum y y y,so mathematical express of updating W W W only changes y y y.Using different y y y result might be different.The result of current algorithm is better than using max ⁡ y ( w ∗ ϕ ) − w ∗ ϕ ( x n , y ^ n ) \max\limits_{y}(w*\phi)-w*\phi(x^n,\hat{y}^n) ymax(wϕ)wϕ(xn,y^n).
在这里插入图片描述
There is a error in express above, m a x max max is a r g m a x argmax argmax.There is a another viewpoint to interpret Cost Function above.
在这里插入图片描述
C ′ C&#x27; C is hard to minimize.When W W W change a little, δ \delta δ maybe have no change. It means δ \delta δ is a staircase function. In some place,gradient of W W W is zero,in all edges,gradient of W W W is infinite.The y ~ n \widetilde{y}^n y n is our predicted label.The C ′ C&#x27; C evaluates difference between predicted label and target label.We want minimum C ′ C&#x27; C.But it is hard to minimize.The C C C is the surrogate of C ′ C&#x27; C.Although minimum of C n C^n Cn doesn’t means C ′ C&#x27; C is minimum,we can get a C ′ C&#x27; C as small as possible. So new cost function is useful.
Proof that δ ( y ^ n , y ~ n ) &lt; C n \delta(\hat{y}^n,\widetilde{y}^n)&lt;C^n δ(y^n,y n)<Cn:
在这里插入图片描述
It’s simple.We give more Cost Function as following. It’s very simple to prove the Slack Variable Rescaling method. So I don’t give proof.Why is the Slack Variable Rescaling proposed?The δ \delta δ might have different scale with w ∗ ϕ w*\phi wϕ,when summing them might result in a number becoming invalid.While multiplication can alleviate this problem.
在这里插入图片描述

Regularization

We want to improve model generalization ability for a good testing result. So we add regularization to Cost Function. λ \lambda λ just a superparameter.If you are familiar with DNN Loss Function,it’s simple for you.
在这里插入图片描述
We give gradient of new Cost Function. It is same with DNN.
在这里插入图片描述
I think the 1 2 \frac{1}{2} 21 is for conveniently computing gradient.

Structured SVM

Structured SVM is a specific case of Structured Learning. It’s a linear model.You can define non-linear function F F F to produce a non-linear Structured Learning model. Firstly,we tansform C n C^n Cn:
在这里插入图片描述
Is C n C^n Cn equivalent to express in blue box?Yes!It is based on minimizing C C C.Because C n + 1 C^n+1 Cn+1 also can satisfy express in blue box,but we want to minimize C C C,while C n + 1 C^n+1 Cn+1 don’t minimize C C C, C n C^n Cn just is m a x max max,it can minimize C C C.So they are equivalent.Then we transform origin express to following express.Express in green box is equivalent to one in yellow box.
在这里插入图片描述
The ϵ \epsilon ϵ is defined as slack variable.When W W W is specific, C n C^n Cn in green box is specific.But ϵ n \epsilon^n ϵn in yellow box is not specific,we should find ϵ n \epsilon^n ϵn.Although you can feel odd,the results of two methods are approximately equal. When y = y ^ n y=\hat{y}^n y=y^n,it is not meaningful. So we remove this constraint. C n &gt; = 0 C^n&gt;=0 Cn>=0,so ϵ n &gt; = 0 \epsilon^n&gt;=0 ϵn>=0.
在这里插入图片描述
For intuition to understand express above,we draw a graph as following.Our ideal result is C n = 0 C^n=0 Cn=0,it means the m a r g i n = δ margin=\delta margin=δ.But we might find no w w w can achieve this. In other words,when we get a minimum C C C,the C n C^n Cn might be a non-zero number.The following inequalities might be impossible to achieve.
在这里插入图片描述
So we add a slack variable ϵ \epsilon ϵ to liberalize.Only if ϵ &gt; = 0 \epsilon&gt;=0 ϵ>=0,this can achieve smaller margin.This a intuition interpretation for ϵ &gt; = 0 \epsilon&gt;=0 ϵ>=0.Adding ϵ \epsilon ϵ to get small margin is equal to a C n ! = 0 C^n!=0 Cn!=0 minimizing C C C.The limitation of W W W is be relaxed.
在这里插入图片描述
ϵ \epsilon ϵ is to relax limitation,so it is named slack variable. ϵ \epsilon ϵ can’t be very big,otherwise,any w w w can satisfy all constraints. So we need to minimize ϵ \epsilon ϵ.It relaxes the limitation as small as possible.
在这里插入图片描述
For example:
在这里插入图片描述
Then we get a quadratic programing.Because SVM also is quadratic programing question,so name of following expression(Structured SVM) contains SVM.We can use QP package to solve Structured SVM.
在这里插入图片描述

Cutting Plane Algorithm for Structured SVM

How to solve the quadratic programing which have many constraints?

CPA can solve it.
在这里插入图片描述
We are going to minimize C C C,all constraints are linear inequalities of w w w and ϵ \epsilon ϵ,as we know,a linear inequalities can be paint on a plane(a common mathematical question),its result is left graph of above.We should now choose a point which makes C C C reach minimum in the area which is like a diamond. A n A^n An is our working set and its elements can influence the solution.
Algorithm process as following:
在这里插入图片描述

How to solve Add op?

Firstly, A n A^n An is equal to NULL,it means we don’t have any constraints.
在这里插入图片描述
Now,we get blue point(a ϵ \epsilon ϵ and a W W W),there are lots of constraints is violated.We find a most violated constraint,then make it join A n A^n An set.
在这里插入图片描述
Then we compute new point which minimizes C C C in new A n A^n An constraint area.Then we move point to new position and compute current the most violated constraint.Process as following:
在这里插入图片描述
Add the constraint to A n A^n An,then …Iteration,Iteration…When there are not any constriants is violated,it completes.
在这里插入图片描述

How to find the most violated one?

在这里插入图片描述
Why do we set A A A for each training data?I think that one of reasons is that limitation of ϵ n \epsilon^n ϵn is only relative with a training data.Another reason is that we can get a small valid working set if all A A A don’t change any more.This just is a strategy,and you can define your Add Op strategy and Degree of violation strategy. In definition of degree of violation,each y have same ϵ \epsilon ϵ and ϕ ( x , y ^ ) \phi(x,\hat{y}) ϕ(x,y^) for a specific training data,so they are omitted.Degree of violation is up to you,you can choose other method.

The whole pseudo-code of CPA

在这里插入图片描述
ϵ \epsilon ϵ doesn’t influence the relative value of degree of violation,we don’t need to remember any ϵ \epsilon ϵ.But each ϵ \epsilon ϵ is computed when w is computed.

An Example

Firstly,initialize A 1 A^1 A1and A 2 A^2 A2
在这里插入图片描述
secondly,
在这里插入图片描述
Thirdly,we get a QP with two constraints.
在这里插入图片描述
We get W 1 W_1 W1 minimizes C C C.Then we find next the most violated constraints.
在这里插入图片描述
You can suppose that y ‾ 1 o r 2 \overline{y}^{1 or 2} y1or2 is computed by a optimization function.rather than accessing each y y y.Then we need to solve QP with four constraints.Then the process repeats iteratively.
在这里插入图片描述
According to origin Structured SVM paper,amount of iteration has upper bound. As description of above,the definition of the most violated constraint is up to you.Origin Structured SVM paper does’t directly add the most violated constraint to A n A^n An,and it defines a bad degree for constraint,in orther words,it set a limitation for the most violated constraint. So the most violated constraint which is added to A n A^n An must over this degree.if λ \lambda λ is bigger,the speed of convergence will be slower.Adding
bad of degree the speed will be faster. The bigger degree is,the faster convergence will be. In a word,when we modify little expressions of above, it can be convergent ,it has upper bound of amount of iteration and you can use it. The origin Structured SVM paper refer that upper bound of iteration isn’t relative to amount of y y y.This property is same with Structured …The upper bound of amount of iteration is proportional to some parameters like λ \lambda λ and bad of degree etc.

Mutil-class and Binary SVM

Mutil-class SVM give a specific amount of y y y,so our question is simpler.We only need to analysis degree of matching between finite y y y and x. In fact,It is a mutil-class or binary classification problem.
We also can regard object detected problem as a classification problem.Because x x x corresponds to y y y can be considered as fact that x x x is class y y y.

Mutil-class

question1:Evaluation

在这里插入图片描述
It’s simple.We can use F F F of Structured SVM express Multi-class SVM.For example, F ( x ⃗ , k ) = w k ∗ x ⃗ F(\vec{x},k)=w^k*\vec{x} F(x ,k)=wkx , w k w^k wk is used to compute the degree of correspondence between x x x and k k k.

question2:Inference

在这里插入图片描述
Because y y y have finite amount,so we can enumerate them.

question3:Training

在这里插入图片描述
We want correct classification score( W y ^ n ∗ x ⃗ W^{\hat{y}^n}*\vec{x} Wy^nx ) is bigger than each error classification score( W y ∗ x ⃗ W^y*\vec{x} Wyx ).Its form is same with Structured SVM.For δ \delta δ,it is up to you.For example, y ∈ { d o g , c a t , b u s , c a r } y\in\{dog,cat,bus,car\} y{dog,cat,bus,car},if you don’t wish to classify c a r car car and c a t cat cat by error,you should set bigger δ ( c a t , c a r ) \delta(cat,car) δ(cat,car) which produces bigger difference between c a t cat cat and c a r car car.We only need iterate N ( K − 1 ) N(K-1) N(K1) times.

Binary SVM

在这里插入图片描述
Binary SVM is a specific case of multi-class SVM,we only need to set K = 2 K=2 K=2. δ ( y ^ n , y ) \delta(\hat{y}^n,y) δ(y^n,y) can be set to 1. So we only have tow constraints.

Beyond Structured SVM

Structured SVM is a linear model,so it has finite epression ability.If you want a more powerful Structured SVM,you need use a DNN(Deep Neural Networks) to abstract features.
在这里插入图片描述
If we use Graident Descent Method to optimize W W W rather than Quadratic Programing method,DNN and Structured SVM can be jointly trained.
在这里插入图片描述
If you don’t want a linear model,you can replace Structured SVM with DNN.The C C C is our Loss Function. M a x Max Max op also can do GD op.
在这里插入图片描述

Summary

For separable data,we can use a gradient descent method to get W W W.What’s more,amount of iteration is finite for separable data.For non-separable data,we can use gradient descent method or cutting plane algorithm to get W W W.It is corresponds to two thinkings.First thinking is we want to minimize cost function.Second thinking is that we transform problem to a quadratic problem. It can be convergent with some definition of the most violated constraint.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值