Structured Learning --Structured SVM

最新推荐文章于 2021-05-19 10:37:52 发布

linjiet

最新推荐文章于 2021-05-19 10:37:52 发布

阅读量614

点赞数

分类专栏：机器学习机器学习与统计学习方法文章标签： Structured learning Structured SVM

本文链接：https://blog.csdn.net/qq_39742013/article/details/89003900

版权

机器学习同时被 2 个专栏收录

27 篇文章 0 订阅

订阅专栏

机器学习与统计学习方法

4 篇文章 1 订阅

订阅专栏

This article notes Lecture1、2、3 of YouTube HongYi Li .

Introduction

When we are doing SVM and Deep Learning,we only operate vector.But in fact we need process other data,for example sequence,list,tree,bounding box…
Relation between Structured Learning and DNN:
在这里插入图片描述
Structured Learning defines $F$ as a evaluation of compatibility between $x$ and $y$ .We can consider $F$ as Loss Function of DNN and $x$ as $DNN\_output(x)$ .So we can consider DNN as a special case of Structured learning.

Structured Learning

We want a more powerful function f,input and output are both objects with sturctures(Object contains bounding-box,sequence,list,tree).
We give a unified framework:
在这里插入图片描述
$F (x, y)$ is a evaluation function which is used to evaluate degree of matching between $X$ and $Y$ .When we get a $F$ trained,we can use it to infer a new $x$ .
We have three problems:

In this article,we will use Object Detection as example task.

problem1:Definition of Evaluation

We suppose that $x$ and $y$ are linear relation. $\phi=(\phi_1,\phi_2,\phi_3,\phi_4...)$ is feature vector.Each $\phi_i$ is a characteristics. $W=(W_1,W_2,W_3,...)$ is a parameter vector which need to be learned.Your $\phi$ can be computed by CNN as following or defined by yourself.The $\phi$ should present information in bounding box. Google do Object Detection by using CNN.We can’t get bounding box by using CNN,we should use CNN and Structured Learning to get bounding box.
在这里插入图片描述

problem2: Inference

We enumerate all $y$ ,then get $\hat{y}$ . 在这里插入图片描述
There are some solution of problem2.The algorithm depends on $\phi$ and task.

Now,we suppose problem2 has been solved.

problem3:Training

We wish correct classification score is higher than each error classification score.
在这里插入图片描述
Now,we suppose problem2 and problem1 have been solved.We are going to solve problem3.

Ouline

在这里插入图片描述

Separable case

We assume a separable data that the difference between correct score and error score is more than $\delta$ .If we can find a good $\phi$ Feature Function,it can come ture. As follow:
在这里插入图片描述
If we can find this $\phi$ ,we can use following algorithm to get $W$ .

Structured Perceptron

We can use following algorithm to get $W$ .The algorithm is defined as Structured Perceptron:
在这里插入图片描述
Why do we use $W+\phi(x^n,\hat{y}^n)-\phi(x^n,\widetilde{y}^n)$ to update $W$ ?
I give my derivation process:
$\left| \begin{array}{c} w_1 \\ w_2 \\ w_3 \\ w_4 \end{array} \right| \phi(x^n,\hat{y}^n)= \left| \begin{array}{c} y_1 \\ y_2 \\ y_3 \\ y_4 \end{array} \right| \phi(x^n,\widetilde{y}^n)= \left| \begin{array}{c} z_1 \\ z_2 \\ z_3 \\ z_4 \end{array} \right| \\ new\_W= \left| \begin{array}{c} w_1+y_1-z_1 \\ w_2+y_2-z_2 \\ w_3+y_3-z_3 \\ w_4+y_4-z_4 \end{array} \right| \\ new\_W*(\phi(x^n,\hat{y}^n)-\phi(x^n,\widetilde{y}^n))=\sum_{i=1}^4(w_i*(y_i-z_i)+(y_i-z_i)^2) \\ for W: W*(\phi(x^n,\hat{y}^n)-\phi(x^n,\widetilde{y}^n))=\sum_{i=1}^4w_i*(y_i-z_i)$
$new\_W$ makes $W*\phi(x^n,\hat{y}^n)$ close to $W*\phi(x^n,\widetilde{y}^n)$ .When $W$ isn’t be updated,we can end this algorithm.

Can algorithm above be convergent?

Firstly,we give conclusion:
在这里插入图片描述
The amount of Iteration has no relation with amount of $y$ .As following,we give proof of termination:

Our proof is based on separable case. As following process,only if this is a separable case,we can get following proof.

If $W^0$ is initialize with 0,we can infer $\hat{w}*w^k\geq{k*\delta}$ .Although we have proved that numerator gets bigger with bigger $k$ ,but denominator might also is changing.
在这里插入图片描述
Discuss denominator:

Because $w^{k-1}$ is not $\hat{w}$ ,so we need to update $w$ to remedy mistake. Because of mistake, $w^{k-1}*(\phi(x^n,\hat{y}^n)-\phi(x^n,\widetilde{y}^n))<0$ . Now,we marge denominator and numerator:

$\sqrt{k}*\frac{\delta}{R}$ is lower bound of $\cos{\rho}_k$ . $\cos(\rho)_k$ varies in area between upper bound and lower bound.I think time consumed is relative with amount of $y$ .Proof above only use update op and it doesn’t consider $a r g m a x$ op.Proof above only proves that amount of iteration is finite.But the amount of iteration has no relation with amount of $y$ .
If $\delta$ is smaller,amount of iteration might be smaller.If we only increase $\delta$ to two times, $R$ will be two times bigger than before.You should find a good feature function to produce a larger margin,rather than use multiplication simply.
在这里插入图片描述

non-separable case

It’s very hard that we find a good feature function which can produce a separable feature vector. So we discuss non-separable data.When we have a non-separable data,we define a cost function $C$ which can evaluate how bad a $W$ is,then we pick a $W$ minimizing the cost $C$ .Although the $W$ is not a separable $W$ .
在这里插入图片描述
The minimum value is zero.This express is simple,and we needn’t choose other cost function that sum of first three subtracts $W*\phi(x^n,\hat{y}^n)$ ,what’s more,this express might be too hard to compute.

How to pick $W$ minimizing $C$ ?

We can use stochastic gradient descent.Of course,we need to consider how to compute gradient of $m a x$ .All we need to do is compute $grad(C^n)$ .We discuss in space of $W$ :
在这里插入图片描述
When a specific $W$ is given,the $y$ maximizing $[W*\phi(x^n,y)]$ is specific. So now we can compute gradient of $C^n$ .For $\max\limits_{y}[W*\phi(x^n,y)]$ ,we can use solution of problem2 to solve this. This algorithm is as following:

There is a error in this slide page, $m a x$ should be $a r g m a x$ .We locate the region of $W$ by $a r g m a x$ op.If we set $\eta$ =1,we are doing Structured Perceptron. So this Stochastic Gradient Descent is generalization of Structured Perceptron Algorithm.

Considering Errors

The algorithm above treats all incorrect y equally,it’s not good.Some better $y$ can’t reflect its advantage. It only wants to put $\hat{y}$ top.Now we want to sort ${W*\phi}$ and make $y$ closing to $\hat{y}$ have higher score than one far from $\hat{y}$ .This algorithm is better than one above.This new algorithm is safer for testing.If testing data is little different with training data,the distance of between the top $y$ and $\hat{y}$ is not large. So result is acceptable and better than result of algorithm above.
在这里插入图片描述
Now we need to modify our cost function to get following result.The $y$ closing to correct box has smaller distances of evaluation score with correct box.

Firstly,we define a error function to evaluate distance of $y$ and $\hat{y}$ .The error function is based on your task. It is up to you.Now our definition as following, $\delta$ is a positive number. Function $A$ is to compute area of box.
在这里插入图片描述
We get another cost function as following.We choose a $y$ which have the biggest $w*\phi+\delta$ ,then make score of $\hat{y}$ more than this $y$ .The minimum of $C^n$ is zero.When $w*\phi(x^n,\hat{y}^n)$ is bigger than all $w*\phi+\delta$ , it means that value that $(w*\phi(x^n,\hat{y}^n)-w*\phi)$ is bigger than $\delta$ .This cost function aims to have smaller difference of score of $y$ closing to $\hat{y}$ and score of $\hat{y}$ . The difference of score is named margin.
在这里插入图片描述
The definition of $\delta$ is up to you,but this is not problem2,we need to redefine solution of this $m a x$ question.If you define a complex $\delta$ ,this question is difficult to solve.You should carefully think about definition of $\delta$ .
We update $W$ by using following similar method.New Cost Function just changes maximum $y$ ,so mathematical express of updating $W$ only changes $y$ .Using different $y$ result might be different.The result of current algorithm is better than using $\max\limits_{y}(w*\phi)-w*\phi(x^n,\hat{y}^n)$ .
在这里插入图片描述
There is a error in express above, $m a x$ is $a r g m a x$ .There is a another viewpoint to interpret Cost Function above.

$C^{'}$ is hard to minimize.When $W$ change a little, $\delta$ maybe have no change. It means $\delta$ is a staircase function. In some place,gradient of $W$ is zero,in all edges,gradient of $W$ is infinite.The $\widetilde{y}^n$ is our predicted label.The $C^{'}$ evaluates difference between predicted label and target label.We want minimum $C^{'}$ .But it is hard to minimize.The $C$ is the surrogate of $C^{'}$ .Although minimum of $C^n$ doesn’t means $C^{'}$ is minimum,we can get a $C^{'}$ as small as possible. So new cost function is useful.
Proof that $\delta(\hat{y}^n,\widetilde{y}^n)<C^n$ :
在这里插入图片描述
It’s simple.We give more Cost Function as following. It’s very simple to prove the Slack Variable Rescaling method. So I don’t give proof.Why is the Slack Variable Rescaling proposed?The $\delta$ might have different scale with $w*\phi$ ,when summing them might result in a number becoming invalid.While multiplication can alleviate this problem.
在这里插入图片描述

Regularization

We want to improve model generalization ability for a good testing result. So we add regularization to Cost Function. $\lambda$ just a superparameter.If you are familiar with DNN Loss Function,it’s simple for you.
在这里插入图片描述
We give gradient of new Cost Function. It is same with DNN.

I think the $\frac{1}{2}$ is for conveniently computing gradient.

Structured SVM

Structured SVM is a specific case of Structured Learning. It’s a linear model.You can define non-linear function $F$ to produce a non-linear Structured Learning model. Firstly,we tansform $C^n$ :
在这里插入图片描述
Is $C^n$ equivalent to express in blue box?Yes!It is based on minimizing $C$ .Because $C^n+1$ also can satisfy express in blue box,but we want to minimize $C$ ,while $C^n+1$ don’t minimize $C$ , $C^n$ just is $m a x$ ,it can minimize $C$ .So they are equivalent.Then we transform origin express to following express.Express in green box is equivalent to one in yellow box.
在这里插入图片描述
The $\epsilon$ is defined as slack variable.When $W$ is specific, $C^n$ in green box is specific.But $\epsilon^n$ in yellow box is not specific,we should find $\epsilon^n$ .Although you can feel odd,the results of two methods are approximately equal. When $y=\hat{y}^n$ ,it is not meaningful. So we remove this constraint. $C^n>=0$ ,so $\epsilon^n>=0$ .
在这里插入图片描述
For intuition to understand express above,we draw a graph as following.Our ideal result is $C^n=0$ ,it means the $margin=\delta$ .But we might find no $w$ can achieve this. In other words,when we get a minimum $C$ ,the $C^n$ might be a non-zero number.The following inequalities might be impossible to achieve.
在这里插入图片描述
So we add a slack variable $\epsilon$ to liberalize.Only if $\epsilon>=0$ ,this can achieve smaller margin.This a intuition interpretation for $\epsilon>=0$ .Adding $\epsilon$ to get small margin is equal to a $C^n!=0$ minimizing $C$ .The limitation of $W$ is be relaxed.
在这里插入图片描述
$\epsilon$ is to relax limitation,so it is named slack variable. $\epsilon$ can’t be very big,otherwise,any $w$ can satisfy all constraints. So we need to minimize $\epsilon$ .It relaxes the limitation as small as possible.

For example:

Then we get a quadratic programing.Because SVM also is quadratic programing question,so name of following expression(Structured SVM) contains SVM.We can use QP package to solve Structured SVM.
在这里插入图片描述

Cutting Plane Algorithm for Structured SVM

How to solve the quadratic programing which have many constraints?

CPA can solve it.
在这里插入图片描述
We are going to minimize $C$ ,all constraints are linear inequalities of $w$ and $\epsilon$ ,as we know,a linear inequalities can be paint on a plane(a common mathematical question),its result is left graph of above.We should now choose a point which makes $C$ reach minimum in the area which is like a diamond. $A^n$ is our working set and its elements can influence the solution.
Algorithm process as following:
在这里插入图片描述

How to solve Add op?

Firstly, $A^n$ is equal to NULL,it means we don’t have any constraints.
在这里插入图片描述
Now,we get blue point(a $\epsilon$ and a $W$ ),there are lots of constraints is violated.We find a most violated constraint,then make it join $A^n$ set.

Then we compute new point which minimizes $C$ in new $A^n$ constraint area.Then we move point to new position and compute current the most violated constraint.Process as following:
在这里插入图片描述
Add the constraint to $A^n$ ,then …Iteration,Iteration…When there are not any constriants is violated,it completes.

How to find the most violated one?

在这里插入图片描述
Why do we set $A$ for each training data?I think that one of reasons is that limitation of $\epsilon^n$ is only relative with a training data.Another reason is that we can get a small valid working set if all $A$ don’t change any more.This just is a strategy,and you can define your Add Op strategy and Degree of violation strategy. In definition of degree of violation,each y have same $\epsilon$ and $\phi(x,\hat{y})$ for a specific training data,so they are omitted.Degree of violation is up to you,you can choose other method.

The whole pseudo-code of CPA

在这里插入图片描述
$\epsilon$ doesn’t influence the relative value of degree of violation,we don’t need to remember any $\epsilon$ .But each $\epsilon$ is computed when w is computed.

An Example

Firstly,initialize $A^1$ and $A^2$
在这里插入图片描述
secondly,

Thirdly,we get a QP with two constraints.

We get $W_1$ minimizes $C$ .Then we find next the most violated constraints.

You can suppose that $\overline{y}^{1 or 2}$ is computed by a optimization function.rather than accessing each $y$ .Then we need to solve QP with four constraints.Then the process repeats iteratively.
在这里插入图片描述
According to origin Structured SVM paper,amount of iteration has upper bound. As description of above,the definition of the most violated constraint is up to you.Origin Structured SVM paper does’t directly add the most violated constraint to $A^n$ ,and it defines a bad degree for constraint,in orther words,it set a limitation for the most violated constraint. So the most violated constraint which is added to $A^n$ must over this degree.if $\lambda$ is bigger,the speed of convergence will be slower.Adding
bad of degree the speed will be faster. The bigger degree is,the faster convergence will be. In a word,when we modify little expressions of above, it can be convergent ,it has upper bound of amount of iteration and you can use it. The origin Structured SVM paper refer that upper bound of iteration isn’t relative to amount of $y$ .This property is same with Structured …The upper bound of amount of iteration is proportional to some parameters like $\lambda$ and bad of degree etc.

Mutil-class and Binary SVM

Mutil-class SVM give a specific amount of $y$ ,so our question is simpler.We only need to analysis degree of matching between finite $y$ and x. In fact,It is a mutil-class or binary classification problem.
We also can regard object detected problem as a classification problem.Because $x$ corresponds to $y$ can be considered as fact that $x$ is class $y$ .

Mutil-class

question1:Evaluation

在这里插入图片描述
It’s simple.We can use $F$ of Structured SVM express Multi-class SVM.For example, $F(\vec{x},k)=w^k*\vec{x}$ , $w^k$ is used to compute the degree of correspondence between $x$ and $k$ .

question2:Inference

在这里插入图片描述
Because $y$ have finite amount,so we can enumerate them.

question3:Training

在这里插入图片描述
We want correct classification score( $W^{\hat{y}^n}*\vec{x}$ ) is bigger than each error classification score( $W^y*\vec{x}$ ).Its form is same with Structured SVM.For $\delta$ ,it is up to you.For example, $y\in\{dog,cat,bus,car\}$ ,if you don’t wish to classify $c a r$ and $c a t$ by error,you should set bigger $\delta(cat,car)$ which produces bigger difference between $c a t$ and $c a r$ .We only need iterate $N (K - 1)$ times.

Binary SVM

在这里插入图片描述
Binary SVM is a specific case of multi-class SVM,we only need to set $K = 2$ . $\delta(\hat{y}^n,y)$ can be set to 1. So we only have tow constraints.

Beyond Structured SVM

Structured SVM is a linear model,so it has finite epression ability.If you want a more powerful Structured SVM,you need use a DNN(Deep Neural Networks) to abstract features.
在这里插入图片描述
If we use Graident Descent Method to optimize $W$ rather than Quadratic Programing method,DNN and Structured SVM can be jointly trained.

If you don’t want a linear model,you can replace Structured SVM with DNN.The $C$ is our Loss Function. $M a x$ op also can do GD op.
在这里插入图片描述

Summary

For separable data,we can use a gradient descent method to get $W$ .What’s more,amount of iteration is finite for separable data.For non-separable data,we can use gradient descent method or cutting plane algorithm to get $W$ .It is corresponds to two thinkings.First thinking is we want to minimize cost function.Second thinking is that we transform problem to a quadratic problem. It can be convergent with some definition of the most violated constraint.