This article notes Lecture1、2、3 of YouTube HongYi Li .
Introduction
When we are doing SVM and Deep Learning,we only operate vector.But in fact we need process other data,for example sequence,list,tree,bounding box…
Relation between Structured Learning and DNN:
Structured Learning defines
F
F
F as a evaluation of compatibility between
x
x
x and
y
y
y.We can consider
F
F
F as Loss Function of DNN and
x
x
x as
D
N
N
_
o
u
t
p
u
t
(
x
)
DNN\_output(x)
DNN_output(x).So we can consider DNN as a special case of Structured learning.
Structured Learning
We want a more powerful function f,input and output are both objects with sturctures(Object contains bounding-box,sequence,list,tree).
We give a unified framework:
F
(
x
,
y
)
F(x,y)
F(x,y) is a evaluation function which is used to evaluate degree of matching between
X
X
X and
Y
Y
Y.When we get a
F
F
F trained,we can use it to infer a new
x
x
x.
We have three problems:
In this article,we will use Object Detection as example task.
problem1:Definition of Evaluation
We suppose that
x
x
x and
y
y
y are linear relation.
ϕ
=
(
ϕ
1
,
ϕ
2
,
ϕ
3
,
ϕ
4
.
.
.
)
\phi=(\phi_1,\phi_2,\phi_3,\phi_4...)
ϕ=(ϕ1,ϕ2,ϕ3,ϕ4...) is feature vector.Each
ϕ
i
\phi_i
ϕi is a characteristics.
W
=
(
W
1
,
W
2
,
W
3
,
.
.
.
)
W=(W_1,W_2,W_3,...)
W=(W1,W2,W3,...) is a parameter vector which need to be learned.Your
ϕ
\phi
ϕ can be computed by CNN as following or defined by yourself.The
ϕ
\phi
ϕ should present information in bounding box. Google do Object Detection by using CNN.We can’t get bounding box by using CNN,we should use CNN and Structured Learning to get bounding box.
problem2: Inference
We enumerate all
y
y
y,then get
y
^
\hat{y}
y^.
There are some solution of problem2.The algorithm depends on
ϕ
\phi
ϕ and task.
Now,we suppose problem2 has been solved.
problem3:Training
We wish correct classification score is higher than each error classification score.
Now,we suppose problem2 and problem1 have been solved.We are going to solve problem3.
Ouline
Separable case
We assume a separable data that the difference between correct score and error score is more than
δ
\delta
δ.If we can find a good
ϕ
\phi
ϕ Feature Function,it can come ture. As follow:
If we can find this
ϕ
\phi
ϕ,we can use following algorithm to get
W
W
W.
Structured Perceptron
We can use following algorithm to get
W
W
W.The algorithm is defined as Structured Perceptron:
Why do we use
W
+
ϕ
(
x
n
,
y
^
n
)
−
ϕ
(
x
n
,
y
~
n
)
W+\phi(x^n,\hat{y}^n)-\phi(x^n,\widetilde{y}^n)
W+ϕ(xn,y^n)−ϕ(xn,y
n) to update
W
W
W?
I give my derivation process:
W
=
∣
w
1
w
2
w
3
w
4
∣
ϕ
(
x
n
,
y
^
n
)
=
∣
y
1
y
2
y
3
y
4
∣
ϕ
(
x
n
,
y
~
n
)
=
∣
z
1
z
2
z
3
z
4
∣
n
e
w
_
W
=
∣
w
1
+
y
1
−
z
1
w
2
+
y
2
−
z
2
w
3
+
y
3
−
z
3
w
4
+
y
4
−
z
4
∣
n
e
w
_
W
∗
(
ϕ
(
x
n
,
y
^
n
)
−
ϕ
(
x
n
,
y
~
n
)
)
=
∑
i
=
1
4
(
w
i
∗
(
y
i
−
z
i
)
+
(
y
i
−
z
i
)
2
)
f
o
r
W
:
W
∗
(
ϕ
(
x
n
,
y
^
n
)
−
ϕ
(
x
n
,
y
~
n
)
)
=
∑
i
=
1
4
w
i
∗
(
y
i
−
z
i
)
W= \left| \begin{array}{c} w_1 \\ w_2 \\ w_3 \\ w_4 \end{array} \right| \phi(x^n,\hat{y}^n)= \left| \begin{array}{c} y_1 \\ y_2 \\ y_3 \\ y_4 \end{array} \right| \phi(x^n,\widetilde{y}^n)= \left| \begin{array}{c} z_1 \\ z_2 \\ z_3 \\ z_4 \end{array} \right| \\ new\_W= \left| \begin{array}{c} w_1+y_1-z_1 \\ w_2+y_2-z_2 \\ w_3+y_3-z_3 \\ w_4+y_4-z_4 \end{array} \right| \\ new\_W*(\phi(x^n,\hat{y}^n)-\phi(x^n,\widetilde{y}^n))=\sum_{i=1}^4(w_i*(y_i-z_i)+(y_i-z_i)^2) \\ for W: W*(\phi(x^n,\hat{y}^n)-\phi(x^n,\widetilde{y}^n))=\sum_{i=1}^4w_i*(y_i-z_i)
W=∣∣∣∣∣∣∣∣w1w2w3w4∣∣∣∣∣∣∣∣ϕ(xn,y^n)=∣∣∣∣∣∣∣∣y1y2y3y4∣∣∣∣∣∣∣∣ϕ(xn,y
n)=∣∣∣∣∣∣∣∣z1z2z3z4∣∣∣∣∣∣∣∣new_W=∣∣∣∣∣∣∣∣w1+y1−z1w2+y2−z2w3+y3−z3w4+y4−z4∣∣∣∣∣∣∣∣new_W∗(ϕ(xn,y^n)−ϕ(xn,y
n))=i=1∑4(wi∗(yi−zi)+(yi−zi)2)forW:W∗(ϕ(xn,y^n)−ϕ(xn,y
n))=i=1∑4wi∗(yi−zi)
n
e
w
_
W
new\_W
new_W makes
W
∗
ϕ
(
x
n
,
y
^
n
)
W*\phi(x^n,\hat{y}^n)
W∗ϕ(xn,y^n) close to
W
∗
ϕ
(
x
n
,
y
~
n
)
W*\phi(x^n,\widetilde{y}^n)
W∗ϕ(xn,y
n).When
W
W
W isn’t be updated,we can end this algorithm.
Can algorithm above be convergent?
Firstly,we give conclusion:
The amount of Iteration has no relation with amount of
y
y
y.As following,we give proof of termination:
Our proof is based on separable case. As following process,only if this is a separable case,we can get following proof.
If
W
0
W^0
W0 is initialize with 0,we can infer
w
^
∗
w
k
≥
k
∗
δ
\hat{w}*w^k\geq{k*\delta}
w^∗wk≥k∗δ.Although we have proved that numerator gets bigger with bigger
k
k
k,but denominator might also is changing.
Discuss denominator:
Because
w
k
−
1
w^{k-1}
wk−1 is not
w
^
\hat{w}
w^,so we need to update
w
w
w to remedy mistake. Because of mistake,
w
k
−
1
∗
(
ϕ
(
x
n
,
y
^
n
)
−
ϕ
(
x
n
,
y
~
n
)
)
<
0
w^{k-1}*(\phi(x^n,\hat{y}^n)-\phi(x^n,\widetilde{y}^n))<0
wk−1∗(ϕ(xn,y^n)−ϕ(xn,y
n))<0. Now,we marge denominator and numerator:
k
∗
δ
R
\sqrt{k}*\frac{\delta}{R}
k∗Rδ is lower bound of
cos
ρ
k
\cos{\rho}_k
cosρk.
cos
(
ρ
)
k
\cos(\rho)_k
cos(ρ)k varies in area between upper bound and lower bound.I think time consumed is relative with amount of
y
y
y.Proof above only use update op and it doesn’t consider
a
r
g
m
a
x
argmax
argmax op.Proof above only proves that amount of iteration is finite.But the amount of iteration has no relation with amount of
y
y
y.
If
δ
\delta
δ is smaller,amount of iteration might be smaller.If we only increase
δ
\delta
δ to two times,
R
R
R will be two times bigger than before.You should find a good feature function to produce a larger margin,rather than use multiplication simply.
non-separable case
It’s very hard that we find a good feature function which can produce a separable feature vector. So we discuss non-separable data.When we have a non-separable data,we define a cost function
C
C
C which can evaluate how bad a
W
W
W is,then we pick a
W
W
W minimizing the cost
C
C
C.Although the
W
W
W is not a separable
W
W
W.
The minimum value is zero.This express is simple,and we needn’t choose other cost function that sum of first three subtracts
W
∗
ϕ
(
x
n
,
y
^
n
)
W*\phi(x^n,\hat{y}^n)
W∗ϕ(xn,y^n),what’s more,this express might be too hard to compute.
How to pick W W W minimizing C C C?
We can use stochastic gradient descent.Of course,we need to consider how to compute gradient of
m
a
x
max
max.All we need to do is compute
g
r
a
d
(
C
n
)
grad(C^n)
grad(Cn).We discuss in space of
W
W
W:
When a specific
W
W
W is given,the
y
y
y maximizing
[
W
∗
ϕ
(
x
n
,
y
)
]
[W*\phi(x^n,y)]
[W∗ϕ(xn,y)] is specific. So now we can compute gradient of
C
n
C^n
Cn.For
max
y
[
W
∗
ϕ
(
x
n
,
y
)
]
\max\limits_{y}[W*\phi(x^n,y)]
ymax[W∗ϕ(xn,y)],we can use solution of problem2 to solve this. This algorithm is as following:
There is a error in this slide page,
m
a
x
max
max should be
a
r
g
m
a
x
argmax
argmax.We locate the region of
W
W
W by
a
r
g
m
a
x
argmax
argmax op.If we set
η
\eta
η=1,we are doing Structured Perceptron. So this Stochastic Gradient Descent is generalization of Structured Perceptron Algorithm.
Considering Errors
The algorithm above treats all incorrect y equally,it’s not good.Some better
y
y
y can’t reflect its advantage. It only wants to put
y
^
\hat{y}
y^ top.Now we want to sort
W
∗
ϕ
{W*\phi}
W∗ϕ and make
y
y
y closing to
y
^
\hat{y}
y^ have higher score than one far from
y
^
\hat{y}
y^.This algorithm is better than one above.This new algorithm is safer for testing.If testing data is little different with training data,the distance of between the top
y
y
y and
y
^
\hat{y}
y^ is not large. So result is acceptable and better than result of algorithm above.
Now we need to modify our cost function to get following result.The
y
y
y closing to correct box has smaller distances of evaluation score with correct box.
Firstly,we define a error function to evaluate distance of
y
y
y and
y
^
\hat{y}
y^.The error function is based on your task. It is up to you.Now our definition as following,
δ
\delta
δ is a positive number. Function
A
A
A is to compute area of box.
We get another cost function as following.We choose a
y
y
y which have the biggest
w
∗
ϕ
+
δ
w*\phi+\delta
w∗ϕ+δ,then make score of
y
^
\hat{y}
y^ more than this
y
y
y.The minimum of
C
n
C^n
Cn is zero.When
w
∗
ϕ
(
x
n
,
y
^
n
)
w*\phi(x^n,\hat{y}^n)
w∗ϕ(xn,y^n) is bigger than all
w
∗
ϕ
+
δ
w*\phi+\delta
w∗ϕ+δ, it means that value that
(
w
∗
ϕ
(
x
n
,
y
^
n
)
−
w
∗
ϕ
)
(w*\phi(x^n,\hat{y}^n)-w*\phi)
(w∗ϕ(xn,y^n)−w∗ϕ) is bigger than
δ
\delta
δ.This cost function aims to have smaller difference of score of
y
y
y closing to
y
^
\hat{y}
y^ and score of
y
^
\hat{y}
y^. The difference of score is named margin.
The definition of
δ
\delta
δ is up to you,but this is not problem2,we need to redefine solution of this
m
a
x
max
max question.If you define a complex
δ
\delta
δ,this question is difficult to solve.You should carefully think about definition of
δ
\delta
δ.
We update
W
W
W by using following similar method.New Cost Function just changes maximum
y
y
y,so mathematical express of updating
W
W
W only changes
y
y
y.Using different
y
y
y result might be different.The result of current algorithm is better than using
max
y
(
w
∗
ϕ
)
−
w
∗
ϕ
(
x
n
,
y
^
n
)
\max\limits_{y}(w*\phi)-w*\phi(x^n,\hat{y}^n)
ymax(w∗ϕ)−w∗ϕ(xn,y^n).
There is a error in express above,
m
a
x
max
max is
a
r
g
m
a
x
argmax
argmax.There is a another viewpoint to interpret Cost Function above.
C
′
C'
C′ is hard to minimize.When
W
W
W change a little,
δ
\delta
δ maybe have no change. It means
δ
\delta
δ is a staircase function. In some place,gradient of
W
W
W is zero,in all edges,gradient of
W
W
W is infinite.The
y
~
n
\widetilde{y}^n
y
n is our predicted label.The
C
′
C'
C′ evaluates difference between predicted label and target label.We want minimum
C
′
C'
C′.But it is hard to minimize.The
C
C
C is the surrogate of
C
′
C'
C′.Although minimum of
C
n
C^n
Cn doesn’t means
C
′
C'
C′ is minimum,we can get a
C
′
C'
C′ as small as possible. So new cost function is useful.
Proof that
δ
(
y
^
n
,
y
~
n
)
<
C
n
\delta(\hat{y}^n,\widetilde{y}^n)<C^n
δ(y^n,y
n)<Cn:
It’s simple.We give more Cost Function as following. It’s very simple to prove the Slack Variable Rescaling method. So I don’t give proof.Why is the Slack Variable Rescaling proposed?The
δ
\delta
δ might have different scale with
w
∗
ϕ
w*\phi
w∗ϕ,when summing them might result in a number becoming invalid.While multiplication can alleviate this problem.
Regularization
We want to improve model generalization ability for a good testing result. So we add regularization to Cost Function.
λ
\lambda
λ just a superparameter.If you are familiar with DNN Loss Function,it’s simple for you.
We give gradient of new Cost Function. It is same with DNN.
I think the
1
2
\frac{1}{2}
21 is for conveniently computing gradient.
Structured SVM
Structured SVM is a specific case of Structured Learning. It’s a linear model.You can define non-linear function
F
F
F to produce a non-linear Structured Learning model. Firstly,we tansform
C
n
C^n
Cn:
Is
C
n
C^n
Cn equivalent to express in blue box?Yes!It is based on minimizing
C
C
C.Because
C
n
+
1
C^n+1
Cn+1 also can satisfy express in blue box,but we want to minimize
C
C
C,while
C
n
+
1
C^n+1
Cn+1 don’t minimize
C
C
C,
C
n
C^n
Cn just is
m
a
x
max
max,it can minimize
C
C
C.So they are equivalent.Then we transform origin express to following express.Express in green box is equivalent to one in yellow box.
The
ϵ
\epsilon
ϵ is defined as slack variable.When
W
W
W is specific,
C
n
C^n
Cn in green box is specific.But
ϵ
n
\epsilon^n
ϵn in yellow box is not specific,we should find
ϵ
n
\epsilon^n
ϵn.Although you can feel odd,the results of two methods are approximately equal. When
y
=
y
^
n
y=\hat{y}^n
y=y^n,it is not meaningful. So we remove this constraint.
C
n
>
=
0
C^n>=0
Cn>=0,so
ϵ
n
>
=
0
\epsilon^n>=0
ϵn>=0.
For intuition to understand express above,we draw a graph as following.Our ideal result is
C
n
=
0
C^n=0
Cn=0,it means the
m
a
r
g
i
n
=
δ
margin=\delta
margin=δ.But we might find no
w
w
w can achieve this. In other words,when we get a minimum
C
C
C,the
C
n
C^n
Cn might be a non-zero number.The following inequalities might be impossible to achieve.
So we add a slack variable
ϵ
\epsilon
ϵ to liberalize.Only if
ϵ
>
=
0
\epsilon>=0
ϵ>=0,this can achieve smaller margin.This a intuition interpretation for
ϵ
>
=
0
\epsilon>=0
ϵ>=0.Adding
ϵ
\epsilon
ϵ to get small margin is equal to a
C
n
!
=
0
C^n!=0
Cn!=0 minimizing
C
C
C.The limitation of
W
W
W is be relaxed.
ϵ
\epsilon
ϵ is to relax limitation,so it is named slack variable.
ϵ
\epsilon
ϵ can’t be very big,otherwise,any
w
w
w can satisfy all constraints. So we need to minimize
ϵ
\epsilon
ϵ.It relaxes the limitation as small as possible.
For example:
Then we get a quadratic programing.Because SVM also is quadratic programing question,so name of following expression(Structured SVM) contains SVM.We can use QP package to solve Structured SVM.
Cutting Plane Algorithm for Structured SVM
How to solve the quadratic programing which have many constraints?
CPA can solve it.
We are going to minimize
C
C
C,all constraints are linear inequalities of
w
w
w and
ϵ
\epsilon
ϵ,as we know,a linear inequalities can be paint on a plane(a common mathematical question),its result is left graph of above.We should now choose a point which makes
C
C
C reach minimum in the area which is like a diamond.
A
n
A^n
An is our working set and its elements can influence the solution.
Algorithm process as following:
How to solve Add op?
Firstly,
A
n
A^n
An is equal to NULL,it means we don’t have any constraints.
Now,we get blue point(a
ϵ
\epsilon
ϵ and a
W
W
W),there are lots of constraints is violated.We find a most violated constraint,then make it join
A
n
A^n
An set.
Then we compute new point which minimizes
C
C
C in new
A
n
A^n
An constraint area.Then we move point to new position and compute current the most violated constraint.Process as following:
Add the constraint to
A
n
A^n
An,then …Iteration,Iteration…When there are not any constriants is violated,it completes.
How to find the most violated one?
Why do we set
A
A
A for each training data?I think that one of reasons is that limitation of
ϵ
n
\epsilon^n
ϵn is only relative with a training data.Another reason is that we can get a small valid working set if all
A
A
A don’t change any more.This just is a strategy,and you can define your Add Op strategy and Degree of violation strategy. In definition of degree of violation,each y have same
ϵ
\epsilon
ϵ and
ϕ
(
x
,
y
^
)
\phi(x,\hat{y})
ϕ(x,y^) for a specific training data,so they are omitted.Degree of violation is up to you,you can choose other method.
The whole pseudo-code of CPA
ϵ
\epsilon
ϵ doesn’t influence the relative value of degree of violation,we don’t need to remember any
ϵ
\epsilon
ϵ.But each
ϵ
\epsilon
ϵ is computed when w is computed.
An Example
Firstly,initialize
A
1
A^1
A1and
A
2
A^2
A2
secondly,
Thirdly,we get a QP with two constraints.
We get
W
1
W_1
W1 minimizes
C
C
C.Then we find next the most violated constraints.
You can suppose that
y
‾
1
o
r
2
\overline{y}^{1 or 2}
y1or2 is computed by a optimization function.rather than accessing each
y
y
y.Then we need to solve QP with four constraints.Then the process repeats iteratively.
According to origin Structured SVM paper,amount of iteration has upper bound. As description of above,the definition of the most violated constraint is up to you.Origin Structured SVM paper does’t directly add the most violated constraint to
A
n
A^n
An,and it defines a bad degree for constraint,in orther words,it set a limitation for the most violated constraint. So the most violated constraint which is added to
A
n
A^n
An must over this degree.if
λ
\lambda
λ is bigger,the speed of convergence will be slower.Adding
bad of degree the speed will be faster. The bigger degree is,the faster convergence will be. In a word,when we modify little expressions of above, it can be convergent ,it has upper bound of amount of iteration and you can use it. The origin Structured SVM paper refer that upper bound of iteration isn’t relative to amount of
y
y
y.This property is same with Structured …The upper bound of amount of iteration is proportional to some parameters like
λ
\lambda
λ and bad of degree etc.
Mutil-class and Binary SVM
Mutil-class SVM give a specific amount of
y
y
y,so our question is simpler.We only need to analysis degree of matching between finite
y
y
y and x. In fact,It is a mutil-class or binary classification problem.
We also can regard object detected problem as a classification problem.Because
x
x
x corresponds to
y
y
y can be considered as fact that
x
x
x is class
y
y
y.
Mutil-class
question1:Evaluation
It’s simple.We can use
F
F
F of Structured SVM express Multi-class SVM.For example,
F
(
x
⃗
,
k
)
=
w
k
∗
x
⃗
F(\vec{x},k)=w^k*\vec{x}
F(x,k)=wk∗x,
w
k
w^k
wk is used to compute the degree of correspondence between
x
x
x and
k
k
k.
question2:Inference
Because
y
y
y have finite amount,so we can enumerate them.
question3:Training
We want correct classification score(
W
y
^
n
∗
x
⃗
W^{\hat{y}^n}*\vec{x}
Wy^n∗x) is bigger than each error classification score(
W
y
∗
x
⃗
W^y*\vec{x}
Wy∗x).Its form is same with Structured SVM.For
δ
\delta
δ,it is up to you.For example,
y
∈
{
d
o
g
,
c
a
t
,
b
u
s
,
c
a
r
}
y\in\{dog,cat,bus,car\}
y∈{dog,cat,bus,car},if you don’t wish to classify
c
a
r
car
car and
c
a
t
cat
cat by error,you should set bigger
δ
(
c
a
t
,
c
a
r
)
\delta(cat,car)
δ(cat,car) which produces bigger difference between
c
a
t
cat
cat and
c
a
r
car
car.We only need iterate
N
(
K
−
1
)
N(K-1)
N(K−1) times.
Binary SVM
Binary SVM is a specific case of multi-class SVM,we only need to set
K
=
2
K=2
K=2.
δ
(
y
^
n
,
y
)
\delta(\hat{y}^n,y)
δ(y^n,y) can be set to 1. So we only have tow constraints.
Beyond Structured SVM
Structured SVM is a linear model,so it has finite epression ability.If you want a more powerful Structured SVM,you need use a DNN(Deep Neural Networks) to abstract features.
If we use Graident Descent Method to optimize
W
W
W rather than Quadratic Programing method,DNN and Structured SVM can be jointly trained.
If you don’t want a linear model,you can replace Structured SVM with DNN.The
C
C
C is our Loss Function.
M
a
x
Max
Max op also can do GD op.
Summary
For separable data,we can use a gradient descent method to get W W W.What’s more,amount of iteration is finite for separable data.For non-separable data,we can use gradient descent method or cutting plane algorithm to get W W W.It is corresponds to two thinkings.First thinking is we want to minimize cost function.Second thinking is that we transform problem to a quadratic problem. It can be convergent with some definition of the most violated constraint.