This article records my process of study.Link:SVM.There are not Lagrange and KKT condition.Here,we mainly use gradient descent.
Binary Classification
Because
g
(
x
)
g(x)
g(x) only outputs
+
1
+1
+1 or
−
1
-1
−1.Thus
δ
\delta
δ can’t use gradient descent.We use another loss function.(PS:
δ
\delta
δ is a piecewise function).
y
^
n
\hat{y}^n
y^n is 1 or -1,it just is defined for convenience. This can unify two classes express about loss function by
y
^
n
∗
f
(
x
)
\hat{y}^n*f(x)
y^n∗f(x). For
y
^
n
=
1
\hat{y}^n=1
y^n=1,we hope bigger
f
(
x
)
f(x)
f(x).For
y
^
n
=
−
1
\hat{y}^n=-1
y^n=−1,we hope smaller
f
(
x
)
f(x)
f(x).
f
(
x
)
=
0
f(x)=0
f(x)=0 is bound between tow classes.
For intuition(Suppose
δ
\delta
δ or
l
l
l is y axis):
We use square loss,it is unreasonable.Because when
y
^
n
f
(
x
)
\hat{y}^nf(x)
y^nf(x) is big,the loss is big.
PS:I think express above is not meaningful(
f
(
x
)
f(x)
f(x) close to 1 or -1).Next,we use
s
i
g
m
o
i
d
sigmoid
sigmoid + square loss.The curve is blue.
Refer to 简单谈谈Cross Entropy Loss About
S
o
f
t
m
a
x
Softmax
Softmax and
C
r
o
s
s
E
n
t
r
o
p
y
Cross\ Entropy
Cross Entropy.Next,we use
s
i
g
m
o
i
d
sigmoid
sigmoid +
C
r
o
s
s
E
n
t
r
o
p
y
Cross Entropy
CrossEntropy.It is reasonable.And if
l
l
l is divided by
l
n
2
ln2
ln2,this method result is upper-bound of ideal loss,we minimize the
l
l
l to minimize ideal loss. In contrast with
s
i
g
m
o
i
d
+
S
q
u
a
r
e
l
o
s
s
sigmoid+Square\ loss
sigmoid+Square loss, gradient descent in this method represent better.
s
i
g
m
o
i
d
+
S
q
u
a
r
e
l
o
s
s
sigmoid+Square\ loss
sigmoid+Square loss don’t operate better in
y
^
n
f
(
x
)
\hat{y}^nf(x)
y^nf(x) being tiny positive number.The
C
r
o
s
s
E
n
t
r
o
p
y
Cross Entropy
CrossEntropy is easier to train. So in some regressions we often use it.
Next,we give
h
i
n
g
e
l
o
s
s
hinge\ loss
hinge loss.Sometimes the
h
i
n
g
e
l
o
s
s
hinge\ loss
hinge loss has stronger robustness than
c
r
o
s
s
e
n
t
r
o
p
y
cross\ entropy
cross entropy.For example,there are outliers.Because it has sparsity for training data and only considers support vectors(In kernel section,we can understand it). While
C
r
o
s
s
E
n
t
r
o
p
y
CrossEntropy
CrossEntropy considers all training data.
Why do we use 1?Because you can think that the hinge function is upper-bound of ideal loss.
Linear SVM
We give linear SVM model.The
l
l
l and normalization are convex function. So we can use gradient descent to minimize hinge loss function.Thus we can use SVM as deep learning classifier layer.There is a reference.
SVM gradient descent
The
c
n
(
w
)
=
+
1
,
−
1
,
o
r
0
c^n(w)=+1,-1,or\ 0
cn(w)=+1,−1,or 0.The
x
i
n
x^n_i
xin is real number which is a dimension of feature vector.
Linear SVM another formulation
Suppose training data set is linear separable.Express above is different with that we know.We transform hinge loss function to that we know.With using hinge loss function,we are going to compute soft margin SVM.For hard margin,you can see my another article which is supplementary for Z.H Zhou Watermelon Book SVM part.The
m
a
x
max
max is
ϵ
n
\epsilon^n
ϵn here.The variant is following:
M
i
n
i
m
i
z
i
n
g
t
h
e
L
o
s
s
f
u
n
c
t
i
o
n
L
(
f
)
=
∑
n
ϵ
n
+
λ
∣
∣
w
∣
∣
2
s
.
t
ϵ
n
>
=
0
y
n
^
f
(
x
n
)
>
=
1
−
ϵ
n
Minimizing\ the\ Loss\ function\\ L(f)=\sum_n\epsilon^n+\lambda||w||_2\\ s.t\ \ \ \epsilon^n>=0\\ \hat{y^n}f(x^n)>=1-\epsilon^n
Minimizing the Loss functionL(f)=n∑ϵn+λ∣∣w∣∣2s.t ϵn>=0yn^f(xn)>=1−ϵn
You can clearly think express above. In fact,it is similar with Watermelon Book by ZhiHua Zhou.Here,I think we should add a hyper-parameter in front of
∑
n
ϵ
n
\sum_n\epsilon^n
∑nϵn.When the hyper-parameter is infinite,we hope get a hard margin SVM.
min
w
,
b
1
2
∣
∣
w
∣
∣
2
s
.
t
.
y
i
(
w
T
x
i
+
b
)
>
=
1
,
i
=
1
,
2
,
.
.
.
m
.
w
h
e
r
e
w
i
s
w
e
i
g
h
t
s
.
\min_{w,b}\ \frac{1}{2}||w||^2 \\ s.t.\ y_i(w^Tx_i+b)>=1,i=1,2,...m.\\ where\ w\ is\ weights.
w,bmin 21∣∣w∣∣2s.t. yi(wTxi+b)>=1,i=1,2,...m.where w is weights.
In common,we find smallest length of weights that satisfies
y
n
f
(
x
n
)
>
=
1
y^nf(x^n)>=1
ynf(xn)>=1 as possible.If the training data set is linear separable,we can get
ϵ
n
=
0
\epsilon^n=0
ϵn=0.For example,you can scale the
w
w
w and
b
b
b to make minimum interval
1
1
1.So,we need minimize
∣
∣
w
∣
∣
2
||w||^2
∣∣w∣∣2.Between difference of two express is type of margin.
In fact,it is hard to make data linear separable.For relaxing this problem,we use soft margin which allows to wrongly classifier a few data samplers.
Why is the variant equivalent to proposed express before?We want to minimize
L
(
f
)
L(f)
L(f),i.e.,
ϵ
\epsilon
ϵ.When
ϵ
\epsilon
ϵ is minimized,this quadratic programming problem is equivalent to minimize hinge loss
ϵ
n
=
m
a
x
(
0
,
1
−
y
n
^
f
(
x
n
)
)
\epsilon^n=max(0,1-\hat{y^n}f(x^n))
ϵn=max(0,1−yn^f(xn)).
ϵ
\epsilon
ϵ can’t be very large.
ϵ
n
\epsilon^n
ϵn is the smallest number but bigger than
0
0
0 and
1
−
y
n
^
f
(
x
n
)
1-\hat{y^n}f(x^n)
1−yn^f(xn) .So with minimization,the two formulations are equal.The
ϵ
n
\epsilon^n
ϵn is a slack variable.
Dual representation
Many SVM slides interpret express below by Lagrange.Now,we interpret why
w
^
\hat{w}
w^ is a linear combination of
x
n
x^n
xn from another perspective.
w
^
=
∑
n
a
^
n
∗
x
n
\hat{w}=\sum_n\hat{a}_n*x^n
w^=n∑a^n∗xn
Where
a
^
n
\hat{a}_n
a^n may be sparse. It is a integer.
If we initialize
w
w
w 0:
w
=
w
−
η
∑
n
c
n
(
w
)
x
n
c
n
(
x
)
=
∂
l
(
f
(
x
n
)
,
y
^
n
)
∂
f
(
x
n
)
=
0
o
r
−
y
^
n
(
+
1
o
r
−
1
)
w=w-\eta\sum_nc^n(w)x^n\\ c^n(x)=\frac{\partial l(f(x^n),\hat{y}^n)}{\partial f(x^n)}=0\ or\ -\hat{y}^n(+1\ or\ -1)
w=w−ηn∑cn(w)xncn(x)=∂f(xn)∂l(f(xn),y^n)=0 or −y^n(+1 or −1)
What the
w
^
\hat{w}
w^ is linear combination of
x
n
x^n
xn is obvious.The hinge loss function usually gets zero. So many
x
x
x is not used to determine
w
^
\hat{w}
w^.These have non-zero
a
^
\hat{a}
a^ are defined support vectors.For logistic regression,it is always non-zero.Because it uses
E
n
t
r
o
p
y
Entropy
Entropy.For
E
n
t
r
o
p
y
Entropy
Entropy loss function,there is not zero for gradient. So if we use
E
n
t
r
o
p
y
Entropy
Entropy loss,we can’t get sparse
a
^
\hat{a}
a^.Every data can influence the result(
w
^
\hat{w}
w^).As mentioned above,hinge loss has stronger robustness.
Step 1
Now,we can get new formulation for our
f
(
x
)
f(x)
f(x).Suppose
x
x
x is in linear space.
d
u
e
t
o
w
=
∑
n
a
n
x
n
=
[
x
1
,
x
2
.
.
.
x
n
]
.
d
o
t
(
[
a
1
a
2
.
.
.
a
n
]
)
=
X
a
f
(
x
)
=
w
T
x
=
a
T
X
T
x
=
[
a
1
,
a
2
,
.
.
.
a
n
]
[
(
x
1
)
T
(
x
2
)
T
.
.
.
(
x
n
)
T
]
[
x
1
x
2
.
.
.
x
m
]
F
i
n
a
l
l
y
f
(
x
)
=
∑
n
a
n
(
(
x
n
)
T
.
d
o
t
(
x
)
)
=
∑
n
a
n
K
(
(
x
n
)
T
,
x
)
due\ to\ w=\sum_n a_nx^n=[x^1,x^2...x^n].dot( \left[ \begin{matrix} a_1\\ a_2\\ ...\\ a_n \end{matrix} \right] )=Xa\\ f(x)=w^Tx=a^TX^Tx=[a_1,a_2,...a_n] \left[ \begin{matrix} (x^{1})^T\\(x^2)^T\\...\\(x^n)^T \end{matrix} \right] \left[ \begin{matrix} x_1\\x_2\\...\\x_m \end{matrix} \right]\\ Finally\\ f(x)=\sum_na_n((x^n)^T.dot(x))=\sum_n a_nK((x^n)^T,x)
due to w=n∑anxn=[x1,x2...xn].dot(⎣⎢⎢⎡a1a2...an⎦⎥⎥⎤)=Xaf(x)=wTx=aTXTx=[a1,a2,...an]⎣⎢⎢⎡(x1)T(x2)T...(xn)T⎦⎥⎥⎤⎣⎢⎢⎡x1x2...xm⎦⎥⎥⎤Finallyf(x)=n∑an((xn)T.dot(x))=n∑anK((xn)T,x)
Where
x
n
x^n
xn is a column vector,the
K
K
K is defined as kernel function.Now,we get a new model,and we want to get
a
1
−
n
a_{1-n}
a1−n.Because our x isn’t usually in linear space,we use function
K
K
K to replace inner product in linear space.
Step 2、3
We find the best
a
1
−
n
a_{1-n}
a1−n to minimize loss function(PS:here,YouTuber doesn’t give update method.But I think it can use gradient descent and QP packages).We use new model
f
(
x
)
f(x)
f(x) to replace origin
f
(
x
)
f(x)
f(x) in loss function.
L
(
f
)
=
∑
n
l
(
∑
n
′
a
n
′
K
(
(
x
n
′
)
T
,
x
n
)
,
y
^
n
)
L(f)=\sum_nl(\sum_{n'} a_{n'}K((x^{n'})^T,x^n),\hat{y}^n)
L(f)=n∑l(n′∑an′K((xn′)T,xn),y^n)
We need not to know specific
x
n
x^n
xn,and we only need know the
K
K
K.We use kernel trick to solve it.For kernel trick,you can use wherever it is effective,e.g,logistic regression,linear regression.
Kernel trick
Why do we define kernel function,rather than using directly inner product between two vectors? Because it makes computation effective and efficient for inner product.We can define different kernel function to easily compute inner product.How to release efficient computation?If we don’t use kernel function.For using linear model,We need to compute feature transformation.For example,in neural network we add hidden layer to compute feature transformation.The feature is a high dimension vector(denoted by
ϕ
\phi
ϕ). Then we compute inner product. It is expensive.Conversely,we directly compute inner product of x and z by defining kernel concept. Tow ways are equivalent. As following:
ϕ
\phi
ϕ is feature vector of x or z.
For example:
Radial Basis Function Kernel
Another kernel function(we use Taylor Expansion to expand exponent computation):
What you should pay attention to is that it may be over fit due to infinite dimension. In kernel trick section,we use a square expression. Its feature dimension is finite(at most 4).
Sigmoid Kernel
K
(
x
,
z
)
=
t
a
n
h
(
x
.
d
o
t
(
z
)
)
K(x,z)=tanh(x.dot(z))
K(x,z)=tanh(x.dot(z))
You can use similar Taylor Expansion to find tow high dimension vectors(
ϕ
(
x
)
,
ϕ
(
z
)
\phi(x),\phi(z)
ϕ(x),ϕ(z)) whose inner product is equal to
K
(
x
,
z
)
K(x,z)
K(x,z).
If we use sigmoid kernel,we get a neural network with a hidden layer.
Thinking about inner product computation,when tow data points are very closed,the value of
K
K
K is very big. So we think
K
K
K is something like similarity between two data points.We can only consider value of
K
K
K.You can define your kernel function,then using Mercer’s theory to check whether this kernel function is equal to inner product of tow high dimension vectors. Here,we give the reference.
The kernel function is hyper-parameter. It can influence effectiveness of model.Some kernel function can’t make data separate.For different task,we choose different kernel function.If you don’t know,you can choose RBF kernel.For text data,we usually use linear kernel(
k
(
x
i
,
x
j
)
=
x
i
T
x
j
k(x_i,x_j)=x_i^Tx_j
k(xi,xj)=xiTxj).
SVM related methods
Our SVM can’t conduct multi-classifier task. It need to be extended. In contrast with SVM,regression need more data.
Relationship with deep learning
Hidden layer is to transform feature.Our kernel function is to change data to a high dimension space,too. In high dimension space, data is linearly separable.Then we use hinge loss to get linear classifier.The kernel function is learnable.The below paper is reference.