1、SVM另一种推法
我们不管分类平面,直接去假设Margin的两个边界:
Plus-plane
=
{
x
:
w
⋅
x
+
b
=
+
1
}
Minus-plane
=
{
x
:
w
⋅
x
+
b
=
−
1
}
\begin{aligned} & \text { Plus-plane }=\{\boldsymbol{x}: \boldsymbol{w} \cdot \boldsymbol{x}+b=+1\} \\ & \text { Minus-plane }=\{\boldsymbol{x}: \boldsymbol{w} \cdot \boldsymbol{x}+b=-1\} \end{aligned}
Plus-plane ={x:w⋅x+b=+1} Minus-plane ={x:w⋅x+b=−1}
这个时候Margin就是这两个平面之间的距离了
回忆一下:
Given 2 parallel lines with equations
a
x
+
b
y
+
c
1
=
0
a x+b y+c_1=0
ax+by+c1=0
and
a
x
+
b
y
+
c
2
=
0
a x+b y+c_2=0
ax+by+c2=0
the distance between them is given by:
d
=
∣
c
2
−
c
1
∣
a
2
+
b
2
d=\frac{\left|c_2-c_1\right|}{\sqrt{a^2+b^2}}
d=a2+b2∣c2−c1∣
于是就有:
maximize
2
∥
w
∥
\frac{2}{\|\mathbf{w}\|}
∥w∥2
such that
For
y
i
=
+
1
,
w
T
x
i
+
b
≥
1
y_i=+1, \quad \mathbf{w}^T \mathbf{x}_i+b \geq 1
yi=+1,wTxi+b≥1
For
y
i
=
−
1
,
w
T
x
i
+
b
≤
−
1
y_i=-1, \quad \mathbf{w}^T \mathbf{x}_i+b \leq-1
yi=−1,wTxi+b≤−1
进一步的,有
argmin
w
,
b
∑
i
=
1
d
w
i
2
subject to
∀
x
i
∈
D
:
y
i
(
x
i
⋅
w
+
b
)
≥
1
\begin{aligned} & \underset{\mathbf{w}, b}{\operatorname{argmin}} \sum_{i=1}^d w_i^2 \\ & \text { subject to } \forall \mathbf{x}_i \in D: y_i\left(\mathbf{x}_i \cdot \mathbf{w}+b\right) \geq 1 \end{aligned}
w,bargmini=1∑dwi2 subject to ∀xi∈D:yi(xi⋅w+b)≥1
模型是一样的
2、二次规划(Quadratic Programming)
二次规划问题是这样的
Find
arg
max
u
c
+
d
T
u
+
u
T
R
u
2
\text { Find } \underset{\mathbf{u}}{\arg \max } \quad c+\mathbf{d}^T \mathbf{u}+\frac{\mathbf{u}^T R \mathbf{u}}{2}
Find uargmaxc+dTu+2uTRu
若干个不等式约束
a
11
u
1
+
a
12
u
2
+
…
+
a
1
m
u
m
≤
b
1
a
21
u
1
+
a
22
u
2
+
…
+
a
2
m
u
m
≤
b
2
:
a
n
1
u
1
+
a
n
2
u
2
+
…
+
a
n
m
u
m
≤
b
n
\begin{gathered} a_{11} u_1+a_{12} u_2+\ldots+a_{1 m} u_m \leq b_1 \\ a_{21} u_1+a_{22} u_2+\ldots+a_{2 m} u_m \leq b_2 \\ : \\ a_{n 1} u_1+a_{n 2} u_2+\ldots+a_{n m} u_m \leq b_n \end{gathered}
a11u1+a12u2+…+a1mum≤b1a21u1+a22u2+…+a2mum≤b2:an1u1+an2u2+…+anmum≤bn
若干个等式约束
a
(
n
+
1
)
1
u
1
+
a
(
n
+
1
)
2
u
2
+
…
+
a
(
n
+
1
)
m
u
m
=
b
(
n
+
1
)
a
(
n
+
2
)
1
u
1
+
a
(
n
+
2
)
2
u
2
+
…
+
a
(
n
+
2
)
m
u
m
=
b
(
n
+
2
)
:
a
(
n
+
e
)
1
u
1
+
a
(
n
+
e
)
2
u
2
+
…
+
a
(
n
+
e
)
m
u
m
=
b
(
n
+
e
)
\begin{gathered} a_{(n+1) 1} u_1+a_{(n+1) 2} u_2+\ldots+a_{(n+1) m} u_m=b_{(n+1)} \\ a_{(n+2) 1} u_1+a_{(n+2) 2} u_2+\ldots+a_{(n+2) m} u_m=b_{(n+2)} \\ : \\ a_{(n+e) 1} u_1+a_{(n+e) 2} u_2+\ldots+a_{(n+e) m} u_m=b_{(n+e)} \end{gathered}
a(n+1)1u1+a(n+1)2u2+…+a(n+1)mum=b(n+1)a(n+2)1u1+a(n+2)2u2+…+a(n+2)mum=b(n+2):a(n+e)1u1+a(n+e)2u2+…+a(n+e)mum=b(n+e)
而我们线性SVM要求解的问题是
{
w
⃗
∗
,
b
∗
}
=
min
w
⃗
,
b
∑
i
w
i
2
subject to
y
i
(
w
⃗
⋅
x
⃗
i
+
b
)
≥
1
for all training data
(
x
⃗
i
,
y
i
)
\begin{aligned} & \left\{\vec{w}^*, b^*\right\}=\min _{\vec{w}, b} \sum_i w_i^2 \\ & \text { subject to } y_i\left(\vec{w} \cdot \vec{x}_i+b\right) \geq 1 \text { for all training data }\left(\vec{x}_i, y_i\right) \end{aligned}
{w∗,b∗}=w,bmini∑wi2 subject to yi(w⋅xi+b)≥1 for all training data (xi,yi)
其实本质上就是一个QP问题
{
w
⃗
∗
,
b
∗
}
=
argmax
w
⃗
,
b
{
0
+
0
→
⋅
w
⃗
−
w
⃗
T
I
n
w
⃗
}
\left\{\vec{w}^*, b^*\right\}=\underset{\vec{w}, b}{\operatorname{argmax}}\left\{0+\overrightarrow{0} \cdot \vec{w}-\vec{w}^T \mathbf{I}_{\mathbf{n}} \vec{w}\right\}
{w∗,b∗}=w,bargmax{0+0⋅w−wTInw}
y
1
(
w
⃗
⋅
x
⃗
1
+
b
)
≥
1
y
2
(
w
⃗
⋅
x
⃗
2
+
b
)
≥
1
…
y
N
(
w
⃗
⋅
x
⃗
N
+
b
)
≥
1
\begin{aligned} & y_1\left(\vec{w} \cdot \vec{x}_1+b\right) \geq 1 \\ & y_2\left(\vec{w} \cdot \vec{x}_2+b\right) \geq 1 \\ & \ldots \\ & y_N\left(\vec{w} \cdot \vec{x}_N+b\right) \geq 1 \end{aligned}
y1(w⋅x1+b)≥1y2(w⋅x2+b)≥1…yN(w⋅xN+b)≥1
3、Soft Margin SVM
咱们的Hard Margin SVM要求样本必须是线性可分的(看它的约束条件),那么问题来了,要是样本线性不可分呢?
那么咱们就希望Margin大的同时,让分类的损失尽量小一点,于是问题就变为
Minimize
w
⋅
w
+
C
\boldsymbol{w}\cdot \boldsymbol{w}+C
w⋅w+C (#train errors)
这样问题就来了。首先这不再是一个QP问题,QP问题的求解方法很成熟;其次这边有一个超参数C,这个参数又叫tradeoff parameter。C越大说明你更希望分类误差小一点,C越小说明你更希望Margin大一点,所以这边C的取值就是一门学问了。
我们将误差建模为分类错误点到分类平面的距离
{
w
⃗
∗
,
b
∗
}
=
min
w
⃗
,
b
∑
i
=
1
d
w
i
2
+
c
∑
j
=
1
N
ε
j
y
1
(
w
⃗
⋅
x
⃗
1
+
b
)
≥
1
−
ε
1
y
2
(
w
⃗
⋅
x
⃗
2
+
b
)
≥
1
−
ε
2
…
y
N
(
w
⃗
⋅
x
⃗
N
+
b
)
≥
1
−
ε
N
\begin{aligned} & \left\{\vec{w}^*, b^*\right\}=\min _{\vec{w}, b} \sum_{i=1}^{\mathrm{d}} w_i^2+c \sum_{j=1}^N \varepsilon_j \\ & y_1\left(\vec{w} \cdot \vec{x}_1+b\right) \geq 1-\varepsilon_1 \\ & y_2\left(\vec{w} \cdot \vec{x}_2+b\right) \geq 1-\varepsilon_2 \\ & \ldots \\ & y_N\left(\vec{w} \cdot \vec{x}_N+b\right) \geq 1-\varepsilon_N \end{aligned}
{w∗,b∗}=w,bmini=1∑dwi2+cj=1∑Nεjy1(w⋅x1+b)≥1−ε1y2(w⋅x2+b)≥1−ε2…yN(w⋅xN+b)≥1−εN
如果
ε
i
<
0
?
\varepsilon_i<0?
εi<0?,这是我们不想看到的,因为当样本分类正确时,我们希望损失是
0
0
0。
{
w
⃗
∗
,
b
∗
}
=
min
w
⃗
,
b
∑
i
=
1
d
w
i
2
+
c
∑
j
=
1
N
ε
j
y
1
(
w
⃗
⋅
x
⃗
1
+
b
)
≥
1
−
ε
1
,
ε
1
>
=
0
y
2
(
w
⃗
⋅
x
⃗
2
+
b
)
≥
1
−
ε
2
,
ε
2
>
=
0
…
y
N
(
w
⃗
⋅
x
⃗
N
+
b
)
≥
1
−
ε
N
,
ε
N
>
=
0
\begin{aligned} & \left\{\vec{w}^*, b^*\right\}=\min _{\vec{w}, b} \sum_{i=1}^{\mathrm{d}} w_i^2+c \sum_{j=1}^N \varepsilon_j \\ & y_1\left(\vec{w} \cdot \vec{x}_1+b\right) \geq 1-\varepsilon_1 ,\varepsilon_1>=0\\ & y_2\left(\vec{w} \cdot \vec{x}_2+b\right) \geq 1-\varepsilon_2 ,\varepsilon_2>=0\\ & \ldots \\ & y_N\left(\vec{w} \cdot \vec{x}_N+b\right) \geq 1-\varepsilon_N,\varepsilon_N>=0 \end{aligned}
{w∗,b∗}=w,bmini=1∑dwi2+cj=1∑Nεjy1(w⋅x1+b)≥1−ε1,ε1>=0y2(w⋅x2+b)≥1−ε2,ε2>=0…yN(w⋅xN+b)≥1−εN,εN>=0
想象一下,一个样本点错得很离谱,离分界面无穷大,那么其对应的损失也是无限大,所以我们说,SVM对噪声是很敏感的
SVM的损失函数是Hinge loss
hinge
(
x
)
=
max
(
1
−
x
,
0
)
\operatorname{hinge}(x)=\max (1-x, 0)
hinge(x)=max(1−x,0)