5 模式识别的方法
5.1 为什么学习机器能够推广
假设我们采用ERM原则,对给定数目的训练样本上设计了一个十分复杂的学习机器(VC维很大),在训练样本上经验风险可以很小,但置信区间变大,这种现象称为过学习或过适应。故我们希望在两者之间折衷考虑,这产生了两种方法:
- 保持置信范围一定(选择适当构造的机器),最小化经验风险,具体实现如神经网络
- 保持经验风险固定(如等于0,在完全可分时),最小化置信范围,具体实现如支持向量机
5.2 指示函数的 sigmoid 逼近
考虑指示函数集合
f
(
x
,
ω
)
=
s
g
n
{
ω
⋅
x
+
b
}
,
ω
∈
R
n
,
b
∈
R
f(x, \omega) = sgn\{ \omega \cdot x + b\}, \omega \in \R^n, b \in \R
f(x,ω)=sgn{ω⋅x+b},ω∈Rn,b∈R
当训练数据对于
ω
∈
R
n
\omega \in \R^n
ω∈Rn 无法完全正确分开时,我们只能希望找到错误最少的分类,但这一过程是NP完全的,而且我们无法使用基于梯度的算法找到局部极小值点(因指示函数导数要么为0,要么不存在),因此人们提出了用一个可导的函数(sigmoid)去逼近指示函数。
称平滑单调的函数
S
S
S 为 sigmoid函数,若
S
S
S 满足
S
(
−
∞
)
=
−
1
,
S
(
+
∞
)
=
1
S(-\infty) = -1,\ \ S(+\infty) = 1
S(−∞)=−1, S(+∞)=1。
一个典型的例子是
S
(
u
)
=
tanh
(
u
)
=
e
u
−
e
−
u
e
u
+
e
−
u
S(u) = \tanh(u) = \dfrac{e^u-e^{-u}}{e^u+e^{-u}}
S(u)=tanh(u)=eu+e−ueu−e−u
简单起见,不考虑常数偏置,设
f
(
x
,
ω
)
=
S
(
ω
⋅
x
)
,
ω
∈
R
n
R
e
m
p
(
ω
)
=
1
l
∑
i
=
1
l
(
y
i
−
S
(
ω
⋅
x
i
)
)
2
f(x, \omega) = S(\omega \cdot x), \omega \in \R^n \\ R_{emp} (\omega) = \dfrac{1}{l}\sum\limits_{i=1}^l (y_i - S(\omega \cdot x_i))^2
f(x,ω)=S(ω⋅x),ω∈RnRemp(ω)=l1i=1∑l(yi−S(ω⋅xi))2
有如下梯度下降解法(
n
n
n 为迭代次数)
g
r
a
d
ω
R
e
m
p
(
ω
)
=
−
2
l
∑
j
=
1
l
(
y
j
−
S
(
ω
⋅
x
j
)
)
S
′
(
ω
⋅
x
j
)
x
j
T
ω
n
e
w
=
ω
o
l
d
−
γ
(
n
)
g
r
a
d
ω
R
e
m
p
(
ω
o
l
d
)
grad_\omega R_{emp}(\omega) = -\dfrac{2}{l}\sum\limits_{j=1}^l (y_j - S(\omega \cdot x_j))S'(\omega \cdot x_j)x_j^T \\ \omega_{new} = \omega_{old} - \gamma(n) grad_\omega R_{emp}(\omega_{old})
gradωRemp(ω)=−l2j=1∑l(yj−S(ω⋅xj))S′(ω⋅xj)xjTωnew=ωold−γ(n)gradωRemp(ωold)
其中,梯度下降法收敛于局部极小值点的充分条件为梯度值有界,且系数满足
∑
n
=
1
∞
γ
(
n
)
=
∞
,
∑
n
=
1
∞
γ
2
(
n
)
<
∞
\sum\limits_{n=1}^\infty \gamma(n) = \infty, \ \ \sum\limits_{n=1}^\infty \gamma^2(n) < \infty
n=1∑∞γ(n)=∞, n=1∑∞γ2(n)<∞
# 但实际操作过程中 γ \gamma γ 即学习率始终为定值,尽管在有限次迭代之后终止(这可能意味着在有限次终止之后的无限次假想迭代过程中的 γ \gamma γ 将满足上面的条件)
# 另外 ∂ ω T x ∂ ω = I x = x \dfrac{\partial \omega^T x}{\partial \omega} = Ix=x ∂ω∂ωTx=Ix=x 而非 x T x^T xT 这个情况也值得商榷。(即使从直觉角度, ω \omega ω 与 x x x 的形状相同,而我们要求经验风险梯度的形状与 ω \omega ω 相同, 故也应该是 x x x 而非 x T x^T xT)
5.3 神经网络
5.3.1 后向传播方法
假设存在神经网络共
m
+
1
m+1
m+1层, 最后一层是单输出的感知器,前
m
m
m 层满足
x
i
(
k
)
=
S
(
w
(
k
)
x
i
(
k
−
1
)
)
,
k
=
1
,
2
,
.
.
.
,
m
u
i
(
k
)
=
w
(
k
)
x
i
(
k
−
1
)
=
[
u
i
1
(
k
)
,
.
.
.
,
u
i
n
k
(
k
)
]
T
S
(
u
i
(
k
)
)
=
[
S
(
u
i
1
(
k
)
)
,
.
.
.
,
S
(
u
i
n
k
(
k
)
)
]
T
x_i(k) = S(w(k)x_i(k-1)), k=1,2,...,m \\ u_i(k) = w(k)x_i(k-1) = [u^1_i(k),...,u^{n_k}_i(k)]^T \\ S(u_i(k)) = [S(u^1_i(k)),...,S(u^{n_k}_i(k))]^T
xi(k)=S(w(k)xi(k−1)),k=1,2,...,mui(k)=w(k)xi(k−1)=[ui1(k),...,uink(k)]TS(ui(k))=[S(ui1(k)),...,S(uink(k))]T
x
i
(
k
)
x_i(k)
xi(k) 为第i个样本的第k层向量,
w
(
k
)
w(k)
w(k) 是连接第k-1层和第k层的权值矩阵,目标为最小化经验泛函
I
(
w
(
1
)
,
.
.
.
,
w
(
m
)
)
=
1
l
∑
i
=
1
l
(
y
i
−
x
i
(
m
)
)
2
I(w(1),...,w(m)) =\dfrac{1}{l}\sum\limits_{i=1}^l (y_i - x_i(m))^2
I(w(1),...,w(m))=l1i=1∑l(yi−xi(m))2
我们将其看作一个带等式约束条件的凸优化问题,采用拉格朗日乘子法解决:
L
(
w
,
x
,
b
)
=
1
l
∑
i
=
1
l
(
y
i
−
x
i
(
m
)
)
2
−
∑
i
=
1
l
∑
k
=
1
m
(
b
i
(
k
)
⋅
[
x
i
(
k
)
−
S
(
w
(
k
)
x
i
(
k
−
1
)
)
]
)
L(w,x,b) = \dfrac{1}{l}\sum\limits_{i=1}^l (y_i - x_i(m))^2 -\sum\limits_{i=1}^l\sum\limits_{k=1}^m (b_i(k) \cdot [x_i(k) - S(w(k)x_i(k-1))])
L(w,x,b)=l1i=1∑l(yi−xi(m))2−i=1∑lk=1∑m(bi(k)⋅[xi(k)−S(w(k)xi(k−1))])
# 需要注意 x i ( m ) x_i(m) xi(m) 为标量,但 x i ( k ) , k ≠ m x_i(k), k \not = m xi(k),k=m 为向量。矩阵求导规律见《矩阵分析和应用(张贤达,第五章)》另外,以下推导出的几个子条件和原文结果不一样,待商榷。
第一个子条件(前向动力)
∂
L
∂
b
i
(
k
)
=
0
→
x
i
(
k
)
=
S
(
w
(
k
)
x
i
(
k
−
1
)
)
,
i
=
1
,
.
.
.
,
l
,
k
=
1
,
.
.
.
,
m
\dfrac{\partial L}{\partial b_i(k)} = 0\ \ \to \ \ x_i(k) = S(w(k)x_i(k-1)),\ \ i=1,...,l,\ k=1,...,m
∂bi(k)∂L=0 → xi(k)=S(w(k)xi(k−1)), i=1,...,l, k=1,...,m
第一个子条件(后向动力)
∂
L
∂
x
i
(
m
)
=
0
→
b
i
(
m
)
=
−
2
l
(
y
i
−
x
i
(
m
)
)
,
i
=
1
,
.
.
.
,
l
∂
L
∂
x
i
(
k
)
=
0
,
k
≠
m
→
0
=
∂
(
−
b
i
(
k
)
⋅
x
i
(
k
)
+
b
i
(
k
+
1
)
⋅
S
(
w
(
k
+
1
)
x
i
(
k
)
)
)
∂
x
i
(
k
)
=
−
b
i
(
k
)
+
∂
S
(
w
(
k
)
x
i
(
k
)
)
∂
x
i
(
k
)
b
i
(
k
+
1
)
→
b
i
(
k
)
=
∂
S
(
w
(
k
)
x
i
(
k
)
)
∂
x
i
(
k
)
b
i
(
k
+
1
)
\dfrac{\partial L}{\partial x_i(m)} = 0\ \ \to \ \ b_i(m) = \dfrac{-2}{l} (y_i - x_i(m)),\ \ i=1,...,l \\ \dfrac{\partial L}{\partial x_i(k)} = 0, k \not = m\ \ \to \ \ 0=\dfrac{\partial (-b_i(k) \cdot x_i(k) + b_i(k+1) \cdot S(w(k+1)x_i(k)))}{\partial x_i(k)}\\=-b_i(k) + \dfrac{\partial S(w(k)x_i(k))}{\partial x_i(k)}b_i(k+1) \to b_i(k) = \dfrac{\partial S(w(k)x_i(k))}{\partial x_i(k)} b_i(k+1)
∂xi(m)∂L=0 → bi(m)=l−2(yi−xi(m)), i=1,...,l∂xi(k)∂L=0,k=m → 0=∂xi(k)∂(−bi(k)⋅xi(k)+bi(k+1)⋅S(w(k+1)xi(k)))=−bi(k)+∂xi(k)∂S(w(k)xi(k))bi(k+1)→bi(k)=∂xi(k)∂S(w(k)xi(k))bi(k+1)
第三个子条件(权值更新)
在极值点上
∂
L
∂
w
(
k
)
=
0
\dfrac{\partial L}{\partial w(k)} = 0
∂w(k)∂L=0, 考虑不在极值点:
w
(
k
)
←
w
(
k
)
−
γ
(
n
)
∂
L
∂
w
(
k
)
,
∂
L
∂
w
(
k
)
=
∑
i
=
1
l
b
i
(
k
)
∂
S
(
w
(
k
)
x
i
(
k
−
1
)
)
∂
w
(
k
)
w(k) \leftarrow w(k) - \gamma(n) \dfrac{\partial L}{\partial w(k)}, \ \ \dfrac{\partial L}{\partial w(k)} = \sum\limits_{i=1}^l b_i(k) \dfrac{\partial S(w(k)x_i(k-1))}{\partial w(k)}
w(k)←w(k)−γ(n)∂w(k)∂L, ∂w(k)∂L=i=1∑lbi(k)∂w(k)∂S(w(k)xi(k−1))
5.3.2 后向传播算法
5.3.3 用于回归估计的神经网络
在最后一层用线性函数来取代sigmoid函数即可。
5.3.4 关于后向传播方法的讨论
5.4 最优分类超平面
5.4.1 最优超平面
假定训练数据(向量集合)
(
x
1
,
y
1
)
,
.
.
.
(
x
n
,
y
n
)
,
x
∈
R
n
,
y
∈
{
−
1
,
+
1
}
(x_1,y_1),...(x_n,y_n), x \in \R^n, y \in \{-1, +1\}
(x1,y1),...(xn,yn),x∈Rn,y∈{−1,+1}
可以被超平面
w
⋅
x
−
b
=
0
w \cdot x - b = 0
w⋅x−b=0 分开。若分开的结果可以为完全正确,且离超平面最近的向量与超平面之间的距离是所有可能中最大的,称该向量集合被最优超平面(最大间隔超平面)分开。
超平面的正确分类结果有:
(
w
⋅
x
i
−
b
)
{
≥
1
i
f
y
i
=
1
≤
−
1
i
f
y
i
=
−
1
(w \cdot x_i -b) \begin{cases} \ge 1 &if\ \ y_i = 1 \\ \le -1 &if\ \ y_i = -1 \end{cases}
(w⋅xi−b){≥1≤−1if yi=1if yi=−1
或者写成
y
i
[
w
⋅
x
i
−
b
]
≥
1
,
i
=
1
,
.
.
.
,
l
y_i[w \cdot x_i -b] \ge 1,\ \ i=1,...,l
yi[w⋅xi−b]≥1, i=1,...,l。
最优超平面除了需要满足分类正确,还需满足距离最大条件,即
min
{
Φ
(
w
)
=
∣
∣
w
∣
∣
2
}
\min\{ \Phi(w) = ||w||^2\}
min{Φ(w)=∣∣w∣∣2}。
5.4.2 Δ \Delta Δ-间隔分类超平面
一个超平面
w
∗
⋅
x
−
b
=
0
,
∣
∣
w
∗
∣
∣
=
1
w^* \cdot x - b = 0, ||w^*|| = 1
w∗⋅x−b=0,∣∣w∗∣∣=1 以如下方式将向量
x
x
x 分类
y
=
{
1
i
f
w
∗
⋅
x
−
b
≥
Δ
−
1
i
f
w
∗
⋅
x
−
b
≤
−
Δ
y=\begin{cases} 1 &if \ \ w^* \cdot x - b \ge \Delta \\ -1 &if \ \ w^* \cdot x - b \le -\Delta \end{cases}
y={1−1if w∗⋅x−b≥Δif w∗⋅x−b≤−Δ
则称该超平面为
Δ
\Delta
Δ-间隔分类超平面,显然最优超平面为
Δ
=
1
/
∣
∣
w
∣
∣
\Delta = 1/ ||w||
Δ=1/∣∣w∣∣ 的
Δ
\Delta
Δ-间隔分类超平面。
定理 5.1
设向量
x
∈
X
x \in X
x∈X 在一个半径为
R
R
R 的球中,那么
Δ
\Delta
Δ-间隔分类超平面的VC维
h
≤
min
{
[
R
2
Δ
2
]
,
n
}
+
1
h \le \min\{[\dfrac{R^2}{\Delta^2}],n\} + 1
h≤min{[Δ2R2],n}+1
5.5 构造最优超平面
即在约束条件
y
i
[
w
⋅
x
i
−
b
]
≥
1
,
i
=
1
,
.
.
.
,
l
y_i[w \cdot x_i -b] \ge 1,\ \ i=1,...,l
yi[w⋅xi−b]≥1, i=1,...,l 下最小化泛函
Φ
(
w
)
=
1
2
w
⋅
w
\Phi(w) = \dfrac{1}{2}w \cdot w
Φ(w)=21w⋅w
采用不等式约束的拉格朗日乘子法
L
(
w
,
b
,
α
)
=
1
2
w
⋅
w
+
∑
i
=
1
l
α
i
(
1
−
y
i
[
w
⋅
x
i
−
b
]
)
,
α
i
≥
0
L(w, b, \alpha) = \dfrac{1}{2}w \cdot w + \sum\limits_{i=1}^l \alpha_i (1-y_i[w \cdot x_i -b] ),\ \ \alpha_i \ge 0
L(w,b,α)=21w⋅w+i=1∑lαi(1−yi[w⋅xi−b]), αi≥0
目标为
max
α
min
w
,
b
L
\max\limits_{\alpha}\min\limits_{w, b}L
αmaxw,bminL,对拉格朗日函数求梯度得
∂
L
∂
w
=
0
→
w
0
=
∑
i
=
1
l
α
i
0
y
i
x
i
∂
L
∂
b
=
0
→
∑
i
=
1
l
α
i
0
y
i
=
0
\dfrac{\partial L}{\partial w} = 0 \to w_0 = \sum\limits_{i=1}^l \alpha^0_i y_i x_i\\ \dfrac{\partial L}{\partial b} = 0 \to \sum\limits_{i=1}^l \alpha^0_i y_i = 0
∂w∂L=0→w0=i=1∑lαi0yixi∂b∂L=0→i=1∑lαi0yi=0
代入拉格朗日函数可得到该问题的对偶问题,最大化泛函
W
(
α
)
=
∑
i
=
1
l
α
i
−
∑
i
,
j
=
1
l
α
i
α
j
y
i
y
j
(
x
i
⋅
x
j
)
W(\alpha) = \sum\limits_{i=1}^l \alpha_i - \sum\limits_{i,j=1}^l \alpha_i \alpha_j y_i y_j (x_i \cdot x_j)
W(α)=i=1∑lαi−i,j=1∑lαiαjyiyj(xi⋅xj)
约束条件为
α
i
≥
0
,
i
=
1
,
.
.
.
,
l
∑
i
=
1
l
α
i
0
y
i
=
0
\alpha_i \ge 0, i=1,...,l\ \ \ \sum\limits_{i=1}^l \alpha^0_i y_i = 0
αi≥0,i=1,...,l i=1∑lαi0yi=0
设
α
0
=
(
α
1
0
,
.
.
.
,
α
l
0
)
\alpha_0 = (\alpha^0_1,...,\alpha^0_l)
α0=(α10,...,αl0) 为该问题的解,那么最优超平面的分类规则为
f
(
x
)
=
s
g
n
{
∑
α
i
0
≠
0
α
i
0
y
i
(
x
i
⋅
x
)
−
b
0
}
f(x) = sgn\{\sum\limits_{\alpha^0_i \not = 0} \alpha^0_i y_i (x_i \cdot x) - b_0\}
f(x)=sgn{αi0=0∑αi0yi(xi⋅x)−b0}
还需满足 Kuhn-Tucker条件(极值点存在的充要条件)
α
i
0
(
1
−
y
i
[
w
0
⋅
x
i
−
b
0
]
)
=
0
,
i
=
1
,
.
.
,
l
\alpha^0_i (1-y_i[w_0 \cdot x_i -b_0] ) = 0,\ \ i=1,..,l
αi0(1−yi[w0⋅xi−b0])=0, i=1,..,l
由 Kuhn-Tucker条件得
b
0
=
w
0
x
i
s
v
−
y
i
=
1
2
[
w
0
x
i
,
y
i
=
1
s
v
+
w
0
x
j
,
y
j
=
−
1
s
v
]
b_0 = w_0 x^{sv}_i - y_i = \dfrac{1}{2}[w_0 x^{sv}_{i, y_i = 1} + w_0 x^{sv}_{j,y_j = -1}]
b0=w0xisv−yi=21[w0xi,yi=1sv+w0xj,yj=−1sv]
其中,
x
i
s
v
x^{sv}_i
xisv 表明该向量为支持向量(support vector)。
x
i
,
y
i
=
1
s
v
x^{sv}_{i, y_i = 1}
xi,yi=1sv 为从满足
y
i
=
1
y_i=1
yi=1 的支持向量中任取的某一向量。支持向量为满足
α
i
0
≠
0
\alpha^0_i \not = 0
αi0=0 的向量。
不可分情况的推广
为在数据线性不可分时构造最优的超平面,引入变量
ξ
i
\xi_i
ξi表示分类误差,常数参数
σ
>
0
\sigma > 0
σ>0, 试图最小化泛函(最小化误差)
F
σ
(
ξ
)
=
∑
i
=
1
l
ξ
i
σ
F_\sigma(\xi) = \sum\limits_{i=1}^l \xi^\sigma_i
Fσ(ξ)=i=1∑lξiσ,约束条件为“在一定误差下分类正确”、“分类间隔至少为
Δ
\Delta
Δ”,即
ξ
i
≥
0
,
y
i
(
w
⋅
x
i
−
b
)
≥
1
−
ξ
i
,
w
⋅
w
≤
Δ
−
2
\xi_i \ge 0,\ \ y_i(w \cdot x_i - b) \ge 1-\xi_i,\ \ w \cdot w \le \Delta ^ {-2}
ξi≥0, yi(w⋅xi−b)≥1−ξi, w⋅w≤Δ−2
这是一个很好的SRM的例子,上述过程定义了结构(由定理5.1,各元素VC维递增,故为一个结构)
S
n
=
{
w
⋅
x
−
b
:
w
⋅
w
≤
c
n
=
Δ
n
−
2
}
,
Δ
n
≤
Δ
n
−
1
S_n=\{w \cdot x -b: w \cdot w \le c_n = \Delta^{-2}_n\},\ \ \Delta_n \le \Delta_{n-1}
Sn={w⋅x−b:w⋅w≤cn=Δn−2}, Δn≤Δn−1
为计算方便,考虑
σ
=
1
\sigma = 1
σ=1。
构造
Δ
\Delta
Δ-间隔超平面
拉格朗日函数
L
(
w
,
b
,
α
,
β
,
λ
)
=
∑
i
=
1
l
ξ
i
σ
+
∑
i
=
1
l
α
i
(
1
−
ξ
i
−
y
i
(
w
⋅
x
i
−
b
)
)
+
∑
i
=
1
l
β
i
(
−
ξ
i
)
+
λ
2
(
w
⋅
w
−
Δ
−
2
)
σ
=
1
,
α
i
≥
0
,
β
i
≥
0
,
λ
≥
0
L(w, b, \alpha, \beta, \lambda) = \sum\limits_{i=1}^l \xi^\sigma_i + \sum\limits_{i=1}^l \alpha_i (1- \xi_i - y_i(w \cdot x_i -b)) + \sum\limits_{i=1}^l \beta_i (-\xi_i) + \dfrac{\lambda}{2} (w \cdot w - \Delta^{-2}) \\ \sigma = 1,\ \ \alpha_i \ge 0,\ \ \beta_i \ge 0,\ \ \lambda \ge 0
L(w,b,α,β,λ)=i=1∑lξiσ+i=1∑lαi(1−ξi−yi(w⋅xi−b))+i=1∑lβi(−ξi)+2λ(w⋅w−Δ−2)σ=1, αi≥0, βi≥0, λ≥0
目标为
max
α
,
β
,
λ
min
w
,
b
,
ξ
L
\max\limits_{\alpha, \beta, \lambda}\min\limits_{w, b, \xi}L
α,β,λmaxw,b,ξminL,对拉格朗日函数求梯度得
∂
L
∂
w
=
0
→
w
=
1
λ
∑
i
=
1
l
α
i
y
i
x
i
∂
L
∂
b
=
0
→
∑
i
=
1
l
α
i
y
i
=
0
∂
L
∂
ξ
i
=
0
→
α
i
+
β
i
=
1
\dfrac{\partial L}{\partial w} = 0 \to w = \dfrac{1}{\lambda} \sum\limits_{i=1}^l \alpha_i y_i x_i\\ \dfrac{\partial L}{\partial b} = 0 \to \sum\limits_{i=1}^l \alpha_i y_i = 0\\ \dfrac{\partial L}{\partial \xi_i} = 0 \to \alpha_i + \beta_i = 1
∂w∂L=0→w=λ1i=1∑lαiyixi∂b∂L=0→i=1∑lαiyi=0∂ξi∂L=0→αi+βi=1
代入拉格朗日函数可得到该问题的对偶问题,最大化泛函
W
(
α
,
λ
)
=
∑
i
=
1
l
α
i
−
1
2
λ
∑
i
,
j
=
1
l
α
i
α
j
y
i
y
j
(
x
i
⋅
x
j
)
−
λ
2
Δ
2
W(\alpha,\lambda) = \sum\limits_{i=1}^l \alpha_i - \dfrac{1}{2\lambda}\sum\limits_{i,j=1}^l \alpha_i \alpha_j y_i y_j (x_i \cdot x_j) - \dfrac{\lambda}{2\Delta^2}
W(α,λ)=i=1∑lαi−2λ1i,j=1∑lαiαjyiyj(xi⋅xj)−2Δ2λ
约束条件
∑
i
=
1
l
α
i
y
i
=
0
,
λ
≥
0
,
0
≤
α
i
≤
1
,
β
i
=
1
−
α
i
\sum\limits_{i=1}^l \alpha_i y_i = 0,\ \ \lambda \ge 0,\ \ 0 \le \alpha_i \le 1,\ \ \beta_i = 1- \alpha_i
i=1∑lαiyi=0, λ≥0, 0≤αi≤1, βi=1−αi
还需满足 Kuhn-Tucker条件
α
i
(
1
−
ξ
i
−
y
i
(
w
⋅
x
i
−
b
)
)
=
0
,
β
i
ξ
i
=
0
,
λ
2
(
w
⋅
w
−
Δ
−
2
)
=
0
\alpha_i (1- \xi_i - y_i(w \cdot x_i -b)) = 0,\ \ \beta_i \xi_i = 0,\ \ \dfrac{\lambda}{2} (w \cdot w - \Delta^{-2}) = 0
αi(1−ξi−yi(w⋅xi−b))=0, βiξi=0, 2λ(w⋅w−Δ−2)=0
构造软间隔分类超平面(广义最优超平面)
为简化问题,采用软间隔的概念,即不再硬性规定分类间隔所需满足的条件(如
w
⋅
w
≤
c
n
w \cdot w \le c_n
w⋅w≤cn),而是给定常数
C
>
0
C > 0
C>0 ,最小化泛函
Φ
(
w
,
ξ
)
=
1
2
w
⋅
w
+
C
∑
i
=
1
l
ξ
i
\Phi(w, \xi) = \dfrac{1}{2} w \cdot w + C \sum\limits_{i=1}^l\xi_i
Φ(w,ξ)=21w⋅w+Ci=1∑lξi
显然当
C
=
λ
0
C= \lambda^0
C=λ0 时,该问题与之前的问题完全等价。约束条件为
ξ
i
≥
0
,
y
i
(
w
⋅
x
i
−
b
)
≥
1
−
ξ
i
\xi_i \ge 0,\ \ y_i(w \cdot x_i - b) \ge 1-\xi_i
ξi≥0, yi(w⋅xi−b)≥1−ξi
拉格朗日函数
L
(
w
,
b
,
α
,
β
)
=
1
2
w
⋅
w
+
C
∑
i
=
1
l
ξ
i
σ
+
∑
i
=
1
l
α
i
(
1
−
ξ
i
−
y
i
(
w
⋅
x
i
−
b
)
)
+
∑
i
=
1
l
β
i
(
−
ξ
i
)
,
α
i
≥
0
,
β
i
≥
0
L(w, b, \alpha, \beta) = \dfrac{1}{2} w \cdot w + C\sum\limits_{i=1}^l \xi^\sigma_i + \sum\limits_{i=1}^l \alpha_i (1- \xi_i - y_i(w \cdot x_i -b)) + \sum\limits_{i=1}^l \beta_i (-\xi_i),\ \ \ \alpha_i \ge 0,\ \ \beta_i \ge 0
L(w,b,α,β)=21w⋅w+Ci=1∑lξiσ+i=1∑lαi(1−ξi−yi(w⋅xi−b))+i=1∑lβi(−ξi), αi≥0, βi≥0
目标为
max
α
,
β
min
w
,
b
,
ξ
L
\max\limits_{\alpha, \beta}\min\limits_{w, b, \xi}L
α,βmaxw,b,ξminL,对拉格朗日函数求梯度得
∂
L
∂
w
=
0
→
w
=
∑
i
=
1
l
α
i
y
i
x
i
∂
L
∂
b
=
0
→
∑
i
=
1
l
α
i
y
i
=
0
∂
L
∂
ξ
i
=
0
→
α
i
+
β
i
=
1
\dfrac{\partial L}{\partial w} = 0 \to w = \sum\limits_{i=1}^l \alpha_i y_i x_i\\ \dfrac{\partial L}{\partial b} = 0 \to \sum\limits_{i=1}^l \alpha_i y_i = 0\\ \dfrac{\partial L}{\partial \xi_i} = 0 \to \alpha_i + \beta_i = 1
∂w∂L=0→w=i=1∑lαiyixi∂b∂L=0→i=1∑lαiyi=0∂ξi∂L=0→αi+βi=1
代入拉格朗日函数可得到该问题的对偶问题,最大化泛函
W
(
α
)
=
∑
i
=
1
l
α
i
−
1
2
∑
i
,
j
=
1
l
α
i
α
j
y
i
y
j
(
x
i
⋅
x
j
)
W(\alpha) = \sum\limits_{i=1}^l \alpha_i - \dfrac{1}{2}\sum\limits_{i,j=1}^l \alpha_i \alpha_j y_i y_j (x_i \cdot x_j)
W(α)=i=1∑lαi−21i,j=1∑lαiαjyiyj(xi⋅xj)
约束条件
∑
i
=
1
l
α
i
y
i
=
0
,
0
≤
α
i
≤
C
,
β
i
=
C
−
α
i
\sum\limits_{i=1}^l \alpha_i y_i = 0,\ \ 0 \le \alpha_i \le C,\ \ \beta_i = C- \alpha_i
i=1∑lαiyi=0, 0≤αi≤C, βi=C−αi
还需满足 Kuhn-Tucker条件
α
i
(
1
−
ξ
i
−
y
i
(
w
⋅
x
i
−
b
)
)
=
0
,
β
i
ξ
i
=
0
\alpha_i (1- \xi_i - y_i(w \cdot x_i -b)) = 0,\ \ \beta_i \xi_i = 0
αi(1−ξi−yi(w⋅xi−b))=0, βiξi=0