5 模式识别的方法
5.6 支持向量机
支持向量机 (Support Vector Machine, SVM, SV机)实现了如下方案:通过事先选择的某种映射将输入 x x x映射到高维空间 Z Z Z中向量 z z z,并在 Z Z Z 中构造最优分类超平面。
5.6.1 高维空间的推广
定理 5.2
设
l
l
l 个样本的训练集被最大间隔超平面完全分开,
P
e
r
r
o
r
P_{error}
Perror 为测试错误概率,
m
m
m 为支持向量的个数,
R
R
R 为包含所有训练集向量的最小超球半径,
Δ
\Delta
Δ 为间隔值,
n
n
n 为输入空间维数,则有
E
[
P
e
r
r
o
r
]
≤
E
[
min
{
m
l
,
[
R
2
Δ
−
2
]
l
,
n
l
}
]
E[P_{error}] \le E[\min\{\dfrac{m}{l}, \dfrac{[R^2\Delta^{-2}]}{l}, \dfrac{n}{l}\}]
E[Perror]≤E[min{lm,l[R2Δ−2],ln}]
# 一般 R ( α ) = f ( R e m p ( α ) , Φ ( ζ ) ) R(\alpha) = f(R_{emp}(\alpha), \Phi(\zeta)) R(α)=f(Remp(α),Φ(ζ)), 完全分类时 R e m p ( α ) = 0 R_{emp}(\alpha) = 0 Remp(α)=0, 在3.4节情况3中, R ( α ) = R e m p ( α ) ( 1 − a ( p ) τ ζ ) + R(\alpha) = \dfrac{R_{emp}(\alpha)}{(1-a(p)\tau\sqrt\zeta)_+} R(α)=(1−a(p)τζ)+Remp(α),此时置信区间是否还起作用?若为情况2, R ( α ) = R e m p ( α ) + Φ ( ζ ) R(\alpha) = R_{emp}(\alpha) + \Phi(\zeta) R(α)=Remp(α)+Φ(ζ),那么我认为上述定理直接表明 R ( α ) ≲ h / l R(\alpha) \lesssim h / l R(α)≲h/l,这样的结果是不是有点太好了?
5.6.2 内积的回旋
考虑在hilbert空间中内积的一个一般表达(Courant and Hilber, 1953)
z
i
⋅
z
j
=
K
(
x
i
,
x
j
)
,
K
(
x
i
,
x
j
)
=
∑
k
=
1
∞
a
k
ψ
(
x
i
)
ψ
(
x
j
)
,
a
k
>
0
z_i \cdot z_j = K(x_i, x_j),\ \ \ K(x_i, x_j) = \sum\limits_{k=1}^\infty a_k \psi(x_i)\psi(x_j),\ \ a_k >0
zi⋅zj=K(xi,xj), K(xi,xj)=k=1∑∞akψ(xi)ψ(xj), ak>0
定理 5.3
(Mercer)要保证
L
2
L_2
L2 下的对称函数
K
(
u
,
v
)
K(u,v)
K(u,v) 能以正的系数
a
k
>
0
a_k > 0
ak>0 展开为
K
(
u
,
v
)
=
∑
k
=
1
∞
a
k
ψ
(
u
)
ψ
(
v
)
K(u, v) = \sum\limits_{k=1}^\infty a_k \psi(u)\psi(v)
K(u,v)=k=1∑∞akψ(u)ψ(v) 的充要条件为
∀
g
≠
0
,
∫
g
2
(
u
)
d
u
<
∞
,
∬
K
(
u
,
v
)
g
(
u
)
g
(
v
)
d
u
d
v
>
0
\forall g \not = 0, \int g^2(u)du < \infty,\ \ \iint K(u,v)g(u)g(v)dudv > 0
∀g=0,∫g2(u)du<∞, ∬K(u,v)g(u)g(v)dudv>0
5.6.3 构造SV机
f
(
x
)
=
s
g
n
{
∑
α
i
≠
0
y
i
α
i
K
(
x
i
,
x
)
−
b
}
f(x) = sgn\{ \sum\limits_{\alpha_i\not =0} y_i \alpha_i K(x_i, x) - b\}
f(x)=sgn{αi=0∑yiαiK(xi,x)−b}
求解过程与前面的步骤完全一致。对偶问题为最大化泛函:
W
(
α
)
=
∑
i
=
1
l
α
i
−
1
2
∑
i
,
j
=
1
l
α
i
α
j
y
i
y
j
K
(
x
i
,
x
j
)
W(\alpha) = \sum\limits_{i=1}^l \alpha_i - \dfrac{1}{2} \sum\limits_{i,j=1}^l \alpha_i \alpha_j y_i y_j K(x_i, x_j)
W(α)=i=1∑lαi−21i,j=1∑lαiαjyiyjK(xi,xj)
约束条件为
∑
i
=
1
l
α
i
y
i
=
0
,
α
i
≥
0
,
α
i
(
1
−
y
i
(
∑
j
=
1
l
y
j
α
j
K
(
x
j
,
x
i
)
−
b
)
)
=
0
\sum\limits_{i=1}^l \alpha_i y_i = 0,\ \ \alpha_i \ge 0,\ \ \alpha_i(1-y_i(\sum\limits_{j=1}^ly_j \alpha_j K(x_j, x_i)-b)) = 0
i=1∑lαiyi=0, αi≥0, αi(1−yi(j=1∑lyjαjK(xj,xi)−b))=0
5.6.4 SV机的例子
要估计某一SV机的测试错误的界,我们需要估计VC维
h
≈
R
2
∣
∣
w
∣
∣
2
h \approx R^2 ||w||^2
h≈R2∣∣w∣∣2,其中
∣
∣
w
∣
∣
2
=
∑
i
,
j
=
1
l
α
i
α
j
y
i
y
j
K
(
x
i
,
x
j
)
0
=
∑
i
=
1
l
α
i
(
1
−
y
i
(
∑
j
=
1
l
y
j
α
j
K
(
x
j
,
x
i
)
−
b
)
)
=
∑
i
=
1
l
α
i
−
∑
i
,
j
=
1
l
α
i
α
j
y
i
y
j
K
(
x
i
,
x
j
)
+
b
∑
i
=
1
l
α
i
y
i
→
∣
∣
w
∣
∣
2
=
∑
i
=
1
l
α
i
||w||^2 = \sum\limits_{i,j=1}^l \alpha_i \alpha_j y_i y_j K(x_i, x_j)\\ 0 = \sum\limits_{i=1}^l \alpha_i(1-y_i(\sum\limits_{j=1}^ly_j \alpha_j K(x_j, x_i)-b)) =\\ \sum\limits_{i=1}^l \alpha_i - \sum\limits_{i,j=1}^l \alpha_i \alpha_j y_i y_j K(x_i, x_j) + b\sum\limits_{i=1}^l \alpha_i y_i \to ||w||^2 = \sum\limits_{i=1}^l \alpha_i
∣∣w∣∣2=i,j=1∑lαiαjyiyjK(xi,xj)0=i=1∑lαi(1−yi(j=1∑lyjαjK(xj,xi)−b))=i=1∑lαi−i,j=1∑lαiαjyiyjK(xi,xj)+bi=1∑lαiyi→∣∣w∣∣2=i=1∑lαi
R
R
R 通过以下方式寻找:
R
2
=
R
2
(
K
)
=
min
a
max
x
i
[
K
(
x
i
,
x
i
)
−
2
K
(
x
i
,
a
)
+
K
(
a
,
a
)
]
R^2 = R^2(K) = \min\limits_{a}\max\limits_{x_i} [K(x_i, x_i) - 2K(x_i,a) + K(a,a)]
R2=R2(K)=aminximax[K(xi,xi)−2K(xi,a)+K(a,a)]
a
a
a 为该最小超球中心。
- 多项式学习机器
K ( u , v ) = [ ( u ⋅ v ) + 1 ] d K(u,v) = [(u \cdot v) + 1]^d K(u,v)=[(u⋅v)+1]d - 径向基函数机器(RBF)
K ( u , v ) = K γ ( ∣ ∣ u − v ∣ ∣ ) K(u,v) = K_\gamma(||u -v||) K(u,v)=Kγ(∣∣u−v∣∣) 固定 γ \gamma γ 时为非负单调减函数,且趋于0,如
K ( u , v ) = exp { − γ ∣ ∣ u − v ∣ ∣ 2 } K(u,v) = \exp\{ -\gamma||u -v||^2\} K(u,v)=exp{−γ∣∣u−v∣∣2} - 两层神经网络
K ( x , x i ) = S [ v ( x ⋅ x i ) + c ] K(x, x_i) = S[v(x \cdot x_i) + c] K(x,xi)=S[v(x⋅xi)+c]
参数 v , c v,c v,c 仅在某些取值时满足 定理5.3 Mercer条件。
5.7 SV机的实验
5.8 关于SV机的讨论
5.9 SVM与Logistic回归
5.9.1 Logistic 回归
对于模式识别问题,我们可以构建贝叶斯(最优)决策规则
r
(
x
)
=
s
g
n
{
ln
P
{
y
=
1
∣
x
}
1
−
P
{
y
=
1
∣
x
}
}
=
s
g
n
{
f
(
x
,
w
0
)
}
r(x) = sgn\{ \ln \dfrac{P\{y=1|x\}}{1 - P\{y=1|x\}} \} = sgn\{ f(x, w_0)\}
r(x)=sgn{ln1−P{y=1∣x}P{y=1∣x}}=sgn{f(x,w0)}
由上式我们可以得出如下结果(称为logistic回归)
P
{
y
=
1
∣
x
}
=
e
f
(
x
,
w
0
)
1
+
e
f
(
x
,
w
0
)
P\{y=1|x\} = \dfrac{e^{f(x, w_0)}}{1+e^{f(x, w_0)}}
P{y=1∣x}=1+ef(x,w0)ef(x,w0)
我们的目标是给定样本集,估计logistic回归的参数,采用风险泛函
R
x
(
w
)
=
E
y
∣
x
ln
(
1
+
e
−
y
f
(
x
,
w
)
)
R_x(w) = E_{y|x}\ln(1+e^{-yf(x,w)})
Rx(w)=Ey∣xln(1+e−yf(x,w))
需确认该风险泛函确实在
w
0
w_0
w0 处取极值
∂
R
x
(
w
)
∂
w
=
P
{
y
=
1
∣
x
}
∂
ln
(
1
+
e
−
f
(
x
,
w
)
)
∂
w
+
P
{
y
=
−
1
∣
x
}
∂
ln
(
1
+
e
f
(
x
,
w
)
)
∂
w
=
(
e
f
(
x
,
w
)
−
e
f
(
x
,
w
0
)
)
f
′
(
x
,
w
)
(
1
+
e
f
(
x
,
w
)
)
(
1
+
e
f
(
x
,
w
0
)
)
=
0
\dfrac{\partial R_x(w)}{\partial w} = P\{y=1|x\}\dfrac{\partial \ln(1+e^{-f(x,w)})}{\partial w} + P\{y=-1|x\}\dfrac{\partial \ln(1+e^{f(x,w)})}{\partial w}\\ = \dfrac{(e^{f(x,w)} - e^{f(x,w_0)})f'(x,w)}{(1+e^{f(x,w)})(1+e^{f(x,w_0)})} = 0
∂w∂Rx(w)=P{y=1∣x}∂w∂ln(1+e−f(x,w))+P{y=−1∣x}∂w∂ln(1+ef(x,w))=(1+ef(x,w))(1+ef(x,w0))(ef(x,w)−ef(x,w0))f′(x,w)=0
假定我们所求的logistic回归是线性函数,即
f
(
x
,
w
)
=
x
⋅
w
+
b
,
R
x
(
w
)
=
E
y
∣
x
ln
(
1
+
e
−
y
[
x
⋅
w
+
b
]
)
f(x,w) = x \cdot w + b, \ \ R_x(w) = E_{y|x}\ln(1+e^{-y[x \cdot w + b]})
f(x,w)=x⋅w+b, Rx(w)=Ey∣xln(1+e−y[x⋅w+b])
采用SRM方法,定义结构
w
⋅
w
≤
r
w \cdot w \le r
w⋅w≤r 进而采用类似软间隔的思路变为最小化泛函
Φ
(
w
,
b
)
=
C
∑
i
=
1
l
ln
(
1
+
e
−
y
i
[
x
i
⋅
w
+
b
]
)
+
1
2
w
⋅
w
\Phi(w,b) = C \sum\limits_{i=1}^l \ln (1+e^{-y_i[x_i \cdot w + b]}) + \dfrac{1}{2} w \cdot w
Φ(w,b)=Ci=1∑lln(1+e−yi[xi⋅w+b])+21w⋅w
对
Φ
(
w
,
b
)
\Phi(w,b)
Φ(w,b) 求梯度,得
∂
Φ
(
w
,
b
)
∂
b
=
0
→
∑
i
=
1
l
y
i
e
−
y
i
[
x
i
⋅
w
+
b
]
1
+
e
−
y
i
[
x
i
⋅
w
+
b
]
=
0
∂
Φ
(
w
,
b
)
∂
w
=
0
→
C
∑
i
=
1
l
y
i
x
i
e
−
y
i
[
x
i
⋅
w
+
b
]
1
+
e
−
y
i
[
x
i
⋅
w
+
b
]
=
w
\dfrac{\partial \Phi(w, b)}{\partial b} = 0 \to \sum\limits_{i=1}^l y_i \dfrac{ e^{-y_i[x_i \cdot w + b]}}{1+e^{-y_i[x_i \cdot w + b]}} = 0\\ \dfrac{\partial \Phi(w, b)}{\partial w} = 0 \to C\sum\limits_{i=1}^l y_i x_i \dfrac{ e^{-y_i[x_i \cdot w + b]}}{1+e^{-y_i[x_i \cdot w + b]}} = w
∂b∂Φ(w,b)=0→i=1∑lyi1+e−yi[xi⋅w+b]e−yi[xi⋅w+b]=0∂w∂Φ(w,b)=0→Ci=1∑lyixi1+e−yi[xi⋅w+b]e−yi[xi⋅w+b]=w
引入变量
α
i
=
e
−
y
i
[
x
i
⋅
w
+
b
]
1
+
e
−
y
i
[
x
i
⋅
w
+
b
]
\alpha_i = \dfrac{ e^{-y_i[x_i \cdot w + b]}}{1+e^{-y_i[x_i \cdot w + b]}}
αi=1+e−yi[xi⋅w+b]e−yi[xi⋅w+b]
则有
w
=
C
∑
i
=
1
l
α
i
y
i
x
i
,
∑
i
=
1
l
α
i
y
i
=
0
,
0
<
α
i
<
1
,
P
{
y
=
1
∣
x
}
=
e
C
∑
i
=
1
l
α
i
y
i
(
x
i
⋅
x
)
+
b
1
+
e
C
∑
i
=
1
l
α
i
y
i
(
x
i
⋅
x
)
+
b
Φ
=
C
2
2
∑
i
,
j
=
1
l
α
i
α
j
y
i
y
j
(
x
i
⋅
x
j
)
+
C
∑
i
=
1
l
ln
(
1
+
e
−
y
i
[
C
∑
j
=
1
l
α
j
y
j
(
x
j
⋅
x
i
)
+
b
]
)
w=C\sum\limits_{i=1}^l \alpha_i y_i x_i,\ \ \sum\limits_{i=1}^l \alpha_i y_i = 0,\ \ 0 < \alpha_i < 1,\\ P\{y=1|x\} = \dfrac{e^{C\sum\limits_{i=1}^l \alpha_i y_i (x_i \cdot x) + b}}{1+e^{C\sum\limits_{i=1}^l \alpha_i y_i (x_i \cdot x) + b}}\\ \Phi = \dfrac{C^2}{2}\sum\limits_{i,j=1}^l \alpha_i \alpha_j y_i y_j (x_i \cdot x_j) + C \sum\limits_{i=1}^l \ln (1+e^{-y_i[C\sum\limits_{j=1}^l \alpha_j y_j (x_j \cdot x_i) + b]})
w=Ci=1∑lαiyixi, i=1∑lαiyi=0, 0<αi<1,P{y=1∣x}=1+eCi=1∑lαiyi(xi⋅x)+beCi=1∑lαiyi(xi⋅x)+bΦ=2C2i,j=1∑lαiαjyiyj(xi⋅xj)+Ci=1∑lln(1+e−yi[Cj=1∑lαjyj(xj⋅xi)+b])
# 原文说可用梯度下降法求上述 Φ \Phi Φ 的极小值点,但这里如何确定关于 α \alpha α 这是要求极大值还是极小值点?
5.9.2 SVM的风险函数
我们采用如下损失函数来近似logistic回归的损失函数
Q
(
x
,
w
)
=
c
1
(
1
−
y
(
w
⋅
x
+
b
)
)
+
Q(x, w) = c_1 (1- y(w \cdot x + b))_+
Q(x,w)=c1(1−y(w⋅x+b))+
且引入变量(形成约束条件)
ξ
i
=
(
1
−
y
(
w
⋅
x
+
b
)
)
+
→
y
(
w
⋅
x
+
b
)
≥
1
−
ξ
i
,
ξ
i
≥
0
\xi_i = (1- y(w \cdot x + b))_+ \to y(w \cdot x + b)\ge 1- \xi_i,\ \ \xi_i \ge 0
ξi=(1−y(w⋅x+b))+→y(w⋅x+b)≥1−ξi, ξi≥0
那么之前的目标泛函
Φ
(
w
,
b
)
\Phi(w,b)
Φ(w,b) 变为
Φ
(
w
,
b
)
=
C
∑
i
=
1
l
ξ
i
+
1
2
w
⋅
w
\Phi(w,b) = C \sum\limits_{i=1}^l \xi_i + \dfrac{1}{2} w \cdot w
Φ(w,b)=Ci=1∑lξi+21w⋅w
此问题我们已经在5.5节讨论过。
5.9.3 Logistic回归的 SVMn 逼近
我们采用如下损失函数来近似logistic回归的损失函数(称为样条逼近)
Q
(
x
,
w
)
=
∑
k
=
1
n
c
k
(
d
k
−
y
(
w
⋅
x
+
b
)
)
+
Q(x, w) = \sum\limits_{k=1}^n c_k (d_k- y(w \cdot x + b))_+
Q(x,w)=k=1∑nck(dk−y(w⋅x+b))+
且引入变量(形成约束条件)
ξ
k
,
i
=
(
d
k
−
y
(
w
⋅
x
+
b
)
)
+
→
y
(
w
⋅
x
+
b
)
≥
d
k
−
ξ
k
,
i
,
ξ
k
,
i
≥
0
\xi_{k,i} = (d_k - y(w \cdot x + b))_+ \to y(w \cdot x + b)\ge d_k- \xi_{k,i},\ \ \xi_{k,i} \ge 0
ξk,i=(dk−y(w⋅x+b))+→y(w⋅x+b)≥dk−ξk,i, ξk,i≥0
那么之前的目标泛函
Φ
(
w
,
b
)
\Phi(w,b)
Φ(w,b) 变为
Φ
(
w
,
b
)
=
C
∑
i
=
1
l
∑
k
=
1
n
c
k
ξ
k
,
i
+
1
2
w
⋅
w
\Phi(w,b) = C \sum\limits_{i=1}^l \sum\limits_{k=1}^n c_k \xi_{k,i} + \dfrac{1}{2} w \cdot w
Φ(w,b)=Ci=1∑lk=1∑nckξk,i+21w⋅w
拉格朗日函数
L
(
w
,
b
,
α
,
β
)
=
1
2
w
⋅
w
+
C
∑
i
=
1
l
∑
k
=
1
n
c
k
ξ
k
,
i
+
∑
i
=
1
l
∑
k
=
1
n
α
k
,
i
(
d
k
−
ξ
k
,
i
−
y
i
(
w
⋅
x
i
+
b
)
)
+
∑
i
=
1
l
∑
k
=
1
n
β
k
,
i
(
−
ξ
k
,
i
)
,
α
k
,
i
≥
0
,
β
k
,
i
≥
0
L(w, b, \alpha, \beta) = \dfrac{1}{2} w \cdot w + C\sum\limits_{i=1}^l\sum\limits_{k=1}^n c_k \xi_{k,i} + \sum\limits_{i=1}^l\sum\limits_{k=1}^n \alpha_{k,i} (d_k- \xi_{k,i} - y_i(w \cdot x_i +b)) + \sum\limits_{i=1}^l\sum\limits_{k=1}^n \beta_{k,i} (-\xi_{k,i}),\\ \alpha_{k,i} \ge 0,\ \ \beta_{k,i} \ge 0
L(w,b,α,β)=21w⋅w+Ci=1∑lk=1∑nckξk,i+i=1∑lk=1∑nαk,i(dk−ξk,i−yi(w⋅xi+b))+i=1∑lk=1∑nβk,i(−ξk,i),αk,i≥0, βk,i≥0
目标为
max
α
,
β
min
w
,
b
,
ξ
L
\max\limits_{\alpha, \beta}\min\limits_{w, b, \xi}L
α,βmaxw,b,ξminL,对拉格朗日函数求梯度得
∂
L
∂
w
=
0
→
w
=
∑
i
=
1
l
(
∑
k
=
1
n
α
k
,
i
)
y
i
x
i
∂
L
∂
b
=
0
→
∑
i
=
1
l
(
∑
k
=
1
n
α
k
,
i
)
y
i
=
0
∂
L
∂
ξ
k
,
i
=
0
→
α
k
,
i
+
β
k
,
i
=
C
c
k
\dfrac{\partial L}{\partial w} = 0 \to w = \sum\limits_{i=1}^l (\sum\limits_{k=1}^n \alpha_{k,i}) y_i x_i\\ \dfrac{\partial L}{\partial b} = 0 \to \sum\limits_{i=1}^l (\sum\limits_{k=1}^n \alpha_{k,i}) y_i = 0\\ \dfrac{\partial L}{\partial \xi_{k,i}} = 0 \to \alpha_{k,i} + \beta_{k,i} = Cc_k
∂w∂L=0→w=i=1∑l(k=1∑nαk,i)yixi∂b∂L=0→i=1∑l(k=1∑nαk,i)yi=0∂ξk,i∂L=0→αk,i+βk,i=Cck
代入拉格朗日函数可得到该问题的对偶问题,最大化泛函
W
(
α
)
=
∑
i
=
1
l
∑
k
=
1
n
α
k
,
i
d
k
−
1
2
∑
i
,
j
=
1
l
(
∑
k
=
1
n
α
k
,
i
)
(
∑
k
=
1
n
α
k
,
j
)
y
i
y
j
(
x
i
⋅
x
j
)
W(\alpha) = \sum\limits_{i=1}^l\sum\limits_{k=1}^n \alpha_{k,i}d_k - \dfrac{1}{2}\sum\limits_{i,j=1}^l (\sum\limits_{k=1}^n\alpha_{k,i}) (\sum\limits_{k=1}^n\alpha_{k,j}) y_i y_j (x_i \cdot x_j)
W(α)=i=1∑lk=1∑nαk,idk−21i,j=1∑l(k=1∑nαk,i)(k=1∑nαk,j)yiyj(xi⋅xj)
约束条件
∑
i
=
1
l
(
∑
k
=
1
n
α
k
,
i
)
y
i
=
0
,
0
≤
α
k
,
i
≤
C
c
k
,
β
k
,
i
=
C
c
k
−
α
k
,
i
\sum\limits_{i=1}^l (\sum\limits_{k=1}^n\alpha_{k,i}) y_i = 0,\ \ 0 \le \alpha_{k,i} \le Cc_k,\ \ \beta_{k,i} = Cc_k- \alpha_{k,i}
i=1∑l(k=1∑nαk,i)yi=0, 0≤αk,i≤Cck, βk,i=Cck−αk,i
还需满足 Kuhn-Tucker条件
α
k
,
i
(
d
k
−
ξ
k
,
i
−
y
i
(
w
⋅
x
i
+
b
)
)
=
0
,
β
k
,
i
ξ
k
,
i
=
0
\alpha_{k,i} (d_k- \xi_{k,i} - y_i(w \cdot x_i +b)) = 0,\ \ \beta_{k,i} \xi_{k,i} = 0
αk,i(dk−ξk,i−yi(w⋅xi+b))=0, βk,iξk,i=0
logistic回归逼近为
f
(
x
,
w
)
=
∑
i
=
1
l
(
∑
k
=
1
n
α
k
,
i
)
y
i
(
x
i
⋅
x
)
+
b
P
{
y
=
1
∣
x
}
=
e
f
(
x
,
w
)
1
+
e
f
(
x
,
w
)
f(x,w) = \sum\limits_{i=1}^l (\sum\limits_{k=1}^n\alpha_{k,i}) y_i (x_i \cdot x) + b\\P\{y=1|x\} = \dfrac{e^{f(x, w)}}{1+e^{f(x, w)}}
f(x,w)=i=1∑l(k=1∑nαk,i)yi(xi⋅x)+bP{y=1∣x}=1+ef(x,w)ef(x,w)
或者更一般的,采用Mercer条件的核
f
(
x
,
w
)
=
∑
i
=
1
l
(
∑
k
=
1
n
α
k
,
i
)
y
i
K
(
x
i
,
x
)
+
b
W
(
α
)
=
∑
i
=
1
l
∑
k
=
1
l
α
k
,
i
d
k
−
1
2
∑
i
,
j
=
1
l
(
∑
k
=
1
n
α
k
,
i
)
(
∑
k
=
1
n
α
k
,
j
)
y
i
y
j
K
(
x
i
,
x
j
)
∑
i
=
1
l
(
∑
k
=
1
n
α
k
,
i
)
y
i
=
0
,
0
≤
α
k
,
i
≤
C
c
k
,
β
k
,
i
=
C
c
k
−
α
k
,
i
α
k
,
i
(
d
k
−
ξ
k
,
i
−
y
i
(
∑
j
=
1
l
(
∑
k
=
1
n
α
k
,
j
)
y
j
K
(
x
j
,
x
i
)
+
b
)
)
=
0
,
β
k
,
i
ξ
k
,
i
=
0
f(x,w) = \sum\limits_{i=1}^l (\sum\limits_{k=1}^n\alpha_{k,i}) y_i K(x_i, x) + b\\ W(\alpha) = \sum\limits_{i=1}^l\sum\limits_{k=1}^l \alpha_{k,i}d_k - \dfrac{1}{2}\sum\limits_{i,j=1}^l (\sum\limits_{k=1}^n\alpha_{k,i}) (\sum\limits_{k=1}^n\alpha_{k,j}) y_i y_j K(x_i, x_j)\\ \sum\limits_{i=1}^l (\sum\limits_{k=1}^n\alpha_{k,i}) y_i = 0,\ \ 0 \le \alpha_{k,i} \le Cc_k,\ \ \beta_{k,i} = Cc_k- \alpha_{k,i}\\ \alpha_{k,i} (d_k- \xi_{k,i} - y_i(\sum\limits_{j=1}^l (\sum\limits_{k=1}^n \alpha_{k,j}) y_j K(x_j, x_i) +b)) = 0,\ \ \beta_{k,i} \xi_{k,i} = 0
f(x,w)=i=1∑l(k=1∑nαk,i)yiK(xi,x)+bW(α)=i=1∑lk=1∑lαk,idk−21i,j=1∑l(k=1∑nαk,i)(k=1∑nαk,j)yiyjK(xi,xj)i=1∑l(k=1∑nαk,i)yi=0, 0≤αk,i≤Cck, βk,i=Cck−αk,iαk,i(dk−ξk,i−yi(j=1∑l(k=1∑nαk,j)yjK(xj,xi)+b))=0, βk,iξk,i=0
但Vapnik表示他们的实验并没有显示出SVMn比SVM对Logistic回归逼近有更大的优势。
5.10 SVM的组合
5.10.1 Adaboost方法
重新考虑风险泛函(不再使用Logistic回归)
R
(
α
)
=
E
y
∣
x
e
−
y
f
(
x
,
α
)
R(\alpha) = E_{y|x}e^{-yf(x,\alpha)}
R(α)=Ey∣xe−yf(x,α)
函数集中包含了函数
f
(
x
,
α
0
)
=
1
2
ln
P
(
y
=
1
∣
x
)
P
(
y
=
−
1
∣
x
)
f(x, \alpha_0) = \dfrac{1}{2}\ln\dfrac{P(y=1|x)}{P(y=-1|x)}
f(x,α0)=21lnP(y=−1∣x)P(y=1∣x)
即
P
(
y
=
1
∣
x
)
=
e
f
(
x
,
α
0
)
e
f
(
x
,
α
0
)
+
e
−
f
(
x
,
α
0
)
P
(
y
=
−
1
∣
x
)
=
e
−
f
(
x
,
α
0
)
e
f
(
x
,
α
0
)
+
e
−
f
(
x
,
α
0
)
P(y=1|x) = \dfrac{e^{f(x,\alpha_0)}}{e^{f(x,\alpha_0)} + e^{-f(x,\alpha_0)}}\\ P(y=-1|x) = \dfrac{e^{-f(x,\alpha_0)}}{e^{f(x,\alpha_0)} + e^{-f(x,\alpha_0)}}
P(y=1∣x)=ef(x,α0)+e−f(x,α0)ef(x,α0)P(y=−1∣x)=ef(x,α0)+e−f(x,α0)e−f(x,α0)
易知该函数恰可使
R
(
α
)
R(\alpha)
R(α) 最小。且
s
g
n
{
f
(
x
,
α
0
)
}
sgn\{f(x, \alpha_0)\}
sgn{f(x,α0)} 恰可使分类正确。采用经验风险代替风险,即最小化
R
e
m
p
(
α
)
=
1
l
∑
i
=
1
l
e
−
y
i
f
(
x
i
,
α
)
R_{emp}(\alpha) = \dfrac{1}{l}\sum\limits_{i=1}^l e^{-y_i f(x_i, \alpha)}
Remp(α)=l1i=1∑le−yif(xi,α)
贪婪优化步骤
该方法将在第k次迭代得到
f
(
x
,
β
k
)
=
∑
r
=
1
k
d
r
ϕ
r
(
x
)
,
d
1
=
1
f(x, \beta_k) = \sum\limits_{r=1}^k d_r \phi_r(x), d_1 = 1
f(x,βk)=r=1∑kdrϕr(x),d1=1, 其中
ϕ
r
(
x
)
\phi_r(x)
ϕr(x) 属于一个给定的指示函数集(这里应该是指
ϕ
(
x
)
=
s
g
n
{
K
(
x
)
}
\phi(x) = sgn\{K(x)\}
ϕ(x)=sgn{K(x)})。
- 第一次迭代时我们选择能最小化经验风险泛函的函数
ϕ
1
(
x
)
\phi_1(x)
ϕ1(x),即最小化泛函(不妨去掉常数系数
1
/
l
1 / l
1/l)
R e m p ( β 1 ) = ∑ i = 1 l e − y i ϕ 1 ( x i ) R_{emp}(\beta_1) = \sum\limits_{i=1}^l e^{-y_i \phi_1(x_i)} Remp(β1)=i=1∑le−yiϕ1(xi)
由下面的推导过程可知,此即相当于
R e m p ( β k + 1 ) ∣ c i k + 1 = 1 , d k + 1 = 1 R_{emp}(\beta_{k+1})|_{c^{k+1}_i = 1, d_{k+1} = 1} Remp(βk+1)∣cik+1=1,dk+1=1
因而之后得出的某些结论也应适用于第一次迭代。 - 第k次我们得到了如下经验风险值
R e m p ( β k ) = ∑ i = 1 l e − y i f ( x i , β k ) R_{emp}(\beta_k) = \sum\limits_{i=1}^l e^{-y_i f(x_i, \beta_k)} Remp(βk)=i=1∑le−yif(xi,βk)
并希望在此基础上(已经选定了 f ( x , β k ) f(x, \beta_k) f(x,βk))最小化经验风险
R e m p ( β k + 1 ) = ∑ i = 1 l e − y i f ( x i , β k + 1 ) = ∑ i = 1 l c i k + 1 e − y i d k + 1 ϕ k + 1 ( x i ) , c i k + 1 = e − y i f ( x i , β k ) R_{emp}(\beta_{k+1}) = \sum\limits_{i=1}^l e^{-y_i f(x_i, \beta_{k+1})} = \sum\limits_{i=1}^l c^{k+1}_ie^{- y_id_{k+1}\phi_{k+1}(x_i)},\ \ c^{k+1}_i = e^{-y_i f(x_i, \beta_k)} Remp(βk+1)=i=1∑le−yif(xi,βk+1)=i=1∑lcik+1e−yidk+1ϕk+1(xi), cik+1=e−yif(xi,βk)
引入如下参数
c + k + 1 = ∑ { i : y i ϕ k = 1 ( x i ) = 1 } c i k + 1 , c − k + 1 = ∑ { i : y i ϕ k = 1 ( x i ) = − 1 } c i k + 1 c^{k+1}_+ = \sum\limits_{\{i: y_i \phi_{k=1}(x_i) = 1\}} c^{k+1}_i, \ \ \ \ c^{k+1}_- = \sum\limits_{\{i: y_i \phi_{k=1}(x_i) = -1\}} c^{k+1}_i c+k+1={i:yiϕk=1(xi)=1}∑cik+1, c−k+1={i:yiϕk=1(xi)=−1}∑cik+1
则有
∂ R e m p ( β k + 1 ) ∂ d k + 1 = 0 → ∑ i = 1 l y i ϕ k + 1 ( x i ) c i k + 1 e − d k + 1 y i ϕ k + 1 ( x i ) = 0 → ∑ { i : y i ϕ k = 1 ( x i ) = 1 } c i k + 1 e − d k + 1 = ∑ { i : y i ϕ k = 1 ( x i ) = − 1 } c i k + 1 e d k + 1 , ∑ i = 1 l c i k + 2 y i ϕ k + 1 ( x i ) = 0 → d k + 1 = 1 2 ln c + k + 1 c − k + 1 , ∑ i = 1 l c i k + 1 y i ϕ k ( x i ) = 0 \dfrac{\partial R_{emp}(\beta_{k+1})}{ \partial d_{k+1}} = 0 \to \sum\limits_{i=1}^l y_i \phi_{k+1}(x_i) c^{k+1}_i e^{- d_{k+1} y_i \phi_{k+1}(x_i)} = 0 \\ \to \sum\limits_{\{i: y_i \phi_{k=1}(x_i) = 1\}} c^{k+1}_i e^{- d_{k+1}} = \sum\limits_{\{i: y_i \phi_{k=1}(x_i) = -1\}} c^{k+1}_i e^{d_{k+1}},\ \ \ \sum\limits_{i=1}^lc^{k+2}_i y_i \phi_{k+1}(x_i) = 0\\ \to d_{k+1} = \dfrac{1}{2} \ln \dfrac{c^{k+1}_+}{c^{k+1}_-}, \ \ \ \sum\limits_{i=1}^lc^{k+1}_i y_i \phi_k(x_i) = 0 ∂dk+1∂Remp(βk+1)=0→i=1∑lyiϕk+1(xi)cik+1e−dk+1yiϕk+1(xi)=0→{i:yiϕk=1(xi)=1}∑cik+1e−dk+1={i:yiϕk=1(xi)=−1}∑cik+1edk+1, i=1∑lcik+2yiϕk+1(xi)=0→dk+1=21lnc−k+1c+k+1, i=1∑lcik+1yiϕk(xi)=0
事实上
min
ϕ
{
∑
i
=
1
l
c
i
e
−
d
y
i
ϕ
(
x
i
)
}
⇔
min
z
i
{
∑
i
=
1
l
−
c
i
z
i
:
z
i
∈
{
e
−
d
,
e
d
}
}
⇔
min
z
i
{
∑
i
=
1
l
−
c
i
z
i
:
z
i
∈
{
e
−
d
−
e
d
2
,
e
d
−
e
−
d
2
}
}
⇔
min
z
i
{
∑
i
=
1
l
−
c
i
z
i
:
z
i
∈
{
−
1
,
1
}
}
\min\limits_{\phi} \{\sum\limits_{i=1}^l c_i e^{- d y_i \phi(x_i)} \} \Leftrightarrow \min\limits_{z_i} \{\sum\limits_{i=1}^l -c_i z_i : z_i \in \{e^{-d}, e^d\} \} \\ \Leftrightarrow \min\limits_{z_i} \{\sum\limits_{i=1}^l -c_i z_i : z_i \in \{\dfrac{e^{-d} - e^d}{2}, \dfrac{ e^d - e^{-d}}{2}\} \} \\ \Leftrightarrow \min\limits_{z_i} \{\sum\limits_{i=1}^l -c_i z_i : z_i \in \{-1, 1\}\}
ϕmin{i=1∑lcie−dyiϕ(xi)}⇔zimin{i=1∑l−cizi:zi∈{e−d,ed}}⇔zimin{i=1∑l−cizi:zi∈{2e−d−ed,2ed−e−d}}⇔zimin{i=1∑l−cizi:zi∈{−1,1}}
故原文(不加推导地)给出选择
ϕ
k
+
1
(
x
)
\phi_{k+1}(x)
ϕk+1(x) 的方法:使之最小泛函
R
(
ϕ
)
=
−
∑
i
=
1
l
c
i
k
+
1
y
i
ϕ
k
+
1
(
x
i
)
R(\phi) = -\sum\limits_{i=1}^l c^{k+1}_i y_i \phi_{k+1}(x_i)
R(ϕ)=−i=1∑lcik+1yiϕk+1(xi)
- 利用上述递归步骤可给出决策规则(称为AdaBoost决策规则)
s g n { f ( x , α N ) } = s g n { ∑ r = 1 N d r ϕ r ( x ) } sgn\{f(x, \alpha_N)\} = sgn\{\sum\limits_{r=1}^N d_r \phi_r(x)\} sgn{f(x,αN)}=sgn{r=1∑Ndrϕr(x)}
5.10.2 SVM的组合
# 以下理解可能与原文思想不符(原文直接给出了含 ξ \xi ξ 的目标泛函和约束条件),同时,推导得出结论也与原文略有不同,值得商榷
使用上述AdaBoost方法的思路来构造SVM的组合,即找到N个软间隔最优超平面以贪婪的方式最小化泛函
R
(
w
,
b
)
=
∑
i
=
1
l
exp
{
−
y
i
∑
k
=
1
N
d
k
s
g
n
{
w
k
⋅
x
i
+
b
k
}
}
R(w, b) = \sum\limits_{i=1}^l \exp\{ -y_i\sum\limits_{k=1}^N d_k sgn\{w_k \cdot x_i + b_k\} \}
R(w,b)=i=1∑lexp{−yik=1∑Ndksgn{wk⋅xi+bk}}
由5.10.1节可以知道,在第k次迭代过程中等价于最小化泛函(含第一次迭代)
R
(
w
k
,
b
k
)
=
−
∑
i
=
1
l
c
i
k
y
i
s
g
n
{
w
k
⋅
x
i
+
b
k
}
,
c
i
1
=
1
R(w_k, b_k) = -\sum\limits_{i=1}^l c^k_i y_i sgn\{w_k \cdot x_i + b_k\},\ \ c^1_i = 1
R(wk,bk)=−i=1∑lcikyisgn{wk⋅xi+bk}, ci1=1
考虑到分类正确或最小化上式时
y
i
s
g
n
{
w
k
⋅
x
i
+
b
k
}
=
1
⇔
y
i
[
w
k
⋅
x
i
+
b
k
]
≥
1
,
∃
w
k
,
b
k
y_i sgn\{w_k \cdot x_i + b_k\} = 1 \Leftrightarrow y_i [w_k \cdot x_i + b_k] \ge 1, \exist w_k, b_k
yisgn{wk⋅xi+bk}=1⇔yi[wk⋅xi+bk]≥1,∃wk,bk
因而采用新的(近似等价的)损失函数
Q
=
c
i
k
(
1
−
y
i
[
w
k
⋅
x
i
+
b
k
]
)
+
=
c
i
k
ξ
k
,
i
Q=c^k_i (1- y_i [w_k \cdot x_i + b_k])_+ = c^k_i \xi_{k,i}
Q=cik(1−yi[wk⋅xi+bk])+=cikξk,i
在约束条件
w
k
⋅
w
k
≤
Δ
−
2
w_k \cdot w_k \le \Delta ^ {-2}
wk⋅wk≤Δ−2 下最小化泛函
R
(
w
k
,
b
k
)
=
∑
i
=
1
l
c
i
k
ξ
k
,
i
R(w_k, b_k) = \sum\limits_{i=1}^l c^k_i \xi_{k,i}
R(wk,bk)=i=1∑lcikξk,i
采用软间隔最优超平面,则需最小化泛函
R
(
w
k
,
b
k
)
=
1
2
w
k
⋅
w
k
+
C
∑
i
=
1
l
c
i
k
ξ
k
,
i
R(w_k, b_k) = \dfrac{1}{2} w_k \cdot w_k + C \sum\limits_{i=1}^l c^k_i \xi_{k,i}
R(wk,bk)=21wk⋅wk+Ci=1∑lcikξk,i
约束条件
ξ
k
,
i
≥
0
,
(
1
−
ξ
k
,
i
−
y
i
[
w
k
⋅
x
i
+
b
k
]
)
≤
0
\xi_{k,i}\ge 0,\ \ (1- \xi_{k,i} - y_i [w_k \cdot x_i + b_k]) \le 0
ξk,i≥0, (1−ξk,i−yi[wk⋅xi+bk])≤0
第k次迭代中(以下推导省略k)拉格朗日函数
L
(
w
,
b
,
α
,
β
)
=
1
2
w
⋅
w
+
C
∑
i
=
1
l
c
i
ξ
i
+
∑
i
=
1
l
α
i
(
1
−
ξ
i
−
y
i
(
w
⋅
x
i
+
b
)
)
+
∑
i
=
1
l
β
i
(
−
ξ
i
)
,
α
i
≥
0
,
β
i
≥
0
L(w, b, \alpha, \beta) = \dfrac{1}{2} w \cdot w + C\sum\limits_{i=1}^l c_i \xi_i + \sum\limits_{i=1}^l \alpha_i (1- \xi_i - y_i(w \cdot x_i +b)) + \sum\limits_{i=1}^l \beta_i (-\xi_i),\\ \alpha_i \ge 0,\ \ \beta_i \ge 0
L(w,b,α,β)=21w⋅w+Ci=1∑lciξi+i=1∑lαi(1−ξi−yi(w⋅xi+b))+i=1∑lβi(−ξi),αi≥0, βi≥0
目标为
max
α
,
β
min
w
,
b
,
ξ
L
\max\limits_{\alpha, \beta}\min\limits_{w, b, \xi}L
α,βmaxw,b,ξminL,对拉格朗日函数求梯度得
∂
L
∂
w
=
0
→
w
=
∑
i
=
1
l
α
i
y
i
x
i
∂
L
∂
b
=
0
→
∑
i
=
1
l
α
i
y
i
=
0
∂
L
∂
ξ
i
=
0
→
α
i
+
β
i
=
C
c
i
\dfrac{\partial L}{\partial w} = 0 \to w = \sum\limits_{i=1}^l \alpha_i y_i x_i\\ \dfrac{\partial L}{\partial b} = 0 \to \sum\limits_{i=1}^l \alpha_i y_i = 0\\ \dfrac{\partial L}{\partial \xi_i} = 0 \to \alpha_i + \beta_i = Cc_i
∂w∂L=0→w=i=1∑lαiyixi∂b∂L=0→i=1∑lαiyi=0∂ξi∂L=0→αi+βi=Cci
代入拉格朗日函数可得到该问题的对偶问题,最大化泛函
W
(
α
)
=
∑
i
=
1
l
α
i
−
1
2
∑
i
,
j
=
1
l
α
i
α
j
y
i
y
j
(
x
i
⋅
x
j
)
W(\alpha) = \sum\limits_{i=1}^l \alpha_i - \dfrac{1}{2}\sum\limits_{i,j=1}^l \alpha_i \alpha_j y_i y_j (x_i \cdot x_j)
W(α)=i=1∑lαi−21i,j=1∑lαiαjyiyj(xi⋅xj)
约束条件
∑
i
=
1
l
α
i
y
i
=
0
,
0
≤
α
i
≤
C
c
i
,
β
i
=
C
c
i
−
α
i
\sum\limits_{i=1}^l \alpha_i y_i = 0,\ \ 0 \le \alpha_i \le Cc_i,\ \ \beta_i = Cc_i- \alpha_i
i=1∑lαiyi=0, 0≤αi≤Cci, βi=Cci−αi
还需满足 Kuhn-Tucker条件
α
i
(
1
−
ξ
i
−
y
i
(
w
⋅
x
i
+
b
)
)
=
0
,
β
i
ξ
i
=
0
\alpha_i (1- \xi_i - y_i(w \cdot x_i +b)) = 0,\ \ \beta_i \xi_i = 0
αi(1−ξi−yi(w⋅xi+b))=0, βiξi=0
回顾5.10.1节便可以知道
c
i
k
,
d
k
c^k_i,\ \ d_k
cik, dk 的值,故可以得出软间隔最优分类超平面,进而得到SVM组合的决策规则
Φ
(
x
)
=
s
g
n
{
∑
r
=
1
N
d
r
(
w
k
⋅
x
+
b
k
)
}
=
s
g
n
{
∑
r
=
1
N
d
r
(
∑
i
=
1
l
α
k
,
i
y
i
(
x
i
⋅
x
)
+
b
k
)
}
\Phi(x) = sgn\{\sum\limits_{r=1}^N d_r (w_k \cdot x + b_k)\} = sgn\{\sum\limits_{r=1}^N d_r (\sum\limits_{i=1}^l \alpha_{k,i} y_i (x_i \cdot x) + b_k)\}
Φ(x)=sgn{r=1∑Ndr(wk⋅x+bk)}=sgn{r=1∑Ndr(i=1∑lαk,iyi(xi⋅x)+bk)}
若采用Mercer条件核函数,只需把
x
i
⋅
x
j
x_i \cdot x_j
xi⋅xj 变为
K
(
x
i
,
x
j
)
K(x_i,x_j)
K(xi,xj) 即可。