目录
1.感知机
学习来源《统计学习方法(第2版)》第2章
1.1 模型
符号函数:
s
i
g
n
(
x
)
=
{
+
1
,
x
⩾
0
−
1
,
x
<
0
(1)
sign(x)=\left\{ \begin{aligned} +1& , & x \geqslant 0\\ -1 & , & x < 0 \end{aligned} \right. \tag{1}
sign(x)={+1−1,,x⩾0x<0(1)
感知机:
f
(
x
)
=
s
i
g
n
(
w
⋅
x
+
b
)
=
{
+
1
,
w
⋅
x
+
b
⩾
0
−
1
,
w
⋅
x
+
b
<
0
(2)
f(x)=sign(w \cdot x +b)=\left\{ \begin{aligned} +1&,&w\cdot x +b \geqslant 0\\ -1&,& w\cdot x+b < 0 \end{aligned} \right. \tag{2}
f(x)=sign(w⋅x+b)={+1−1,,w⋅x+b⩾0w⋅x+b<0(2)
其中,
w
=
(
w
1
,
w
2
,
.
.
.
,
w
n
)
,
x
=
(
x
1
,
x
2
,
.
.
.
,
x
n
)
T
w=(w_1,w_2,...,w_n), x = (x_1, x_2, ..., x_n)^T
w=(w1,w2,...,wn),x=(x1,x2,...,xn)T
感知机几何解释:
线性方程
w
⋅
x
+
b
w \cdot x + b
w⋅x+b对应一个超平面
S
S
S,
w
w
w是超平面的法向量,
b
b
b是超平面的截距。
1.2 损失函数
单个误分类点到分离超平面的距离:
−
1
∣
∣
w
∣
∣
y
i
(
w
⋅
x
i
+
b
)
(3)
-\frac{1}{||w||}y_i(w\cdot x_i+b) \tag{3}
−∣∣w∣∣1yi(w⋅xi+b)(3)
所有误分类点到分离超平面的距离:
−
1
∣
∣
w
∣
∣
∑
x
i
∈
M
y
i
(
w
⋅
x
i
+
b
)
(4)
-\frac{1}{||w||}\sum_{x_i \in M}y_i(w \cdot x_i + b) \tag{4}
−∣∣w∣∣1xi∈M∑yi(w⋅xi+b)(4)
损失函数:
L
(
w
,
b
)
=
−
∑
x
i
∈
M
y
i
(
w
⋅
x
i
+
b
)
(5)
L(w,b)=-\sum_{x_i \in M}y_i(w \cdot x_i + b) \tag{5}
L(w,b)=−xi∈M∑yi(w⋅xi+b)(5)
1.3 算法
需要找到一组合适的
w
,
b
w,b
w,b,因此求导:
∇
w
L
(
w
,
b
)
=
−
∑
x
i
∈
M
y
i
x
i
∇
b
L
(
w
,
b
)
=
−
∑
x
i
∈
M
y
i
(6)
\begin{aligned} \nabla_w L(w,b)&=-\sum_{x_i \in M}y_ix_i \\ \nabla_b L(w,b)&=-\sum_{x_i \in M} y_i \end{aligned} \tag{6}
∇wL(w,b)∇bL(w,b)=−xi∈M∑yixi=−xi∈M∑yi(6)
更新:
w
←
w
+
η
y
i
x
i
b
←
b
+
η
y
i
(7)
\begin{aligned} w&\leftarrow w + \eta y_ix_i \\ b &\leftarrow b + \eta y_i \end{aligned} \tag{7}
wb←w+ηyixi←b+ηyi(7)
注意:不是一次使所有的误分类点梯度下降,而是一次随机选取一个误分类点使其梯度下降
1.3.1 算法1
(1).选取初值
w
0
,
b
0
w_0,b_0
w0,b0;
(2).训练集中校验样本点;
(3).如果
y
i
(
w
⋅
x
i
+
b
)
⩽
0
y_i(w\cdot x_i + b) \leqslant 0
yi(w⋅xi+b)⩽0,说明是该样本点是误分类点,执行更新步骤;
(4).转至(2),直到数据集中没有误分类点
1.3.2 算法2(对偶形式)
算法2是对算法1的改进。当
w
0
=
0
,
b
0
=
0
w_0=0,b_0=0
w0=0,b0=0时,原更新方程变成:
w
←
η
y
i
x
i
b
←
η
y
i
(8)
\begin{aligned} w &\leftarrow \eta y_i x_i \\ b &\leftarrow \eta y_i \end{aligned} \tag{8}
wb←ηyixi←ηyi(8)
设修改
n
n
n次,
w
,
b
w,b
w,b可以表示为
α
i
y
i
x
i
\alpha_iy_ix_i
αiyixi和
α
i
y
i
\alpha_iy_i
αiyi,其中
α
i
=
n
i
η
\alpha_i=n_i\eta
αi=niη。
η
\eta
η是学习率,
i
i
i是实例的编号。当
η
=
1
\eta=1
η=1时,
α
i
=
n
i
\alpha_i=n_i
αi=ni表示第
i
i
i个实例点由于误分而更新的次数。于是,最后学习到的
w
,
b
w,b
w,b可以表示为:
w
=
∑
i
=
1
N
α
i
y
i
x
i
b
=
∑
i
=
1
N
α
i
y
i
(9)
\begin{aligned} w=&\sum_{i=1}^{N}\alpha_iy_ix_i \\ b=&\sum_{i=1}^{N} \alpha_i y_i \end{aligned} \tag{9}
w=b=i=1∑Nαiyixii=1∑Nαiyi(9)
归纳一下算法2:
(1).
α
←
0
,
b
←
0
\alpha \leftarrow 0,b \leftarrow 0
α←0,b←0;
(2).数据集遍历数据;
(3).如果
y
i
(
∑
j
=
1
N
α
j
y
j
x
j
⋅
x
i
+
b
)
⩽
0
y_i(\sum_{j=1}^N\alpha_jy_jx_j\cdot x_i +b)\leqslant 0
yi(∑j=1Nαjyjxj⋅xi+b)⩽0,更新
α
i
←
α
i
+
η
b
←
b
+
η
y
i
(10)
\begin{aligned} \alpha_i &\leftarrow \alpha_i +\eta \\ b &\leftarrow b +\eta y_i \end{aligned} \tag{10}
αib←αi+η←b+ηyi(10)
(4).转至(2),直至数据集中没有误分类点
注意:步骤(3)的
y
i
(
∑
j
=
1
N
α
j
y
j
x
j
⋅
x
i
+
b
)
⩽
0
y_i(\sum_{j=1}^N\alpha_jy_jx_j \cdot x_i+b) \leqslant 0
yi(∑j=1Nαjyjxj⋅xi+b)⩽0中,只有
α
i
,
b
\alpha_i,b
αi,b未知,
(
x
i
,
y
i
)
,
(
x
j
,
y
j
)
(x_i,y_i),(x_j,y_j)
(xi,yi),(xj,yj)均已知。所以需要对
α
i
,
b
\alpha_i,b
αi,b更新,
b
b
b更新与算法1相同,如何更新
α
i
\alpha_i
αi呢?查看
α
i
\alpha_i
αi的定义,
α
i
=
n
i
η
\alpha_i=n_i\eta
αi=niη。eg.
n
2
=
3
n_2=3
n2=3表示第
2
2
2个实例点,误分了
3
3
3次,因此
α
2
=
η
+
η
+
η
\alpha_2=\eta + \eta + \eta
α2=η+η+η。所以更新方程要定义为
α
i
←
α
i
+
η
\alpha_i \leftarrow \alpha_i +\eta
αi←αi+η
1.4 总结
缺点:
(1).无法解决非线性;
(2).初值以及误分类点的选择不同,得到不同的分离超平面
2.SVM
学习来源《统计学习方法(第2版)》第7章
2.1 建模
单个样本点到超平面的函数间隔:
γ
i
^
=
y
i
(
w
⋅
x
i
+
b
)
(11)
\hat{\gamma_i}=y_i(w\cdot x_i+b) \tag{11}
γi^=yi(w⋅xi+b)(11)
取所有样本点函数间隔最小的,把它作为整个数据集的函数间隔:
γ
^
=
min
i
=
1
,
2
,
.
.
.
,
N
γ
i
^
(12)
\hat{\gamma}=\min_{i=1,2,...,N} \hat{\gamma_i} \tag{12}
γ^=i=1,2,...,Nminγi^(12)
由于成倍地扩大或者缩小
w
,
b
w,b
w,b会改变函数间隔,因此引入几何间隔:
γ
i
=
y
i
(
w
∣
∣
w
∣
∣
⋅
x
i
+
b
∣
∣
w
∣
∣
)
(13)
\gamma_i =y_i(\frac{w}{||w||}\cdot x_i+\frac{b}{||w||}) \tag{13}
γi=yi(∣∣w∣∣w⋅xi+∣∣w∣∣b)(13)
整个数据集的几何间隔:
γ
=
min
i
=
1
,
2
,
.
.
.
,
N
γ
i
(14)
\gamma=\min_{i=1,2,...,N}\gamma_i \tag{14}
γ=i=1,2,...,Nminγi(14)
对这个分类问题建模以后,需要最大化间隔。因为间隔越大,被认为确信度越大:
max
w
,
b
γ
s
.
t
.
y
i
(
w
∣
∣
w
∣
∣
⋅
x
i
+
b
∣
∣
w
∣
∣
)
⩾
γ
,
i
=
1
,
2
,
.
.
.
,
N
(15)
\begin{aligned} \max_{w,b}\ \ \ &\gamma \\ s.t.\ \ \ &y_i(\frac{w}{||w||}\cdot x_i+\frac{b}{||w||})\geqslant \gamma,&i=1,2,...,N \end{aligned} \tag{15}
w,bmax s.t. γyi(∣∣w∣∣w⋅xi+∣∣w∣∣b)⩾γ,i=1,2,...,N(15)
函数间隔和几何间隔存在关系
γ
=
γ
^
∣
∣
w
∣
∣
\gamma=\frac{\hat{\gamma}}{||w||}
γ=∣∣w∣∣γ^代入公式(15):
max
w
,
b
γ
^
∣
∣
w
∣
∣
s
.
t
.
y
i
(
w
⋅
x
i
+
b
)
⩾
γ
^
,
i
=
1
,
2
,
.
.
.
,
N
(16)
\begin{aligned} \max_{w,b}\ \ \ &\frac{\hat{\gamma}}{||w||} \\ s.t.\ \ \ &y_i(w\cdot x_i + b) \geqslant \hat{\gamma},&i=1,2,...,N \end{aligned} \tag{16}
w,bmax s.t. ∣∣w∣∣γ^yi(w⋅xi+b)⩾γ^,i=1,2,...,N(16)
令
γ
^
=
1
\hat{\gamma}=1
γ^=1,不影响最后的解。同时,求最大化的
1
∣
∣
w
∣
∣
\frac{1}{||w||}
∣∣w∣∣1和求最小化的
1
2
∣
∣
w
∣
∣
2
\frac{1}{2}||w||^2
21∣∣w∣∣2问题等价,原问题变成:
min
w
,
b
1
2
∣
∣
w
∣
∣
2
s
.
t
.
y
i
(
w
⋅
x
i
+
b
)
−
1
⩾
0
,
i
=
1
,
2
,
.
.
.
,
N
(17)
\begin{aligned} \min_{w,b} \ \ \ &\frac{1}{2}||w||^2 \\ s.t. \ \ \ &y_i(w\cdot x_i+b)-1 \geqslant 0,&i=1,2,...,N \end{aligned} \tag{17}
w,bmin s.t. 21∣∣w∣∣2yi(w⋅xi+b)−1⩾0,i=1,2,...,N(17)
这是一个凸二次规划问题。
2.2 解方程
2.2.1 推导
无法直接求解式(17),将原始问题转变成对偶问题:
(1).构建拉格朗日函数:
L
(
w
,
b
,
α
)
=
1
2
∣
∣
w
∣
∣
2
−
∑
i
=
1
N
α
i
(
y
i
(
w
⋅
x
i
+
b
)
−
1
)
=
1
2
∣
∣
w
∣
∣
2
−
∑
i
=
1
N
α
i
y
i
(
w
⋅
x
i
+
b
)
+
∑
i
=
1
N
α
i
(18)
\begin{aligned} L(w,b,\alpha)=&\frac{1}{2}||w||^2-\sum_{i=1}^N\alpha_i(y_i(w \cdot x_i+b)-1)\\ =&\frac{1}{2}||w||^2-\sum_{i=1}^N\alpha_iy_i(w\cdot x_i +b)+\sum_{i=1}^N\alpha_i \end{aligned} \tag{18}
L(w,b,α)==21∣∣w∣∣2−i=1∑Nαi(yi(w⋅xi+b)−1)21∣∣w∣∣2−i=1∑Nαiyi(w⋅xi+b)+i=1∑Nαi(18)
即原问题可描述成极大极小问题:
max
α
min
w
,
b
L
(
w
,
b
,
α
)
(19)
\max_{\alpha} \min_{w,b} L(w,b,\alpha) \tag{19}
αmaxw,bminL(w,b,α)(19)
(2).求极小:
求偏导
∇
w
L
(
w
,
b
,
α
)
=
w
−
∑
i
=
1
N
α
i
y
i
c
i
=
0
∇
b
L
(
w
,
b
,
α
)
=
−
∑
i
=
1
N
α
i
y
i
=
0
(20)
\begin{aligned} \nabla_w L(w,b,\alpha)=&w-\sum_{i=1}^N \alpha_i y_i c_i=0 \\ \nabla_b L(w,b,\alpha)=&-\sum_{i=1}^N \alpha_i y_i = 0 \end{aligned} \tag{20}
∇wL(w,b,α)=∇bL(w,b,α)=w−i=1∑Nαiyici=0−i=1∑Nαiyi=0(20)
即:
w
=
∑
i
=
1
N
α
i
y
i
x
i
∑
i
=
1
N
α
i
y
i
=
0
(21)
\begin{aligned} w=&\sum_{i=1}^N \alpha_iy_ix_i \\ &\sum_{i=1}^N \alpha_i y_i =0 \end{aligned} \tag{21}
w=i=1∑Nαiyixii=1∑Nαiyi=0(21)
将(21)代入(18)中:
L
(
w
,
b
,
α
)
=
1
2
∑
i
=
1
N
∑
j
=
1
N
α
i
α
j
y
i
y
j
(
x
i
⋅
x
j
)
−
∑
i
=
1
N
α
i
y
i
(
(
∑
i
=
1
N
α
j
y
j
x
j
)
⋅
x
i
+
b
)
+
∑
i
=
1
N
α
i
=
−
1
2
∑
i
=
1
N
∑
i
=
1
N
α
i
α
j
y
i
y
j
(
x
i
⋅
x
j
)
+
∑
i
=
1
N
α
i
(22)
\begin{aligned} L(w,b,\alpha)=&\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N\alpha_i\alpha_jy_iy_j(x_i \cdot x_j)-\sum_{i=1}^N\alpha_iy_i((\sum_{i=1}^N\alpha_jy_jx_j) \cdot x_i + b)+\sum_{i=1}^N\alpha_i \\ =&-\frac{1}{2}\sum_{i=1}^N\sum_{i=1}^N\alpha_i\alpha_jy_iy_j(x_i\cdot x_j)+\sum_{i=1}^N\alpha_i \end{aligned} \tag{22}
L(w,b,α)==21i=1∑Nj=1∑Nαiαjyiyj(xi⋅xj)−i=1∑Nαiyi((i=1∑Nαjyjxj)⋅xi+b)+i=1∑Nαi−21i=1∑Ni=1∑Nαiαjyiyj(xi⋅xj)+i=1∑Nαi(22)
(3).求极大
max
α
−
1
2
∑
i
=
1
N
∑
j
=
1
N
α
i
α
j
y
i
y
j
(
x
i
⋅
x
j
)
+
∑
i
=
1
N
α
i
s
.
t
.
∑
i
=
1
N
α
i
y
i
=
0
α
i
⩾
0
,
i
=
1
,
2
,
.
.
.
,
N
(23)
\begin{aligned} \max_{\alpha}\ \ \ &-\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N \alpha_i \alpha_j y_i y_j(x_i \cdot x_j) + \sum_{i=1}^N\alpha_i \\ s.t. \ \ \ &\sum_{i=1}^N \alpha_iy_i=0 \\ &\alpha_i \geqslant0,i=1,2,...,N \end{aligned} \tag{23}
αmax s.t. −21i=1∑Nj=1∑Nαiαjyiyj(xi⋅xj)+i=1∑Nαii=1∑Nαiyi=0αi⩾0,i=1,2,...,N(23)
将求极大转变成求极小:
min
α
1
2
∑
i
=
1
N
∑
j
=
1
N
α
i
α
j
y
i
y
j
(
x
i
⋅
x
j
)
−
∑
i
=
1
N
α
i
s
.
t
.
∑
i
=
1
N
α
i
y
i
=
0
α
i
⩾
0
,
i
=
1
,
2
,
.
.
.
,
N
(24)
\begin{aligned} \min_{\alpha}\ \ \ &\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N \alpha_i \alpha_j y_i y_j(x_i \cdot x_j)-\sum_{i=1}^N\alpha_i \\ s.t.\ \ \ &\sum_{i=1}^N\alpha_iy_i=0 \\ &\alpha_i \geqslant0,i=1,2,...,N \end{aligned} \tag{24}
αmin s.t. 21i=1∑Nj=1∑Nαiαjyiyj(xi⋅xj)−i=1∑Nαii=1∑Nαiyi=0αi⩾0,i=1,2,...,N(24)
2.2.2 线性可分支持向量机算法
(1).构造并求解约束最优化问题:
min
α
1
2
∑
i
=
1
N
∑
j
=
1
N
α
i
α
j
y
i
y
j
(
x
i
⋅
x
j
)
−
∑
i
=
1
N
α
i
s
.
t
.
∑
i
=
1
N
α
i
y
i
=
0
α
i
⩾
0
,
i
=
1
,
2
,
.
.
.
,
N
(25)
\begin{aligned} \min_{\alpha}\ \ \ &\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N \alpha_i \alpha_jy_i y_j(x_i \cdot x_j)-\sum_{i=1}^N\alpha_i \\ s.t.\ \ \ &\sum_{i=1}^N\alpha_iy_i=0 \\ &\alpha_i \geqslant0,i=1,2,...,N \end{aligned} \tag{25}
αmin s.t. 21i=1∑Nj=1∑Nαiαjyiyj(xi⋅xj)−i=1∑Nαii=1∑Nαiyi=0αi⩾0,i=1,2,...,N(25)
求得最优解
α
∗
=
(
α
1
∗
,
α
2
∗
,
.
.
.
,
α
N
∗
)
T
\alpha^*=(\alpha_{1}^*,\alpha_{2}^*,...,\alpha_{N}^*)^T
α∗=(α1∗,α2∗,...,αN∗)T
(2).计算
w
∗
=
∑
i
=
1
N
α
i
∗
y
i
x
i
(26)
\begin{aligned} w^*=\sum_{i=1}^N\alpha_i^*y_ix_i \end{aligned} \tag{26}
w∗=i=1∑Nαi∗yixi(26)
并选择
α
∗
\alpha^*
α∗的一个正分量
α
j
∗
>
0
\alpha_{j}^*>0
αj∗>0,计算
b
∗
=
y
j
−
∑
i
=
1
N
α
i
∗
y
i
(
x
i
⋅
x
j
)
(27)
b^*=y_j-\sum_{i=1}^N\alpha_{i}^*y_i(x_i \cdot x_j) \tag{27}
b∗=yj−i=1∑Nαi∗yi(xi⋅xj)(27)
(3).求得分离超平面:
w
∗
⋅
x
+
b
∗
=
0
(28)
w^* \cdot x +b^* = 0 \tag{28}
w∗⋅x+b∗=0(28)
分类决策函数:
f
(
x
)
=
s
i
g
n
(
w
∗
⋅
x
+
b
∗
)
(29)
f(x)=sign(w^*\cdot x+b^*) \tag{29}
f(x)=sign(w∗⋅x+b∗)(29)
2.3 拓展建模
之前的建模存在一个bug,不允许存在outlier。为了增加算法的鲁棒性,引入软间隔的概念。
原始问题可被定义成:
$$
\begin{aligned}
\end{aligned}
$$