文章目录
前言
西瓜书中,123页的(6.8)式:
L
(
w
,
b
,
α
)
=
1
2
∥
w
∥
2
+
∑
i
=
1
m
α
i
(
1
−
y
i
(
w
T
x
i
+
b
)
)
(6.8)
L(w,b,\alpha)=\displaystyle\frac{1}{2}\|w\|^{2}+ \sum\limits_{i=1}^{m} \alpha_i(1-y_i(w^{T}x_i+b)) \tag{6.8}
L(w,b,α)=21∥w∥2+i=1∑mαi(1−yi(wTxi+b))(6.8)
其中
α
=
(
α
1
,
α
2
,
.
.
.
α
n
)
\alpha=(\alpha_1,\alpha_2,...\alpha_n)
α=(α1,α2,...αn)。令
∂
L
∂
w
=
0
\displaystyle\frac{\partial L}{\partial w}=0
∂w∂L=0,
∂
L
∂
b
=
0
\displaystyle\frac{\partial L}{\partial b}=0
∂b∂L=0,可得
w
=
∑
i
=
1
m
α
i
y
i
x
i
(6.9)
w= \sum\limits_{i=1}^{m} \alpha_iy_ix_i \tag{6.9}
w=i=1∑mαiyixi(6.9)
0
=
∑
i
=
1
m
α
i
y
i
(6.10)
0= \sum\limits_{i=1}^{m} \alpha_iy_i \tag{6.10}
0=i=1∑mαiyi(6.10)
求解过程需要掌握线性代数中的向量的范数和矩阵论课程中矩阵求导的相关知识,因此对向量的范数和矩阵求导进行相关的说明,再进一步对
L
(
w
,
b
,
α
)
L(w,b,\alpha)
L(w,b,α)求解。
一、向量的范数
向量的范数
若向量
w
=
(
w
1
,
w
2
,
.
.
.
,
w
n
)
∈
ℜ
n
w=(w_1,w_2,...,w_n)\in \Re^n
w=(w1,w2,...,wn)∈ℜn的范数(长度)定义为
∥
w
∥
=
w
1
2
+
w
2
2
+
.
.
.
+
w
n
2
\|w\|=\sqrt{w_1^2+w_2^2+...+w_n^2}
∥w∥=w12+w22+...+wn2
因此有
∥
w
∥
=
w
1
2
+
w
2
2
+
.
.
.
+
w
n
2
=
w
w
T
\|w\|=\sqrt{w_1^2+w_2^2+...+w_n^2}=\sqrt{ww^T}
∥w∥=w12+w22+...+wn2=wwT
即
∥
w
∥
2
=
w
1
2
+
w
2
2
+
.
.
.
+
w
n
2
=
w
w
T
\|w\|^2=w_1^2+w_2^2+...+w_n^2=ww^T
∥w∥2=w12+w22+...+wn2=wwT
二、矩阵的求导
例:
假设现有一个二元函数:
f
(
x
1
,
x
2
)
=
3
x
1
+
2
x
2
f(x_1,x_2)=3x_1+2x_2
f(x1,x2)=3x1+2x2
分别对该函数求偏导,则有
∂
f
∂
x
1
=
3
\displaystyle\frac{\partial f}{\partial x_1}=3
∂x1∂f=3
∂
f
∂
x
2
=
2
\displaystyle\frac{\partial f}{\partial x_2}=2
∂x2∂f=2
那么由多个变量组成的向量,即向量变元可写为
x
=
[
x
1
,
x
2
]
T
x=[x_1,x_2]^T
x=[x1,x2]T
可以得到对函数
f
f
f进行向量变元
x
x
x的向量求导的结果是
∂
f
(
x
)
∂
x
=
[
3
2
]
\displaystyle\frac{\partial f(x)}{\partial x}= \begin{bmatrix} 3 \\ 2 \end{bmatrix}
∂x∂f(x)=[32]
这就是向量
x
x
x求导的结果。
二元函数
f
(
x
1
,
x
2
)
=
3
x
1
+
2
x
2
f(x_1,x_2)=3x_1+2x_2
f(x1,x2)=3x1+2x2可写为
f
(
x
)
=
A
T
x
f(x)=A^Tx
f(x)=ATx
其中
A
=
[
3
,
2
]
T
A=[3,2]^T
A=[3,2]T
x
=
[
x
1
,
x
2
]
T
x=[x_1,x_2]^T
x=[x1,x2]T
所以
∂
f
(
x
)
∂
x
=
∂
A
T
x
∂
x
=
A
=
[
3
2
]
\displaystyle\frac{\partial f(x)}{\partial x}=\displaystyle\frac{\partial A^Tx}{\partial x}=A= \begin{bmatrix} 3 \\ 2 \end{bmatrix}
∂x∂f(x)=∂x∂ATx=A=[32]
1. 向量求导的梯度向量形式
一般的情况下,设
f
(
x
)
f(x)
f(x)是一个关于向量变元x的函数,且
x
=
[
x
1
,
x
2
,
.
.
.
,
x
n
]
T
x=[x_1,x_2,...,x_n]^T
x=[x1,x2,...,xn]T
则
∂
f
(
x
)
∂
x
=
[
∂
f
∂
x
1
,
∂
f
∂
x
2
,
.
.
.
,
∂
f
∂
x
n
]
T
\displaystyle\frac{\partial f(x)}{\partial x}=[\displaystyle\frac{\partial f}{\partial x_1},\displaystyle\frac{\partial f}{\partial x_2},...,\displaystyle\frac{\partial f}{\partial x_n}]^T
∂x∂f(x)=[∂x1∂f,∂x2∂f,...,∂xn∂f]T
此式也被称为向量求导的梯度向量形式:
▽
x
f
(
x
)
=
∂
f
(
x
)
∂
x
=
[
∂
f
∂
x
1
,
∂
f
∂
x
2
,
.
.
.
,
∂
f
∂
x
n
]
T
\bigtriangledown_x f(x) =\displaystyle\frac{\partial f(x)}{\partial x}=[\displaystyle\frac{\partial f}{\partial x_1},\displaystyle\frac{\partial f}{\partial x_2},...,\displaystyle\frac{\partial f}{\partial x_n}]^T
▽xf(x)=∂x∂f(x)=[∂x1∂f,∂x2∂f,...,∂xn∂f]T
因此,矩阵求导与向量求导类似。
1. 结论1
∂
(
x
T
A
)
∂
x
=
∂
(
A
x
T
)
∂
x
=
A
\displaystyle\frac{\partial (x^TA)}{\partial x}=\displaystyle\frac{\partial (Ax^T)}{\partial x}=A
∂x∂(xTA)=∂x∂(AxT)=A
证明:
设
A
=
[
a
1
,
a
2
,
.
.
.
,
a
n
]
T
A=[a_1,a_2,...,a_n]^T
A=[a1,a2,...,an]T,其中
a
1
,
a
2
,
.
.
.
,
a
n
a_1,a_2,...,a_n
a1,a2,...,an为常数,则有
∂
(
x
T
A
)
∂
x
=
∂
(
A
x
T
)
∂
x
=
A
\displaystyle\frac{\partial (x^TA)}{\partial x}=\displaystyle\frac{\partial (Ax^T)}{\partial x}=A
∂x∂(xTA)=∂x∂(AxT)=A
证明如下:
∂
(
x
T
A
)
∂
x
=
∂
(
A
x
T
)
∂
x
=
∂
(
a
1
x
1
+
a
2
x
2
+
.
.
.
+
a
n
x
n
)
∂
x
=
[
∂
(
a
1
x
1
+
a
2
x
2
+
.
.
.
+
a
n
x
n
)
∂
x
1
∂
(
a
1
x
1
+
a
2
x
2
+
.
.
.
+
a
n
x
n
)
∂
x
2
.
.
.
∂
(
a
1
x
1
+
a
2
x
2
+
.
.
.
+
a
n
x
n
)
∂
x
n
]
=
[
a
1
,
a
2
,
.
.
.
,
a
n
]
T
=
A
\begin{aligned} \displaystyle\frac{\partial (x^TA)}{\partial x}&=\displaystyle\frac{\partial (Ax^T)}{\partial x}\\ &=\displaystyle\frac{\partial (a_1x_1+a_2x_2+...+a_nx_n)}{\partial x}\\ &= \begin{bmatrix} \displaystyle\frac{\partial (a_1x_1+a_2x_2+...+a_nx_n)}{\partial x_1}\\ \displaystyle\frac{\partial (a_1x_1+a_2x_2+...+a_nx_n)}{\partial x_2} \\ ...\\ \displaystyle\frac{\partial (a_1x_1+a_2x_2+...+a_nx_n)}{\partial x_n} \\ \end{bmatrix} \\ &=[a_1,a_2,...,a_n]^T\\ &=A\\ \end{aligned}
∂x∂(xTA)=∂x∂(AxT)=∂x∂(a1x1+a2x2+...+anxn)=⎣⎢⎢⎢⎢⎢⎢⎢⎡∂x1∂(a1x1+a2x2+...+anxn)∂x2∂(a1x1+a2x2+...+anxn)...∂xn∂(a1x1+a2x2+...+anxn)⎦⎥⎥⎥⎥⎥⎥⎥⎤=[a1,a2,...,an]T=A
2. 结论2
∂
(
x
T
x
)
∂
x
=
2
x
\displaystyle\frac{\partial (x^Tx)}{\partial x}=2x
∂x∂(xTx)=2x
证明:
∂
(
x
T
x
)
∂
x
=
[
∂
(
x
1
2
+
x
2
2
+
.
.
.
+
x
n
2
)
∂
x
1
∂
(
x
1
2
+
x
2
2
+
.
.
.
+
x
n
2
)
∂
x
2
.
.
.
∂
(
x
1
2
+
x
2
2
+
.
.
.
+
x
n
2
)
∂
x
n
]
=
[
2
x
1
,
2
x
2
,
.
.
.
,
2
x
n
]
T
=
2
x
\begin{aligned} \displaystyle\frac{\partial (x^Tx)}{\partial x}&= \begin{bmatrix} \displaystyle\frac{\partial (x_1^2+x_2^2+...+x_n^2)}{\partial x_1}\\ \displaystyle\frac{\partial (x_1^2+x_2^2+...+x_n^2)}{\partial x_2} \\ ...\\ \displaystyle\frac{\partial (x_1^2+x_2^2+...+x_n^2)}{\partial x_n} \\ \end{bmatrix} \\ &=[2x_1,2x_2,...,2x_n]^T\\ &=2x \end{aligned}
∂x∂(xTx)=⎣⎢⎢⎢⎢⎢⎢⎢⎡∂x1∂(x12+x22+...+xn2)∂x2∂(x12+x22+...+xn2)...∂xn∂(x12+x22+...+xn2)⎦⎥⎥⎥⎥⎥⎥⎥⎤=[2x1,2x2,...,2xn]T=2x
三、 ∂ ∥ w ∥ 2 ∂ w \displaystyle\frac{\partial \|w\|^2}{\partial w} ∂w∂∥w∥2的结果
根据一和二的讨论,则有
∂
∥
w
∥
2
∂
w
=
∂
(
w
w
T
)
∂
w
=
2
w
\displaystyle\frac{\partial \|w\|^2}{\partial w}=\displaystyle\frac{\partial (ww^T)}{\partial w}= 2w
∂w∂∥w∥2=∂w∂(wwT)=2w
四、(6.8)式的计算结果
当
∂
L
∂
w
=
0
\displaystyle\frac{\partial L}{\partial w}=0
∂w∂L=0
则
∂
L
∂
w
=
∂
(
1
2
∥
w
∥
2
+
∑
i
=
1
m
α
i
(
1
−
y
i
(
w
T
x
i
+
b
)
)
)
∂
w
=
∂
1
2
∥
w
∥
2
∂
w
+
∂
(
∑
i
=
1
m
α
i
)
∂
w
−
∂
(
∑
i
=
1
m
α
i
y
i
w
T
x
i
)
∂
w
−
∂
(
∑
i
=
1
m
α
i
y
i
b
)
∂
w
\begin{aligned} \displaystyle\frac{\partial L}{\partial w}&=\displaystyle\frac{\partial (\displaystyle\frac{1}{2}\|w\|^{2}+ \sum\limits_{i=1}^{m} \alpha_i(1-y_i(w^{T}x_i+b))) }{\partial w}\\ &=\frac{\partial \displaystyle\frac{1}{2}\|w\|^{2}}{{\partial w}}+\frac{\partial \displaystyle (\sum\limits_{i=1}^{m} \alpha_i)}{{\partial w}}-\frac{\partial \displaystyle (\sum\limits_{i=1}^{m} \alpha_iy_iw^{T}x_i)}{{\partial w}}-\frac{\partial \displaystyle (\sum\limits_{i=1}^{m} \alpha_iy_ib)}{{\partial w}} \end{aligned}
∂w∂L=∂w∂(21∥w∥2+i=1∑mαi(1−yi(wTxi+b)))=∂w∂21∥w∥2+∂w∂(i=1∑mαi)−∂w∂(i=1∑mαiyiwTxi)−∂w∂(i=1∑mαiyib)
因为
α
i
,
y
i
,
b
\alpha_i,y_i,b
αi,yi,b与
w
w
w无关,因此
∂
(
∑
i
=
1
m
α
i
)
∂
w
=
0
\displaystyle \frac{\partial \displaystyle (\sum\limits_{i=1}^{m} \alpha_i)}{{\partial w}}=0
∂w∂(i=1∑mαi)=0
∂
(
∑
i
=
1
m
α
i
y
i
b
)
∂
w
=
0
\displaystyle \frac{\partial \displaystyle (\sum\limits_{i=1}^{m} \alpha_iy_ib)}{{\partial w}}=0
∂w∂(i=1∑mαiyib)=0
又因为
1
2
∂
∥
w
∥
2
∂
w
=
w
\displaystyle\frac{1}{2}\frac{\partial \|w\|^{2}}{{\partial w}}=w
21∂w∂∥w∥2=w
所以
∂
L
∂
w
=
w
−
∂
(
∑
i
=
1
m
α
i
y
i
w
T
x
i
)
∂
w
=
w
−
∑
i
=
1
m
α
i
y
i
x
i
=
0
\begin{aligned} \displaystyle\frac{\partial L}{\partial w}&=w-\frac{\partial \displaystyle (\sum\limits_{i=1}^{m} \alpha_iy_iw^{T}x_i)}{{\partial w}}\\ &=w-\sum\limits_{i=1}^{m} \alpha_iy_ix_i\\ &=0 \end{aligned}
∂w∂L=w−∂w∂(i=1∑mαiyiwTxi)=w−i=1∑mαiyixi=0
即
w
=
∑
i
=
1
m
α
i
y
i
x
i
(6.9)
w=\sum\limits_{i=1}^{m} \alpha_iy_ix_i \tag{6.9}
w=i=1∑mαiyixi(6.9)
当
∂
L
∂
b
=
0
\displaystyle\frac{\partial L}{\partial b}=0
∂b∂L=0时,
α
i
,
y
i
,
x
i
,
w
T
\alpha_i,y_i,x_i,w^T
αi,yi,xi,wT与
b
b
b无关,因此
∂
L
∂
b
=
0
−
∑
i
=
1
m
α
i
y
i
=
0
\displaystyle\frac{\partial L}{\partial b}=0-\sum\limits_{i=1}^{m} \alpha_iy_i=0
∂b∂L=0−i=1∑mαiyi=0
所以
∑
i
=
1
m
α
i
y
i
=
0
(6.10)
\sum\limits_{i=1}^{m} \alpha_iy_i=0 \tag{6.10}
i=1∑mαiyi=0(6.10)
总结
在支持向量机中需要太多的数学知识,做这个笔记是给自己一个记录。
本文参考了B站视频:
机器学习中的矩阵求导方法,https://www.bilibili.com/medialist/detail/ml1590616425?type=1&spm_id_from=333.999.0.0