注:在本文中不考虑复数矩阵的可能性,仅考虑实矩阵
符号约定
X , A , B \mathbf{X},\mathbf{A},\mathbf{B} X,A,B | 矩阵 | F ( ⋅ ) \mathbf{F(\cdot)} F(⋅) | 输出为矩阵的函数 |
---|---|---|---|
x , y , z \mathbf{x},\mathbf{y},\mathbf{z} x,y,z | 向量 | f ( ⋅ ) , g ( ⋅ ) \mathbf{f(\cdot)},\mathbf{g(\cdot)} f(⋅),g(⋅) | 输出为向量的函数 |
x , y x,y x,y | 标量 | f ( ⋅ ) f ( \cdot) f(⋅) | 输出为标量的函数 |
在上篇中我们使用定义计算了矩阵函数的梯度,并且给出了多元函数二阶导海瑟矩阵的表达式,但是在实际在计算中还是略微繁琐,在本文中将给出计算梯度标准型。
一.矩阵的Hadamard积
m
×
n
m \times n
m×n矩阵
A
=
[
a
i
j
]
\mathbf A=[a_{ij}]
A=[aij]与
m
×
n
m \times n
m×n矩阵
B
=
[
b
i
j
]
\mathbf B=[b_{ij}]
B=[bij]的Hadamard积记作
A
⊙
B
\mathbf A\odot\mathbf B
A⊙B,它仍然是一个
m
×
n
m\times n
m×n矩阵,其元素定义为两个矩阵对应元素的乘积。
(
A
⊙
B
)
i
j
=
a
i
j
b
i
j
(\mathbf A\odot\mathbf B)_{ij}=a_{ij}b_{ij}
(A⊙B)ij=aijbij
即Hadamard积是一映射
R
m
×
n
×
R
m
×
n
→
R
m
×
n
\Bbb R^{m\times n}\times \Bbb R^{m\times n}\rightarrow\Bbb R^{m\times n}
Rm×n×Rm×n→Rm×n
Hadamard积的一些性质如下。
(
1
)
(
A
⊙
B
)
T
=
A
T
⊙
B
T
(
2
)
c
(
A
⊙
B
)
=
(
c
A
)
⊙
B
=
A
⊙
(
c
B
)
(
3
)
A
⊙
B
=
B
⊙
A
(
4
)
若
m
×
m
矩
阵
A
,
B
是
正
定
(
或
者
半
正
定
)
的
,
则
它
们
的
H
a
d
a
m
a
r
d
积
也
是
正
定
(
或
半
正
定
)
的
。
(
5
)
t
r
(
A
T
(
B
⊙
C
)
)
=
t
r
(
(
A
T
⊙
B
T
)
C
)
(
6
)
v
e
c
(
A
⊙
X
)
=
d
i
a
g
(
A
)
v
e
c
(
X
)
\begin{aligned} &(1)(\mathbf A\odot \mathbf B)^T=\mathbf A^T\odot\mathbf B^T\\ &(2)c(\mathbf A\odot\mathbf B)=(c\mathbf A)\odot\mathbf B=\mathbf A\odot (c\mathbf B)\\ &(3)\mathbf A\odot \mathbf B=\mathbf B\odot \mathbf A\\ &(4)若m\times m矩阵\mathbf A,\mathbf B是正定(或者半正定)的,则它们的Hadamard积也是正定(或半正定)的。\\ &(5)\mathrm{tr}(\mathbf A^T(\mathbf B\odot\mathbf C))=\mathrm{tr}((\mathbf A^T\odot\mathbf B^T)\mathbf C)\\ &(6)\mathrm{vec}(\mathbf A\odot\mathbf X)=\mathrm{diag}(\mathbf A)\mathrm{vec}(\mathbf X) \end{aligned}
(1)(A⊙B)T=AT⊙BT(2)c(A⊙B)=(cA)⊙B=A⊙(cB)(3)A⊙B=B⊙A(4)若m×m矩阵A,B是正定(或者半正定)的,则它们的Hadamard积也是正定(或半正定)的。(5)tr(AT(B⊙C))=tr((AT⊙BT)C)(6)vec(A⊙X)=diag(A)vec(X)
二.矩阵的Kronecker积
两个矩阵的Kronecker积分为左Kronecker积和右Kronecker积。
m
×
n
m \times n
m×n矩阵
A
=
[
a
i
j
]
\mathbf A=[a_{ij}]
A=[aij]与
p
×
q
p \times q
p×q矩阵
B
=
[
b
i
j
]
\mathbf B=[b_{ij}]
B=[bij]的右Kronecker积记作
A
⊗
B
\mathbf A\otimes\mathbf B
A⊗B,它是一个
m
p
×
n
q
mp\times nq
mp×nq矩阵,定义为
A
⊗
B
=
[
a
i
j
B
]
i
=
1
,
j
=
1
m
,
n
=
[
a
11
B
a
12
B
⋯
a
1
n
B
a
21
B
a
22
B
⋯
a
2
n
B
⋮
⋮
⋱
⋮
a
m
1
B
a
m
2
B
⋯
a
m
n
B
]
\mathbf A\otimes\mathbf B=[a_{ij}\mathbf B]_{i=1,j=1}^{m,n}=\begin{bmatrix}a_{11}\mathbf B &a_{12}\mathbf B&\cdots&a_{1n}\mathbf B\\ a_{21}\mathbf B &a_{22}\mathbf B&\cdots&a_{2n}\mathbf B\\ \vdots&\vdots&\ddots&\vdots\\ a_{m1}\mathbf B &a_{m2}\mathbf B&\cdots&a_{mn}\mathbf B \end{bmatrix}
A⊗B=[aijB]i=1,j=1m,n=⎣⎢⎢⎢⎡a11Ba21B⋮am1Ba12Ba22B⋮am2B⋯⋯⋱⋯a1nBa2nB⋮amnB⎦⎥⎥⎥⎤
m
×
n
m \times n
m×n矩阵
A
=
[
a
i
j
]
\mathbf A=[a_{ij}]
A=[aij]与
p
×
q
p \times q
p×q矩阵
B
=
[
b
i
j
]
\mathbf B=[b_{ij}]
B=[bij]的左Kronecker积记作
[
A
⊗
B
]
l
e
f
t
[\mathbf A\otimes\mathbf B]_{\mathrm{left}}
[A⊗B]left,它是一个
m
p
×
n
q
mp\times nq
mp×nq矩阵,定义为
[
A
⊗
B
]
l
e
f
t
=
[
b
i
j
A
]
i
=
1
,
j
=
1
m
,
n
=
[
b
11
A
b
12
A
⋯
b
1
n
A
b
21
A
b
22
A
⋯
b
2
n
A
⋮
⋮
⋱
⋮
b
m
1
A
b
m
2
A
⋯
b
m
n
A
]
[\mathbf A\otimes\mathbf B]_{\mathrm{left}}=[b_{ij}\mathbf A]_{i=1,j=1}^{m,n}=\begin{bmatrix}b_{11}\mathbf A &b_{12}\mathbf A&\cdots&b_{1n}\mathbf A\\ b_{21}\mathbf A &b_{22}\mathbf A&\cdots&b_{2n}\mathbf A\\ \vdots&\vdots&\ddots&\vdots\\ b_{m1}\mathbf A &b_{m2}\mathbf A&\cdots&b_{mn}\mathbf A \end{bmatrix}
[A⊗B]left=[bijA]i=1,j=1m,n=⎣⎢⎢⎢⎡b11Ab21A⋮bm1Ab12Ab22A⋮bm2A⋯⋯⋱⋯b1nAb2nA⋮bmnA⎦⎥⎥⎥⎤
显然,矩阵的左Kronecker积可以用右Kronecker积表示。默认都采用右Kronecker积表示Kronecker积。
Kronecker积有以下常用性质
(
1
)
A
⊗
B
≠
B
⊗
A
(
2
)
α
A
⊗
β
B
=
α
β
(
A
⊗
B
)
(
3
)
(
A
B
)
⊗
(
C
D
)
=
(
A
⊗
C
)
(
B
⊗
D
)
(
4
)
(
A
⊗
B
)
T
=
A
T
⊗
B
T
(
5
)
(
A
⊗
B
)
−
1
=
A
−
1
⊗
B
−
1
(
6
)
v
e
c
(
A
X
B
)
=
(
B
T
⊗
A
)
v
e
c
(
X
)
(
7
)
v
e
c
(
a
b
T
)
=
b
⊗
a
\begin{aligned} &(1)\mathbf A\otimes\mathbf B \neq \mathbf B\otimes\mathbf A\\ &(2)\alpha\mathbf A\otimes\mathbf \beta B=\alpha\beta(\mathbf A\otimes\mathbf B)\\ &(3)(\mathbf A\mathbf B)\otimes(\mathbf C\mathbf D)=(\mathbf A\otimes\mathbf C)(\mathbf B\otimes\mathbf D)\\ &(4)(\mathbf A\otimes\mathbf B)^T=\mathbf A^T\otimes\mathbf B^T\\ &(5)(\mathbf A\otimes\mathbf B)^{-1}=\mathbf A^{-1}\otimes\mathbf B^{-1}\\ &(6)\mathbf{vec}(\mathbf A\mathbf X \mathbf B)=(\mathbf B^T\otimes \mathbf A)\mathrm{vec}(\mathbf X)\\ &(7)\mathrm{vec}(\mathbf a \mathbf b^T)=\mathbf b\otimes \mathbf a \end{aligned}
(1)A⊗B=B⊗A(2)αA⊗βB=αβ(A⊗B)(3)(AB)⊗(CD)=(A⊗C)(B⊗D)(4)(A⊗B)T=AT⊗BT(5)(A⊗B)−1=A−1⊗B−1(6)vec(AXB)=(BT⊗A)vec(X)(7)vec(abT)=b⊗a
三.逐元素函数
假设一个函数
f
(
x
)
f(x)
f(x)的输出是标量
x
x
x,对于一组
K
K
K个标量
x
1
,
x
2
,
⋯
,
x
k
x_1,x_2,\cdots,x_k
x1,x2,⋯,xk我们可以通过
f
(
x
)
f(x)
f(x)得到另外一组
K
K
K个标量
z
1
,
z
2
,
z
3
,
⋯
,
z
k
z_1,z_2,z_3,\cdots,z_k
z1,z2,z3,⋯,zk。
z
k
=
f
(
x
k
)
,
∀
k
=
1
,
⋯
,
K
z_k=f(x_k),\ \ \ \ \ \ \ \ \forall k=1,\cdots,K
zk=f(xk), ∀k=1,⋯,K
我们定义
x
=
[
x
1
,
⋯
,
x
k
]
T
,
z
=
[
z
1
,
⋯
,
z
k
]
T
\mathbf x=[x_1,\cdots,x_k]^T,\mathbf z=[z_1,\cdots,z_k]^T
x=[x1,⋯,xk]T,z=[z1,⋯,zk]T。
z
=
f
(
x
)
\mathbf z=\mathbf{f(x)}
z=f(x)
其中
f
(
x
)
f(x)
f(x)是按位运算,即
[
f
(
x
)
]
k
=
f
(
x
k
)
[\mathbf f(\mathbf x)]_k=f(x_k)
[f(x)]k=f(xk)。这样的函数就是逐元素函数,将向量变元推广至矩阵变元也成立。
当
x
x
x为标量时,
f
(
x
)
f(x)
f(x)的导数记为
f
′
(
x
)
f'(x)
f′(x),当输入为K维向量
x
=
[
x
1
,
⋯
,
x
k
]
T
\mathbf x = [x_1,\cdots,x_k]^T
x=[x1,⋯,xk]T时,其导数为一个对角阵。
∂
f
(
x
)
∂
x
=
[
f
′
(
x
1
)
0
⋯
0
0
f
′
(
x
2
)
⋯
0
⋮
⋮
⋱
⋮
0
0
⋯
f
′
(
x
k
)
]
=
d
i
a
g
(
f
′
(
x
)
)
\begin{aligned} \frac{\partial \mathbf{f(x)}}{\partial \mathbf x}&= \left[ \begin{matrix} f'(x_1) & 0 & \cdots & 0\\ 0 & f'(x_2) & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & f'(x_k) \end{matrix} \right]\\ \\ &=\mathrm{diag}(\mathbf f'(\mathbf x)) \end{aligned}
∂x∂f(x)=⎣⎢⎢⎢⎡f′(x1)0⋮00f′(x2)⋮0⋯⋯⋱⋯00⋮f′(xk)⎦⎥⎥⎥⎤=diag(f′(x))
四.实值函数的微分标准型
矩阵微分用符号 d X \mathrm{d}\mathbf X dX表示,定义为 d X = d e f [ d x i j ] i = 1 , j = 1 m , n \mathrm{d}\mathbf X\overset{\mathrm{def}}{=}[\mathrm{d}x_{ij}]^{m,n}_{i=1,j=1} dX=def[dxij]i=1,j=1m,n
下面是矩阵微分常用的性质。
(
1
)
d
(
t
r
(
X
)
)
=
t
r
(
d
(
X
)
)
(
2
)
d
(
X
Y
)
=
(
d
X
)
Y
+
X
(
d
Y
)
(
3
)
d
(
X
)
T
=
(
d
X
)
T
(
4
)
d
(
α
X
+
β
Y
)
=
α
d
X
+
β
d
Y
(
5
)
常
数
矩
阵
的
微
分
为
0
d
(
A
)
=
0
(
6
)
d
X
−
1
=
−
X
−
1
d
X
X
−
1
可
对
I
两
边
求
微
分
得
出
(
7
)
d
∣
X
∣
=
∣
X
∣
t
r
(
X
−
1
d
X
)
可
用
L
a
p
l
a
c
e
展
开
证
明
(
8
)
d
(
X
⊙
Y
)
=
(
d
X
)
⊙
Y
+
X
⊙
(
d
Y
)
(
9
)
d
(
X
⊗
Y
)
=
(
d
X
)
⊗
Y
+
X
⊗
(
d
Y
)
(
10
)
d
(
v
e
c
(
X
)
)
=
v
e
c
(
d
X
)
(
11
)
d
log
X
=
X
−
1
d
X
(
12
)
d
f
(
X
)
=
f
′
(
X
)
⊙
d
X
(
f
是
逐
元
素
函
数
)
\begin{aligned} &(1)\mathrm{d}(\mathrm{tr}(\mathbf X))=\mathrm{tr}(\mathrm{d}(\mathbf X))\\ &(2)\mathrm{d}(\mathbf X \mathbf Y)=(\mathrm d\mathbf X)\mathbf Y+\mathbf X(\mathrm d\mathbf Y)\\ &(3)\mathrm d(\mathbf X)^T=(\mathrm d\mathbf X)^T\\ &(4)\mathrm d(\alpha\mathbf X+\beta\mathbf Y)=\alpha\mathrm d\mathbf X+\beta\mathrm d\mathbf Y\\ &(5)常数矩阵的微分为0 \ \ \ \ \ \ \ \ \mathrm d(\mathbf A)=0\\ &(6)\mathrm{d \mathbf{X^{-1}}}=-\mathbf{X^{-1}}\mathrm{d \mathbf{X}}\mathbf{X^{-1}}\ \ \ \ \ 可对\mathbf I两边求微分得出\\ &(7)\mathrm{d |X|}=|\mathbf{X}|\mathrm{tr(\mathbf X^{-1} \mathrm{d \mathbf{X}})}\ \ \ \ \ 可用Laplace展开证明\\ &(8)\mathrm d(\mathbf{X}\odot \mathbf{Y})=(\mathrm d\mathbf{X})\odot \mathbf{Y}+\mathbf{X}\odot(\mathrm d\mathbf{Y})\\ &(9)\mathrm d(\mathbf{X}\otimes \mathbf{Y})=(\mathrm d\mathbf{X})\otimes \mathbf{Y}+\mathbf{X}\otimes(\mathrm d\mathbf{Y})\\ &(10)\mathrm d(\mathrm{vec(\mathbf{X})})=\mathrm{vec}(\mathrm d\mathbf{X})\\ &(11)\mathrm d\log\mathbf{X}=\mathbf{X}^{-1}\mathrm d\mathbf{X}\\ &(12)\mathrm d \mathbf f(\mathbf X)=\mathbf f'(\mathbf X)\odot\mathrm d\mathbf X (\mathbf f是逐元素函数) \end{aligned}
(1)d(tr(X))=tr(d(X))(2)d(XY)=(dX)Y+X(dY)(3)d(X)T=(dX)T(4)d(αX+βY)=αdX+βdY(5)常数矩阵的微分为0 d(A)=0(6)dX−1=−X−1dXX−1 可对I两边求微分得出(7)d∣X∣=∣X∣tr(X−1dX) 可用Laplace展开证明(8)d(X⊙Y)=(dX)⊙Y+X⊙(dY)(9)d(X⊗Y)=(dX)⊗Y+X⊗(dY)(10)d(vec(X))=vec(dX)(11)dlogX=X−1dX(12)df(X)=f′(X)⊙dX(f是逐元素函数)
考虑实值标量函数
f
(
x
)
f(\mathbf x)
f(x),其全微分形式为
d
f
=
∂
f
∂
x
1
d
x
1
+
∂
f
∂
x
2
d
x
2
+
⋯
+
∂
f
∂
x
n
d
x
n
=
[
∂
f
∂
x
1
,
⋯
,
∂
f
∂
x
n
]
[
d
x
1
⋮
d
x
n
]
=
(
∂
f
∂
x
)
T
d
x
\begin{aligned} \mathrm df&=\frac{\partial f}{\partial x_1}\mathrm d x_1+\frac{\partial f}{\partial x_2}\mathrm d x_2+\cdots+\frac{\partial f}{\partial x_n}\mathrm d x_n\\ &=[\frac{\partial f}{\partial x_1},\cdots,\frac{\partial f}{\partial x_n}]\begin{bmatrix}\mathrm dx_1\\ \vdots \\\mathrm dx_n \end{bmatrix}\\ &=(\frac{\partial f}{\partial \mathbf x})^T\mathrm d\mathbf x \end{aligned}
df=∂x1∂fdx1+∂x2∂fdx2+⋯+∂xn∂fdxn=[∂x1∂f,⋯,∂xn∂f]⎣⎢⎡dx1⋮dxn⎦⎥⎤=(∂x∂f)Tdx
进一步考察标量函数
f
(
X
)
f(\mathbf X)
f(X),其变元为
m
×
n
m\times n
m×n矩阵
X
\mathbf X
X,由全微分形式易证
d
f
=
∂
f
∂
x
11
d
x
11
+
∂
f
∂
x
12
d
x
12
+
⋯
+
∂
f
∂
x
m
n
d
x
m
n
=
[
∂
f
∂
x
11
,
⋯
,
∂
f
∂
x
m
n
]
[
d
x
11
⋮
d
x
m
n
]
=
(
∂
f
∂
v
e
c
(
X
)
)
T
d
v
e
c
(
X
)
=
(
v
e
c
(
∂
f
(
X
)
∂
X
)
)
T
d
(
v
e
c
X
)
与
矩
阵
内
积
的
形
式
一
致
=
t
r
(
(
∂
f
(
X
)
∂
X
)
T
d
X
)
\begin{aligned} \mathrm df&=\frac{\partial f}{\partial x_{11}}\mathrm d x_{11}+\frac{\partial f}{\partial x_{12}}\mathrm d x_{12}+\cdots+\frac{\partial f}{\partial x_{mn}}\mathrm d x_{mn}\\ &=[\frac{\partial f}{\partial x_{11}},\cdots,\frac{\partial f}{\partial x_{mn}}]\begin{bmatrix}\mathrm d x_{11}\\ \vdots \\\mathrm d x_{mn} \end{bmatrix}\\ &=(\frac{\partial f}{\partial \mathrm{vec}(\mathbf X)})^T\mathrm d\mathrm{vec}(\mathbf X)\\ &=(\mathrm{vec}( \frac{\partial f(\mathbf X)}{\partial \mathbf X}))^T\mathrm d(\mathrm{vec}\mathbf X)\ \ \ \ 与矩阵内积的形式一致\\ &=\mathrm{tr}((\frac{\partial f(\mathbf X)}{\partial \mathbf X})^T\mathrm d\mathbf X) \end{aligned}
df=∂x11∂fdx11+∂x12∂fdx12+⋯+∂xmn∂fdxmn=[∂x11∂f,⋯,∂xmn∂f]⎣⎢⎡dx11⋮dxmn⎦⎥⎤=(∂vec(X)∂f)Tdvec(X)=(vec(∂X∂f(X)))Td(vecX) 与矩阵内积的形式一致=tr((∂X∂f(X))TdX)
由此得出以下重要结论
d
f
(
x
)
=
(
∂
f
(
x
)
∂
x
)
T
d
x
d
f
(
X
)
=
t
r
(
(
∂
f
(
X
)
∂
X
)
T
d
X
)
\mathrm d f(\mathbf x)=(\frac{\partial f(\mathbf x)}{\partial \mathbf x})^T\mathrm d\mathbf x\\ \mathrm df(\mathbf X)=\mathrm{tr}((\frac{\partial f(\mathbf X)}{\partial \mathbf X})^T\mathrm d\mathbf X)
df(x)=(∂x∂f(x))Tdxdf(X)=tr((∂X∂f(X))TdX)
即只要找到
d
f
(
x
)
\mathrm d f(\mathbf x)
df(x)和
d
x
\mathrm d\mathbf x
dx之间的关系即可得出函数的梯度
例
-
y = a T X b y=\mathbf a^T\mathbf{Xb} y=aTXb
d y = d ( a T ) X b + a T d ( X b ) = a T ( d X ) b = t r ( a T ( d X ) b ) = t r ( b a T ( d X ) ) \begin{aligned} \mathrm dy&=\mathrm d(\mathbf a^T)\mathbf{Xb}+\mathbf a^T\mathrm d(\mathbf{Xb})\\ &=\mathbf a^T\mathrm {(d\mathbf{X})}\mathbf b\\ &=\mathrm{tr}(\mathbf a^T\mathrm {(d\mathbf{X})}\mathbf b)\\ &=\mathrm{tr}(\mathbf b\mathbf a^T\mathrm {(d\mathbf{X}))}\\ \end{aligned} dy=d(aT)Xb+aTd(Xb)=aT(dX)b=tr(aT(dX)b)=tr(baT(dX))
所以 ∂ y ∂ X = ( b a T ) T = a b T \frac{\partial y}{\partial \mathbf X}=(\mathbf{b}\mathbf{a}^T)^T=\mathbf a\mathbf b^T ∂X∂y=(baT)T=abT -
方差的最大似然估计,随机变量 x \mathbf x x服从 N ( μ , Σ ) \mathcal N(\mathbf \mu,\mathbf \Sigma) N(μ,Σ),现有样本 x 1 , x 2 , x 3 , ⋯ , x n \mathbf x_1,\mathbf x_2,\mathbf x_3,\cdots,\mathbf x_n x1,x2,x3,⋯,xn,求协方差矩阵的最大似然估计。
对数似然函数为
L = log ∣ Σ ∣ + 1 N ∑ i N ( x i − μ ) T Σ − 1 ( x i − μ ) L=\log |\mathbf \Sigma|+\frac{1}{N}\sum_i^N(\mathbf x_i-\mathbf \mu)^T\mathbf\Sigma^{-1}(\mathbf x_i-\mathbf \mu) L=log∣Σ∣+N1i∑N(xi−μ)TΣ−1(xi−μ)d L = 1 ∣ Σ ∣ d ∣ Σ ∣ + 1 N ∑ i N ( x i − μ ) T ( d Σ − 1 ) ( x i − μ ) = t r ( Σ − 1 d Σ ) − 1 N ∑ i N ( x i − μ ) T ( Σ − 1 d Σ Σ − 1 ) ( x i − μ ) = t r ( Σ − 1 d Σ ) − 1 N t r ( ∑ i N ( ( x i − μ ) ( x i − μ ) T Σ − 2 d Σ ) = t r ( ( Σ − 1 − 1 N ∑ i N ( ( x i − μ ) ( x i − μ ) T Σ − 2 ) d Σ ) \begin{aligned} \mathrm{d}L&=\frac{1}{|\mathbf \Sigma|}\mathrm d|\mathbf \Sigma|+\frac{1}{N}\sum_i^N(\mathbf x_i-\mathbf \mu)^T(\mathrm d\mathbf\Sigma^{-1})(\mathbf x_i-\mathbf \mu)\\ &=\mathrm{tr}(\mathbf \Sigma^{-1}\mathrm d\mathbf \Sigma)-\frac{1}{N}\sum_i^N(\mathbf x_i-\mathbf \mu)^T(\mathbf{\Sigma^{-1}}\mathrm{d \mathbf{\Sigma}}\mathbf{\Sigma^{-1}})(\mathbf x_i-\mathbf \mu)\\ &=\mathrm{tr}(\mathbf \Sigma^{-1}\mathrm d\mathbf \Sigma)-\frac{1}{N}\mathrm{tr}(\sum_i^N((\mathbf x_i-\mathbf \mu)(\mathbf x_i-\mathbf \mu)^T\mathbf{\Sigma^{-2}}\mathrm{d \mathbf{\Sigma}})\\ &=\mathrm{tr}((\mathbf \Sigma^{-1}\mathrm -\frac{1}{N}\sum_i^N((\mathbf x_i-\mathbf \mu)(\mathbf x_i-\mathbf \mu)^T\mathbf{\Sigma^{-2}})\mathrm{d \mathbf{\Sigma}}) \end{aligned} dL=∣Σ∣1d∣Σ∣+N1i∑N(xi−μ)T(dΣ−1)(xi−μ)=tr(Σ−1dΣ)−N1i∑N(xi−μ)T(Σ−1dΣΣ−1)(xi−μ)=tr(Σ−1dΣ)−N1tr(i∑N((xi−μ)(xi−μ)TΣ−2dΣ)=tr((Σ−1−N1i∑N((xi−μ)(xi−μ)TΣ−2)dΣ)
所以要使 ∇ L = 0 \nabla L=0 ∇L=0应使
Σ = 1 N ∑ i N ( ( x i − μ ) ( x i − μ ) T \mathbf \Sigma=\frac{1}{N}\sum_i^N((\mathbf x_i-\mathbf \mu)(\mathbf x_i-\mathbf \mu)^T Σ=N1i∑N((xi−μ)(xi−μ)T -
最二乘估计。 L = ∣ ∣ X w − y ∣ ∣ 2 L=||\mathbf X\mathbf w-\mathbf y||^2 L=∣∣Xw−y∣∣2
d L = d [ ( X w − y ) T ( X w − y ) ] = 2 ( X w − y ) T d w \begin{aligned} \mathrm dL&=\mathrm d[(\mathbf X\mathbf w-\mathbf y)^T(\mathbf X\mathbf w-\mathbf y)]\\ &=2(\mathbf X\mathbf w-\mathbf y)^T\mathrm d\mathbf w \end{aligned} dL=d[(Xw−y)T(Xw−y)]=2(Xw−y)Tdw所以 ∇ L = 2 X ( X w − y ) \nabla L=2\mathbf X(\mathbf X\mathbf w-\mathbf y) ∇L=2X(Xw−y),为使梯度为零,即可解出
五.复杂函数的导数
假设 Y \mathbf Y Y是关于 X \mathbf X X的函数,若已经求出目标函数 f f f关于 Y \mathbf Y Y的梯度,依靠微分标准型即可得出目标函数 f f f关于 X \mathbf X X的梯度。
例
-
f = f ( Y ) f=f(\mathbf Y) f=f(Y), Y = B X \mathbf Y=\mathbf B\mathbf X Y=BX
d f = t r ( ( ∂ f ∂ Y ) T d Y ) = t r ( ( ∂ f ∂ Y ) T d ( B X ) ) = t r ( ( ∂ f ∂ Y ) T B d X ) = t r ( ( ∂ f ∂ X ) T d X ) \begin{aligned} \mathrm df&=\mathrm{tr}((\frac{\partial f}{\partial \mathbf Y})^T\mathrm d\mathbf Y)\\ &=\mathrm{tr}((\frac{\partial f}{\partial \mathbf Y})^T\mathrm d(\mathbf {BX}))\\ &=\mathrm{tr}((\frac{\partial f}{\partial \mathbf Y})^T\mathbf B\mathrm d\mathbf {X})\\ &=\mathrm{tr}((\frac{\partial f}{\partial \mathbf X})^T\mathrm d\mathbf X) \end{aligned} df=tr((∂Y∂f)TdY)=tr((∂Y∂f)Td(BX))=tr((∂Y∂f)TBdX)=tr((∂X∂f)TdX)
即可得出 ∂ f ∂ X = B T ∂ f ∂ Y \frac{\partial f}{\partial \mathbf X}=\mathbf B^T\frac{\partial f}{\partial \mathbf Y} ∂X∂f=BT∂Y∂f
易证以下常用结论
-
若 x ∈ R m , y = f ( x ) ∈ R n , z = g ( x ) ∈ R n 则 若\mathbf{x} \in R^{m},\mathbf{y=f(x)} \in R^{n},\mathbf{z = g(x)}\in R^{n}则 若x∈Rm,y=f(x)∈Rn,z=g(x)∈Rn则
∂ y T z ∂ x = ∂ y ∂ x z + ∂ z ∂ x y ∈ R m (3-3) \frac{\partial \mathbf{y}^T\mathbf{z}}{\partial \mathbf x}=\frac{\partial \mathbf{y}}{\partial \mathbf x}\mathbf{z}+\frac{\partial \mathbf{z}}{\partial \mathbf x}\mathbf{y} \ \ \ \ \in R^m \tag{3-3} ∂x∂yTz=∂x∂yz+∂x∂zy ∈Rm(3-3) -
若 x ∈ R m , y = f ( x ) ∈ R , z = g ( x ) ∈ R n 则 若\mathbf{x} \in R^{m},y=f(\mathbf x) \in R,\mathbf{z = g(\mathbf x)}\in R^{n}则 若x∈Rm,y=f(x)∈R,z=g(x)∈Rn则
∂ y z ∂ x = ∂ y ∂ x z T + ∂ z ∂ x y ∈ R m × n (3-5) \frac{\partial y\mathbf{z}}{\partial \mathbf x}=\frac{\partial y}{\partial \mathbf x}\mathbf{z}^T+\frac{\partial \mathbf{z}}{\partial \mathbf x}y \ \ \ \ \in R^{m \times n} \tag{3-5} ∂x∂yz=∂x∂yzT+∂x∂zy ∈Rm×n(3-5) -
若 x ∈ R m , y = f ( x ) ∈ R n , z = g ( y ) ∈ R k 则 若\mathbf{x} \in R^{m},\mathbf{y=f(x)} \in R^{n},\mathbf{z = g(y)}\in R^{k}则 若x∈Rm,y=f(x)∈Rn,z=g(y)∈Rk则
∂ z ∂ x = ∂ y ∂ x ∂ z ∂ y ∈ R m × k (3-2) \frac{\partial \mathbf{z}}{\partial \mathbf{x}}=\frac{\partial \mathbf{y}}{\partial\mathbf{x}}\frac{\partial \mathbf{z}}{\partial \mathbf{y}}\ \ \ \in R^{m \times k}\tag{3-2} ∂x∂z=∂x∂y∂y∂z ∈Rm×k(3-2)
-
若 f ( X ) , g ( X ) f(\mathbf X),g(\mathbf X) f(X),g(X)都是矩阵 X \mathbf X X的实值函数,则
∂ [ f ( X ) g ( X ) ] ∂ X = f ( X ) ∂ g ( X ) ∂ X + g ( X ) ∂ f ( X ) ∂ X \frac{\partial [f(\mathbf X)g(\mathbf X)]}{\partial \mathbf X}=f(\mathbf X)\frac{\partial g(\mathbf X)}{\partial \mathbf X}+g(\mathbf X)\frac{\partial f(\mathbf X)}{\partial \mathbf X} ∂X∂[f(X)g(X)]=f(X)∂X∂g(X)+g(X)∂X∂f(X) -
若 g ( X ) ≠ 0 g(\mathbf X)\neq 0 g(X)=0
∂ [ f ( X ) / g ( X ) ] ∂ X = 1 g ( X ) 2 [ g ( X ) ∂ f ( X ) ∂ X − f ( X ) ∂ g ( X ) ∂ X ] \frac{\partial [f(\mathbf X)/g(\mathbf X)]}{\partial \mathbf X}=\frac{1}{g(\mathbf X)^2}[g(\mathbf X)\frac{\partial f(\mathbf X)}{\partial \mathbf X}-f(\mathbf X)\frac{\partial g(\mathbf X)}{\partial \mathbf X}] ∂X∂[f(X)/g(X)]=g(X)21[g(X)∂X∂f(X)−f(X)∂X∂g(X)] -
∂ z ∂ x = ∂ y ∂ x ∂ z ∂ y ∈ R m × k (3-2) \frac{\partial \mathbf{z}}{\partial \mathbf{x}}=\frac{\partial \mathbf{y}}{\partial\mathbf{x}}\frac{\partial \mathbf{z}}{\partial \mathbf{y}}\ \ \ \in R^{m \times k}\tag{3-2} ∂x∂z=∂x∂y∂y∂z ∈Rm×k(3-2)
注意顺序是否颠倒(如果顺序不对则维度不匹配无法相乘)。
六.激活函数的导数
-
y = 1 1 + exp ( − x ) \mathbf y=\frac{1}{1+\exp(-\mathbf x)} y=1+exp(−x)1
标准的标量Sigmod函数的导数为。
f ′ ( x ) = f ( x ) ( 1 − f ( x ) ) f'(x)=f(x)(1-f(x)) f′(x)=f(x)(1−f(x))
当输入为 K K K维向量时,其导数为
∂ f ( x ) ∂ x = d i a g ( f ( x ) ⊙ ( 1 − f ( x ) ) ) \frac{\partial\mathbf f(\mathbf x)}{\partial \mathbf x}=\mathrm{diag}(f(\mathbf x)\odot(1-f(\mathbf x))) ∂x∂f(x)=diag(f(x)⊙(1−f(x))) -
Softmax函数的导数。
Softmax函数定义为
z k = s f o t m a x ( x k ) = exp ( x k ) ∑ i = 1 K exp ( x i ) z_k=\mathrm{sfotmax(x_k)}=\frac{\exp(x_k)}{\sum_{i=1}^{K}\exp(x_i)} zk=sfotmax(xk)=∑i=1Kexp(xi)exp(xk)
用 K K K维向量 x = [ x 1 , ⋯ , x k ] T \mathbf x=[x_1,\cdots,x_k]^T x=[x1,⋯,xk]T来表示Softmax函数的输入,
z = s o f t m a x ( x ) = 1 ∑ i = 1 K exp ( x k ) [ exp ( x 1 ) exp ( x 2 ) ⋮ exp ( x k ) ] = exp ( x ) 1 K T exp ( x ) \begin{aligned} \mathbf z&=\mathrm{softmax}(\mathbf x)\\ &=\frac{1}{\sum_{i=1}^{K}\exp(x_k)}\left[ \begin{matrix} \exp(x_1) \\ \exp(x_2) \\ \vdots\\ \exp(x_k) \end{matrix} \right]\\ &=\frac{\exp(\mathbf x)}{\mathbf 1_K^T\exp(\mathbf x)} \end{aligned} z=softmax(x)=∑i=1Kexp(xk)1⎣⎢⎢⎢⎡exp(x1)exp(x2)⋮exp(xk)⎦⎥⎥⎥⎤=1KTexp(x)exp(x)
Softmax函数的导数为
∂ s o f t m a x ( x ) ∂ x = ∂ exp ( x ) 1 K T exp ( x ) ∂ x = 1 1 k T exp ( x ) ∂ exp ( x ) ∂ x + ∂ 1 1 k T exp ( x ) ∂ x ( exp ( x ) ) T = d i a g ( exp ( x ) ) 1 k T exp ( x ) − ( 1 ( 1 K T exp ( x ) ) 2 ) ( ∂ 1 K T exp ( x ) ) ∂ x ( exp ( x ) ) T = d i a g ( exp ( x ) ) 1 k T exp ( x ) − ( 1 ( 1 K T exp ( x ) ) 2 ) d i a g ( exp ( x ) ) 1 K T ( exp ( x ) ) T = d i a g ( exp ( x ) ) 1 k T exp ( x ) − ( 1 ( 1 K T exp ( x ) ) 2 ) exp ( x ) ( exp ( x ) ) T = d i a g ( exp ( x ) 1 K T exp x ) − exp ( x ) 1 K T exp ( x ) ( exp ( x ) ) T 1 K T exp ( x ) = d i a g ( s o f t m a x ( x ) ) − s o f t m a x ( x ) s o f t m a x ( x ) T \begin{aligned} \frac{\partial \mathrm{softmax(\mathbf x)}}{\partial \mathbf x}&=\frac{\partial \frac{\exp(\mathbf x)}{ \mathbf 1_K^T\exp(\mathbf x)}}{\partial \mathbf x}\\ \\ &=\frac{1}{\mathbf 1_k^T\exp(\mathbf x)}\frac{\partial \exp(\mathbf x)}{\partial \mathbf x}+\frac{\partial \frac{1}{\mathbf 1_k^T\exp(\mathbf x)}}{\partial \mathbf x}(\exp(\mathbf x))^T\\ \\ &=\frac{\mathrm{diag}(\exp(\mathbf x))}{\mathbf 1_k^T\exp(\mathbf x)}-(\frac{1}{(\mathbf 1_K^T\exp(\mathbf x))^2})\frac{(\partial \mathbf 1_K^T\exp(\mathbf x))}{\partial \mathbf x}(\exp(\mathbf x))^T\\ \\ &=\frac{\mathrm{diag}(\exp(\mathbf x))}{\mathbf 1_k^T\exp(\mathbf x)}-(\frac{1}{(\mathbf 1_K^T\exp(\mathbf x))^2})\mathrm{diag(\exp(\mathbf x))}\mathbf 1_K^T(\exp(\mathbf x))^T\\ \\ &=\frac{\mathrm{diag}(\exp(\mathbf x))}{\mathbf 1_k^T\exp(\mathbf x)}-(\frac{1}{(\mathbf 1_K^T\exp(\mathbf x))^2})\exp(\mathbf x)(\exp(\mathbf x))^T\\ \\ &=\mathrm{diag(\frac{\exp(\mathbf x)}{\mathbf 1_K^T\exp\mathbf x})}-\frac{\exp(\mathbf x)}{\mathbf 1_K^T\exp(\mathbf x)}\frac{(\exp(\mathbf x))^T}{\mathbf 1_K^T\exp(\mathbf x)}\\ \\ &=\mathrm{diag(softmax(\mathbf x))}-\mathrm{softmax(\mathbf x)}\mathrm{softmax(\mathbf x)}^T \end{aligned} ∂x∂softmax(x)=∂x∂1KTexp(x)exp(x)=1kTexp(x)1∂x∂exp(x)+∂x∂1kTexp(x)1(exp(x))T=1kTexp(x)diag(exp(x))−((1KTexp(x))21)∂x(∂1KTexp(x))(exp(x))T=1kTexp(x)diag(exp(x))−((1KTexp(x))21)diag(exp(x))1KT(exp(x))T=1kTexp(x)diag(exp(x))−((1KTexp(x))21)exp(x)(exp(x))T=diag(1KTexpxexp(x))−1KTexp(x)exp(x)1KTexp(x)(exp(x))T=diag(softmax(x))−softmax(x)softmax(x)T
七.多层感知机反向传播
已知
a
d
,
z
d
,
W
d
,
b
d
,
f
\mathbf a^d,\mathbf z^d,\mathbf W^d,\mathbf b^d,\mathbf f
ad,zd,Wd,bd,f,分别为第d层已激活输出,未激活输出,权重矩阵,偏置,激活函数。
a
d
=
f
(
z
d
)
z
d
=
W
d
a
d
−
1
+
b
d
\mathbf a^{d}=\mathbf f(\mathbf z^d)\\ \mathbf z^d=\mathbf W^d\mathbf a^{d-1}+\mathbf b^d
ad=f(zd)zd=Wdad−1+bd
现求损失函数对于第d层权重向量偏导。
d
L
=
t
r
(
(
∂
L
∂
W
d
)
T
d
W
d
)
=
t
r
(
(
∂
L
∂
z
d
)
T
d
z
d
)
=
t
r
(
(
∂
L
∂
z
d
)
T
d
(
W
d
a
d
−
1
)
)
=
t
r
(
a
d
−
1
(
∂
L
∂
z
d
)
T
d
W
d
)
\begin{aligned} \mathrm dL&=\mathrm{tr}((\frac{\partial L}{\partial \mathbf W^d})^T\mathrm d\mathbf W^d)\\ &=\mathrm{tr}((\frac{\partial L}{\partial \mathbf z^d})^T\mathrm d\mathbf z^d)\\ &=\mathrm{tr}((\frac{\partial L}{\partial \mathbf z^d})^T\mathrm d(\mathbf {W^d a^{d-1}}))\\ &=\mathrm{tr}(\mathbf a^{d-1}(\frac{\partial L}{\partial \mathbf z^d})^T\mathrm d\mathbf {W^d}) \end{aligned}
dL=tr((∂Wd∂L)TdWd)=tr((∂zd∂L)Tdzd)=tr((∂zd∂L)Td(Wdad−1))=tr(ad−1(∂zd∂L)TdWd)
所以 ∂ L ∂ W d = ∂ L ∂ z d ( a d − 1 ) T \frac{\partial L}{\partial \mathbf W^d}=\frac{\partial L}{\partial \mathbf z^d}(\mathbf a^{d-1})^T ∂Wd∂L=∂zd∂L(ad−1)T
定义
∂
L
∂
z
d
=
d
e
f
δ
d
\frac{\partial L}{\partial \mathbf z^d}\overset{\mathrm{def}}{=}\mathbf \delta^{d}
∂zd∂L=defδd,记作每层的误差。
δ
d
=
∂
L
∂
z
d
=
∂
a
d
∂
z
d
∂
z
d
+
1
∂
a
d
∂
L
∂
z
d
+
1
\begin{aligned} \mathbf \delta^{d}&=\frac{\partial L}{\partial \mathbf z^d}\\ &=\frac{\partial \mathbf a^d}{\partial \mathbf z^d}\frac{\partial \mathbf z^{d+1}}{\partial \mathbf a^d}\frac{\partial L}{\partial \mathbf z^{d+1}}\\ \end{aligned}
δd=∂zd∂L=∂zd∂ad∂ad∂zd+1∂zd+1∂L
又因
∂
a
d
∂
z
d
=
d
i
a
g
(
f
′
(
z
d
)
)
∂
z
d
+
1
∂
a
d
=
(
W
d
+
1
)
T
\frac{\partial \mathbf a^d}{\partial\mathbf z^{d}}=\mathbf{diag}(\mathbf f'(\mathbf z^d))\\ \frac{\partial \mathbf z^{d+1}}{\partial\mathbf a^{d}}=(\mathbf W^{d+1})^T
∂zd∂ad=diag(f′(zd))∂ad∂zd+1=(Wd+1)T
所以最后的结果为
δ
d
=
d
i
a
g
(
f
′
(
z
d
)
)
(
W
d
+
1
)
T
δ
d
+
1
∂
L
∂
W
d
=
δ
d
(
a
d
−
1
)
T
\mathbf \delta^d=\mathbf{diag}(\mathbf f'(\mathbf z^d))(\mathbf W^{d+1})^T\mathbf \delta^{d+1} \\ \frac{\partial L}{\partial \mathbf W^d}=\mathbf \delta^d(\mathbf a^{d-1})^T
δd=diag(f′(zd))(Wd+1)Tδd+1∂Wd∂L=δd(ad−1)T