个人写作笔记,如有问题,请不吝赐教!
上一篇中矩阵迹、矩阵范数的导数部分还
矩阵求导
还差一个最普适的情况——“矩阵对矩阵求导”没有分析,这种情况在神经网络中貌似特别常见。这里对这种情况的公式进行证明,证明方法与证明列向量的方式是完全一样的,不同之处在于,列向量在证明第二项时可以提取公因式产生矩阵分块,而在矩阵对矩阵求导中,不会得到矩阵分块,只能是矩阵直积,具体过程如下。
设矩阵
A
∈
C
l
×
m
\bm{A}\in\mathbb{C}^{l\times m}
A∈Cl×m,
B
∈
C
m
×
n
\bm{B}\in\mathbb{C}^{m\times n}
B∈Cm×n,
W
∈
C
p
×
q
\bm{W}\in\mathbb{C}^{p\times q}
W∈Cp×q,证明:
d
A
B
d
W
=
d
A
d
W
(
I
p
⊗
B
)
+
(
I
q
⊗
A
)
d
B
d
W
\begin{align*} &\frac{\mathrm{d}\bm{AB}}{\mathrm{d}\bm{W}}=\frac{\mathrm{d}\bm{A}}{\mathrm{d}\bm{W}}\big(\bm{I}_p\otimes\bm{B}\big)+\big(\bm{I}_q\otimes\bm{A}\big)\frac{\mathrm{d}\bm{B}}{\mathrm{d}\bm{W}} \end{align*}
dWdAB=dWdA(Ip⊗B)+(Iq⊗A)dWdB
根据矩阵对矩阵求导的定义,有:
d
A
B
d
W
=
[
d
A
B
d
w
11
d
A
B
d
w
21
⋯
d
A
B
d
w
p
1
d
A
B
d
w
12
d
A
B
d
w
22
⋯
d
A
B
d
w
p
2
⋮
⋮
⋮
d
A
B
d
w
1
q
d
A
B
d
w
2
q
⋯
d
A
B
d
w
p
q
]
\begin{align*} \frac{\mathrm{d}\bm{AB}}{\mathrm{d}\bm{W}}&= \begin{bmatrix} \frac{\mathrm{d}\bm{AB}}{\mathrm{d}w_{11}}&\frac{\mathrm{d}\bm{AB}}{\mathrm{d}w_{21}}&\cdots&\frac{\mathrm{d}\bm{AB}}{\mathrm{d}w_{p1}}\\\\ \frac{\mathrm{d}\bm{AB}}{\mathrm{d}w_{12}}&\frac{\mathrm{d}\bm{AB}}{\mathrm{d}w_{22}}&\cdots&\frac{\mathrm{d}\bm{AB}}{\mathrm{d}w_{p2}}\\\\ \vdots&\vdots&&\vdots\\\\ \frac{\mathrm{d}\bm{AB}}{\mathrm{d}w_{1q}}&\frac{\mathrm{d}\bm{AB}}{\mathrm{d}w_{2q}}&\cdots&\frac{\mathrm{d}\bm{AB}}{\mathrm{d}w_{pq}} \end{bmatrix} \end{align*}
dWdAB=
dw11dABdw12dAB⋮dw1qdABdw21dABdw22dAB⋮dw2qdAB⋯⋯⋯dwp1dABdwp2dAB⋮dwpqdAB
由于各项形式完全一致,考虑第一项
d
A
B
d
w
11
\displaystyle\frac{\mathrm{d}\bm{AB}}{\mathrm{d}w_{11}}
dw11dAB 即可,这是一项纯量对向量求导,直接化简可得:
d
A
B
d
w
11
=
d
A
d
w
11
B
+
A
d
B
d
w
11
\begin{align} \frac{\mathrm{d}\bm{AB}}{\mathrm{d}w_{11}}=\frac{\mathrm{d}\bm{A}}{\mathrm{d}w_{11}}\bm{B}+\bm{A}\frac{\mathrm{d}\bm{B}}{\mathrm{d}w_{11}} \end{align}
dw11dAB=dw11dAB+Adw11dB
可知最终结果必含两项,先考虑由
d
A
d
w
i
j
B
\displaystyle\frac{\mathrm{d}\bm{A}}{\mathrm{d}w_{ij}}\bm{B}
dwijdAB 代表的一项,即有:
[
d
A
d
w
11
B
d
A
d
w
21
B
⋯
d
A
d
w
p
1
B
d
A
d
w
12
B
d
A
d
w
22
B
⋯
d
A
d
w
p
2
B
⋮
⋮
⋮
d
A
d
w
1
q
B
d
A
d
w
2
q
B
⋯
d
A
d
w
p
q
B
]
=
[
d
A
d
w
11
d
A
d
w
21
⋯
d
A
d
w
p
1
d
A
d
w
12
d
A
d
w
22
⋯
d
A
d
w
p
2
⋮
⋮
⋮
d
A
d
w
1
q
d
A
d
w
2
q
⋯
d
A
d
w
p
q
]
⋅
(
I
p
⊗
B
)
=
d
A
d
W
(
I
p
⊗
B
)
\begin{align*} &\begin{bmatrix} \frac{\mathrm{d}\bm{A}}{\mathrm{d}w_{11}}\bm{B}&\frac{\mathrm{d}\bm{A}}{\mathrm{d}w_{21}}\bm{B}&\cdots&\frac{\mathrm{d}\bm{A}}{\mathrm{d}w_{p1}}\bm{B}\\\\ \frac{\mathrm{d}\bm{A}}{\mathrm{d}w_{12}}\bm{B}&\frac{\mathrm{d}\bm{A}}{\mathrm{d}w_{22}}\bm{B}&\cdots&\frac{\mathrm{d}\bm{A}}{\mathrm{d}w_{p2}}\bm{B}\\\\ \vdots&\vdots&&\vdots\\\\ \frac{\mathrm{d}\bm{A}}{\mathrm{d}w_{1q}}\bm{B}&\frac{\mathrm{d}\bm{A}}{\mathrm{d}w_{2q}}\bm{B}&\cdots&\frac{\mathrm{d}\bm{A}}{\mathrm{d}w_{pq}}\bm{B} \end{bmatrix}\\\\ &=\begin{bmatrix} \frac{\mathrm{d}\bm{A}}{\mathrm{d}w_{11}}&\frac{\mathrm{d}\bm{A}}{\mathrm{d}w_{21}}&\cdots&\frac{\mathrm{d}\bm{A}}{\mathrm{d}w_{p1}}\\\\ \frac{\mathrm{d}\bm{A}}{\mathrm{d}w_{12}}&\frac{\mathrm{d}\bm{A}}{\mathrm{d}w_{22}}&\cdots&\frac{\mathrm{d}\bm{A}}{\mathrm{d}w_{p2}}\\\\ \vdots&\vdots&&\vdots\\\\ \frac{\mathrm{d}\bm{A}}{\mathrm{d}w_{1q}}&\frac{\mathrm{d}\bm{A}}{\mathrm{d}w_{2q}}&\cdots&\frac{\mathrm{d}\bm{A}}{\mathrm{d}w_{pq}} \end{bmatrix}\cdot\big(\bm{I}_p\otimes\bm{B}\big)\\\\ &=\frac{\mathrm{d}\bm{A}}{\mathrm{d}\bm{W}}\big(\bm{I}_p\otimes\bm{B}\big) \end{align*}
dw11dABdw12dAB⋮dw1qdABdw21dABdw22dAB⋮dw2qdAB⋯⋯⋯dwp1dABdwp2dAB⋮dwpqdAB
=
dw11dAdw12dA⋮dw1qdAdw21dAdw22dA⋮dw2qdA⋯⋯⋯dwp1dAdwp2dA⋮dwpqdA
⋅(Ip⊗B)=dWdA(Ip⊗B)
即为待证结果第一项,再考虑由
A
d
B
d
w
i
j
\displaystyle\bm{A}\frac{\mathrm{d}\bm{B}}{\mathrm{d}w_{ij}}
AdwijdB 代表的一项,有:
[
A
d
B
d
w
11
A
d
B
d
w
21
⋯
A
d
B
d
w
p
1
A
d
B
d
w
12
A
d
B
d
w
22
⋯
A
d
B
d
w
p
2
⋮
⋮
⋮
A
d
B
d
w
1
q
A
d
B
d
w
2
q
⋯
A
d
B
d
w
p
q
]
=
(
I
q
⊗
A
)
⋅
[
d
B
d
w
11
d
B
d
w
21
⋯
d
B
d
w
p
1
d
B
d
w
12
d
B
d
w
22
⋯
d
B
d
w
p
2
⋮
⋮
⋮
d
B
d
w
1
q
d
B
d
w
2
q
⋯
d
B
d
w
p
q
]
=
(
I
q
⊗
A
)
d
B
d
W
\begin{align*} &\begin{bmatrix} \bm{A}\frac{\mathrm{d}\bm{B}}{\mathrm{d}w_{11}}&\bm{A}\frac{\mathrm{d}\bm{B}}{\mathrm{d}w_{21}}&\cdots&\bm{A}\frac{\mathrm{d}\bm{B}}{\mathrm{d}w_{p1}}\\\\ \bm{A}\frac{\mathrm{d}\bm{B}}{\mathrm{d}w_{12}}&\bm{A}\frac{\mathrm{d}\bm{B}}{\mathrm{d}w_{22}}&\cdots&\bm{A}\frac{\mathrm{d}\bm{B}}{\mathrm{d}w_{p2}}\\\\ \vdots&\vdots&&\vdots\\\\ \bm{A}\frac{\mathrm{d}\bm{B}}{\mathrm{d}w_{1q}}&\bm{A}\frac{\mathrm{d}\bm{B}}{\mathrm{d}w_{2q}}&\cdots&\bm{A}\frac{\mathrm{d}\bm{B}}{\mathrm{d}w_{pq}} \end{bmatrix}\\\\ &=\big(\bm{I}_q\otimes\bm{A}\big)\cdot\begin{bmatrix} \frac{\mathrm{d}\bm{B}}{\mathrm{d}w_{11}}&\frac{\mathrm{d}\bm{B}}{\mathrm{d}w_{21}}&\cdots&\frac{\mathrm{d}\bm{B}}{\mathrm{d}w_{p1}}\\\\ \frac{\mathrm{d}\bm{B}}{\mathrm{d}w_{12}}&\frac{\mathrm{d}\bm{B}}{\mathrm{d}w_{22}}&\cdots&\frac{\mathrm{d}\bm{B}}{\mathrm{d}w_{p2}}\\\\ \vdots&\vdots&&\vdots\\\\ \frac{\mathrm{d}\bm{B}}{\mathrm{d}w_{1q}}&\frac{\mathrm{d}\bm{B}}{\mathrm{d}w_{2q}}&\cdots&\frac{\mathrm{d}\bm{B}}{\mathrm{d}w_{pq}} \end{bmatrix}\\\\ &=\big(\bm{I}_q\otimes\bm{A}\big)\frac{\mathrm{d}\bm{B}}{\mathrm{d}\bm{W}} \end{align*}
Adw11dBAdw12dB⋮Adw1qdBAdw21dBAdw22dB⋮Adw2qdB⋯⋯⋯Adwp1dBAdwp2dB⋮AdwpqdB
=(Iq⊗A)⋅
dw11dBdw12dB⋮dw1qdBdw21dBdw22dB⋮dw2qdB⋯⋯⋯dwp1dBdwp2dB⋮dwpqdB
=(Iq⊗A)dWdB
综上,原公式得到了证明。
矩阵的迹
设矩阵 A , B ∈ C n × n \bm{A},\bm{B}\in\mathbb{C}^{n\times n} A,B∈Cn×n, W ∈ C p × q \bm{W}\in\mathbb{C}^{p\times q} W∈Cp×q
定义
对于方阵而言,定义矩阵的迹为主对角线上全体元素之和,即:
t
r
(
A
)
=
∑
i
=
1
n
a
i
i
\mathrm{tr}(\bm{A})=\sum_{i=1}^na_{ii}
tr(A)=i=1∑naii
性质
讨论矩阵的迹时,只考虑方阵的情况,一般矩阵不讨论迹。
矩阵的迹等于矩阵全体特征值之和,即有:
t
r
(
A
)
=
∑
i
=
1
n
λ
i
\mathrm{tr}(\bm{A})=\sum_{i=1}^n\lambda_i
tr(A)=i=1∑nλi
矩阵的迹满足乘积可换顺序,即:
t
r
(
A
B
)
=
t
r
(
B
A
)
\mathrm{tr}(\bm{AB})=\mathrm{tr}(\bm{BA})
tr(AB)=tr(BA)
微分性质
矩阵的迹就是一个纯量,因此求解时按一般纯量的求法即可。对于含有矩阵运算的迹,当微分变量也是方阵时,追迹计算与导数运算可以交换顺序,否则,没有简便的计算方法:
d
d
W
t
r
(
A
)
=
t
r
(
d
A
d
W
)
\begin{align*} &\frac{\mathrm{d}}{\mathrm{d}\bm{W}}\mathrm{tr}(\bm{A})=\mathrm{tr}\Big(\frac{\mathrm{d}\bm{A}}{\mathrm{d}\bm{W}}\Big) \end{align*}
dWdtr(A)=tr(dWdA)
矩阵的范数
∀ A ∈ C m × n \displaystyle\forall\bm{A}\in\mathbb{C}^{m\times n} ∀A∈Cm×n,定义以下的范数:
矩阵原生范数
总和范数:
∣
∣
A
∣
∣
M
=
∑
j
=
1
n
∑
i
=
1
m
∣
a
i
j
∣
\displaystyle||\bm{A}||_M=\sum^n_{j=1}\sum^m_{i=1} |a_{ij}|
∣∣A∣∣M=j=1∑ni=1∑m∣aij∣
F范数:
∣
∣
A
∣
∣
F
=
∑
j
=
1
n
∑
i
=
1
m
∣
a
i
j
∣
2
\displaystyle||\bm{A}||_F=\sum^n_{j=1}\sum^m_{i=1} |a_{ij}|^2
∣∣A∣∣F=j=1∑ni=1∑m∣aij∣2
G范数:
∣
∣
A
∣
∣
G
=
n
⋅
max
i
,
j
∑
i
=
1
n
∣
a
i
j
∣
\displaystyle||\bm{A}||_G=n\cdot \max_{i,j}\sum^n_{i=1} |a_{ij}|
∣∣A∣∣G=n⋅i,jmaxi=1∑n∣aij∣
向量范数导出的矩阵范数
矩阵的最大奇异值称为矩阵的谱半径,用
ρ
(
A
)
=
max
{
s
i
}
i
=
1
n
\displaystyle\rho({\bm{A}})=\max\{s_i\}_{i=1}^{n}
ρ(A)=max{si}i=1n 表示。
行和范数:
∣
∣
A
∣
∣
∞
=
max
j
∑
j
=
1
n
∣
a
i
j
∣
\displaystyle||\bm{A}||_{\infty}=\max_j\sum^n_{j=1} |a_{ij}|
∣∣A∣∣∞=jmaxj=1∑n∣aij∣
列和范数:
∣
∣
A
∣
∣
1
=
max
j
∑
i
=
1
m
∣
a
i
j
∣
\displaystyle||\bm{A}||_1=\max_j\sum^m_{i=1} |a_{ij}|
∣∣A∣∣1=jmaxi=1∑m∣aij∣
谱范数:
∣
∣
A
∣
∣
2
=
max
{
s
i
}
i
=
1
n
\displaystyle||\bm{A}||_2=\max\{s_i\}_{i=1}^{n}
∣∣A∣∣2=max{si}i=1n
矩阵范数的导数
矩阵范数大多不能直接求导,常见的能求导的范数有F范数,F范数的求法可由下式快捷算出:
∣
∣
A
∣
∣
F
2
=
t
r
(
A
H
A
)
||\bm{A}||_F^2=\mathrm{tr}(\bm{A^{\mathrm{H}}A})
∣∣A∣∣F2=tr(AHA)
简单起见,此处对范数进行平方,在常见的学习率正则化处理中,经常见到带有这种形式的误差项,对此平方求导,可得:
d
∣
∣
A
∣
∣
F
2
d
W
=
d
[
t
r
(
A
H
A
)
]
d
W
=
t
r
(
d
A
H
A
d
W
)
=
t
r
[
d
A
H
d
W
(
I
p
⊗
A
)
+
(
I
q
⊗
A
H
)
d
A
d
W
]
\begin{align*} \frac{\mathrm{d}||\bm{A}||_F^2}{\mathrm{d}\bm{W}}&=\frac{\mathrm{d}\big[\mathrm{tr}(\bm{A^{\mathrm{H}}A})\big]}{\mathrm{d}\bm{W}}=\mathrm{tr}\Big(\frac{\mathrm{d}\bm{A^{\mathrm{H}}A}}{\mathrm{d}\bm{W}}\Big)\\\\ &=\mathrm{tr}\Big[\frac{\mathrm{d}\bm{A}^\mathrm{H}}{\mathrm{d}\bm{W}}\big(\bm{I}_p\otimes\bm{A}\big)+\big(\bm{I}_q\otimes\bm{A}^\mathrm{H}\big)\frac{\mathrm{d}\bm{A}}{\mathrm{d}\bm{W}}\Big] \end{align*}
dWd∣∣A∣∣F2=dWd[tr(AHA)]=tr(dWdAHA)=tr[dWdAH(Ip⊗A)+(Iq⊗AH)dWdA]
特别地,当
W
∈
C
m
×
m
\bm{W}\in\mathbb{C^{m\times m}}
W∈Cm×m 即微分变量也是方阵时,根据追迹的乘积互换性,可以一步得到:
d
∣
∣
A
∣
∣
F
2
d
W
=
t
r
[
d
A
H
d
W
(
I
m
⊗
A
)
+
(
I
m
⊗
A
H
)
d
A
d
W
]
=
2
⋅
t
r
[
d
A
H
d
W
(
I
m
⊗
A
)
]
\begin{align*} \frac{\mathrm{d}||\bm{A}||_F^2}{\mathrm{d}\bm{W}}&=\mathrm{tr}\Big[\frac{\mathrm{d}\bm{A}^\mathrm{H}}{\mathrm{d}\bm{W}}\big(\bm{I}_m\otimes\bm{A}\big)+\big(\bm{I}_m\otimes\bm{A}^\mathrm{H}\big)\frac{\mathrm{d}\bm{A}}{\mathrm{d}\bm{W}}\Big]\\\\ &=2\cdot\mathrm{tr}\Big[\frac{\mathrm{d}\bm{A}^\mathrm{H}}{\mathrm{d}\bm{W}}\big(\bm{I}_m\otimes\bm{A}\big)\Big] \end{align*}
dWd∣∣A∣∣F2=tr[dWdAH(Im⊗A)+(Im⊗AH)dWdA]=2⋅tr[dWdAH(Im⊗A)]
向量范数的导数
将矩阵退化为列向量,就可以得到向量范数倒数,这种情况更为常见。对于列向量
Y
∈
C
m
×
1
\bm{Y}\in\mathrm{C}^{m\times1}
Y∈Cm×1 和
X
∈
C
n
×
1
\bm{X}\in\mathrm{C}^{n\times1}
X∈Cn×1 ,有:
d
∣
∣
Y
∣
∣
2
2
d
X
=
d
Y
H
d
X
(
I
m
⊗
Y
)
+
(
I
1
⊗
Y
H
)
d
Y
d
X
=
2
Y
H
d
Y
d
X
\begin{align*} \frac{\mathrm{d}||\bm{Y}||_2^2}{\mathrm{d}\bm{X}}&=\frac{\mathrm{d}\bm{Y}^\mathrm{H}}{\mathrm{d}\bm{X}}\big(\bm{I}_m\otimes\bm{Y}\big)+\big(\bm{I}_1\otimes\bm{Y}^\mathrm{H}\big)\frac{\mathrm{d}\bm{Y}}{\mathrm{d}\bm{X}}\\\\ &=2\bm{Y}^\mathrm{H}\frac{\mathrm{d}\bm{Y}}{\mathrm{d}\bm{X}} \end{align*}
dXd∣∣Y∣∣22=dXdYH(Im⊗Y)+(I1⊗YH)dXdY=2YHdXdY
向量的范数定义与矩阵不一样,
∣
∣
Y
∣
∣
2
2
=
Y
H
Y
\displaystyle||\bm{Y}||_2^2=\bm{Y}^\mathrm{H}\bm{Y}
∣∣Y∣∣22=YHY不带有迹表达式,因此得到的结果略有区别,当
Y
=
X
\bm{Y=X}
Y=X时,就能得到凸优化中非常漂亮的一个式子了:
d
∣
∣
X
∣
∣
2
2
d
X
=
2
X
H
\frac{\mathrm{d}||\bm{X}||_2^2}{\mathrm{d}\bm{X}}=2\bm{X}^\mathrm{H}
dXd∣∣X∣∣22=2XH