矩阵求导的技术,在统计学、控制论、机器学习等领域有广泛的应用。本文来做个科普,分作两篇,上篇讲标量对矩阵的求导术,下篇讲矩阵对矩阵的求导术。本文使用小写字母 x x x表示标量,粗体小写字母 x \boldsymbol{x} x表示(列)向量,大写字母 X X X表示矩阵。
首先来琢磨一下定义,标量 f f f对矩阵 X X X的导数,定义为 ∂ f ∂ X = [ ∂ f ∂ X i j ] \frac{\partial f}{\partial X}=\left [ \frac{\partial f}{\partial X_{ij}} \right ] ∂X∂f=[∂Xij∂f],即 f f f对 X X X逐元素求导排成与 X X X尺寸相同的矩阵。然而,这个定义在计算中并不好用,实用上的原因是对函数较复杂的情形难以逐元素求导;哲理上的原因是逐元素求导破坏了整体性。试想,为何要将 f f f看做矩阵 X X X而不是各元素 X i j X_{ij} Xij的函数呢?答案是用矩阵运算更整洁。所以在求导时不宜拆开矩阵,而是要找一个从整体出发的算法。
为此,我们来回顾,一元微积分中的导数(标量对标量的导数)与微分有联系: d f = f ′ ( x ) d x df={f}'(x)dx df=f′(x)dx;多元微积分中的梯度(标量对向量的导数)也与微分有联系: d f = ∑ i = 1 n ∂ f ∂ x i d x i = ∂ f T ∂ x d x d f=\sum_{i=1}^n \frac{\partial f}{\partial x_i}dx_i=\frac{\partial f^T}{\partial \boldsymbol x}d\boldsymbol x df=∑i=1n∂xi∂fdxi=∂x∂fTdx,这里第一个等号是全微分公式,第二个等号表达了梯度与微分的联系:全微分 d f df df是梯度向量 ∂ f ∂ x ( n × 1 ) \frac{\partial f}{\partial \boldsymbol x}(n\times 1) ∂x∂f(n×1)与微分向量 d x ( x × 1 ) d \boldsymbol x(x \times 1) dx(x×1)的内积;受此启发,我们将矩阵导数与微分建立联系: d f = ∑ i = 1 m ∑ j = 1 n ∂ f ∂ X i j d X i j = t r ( ∂ f T ∂ X d X ) df=\sum_{i=1}^m\sum_{j=1}^n\frac{\partial f}{\partial X_{ij}} dX_{ij}=tr\left ( \frac{\partial f^T}{\partial X} dX \right ) df=∑i=1m∑j=1n∂Xij∂fdXij=tr(∂X∂fTdX)。其中tr代表迹(trace)是方阵对角线元素之和,满足性质:对尺寸相同的矩阵A,B, t r ( A T B ) = ∑ i , j A i j B i j tr\left ( A^TB \right )=\sum_{i,j}A_{ij}B_{ij} tr(ATB)=∑i,jAijBij,即 t r ( A T B ) tr\left ( A^TB \right ) tr(ATB)是矩阵A,B的内积。与梯度相似,这里第一个等号是全微分公式,第二个等号表达了矩阵导数与微分的联系:全微分 d f df df是导数 ∂ f ∂ X ( m × n ) \frac{\partial f}{\partial X}(m\times n) ∂X∂f(m×n)与微分矩阵 d X ( m × n ) dX(m\times n) dX(m×n)的内积。
然后来建立运算法则。回想遇到较复杂的一元函数如
f
=
l
o
g
(
2
+
s
i
n
x
)
e
x
f=log(2+sin \ x)e^{\sqrt{x}}
f=log(2+sin x)ex,我们是如何求导的呢?通常不是从定义开始求极限,而是先建立了初等函数求导和四则运算、复合等法则,再来运用这些法则。故而,我们来创立常用的矩阵微分的运算法则:
1.
加减法:
d
(
X
±
Y
)
=
d
X
±
d
Y
d(X\pm Y)=dX\pm dY
d(X±Y)=dX±dY
矩阵乘法:
d
(
X
Y
)
=
(
d
X
)
Y
+
X
(
d
Y
)
d(XY)=(dX)Y+X(dY)
d(XY)=(dX)Y+X(dY)
转置:
d
(
X
T
)
=
(
d
X
)
T
d(X^T)=(dX)^T
d(XT)=(dX)T
迹:
d
t
r
(
X
)
=
t
r
(
d
X
)
dtr(X)=tr(dX)
dtr(X)=tr(dX)
2.
逆:
d
X
−
1
=
−
X
−
1
d
X
X
−
1
dX^{-1}=-X^{-1}dXX^{-1}
dX−1=−X−1dXX−1。此式可在
X
X
−
1
=
I
XX^{-1}=I
XX−1=I两侧求微分来证明。
3.
行列式:
d
∣
X
∣
=
t
r
(
X
#
d
X
)
d\left | X \right |=tr(X^{\#}dX)
d∣X∣=tr(X#dX),其中
X
#
X^{\#}
X#表示
X
X
X的伴随矩阵,在
X
X
X可逆时又可以写作
d
∣
X
∣
=
∣
X
∣
t
r
(
X
−
1
d
X
)
d\left | X \right |=\left | X \right |tr\left ( X^{-1}dX \right )
d∣X∣=∣X∣tr(X−1dX)。此式可用Laplace展开来证明,详见张贤达《矩阵分析与应用》第279页。
4.
逐元素乘法:
d
(
X
⊙
Y
)
=
d
X
⊙
Y
+
X
⊙
d
Y
d(X\odot Y)=dX \odot Y + X \odot dY
d(X⊙Y)=dX⊙Y+X⊙dY,
⊙
\odot
⊙表示尺寸相同的矩阵X,Y逐元素相乘。
5.
逐元素函数:
d
σ
(
X
)
=
σ
′
(
X
)
⊙
d
X
,
σ
(
X
)
=
[
σ
(
X
i
j
)
]
d \sigma (X)= \sigma'(X) \odot dX,\ \sigma (X)=\left [ \sigma (X_{ij}) \right ]
dσ(X)=σ′(X)⊙dX, σ(X)=[σ(Xij)],是逐元素标量函数运算,
σ
′
(
X
)
=
[
σ
′
(
X
i
j
)
]
{\sigma }'(X)=\left [ {\sigma }'(X_{ij}) \right ]
σ′(X)=[σ′(Xij)]是逐元素求导数。举个例子:
X
=
[
x
11
x
12
x
21
x
22
]
,
d
sin
(
X
)
=
[
c
o
s
x
11
d
x
11
c
o
s
x
12
d
x
12
c
o
s
x
21
d
x
21
c
o
s
x
22
d
x
22
]
=
c
o
s
(
X
)
⊙
d
X
X=\begin{bmatrix} x_{11} & x_{12}\\ x_{21} & x_{22} \end{bmatrix},\ d\sin(X)=\begin{bmatrix} cos \ x_{11}dx_{11} &cos \ x_{12}dx_{12} \\ cos \ x_{21}dx_{21} &cos \ x_{22}dx_{22} \end{bmatrix}=cos(X) \odot dX
X=[x11x21x12x22], dsin(X)=[cos x11dx11cos x21dx21cos x12dx12cos x22dx22]=cos(X)⊙dX
我们试图利用矩阵导数与微分的联系
d
f
=
t
r
(
∂
f
T
∂
X
d
X
)
df = tr\left ( \frac{\partial f^T}{\partial X} dX\right )
df=tr(∂X∂fTdX),在求出左侧的微分
d
f
df
df后,该如何写成右侧的形式并得到导数呢?这需要一些迹技巧(trace trick):
1.标量套上迹:
a
=
t
r
(
a
)
a=tr(a)
a=tr(a)
2.转置:
t
r
(
A
T
)
=
t
r
(
A
)
tr(A^T)=tr(A)
tr(AT)=tr(A)
3.线性:
t
r
(
A
±
B
)
=
t
r
(
A
)
±
t
r
(
B
)
tr(A\pm B)=tr(A)\pm tr(B)
tr(A±B)=tr(A)±tr(B)
4.矩阵乘法交换:
t
r
(
A
B
)
=
t
r
(
B
A
)
tr(AB)=tr(BA)
tr(AB)=tr(BA),其中
A
A
A与
B
T
B^T
BT尺寸相同。两侧都等于
∑
i
j
A
i
j
B
j
i
\sum_{ij}A_{ij}B_{ji}
∑ijAijBji。
5.矩阵乘法/逐元素乘法交换:
t
r
(
A
T
(
B
⊙
C
)
)
=
t
r
(
(
A
⊙
B
)
T
C
)
tr(A^T(B \odot C))=tr((A \odot B)^TC)
tr(AT(B⊙C))=tr((A⊙B)TC)其中尺寸相同。两侧都等于
∑
i
,
j
A
i
j
B
i
j
C
i
j
\sum_{i,j}A_{ij}B_{ij}C_{ij}
∑i,jAijBijCij
观察一下可以断言,若标量函数 f f f是矩阵 X X X经加减乘法、逆、行列式、逐元素函数等运算构成,则使用相应的运算法则对 f f f求微分,再使用迹技巧给 d f df df套上迹并将其它项交换至 d X dX dX左侧,即能得到导数。
在建立法则的最后,来谈一谈复合:假设已求得 ∂ f ∂ Y \frac{\partial f}{\partial Y} ∂Y∂f,而Y是X的函数,如何求 ∂ f ∂ X \frac{\partial f}{\partial X} ∂X∂f呢?在微积分中有标量求导的链式法则 ∂ f ∂ x = ∂ f ∂ y ∂ y ∂ x \frac{\partial f}{\partial x}=\frac{\partial f}{\partial y}\frac{\partial y}{\partial x} ∂x∂f=∂y∂f∂x∂y,但这里我们不能沿用链式法则,因为矩阵对矩阵的导数 ∂ Y ∂ X \frac{\partial Y}{\partial X} ∂X∂Y截至目前仍是未定义的。于是我们继续追本溯源,链式法则是从何而来?源头仍然是微分。我们直接从微分入手建立复合法则:先写出 d f = t r ( ∂ f T ∂ Y d Y ) df=tr(\frac{\partial f^T}{\partial Y}dY) df=tr(∂Y∂fTdY)再将 d Y dY dY用 d X dX dX表示出来代入,并使用迹技巧将其他项交换至 d X dX dX左侧,即可得到 ∂ f ∂ X \frac{\partial f}{\partial X} ∂X∂f。
接下来演示一些算例。特别提醒要依据已经建立的运算法则来计算,不能随意套用微积分中标量导数的结论,比如认为 A X AX AX对 X X X的导数为 A A A,这是没有根据、意义不明的。
例1: f = a T X b f=\boldsymbol a^TX \boldsymbol b f=aTXb,求 ∂ f ∂ X \frac{\partial f}{\partial X} ∂X∂f。其中 a \boldsymbol a a是 m × 1 m \times 1 m×1列向量, X X X是 m × n m \times n m×n矩阵, b \boldsymbol b b是 n × 1 n \times 1 n×1列向量, f f f是标量。
解:先使用矩阵乘法法则求微分,这里的 a , b \boldsymbol a,\boldsymbol b a,b是常量, d a = 0 , d b = 0 d\boldsymbol a=0,d\boldsymbol b=0 da=0,db=0,得到: d f = a T d X b df=\boldsymbol a^TdX \boldsymbol b df=aTdXb,再套上迹并做矩阵乘法交换: d f = t r ( a T d X b ) = t r ( b a T d X ) df=tr(\boldsymbol a^TdX\boldsymbol b)=tr(\boldsymbol b \boldsymbol a^TdX) df=tr(aTdXb)=tr(baTdX),注意这里我们根据 t r ( A B ) = t r ( B A ) tr(AB)=tr(BA) tr(AB)=tr(BA)交换了 a T d X \boldsymbol a^TdX aTdX与 b \boldsymbol b b。对照导数与微分的联系 d f = t r ( ∂ f T ∂ X d X ) df=tr\left ( \frac{\partial f^T}{\partial X} dX \right ) df=tr(∂X∂fTdX),得到 ∂ f ∂ X = ( b a T ) T = a b T \frac{\partial f}{\partial X}=(\boldsymbol b \boldsymbol a^T)^T=\boldsymbol a \boldsymbol b^T ∂X∂f=(baT)T=abT。
注意:这里不能用 ∂ f ∂ X = a T ∂ X ∂ X b = ? \frac{\partial f}{\partial X}=\boldsymbol a^T\frac{\partial X}{\partial X}\boldsymbol b=? ∂X∂f=aT∂X∂Xb=?,导数与乘常数矩阵的交换是不合法则的运算(而微分是合法的)。有些资料在计算矩阵导数时,会略过求微分这一步,这是逻辑上解释不通的。
例2: f = a T e x p ( X b ) f= \boldsymbol a^Texp(X \boldsymbol b) f=aTexp(Xb),求 ∂ f ∂ X \frac{\partial f}{\partial X} ∂X∂f。其中 a \boldsymbol a a是 m × 1 m \times 1 m×1列向量,其中 X X X是 m × n m \times n m×n矩阵,其中 b \boldsymbol b b是 n × 1 n \times 1 n×1列向量, e x p exp exp表示逐元素求指数, f f f是标量。
解:先使用矩阵乘法、逐元素函数法则求微分: d f = a T ( e x p ( X b ) ) ⊙ ( d X b ) df=\boldsymbol a^T(exp(X\boldsymbol b))\odot(dX\boldsymbol b) df=aT(exp(Xb))⊙(dXb),再套上迹并做交换: d f = t r ( a T ( e x p ( X b ) ) ⊙ ( d X b ) ) ) = t r ( ( a ⊙ e x p ( X b ) ) T d X b ) = t r ( b ( a ⊙ e x p ( X b ) ) T d X ) df=tr(\boldsymbol a^T(exp(X\boldsymbol b))\odot(dX\boldsymbol b)))=tr((\boldsymbol a \odot exp(X\boldsymbol b))^TdX\boldsymbol b)=tr(\boldsymbol b(\boldsymbol a \odot exp(X\boldsymbol b))^TdX) df=tr(aT(exp(Xb))⊙(dXb)))=tr((a⊙exp(Xb))TdXb)=tr(b(a⊙exp(Xb))TdX),注意这里我们先根据 t r ( A T ( B ⊙ C ) ) = t r ( ( A ⊙ B ) T C ) tr(A^T(B \odot C))=tr((A \odot B)^T C) tr(AT(B⊙C))=tr((A⊙B)TC)交换了 a , e x p ( X b ) \boldsymbol a,exp(X\boldsymbol b) a,exp(Xb)与 d X b dX\boldsymbol b dXb,再根据 t r ( A B ) = t r ( B A ) tr(AB)=tr(BA) tr(AB)=tr(BA)交换了 ( a ⊙ e x p ( X b ) ) T d X (\boldsymbol a \odot exp(X\boldsymbol b))^TdX (a⊙exp(Xb))TdX与 b \boldsymbol b b。对照导数与微分的联系 d f = t r ( ∂ f T ∂ X d X ) df=tr\left ( \frac{\partial f^T}{\partial X} dX\right ) df=tr(∂X∂fTdX),得到 ∂ f ∂ X = ( b ( a ⊙ e x p ( X b ) ) T ) T = ( a ⊙ e x p ( X b ) ) b T \frac{\partial f}{\partial X}=(\boldsymbol b(\boldsymbol a \odot exp(X\boldsymbol b))^T)^T=(\boldsymbol a \odot exp(X\boldsymbol b))\boldsymbol b^T ∂X∂f=(b(a⊙exp(Xb))T)T=(a⊙exp(Xb))bT
例3: f = t r ( Y T M Y ) , Y = σ ( W X ) f=tr(Y^TMY),Y=\sigma(WX) f=tr(YTMY),Y=σ(WX)求 ∂ f ∂ X \frac{\partial f}{\partial X} ∂X∂f其中 W W W是 l × m l \times m l×m矩阵,其中 X X X是 m × n m \times n m×n矩阵,其中 Y Y Y是 l × n l \times n l×n矩阵,其中 M M M是 l × l l \times l l×l矩阵, σ \sigma σ是逐元素函数, f f f是标量。
解:先求 ∂ f ∂ Y \frac{\partial f}{\partial Y} ∂Y∂f,求微分,使用矩阵乘法、转置法则: d f = t r ( ( d Y ) T M Y ) + t r ( Y T M d Y ) = t r ( Y T M T d Y ) + t r ( Y T M d Y ) = t r ( Y T ( M + M T ) d Y ) df=tr((dY)^TMY)+tr(Y^TMdY)=tr(Y^TM^TdY)+tr(Y^TMdY)=tr(Y^T(M+M^T)dY) df=tr((dY)TMY)+tr(YTMdY)=tr(YTMTdY)+tr(YTMdY)=tr(YT(M+MT)dY),对照导数与微分的联系,得到 ∂ f ∂ Y = ( M + M T ) Y = 2 M Y \frac{\partial f}{\partial Y}=(M+M^T)Y=2MY ∂Y∂f=(M+MT)Y=2MY,这里是对称矩阵。为求 ∂ f ∂ X \frac{\partial f}{\partial X} ∂X∂f,写出 d f = t r ( ∂ f T ∂ Y d Y ) df=tr\left ( \frac{\partial f^T}{\partial Y} dY \right ) df=tr(∂Y∂fTdY),再将 d Y dY dY用 d X dX dX表示出来代入,并使用矩阵乘法/逐元素乘法交换: d f = t r ( ∂ f T ∂ Y ( σ ′ ( W X ) ⊙ ( W d X ) ) = t r ( ( ∂ f ∂ Y ⊙ σ ′ ( W X ) ) T W d X ) df=tr\left ( \frac{\partial f^T}{\partial Y}({\sigma}'(WX) \odot (WdX) \right )=tr\left ( {\left ( \frac{\partial f}{\partial Y} \odot {\sigma}'(WX) \right )}^T WdX \right ) df=tr(∂Y∂fT(σ′(WX)⊙(WdX))=tr((∂Y∂f⊙σ′(WX))TWdX),对照导数与微分的联系,得到 ∂ f ∂ X = W T ( ∂ f ∂ Y ⊙ σ ′ ( W X ) ) = W T ( ( 2 M σ ( W X X ) ) ⊙ σ ′ ( W X ) ) \frac{\partial f}{\partial X}=W^T\left ( \frac{\partial f}{\partial Y} \odot {\sigma}' (WX)\right )=W^T((2M\sigma(WXX)) \odot {\sigma}'(WX)) ∂X∂f=WT(∂Y∂f⊙σ′(WX))=WT((2Mσ(WXX))⊙σ′(WX))
例4【线性回归】: l = ∥ X ω − y ∥ 2 l=\left \| X \boldsymbol \omega - \boldsymbol y \right \|^2 l=∥Xω−y∥2, 求 ω \boldsymbol \omega ω的最小二乘估计,即求 ∂ l ∂ ω \frac{\partial l}{\partial \boldsymbol \omega } ∂ω∂l的零点。其中 y \boldsymbol y y是 m × 1 m \times 1 m×1列向量, X X X是 m × n m \times n m×n矩阵, ω \boldsymbol \omega ω是 n × 1 n \times 1 n×1列向量, l l l是标量。
解:严格来说这是标量对向量的导数,不过可以把向量看做矩阵的特例。先将向量模平方改写成向量与自身的内积: l = ( X ω − y ) T ( X ω − y ) l=(X\boldsymbol \omega - \boldsymbol y)^T(X\boldsymbol \omega - \boldsymbol y) l=(Xω−y)T(Xω−y),求微分,使用矩阵乘法、转置等法则: d l = ( X d ω ) T ( X ω − y ) + ( X ω − y ) T ( X d ω ) = 2 ( X ω − y ) T X d ω dl=(Xd\boldsymbol \omega )^T(X\boldsymbol \omega - \boldsymbol y)+(X\boldsymbol \omega - \boldsymbol y)^T(Xd\boldsymbol \omega )=2(X\boldsymbol \omega - \boldsymbol y)^TXd\boldsymbol \omega dl=(Xdω)T(Xω−y)+(Xω−y)T(Xdω)=2(Xω−y)TXdω。对照导数与微分的联系 d l = ∂ l T ∂ ω d ω dl=\frac{\partial l^T}{\partial \boldsymbol \omega }d\boldsymbol \omega dl=∂ω∂lTdω,得到 ∂ l ∂ ω = ( 2 ( X ω − y ) T X ) T = 2 X T ( X ω − y ) \frac{\partial l}{\partial \boldsymbol \omega }=(2(X\boldsymbol \omega - \boldsymbol y)^TX)^T=2X^T(X\boldsymbol \omega - \boldsymbol y) ∂ω∂l=(2(Xω−y)TX)T=2XT(Xω−y)。 ∂ l ∂ ω \frac{\partial l}{\partial \boldsymbol \omega } ∂ω∂l零点即 ω \boldsymbol \omega ω的最小二乘估计为 ω = ( X T X ) − 1 X T y \boldsymbol \omega =(X^TX)^{-1}X^T\boldsymbol y ω=(XTX)−1XTy。
例5【方差的最大似然估计】:样本 x 1 , . . . , x N ∼ λ ( μ , Σ ) \boldsymbol x_1,...,\boldsymbol x_N\sim \lambda (\boldsymbol \mu , \Sigma ) x1,...,xN∼λ(μ,Σ),求方差 Σ \Sigma Σ的最大似然估计。写成数学式是: l = l o g ∣ Σ ∣ + 1 N ∑ i = 1 N ( x i − x ˉ ) T Σ − 1 ( x i − x ˉ ) l=log\left | \Sigma \right |+\frac{1}{N}\sum_{i=1}^N(\boldsymbol x_i-\bar \boldsymbol x)^T\Sigma ^{-1}(\boldsymbol x_i-\bar \boldsymbol x) l=log∣Σ∣+N1∑i=1N(xi−xˉ)TΣ−1(xi−xˉ),求 ∂ l ∂ Σ \frac{\partial l}{\partial \Sigma } ∂Σ∂l的零点。其中 x i \boldsymbol x_{i} xi是 m × 1 m \times 1 m×1向量, Σ \Sigma Σ是 m × m m \times m m×m对称正定矩阵, x ˉ = 1 N ∑ i = 1 N x i \bar \boldsymbol x= \frac{1}{N} \sum_{i=1}^{N}\boldsymbol x_i xˉ=N1∑i=1Nxi是样本均值, l l l是标量,log表示自然对数。
解:首先求微分,使用矩阵乘法、行列式、逆等运算法则,第一项是 d l o g ∣ Σ ∣ = ∣ Σ ∣ − 1 d ∣ Σ ∣ = t r ( ∣ Σ ∣ − 1 d Σ ) dlog\left | \Sigma \right |=\left | \Sigma \right |^{-1}d\left | \Sigma \right |=tr(\left | \Sigma \right |^{-1}d\Sigma ) dlog∣Σ∣=∣Σ∣−1d∣Σ∣=tr(∣Σ∣−1dΣ),第二项是 1 N ∑ i = 1 N ( x i − x ˉ ) T d Σ − 1 ( x i − x ˉ ) = − 1 N ∑ i = 1 N ( x i − x ˉ ) T Σ − 1 d Σ Σ − 1 ( x i − x ˉ ) \frac{1}{N}\sum_{i=1}^N(\boldsymbol x_i-\bar \boldsymbol x)^Td{\Sigma} ^{-1}(\boldsymbol x_i-\bar \boldsymbol x)= - \frac{1}{N}\sum_{i=1}^N(\boldsymbol x_i-\bar \boldsymbol x)^T {\Sigma} ^{-1} d\Sigma{\Sigma} ^{-1}(\boldsymbol x_i-\bar \boldsymbol x) N1∑i=1N(xi−xˉ)TdΣ−1(xi−xˉ)=−N1∑i=1N(xi−xˉ)TΣ−1dΣΣ−1(xi−xˉ)。再给第二项套上迹做交换: t r ( 1 N ∑ i = 1 N ( x i − x ˉ ) T Σ − 1 d Σ Σ − 1 ( x i − x ˉ ) ) = 1 N ∑ i = 1 N t r ( ( x i − x ˉ ) T Σ − 1 d Σ Σ − 1 ( x i − x ˉ ) ) = 1 N ∑ i = 1 N t r ( Σ − 1 ( x i − x ˉ ) ( x i − x ˉ ) T Σ − 1 d Σ = t r ( Σ − 1 S Σ − 1 d Σ ) tr(\frac{1}{N}\sum_{i=1}^N(\boldsymbol x_i-\bar \boldsymbol x)^T {\Sigma} ^{-1} d\Sigma{\Sigma} ^{-1}(\boldsymbol x_i-\bar \boldsymbol x))= \frac{1}{N}\sum_{i=1}^N tr((\boldsymbol x_i-\bar \boldsymbol x)^T {\Sigma} ^{-1} d\Sigma{\Sigma} ^{-1}(\boldsymbol x_i-\bar \boldsymbol x))= \frac{1}{N}\sum_{i=1}^N tr({\Sigma} ^{-1} (\boldsymbol x_i-\bar \boldsymbol x)(\boldsymbol x_i-\bar \boldsymbol x)^T {\Sigma} ^{-1} d\Sigma=tr({\Sigma} ^{-1} S{\Sigma} ^{-1} d\Sigma) tr(N1∑i=1N(xi−xˉ)TΣ−1dΣΣ−1(xi−xˉ))=N1∑i=1Ntr((xi−xˉ)TΣ−1dΣΣ−1(xi−xˉ))=N1∑i=1Ntr(Σ−1(xi−xˉ)(xi−xˉ)TΣ−1dΣ=tr(Σ−1SΣ−1dΣ),其中先交换迹与求和,然后将 Σ − 1 ( x i − x ˉ ) \Sigma ^{-1}(\boldsymbol x_i - \bar \boldsymbol x) Σ−1(xi−xˉ)交换到左边,最后再交换迹与求和,并定义 S = 1 N ∑ i = 1 N ( x i − x ˉ ) ( x i − x ˉ ) T S=\frac{1}{N} \sum_{i=1}^N(\boldsymbol x_i - \bar \boldsymbol x)(\boldsymbol x_i - \bar \boldsymbol x)^T S=N1∑i=1N(xi−xˉ)(xi−xˉ)T为样本方差矩阵。得到 d l = t r ( ( Σ − 1 − Σ − 1 S Σ − 1 ) d Σ ) dl=tr(({\Sigma} ^{-1} - \Sigma^{-1} S{\Sigma} ^{-1}) d\Sigma) dl=tr((Σ−1−Σ−1SΣ−1)dΣ)。对照导数与微分的联系,有 ∂ l ∂ Σ = ( Σ − 1 − Σ − 1 S Σ − 1 ) T \frac{\partial l}{\partial \Sigma}=({\Sigma} ^{-1} - \Sigma^{-1} S{\Sigma} ^{-1})^T ∂Σ∂l=(Σ−1−Σ−1SΣ−1)T,其零点即 Σ \Sigma Σ的最大似然估计为 Σ = S \Sigma = S Σ=S。
例6【多元logistic回归】: l = − y T l o g s o f t m a x ( W x ) l=- \boldsymbol y^Tlog \ softmax(W \boldsymbol x) l=−yTlog softmax(Wx),求 ∂ l ∂ W \frac{\partial l}{\partial W} ∂W∂l。其中 y \boldsymbol y y是除一个元素为1外其它元素为0的 m × 1 m \times 1 m×1列向量, W W W是 m × n m \times n m×n矩阵, x \boldsymbol x x是 n × 1 n \times 1 n×1矩阵, l l l是标量;log表示自然对数, s o f t m a x ( a ) = e x p ( a ) 1 T e x p ( a ) softmax( \boldsymbol a)=\frac{exp( \boldsymbol a)}{ \boldsymbol 1^T exp( \boldsymbol a)} softmax(a)=1Texp(a)exp(a),其中 e x p ( a ) exp(a) exp(a)表示逐元素求指数, 1 \boldsymbol 1 1代表全1向量。
解1:首先将softmax函数代入并写成 d l = − y T ( l o g ( e x p ( W x ) ) − 1 l o g ( 1 T e x p ( W x ) ) ) = − y T W x + l o g ( 1 T e x p ( W x ) ) dl=- \boldsymbol y^T(log(exp(W \boldsymbol x))- \boldsymbol 1log( \boldsymbol 1^Texp(W \boldsymbol x)))=- \boldsymbol y^TW \boldsymbol x+log( \boldsymbol 1^Texp(W \boldsymbol x)) dl=−yT(log(exp(Wx))−1log(1Texp(Wx)))=−yTWx+log(1Texp(Wx)),这里要注意逐元素log满足等式 l o g ( u / c ) = l o g ( u ) − 1 l o g ( c ) log( \boldsymbol u /c)=log( \boldsymbol u)- \boldsymbol 1log(c) log(u/c)=log(u)−1log(c),以及 y \boldsymbol y y满足 y T 1 = 1 \boldsymbol y^T \boldsymbol 1 = 1 yT1=1。求微分,使用矩阵乘法、逐元素函数等法则: d l = − y T d W x + 1 T ( e x p ( W x ) ⊙ ( d W x ) ) 1 T e x p ( W x ) dl= - \boldsymbol y^TdW\boldsymbol x + \frac{\boldsymbol 1^T(exp(W\boldsymbol x) \odot (dW\boldsymbol x))}{\boldsymbol 1^Texp(W\boldsymbol x)} dl=−yTdWx+1Texp(Wx)1T(exp(Wx)⊙(dWx))。再套上迹并做交换,注意可化简 1 T ( e x p ( W x ) ⊙ ( d W x ) ) = e x p ( W x ) T d W x \boldsymbol 1^T(exp(W\boldsymbol x) \odot (dW\boldsymbol x))=exp(W\boldsymbol x)^TdW\boldsymbol x 1T(exp(Wx)⊙(dWx))=exp(Wx)TdWx,这是根据等式 1 T ( u ⊙ v ) = u T v \boldsymbol 1^T(\boldsymbol u \odot \boldsymbol v)=\boldsymbol u^T\boldsymbol v 1T(u⊙v)=uTv,故 d l = t r ( − y T d W x + e x p ( W x ) T d W x 1 T e x p ( W x ) ) = t r ( x ( s o f t m a x ( W x ) − y T ) d W ) dl=tr\left ( -\boldsymbol y^TdW\boldsymbol x + \frac{exp(W\boldsymbol x)^TdW\boldsymbol x}{1^T exp(W\boldsymbol x)} \right )=tr(\boldsymbol x(softmax(W\boldsymbol x)-\boldsymbol y^T)dW) dl=tr(−yTdWx+1Texp(Wx)exp(Wx)TdWx)=tr(x(softmax(Wx)−yT)dW)。对照导数与微分的联系,得到 ∂ l ∂ W = ( s o f t m a x ( W x ) − y ) x T \frac{\partial l}{\partial W}=(softmax(W\boldsymbol x)-\boldsymbol y)\boldsymbol x^T ∂W∂l=(softmax(Wx)−y)xT。
解2:定义 a = W x \boldsymbol a=W\boldsymbol x a=Wx,则 l = − y T l o g s o f t m a x ( a ) l=-\boldsymbol y^Tlog \ softmax(\boldsymbol a) l=−yTlog softmax(a),先同上求出 ∂ l ∂ a = s o f t m a x ( a ) − y \frac{\partial l}{\partial \boldsymbol a} = softmax(\boldsymbol a)-\boldsymbol y ∂a∂l=softmax(a)−y,再利用复合法则: d l = t r ( ∂ l T ∂ a d a ) = t r ( ∂ l T ∂ a d W x ) = t r ( x ∂ l T ∂ a d W ) dl= tr\left ( \frac{\partial l^T}{\partial \boldsymbol a}d\boldsymbol a \right )=tr\left ( \frac{\partial l^T}{\partial \boldsymbol a}dW\boldsymbol x \right )=tr\left ( \boldsymbol x \frac{\partial l^T}{\partial \boldsymbol a}dW \right ) dl=tr(∂a∂lTda)=tr(∂a∂lTdWx)=tr(x∂a∂lTdW),得到 ∂ l ∂ W = ∂ l ∂ a x T \frac{\partial l}{\partial W}= \frac{\partial l}{\partial \boldsymbol a}\boldsymbol x^T ∂W∂l=∂a∂lxT。
最后一例留给经典的神经网络。神经网络的求导术是学术史上的重要成果,还有个专门的名字叫做BP算法,我相信如今很多人在初次推导BP算法时也会颇费一番脑筋,事实上使用矩阵求导术来推导并不复杂。为简化起见,我们推导二层神经网络的BP算法。
例7【二层神经网络】: l = − y T l o g s o f t m a x ( W 2 σ ( W 1 x ) ) l=-\boldsymbol y^Tlog \ softmax(W_2 \sigma(W_1\boldsymbol x)) l=−yTlog softmax(W2σ(W1x)),求 ∂ l ∂ W 1 \frac{\partial l}{\partial W_1} ∂W1∂l和 ∂ l ∂ W 2 \frac{\partial l}{\partial W_2} ∂W2∂l。其中 y \boldsymbol y y是除一个元素为1外其它元素为0的的 m × 1 m \times 1 m×1列向量, W 2 W_2 W2是 m × p m \times p m×p矩阵, W 1 W_1 W1是 p × n p \times n p×n矩阵, x \boldsymbol x x是 n × 1 n \times 1 n×1矩阵, l l l是标量;log表示自然对数, s o f t m a x ( a ) = e x p ( a ) 1 T e x p ( a ) softmax(\boldsymbol a)=\frac{exp(\boldsymbol a)}{\boldsymbol 1^Texp(\boldsymbol a)} softmax(a)=1Texp(a)exp(a)同上, σ \sigma σ是逐元素sigmoid函数。
解:定义 a 1 = W 1 x , h 1 = σ ( a 1 ) , a 2 = W 2 h 1 \boldsymbol a_1=W_1\boldsymbol x,\boldsymbol h_1=\sigma(\boldsymbol a_1),\boldsymbol a_2=W_2\boldsymbol h_1 a1=W1x,h1=σ(a1),a2=W2h1,则 l = − y T l o g s o f t m a x ( a 2 ) l=-\boldsymbol y^Tlog \ softmax(\boldsymbol a_2) l=−yTlog softmax(a2)。在前例中已求出 ∂ l ∂ a 2 = s o f t m a x ( a 2 ) − y \frac{\partial l}{\partial \boldsymbol a_2}=softmax(\boldsymbol a_2)-\boldsymbol y ∂a2∂l=softmax(a2)−y。使用复合法则, d l = t r ( ∂ l T ∂ a 2 d a 2 ) = t r ( ∂ l T ∂ a 2 d W 2 h 1 ) + t r ( ∂ l T ∂ a 2 W 2 d h 1 ) ⎵ d l 2 dl=tr\left ( \frac{\partial l^T}{ \partial \boldsymbol a_2} d\boldsymbol a_2 \right)=tr\left ( \frac{\partial l^T}{ \partial \boldsymbol a_2} dW_2\boldsymbol h_1 \right ) + \underbrace{tr\left ( \frac{\partial l^T}{ \partial \boldsymbol a_2} W_2d\boldsymbol h_1 \right ) }_{dl_2} dl=tr(∂a2∂lTda2)=tr(∂a2∂lTdW2h1)+dl2 tr(∂a2∂lTW2dh1),使用矩阵乘法交换的迹技巧从第一项得到 ∂ l ∂ W 2 = ∂ l ∂ a 2 h 1 T \frac{\partial l}{\partial W_2}=\frac{\partial l}{\partial \boldsymbol a_2}h^T_1 ∂W2∂l=∂a2∂lh1T,从第二项得到 ∂ l ∂ h 1 = W 2 T ∂ l ∂ a 2 \frac{\partial l}{\partial \boldsymbol h_1}=W_2^T\frac{\partial l}{\partial \boldsymbol a_2} ∂h1∂l=W2T∂a2∂l。接下来对第二项继续使用复合法则来求 ∂ l ∂ a 1 \frac{\partial l}{\partial \boldsymbol a_1} ∂a1∂l,并利用矩阵乘法和逐元素乘法交换的迹技巧: d l 2 = t r ( ∂ l T ∂ h 1 d h 1 ) = t r ( ∂ l T ∂ h 1 ( σ ′ ( a 1 ) ⊙ d a 1 ) ) = t r ( ( ∂ l ∂ h 1 ⊙ σ ′ ( a 1 ) ) T d a 1 ) dl_2=tr\left ( \frac{\partial l^T}{\partial \boldsymbol h_1}d\boldsymbol h_1 \right )=tr\left ( \frac{\partial l^T}{\partial \boldsymbol h_1}(\sigma'(\boldsymbol a_1) \odot d\boldsymbol a_1) \right )=tr\left ( \left ( \frac{\partial l}{\partial \boldsymbol h_1} \odot \sigma'(\boldsymbol a_1)\right )^Td\boldsymbol a_1 \right ) dl2=tr(∂h1∂lTdh1)=tr(∂h1∂lT(σ′(a1)⊙da1))=tr((∂h1∂l⊙σ′(a1))Tda1),得到 ∂ l ∂ a 1 = ∂ l ∂ h 1 ⊙ σ ′ ( a 1 ) \frac{\partial l}{\partial \boldsymbol a_1}=\frac{\partial l}{\partial \boldsymbol h_1} \odot \sigma'(\boldsymbol a_1) ∂a1∂l=∂h1∂l⊙σ′(a1)为求 ∂ l ∂ W \frac{\partial l}{\partial W} ∂W∂l,再用一次复合法则: d l 2 = t r ( ∂ l T ∂ a 1 d a 1 ) = t r ( ∂ l T ∂ a 1 d W 1 x ) = t r ( x ∂ l T ∂ a 1 d W 1 ) dl_2=tr\left ( \frac{\partial l^T}{\partial \boldsymbol a_1}d\boldsymbol a_1 \right )=tr\left ( \frac{\partial l^T}{\partial \boldsymbol a_1}dW_1\boldsymbol x \right )=tr\left ( \boldsymbol x \frac{\partial l^T}{\partial \boldsymbol a_1}dW_1 \right ) dl2=tr(∂a1∂lTda1)=tr(∂a1∂lTdW1x)=tr(x∂a1∂lTdW1),得到 ∂ l ∂ W 1 = ∂ l ∂ a 1 x T \frac{\partial l}{\partial W_1}=\frac{\partial l}{\partial \boldsymbol a_1}\boldsymbol x^T ∂W1∂l=∂a1∂lxT
推广:样本 ( x 1 , y 1 ) , . . . . , ( x N , y N ) (\boldsymbol x_1,\boldsymbol y_1),....,(\boldsymbol x_N,\boldsymbol y_N) (x1,y1),....,(xN,yN), l = − ∑ i = 1 N y i T l o g s o f t m a x ( W 2 σ ( W 1 x i + b 1 ) + b 2 ) l=- \sum_{i=1}^N\boldsymbol y_i^T log \ softmax(W_2 \sigma(W_1\boldsymbol x_i+\boldsymbol b_1)+\boldsymbol b_2) l=−∑i=1NyiTlog softmax(W2σ(W1xi+b1)+b2), b 1 \boldsymbol b_1 b1是 p × 1 p \times 1 p×1列向量, b 2 \boldsymbol b_2 b2是 m × 1 m \times 1 m×1列向量其余定义同上。
解1:定义 a 1 , i = W 1 x i + b 1 , h 1 , i = σ ( a 1 , i ) , a 2 , i = W 2 h 1 , i + b 2 \boldsymbol a_{1,i}=W_1\boldsymbol x_i+\boldsymbol b_1,\boldsymbol h_{1,i}=\sigma(\boldsymbol a_{1,i}),\boldsymbol a_{2,i}=W_2\boldsymbol h_{1,i}+\boldsymbol b_2 a1,i=W1xi+b1,h1,i=σ(a1,i),a2,i=W2h1,i+b2,则 l = − ∑ i = 1 N y i T l o g s o f t m a x ( a 2 , i ) l=-\sum_{i=1}^N\boldsymbol y_i^T log \ softmax(\boldsymbol a_{2,i}) l=−∑i=1NyiTlog softmax(a2,i)。先同上可求出 ∂ l ∂ a 2 , i = s o f t m a x ( a 2 , i ) − y i \frac{\partial l}{\partial \boldsymbol a_{2,i}}=softmax(\boldsymbol a_{2,i})-\boldsymbol y_i ∂a2,i∂l=softmax(a2,i)−yi。使用复合法则, d l = t r ( ∑ i = 1 N ∂ l T ∂ a 2 , i d a 2 , i ) = t r ( ∑ i = 1 N ∂ l T ∂ a 2 , i d W 2 h 1 , i ) + t r ( ∑ i = 1 N ∂ l T ∂ a 2 , i W 2 d h 1 , i ) ⎵ d l 2 + t r ( ∑ i = 1 N ∂ l T ∂ a 2 , i d b 2 ) dl=tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{2,i}} d\boldsymbol a_{2,i}\right )=tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{2,i}} dW_2\boldsymbol h_{1,i} \right ) + \underbrace{tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{2,i}}W_ 2d\boldsymbol h_{1,i} \right ) }_{dl_2} + tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{2,i}} d\boldsymbol b_2 \right ) dl=tr(∑i=1N∂a2,i∂lTda2,i)=tr(∑i=1N∂a2,i∂lTdW2h1,i)+dl2 tr(i=1∑N∂a2,i∂lTW2dh1,i)+tr(∑i=1N∂a2,i∂lTdb2),从第一项得到得到 ∂ l ∂ W 2 = ∑ i = 1 N ∂ l ∂ a 2 , i h 1 , i T \frac{\partial l}{\partial W_2}=\sum_{i=1}^N\frac{\partial l}{\partial \boldsymbol a_{2,i} }\boldsymbol h_{1,i}^T ∂W2∂l=∑i=1N∂a2,i∂lh1,iT,从第二项得到 ∂ l ∂ h 1 , i = W 2 T ∂ l ∂ a 2 , i \frac{\partial l}{\partial \boldsymbol h_{1,i}}=W_2^T\frac{\partial l}{\partial \boldsymbol a_{2,i} } ∂h1,i∂l=W2T∂a2,i∂l,从第三项得到 ∂ l ∂ b 2 = ∑ i = 1 N ∂ l ∂ a 2 , i \frac{\partial l}{\partial \boldsymbol b_2}=\sum_{i=1}^N\frac{\partial l}{\partial \boldsymbol a_{2,i} } ∂b2∂l=∑i=1N∂a2,i∂l。接下来对第二项继续使用复合法则,得到 ∂ l ∂ a 1 , i = ∂ l ∂ h 1 , i ⊙ σ ′ ( a 1 , i ) \frac{\partial l}{\partial \boldsymbol a_{1,i}}=\frac{\partial l}{\partial \boldsymbol h_{1,i} } \odot \sigma'(\boldsymbol a_{1,i}) ∂a1,i∂l=∂h1,i∂l⊙σ′(a1,i)。为求 ∂ l ∂ W 1 , ∂ l ∂ b 1 \frac{\partial l}{\partial W_1},\frac{\partial l}{\partial \boldsymbol b_1} ∂W1∂l,∂b1∂l,再用一次复合法则: d l 2 = t r ( ∑ i = 1 N ∂ l T ∂ a 1 , i d a 1 , i ) = t r ( ∑ i = 1 N ∂ l T ∂ a 1 , i d W 1 x i ) + t r ( ∑ i = 1 N ∂ l T ∂ a 1 , i d b 1 ) dl_2=tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial a_{1,i}} d\boldsymbol a_{1,i}\right )=tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{1,i}} dW_1\boldsymbol x_{i} \right ) + tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{1,i}} d\boldsymbol b_1 \right ) dl2=tr(∑i=1N∂a1,i∂lTda1,i)=tr(∑i=1N∂a1,i∂lTdW1xi)+tr(∑i=1N∂a1,i∂lTdb1)得到 ∂ l ∂ W 1 = ∑ i = 1 N ∂ l ∂ a 1 , i x i T , ∂ l ∂ b 1 = ∑ i = 1 N ∂ l ∂ a 1 , i \frac{\partial l}{\partial W_1}= \sum_{i=1}^{N}\frac{\partial l}{\partial \boldsymbol a_{1, i}}x_i^T,\frac{\partial l}{\partial \boldsymbol b_1}=\sum_{i=1}^N \frac{\partial l}{\partial \boldsymbol a_{1, i}} ∂W1∂l=∑i=1N∂a1,i∂lxiT,∂b1∂l=∑i=1N∂a1,i∂l
解2:可以用矩阵来表示N个样本,以简化形式。定义 X = [ x 1 , . . . , x N ] , A 1 = [ a 1 , 1 , . . , a 1 , N ] = W 1 X + b 1 1 T , H 1 = [ h 1 , 1 , . . . , h 1 , N ] = σ ( A 1 ) , A 2 = [ a 2 , 1 , . . , a 2 , N ] = W 2 H 1 + b 2 1 T X=[\boldsymbol x_1, ...,\boldsymbol x_N],A_1=[\boldsymbol a_{1,1},..,\boldsymbol a_{1,N}]=W_1X+\boldsymbol b_1\boldsymbol 1^T,H_1=[\boldsymbol h_{1,1},...,\boldsymbol h_{1,N}]=\sigma(A_1),A_2=[\boldsymbol a_{2,1},..,\boldsymbol a_{2,N}]=W_2H_1+\boldsymbol b_2\boldsymbol 1^T X=[x1,...,xN],A1=[a1,1,..,a1,N]=W1X+b11T,H1=[h1,1,...,h1,N]=σ(A1),A2=[a2,1,..,a2,N]=W2H1+b21T,注意这里使用全1向量来扩展维度。先同上求出 ∂ l ∂ A 2 = [ s o f t m a x ( a 2 , 1 ) − y 1 , . . . , s o f t m a x ( a 2 , N ) − y N ] \frac{\partial l}{\partial A_2}=[softmax(\boldsymbol a_{2,1})-\boldsymbol y_1,...,softmax(\boldsymbol a_{2,N})-\boldsymbol y_N] ∂A2∂l=[softmax(a2,1)−y1,...,softmax(a2,N)−yN]。使用复合法则, d l = t r ( ∂ l T ∂ A 2 d A 2 ) = t r ( ∂ l T ∂ A 2 d W 2 H 1 ) + t r ( ∂ l T ∂ A 2 W 2 d H 1 ) ⎵ d l 2 + t r ( ∂ l T ∂ A 2 d b 2 1 T ) dl=tr\left ( \frac{\partial l^T}{\partial A_2} dA_2\right )=tr\left ( \frac{\partial l^T}{\partial A_2} dW_2H_1 \right ) + \underbrace{tr\left ( \frac{\partial l^T}{\partial A_2} W_2dH_1 \right )}_{dl_2} + tr\left ( \frac{\partial l^T}{\partial A_2} d\boldsymbol b_2\boldsymbol 1^T \right ) dl=tr(∂A2∂lTdA2)=tr(∂A2∂lTdW2H1)+dl2 tr(∂A2∂lTW2dH1)+tr(∂A2∂lTdb21T),从第一项得到 ∂ l ∂ W 2 = ∂ l ∂ A 2 H 1 T \frac{\partial l}{\partial W_2}=\frac{\partial l}{\partial A_2}H_1^T ∂W2∂l=∂A2∂lH1T,从第二项得到 ∂ l ∂ W 1 = W 2 T ∂ l ∂ A 2 \frac{\partial l}{\partial W_1}=W_2^T\frac{\partial l}{\partial A_2} ∂W1∂l=W2T∂A2∂l,从第三项得到到 ∂ l ∂ b 2 = ∂ l ∂ A 2 1 \frac{\partial l}{\partial \boldsymbol b_2}=\frac{\partial l}{\partial A_2}\boldsymbol 1 ∂b2∂l=∂A2∂l1。接下来对第二项继续使用复合法则,得到 ∂ l ∂ A 1 = ∂ l ∂ H 1 ⊙ σ ′ ( A 1 ) \frac{\partial l}{\partial A_1}=\frac{\partial l}{\partial H_1} \odot \sigma'(A_1) ∂A1∂l=∂H1∂l⊙σ′(A1)。为求 ∂ l ∂ W 1 , ∂ l ∂ b 1 \frac{\partial l}{\partial W_1},\frac{\partial l}{\partial \boldsymbol b_1} ∂W1∂l,∂b1∂l,再用一次复合法则: d l 2 = t r ( ∂ l T ∂ A 1 d A 1 ) = t r ( ∂ l T ∂ A 1 d W 1 X ) + t r ( ∂ l T ∂ A 1 d b 1 1 T ) dl_2=tr\left ( \frac{\partial l^T}{\partial A_1}dA_1 \right )=tr\left ( \frac{\partial l^T}{\partial A_1}dW_1X \right )+tr\left ( \frac{\partial l^T}{\partial A_1}d\boldsymbol b_1\boldsymbol 1^T \right ) dl2=tr(∂A1∂lTdA1)=tr(∂A1∂lTdW1X)+tr(∂A1∂lTdb11T),得到 ∂ l ∂ W 1 = ∂ l ∂ A 1 X T , ∂ l ∂ b 1 = ∂ l ∂ A 1 1 \frac{\partial l}{\partial W_1}=\frac{\partial l}{\partial A_1}X^T,\frac{\partial l}{\partial \boldsymbol b_1}=\frac{\partial l}{\partial A_1}\boldsymbol 1 ∂W1∂l=∂A1∂lXT,∂b1∂l=∂A1∂l1