矩阵求导术

矩阵求导的技术,在统计学、控制论、机器学习等领域有广泛的应用。本文来做个科普,分作两篇,上篇讲标量对矩阵的求导术,下篇讲矩阵对矩阵的求导术。本文使用小写字母 x x x表示标量,粗体小写字母 x \boldsymbol{x} x表示(列)向量,大写字母 X X X表示矩阵。

首先来琢磨一下定义,标量 f f f对矩阵 X X X的导数,定义为 ∂ f ∂ X = [ ∂ f ∂ X i j ] \frac{\partial f}{\partial X}=\left [ \frac{\partial f}{\partial X_{ij}} \right ] Xf=[Xijf],即 f f f X X X逐元素求导排成与 X X X尺寸相同的矩阵。然而,这个定义在计算中并不好用,实用上的原因是对函数较复杂的情形难以逐元素求导;哲理上的原因是逐元素求导破坏了整体性。试想,为何要将 f f f看做矩阵 X X X而不是各元素 X i j X_{ij} Xij的函数呢?答案是用矩阵运算更整洁。所以在求导时不宜拆开矩阵,而是要找一个从整体出发的算法。

为此,我们来回顾,一元微积分中的导数(标量对标量的导数)与微分有联系: d f = f ′ ( x ) d x df={f}'(x)dx df=f(x)dx;多元微积分中的梯度(标量对向量的导数)也与微分有联系: d f = ∑ i = 1 n ∂ f ∂ x i d x i = ∂ f T ∂ x d x d f=\sum_{i=1}^n \frac{\partial f}{\partial x_i}dx_i=\frac{\partial f^T}{\partial \boldsymbol x}d\boldsymbol x df=i=1nxifdxi=xfTdx,这里第一个等号是全微分公式,第二个等号表达了梯度与微分的联系:全微分 d f df df是梯度向量 ∂ f ∂ x ( n × 1 ) \frac{\partial f}{\partial \boldsymbol x}(n\times 1) xf(n×1)与微分向量 d x ( x × 1 ) d \boldsymbol x(x \times 1) dx(x×1)的内积;受此启发,我们将矩阵导数与微分建立联系: d f = ∑ i = 1 m ∑ j = 1 n ∂ f ∂ X i j d X i j = t r ( ∂ f T ∂ X d X ) df=\sum_{i=1}^m\sum_{j=1}^n\frac{\partial f}{\partial X_{ij}} dX_{ij}=tr\left ( \frac{\partial f^T}{\partial X} dX \right ) df=i=1mj=1nXijfdXij=tr(XfTdX)​​​​​​​。其中tr代表迹(trace)是方阵对角线元素之和,满足性质:对尺寸相同的矩阵A,B, t r ( A T B ) = ∑ i , j A i j B i j tr\left ( A^TB \right )=\sum_{i,j}A_{ij}B_{ij} tr(ATB)=i,jAijBij​​​​​​​,即 t r ( A T B ) tr\left ( A^TB \right ) tr(ATB)是矩阵A,B的内积。与梯度相似,这里第一个等号是全微分公式,第二个等号表达了矩阵导数与微分的联系:全微分 d f df df是导数 ∂ f ∂ X ( m × n ) \frac{\partial f}{\partial X}(m\times n) Xf(m×n)与微分矩阵 d X ( m × n ) dX(m\times n) dX(m×n)的内积。

然后来建立运算法则。回想遇到较复杂的一元函数如 f = l o g ( 2 + s i n   x ) e x f=log(2+sin \ x)e^{\sqrt{x}} f=log(2+sin x)ex ,我们是如何求导的呢?通常不是从定义开始求极限,而是先建立了初等函数求导和四则运算、复合等法则,再来运用这些法则。故而,我们来创立常用的矩阵微分的运算法则:
1.
加减法: d ( X ± Y ) = d X ± d Y d(X\pm Y)=dX\pm dY d(X±Y)=dX±dY
矩阵乘法: d ( X Y ) = ( d X ) Y + X ( d Y ) d(XY)=(dX)Y+X(dY) d(XY)=(dX)Y+X(dY)
转置: d ( X T ) = ( d X ) T d(X^T)=(dX)^T d(XT)=(dX)T
迹: d t r ( X ) = t r ( d X ) dtr(X)=tr(dX) dtr(X)=tr(dX)
2.
逆: d X − 1 = − X − 1 d X X − 1 dX^{-1}=-X^{-1}dXX^{-1} dX1=X1dXX1。此式可在 X X − 1 = I XX^{-1}=I XX1=I两侧求微分来证明。
3.
行列式: d ∣ X ∣ = t r ( X # d X ) d\left | X \right |=tr(X^{\#}dX) dX=tr(X#dX),其中 X # X^{\#} X#表示 X X X的伴随矩阵,在 X X X可逆时又可以写作 d ∣ X ∣ = ∣ X ∣ t r ( X − 1 d X ) d\left | X \right |=\left | X \right |tr\left ( X^{-1}dX \right ) dX=Xtr(X1dX)。此式可用Laplace展开来证明,详见张贤达《矩阵分析与应用》第279页。
4.
逐元素乘法: d ( X ⊙ Y ) = d X ⊙ Y + X ⊙ d Y d(X\odot Y)=dX \odot Y + X \odot dY d(XY)=dXY+XdY ⊙ \odot 表示尺寸相同的矩阵X,Y逐元素相乘。
5.
逐元素函数: d σ ( X ) = σ ′ ( X ) ⊙ d X ,   σ ( X ) = [ σ ( X i j ) ] d \sigma (X)= \sigma'(X) \odot dX,\ \sigma (X)=\left [ \sigma (X_{ij}) \right ] dσ(X)=σ(X)dX, σ(X)=[σ(Xij)],是逐元素标量函数运算, σ ′ ( X ) = [ σ ′ ( X i j ) ] {\sigma }'(X)=\left [ {\sigma }'(X_{ij}) \right ] σ(X)=[σ(Xij)]是逐元素求导数。举个例子:
X = [ x 11 x 12 x 21 x 22 ] ,   d sin ⁡ ( X ) = [ c o s   x 11 d x 11 c o s   x 12 d x 12 c o s   x 21 d x 21 c o s   x 22 d x 22 ] = c o s ( X ) ⊙ d X X=\begin{bmatrix} x_{11} & x_{12}\\ x_{21} & x_{22} \end{bmatrix},\ d\sin(X)=\begin{bmatrix} cos \ x_{11}dx_{11} &cos \ x_{12}dx_{12} \\ cos \ x_{21}dx_{21} &cos \ x_{22}dx_{22} \end{bmatrix}=cos(X) \odot dX X=[x11x21x12x22], dsin(X)=[cos x11dx11cos x21dx21cos x12dx12cos x22dx22]=cos(X)dX
我们试图利用矩阵导数与微分的联系 d f = t r ( ∂ f T ∂ X d X ) df = tr\left ( \frac{\partial f^T}{\partial X} dX\right ) df=tr(XfTdX),在求出左侧的微分 d f df df后,该如何写成右侧的形式并得到导数呢?这需要一些迹技巧(trace trick):
1.标量套上迹: a = t r ( a ) a=tr(a) a=tr(a)
2.转置: t r ( A T ) = t r ( A ) tr(A^T)=tr(A) tr(AT)=tr(A)
3.线性: t r ( A ± B ) = t r ( A ) ± t r ( B ) tr(A\pm B)=tr(A)\pm tr(B) tr(A±B)=tr(A)±tr(B)
4.矩阵乘法交换: t r ( A B ) = t r ( B A ) tr(AB)=tr(BA) tr(AB)=tr(BA),其中 A A A B T B^T BT尺寸相同。两侧都等于 ∑ i j A i j B j i \sum_{ij}A_{ij}B_{ji} ijAijBji
5.矩阵乘法/逐元素乘法交换: t r ( A T ( B ⊙ C ) ) = t r ( ( A ⊙ B ) T C ) tr(A^T(B \odot C))=tr((A \odot B)^TC) tr(AT(BC))=tr((AB)TC)其中尺寸相同。两侧都等于 ∑ i , j A i j B i j C i j \sum_{i,j}A_{ij}B_{ij}C_{ij} i,jAijBijCij

观察一下可以断言,若标量函数 f f f是矩阵 X X X经加减乘法、逆、行列式、逐元素函数等运算构成,则使用相应的运算法则对 f f f求微分,再使用迹技巧给 d f df df套上迹并将其它项交换至 d X dX dX左侧,即能得到导数。

在建立法则的最后,来谈一谈复合:假设已求得 ∂ f ∂ Y \frac{\partial f}{\partial Y} Yf,而Y是X的函数,如何求 ∂ f ∂ X \frac{\partial f}{\partial X} Xf呢?在微积分中有标量求导的链式法则 ∂ f ∂ x = ∂ f ∂ y ∂ y ∂ x \frac{\partial f}{\partial x}=\frac{\partial f}{\partial y}\frac{\partial y}{\partial x} xf=yfxy,但这里我们不能沿用链式法则,因为矩阵对矩阵的导数 ∂ Y ∂ X \frac{\partial Y}{\partial X} XY截至目前仍是未定义的。于是我们继续追本溯源,链式法则是从何而来?源头仍然是微分。我们直接从微分入手建立复合法则:先写出 d f = t r ( ∂ f T ∂ Y d Y ) df=tr(\frac{\partial f^T}{\partial Y}dY) df=tr(YfTdY)再将 d Y dY dY d X dX dX表示出来代入,并使用迹技巧将其他项交换至 d X dX dX左侧,即可得到 ∂ f ∂ X \frac{\partial f}{\partial X} Xf

接下来演示一些算例。特别提醒要依据已经建立的运算法则来计算,不能随意套用微积分中标量导数的结论,比如认为 A X AX AX X X X的导数为 A A A,这是没有根据、意义不明的。

例1: f = a T X b f=\boldsymbol a^TX \boldsymbol b f=aTXb,求 ∂ f ∂ X \frac{\partial f}{\partial X} Xf。其中 a \boldsymbol a a m × 1 m \times 1 m×1列向量, X X X m × n m \times n m×n矩阵, b \boldsymbol b b n × 1 n \times 1 n×1列向量, f f f是标量。

解:先使用矩阵乘法法则求微分,这里的 a , b \boldsymbol a,\boldsymbol b a,b是常量, d a = 0 , d b = 0 d\boldsymbol a=0,d\boldsymbol b=0 da=0,db=0,得到: d f = a T d X b df=\boldsymbol a^TdX \boldsymbol b df=aTdXb,再套上迹并做矩阵乘法交换: d f = t r ( a T d X b ) = t r ( b a T d X ) df=tr(\boldsymbol a^TdX\boldsymbol b)=tr(\boldsymbol b \boldsymbol a^TdX) df=tr(aTdXb)=tr(baTdX),注意这里我们根据 t r ( A B ) = t r ( B A ) tr(AB)=tr(BA) tr(AB)=tr(BA)交换了 a T d X \boldsymbol a^TdX aTdX b \boldsymbol b b。对照导数与微分的联系 d f = t r ( ∂ f T ∂ X d X ) df=tr\left ( \frac{\partial f^T}{\partial X} dX \right ) df=tr(XfTdX),得到 ∂ f ∂ X = ( b a T ) T = a b T \frac{\partial f}{\partial X}=(\boldsymbol b \boldsymbol a^T)^T=\boldsymbol a \boldsymbol b^T Xf=(baT)T=abT

注意:这里不能用 ∂ f ∂ X = a T ∂ X ∂ X b = ? \frac{\partial f}{\partial X}=\boldsymbol a^T\frac{\partial X}{\partial X}\boldsymbol b=? Xf=aTXXb=?,导数与乘常数矩阵的交换是不合法则的运算(而微分是合法的)。有些资料在计算矩阵导数时,会略过求微分这一步,这是逻辑上解释不通的。

例2: f = a T e x p ( X b ) f= \boldsymbol a^Texp(X \boldsymbol b) f=aTexp(Xb),求 ∂ f ∂ X \frac{\partial f}{\partial X} Xf。其中 a \boldsymbol a a m × 1 m \times 1 m×1列向量,其中 X X X m × n m \times n m×n矩阵,其中 b \boldsymbol b b n × 1 n \times 1 n×1列向量, e x p exp exp表示逐元素求指数, f f f是标量。

解:先使用矩阵乘法、逐元素函数法则求微分: d f = a T ( e x p ( X b ) ) ⊙ ( d X b ) df=\boldsymbol a^T(exp(X\boldsymbol b))\odot(dX\boldsymbol b) df=aT(exp(Xb))(dXb),再套上迹并做交换: d f = t r ( a T ( e x p ( X b ) ) ⊙ ( d X b ) ) ) = t r ( ( a ⊙ e x p ( X b ) ) T d X b ) = t r ( b ( a ⊙ e x p ( X b ) ) T d X ) df=tr(\boldsymbol a^T(exp(X\boldsymbol b))\odot(dX\boldsymbol b)))=tr((\boldsymbol a \odot exp(X\boldsymbol b))^TdX\boldsymbol b)=tr(\boldsymbol b(\boldsymbol a \odot exp(X\boldsymbol b))^TdX) df=tr(aT(exp(Xb))(dXb)))=tr((aexp(Xb))TdXb)=tr(b(aexp(Xb))TdX),注意这里我们先根据 t r ( A T ( B ⊙ C ) ) = t r ( ( A ⊙ B ) T C ) tr(A^T(B \odot C))=tr((A \odot B)^T C) tr(AT(BC))=tr((AB)TC)交换了 a , e x p ( X b ) \boldsymbol a,exp(X\boldsymbol b) a,exp(Xb) d X b dX\boldsymbol b dXb,再根据 t r ( A B ) = t r ( B A ) tr(AB)=tr(BA) tr(AB)=tr(BA)交换了 ( a ⊙ e x p ( X b ) ) T d X (\boldsymbol a \odot exp(X\boldsymbol b))^TdX (aexp(Xb))TdX b \boldsymbol b b。对照导数与微分的联系 d f = t r ( ∂ f T ∂ X d X ) df=tr\left ( \frac{\partial f^T}{\partial X} dX\right ) df=tr(XfTdX),得到 ∂ f ∂ X = ( b ( a ⊙ e x p ( X b ) ) T ) T = ( a ⊙ e x p ( X b ) ) b T \frac{\partial f}{\partial X}=(\boldsymbol b(\boldsymbol a \odot exp(X\boldsymbol b))^T)^T=(\boldsymbol a \odot exp(X\boldsymbol b))\boldsymbol b^T Xf=(b(aexp(Xb))T)T=(aexp(Xb))bT

例3: f = t r ( Y T M Y ) , Y = σ ( W X ) f=tr(Y^TMY),Y=\sigma(WX) f=tr(YTMY),Y=σ(WX) ∂ f ∂ X \frac{\partial f}{\partial X} Xf其中 W W W l × m l \times m l×m矩阵,其中 X X X m × n m \times n m×n矩阵,其中 Y Y Y l × n l \times n l×n矩阵,其中 M M M l × l l \times l l×l矩阵, σ \sigma σ是逐元素函数, f f f是标量。

解:先求 ∂ f ∂ Y \frac{\partial f}{\partial Y} Yf,求微分,使用矩阵乘法、转置法则: d f = t r ( ( d Y ) T M Y ) + t r ( Y T M d Y ) = t r ( Y T M T d Y ) + t r ( Y T M d Y ) = t r ( Y T ( M + M T ) d Y ) df=tr((dY)^TMY)+tr(Y^TMdY)=tr(Y^TM^TdY)+tr(Y^TMdY)=tr(Y^T(M+M^T)dY) df=tr((dY)TMY)+tr(YTMdY)=tr(YTMTdY)+tr(YTMdY)=tr(YT(M+MT)dY),对照导数与微分的联系,得到 ∂ f ∂ Y = ( M + M T ) Y = 2 M Y \frac{\partial f}{\partial Y}=(M+M^T)Y=2MY Yf=(M+MT)Y=2MY,这里是对称矩阵。为求 ∂ f ∂ X \frac{\partial f}{\partial X} Xf,写出 d f = t r ( ∂ f T ∂ Y d Y ) df=tr\left ( \frac{\partial f^T}{\partial Y} dY \right ) df=tr(YfTdY),再将 d Y dY dY d X dX dX表示出来代入,并使用矩阵乘法/逐元素乘法交换: d f = t r ( ∂ f T ∂ Y ( σ ′ ( W X ) ⊙ ( W d X ) ) = t r ( ( ∂ f ∂ Y ⊙ σ ′ ( W X ) ) T W d X ) df=tr\left ( \frac{\partial f^T}{\partial Y}({\sigma}'(WX) \odot (WdX) \right )=tr\left ( {\left ( \frac{\partial f}{\partial Y} \odot {\sigma}'(WX) \right )}^T WdX \right ) df=tr(YfT(σ(WX)(WdX))=tr((Yfσ(WX))TWdX),对照导数与微分的联系,得到 ∂ f ∂ X = W T ( ∂ f ∂ Y ⊙ σ ′ ( W X ) ) = W T ( ( 2 M σ ( W X X ) ) ⊙ σ ′ ( W X ) ) \frac{\partial f}{\partial X}=W^T\left ( \frac{\partial f}{\partial Y} \odot {\sigma}' (WX)\right )=W^T((2M\sigma(WXX)) \odot {\sigma}'(WX)) Xf=WT(Yfσ(WX))=WT((2Mσ(WXX))σ(WX))

例4【线性回归】: l = ∥ X ω − y ∥ 2 l=\left \| X \boldsymbol \omega - \boldsymbol y \right \|^2 l=Xωy2, 求 ω \boldsymbol \omega ω的最小二乘估计,即求 ∂ l ∂ ω \frac{\partial l}{\partial \boldsymbol \omega } ωl的零点。其中 y \boldsymbol y y m × 1 m \times 1 m×1列向量, X X X m × n m \times n m×n矩阵, ω \boldsymbol \omega ω n × 1 n \times 1 n×1列向量, l l l是标量。

解:严格来说这是标量对向量的导数,不过可以把向量看做矩阵的特例。先将向量模平方改写成向量与自身的内积: l = ( X ω − y ) T ( X ω − y ) l=(X\boldsymbol \omega - \boldsymbol y)^T(X\boldsymbol \omega - \boldsymbol y) l=(Xωy)T(Xωy),求微分,使用矩阵乘法、转置等法则: d l = ( X d ω ) T ( X ω − y ) + ( X ω − y ) T ( X d ω ) = 2 ( X ω − y ) T X d ω dl=(Xd\boldsymbol \omega )^T(X\boldsymbol \omega - \boldsymbol y)+(X\boldsymbol \omega - \boldsymbol y)^T(Xd\boldsymbol \omega )=2(X\boldsymbol \omega - \boldsymbol y)^TXd\boldsymbol \omega dl=(Xdω)T(Xωy)+(Xωy)T(Xdω)=2(Xωy)TXdω。对照导数与微分的联系 d l = ∂ l T ∂ ω d ω dl=\frac{\partial l^T}{\partial \boldsymbol \omega }d\boldsymbol \omega dl=ωlTdω,得到 ∂ l ∂ ω = ( 2 ( X ω − y ) T X ) T = 2 X T ( X ω − y ) \frac{\partial l}{\partial \boldsymbol \omega }=(2(X\boldsymbol \omega - \boldsymbol y)^TX)^T=2X^T(X\boldsymbol \omega - \boldsymbol y) ωl=(2(Xωy)TX)T=2XT(Xωy) ∂ l ∂ ω \frac{\partial l}{\partial \boldsymbol \omega } ωl零点即 ω \boldsymbol \omega ω的最小二乘估计为 ω = ( X T X ) − 1 X T y \boldsymbol \omega =(X^TX)^{-1}X^T\boldsymbol y ω=(XTX)1XTy

例5【方差的最大似然估计】:样本 x 1 , . . . , x N ∼ λ ( μ , Σ ) \boldsymbol x_1,...,\boldsymbol x_N\sim \lambda (\boldsymbol \mu , \Sigma ) x1,...,xNλ(μ,Σ),求方差 Σ \Sigma Σ的最大似然估计。写成数学式是: l = l o g ∣ Σ ∣ + 1 N ∑ i = 1 N ( x i − x ˉ ) T Σ − 1 ( x i − x ˉ ) l=log\left | \Sigma \right |+\frac{1}{N}\sum_{i=1}^N(\boldsymbol x_i-\bar \boldsymbol x)^T\Sigma ^{-1}(\boldsymbol x_i-\bar \boldsymbol x) l=logΣ+N1i=1N(xixˉ)TΣ1(xixˉ),求 ∂ l ∂ Σ \frac{\partial l}{\partial \Sigma } Σl的零点。其中 x i \boldsymbol x_{i} xi m × 1 m \times 1 m×1向量, Σ \Sigma Σ m × m m \times m m×m对称正定矩阵, x ˉ = 1 N ∑ i = 1 N x i \bar \boldsymbol x= \frac{1}{N} \sum_{i=1}^{N}\boldsymbol x_i xˉ=N1i=1Nxi是样本均值, l l l是标量,log表示自然对数。

解:首先求微分,使用矩阵乘法、行列式、逆等运算法则,第一项是 d l o g ∣ Σ ∣ = ∣ Σ ∣ − 1 d ∣ Σ ∣ = t r ( ∣ Σ ∣ − 1 d Σ ) dlog\left | \Sigma \right |=\left | \Sigma \right |^{-1}d\left | \Sigma \right |=tr(\left | \Sigma \right |^{-1}d\Sigma ) dlogΣ=Σ1dΣ=tr(Σ1dΣ),第二项是 1 N ∑ i = 1 N ( x i − x ˉ ) T d Σ − 1 ( x i − x ˉ ) = − 1 N ∑ i = 1 N ( x i − x ˉ ) T Σ − 1 d Σ Σ − 1 ( x i − x ˉ ) \frac{1}{N}\sum_{i=1}^N(\boldsymbol x_i-\bar \boldsymbol x)^Td{\Sigma} ^{-1}(\boldsymbol x_i-\bar \boldsymbol x)= - \frac{1}{N}\sum_{i=1}^N(\boldsymbol x_i-\bar \boldsymbol x)^T {\Sigma} ^{-1} d\Sigma{\Sigma} ^{-1}(\boldsymbol x_i-\bar \boldsymbol x) N1i=1N(xixˉ)TdΣ1(xixˉ)=N1i=1N(xixˉ)TΣ1dΣΣ1(xixˉ)。再给第二项套上迹做交换: t r ( 1 N ∑ i = 1 N ( x i − x ˉ ) T Σ − 1 d Σ Σ − 1 ( x i − x ˉ ) ) = 1 N ∑ i = 1 N t r ( ( x i − x ˉ ) T Σ − 1 d Σ Σ − 1 ( x i − x ˉ ) ) = 1 N ∑ i = 1 N t r ( Σ − 1 ( x i − x ˉ ) ( x i − x ˉ ) T Σ − 1 d Σ = t r ( Σ − 1 S Σ − 1 d Σ ) tr(\frac{1}{N}\sum_{i=1}^N(\boldsymbol x_i-\bar \boldsymbol x)^T {\Sigma} ^{-1} d\Sigma{\Sigma} ^{-1}(\boldsymbol x_i-\bar \boldsymbol x))= \frac{1}{N}\sum_{i=1}^N tr((\boldsymbol x_i-\bar \boldsymbol x)^T {\Sigma} ^{-1} d\Sigma{\Sigma} ^{-1}(\boldsymbol x_i-\bar \boldsymbol x))= \frac{1}{N}\sum_{i=1}^N tr({\Sigma} ^{-1} (\boldsymbol x_i-\bar \boldsymbol x)(\boldsymbol x_i-\bar \boldsymbol x)^T {\Sigma} ^{-1} d\Sigma=tr({\Sigma} ^{-1} S{\Sigma} ^{-1} d\Sigma) tr(N1i=1N(xixˉ)TΣ1dΣΣ1(xixˉ))=N1i=1Ntr((xixˉ)TΣ1dΣΣ1(xixˉ))=N1i=1Ntr(Σ1(xixˉ)(xixˉ)TΣ1dΣ=tr(Σ1SΣ1dΣ),其中先交换迹与求和,然后将 Σ − 1 ( x i − x ˉ ) \Sigma ^{-1}(\boldsymbol x_i - \bar \boldsymbol x) Σ1(xixˉ)交换到左边,最后再交换迹与求和,并定义 S = 1 N ∑ i = 1 N ( x i − x ˉ ) ( x i − x ˉ ) T S=\frac{1}{N} \sum_{i=1}^N(\boldsymbol x_i - \bar \boldsymbol x)(\boldsymbol x_i - \bar \boldsymbol x)^T S=N1i=1N(xixˉ)(xixˉ)T为样本方差矩阵。得到 d l = t r ( ( Σ − 1 − Σ − 1 S Σ − 1 ) d Σ ) dl=tr(({\Sigma} ^{-1} - \Sigma^{-1} S{\Sigma} ^{-1}) d\Sigma) dl=tr((Σ1Σ1SΣ1)dΣ)。对照导数与微分的联系,有 ∂ l ∂ Σ = ( Σ − 1 − Σ − 1 S Σ − 1 ) T \frac{\partial l}{\partial \Sigma}=({\Sigma} ^{-1} - \Sigma^{-1} S{\Sigma} ^{-1})^T Σl=(Σ1Σ1SΣ1)T,其零点即 Σ \Sigma Σ的最大似然估计为 Σ = S \Sigma = S Σ=S

例6【多元logistic回归】: l = − y T l o g   s o f t m a x ( W x ) l=- \boldsymbol y^Tlog \ softmax(W \boldsymbol x) l=yTlog softmax(Wx),求 ∂ l ∂ W \frac{\partial l}{\partial W} Wl。其中 y \boldsymbol y y是除一个元素为1外其它元素为0的 m × 1 m \times 1 m×1列向量, W W W m × n m \times n m×n矩阵, x \boldsymbol x x n × 1 n \times 1 n×1矩阵, l l l是标量;log表示自然对数, s o f t m a x ( a ) = e x p ( a ) 1 T e x p ( a ) softmax( \boldsymbol a)=\frac{exp( \boldsymbol a)}{ \boldsymbol 1^T exp( \boldsymbol a)} softmax(a)=1Texp(a)exp(a),其中 e x p ( a ) exp(a) exp(a)表示逐元素求指数, 1 \boldsymbol 1 1代表全1向量。

解1:首先将softmax函数代入并写成 d l = − y T ( l o g ( e x p ( W x ) ) − 1 l o g ( 1 T e x p ( W x ) ) ) = − y T W x + l o g ( 1 T e x p ( W x ) ) dl=- \boldsymbol y^T(log(exp(W \boldsymbol x))- \boldsymbol 1log( \boldsymbol 1^Texp(W \boldsymbol x)))=- \boldsymbol y^TW \boldsymbol x+log( \boldsymbol 1^Texp(W \boldsymbol x)) dl=yT(log(exp(Wx))1log(1Texp(Wx)))=yTWx+log(1Texp(Wx)),这里要注意逐元素log满足等式 l o g ( u / c ) = l o g ( u ) − 1 l o g ( c ) log( \boldsymbol u /c)=log( \boldsymbol u)- \boldsymbol 1log(c) log(u/c)=log(u)1log(c),以及 y \boldsymbol y y满足 y T 1 = 1 \boldsymbol y^T \boldsymbol 1 = 1 yT1=1。求微分,使用矩阵乘法、逐元素函数等法则: d l = − y T d W x + 1 T ( e x p ( W x ) ⊙ ( d W x ) ) 1 T e x p ( W x ) dl= - \boldsymbol y^TdW\boldsymbol x + \frac{\boldsymbol 1^T(exp(W\boldsymbol x) \odot (dW\boldsymbol x))}{\boldsymbol 1^Texp(W\boldsymbol x)} dl=yTdWx+1Texp(Wx)1T(exp(Wx)(dWx))。再套上迹并做交换,注意可化简 1 T ( e x p ( W x ) ⊙ ( d W x ) ) = e x p ( W x ) T d W x \boldsymbol 1^T(exp(W\boldsymbol x) \odot (dW\boldsymbol x))=exp(W\boldsymbol x)^TdW\boldsymbol x 1T(exp(Wx)(dWx))=exp(Wx)TdWx,这是根据等式 1 T ( u ⊙ v ) = u T v \boldsymbol 1^T(\boldsymbol u \odot \boldsymbol v)=\boldsymbol u^T\boldsymbol v 1T(uv)=uTv,故 d l = t r ( − y T d W x + e x p ( W x ) T d W x 1 T e x p ( W x ) ) = t r ( x ( s o f t m a x ( W x ) − y T ) d W ) dl=tr\left ( -\boldsymbol y^TdW\boldsymbol x + \frac{exp(W\boldsymbol x)^TdW\boldsymbol x}{1^T exp(W\boldsymbol x)} \right )=tr(\boldsymbol x(softmax(W\boldsymbol x)-\boldsymbol y^T)dW) dl=tr(yTdWx+1Texp(Wx)exp(Wx)TdWx)=tr(x(softmax(Wx)yT)dW)。对照导数与微分的联系,得到 ∂ l ∂ W = ( s o f t m a x ( W x ) − y ) x T \frac{\partial l}{\partial W}=(softmax(W\boldsymbol x)-\boldsymbol y)\boldsymbol x^T Wl=(softmax(Wx)y)xT

解2:定义 a = W x \boldsymbol a=W\boldsymbol x a=Wx,则 l = − y T l o g   s o f t m a x ( a ) l=-\boldsymbol y^Tlog \ softmax(\boldsymbol a) l=yTlog softmax(a),先同上求出 ∂ l ∂ a = s o f t m a x ( a ) − y \frac{\partial l}{\partial \boldsymbol a} = softmax(\boldsymbol a)-\boldsymbol y al=softmax(a)y,再利用复合法则: d l = t r ( ∂ l T ∂ a d a ) = t r ( ∂ l T ∂ a d W x ) = t r ( x ∂ l T ∂ a d W ) dl= tr\left ( \frac{\partial l^T}{\partial \boldsymbol a}d\boldsymbol a \right )=tr\left ( \frac{\partial l^T}{\partial \boldsymbol a}dW\boldsymbol x \right )=tr\left ( \boldsymbol x \frac{\partial l^T}{\partial \boldsymbol a}dW \right ) dl=tr(alTda)=tr(alTdWx)=tr(xalTdW),得到 ∂ l ∂ W = ∂ l ∂ a x T \frac{\partial l}{\partial W}= \frac{\partial l}{\partial \boldsymbol a}\boldsymbol x^T Wl=alxT

最后一例留给经典的神经网络。神经网络的求导术是学术史上的重要成果,还有个专门的名字叫做BP算法,我相信如今很多人在初次推导BP算法时也会颇费一番脑筋,事实上使用矩阵求导术来推导并不复杂。为简化起见,我们推导二层神经网络的BP算法。

例7【二层神经网络】: l = − y T l o g   s o f t m a x ( W 2 σ ( W 1 x ) ) l=-\boldsymbol y^Tlog \ softmax(W_2 \sigma(W_1\boldsymbol x)) l=yTlog softmax(W2σ(W1x)),求 ∂ l ∂ W 1 \frac{\partial l}{\partial W_1} W1l ∂ l ∂ W 2 \frac{\partial l}{\partial W_2} W2l。其中 y \boldsymbol y y是除一个元素为1外其它元素为0的的 m × 1 m \times 1 m×1列向量, W 2 W_2 W2 m × p m \times p m×p矩阵, W 1 W_1 W1 p × n p \times n p×n矩阵, x \boldsymbol x x n × 1 n \times 1 n×1矩阵, l l l是标量;log表示自然对数, s o f t m a x ( a ) = e x p ( a ) 1 T e x p ( a ) softmax(\boldsymbol a)=\frac{exp(\boldsymbol a)}{\boldsymbol 1^Texp(\boldsymbol a)} softmax(a)=1Texp(a)exp(a)同上, σ \sigma σ是逐元素sigmoid函数。

解:定义 a 1 = W 1 x , h 1 = σ ( a 1 ) , a 2 = W 2 h 1 \boldsymbol a_1=W_1\boldsymbol x,\boldsymbol h_1=\sigma(\boldsymbol a_1),\boldsymbol a_2=W_2\boldsymbol h_1 a1=W1x,h1=σ(a1),a2=W2h1,则 l = − y T l o g   s o f t m a x ( a 2 ) l=-\boldsymbol y^Tlog \ softmax(\boldsymbol a_2) l=yTlog softmax(a2)。在前例中已求出 ∂ l ∂ a 2 = s o f t m a x ( a 2 ) − y \frac{\partial l}{\partial \boldsymbol a_2}=softmax(\boldsymbol a_2)-\boldsymbol y a2l=softmax(a2)y。使用复合法则, d l = t r ( ∂ l T ∂ a 2 d a 2 ) = t r ( ∂ l T ∂ a 2 d W 2 h 1 ) + t r ( ∂ l T ∂ a 2 W 2 d h 1 ) ⎵ d l 2 dl=tr\left ( \frac{\partial l^T}{ \partial \boldsymbol a_2} d\boldsymbol a_2 \right)=tr\left ( \frac{\partial l^T}{ \partial \boldsymbol a_2} dW_2\boldsymbol h_1 \right ) + \underbrace{tr\left ( \frac{\partial l^T}{ \partial \boldsymbol a_2} W_2d\boldsymbol h_1 \right ) }_{dl_2} dl=tr(a2lTda2)=tr(a2lTdW2h1)+dl2 tr(a2lTW2dh1),使用矩阵乘法交换的迹技巧从第一项得到 ∂ l ∂ W 2 = ∂ l ∂ a 2 h 1 T \frac{\partial l}{\partial W_2}=\frac{\partial l}{\partial \boldsymbol a_2}h^T_1 W2l=a2lh1T,从第二项得到 ∂ l ∂ h 1 = W 2 T ∂ l ∂ a 2 \frac{\partial l}{\partial \boldsymbol h_1}=W_2^T\frac{\partial l}{\partial \boldsymbol a_2} h1l=W2Ta2l。接下来对第二项继续使用复合法则来求 ∂ l ∂ a 1 \frac{\partial l}{\partial \boldsymbol a_1} a1l,并利用矩阵乘法和逐元素乘法交换的迹技巧: d l 2 = t r ( ∂ l T ∂ h 1 d h 1 ) = t r ( ∂ l T ∂ h 1 ( σ ′ ( a 1 ) ⊙ d a 1 ) ) = t r ( ( ∂ l ∂ h 1 ⊙ σ ′ ( a 1 ) ) T d a 1 ) dl_2=tr\left ( \frac{\partial l^T}{\partial \boldsymbol h_1}d\boldsymbol h_1 \right )=tr\left ( \frac{\partial l^T}{\partial \boldsymbol h_1}(\sigma'(\boldsymbol a_1) \odot d\boldsymbol a_1) \right )=tr\left ( \left ( \frac{\partial l}{\partial \boldsymbol h_1} \odot \sigma'(\boldsymbol a_1)\right )^Td\boldsymbol a_1 \right ) dl2=tr(h1lTdh1)=tr(h1lT(σ(a1)da1))=tr((h1lσ(a1))Tda1),得到 ∂ l ∂ a 1 = ∂ l ∂ h 1 ⊙ σ ′ ( a 1 ) \frac{\partial l}{\partial \boldsymbol a_1}=\frac{\partial l}{\partial \boldsymbol h_1} \odot \sigma'(\boldsymbol a_1) a1l=h1lσ(a1)为求 ∂ l ∂ W \frac{\partial l}{\partial W} Wl,再用一次复合法则: d l 2 = t r ( ∂ l T ∂ a 1 d a 1 ) = t r ( ∂ l T ∂ a 1 d W 1 x ) = t r ( x ∂ l T ∂ a 1 d W 1 ) dl_2=tr\left ( \frac{\partial l^T}{\partial \boldsymbol a_1}d\boldsymbol a_1 \right )=tr\left ( \frac{\partial l^T}{\partial \boldsymbol a_1}dW_1\boldsymbol x \right )=tr\left ( \boldsymbol x \frac{\partial l^T}{\partial \boldsymbol a_1}dW_1 \right ) dl2=tr(a1lTda1)=tr(a1lTdW1x)=tr(xa1lTdW1),得到 ∂ l ∂ W 1 = ∂ l ∂ a 1 x T \frac{\partial l}{\partial W_1}=\frac{\partial l}{\partial \boldsymbol a_1}\boldsymbol x^T W1l=a1lxT

推广:样本 ( x 1 , y 1 ) , . . . . , ( x N , y N ) (\boldsymbol x_1,\boldsymbol y_1),....,(\boldsymbol x_N,\boldsymbol y_N) (x1,y1),....,(xN,yN) l = − ∑ i = 1 N y i T l o g   s o f t m a x ( W 2 σ ( W 1 x i + b 1 ) + b 2 ) l=- \sum_{i=1}^N\boldsymbol y_i^T log \ softmax(W_2 \sigma(W_1\boldsymbol x_i+\boldsymbol b_1)+\boldsymbol b_2) l=i=1NyiTlog softmax(W2σ(W1xi+b1)+b2) b 1 \boldsymbol b_1 b1 p × 1 p \times 1 p×1列向量, b 2 \boldsymbol b_2 b2 m × 1 m \times 1 m×1列向量其余定义同上。

解1:定义 a 1 , i = W 1 x i + b 1 , h 1 , i = σ ( a 1 , i ) , a 2 , i = W 2 h 1 , i + b 2 \boldsymbol a_{1,i}=W_1\boldsymbol x_i+\boldsymbol b_1,\boldsymbol h_{1,i}=\sigma(\boldsymbol a_{1,i}),\boldsymbol a_{2,i}=W_2\boldsymbol h_{1,i}+\boldsymbol b_2 a1,i=W1xi+b1,h1,i=σ(a1,i),a2,i=W2h1,i+b2​​​​​,则 l = − ∑ i = 1 N y i T l o g   s o f t m a x ( a 2 , i ) l=-\sum_{i=1}^N\boldsymbol y_i^T log \ softmax(\boldsymbol a_{2,i}) l=i=1NyiTlog softmax(a2,i)。先同上可求出 ∂ l ∂ a 2 , i = s o f t m a x ( a 2 , i ) − y i \frac{\partial l}{\partial \boldsymbol a_{2,i}}=softmax(\boldsymbol a_{2,i})-\boldsymbol y_i a2,il=softmax(a2,i)yi。使用复合法则, d l = t r ( ∑ i = 1 N ∂ l T ∂ a 2 , i d a 2 , i ) = t r ( ∑ i = 1 N ∂ l T ∂ a 2 , i d W 2 h 1 , i ) + t r ( ∑ i = 1 N ∂ l T ∂ a 2 , i W 2 d h 1 , i ) ⎵ d l 2 + t r ( ∑ i = 1 N ∂ l T ∂ a 2 , i d b 2 ) dl=tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{2,i}} d\boldsymbol a_{2,i}\right )=tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{2,i}} dW_2\boldsymbol h_{1,i} \right ) + \underbrace{tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{2,i}}W_ 2d\boldsymbol h_{1,i} \right ) }_{dl_2} + tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{2,i}} d\boldsymbol b_2 \right ) dl=tr(i=1Na2,ilTda2,i)=tr(i=1Na2,ilTdW2h1,i)+dl2 tr(i=1Na2,ilTW2dh1,i)+tr(i=1Na2,ilTdb2),从第一项得到得到 ∂ l ∂ W 2 = ∑ i = 1 N ∂ l ∂ a 2 , i h 1 , i T \frac{\partial l}{\partial W_2}=\sum_{i=1}^N\frac{\partial l}{\partial \boldsymbol a_{2,i} }\boldsymbol h_{1,i}^T W2l=i=1Na2,ilh1,iT,从第二项得到 ∂ l ∂ h 1 , i = W 2 T ∂ l ∂ a 2 , i \frac{\partial l}{\partial \boldsymbol h_{1,i}}=W_2^T\frac{\partial l}{\partial \boldsymbol a_{2,i} } h1,il=W2Ta2,il,从第三项得到 ∂ l ∂ b 2 = ∑ i = 1 N ∂ l ∂ a 2 , i \frac{\partial l}{\partial \boldsymbol b_2}=\sum_{i=1}^N\frac{\partial l}{\partial \boldsymbol a_{2,i} } b2l=i=1Na2,il。接下来对第二项继续使用复合法则,得到 ∂ l ∂ a 1 , i = ∂ l ∂ h 1 , i ⊙ σ ′ ( a 1 , i ) \frac{\partial l}{\partial \boldsymbol a_{1,i}}=\frac{\partial l}{\partial \boldsymbol h_{1,i} } \odot \sigma'(\boldsymbol a_{1,i}) a1,il=h1,ilσ(a1,i)。为求 ∂ l ∂ W 1 , ∂ l ∂ b 1 \frac{\partial l}{\partial W_1},\frac{\partial l}{\partial \boldsymbol b_1} W1l,b1l,再用一次复合法则: d l 2 = t r ( ∑ i = 1 N ∂ l T ∂ a 1 , i d a 1 , i ) = t r ( ∑ i = 1 N ∂ l T ∂ a 1 , i d W 1 x i ) + t r ( ∑ i = 1 N ∂ l T ∂ a 1 , i d b 1 ) dl_2=tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial a_{1,i}} d\boldsymbol a_{1,i}\right )=tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{1,i}} dW_1\boldsymbol x_{i} \right ) + tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{1,i}} d\boldsymbol b_1 \right ) dl2=tr(i=1Na1,ilTda1,i)=tr(i=1Na1,ilTdW1xi)+tr(i=1Na1,ilTdb1)得到 ∂ l ∂ W 1 = ∑ i = 1 N ∂ l ∂ a 1 , i x i T , ∂ l ∂ b 1 = ∑ i = 1 N ∂ l ∂ a 1 , i \frac{\partial l}{\partial W_1}= \sum_{i=1}^{N}\frac{\partial l}{\partial \boldsymbol a_{1, i}}x_i^T,\frac{\partial l}{\partial \boldsymbol b_1}=\sum_{i=1}^N \frac{\partial l}{\partial \boldsymbol a_{1, i}} W1l=i=1Na1,ilxiT,b1l=i=1Na1,il

解2:可以用矩阵来表示N个样本,以简化形式。定义 X = [ x 1 , . . . , x N ] , A 1 = [ a 1 , 1 , . . , a 1 , N ] = W 1 X + b 1 1 T , H 1 = [ h 1 , 1 , . . . , h 1 , N ] = σ ( A 1 ) , A 2 = [ a 2 , 1 , . . , a 2 , N ] = W 2 H 1 + b 2 1 T X=[\boldsymbol x_1, ...,\boldsymbol x_N],A_1=[\boldsymbol a_{1,1},..,\boldsymbol a_{1,N}]=W_1X+\boldsymbol b_1\boldsymbol 1^T,H_1=[\boldsymbol h_{1,1},...,\boldsymbol h_{1,N}]=\sigma(A_1),A_2=[\boldsymbol a_{2,1},..,\boldsymbol a_{2,N}]=W_2H_1+\boldsymbol b_2\boldsymbol 1^T X=[x1,...,xN],A1=[a1,1,..,a1,N]=W1X+b11T,H1=[h1,1,...,h1,N]=σ(A1),A2=[a2,1,..,a2,N]=W2H1+b21T,注意这里使用全1向量来扩展维度。先同上求出 ∂ l ∂ A 2 = [ s o f t m a x ( a 2 , 1 ) − y 1 , . . . , s o f t m a x ( a 2 , N ) − y N ] \frac{\partial l}{\partial A_2}=[softmax(\boldsymbol a_{2,1})-\boldsymbol y_1,...,softmax(\boldsymbol a_{2,N})-\boldsymbol y_N] A2l=[softmax(a2,1)y1,...,softmax(a2,N)yN]。使用复合法则, d l = t r ( ∂ l T ∂ A 2 d A 2 ) = t r ( ∂ l T ∂ A 2 d W 2 H 1 ) + t r ( ∂ l T ∂ A 2 W 2 d H 1 ) ⎵ d l 2 + t r ( ∂ l T ∂ A 2 d b 2 1 T ) dl=tr\left ( \frac{\partial l^T}{\partial A_2} dA_2\right )=tr\left ( \frac{\partial l^T}{\partial A_2} dW_2H_1 \right ) + \underbrace{tr\left ( \frac{\partial l^T}{\partial A_2} W_2dH_1 \right )}_{dl_2} + tr\left ( \frac{\partial l^T}{\partial A_2} d\boldsymbol b_2\boldsymbol 1^T \right ) dl=tr(A2lTdA2)=tr(A2lTdW2H1)+dl2 tr(A2lTW2dH1)+tr(A2lTdb21T),从第一项得到 ∂ l ∂ W 2 = ∂ l ∂ A 2 H 1 T \frac{\partial l}{\partial W_2}=\frac{\partial l}{\partial A_2}H_1^T W2l=A2lH1T,从第二项得到 ∂ l ∂ W 1 = W 2 T ∂ l ∂ A 2 \frac{\partial l}{\partial W_1}=W_2^T\frac{\partial l}{\partial A_2} W1l=W2TA2l,从第三项得到到 ∂ l ∂ b 2 = ∂ l ∂ A 2 1 \frac{\partial l}{\partial \boldsymbol b_2}=\frac{\partial l}{\partial A_2}\boldsymbol 1 b2l=A2l1。接下来对第二项继续使用复合法则,得到 ∂ l ∂ A 1 = ∂ l ∂ H 1 ⊙ σ ′ ( A 1 ) \frac{\partial l}{\partial A_1}=\frac{\partial l}{\partial H_1} \odot \sigma'(A_1) A1l=H1lσ(A1)。为求 ∂ l ∂ W 1 , ∂ l ∂ b 1 \frac{\partial l}{\partial W_1},\frac{\partial l}{\partial \boldsymbol b_1} W1l,b1l,再用一次复合法则: d l 2 = t r ( ∂ l T ∂ A 1 d A 1 ) = t r ( ∂ l T ∂ A 1 d W 1 X ) + t r ( ∂ l T ∂ A 1 d b 1 1 T ) dl_2=tr\left ( \frac{\partial l^T}{\partial A_1}dA_1 \right )=tr\left ( \frac{\partial l^T}{\partial A_1}dW_1X \right )+tr\left ( \frac{\partial l^T}{\partial A_1}d\boldsymbol b_1\boldsymbol 1^T \right ) dl2=tr(A1lTdA1)=tr(A1lTdW1X)+tr(A1lTdb11T),得到 ∂ l ∂ W 1 = ∂ l ∂ A 1 X T , ∂ l ∂ b 1 = ∂ l ∂ A 1 1 \frac{\partial l}{\partial W_1}=\frac{\partial l}{\partial A_1}X^T,\frac{\partial l}{\partial \boldsymbol b_1}=\frac{\partial l}{\partial A_1}\boldsymbol 1 W1l=A1lXT,b1l=A1l1

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值