机器学习中的数学:求导技术

符号约定

x x x:标量

x \boldsymbol{x} x:向量

X \boldsymbol{X} X:矩阵

t r ( A ) tr(\boldsymbol{A}) tr(A):矩阵 A \boldsymbol{A} A的迹

⟨ A , B ⟩ \langle \boldsymbol{A},\boldsymbol{B} \rangle A,B:矩阵 A \boldsymbol{A} A和矩阵 B \boldsymbol{B} B的内积(数量积)

A ∗ B \boldsymbol{A}*\boldsymbol{B} AB:矩阵 A \boldsymbol{A} A和矩阵 B \boldsymbol{B} B的元素积(Hadamard积)

A ⊗ B \boldsymbol{A}\otimes\boldsymbol{B} AB:矩阵 A \boldsymbol{A} A和矩阵 B \boldsymbol{B} B的张量积(Kronecker积)

v e c ( A ) vec(\boldsymbol{A}) vec(A):矩阵 A \boldsymbol{A} A的(列)向量化

注:本文在实数范围内进行讨论。


一、求导定义与布局方式

1、求导定义

根据求导的自变量和因变量是标量、向量还是矩阵,有9种可能的求导定义:

自变量/因变量标量 y y y向量 y \boldsymbol{y} y矩阵 Y \boldsymbol{Y} Y
标量 x x x ∂ y ∂ x \frac{\partial y}{\partial x} xy ∂ y ∂ x \frac{\partial \boldsymbol{y}}{\partial x} xy ∂ Y ∂ x \frac{\partial \boldsymbol{Y}}{\partial x} xY
向量 x \boldsymbol{x} x ∂ y ∂ x \frac{\partial y}{\partial \boldsymbol{x}} xy ∂ y ∂ x \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} xy ∂ Y ∂ x \frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{x}} xY
矩阵 X \boldsymbol{X} X ∂ y ∂ X \frac{\partial y}{\partial \boldsymbol{X}} Xy ∂ y ∂ X \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{X}} Xy ∂ Y ∂ X \frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X}} XY

在高等数学里,我们学习了标量对标量的求导,而其他的任何一种求导,最终都是转化成标量之间的求导,并把结果按照一定的方式排列,以向量或者矩阵的形式表达出来。

2、分子布局和分母布局

以一个 m m m维向量 y \boldsymbol{y} y对标量 x x x的求导为例,它的结果 ∂ y ∂ x \frac{\partial \boldsymbol{y}}{\partial x} xy也是一个 m m m维向量,其每一维对应向量 y \boldsymbol{y} y的每一维对标量 x x x的导数。问题是, ∂ y ∂ x \frac{\partial \boldsymbol{y}}{\partial x} xy究竟应该表示成行向量,还是列向量呢?答案是都可以。但是在一系列的计算中,任意书写会带来阅读和求解的困难,为此引入布局的约定概念:

  • 分子布局(numerator layout):导数的维度以分子为主。
  • 分母布局(denominator layout):导数的维度以分母为主。

由此,我们有(表中分子布局简写为N,分母布局简写为D):

自变量/因变量标量 y y y m m m维列向量 y \boldsymbol{y} y p × q p \times q p×q矩阵 Y \boldsymbol{Y} Y
标量 x x x/N: m m m维列向量
D: m m m维行向量
N: p × q p \times q p×q矩阵
D: q × p q \times p q×p矩阵
n n n列向量 x \boldsymbol{x} xN: n n n维行向量
D: n n n维列向量
N: m × n m \times n m×n矩阵
D: n × m n \times m n×m 矩阵
/
s × t s \times t s×t矩阵 X \boldsymbol{X} XN: t × s t \times s t×s矩阵
D: s × t s \times t s×t矩阵
//

注1: m m m维列向量 y \boldsymbol{y} y n n n维列向量 x \boldsymbol{x} x求导的结果,若按分子布局排列,即为一个 m × n m \times n m×n维的Jacobian矩阵,

注2:分子布局和分母布局的结果相差一个转置。

注3:在机器学习的算法推导里,通常遵循以下布局的规范(本文采用):

  • 向量/矩阵对标量求导,采用分子布局
  • 标量对向量/矩阵求导,采用分母布局
  • 向量对向量求导,采用分母布局

二、标量对矩阵的求导

1、定义法

y y y为标量, X \boldsymbol{X} X m × n m \times n m×n维矩阵,那么 y y y X \boldsymbol{X} X的导数为:
∂ y ∂ X = [ ∂ y ∂ x i j ] i = 1 , j = 1 m , n \frac{\partial y}{\partial \boldsymbol{X}}=\bigg[\frac{\partial y}{\partial x_{ij}}\bigg]_{i=1,j=1}^{m,n} Xy=[xijy]i=1,j=1m,n


例1: f = a T X b f=\boldsymbol{a}^T\boldsymbol{X}\boldsymbol{b} f=aTXb,用定义法求 ∂ f ∂ X \frac{\partial f}{\partial \boldsymbol{X}} Xf。其中 a \boldsymbol{a} a m × 1 m \times 1 m×1列向量, X \boldsymbol{X} X m × n m \times n m×n矩阵, b \boldsymbol{b} b n × 1 n \times 1 n×1列向量, f f f是标量。

解: 有如下等式:
a T X b = ∑ i = 1 m ∑ j = 1 n a i x i j b j \boldsymbol{a}^T\boldsymbol{X}\boldsymbol{b}=\sum_{i=1}^{m}\sum_{j=1}^{n}a_ix_{ij}b_j aTXb=i=1mj=1naixijbj
从而有:
∂ f ∂ x i j = ∂ a T X b ∂ x i j = a i b j \frac{\partial f}{\partial x_{ij}}=\frac{\partial \boldsymbol{a}^T\boldsymbol{X}\boldsymbol{b}}{\partial x_{ij}}=a_ib_j xijf=xijaTXb=aibj
所以:
∂ f ∂ X = a b T \frac{\partial f}{\partial \boldsymbol{X}}=\boldsymbol{a}\boldsymbol{b}^T Xf=abT


2、微分法

在一元微积分中,导数(标量对标量的导数)与微分有联系:

d f = f ′ ( x ) d x df=f'(x)dx df=f(x)dx
在多元微积分中,梯度(标量对向量的导数)也与微分有联系:

d f = ∑ i = 1 n ∂ f ∂ x i d x i = ∂ f ∂ x T d x df=\sum_{i=1}^{n}\frac{\partial f}{\partial x_i}dx_i=\frac{\partial f}{\partial \boldsymbol{x}}^Td\boldsymbol{x} df=i=1nxifdxi=xfTdx
把多元微积分中的梯度与微分之间的联系拓展到矩阵,则有:

d f = ∑ i = 1 m ∑ j = 1 n ∂ f ∂ x i j d x i j = t r ( ∂ f ∂ X T d X ) df=\sum_{i=1}^{m}\sum_{j=1}^{n}\frac{\partial f}{\partial x_{ij}}dx_{ij}=tr\bigg(\frac{\partial f}{\partial \boldsymbol{X}}^Td\boldsymbol{X}\bigg) df=i=1mj=1nxijfdxij=tr(XfTdX)
也就是说,全微分 d f df df是导数 ∂ f ∂ X ( m × n ) \frac{\partial f}{\partial \boldsymbol{X}}(m \times n) Xf(m×n)与微分矩阵 d X ( m × n ) d\boldsymbol{X}(m \times n) dX(m×n)的内积(数量积)。同时,根据矩阵的内积和矩阵的迹之间的关系,我们可以如下求出标量 f f f对矩阵 X \boldsymbol{X} X的导数:

  • (1)根据给定的 f f f求出 d f df df
  • (2)给 d f df df套上迹 t r tr tr,由于 d f df df是标量,故有 d f = t r ( d f ) df=tr(df) df=tr(df)
  • (3)化简 t r ( d f ) tr(df) tr(df),根据导数与微分的联系 d f = t r ( ∂ f ∂ X T d X ) df=tr\bigg(\frac{\partial f}{\partial \boldsymbol{X}}^Td\boldsymbol{X}\bigg) df=tr(XfTdX)求得 ∂ f ∂ X \frac{\partial f}{\partial \boldsymbol{X}} Xf

例2: f = a T X b f=\boldsymbol{a}^T\boldsymbol{X}\boldsymbol{b} f=aTXb,用微分法求 ∂ f ∂ X \frac{\partial f}{\partial \boldsymbol{X}} Xf。其中 a \boldsymbol{a} a m × 1 m \times 1 m×1列向量, X \boldsymbol{X} X m × n m \times n m×n矩阵, b \boldsymbol{b} b n × 1 n \times 1 n×1列向量, f f f是标量。

解:(1)求出 d f df df
d f = d a T X b + a T d X b + a T X d b = a T d X b df=d\boldsymbol{a}^T\boldsymbol{X}\boldsymbol{b}+\boldsymbol{a}^Td\boldsymbol{X}\boldsymbol{b}+\boldsymbol{a}^T\boldsymbol{X}d\boldsymbol{b}=\boldsymbol{a}^Td\boldsymbol{X}\boldsymbol{b} df=daTXb+aTdXb+aTXdb=aTdXb
(2)给 d f df df套上迹 t r tr tr并化简:
d f = t r ( d f ) = t r ( a T d X b ) = t r ( b a T d X ) = t r ( ( a b T ) T d X ) df=tr(df)=tr(\boldsymbol{a}^Td\boldsymbol{X}\boldsymbol{b})=tr(\boldsymbol{b}\boldsymbol{a}^Td\boldsymbol{X})=tr((\boldsymbol{a}\boldsymbol{b}^T)^Td\boldsymbol{X}) df=tr(df)=tr(aTdXb)=tr(baTdX)=tr((abT)TdX)
(3)根据导数与微分的联系 d f = t r ( ∂ f ∂ X T d X ) df=tr\bigg(\frac{\partial f}{\partial \boldsymbol{X}}^Td\boldsymbol{X}\bigg) df=tr(XfTdX),有:
∂ f ∂ X = a b T \frac{\partial f}{\partial \boldsymbol{X}}=\boldsymbol{a}\boldsymbol{b}^T Xf=abT


例3: f = a T e x p ( X b ) f=\boldsymbol{a}^Texp(\boldsymbol{X}\boldsymbol{b}) f=aTexp(Xb),求 ∂ f ∂ X \frac{\partial f}{\partial \boldsymbol{X}} Xf。其中 a \boldsymbol{a} a m × 1 m \times 1 m×1列向量, X \boldsymbol{X} X m × n m \times n m×n矩阵, b \boldsymbol{b} b n × 1 n \times 1 n×1列向量, e x p exp exp是逐元素求指数函数, f f f是标量。

解:(1)求出 d f df df
d f = a T ( e x p ( X b ) ∗ ( d X b ) ) df=\boldsymbol{a}^T(exp(\boldsymbol{X}\boldsymbol{b})*(d\boldsymbol{X}\boldsymbol{b})) df=aT(exp(Xb)(dXb))
(2)给 d f df df套上迹 t r tr tr并化简:
d f = t r ( d f ) = t r ( a T ( e x p ( X b ) ∗ ( d X b ) ) ) = t r ( ( a ∗ e x p ( X b ) ) T d X b ) = t r ( b ( a ∗ e x p ( X b ) ) T d X ) = t r ( ( ( a ∗ e x p ( X b ) ) b T ) T d X ) \begin{aligned} df & =tr(df)=tr(\boldsymbol{a}^T(exp(\boldsymbol{X}\boldsymbol{b})*(d\boldsymbol{X}\boldsymbol{b}))) \\ & = tr((\boldsymbol{a}*exp(\boldsymbol{X}\boldsymbol{b}))^Td\boldsymbol{X}\boldsymbol{b}) \\ & = tr(\boldsymbol{b}(\boldsymbol{a}*exp(\boldsymbol{X}\boldsymbol{b}))^Td\boldsymbol{X}) \\ & = tr(((\boldsymbol{a}*exp(\boldsymbol{X}\boldsymbol{b}))\boldsymbol{b}^T)^Td\boldsymbol{X}) \end{aligned} df=tr(df)=tr(aT(exp(Xb)(dXb)))=tr((aexp(Xb))TdXb)=tr(b(aexp(Xb))TdX)=tr(((aexp(Xb))bT)TdX)
(3)根据导数与微分的联系 d f = t r ( ∂ f ∂ X T d X ) df=tr\bigg(\frac{\partial f}{\partial \boldsymbol{X}}^Td\boldsymbol{X}\bigg) df=tr(XfTdX),有:
∂ f ∂ X = ( a ∗ e x p ( X b ) ) b T \frac{\partial f}{\partial \boldsymbol{X}}=(\boldsymbol{a}*exp(\boldsymbol{X}\boldsymbol{b}))\boldsymbol{b}^T Xf=(aexp(Xb))bT


例4: l = ∣ ∣ X w − y ∣ ∣ 2 l=|| \boldsymbol{X}\boldsymbol{w}-\boldsymbol{y} ||^2 l=Xwy2,求 w \boldsymbol{w} w的最小二乘估计,即求 ∂ l ∂ w \frac{\partial l}{\partial \boldsymbol{w}} wl的零点。其中 y \boldsymbol{y} y m × 1 m \times 1 m×1列向量, X \boldsymbol{X} X m × n m \times n m×n 矩阵, w \boldsymbol{w} w n × 1 n \times 1 n×1列向量, l l l是标量。

解:(1)求出 d l dl dl

易知:
l = ∣ ∣ X w − y ∣ ∣ 2 = ( X w − y ) T ( X w − y ) l=|| \boldsymbol{X}\boldsymbol{w}-\boldsymbol{y} ||^2=(\boldsymbol{X}\boldsymbol{w}-\boldsymbol{y})^T(\boldsymbol{X}\boldsymbol{w}-\boldsymbol{y}) l=Xwy2=(Xwy)T(Xwy)
从而有:
d l = ( X d w ) T ( X w − y ) + ( X w − y ) T ( X d w ) dl=(\boldsymbol{X}d\boldsymbol{w})^T(\boldsymbol{X}\boldsymbol{w}-\boldsymbol{y})+(\boldsymbol{X}\boldsymbol{w}-\boldsymbol{y})^T(\boldsymbol{X}d\boldsymbol{w}) dl=(Xdw)T(Xwy)+(Xwy)T(Xdw)
(2)给 d l dl dl套上迹 t r tr tr并化简:
d l = t r ( d l ) = t r ( ( X d w ) T ( X w − y ) + ( X w − y ) T ( X d w ) ) = t r ( 2 ( X w − y ) T ( X d w ) ) = t r ( ( 2 X T ( X w − y ) ) T d w ) \begin{aligned} dl & =tr(dl)=tr((\boldsymbol{X}d\boldsymbol{w})^T(\boldsymbol{X}\boldsymbol{w}-\boldsymbol{y})+(\boldsymbol{X}\boldsymbol{w}-\boldsymbol{y})^T(\boldsymbol{X}d\boldsymbol{w})) \\ & = tr(2(\boldsymbol{X}\boldsymbol{w}-\boldsymbol{y})^T(\boldsymbol{X}d\boldsymbol{w})) \\ & = tr((2\boldsymbol{X}^T(\boldsymbol{X}\boldsymbol{w}-\boldsymbol{y}))^Td\boldsymbol{w}) \end{aligned} dl=tr(dl)=tr((Xdw)T(Xwy)+(Xwy)T(Xdw))=tr(2(Xwy)T(Xdw))=tr((2XT(Xwy))Tdw)
(3)根据导数与微分的联系 d l = t r ( ∂ l ∂ w T d w ) dl=tr\bigg(\frac{\partial l}{\partial \boldsymbol{w}}^Td\boldsymbol{w}\bigg) dl=tr(wlTdw),有:
∂ l ∂ w = 2 X T ( X w − y ) \frac{\partial l}{\partial \boldsymbol{w}}=2\boldsymbol{X}^T(\boldsymbol{X}\boldsymbol{w}-\boldsymbol{y}) wl=2XT(Xwy)
(4)求 ∂ l ∂ w \frac{\partial l}{\partial \boldsymbol{w}} wl的零点:

∂ l ∂ w = 0 \frac{\partial l}{\partial \boldsymbol{w}}=\boldsymbol{0} wl=0,有:
X T X w = X T y \boldsymbol{X}^T\boldsymbol{X}\boldsymbol{w}=\boldsymbol{X}^T\boldsymbol{y} XTXw=XTy
得到 w \boldsymbol{w} w的最小二乘估计为:
w = ( X T X ) − 1 X T y \boldsymbol{w}=(\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y} w=(XTX)1XTy


例5: l = − y T l n ( s o f t m a x ( W x ) ) l=-\boldsymbol{y}^Tln(softmax(\boldsymbol{W}\boldsymbol{x})) l=yTln(softmax(Wx)),求 ∂ l ∂ W \frac{\partial l}{\partial \boldsymbol{W}} Wl。其中 y \boldsymbol{y} y是除一个元素为 1 1 1外其他元素为 0 0 0 m × 1 m \times 1 m×1列向量, W \boldsymbol{W} W m × n m \times n m×n矩阵, x \boldsymbol{x} x n × 1 n \times 1 n×1列向量, l l l是标量; l n ln ln表示逐元素求自然对数, s o f t m a x ( a ) = e x p ( a ) 1 T e x p ( a ) softmax(\boldsymbol{a})=\frac{exp(\boldsymbol{a})}{\boldsymbol{1}^Texp(\boldsymbol{a})} softmax(a)=1Texp(a)exp(a),其中 e x p ( a ) exp(\boldsymbol{a}) exp(a)表示逐元素求指数函数, 1 \boldsymbol{1} 1表示全 1 1 1的列向量。

解:(1)求出 d l dl dl

注意到: l n ( u / c ) = l n ( u ) − 1 l n ( c ) ln(\boldsymbol{u}/c)=ln(\boldsymbol{u})-\boldsymbol{1}ln(c) ln(u/c)=ln(u)1ln(c),以及 y T 1 = 1 \boldsymbol{y}^T\boldsymbol{1}=1 yT1=1,有:
l = − y T l n ( e x p ( W x ) ) + y T 1 l n ( 1 T e x p ( W x ) ) = − y T W x + l n ( 1 T e x p ( W x ) ) \begin{aligned} l & =-\boldsymbol{y}^Tln(exp(\boldsymbol{W}\boldsymbol{x}))+\boldsymbol{y}^T\boldsymbol{1}ln(\boldsymbol{1}^Texp(\boldsymbol{W}\boldsymbol{x})) \\ & = -\boldsymbol{y}^T\boldsymbol{W}\boldsymbol{x}+ln(\boldsymbol{1}^Texp(\boldsymbol{W}\boldsymbol{x})) \end{aligned} l=yTln(exp(Wx))+yT1ln(1Texp(Wx))=yTWx+ln(1Texp(Wx))
从而:
d l = − y T d W x + 1 T ( e x p ( W x ) ∗ ( d W x ) ) 1 T e x p ( W x ) = − y T d W x + e x p ( W x ) T ( d W x ) 1 T e x p ( W x ) = ( s o f t m a x ( W x ) − y ) T d W x \begin{aligned} dl & = -\boldsymbol{y}^Td\boldsymbol{W}\boldsymbol{x}+\frac{\boldsymbol{1}^T(exp(\boldsymbol{W}\boldsymbol{x})*(d\boldsymbol{W}\boldsymbol{x}))}{\boldsymbol{1}^Texp(\boldsymbol{W}\boldsymbol{x})} \\ & = -\boldsymbol{y}^Td\boldsymbol{W}\boldsymbol{x}+\frac{exp(\boldsymbol{W}\boldsymbol{x})^T(d\boldsymbol{W}\boldsymbol{x})}{\boldsymbol{1}^Texp(\boldsymbol{W}\boldsymbol{x})} \\ & = (softmax(\boldsymbol{W}\boldsymbol{x})-\boldsymbol{y})^Td\boldsymbol{W}\boldsymbol{x} \end{aligned} dl=yTdWx+1Texp(Wx)1T(exp(Wx)(dWx))=yTdWx+1Texp(Wx)exp(Wx)T(dWx)=(softmax(Wx)y)TdWx
(2)给 d l dl dl套上迹 t r tr tr并化简:
d l = t r ( d l ) = t r ( ( s o f t m a x ( W x ) − y ) T d W x ) = t r ( x ( s o f t m a x ( W x ) − y ) T d W ) = t r ( ( ( s o f t m a x ( W x ) − y ) x T ) T d W ) \begin{aligned} dl & = tr(dl)=tr((softmax(\boldsymbol{W}\boldsymbol{x})-\boldsymbol{y})^Td\boldsymbol{W}\boldsymbol{x}) \\ & = tr(\boldsymbol{x}(softmax(\boldsymbol{W}\boldsymbol{x})-\boldsymbol{y})^Td\boldsymbol{W}) \\ & = tr(((softmax(\boldsymbol{W}\boldsymbol{x})-\boldsymbol{y})\boldsymbol{x}^T)^Td\boldsymbol{W}) \end{aligned} dl=tr(dl)=tr((softmax(Wx)y)TdWx)=tr(x(softmax(Wx)y)TdW)=tr(((softmax(Wx)y)xT)TdW)
(3)根据导数与微分的联系 d l = t r ( ∂ l ∂ W T d W ) dl=tr\bigg(\frac{\partial l}{\partial \boldsymbol{W}}^Td\boldsymbol{W}\bigg) dl=tr(WlTdW),有:
∂ l ∂ W = ( s o f t m a x ( W x ) − y ) x T \frac{\partial l}{\partial \boldsymbol{W}}=(softmax(\boldsymbol{W}\boldsymbol{x})-\boldsymbol{y})\boldsymbol{x}^T Wl=(softmax(Wx)y)xT


例6: t r ( A B ) tr(\boldsymbol{A}\boldsymbol{B}) tr(AB)对矩阵 A \boldsymbol{A} A的导数,其中矩阵 A \boldsymbol{A} A和矩阵 B T \boldsymbol{B}^T BT的形状相同。

解:(1)令 f = t r ( A B ) f=tr(\boldsymbol{A}\boldsymbol{B}) f=tr(AB),求出 d f df df
d f = d ( t r ( A B ) ) = t r ( d ( A B ) ) = t r ( ( d A ) B ) = t r ( B d A ) df=d(tr(\boldsymbol{A}\boldsymbol{B}))=tr(d(\boldsymbol{A}\boldsymbol{B}))=tr((d\boldsymbol{A})\boldsymbol{B})=tr(\boldsymbol{B}d\boldsymbol{A}) df=d(tr(AB))=tr(d(AB))=tr((dA)B)=tr(BdA)
(2)根据导数与微分的联系 d f = t r ( ∂ f ∂ A T d A ) df=tr\bigg(\frac{\partial f}{\partial \boldsymbol{A}}^Td\boldsymbol{A}\bigg) df=tr(AfTdA),有:
∂ f ∂ A = B T \frac{\partial f}{\partial \boldsymbol{A}}=\boldsymbol{B}^T Af=BT


三、向量对向量的求导

1、定义法

y \boldsymbol{y} y m × 1 m \times 1 m×1列向量, x \boldsymbol{x} x n × 1 n \times 1 n×1列向量,那么 y \boldsymbol{y} y x \boldsymbol{x} x的导数(分母布局)为 n × m n \times m n×m维矩阵:
∂ y ∂ x = ( ∂ y 1 ∂ x 1 ∂ y 2 ∂ x 1 … ∂ y m ∂ x 1 ∂ y 1 ∂ x 2 ∂ y 2 ∂ x 2 … ∂ y m ∂ x 2 … … … … ∂ y 1 ∂ x n ∂ y 2 ∂ x n … ∂ y m ∂ x n ) \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}}= \begin{pmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_2}{\partial x_1} & \dots & \frac{\partial y_m}{\partial x_1} \\ \frac{\partial y_1}{\partial x_2} & \frac{\partial y_2}{\partial x_2} & \dots & \frac{\partial y_m}{\partial x_2} \\ \dots & \dots & \dots & \dots \\ \frac{\partial y_1}{\partial x_n} & \frac{\partial y_2}{\partial x_n} & \dots & \frac{\partial y_m}{\partial x_n} \\ \end{pmatrix} xy=x1y1x2y1xny1x1y2x2y2xny2x1ymx2ymxnym

2、微分法

列向量 f \boldsymbol{f} f对列向量 x \boldsymbol{x} x的导数(分母布局)与微分有如下联系:
d f = ∂ f ∂ x T d x d\boldsymbol{f}=\frac{\partial \boldsymbol{f}}{\partial \boldsymbol{x}}^Td\boldsymbol{x} df=xfTdx


四、求导的链式法则

有时候并不需要使用链式法则,比如下面的例子:

例7: f = t r ( Y T M Y ) , Y = σ ( W X ) f=tr(\boldsymbol{Y}^T\boldsymbol{M}\boldsymbol{Y}),Y=\sigma(\boldsymbol{W}\boldsymbol{X}) f=tr(YTMY),Y=σ(WX),求 ∂ f ∂ X \frac{\partial f}{\partial \boldsymbol{X}} Xf。其中 W \boldsymbol{W} W l × m l \times m l×m矩阵, X \boldsymbol{X} X m × n m \times n m×n矩阵, Y \boldsymbol{Y} Y l × n l \times n l×n矩阵, M \boldsymbol{M} M l × l l \times l l×l对称矩阵, σ \sigma σ是逐元素函数, f f f是标量。

解: 先求出 ∂ f ∂ Y \frac{\partial f}{\partial \boldsymbol{Y}} Yf

(1)求出 d f df df(自变量 Y \boldsymbol{Y} Y):
d f = t r ( ( d Y ) T M Y ) + t r ( Y T M d Y ) = t r ( Y T M T d Y ) + t r ( Y T M d Y ) = t r ( Y T ( M + M T ) d Y ) \begin{aligned} df & = tr((d\boldsymbol{Y})^T\boldsymbol{M}\boldsymbol{Y})+tr(\boldsymbol{Y}^T\boldsymbol{M}d\boldsymbol{Y}) \\ & = tr(\boldsymbol{Y}^T\boldsymbol{M}^Td\boldsymbol{Y})+tr(\boldsymbol{Y}^T\boldsymbol{M}d\boldsymbol{Y}) \\ & = tr(\boldsymbol{Y}^T(\boldsymbol{M}+\boldsymbol{M}^T)d\boldsymbol{Y}) \end{aligned} df=tr((dY)TMY)+tr(YTMdY)=tr(YTMTdY)+tr(YTMdY)=tr(YT(M+MT)dY)
(2)根据导数与微分的联系 d f = t r ( ∂ f ∂ Y T d Y ) df=tr\bigg(\frac{\partial f}{\partial \boldsymbol{Y}}^Td\boldsymbol{Y}\bigg) df=tr(YfTdY),有:
∂ f ∂ Y = ( M T + M ) Y = 2 M Y \frac{\partial f}{\partial \boldsymbol{Y}}=(\boldsymbol{M}^T+\boldsymbol{M})\boldsymbol{Y}=2\boldsymbol{M}\boldsymbol{Y} Yf=(MT+M)Y=2MY
再求 ∂ f ∂ X \frac{\partial f}{\partial \boldsymbol{X}} Xf

(3)求出 d f df df(自变量 X \boldsymbol{X} X):
d f = t r ( ∂ f ∂ Y T d Y ) = t r ( ∂ f ∂ Y T ( σ ′ ( W X ) ∗ ( W d X ) ) ) = t r ( ( ∂ f ∂ Y ∗ σ ′ ( W X ) ) T W d X ) \begin{aligned} df & = tr\bigg(\frac{\partial f}{\partial \boldsymbol{Y}}^Td\boldsymbol{Y}\bigg) \\ & = tr\bigg(\frac{\partial f}{\partial \boldsymbol{Y}}^T(\sigma'(\boldsymbol{W}\boldsymbol{X})*(\boldsymbol{W}d\boldsymbol{X}))\bigg) \\ & = tr\Bigg(\bigg(\frac{\partial f}{\partial \boldsymbol{Y}}*\sigma'(\boldsymbol{W}\boldsymbol{X})\bigg)^T\boldsymbol{W}d\boldsymbol{X}\Bigg) \end{aligned} df=tr(YfTdY)=tr(YfT(σ(WX)(WdX)))=tr((Yfσ(WX))TWdX)
(4)根据导数与微分的联系 d f = t r ( ∂ f ∂ X T d X ) df=tr\bigg(\frac{\partial f}{\partial \boldsymbol{X}}^Td\boldsymbol{X}\bigg) df=tr(XfTdX),有:
∂ f ∂ X = W T ( ∂ f ∂ Y ∗ σ ′ ( W X ) ) = W T ( ( 2 M σ ( W X ) ) ∗ σ ′ ( W X ) ) \frac{\partial f}{\partial \boldsymbol{X}} =\boldsymbol{W}^T\bigg(\frac{\partial f}{\partial \boldsymbol{Y}}*\sigma'(\boldsymbol{W}\boldsymbol{X})\bigg) =\boldsymbol{W}^T((2\boldsymbol{M}\sigma(\boldsymbol{W\boldsymbol{X}}))*\sigma'(\boldsymbol{W}\boldsymbol{X})) Xf=WT(Yfσ(WX))=WT((2Mσ(WX))σ(WX))


但是很多时候,求导的自变量和因变量间有复杂的链式求导关系,若不使用链式法则计算会有些麻烦。

1、向量对向量求导的链式法则

设向量 x ( m × 1 ) , y ( n × 1 ) , z ( p × 1 ) \boldsymbol{x}(m \times 1),\boldsymbol{y}(n \times 1),\boldsymbol{z}(p \times 1) x(m×1),y(n×1),z(p×1)存在如下依赖关系:
x → y → z \boldsymbol{x}\rightarrow\boldsymbol{y}\rightarrow\boldsymbol{z} xyz
则我们有如下链式法则:
∂ z ∂ x = ∂ y ∂ x ∂ z ∂ y \frac{\partial \boldsymbol{z}}{\partial\boldsymbol{x}}=\frac{\partial \boldsymbol{y}}{\partial\boldsymbol{x}}\frac{\partial \boldsymbol{z}}{\partial\boldsymbol{y}} xz=xyyz
从维度的角度可以验证上述做法的合理性:

等式左侧是一个 m × p m \times p m×p维的矩阵,等式右侧是一个 m × n m \times n m×n维矩阵和一个 n × p n \times p n×p维矩阵的积,因此维度是相容的。

2、标量对多个向量求导的链式法则

设有依赖关系:
y 1 → y 2 → ⋯ → y n → z \boldsymbol{y}_1\rightarrow\boldsymbol{y}_2\rightarrow\dots\rightarrow\boldsymbol{y}_n\rightarrow z y1y2ynz
则我们有如下链式法则:
∂ z ∂ y 1 = ∂ y 2 ∂ y 1 ∂ y 3 ∂ y 2 … ∂ y n ∂ y n − 1 ∂ z ∂ y n \frac{\partial z}{\partial\boldsymbol{y}_1}=\frac{\partial \boldsymbol{y}_2}{\partial\boldsymbol{y}_1}\frac{\partial \boldsymbol{y}_3}{\partial\boldsymbol{y}_2}\dots\frac{\partial \boldsymbol{y}_n}{\partial\boldsymbol{y}_{n-1}}\frac{\partial z}{\partial\boldsymbol{y}_n} y1z=y1y2y2y3yn1ynynz

3、标量对多个矩阵求导的链式法则

设有依赖关系:
X → Y → z \boldsymbol{X}\rightarrow\boldsymbol{Y}\rightarrow z XYz
则我们有如下链式法则:
∂ z ∂ x i j = ∑ k , l ∂ z ∂ y k l ∂ y k l ∂ x i j = t r ( ( ∂ z ∂ Y ) T ∂ Y ∂ x i j ) \frac{\partial z}{\partial x_{ij}} =\sum_{k,l}\frac{\partial z}{\partial y_{kl}}\frac{\partial y_{kl}}{\partial x_{ij}} =tr\bigg((\frac{\partial z}{\partial \boldsymbol{Y}})^T\frac{\partial \boldsymbol{Y}}{\partial x_{ij}}\bigg) xijz=k,lyklzxijykl=tr((Yz)TxijY)
矩阵对矩阵的求导比较复杂,我们在下一节专门讨论。这里只是给出了对矩阵中的任一标量的链式求导方法,即如何求解 ∂ z ∂ x i j \frac{\partial z}{\partial x_{ij}} xijz,而没有给出如何求解整体 ∂ z ∂ X \frac{\partial z}{\partial \boldsymbol{X}} Xz。不过对于 ∂ z ∂ X \frac{\partial z}{\partial \boldsymbol{X}} Xz的求解,还是有一些有用的结论容易获得,比如下面这个例子。


例8: z = f ( Y ) , Y = A X + B z=f(\boldsymbol{Y}),\boldsymbol{Y}=\boldsymbol{A}\boldsymbol{X}+\boldsymbol{B} z=f(Y),Y=AX+B,求 ∂ z ∂ X \frac{\partial z}{\partial\boldsymbol{X}} Xz其中 A , B , X , Y \boldsymbol{A},\boldsymbol{B},\boldsymbol{X},\boldsymbol{Y} A,B,X,Y都是矩阵, z z z是标量。

解:(1)由标量对矩阵的导数与微分的关系以及矩阵微分运算,有:
d z = t r ( ∂ z ∂ Y T d Y ) = t r ( ∂ z ∂ Y T A d X ) dz = tr\bigg(\frac{\partial z}{\partial \boldsymbol{Y}}^Td\boldsymbol{Y}\bigg) =tr\bigg(\frac{\partial z}{\partial \boldsymbol{Y}}^T\boldsymbol{A}d\boldsymbol{X}\bigg) dz=tr(YzTdY)=tr(YzTAdX)

(2)再由 d z = t r ( ∂ z ∂ X T d X ) dz=tr\bigg(\frac{\partial z}{\partial \boldsymbol{X}}^Td\boldsymbol{X}\bigg) dz=tr(XzTdX),有:
∂ z ∂ X = A T ∂ z ∂ Y \frac{\partial z}{\partial\boldsymbol{X}}=\boldsymbol{A}^T\frac{\partial z}{\partial \boldsymbol{Y}} Xz=ATYz


五、矩阵对矩阵的求导

1、基本方法

我们首先对这2个矩阵 Y ( p × q ) \boldsymbol{Y}(p \times q) Y(p×q) X ( m × n ) \boldsymbol{X}(m \times n) X(m×n)进行向量化:
v e c ( Y ) = [ y 11 , … , y p 1 , y 12 , … , y p 2 , … , y 1 q , … , y p q ] T v e c ( X ) = [ x 11 , … , x m 1 , x 12 , … , x m 2 , … , x 1 n , … , x m n ] T \begin{aligned} vec(\boldsymbol{Y}) & = [y_{11},\dots,y_{p1},y_{12},\dots,y_{p2},\dots,y_{1q},\dots,y_{pq}]^T \\ vec(\boldsymbol{X}) & = [x_{11},\dots,x_{m1},x_{12},\dots,x_{m2},\dots,x_{1n},\dots,x_{mn}]^T \end{aligned} vec(Y)vec(X)=[y11,,yp1,y12,,yp2,,y1q,,ypq]T=[x11,,xm1,x12,,xm2,,x1n,,xmn]T
从而我们可以把矩阵对矩阵的导数转化为向量对向量的导数(分母布局):
∂ Y ∂ X = ∂ v e c ( Y ) ∂ v e c ( X ) \frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X}}=\frac{\partial vec(\boldsymbol{Y})}{\partial vec(\boldsymbol{X})} XY=vec(X)vec(Y)
其中, ∂ Y ∂ X \frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X}} XY是一个 m n × p q mn \times pq mn×pq维的矩阵。并且,根据向量对向量的导数和微分的关系,我们有:
v e c ( d Y ) = ∂ Y ∂ X T v e c ( d X ) vec(d\boldsymbol{Y})=\frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X}}^Tvec(d\boldsymbol{X}) vec(dY)=XYTvec(dX)
按此定义, ∂ f ∂ X \frac{\partial f}{\partial \boldsymbol{X}} Xf的定义会产生歧义,设 X \boldsymbol{X} X m × n m \times n m×n维矩阵,则:

  • 按照标量对矩阵求导的原则,结果应该是一个 m × n m \times n m×n维矩阵。
  • 按照矩阵对矩阵求导的原则,结果应该是一个 m n × 1 mn \times 1 mn×1维矩阵。

为了避免混淆,用记号 ∇ X f \nabla_{\boldsymbol{X}}f Xf表示标量对矩阵导数,而 ∂ f ∂ X = v e c ( ∇ X f ) \frac{\partial f}{\partial \boldsymbol{X}}=vec(\nabla_{\boldsymbol{X}}f) Xf=vec(Xf)

标量对矩阵的二阶导数,又称为Hessian矩阵,定义如下:
∇ X 2 f = ∂ 2 f ∂ X 2 = ∂ ∇ X f ∂ X \nabla_{\boldsymbol{X}}^2f=\frac{\partial^2 f}{\partial \boldsymbol{X}^2}=\frac{\partial \nabla_{\boldsymbol{X}}f}{\partial \boldsymbol{X}} X2f=X22f=XXf
其中, ∇ X 2 f \nabla_{\boldsymbol{X}}^2f X2f是一个 m n × m n mn \times mn mn×mn维的对称矩阵。

微分法求矩阵 F \boldsymbol{F} F对矩阵 X \boldsymbol{X} X的导数的一般步骤概括如下:

  • (1)根据给定的 F \boldsymbol{F} F求出 d F d\boldsymbol{F} dF
  • (2)将 d F d\boldsymbol{F} dF向量化为 v e c ( d F ) vec(d\boldsymbol{F}) vec(dF),并进行化简。
  • (3)根据导数与微分的关系 v e c ( d F ) = ∂ F ∂ X T v e c ( d X ) vec(d\boldsymbol{F})=\frac{\partial \boldsymbol{F}}{\partial \boldsymbol{X}}^Tvec(d\boldsymbol{X}) vec(dF)=XFTvec(dX),求得 ∂ F ∂ X \frac{\partial \boldsymbol{F}}{\partial \boldsymbol{X}} XF

2、链式法则

设有依赖关系:
X → Y → Z \boldsymbol{X}\rightarrow\boldsymbol{Y}\rightarrow\boldsymbol{Z} XYZ
根据导数与微分的联系,有:
v e c ( d F ) = ∂ F ∂ Y T v e c ( d Y ) = ∂ F ∂ Y T ∂ Y ∂ X T v e c ( d X ) vec(d\boldsymbol{F})=\frac{\partial \boldsymbol{F}}{\partial \boldsymbol{Y}}^Tvec(d\boldsymbol{Y}) =\frac{\partial \boldsymbol{F}}{\partial \boldsymbol{Y}}^T\frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X}}^Tvec(d\boldsymbol{X}) vec(dF)=YFTvec(dY)=YFTXYTvec(dX)
从而有链式法则:
∂ F ∂ X = ∂ Y ∂ X ∂ F ∂ Y \frac{\partial \boldsymbol{F}}{\partial \boldsymbol{X}}=\frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X}}\frac{\partial \boldsymbol{F}}{\partial \boldsymbol{Y}} XF=XYYF


例9: F = A X \boldsymbol{F}=\boldsymbol{A}\boldsymbol{X} F=AX X \boldsymbol{X} X m × n m \times n m×n维矩阵,求 ∂ F ∂ X \frac{\partial \boldsymbol{F}}{\partial \boldsymbol{X}} XF

解:(1)根据给定的 F \boldsymbol{F} F求出 d F d\boldsymbol{F} dF
d F = A d X d\boldsymbol{F}=\boldsymbol{A}d\boldsymbol{X} dF=AdX
(2)将 d F d\boldsymbol{F} dF向量化为 v e c ( d F ) vec(d\boldsymbol{F}) vec(dF),并进行化简:
v e c ( d F ) = v e c ( A d X ) = ( I n ⊗ A ) v e c ( d X ) vec(d\boldsymbol{F})=vec(\boldsymbol{A}d\boldsymbol{X})=(\boldsymbol{I}_n \otimes \boldsymbol{A})vec(d\boldsymbol{X}) vec(dF)=vec(AdX)=(InA)vec(dX)
(3)根据导数与微分的关系 v e c ( d F ) = ∂ F ∂ X T v e c ( d X ) vec(d\boldsymbol{F})=\frac{\partial \boldsymbol{F}}{\partial \boldsymbol{X}}^Tvec(d\boldsymbol{X}) vec(dF)=XFTvec(dX),求得 ∂ F ∂ X \frac{\partial \boldsymbol{F}}{\partial \boldsymbol{X}} XF
∂ F ∂ X = I n ⊗ A T \frac{\partial \boldsymbol{F}}{\partial \boldsymbol{X}}=\boldsymbol{I}_n \otimes \boldsymbol{A}^T XF=InAT


例10: f = l n ∣ X ∣ f=ln|\boldsymbol{X}| f=lnX X \boldsymbol{X} X n × n n \times n n×n矩阵,求 ∇ X f \nabla_{\boldsymbol{X}}f Xf ∇ X 2 f \nabla_{\boldsymbol{X}}^2f X2f

解:(1)首先求 ∇ X f \nabla_{\boldsymbol{X}}f Xf

d f = t r ( X − 1 d X ) df=tr(\boldsymbol{X}^{-1}d\boldsymbol{X}) df=tr(X1dX),故 ∇ X f = ( X − 1 ) T \nabla_{\boldsymbol{X}}f=(\boldsymbol{X}^{-1})^T Xf=(X1)T。于是问题转化为: F = ( X − 1 ) T \boldsymbol{F}=(\boldsymbol{X}^{-1})^T F=(X1)T,求 ∂ F ∂ X \frac{\partial \boldsymbol{F}}{\partial \boldsymbol{X}} XF

(2)根据给定的 F \boldsymbol{F} F求出 d F d\boldsymbol{F} dF
d F = − ( X − 1 d X X − 1 ) T d\boldsymbol{F}=-(\boldsymbol{X}^{-1}d\boldsymbol{X}\boldsymbol{X}^{-1})^T dF=(X1dXX1)T
(3)将 d F d\boldsymbol{F} dF向量化为 v e c ( d F ) vec(d\boldsymbol{F}) vec(dF),并进行化简:
v e c ( d F ) = − K n n v e c ( X − 1 d X X − 1 ) = − K n n ( ( X − 1 ) T ∗ X − 1 ) v e c ( d X ) vec(d\boldsymbol{F})=-\boldsymbol{K}_{nn}vec(\boldsymbol{X}^{-1}d\boldsymbol{X}\boldsymbol{X}^{-1})=-\boldsymbol{K}_{nn}((\boldsymbol{X}^{-1})^T*\boldsymbol{X}^{-1})vec(d\boldsymbol{X}) vec(dF)=Knnvec(X1dXX1)=Knn((X1)TX1)vec(dX)
其中 K n n \boldsymbol{K}_{nn} Knn是一个交换矩阵。

(4)根据导数与微分的关系 v e c ( d F ) = ∂ F ∂ X T v e c ( d X ) vec(d\boldsymbol{F})=\frac{\partial \boldsymbol{F}}{\partial \boldsymbol{X}}^Tvec(d\boldsymbol{X}) vec(dF)=XFTvec(dX),求得 ∂ F ∂ X \frac{\partial \boldsymbol{F}}{\partial \boldsymbol{X}} XF
∂ F ∂ X = ∂ F ∂ X T = − K n n ( ( X − 1 ) T ∗ X − 1 ) \frac{\partial \boldsymbol{F}}{\partial \boldsymbol{X}}=\frac{\partial \boldsymbol{F}}{\partial \boldsymbol{X}}^T =-\boldsymbol{K}_{nn}((\boldsymbol{X}^{-1})^T*\boldsymbol{X}^{-1}) XF=XFT=Knn((X1)TX1)


例11: F = A e x p ( X B ) \boldsymbol{F}=\boldsymbol{A}exp(\boldsymbol{X}\boldsymbol{B}) F=Aexp(XB),求 ∂ F ∂ X \frac{\partial \boldsymbol{F}}{\partial \boldsymbol{X}} XF。其中 A \boldsymbol{A} A l × m l \times m l×m矩阵, X \boldsymbol{X} X m × n m \times n m×n矩阵, B \boldsymbol{B} B n × p n \times p n×p矩阵, e x p exp exp为逐元素求指数函数。

解:(1)根据给定的 F \boldsymbol{F} F求出 d F d\boldsymbol{F} dF
d F = A ( e x p ( X B ) ∗ ( d X B ) ) d\boldsymbol{F}=\boldsymbol{A}(exp(\boldsymbol{X}\boldsymbol{B})*(d\boldsymbol{X}\boldsymbol{B})) dF=A(exp(XB)(dXB))
(2)将 d F d\boldsymbol{F} dF向量化为 v e c ( d F ) vec(d\boldsymbol{F}) vec(dF),并进行化简:
v e c ( d F ) = ( I p ⊗ A ) v e c ( e x p ( X B ) ∗ ( d X B ) ) = ( I p ⊗ A ) d i a g ( e x p ( X B ) ) v e c ( d X B ) = ( I p ⊗ A ) d i a g ( e x p ( X B ) ) ( B T ⊗ I m ) v e c ( d X ) \begin{aligned} vec(d\boldsymbol{F}) & = (\boldsymbol{I}_p \otimes \boldsymbol{A})vec(exp(\boldsymbol{X}\boldsymbol{B})*(d\boldsymbol{X}\boldsymbol{B})) \\ & = (\boldsymbol{I}_p \otimes \boldsymbol{A})diag(exp(\boldsymbol{X}\boldsymbol{B}))vec(d\boldsymbol{X}\boldsymbol{B}) \\ & = (\boldsymbol{I}_p \otimes \boldsymbol{A})diag(exp(\boldsymbol{X}\boldsymbol{B}))(\boldsymbol{B}^T \otimes \boldsymbol{I}_m)vec(d\boldsymbol{X}) \end{aligned} vec(dF)=(IpA)vec(exp(XB)(dXB))=(IpA)diag(exp(XB))vec(dXB)=(IpA)diag(exp(XB))(BTIm)vec(dX)
其中 d i a g ( A ) diag(\boldsymbol{A}) diag(A)是用 A \boldsymbol{A} A的元素(按列优先)排成的对角阵。

(3)根据导数与微分的关系 v e c ( d F ) = ∂ F ∂ X T v e c ( d X ) vec(d\boldsymbol{F})=\frac{\partial \boldsymbol{F}}{\partial \boldsymbol{X}}^Tvec(d\boldsymbol{X}) vec(dF)=XFTvec(dX),求得 ∂ F ∂ X \frac{\partial \boldsymbol{F}}{\partial \boldsymbol{X}} XF
∂ F ∂ X = ( B ⊗ I m ) d i a g ( e x p ( X B ) ) ( I p ⊗ A T ) \frac{\partial \boldsymbol{F}}{\partial \boldsymbol{X}}=(\boldsymbol{B} \otimes \boldsymbol{I}_m)diag(exp(\boldsymbol{X}\boldsymbol{B}))(\boldsymbol{I}_p \otimes \boldsymbol{A}^T) XF=(BIm)diag(exp(XB))(IpAT)


例12: l = − y x T w + l n ( 1 + e x p ( x T w ) ) l=-y\boldsymbol{x}^T\boldsymbol{w}+ln(1+exp(\boldsymbol{x}^T\boldsymbol{w})) l=yxTw+ln(1+exp(xTw)),求 ∇ w l \nabla_{\boldsymbol{w}}l wl ∇ w 2 l \nabla_{\boldsymbol{w}}^2l w2l。其中 y y y是取值为 0 0 0 1 1 1的标量, x , w \boldsymbol{x},\boldsymbol{w} x,w n × 1 n \times 1 n×1列向量。

解:(1) ∇ w l \nabla_{\boldsymbol{w}}l wl是个标量对向量的导数,有:
d l = − y x T d w + ( 1 + e x p ( x T w ) ) − 1 ∗ e x p ( x T w ) ∗ ( x T d w ) dl=-y\boldsymbol{x}^Td\boldsymbol{w}+(1+exp(\boldsymbol{x}^T\boldsymbol{w}))^{-1}*exp(\boldsymbol{x}^T\boldsymbol{w})*(\boldsymbol{x}^Td\boldsymbol{w}) dl=yxTdw+(1+exp(xTw))1exp(xTw)(xTdw)
所以:
∇ w l = x ( σ ( x T w ) − y ) \nabla_{\boldsymbol{w}}l=\boldsymbol{x}(\sigma(\boldsymbol{x}^T\boldsymbol{w})-y) wl=x(σ(xTw)y)
其中, σ ( a ) = e x p ( a ) 1 + e x p ( a ) \sigma(a)=\frac{exp(a)}{1+exp(a)} σ(a)=1+exp(a)exp(a)是sigmoid函数。

(2) ∇ w 2 l \nabla_{\boldsymbol{w}}^2l w2l是个向量对向量的导数,有:
d ∇ w l = x σ ′ ( x T w ) x T d w d\nabla_{\boldsymbol{w}}l=\boldsymbol{x}\sigma'(\boldsymbol{x}^T\boldsymbol{w})\boldsymbol{x}^Td\boldsymbol{w} dwl=xσ(xTw)xTdw
所以:
∇ w 2 l = x σ ′ ( x T w ) x T \nabla_{\boldsymbol{w}}^2l=\boldsymbol{x}\sigma'(\boldsymbol{x}^T\boldsymbol{w})\boldsymbol{x}^T w2l=xσ(xTw)xT


例13: l = ∑ i = 1 N ( − y i x i T w + l n ( 1 + e x p ( x i T w ) ) ) l=\sum_{i=1}^{N}(-y_i\boldsymbol{x}_i^T\boldsymbol{w}+ln(1+exp(\boldsymbol{x}_i^T\boldsymbol{w}))) l=i=1N(yixiTw+ln(1+exp(xiTw))),求 ∇ w l \nabla_{\boldsymbol{w}}l wl ∇ w 2 l \nabla_{\boldsymbol{w}}^2l w2l。其中 y i y_i yi是标量, x i , w \boldsymbol{x}_i,\boldsymbol{w} xi,w n × 1 n \times 1 n×1列向量。

解:(1)求 ∇ w l \nabla_{\boldsymbol{w}}l wl(标量对向量的导数):

定义矩阵 X = [ x 1 T , … , x N T ] T ( N × n ) \boldsymbol{X}=[\boldsymbol{x}_1^T,\dots,\boldsymbol{x}_N^T]^T(N \times n) X=[x1T,,xNT]T(N×n),向量 y = [ y 1 , … , y N ] T \boldsymbol{y}=[y_1,\dots,y_N]^T y=[y1,,yN]T,从而可以将 l l l写成矩阵形式:
l = − y T X w + 1 T l n ( 1 + e x p ( X w ) ) l=-\boldsymbol{y}^T\boldsymbol{X}\boldsymbol{w}+\boldsymbol{1}^Tln(\boldsymbol{1}+exp(\boldsymbol{X}\boldsymbol{w})) l=yTXw+1Tln(1+exp(Xw))
其中 1 \boldsymbol{1} 1是全 1 1 1 N × 1 N \times 1 N×1列向量。

l l l的微分有:
d l = − y T X d w + 1 T ( 1 1 + e x p ( X w ) ∗ e x p ( X w ) ∗ ( X d w ) ) = − y T X d w + 1 T ( σ ( X w ) ∗ ( X d w ) ) = − y T X d w + ( 1 ∗ σ ( X w ) ) T X d w = − y T X d w + σ ( X w ) T X d w \begin{aligned} dl & = -\boldsymbol{y}^T\boldsymbol{X}d\boldsymbol{w}+\boldsymbol{1}^T\bigg(\frac{1}{\boldsymbol{1}+exp(\boldsymbol{X}\boldsymbol{w})}*exp(\boldsymbol{X}\boldsymbol{w})*(\boldsymbol{X}d\boldsymbol{w})\bigg) \\ & = -\boldsymbol{y}^T\boldsymbol{X}d\boldsymbol{w}+\boldsymbol{1}^T(\sigma(\boldsymbol{X}\boldsymbol{w})*(\boldsymbol{X}d\boldsymbol{w})) \\ & = -\boldsymbol{y}^T\boldsymbol{X}d\boldsymbol{w}+(\boldsymbol{1}*\sigma(\boldsymbol{X}\boldsymbol{w}))^T\boldsymbol{X}d\boldsymbol{w} \\ & = -\boldsymbol{y}^T\boldsymbol{X}d\boldsymbol{w}+\sigma(\boldsymbol{X}\boldsymbol{w})^T\boldsymbol{X}d\boldsymbol{w} \end{aligned} dl=yTXdw+1T(1+exp(Xw)1exp(Xw)(Xdw))=yTXdw+1T(σ(Xw)(Xdw))=yTXdw+(1σ(Xw))TXdw=yTXdw+σ(Xw)TXdw
故:
∇ w l = X T ( σ ( X w ) − y ) \nabla_{\boldsymbol{w}}l=\boldsymbol{X}^T(\sigma(\boldsymbol{X}\boldsymbol{w})-\boldsymbol{y}) wl=XT(σ(Xw)y)
(2)求 ∇ w 2 l \nabla_{\boldsymbol{w}}^2l w2l(向量对向量的导数):

∇ w l \nabla_{\boldsymbol{w}}l wl的微分有:
d ∇ w l = X T ( σ ′ ( X w ) ∗ ( X d w ) ) = X T d i a g ( σ ′ ( X w ) ) X d w d\nabla_{\boldsymbol{w}}l=\boldsymbol{X}^T(\sigma'(\boldsymbol{X}\boldsymbol{w})*(\boldsymbol{X}d\boldsymbol{w}))=\boldsymbol{X}^Tdiag(\sigma'(\boldsymbol{X}\boldsymbol{w}))\boldsymbol{X}d\boldsymbol{w} dwl=XT(σ(Xw)(Xdw))=XTdiag(σ(Xw))Xdw
从而:
∇ w 2 l = X T d i a g ( σ ′ ( X w ) ) X \nabla_{\boldsymbol{w}}^2l=\boldsymbol{X}^Tdiag(\sigma'(\boldsymbol{X}\boldsymbol{w}))\boldsymbol{X} w2l=XTdiag(σ(Xw))X


附录

1、矩阵的迹

1.1迹的定义

n × n n \times n n×n矩阵 A \boldsymbol{A} A的对角线元素之和称为 A \boldsymbol{A} A的迹(trace),记作 t r ( A ) tr(\boldsymbol{A}) tr(A),非正方矩阵无迹的定义。

1.2关于迹的等式

(1)若 A \boldsymbol{A} A B \boldsymbol{B} B均为 n × n n \times n n×n矩阵,则 t r ( A ± B ) = t r ( A ) ± t r ( B ) tr(\boldsymbol{A} \pm \boldsymbol{B})=tr(\boldsymbol{A}) \pm tr(\boldsymbol{B}) tr(A±B)=tr(A)±tr(B)

(2)若 A \boldsymbol{A} A B \boldsymbol{B} B均为 n × n n \times n n×n矩阵,且 c 1 c_1 c1 c 2 c_2 c2为常数,则 t r ( c 1 A ± c 2 B ) = c 1 t r ( A ) ± c 2 t r ( B ) tr(c_1\boldsymbol{A} \pm c_2\boldsymbol{B})=c_1tr(\boldsymbol{A}) \pm c_2tr(\boldsymbol{B}) tr(c1A±c2B)=c1tr(A)±c2tr(B)

(3) t r ( A T ) = t r ( A ) tr(\boldsymbol{A}^T)=tr(\boldsymbol{A}) tr(AT)=tr(A)

(4)若矩阵 A \boldsymbol{A} A和矩阵 B T \boldsymbol{B}^T BT形状相同,则 t r ( A B ) = t r ( B A ) tr(\boldsymbol{A}\boldsymbol{B})=tr(\boldsymbol{B}\boldsymbol{A}) tr(AB)=tr(BA)

(5)若矩阵 A , B , C \boldsymbol{A},\boldsymbol{B},\boldsymbol{C} A,B,C均为 m × n m \times n m×n矩阵,则 t r ( A T ( B ∗ C ) ) = t r ( ( A ∗ B ) T C ) tr(\boldsymbol{A}^T(\boldsymbol{B}*\boldsymbol{C}))=tr((\boldsymbol{A}*\boldsymbol{B})^T\boldsymbol{C}) tr(AT(BC))=tr((AB)TC)

(6)若 A \boldsymbol{A} A B \boldsymbol{B} B均为方阵,则 t r ( A ⊗ B ) = t r ( A ) t r ( B ) tr(\boldsymbol{A}\otimes\boldsymbol{B})=tr(\boldsymbol{A})tr(\boldsymbol{B}) tr(AB)=tr(A)tr(B)

(7)若矩阵 A \boldsymbol{A} A和矩阵 B \boldsymbol{B} B形状相同,则 t r ( A T B ) = ( v e c ( A ) ) T v e c ( B ) = ⟨ A , B ⟩ tr(\boldsymbol{A}^T\boldsymbol{B})=(vec(\boldsymbol{A}))^Tvec(\boldsymbol{B})=\langle \boldsymbol{A},\boldsymbol{B} \rangle tr(ATB)=(vec(A))Tvec(B)=A,B

2、Hadamard积

2.1Hadamard积的定义

m × n m \times n m×n矩阵 A = [ a i j ] \boldsymbol{A}=[a_{ij}] A=[aij] m × n m \times n m×n矩阵 B = [ b i j ] \boldsymbol{B}=[b_{ij}] B=[bij]的Hadamard积记作 A ∗ B \boldsymbol{A}*\boldsymbol{B} AB,它仍然是一个 m × n m \times n m×n矩阵,其元素定义为两个矩阵对应元素的乘积:
( A ∗ B ) i j = a i j b i j (\boldsymbol{A}*\boldsymbol{B})_{ij}=a_{ij}b_{ij} (AB)ij=aijbij
Hadamard积也称Schur积或者对应元素乘积(简称元素积)。

2.2Hadamard积的性质

(1)若 A \boldsymbol{A} A B \boldsymbol{B} B均为 m × n m \times n m×n矩阵,则 ( A ∗ B ) T = ( A T ∗ B T ) (\boldsymbol{A}*\boldsymbol{B})^T=(\boldsymbol{A}^T*\boldsymbol{B}^T) (AB)T=(ATBT)

(2)若 c c c为常数,则 c ( A ∗ B ) = ( c A ) ∗ B = A ∗ ( c B ) c(\boldsymbol{A}*\boldsymbol{B})=(c\boldsymbol{A})*\boldsymbol{B}=\boldsymbol{A}*(c\boldsymbol{B}) c(AB)=(cA)B=A(cB)

(3)若 A , B , C , D \boldsymbol{A},\boldsymbol{B},\boldsymbol{C},\boldsymbol{D} A,B,C,D均为 m × n m \times n m×n矩阵,则 ( A + B ) ∗ ( C + D ) = A ∗ C + A ∗ D + B ∗ C + B ∗ D (\boldsymbol{A}+\boldsymbol{B})*(\boldsymbol{C}+\boldsymbol{D})=\boldsymbol{A}*\boldsymbol{C}+\boldsymbol{A}*\boldsymbol{D}+\boldsymbol{B}*\boldsymbol{C}+\boldsymbol{B}*\boldsymbol{D} (A+B)(C+D)=AC+AD+BC+BD

3、Kronecker积

3.1Kronecker积的定义

两个矩阵的Kronecker积分为右Kronecker积和左Kronecker积。

(1)右Kronecker积: m × n m \times n m×n矩阵 A \boldsymbol{A} A p × q p \times q p×q矩阵 B \boldsymbol{B} B的右Kronecker积 A ⊗ B \boldsymbol{A}\otimes\boldsymbol{B} AB是一个 m p × n q mp \times nq mp×nq矩阵,定义为:
A ⊗ B = [ a i j B ] i = 1 , j = 1 m , n \boldsymbol{A}\otimes\boldsymbol{B}=[a_{ij}\boldsymbol{B}]_{i=1,j=1}^{m,n} AB=[aijB]i=1,j=1m,n
(2)左Kronecker积: m × n m \times n m×n矩阵 A \boldsymbol{A} A p × q p \times q p×q矩阵 B \boldsymbol{B} B的左Kronecker积 A ⊗ B \boldsymbol{A}\otimes\boldsymbol{B} AB是一个 m p × n q mp \times nq mp×nq矩阵,定义为:
A ⊗ B = [ A b i j ] i = 1 , j = 1 p , q \boldsymbol{A}\otimes\boldsymbol{B}=[\boldsymbol{A}b_{ij}]_{i=1,j=1}^{p,q} AB=[Abij]i=1,j=1p,q
通常多采用右Kronecker积,为避免混淆,本文一律采用右Kronecker积,简称为Kronecker积(张量积)。

3.2Kronecker积的性质

(1)若 α \alpha α β \beta β为常数,则 ( α A ) ⊗ ( β B ) = α β ( A ⊗ B ) (\alpha\boldsymbol{A})\otimes(\beta\boldsymbol{B})=\alpha\beta(\boldsymbol{A}\otimes\boldsymbol{B}) (αA)(βB)=αβ(AB)

(2)单位矩阵间的Kronecker积满足: I m ⊗ I n = I m n \boldsymbol{I}_m \otimes \boldsymbol{I}_n=\boldsymbol{I}_{mn} ImIn=Imn

(3)对于矩阵 A m × n , B n × k , C l × p , D p × q \boldsymbol{A}_{m \times n},\boldsymbol{B}_{n \times k},\boldsymbol{C}_{l \times p},\boldsymbol{D}_{p \times q} Am×n,Bn×k,Cl×p,Dp×q,有 ( A B ) ⊗ ( C D ) = ( A ⊗ C ) ( B ⊗ D ) (\boldsymbol{AB})\otimes(\boldsymbol{CD})=(\boldsymbol{A}\otimes\boldsymbol{C})(\boldsymbol{B}\otimes\boldsymbol{D}) (AB)(CD)=(AC)(BD)

(4)对于矩阵 A m × n , B p × q , C p × q \boldsymbol{A}_{m \times n},\boldsymbol{B}_{p \times q},\boldsymbol{C}_{p \times q} Am×n,Bp×q,Cp×q,有
A ⊗ ( B ± C ) = A ⊗ B ± A ⊗ C ( B ± C ) ⊗ A = B ⊗ A ± C ⊗ A \begin{aligned} \boldsymbol{A}\otimes(\boldsymbol{B}\pm\boldsymbol{C}) & = \boldsymbol{A}\otimes\boldsymbol{B}\pm\boldsymbol{A}\otimes\boldsymbol{C} \\ (\boldsymbol{B}\pm\boldsymbol{C})\otimes \boldsymbol{A}& = \boldsymbol{B}\otimes\boldsymbol{A}\pm\boldsymbol{C}\otimes\boldsymbol{A} \end{aligned} A(B±C)(B±C)A=AB±AC=BA±CA
(5)Kronecker积的转置满足: ( A ⊗ B ) T = A T ⊗ B T (\boldsymbol{A}\otimes\boldsymbol{B})^T=\boldsymbol{A}^T\otimes\boldsymbol{B}^T (AB)T=ATBT

(6)Kronecker积的逆矩阵满足: ( A ⊗ B ) − 1 = A − 1 ⊗ B − 1 (\boldsymbol{A}\otimes\boldsymbol{B})^{-1}=\boldsymbol{A}^{-1}\otimes\boldsymbol{B}^{-1} (AB)1=A1B1

(7)Kronecker积的行列式满足: d e t ( A n × n ⊗ B m × m ) = ( d e t A ) m ( d e t B ) n det(\boldsymbol{A}_{n \times n}\otimes\boldsymbol{B}_{m \times m})=(det\boldsymbol{A})^m(det\boldsymbol{B})^n det(An×nBm×m)=(detA)m(detB)n

(8)对于矩阵 A m × n , B m × n , C p × q , D p × q \boldsymbol{A}_{m \times n},\boldsymbol{B}_{m \times n},\boldsymbol{C}_{p \times q},\boldsymbol{D}_{p \times q} Am×n,Bm×n,Cp×q,Dp×q,有 ( A + B ) ⊗ ( C + D ) = A ⊗ C + A ⊗ D + B ⊗ C + B ⊗ D (\boldsymbol{A}+\boldsymbol{B})\otimes(\boldsymbol{C}+\boldsymbol{D})=\boldsymbol{A}\otimes\boldsymbol{C}+\boldsymbol{A}\otimes\boldsymbol{D}+\boldsymbol{B}\otimes\boldsymbol{C}+\boldsymbol{B}\otimes\boldsymbol{D} (A+B)(C+D)=AC+AD+BC+BD

(9)对于矩阵 A m × n , B p × q , C k × l \boldsymbol{A}_{m \times n},\boldsymbol{B}_{p \times q},\boldsymbol{C}_{k \times l} Am×n,Bp×q,Ck×l,有 ( A ⊗ B ) ⊗ C = A ⊗ ( B ⊗ C ) (\boldsymbol{A}\otimes\boldsymbol{B})\otimes\boldsymbol{C}=\boldsymbol{A}\otimes(\boldsymbol{B}\otimes\boldsymbol{C}) (AB)C=A(BC)

(10)对于矩阵 A m × n , B k × l , C p × q , D r × s \boldsymbol{A}_{m \times n},\boldsymbol{B}_{k \times l},\boldsymbol{C}_{p \times q},\boldsymbol{D}_{r \times s} Am×n,Bk×l,Cp×q,Dr×s,有 ( A ⊗ B ) ⊗ ( C ⊗ D ) = A ⊗ B ⊗ C ⊗ D (\boldsymbol{A}\otimes\boldsymbol{B})\otimes(\boldsymbol{C}\otimes\boldsymbol{D})=\boldsymbol{A}\otimes\boldsymbol{B}\otimes\boldsymbol{C}\otimes\boldsymbol{D} (AB)(CD)=ABCD

(11)对于矩阵 A m × n , B p × q \boldsymbol{A}_{m \times n},\boldsymbol{B}_{p \times q} Am×n,Bp×q,有 e x p ( A ⊗ B ) = e x p ( A ) ⊗ e x p ( B ) exp(\boldsymbol{A}\otimes\boldsymbol{B})=exp(\boldsymbol{A})\otimes exp(\boldsymbol{B}) exp(AB)=exp(A)exp(B)

(12))对于矩阵 A m × n , B p × q \boldsymbol{A}_{m \times n},\boldsymbol{B}_{p \times q} Am×n,Bp×q,有
K p m ( A ⊗ B ) = ( B ⊗ A ) K q n K p m ( A ⊗ B ) K n q = B ⊗ A \begin{aligned} \boldsymbol{K}_{pm}(\boldsymbol{A}\otimes\boldsymbol{B}) &= (\boldsymbol{B}\otimes\boldsymbol{A})\boldsymbol{K}_{qn} \\ \boldsymbol{K}_{pm}(\boldsymbol{A}\otimes\boldsymbol{B})\boldsymbol{K}_{nq} &= \boldsymbol{B}\otimes\boldsymbol{A} \end{aligned} Kpm(AB)Kpm(AB)Knq=(BA)Kqn=BA

4、置换矩阵

4.1置换矩阵的定义

一个正方矩阵称为置换矩阵,若它的每一行和每一列有且仅有一个非零元素1(其余位置为0)。

4.2置换矩阵的性质

置换矩阵 P \boldsymbol{P} P是正交矩阵,即 P T P = P P T = I \boldsymbol{P}^T\boldsymbol{P}=\boldsymbol{P}\boldsymbol{P}^T=\boldsymbol{I} PTP=PPT=I

5、矩阵的向量化

5.1向量化算子的定义

m × n m \times n m×n矩阵的(列)向量化 v e c ( A ) vec(\boldsymbol{A}) vec(A)将矩阵 A = [ a i j ] \boldsymbol{A}=[a_{ij}] A=[aij]的元素按列堆栈,排成一个 m n × 1 mn \times 1 mn×1向量
v e c ( A ) = [ a 11 , … , a m 1 , … , a 1 n , … , a m n ] T vec(\boldsymbol{A})=[a_{11},\dots,a_{m1},\dots,a_{1n},\dots,a_{mn}]^T vec(A)=[a11,,am1,,a1n,,amn]T
矩阵也可以按行堆栈为行向量 r v e c ( A ) rvec(\boldsymbol{A}) rvec(A)。称为矩阵的行向量化,定义为
r v e c ( A ) = [ a 11 , … , a 1 n , … , a m 1 , … , a m n ] rvec(\boldsymbol{A})=[a_{11},\dots,a_{1n},\dots,a_{m1},\dots,a_{mn}] rvec(A)=[a11,,a1n,,am1,,amn]
显然矩阵的向量化和行向量化之间存在如下关系
v e c ( A T ) = ( r v e c ( A ) ) T vec(\boldsymbol{A}^T)=(rvec(\boldsymbol{A}))^T vec(AT)=(rvec(A))T

5.2交换矩阵的定义

显然,对于一个 m × n m \times n m×n矩阵 A \boldsymbol{A} A,向量 v e c ( A ) vec(\boldsymbol{A}) vec(A) v e c ( A T ) vec(\boldsymbol{A}^T) vec(AT)含有相同的元素,但排列次序不同。因此,存在一个唯一的 m n × m n mn \times mn mn×mn置换矩阵,可以将一个矩阵的向量化 v e c ( A ) vec(\boldsymbol{A}) vec(A)变为其转置矩阵的向量化 v e c ( A T ) vec(\boldsymbol{A}^T) vec(AT)。这一置换矩阵称为交换矩阵,记作 K m n \boldsymbol{K}_{mn} Kmn,定义为
K m n v e c ( A m × n ) = v e c ( A T ) \boldsymbol{K}_{mn}vec(\boldsymbol{A}_{m \times n})=vec(\boldsymbol{A}^T) Kmnvec(Am×n)=vec(AT)
类似地,可以将转置矩阵的向量化 v e c ( A T ) vec(\boldsymbol{A}^T) vec(AT)变为原矩阵的向量化 v e c ( A ) vec(\boldsymbol{A}) vec(A)的交换矩阵是一和 n m × n m nm \times nm nm×nm置换矩阵,记作 K n m \boldsymbol{K}_{nm} Knm,定义为
K n m v e c ( A T ) = v e c ( A m × n ) \boldsymbol{K}_{nm}vec(\boldsymbol{A}^T)=vec(\boldsymbol{A}_{m \times n}) Knmvec(AT)=vec(Am×n)
m n × m n mn \times mn mn×mn交换矩阵 K m n \boldsymbol{K}_{mn} Kmn的构造方法如下:每一行只赋一个元素1,其余元素全部为0。首先,第1行第1个元素为1,然后这个1元素右移m位,变成第2行该位置的1元素。第2行该位置的1元素再右移m位,变成第3行该位置的1元素。依此类推,找到所有的1元素。但是,如果右移时超过第mn列,则应该转到下一行第1列继续移位,并多移动1位,再在此位置赋1。

5.3交换矩阵的性质

(1) K n m K m n = K m n K n m = I m n \boldsymbol{K}_{nm}\boldsymbol{K}_{mn}=\boldsymbol{K}_{mn}\boldsymbol{K}_{nm}=\boldsymbol{I}_{mn} KnmKmn=KmnKnm=Imn

(2) K m n T K m n = K m n K m n T = I m n \boldsymbol{K}_{mn}^T\boldsymbol{K}_{mn}=\boldsymbol{K}_{mn}\boldsymbol{K}_{mn}^T=\boldsymbol{I}_{mn} KmnTKmn=KmnKmnT=Imn

(3) K m n T = K n m \boldsymbol{K}_{mn}^T=\boldsymbol{K}_{nm} KmnT=Knm

(4) K 1 n = K n 1 = I n \boldsymbol{K}_{1n}=\boldsymbol{K}_{n1}=\boldsymbol{I}_n K1n=Kn1=In

5.4向量化算子的性质

(1)矩阵之和的向量化: v e c ( A + B ) = v e c ( A ) + v e c ( B ) vec(\boldsymbol{A}+\boldsymbol{B})=vec(\boldsymbol{A})+vec(\boldsymbol{B}) vec(A+B)=vec(A)+vec(B)

(2) m × n m \times n m×n矩阵 A \boldsymbol{A} A B \boldsymbol{B} B的Hadamard积的向量化: v e c ( A ∗ B ) = v e c ( A ) ∗ v e c ( B ) = d i a g ( v e c ( A ) ) v e c ( B ) vec(\boldsymbol{A}*\boldsymbol{B})=vec(\boldsymbol{A})*vec(\boldsymbol{B})=diag(vec(\boldsymbol{A}))vec(\boldsymbol{B}) vec(AB)=vec(A)vec(B)=diag(vec(A))vec(B)

其中 d i a g ( v e c ( A ) ) diag(vec(\boldsymbol{A})) diag(vec(A))表示以 v e c ( A ) vec(\boldsymbol{A}) vec(A)的各元素(按列排列)为对角元素的对角矩阵。

(3)两个向量的Kronecker积与向量化算子: a ⊗ b = v e c ( b a T ) \boldsymbol{a}\otimes\boldsymbol{b}=vec(\boldsymbol{b}\boldsymbol{a}^T) ab=vec(baT)

(4)矩阵 A m × p B p × q C q × n \boldsymbol{A}_{m \times p}\boldsymbol{B}_{p \times q}\boldsymbol{C}_{q \times n} Am×pBp×qCq×n乘积的向量化与Kronecker积的关系:
v e c ( A B C ) = ( C T B T ⊗ I m ) v e c ( A ) v e c ( A B C ) = ( C T ⊗ A ) v e c ( B ) v e c ( A B C ) = ( I q ⊗ A B ) v e c ( C ) \begin{aligned} vec(\boldsymbol{ABC}) &= (\boldsymbol{C}^T\boldsymbol{B}^T\otimes\boldsymbol{I}_m)vec(\boldsymbol{A}) \\ vec(\boldsymbol{ABC}) &= (\boldsymbol{C}^T\otimes\boldsymbol{A})vec(\boldsymbol{B}) \\ vec(\boldsymbol{ABC}) &= (\boldsymbol{I}_q\otimes\boldsymbol{AB})vec(\boldsymbol{C}) \end{aligned} vec(ABC)vec(ABC)vec(ABC)=(CTBTIm)vec(A)=(CTA)vec(B)=(IqAB)vec(C)
(5)Kronecker积的向量化:设有 p × m p \times m p×m矩阵 X \boldsymbol{X} X n × q n \times q n×q矩阵 Y \boldsymbol{Y} Y,则:
v e c ( X ⊗ Y ) = ( I m ⊗ K q p ⊗ I n ) ( v e c ( X ) ⊗ v e c ( Y ) ) vec(\boldsymbol{X}\otimes\boldsymbol{Y})=(\boldsymbol{I}_m\otimes\boldsymbol{K}_{qp}\otimes\boldsymbol{I}_n)(vec(\boldsymbol{X}) \otimes vec(\boldsymbol{Y})) vec(XY)=(ImKqpIn)(vec(X)vec(Y))

6、实矩阵微分运算

6.1矩阵微分的定义

m × n m \times n m×n矩阵 X \boldsymbol{X} X的微分用符号 d X d\boldsymbol{X} dX表示,定义为 d X = [ d x i j ] i = 1 , j = 1 m , n d\boldsymbol{X}=[dx_{ij}]_{i=1,j=1}^{m,n} dX=[dxij]i=1,j=1m,n

6.2矩阵微分的常用计算公式

(1)常数矩阵的微分矩阵为零矩阵,即 d A = O d\boldsymbol{A}=\boldsymbol{O} dA=O

(2)常数 α \alpha α与矩阵 X \boldsymbol{X} X的乘积的微分矩阵为 d ( α X ) = α d X d(\alpha\boldsymbol{X})=\alpha d\boldsymbol{X} d(αX)=αdX

(3)矩阵转置的微分矩阵为 d ( X T ) = ( d X ) T d(\boldsymbol{X}^T)=(d\boldsymbol{X})^T d(XT)=(dX)T

(4)两个矩阵函数的和(差)的微分矩阵为 d ( U ± V ) = d U ± d V d(\boldsymbol{U}\pm\boldsymbol{V})=d\boldsymbol{U} \pm d\boldsymbol{V} d(U±V)=dU±dV

(5)两个矩阵函数乘积的微分矩阵为 d ( U V ) = ( d U ) V + U ( d V ) d(\boldsymbol{UV})=(d\boldsymbol{U})\boldsymbol{V}+\boldsymbol{U}(d\boldsymbol{V}) d(UV)=(dU)V+U(dV)

(6)矩阵的迹的微分矩阵为 d ( t r ( X ) ) = t r ( d X ) d(tr(\boldsymbol{X}))=tr(d\boldsymbol{X}) d(tr(X))=tr(dX)

(7)行列式的微分为 d ∣ X ∣ = t r ( X ∗ d X ) d|\boldsymbol{X}|=tr(\boldsymbol{X}^*d\boldsymbol{X}) dX=tr(XdX),其中 X ∗ \boldsymbol{X}^* X表示 X \boldsymbol{X} X的伴随矩阵,在 X \boldsymbol{X} X可逆时又可以写作 d ∣ X ∣ = ∣ X ∣ t r ( X − 1 d X ) d|\boldsymbol{X}|=|\boldsymbol{X}|tr(\boldsymbol{X}^{-1}d\boldsymbol{X}) dX=Xtr(X1dX)

(8)矩阵的Hadamard积的微分矩阵为 d ( U ∗ V ) = ( d U ) ∗ V + U ∗ ( d V ) d(\boldsymbol{U}*\boldsymbol{V})=(d\boldsymbol{U})*\boldsymbol{V}+\boldsymbol{U}*(d\boldsymbol{V}) d(UV)=(dU)V+U(dV)

(9)矩阵的Kronecker积的微分矩阵为 d ( U ⊗ V ) = ( d U ) ⊗ V + U ⊗ ( d V ) d(\boldsymbol{U}\otimes\boldsymbol{V})=(d\boldsymbol{U})\otimes\boldsymbol{V}+\boldsymbol{U}\otimes(d\boldsymbol{V}) d(UV)=(dU)V+U(dV)

(10)向量化函数的微分矩阵为 d ( v e c ( X ) ) = v e c ( d X ) d(vec(\boldsymbol{X}))=vec(d\boldsymbol{X}) d(vec(X))=vec(dX)

(11)矩阵对数的微分矩阵为 d ( l n X ) = X − 1 d X d(ln\boldsymbol{X})=\boldsymbol{X}^{-1}d\boldsymbol{X} d(lnX)=X1dX

(12)逆矩阵的微分矩阵为 d ( X − 1 ) = − X − 1 ( d X ) X − 1 d(\boldsymbol{X}^{-1})=-\boldsymbol{X}^{-1}(d\boldsymbol{X})\boldsymbol{X}^{-1} d(X1)=X1(dX)X1

(13)行列式对数的微分矩阵为 d ( l n ∣ X ∣ ) = t r ( X − 1 d X ) d(ln|\boldsymbol{X}|)=tr(\boldsymbol{X}^{-1}d\boldsymbol{X}) d(lnX)=tr(X1dX),其中矩阵 X \boldsymbol{X} X可逆。

(14)逐元素函数的微分矩阵为 d ( f ( X ) ) = f ′ ( X ) ∗ d X d(f(\boldsymbol{X}))=f'(\boldsymbol{X})*d\boldsymbol{X} d(f(X))=f(X)dX

参考资料

[1]知乎:《机器学习中的数学理论1:三步搞定矩阵求导》
[2]张贤达:《矩阵分析与应用》

  • 2
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值