cs224笔记: Lecture 3a Matrix Calculus

Matrix Calculus

1 Jacobian Matrix

假设有函数 f : R n → R m f:\mathbb{R}^n \rightarrow\mathbb{R}^m f:RnRm,即将一个长度为 n n n的向量映射成长度为 m m m的向量,
f ( x ) = [ f 1 ( x 1 , x 2 , . . . , x n ) , f 2 ( x 1 , x 2 , . . . , x n ) , . . . , f m ( x 1 , x 2 , . . . , x n ) ] \mathbf{f}(\mathbf{x}) = [f_1(x_1,x_2,...,x_n),f_2(x_1,x_2,...,x_n),...,f_m(x_1,x_2,...,x_n)] f(x)=[f1(x1,x2,...,xn),f2(x1,x2,...,xn),...,fm(x1,x2,...,xn)]
Jacobian矩阵是一个 m × n m\times n m×n的矩阵,定义如下:
∂ f ∂ x = ( ∂ f 1 ∂ x 1 . . . ∂ f 1 ∂ x n . . . . . . . . . ∂ f m ∂ x 1 . . . ∂ f m ∂ x n ) \frac{\partial \mathbf{f}}{\partial \mathbf{x}}= \left(\begin{array}{ccc}\frac{\partial f_1}{\partial x_1} &... & \frac{\partial f_1}{\partial x_n}\\ ... & ... &...\\ \frac{\partial f_m}{\partial x_1} & ... &\frac{\partial f_m}{\partial x_n}\\ \end{array} \right) xf=x1f1...x1fm.........xnf1...xnfm
( ∂ f ∂ x ) i j = ∂ f i ∂ x j (\frac{\partial f}{\partial x})_{ij}=\frac{\partial f_i}{\partial x_j} (xf)ij=xjfi,Jacobian矩阵相乘就可以实现链式法则(chain rule)的运算。

2 有用的等式

(1) z = W x \mathbf{z}=\mathbf{W}\mathbf{x} z=Wx,则 ∂ z ∂ x = W \frac{\partial \mathbf{z}}{\partial \mathbf{x}}=\mathbf{W} xz=W

假设 W ∈ R n × m \mathbf{W}\in \mathbb{R}^{n\times m} WRn×m,可以看作是将 m m m维向量映射到 n n n维向量,展开为标量形式为:
( z 1 z 2 . . . z n ) = ( W 11 W 12 . . . W 1 m . . . . . . . . . . . . W n 1 W n 2 . . . W n m ) ( x 1 x 2 . . . x m ) \left(\begin{array}{ccc} z_1\\ z_2\\ ...\\ z_n \end{array}\right) = \left(\begin{array}{ccc} W_{11} &W_{12} &... & W_{1m}\\ ... & ...&... &...\\ W_{n1}&W_{n2} &... & W_{nm}\\ \end{array} \right) \left(\begin{array}{ccc} x_1\\ x_2\\ ...\\ x_m \end{array}\right) z1z2...zn=W11...Wn1W12...Wn2.........W1m...Wnmx1x2...xm

z i = ∑ k = 1 m W i k x k z_i = \sum_{k=1}^{m}W_{ik}x_k zi=k=1mWikxk

所以Jacobian为 n × m n\times m n×m:
( ∂ z ∂ x ) i j = ∂ z i ∂ x j = ∂ ∂ x j ∑ k = 1 m W i k x k = ∑ k = 1 m W i k ∂ ∂ x j x k = W i j (\frac{\partial \mathbf{z}}{\partial \mathbf{x}})_{ij} =\frac{\partial z_i}{\partial x_j} =\frac{\partial}{\partial x_j}\sum_{k=1}^{m}W_{ik}x_k =\sum_{k=1}^{m}W_{ik}\frac{\partial}{\partial x_j}x_k=W_{ij} (xz)ij=xjzi=xjk=1mWikxk=k=1mWikxjxk=Wij
因此有, ∂ z ∂ x = W \frac{\partial \mathbf{z}}{\partial \mathbf{x}}=\mathbf{W} xz=W

(2)行向量(row vector), x \mathbf{x} x ( 1 × n 1\times n 1×n)和 z \mathbf{z} z ( 1 × m 1\times m 1×m), z = x W \mathbf{z}=\mathbf{x}\mathbf{W} z=xW,则 ∂ z ∂ x = W T \frac{\partial \mathbf{z}}{\partial \mathbf{x}}=\mathbf{W}^T xz=WT

同上Jacobian为:
( ∂ z ∂ x ) i j = ∂ z i ∂ x j = ∂ ∂ x j ∑ k = 1 n x k W k i = ∑ k = 1 n W k i ∂ ∂ x j x k = W j i (\frac{\partial \mathbf{z}}{\partial \mathbf{x}})_{ij} =\frac{\partial z_i}{\partial x_j} =\frac{\partial}{\partial x_j}\sum_{k=1}^{n}x_kW_{ki} =\sum_{k=1}^{n}W_{ki}\frac{\partial}{\partial x_j}x_k =W_{ji} (xz)ij=xjzi=xjk=1nxkWki=k=1nWkixjxk=Wji

因此, ∂ z ∂ x = W T \frac{\partial \mathbf{z}}{\partial \mathbf{x}}={\mathbf{W}}^T xz=WT

(3) z = x \mathbf{z} = \mathbf{x} z=x ∂ z ∂ x = I \frac{\partial{\mathbf{z}}}{\partial{\mathbf{x}}}=\mathbf{I} xz=I

( ∂ z ∂ x ) i j = ∂ z i ∂ x j = ∂ x i ∂ x j = { 1 , if i=j 0 , otherwise (\frac{\partial \mathbf{z}}{\partial \mathbf{x}})_{ij}=\frac{\partial z_i}{\partial x_j}=\frac{\partial x_i}{\partial x_j}= \begin{cases}1 \quad ,\text{if i=j}\\0 \quad, \text{otherwise}\end{cases} (xz)ij=xjzi=xjxi={1,if i=j0,otherwise

所以可以看出Jacobian矩阵对角线都是1,其他位置都为0,所以是一个单位矩阵(identity matrix)。

(4) z = f ( x ) \mathbf{z}=f(\mathbf{x}) z=f(x) ∂ z ∂ x = d i a g ( f ′ ( x ) ) \frac{\partial{\mathbf{z}}}{\partial{\mathbf{x}}}=diag(f^{'}(\mathbf{x})) xz=diag(f(x))

z i = f ( x i ) z_i=f(x_i) zi=f(xi),所以:

( ∂ z ∂ x ) i j = ∂ z i ∂ x j = ∂ ∂ x j f ( x i ) = { f ′ ( x i ) , if i=j 0 , otherwise (\frac{\partial \mathbf{z}}{\partial \mathbf{x}})_{ij}=\frac{\partial z_i}{\partial x_j}=\frac{\partial }{\partial x_j}f(x_i)= \begin{cases}f^{'}(x_i) \quad ,\text{if i=j}\\ 0 \quad, \text{otherwise} \end{cases} (xz)ij=xjzi=xjf(xi)={f(xi),if i=j0,otherwise

Jacobian为对角线为 f ′ ( x i ) f^{'}(x_i) f(xi)的对角矩阵(diagonal matrix)。

(5)Matrix times column vector with respect to the matrix

z = W x \mathbf{z}=\mathbf{W}\mathbf{x} z=Wx δ = ∂ J ∂ z \delta = \frac{\partial J}{\partial \mathbf{z}} δ=zJ ∂ J ∂ W = ∂ J ∂ z ∂ z ∂ W = δ T x T \frac{\partial J}{\partial \mathbf{W}}=\frac{\partial J}{\partial \mathbf{z}}\frac{\partial \mathbf{z}}{\partial \mathbf{W}}=\delta^Tx^T WJ=zJWz=δTxT

假设我们有一个损失函数 J J J(一个标量),然后要计算它关于 W ∈ R n × m \mathbf{W}\in \mathbb{R}^{n\times m} WRn×m的梯度,这样我们可以把 J J J当作是一个关于 W \mathbf{W} W的函数,有 n m nm nm个输入( W \mathbf{W} W的size),和一个输出( J J J),所以 ∂ J ∂ W \frac{\partial J}{\partial \mathbf{W}} WJ的Jacobian是 1 × n m 1\times nm 1×nm的向量,但是在实践中,这个方式不是很有用,如果导数以下形式出现会更方便一些,因为它和 W \mathbf{W} W的形式保持一致,直接与其相减就可以实现梯度下降,所以将它作为 ∂ J ∂ W \frac{\partial J}{\partial \mathbf{W}} WJ,
∂ J ∂ W = ( ∂ J ∂ W 11 ∂ J ∂ W 12 . . . ∂ J ∂ W 1 m . . . . . . . . . . . . ∂ J ∂ W n 1 ∂ J ∂ W n 2 . . . ∂ J ∂ W n m ) \frac{\partial J}{\partial \mathbf{W}}= \left(\begin{array}{ccc} \frac{\partial J}{\partial W_{11}} &\frac{\partial J}{\partial W_{12}} &... & \frac{\partial J}{\partial W_{1m}}\\ ... & ...&... &...\\ \frac{\partial J}{\partial W_{n1}}&\frac{\partial J}{\partial W_{n2}} &... & \frac{\partial J}{\partial W_{nm}}\\ \end{array} \right) WJ=W11J...Wn1JW12J...Wn2J.........W1mJ...WnmJ

z k = ∑ l = 1 m W k l x l ∂ z k ∂ W i j = ∑ l = 1 m x l ∂ ∂ W i j W k l z_k = \sum_{l=1}^{m}W_{kl}x_l\\ \frac{\partial z_k}{\partial W_{ij}}= \sum_{l=1}^{m}x_l\frac{\partial }{\partial W_{ij}}W_{kl} zk=l=1mWklxlWijzk=l=1mxlWijWkl

这里如果 i = k i=k i=k j = l j=l j=l,则 ∂ ∂ w i j w k l = 1 \frac{\partial }{\partial w_{ij}}w_{kl}=1 wijwkl=1,否则等于0。

∂ z ∂ W i j = ( 0 . . 0 x j 0 . 0 ) \frac{\partial z}{\partial W_{ij}} = \left(\begin{array}{ccc} 0\\ .\\ . \\ 0\\ x_j\\ 0\\ .\\ 0 \end{array} \right) Wijz=0..0xj0.0

∂ J ∂ W i j = ∂ J ∂ z ∂ z ∂ W i j = ∑ k = 1 n δ k ∂ z k ∂ W i j = δ i x j \frac{\partial J}{\partial W_{ij}}=\frac{\partial J}{\partial \mathbf{z}}\frac{\partial \mathbf{z}}{\partial W_{ij}} =\sum_{k=1}^n\delta_k \frac{\partial z_k}{\partial W_{ij}} =\delta_ix_j WijJ=zJWijz=k=1nδkWijzk=δixj

x ( m , 1 ) x(m, 1) x(m,1)

W ( n , m ) W(n, m) W(n,m)

z ( n , 1 ) z(n, 1) z(n,1)

δ = ∂ J ∂ z ( 1 , n ) \delta=\frac{\partial J}{\partial z}(1, n) δ=zJ(1,n)

∂ J ∂ W = δ T x T ( n , m ) \frac{\partial J}{\partial \mathbf{W}}=\delta^Tx^T(n, m) WJ=δTxT(n,m)

(6)Row vector time matrix with respect to the matrix
z = x W \mathbf{z}=\mathbf{x}\mathbf{W} z=xW δ = ∂ J ∂ z \delta = \frac{\partial J}{\partial \mathbf{z}} δ=zJ ∂ J ∂ W = ∂ J ∂ z ∂ z ∂ W = x T δ \frac{\partial J}{\partial \mathbf{W}}=\frac{\partial J}{\partial \mathbf{z}}\frac{\partial \mathbf{z}}{\partial \mathbf{W}}=x^T\delta WJ=zJWz=xTδ

过程同(5)

(7)Cross-Entropy loss with respect to logits

y ^ = s o f t m a x ( θ ) \hat{\mathbf{y}}=softmax(\mathbf{θ}) y^=softmax(θ), J = C E ( y , y ^ ) J =CE(\mathbf{y},\hat{\mathbf{y}}) J=CE(y,y^), ∂ J ∂ θ = y ^ − y \frac{\partial J}{\partial \mathbf{\theta}}=\hat{\mathbf{y}}-\mathbf{y} θJ=y^y

下面给出过程:
y ^ = s o f t m a x ( θ ) = ( e x p ( θ 1 ) ∑ i = 1 n θ i . . . e x p ( θ n ) ∑ i = 1 n θ i ) ∈ R n \hat{\mathbf{y}}=softmax(\mathbf{\theta})=\left(\begin{array}{ccc} \frac{exp(\theta_1)}{\sum_{i=1}^n\theta_i}\\ .\\.\\.\\ \frac{exp(\theta_n)}{\sum_{i=1}^n\theta_i} \end{array} \right) \in \mathbb{R}^n y^=softmax(θ)=i=1nθiexp(θ1)...i=1nθiexp(θn)Rn

J = C E ( y , y ^ ) = − y T log ⁡ ( y ^ ) J = CE(\mathbf{y},\hat{\mathbf{y}}) = -\mathbf{y}^T\log(\hat{\mathbf{y}}) J=CE(y,y^)=yTlog(y^)

因为 y y y代表正确标签,会使用one-hot编码,所以只有正确的位置为1,其他位置为0,如下:
y = ( 0 . . . 0 1 0 . . . 0 ) \mathbf{y} = \left(\begin{array}{ccc} 0\\ ...\\ 0\\ 1\\ 0\\ ...\\ 0 \end{array} \right) y=0...010...0
不失一般性,令 y k = 1 y_k=1 yk=1,即第k个为正确值,所以 J J J如下:
J = − y k log ⁡ ( y ^ k ) = − log ⁡ ( y ^ k ) = − log ⁡ ( e x p ( θ k ) ∑ i = 1 n e x p ( θ i ) ) = − θ k + log ⁡ ( ∑ i = 1 n e x p ( θ i ) ) J = -y_k\log(\hat{y}_k) = -\log(\hat{y}_k)=-\log(\frac{exp(\theta_k)}{\sum_{i=1}^n exp(\theta_i)})=-\theta_k+\log(\sum_{i=1}^nexp(\theta_i)) J=yklog(y^k)=log(y^k)=log(i=1nexp(θi)exp(θk))=θk+log(i=1nexp(θi))
则,
∂ J ∂ θ i = ∂ ∂ θ i − log ⁡ ( y ^ k ) = ∂ ∂ θ i ( − θ k + log ⁡ ( ∑ x = 1 n e x p ( θ x ) ) = − ∂ ∂ θ i θ k + ∂ ∂ θ i log ⁡ ( ∑ x = 1 n e x p ( θ x ) ) = − ∂ ∂ θ i θ k + 1 ∑ x = 1 n e x p ( θ x ) ∂ ∂ θ i ∑ x = 1 n e x p ( θ x ) = − ∂ ∂ θ i θ k + 1 ∑ x = 1 n e x p ( θ x ) e x p ( θ i ) = − ∂ ∂ θ i θ k + y ^ i \begin{aligned} \frac{\partial J}{\partial \theta_i}&=\frac{\partial}{\partial \theta_i}-\log(\hat{y}_k)\\ &= \frac{\partial}{\partial \theta_i}(-\theta_k+\log(\sum_{x=1}^nexp(\theta_x))\\ &= -\frac{\partial}{\partial \theta_i}\theta_k+\frac{\partial}{\partial \theta_i}\log(\sum_{x=1}^nexp(\theta_x))\\ &=-\frac{\partial}{\partial \theta_i}\theta_k+\frac{1}{\sum_{x=1}^nexp(\theta_x)}\frac{\partial}{\partial \theta_i}\sum_{x=1}^nexp(\theta_x)\\ &=-\frac{\partial}{\partial \theta_i}\theta_k+\frac{1}{\sum_{x=1}^nexp(\theta_x)}exp(\theta_i)\\ &=-\frac{\partial}{\partial \theta_i}\theta_k+\hat{y}_i \end{aligned} θiJ=θilog(y^k)=θi(θk+log(x=1nexp(θx))=θiθk+θilog(x=1nexp(θx))=θiθk+x=1nexp(θx)1θix=1nexp(θx)=θiθk+x=1nexp(θx)1exp(θi)=θiθk+y^i
等式前半部分只有 i = k i=k i=k时为1,其他情况为0,所以有:
∂ J ∂ θ = − y + y ^ \frac{\partial J}{\partial \mathbf{\theta}}=-\mathbf{y}+\hat{\mathbf{y}} θJ=y+y^

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值