Matrix Calculus
1 Jacobian Matrix
假设有函数
f
:
R
n
→
R
m
f:\mathbb{R}^n \rightarrow\mathbb{R}^m
f:Rn→Rm,即将一个长度为
n
n
n的向量映射成长度为
m
m
m的向量,
f
(
x
)
=
[
f
1
(
x
1
,
x
2
,
.
.
.
,
x
n
)
,
f
2
(
x
1
,
x
2
,
.
.
.
,
x
n
)
,
.
.
.
,
f
m
(
x
1
,
x
2
,
.
.
.
,
x
n
)
]
\mathbf{f}(\mathbf{x}) = [f_1(x_1,x_2,...,x_n),f_2(x_1,x_2,...,x_n),...,f_m(x_1,x_2,...,x_n)]
f(x)=[f1(x1,x2,...,xn),f2(x1,x2,...,xn),...,fm(x1,x2,...,xn)]
Jacobian矩阵是一个
m
×
n
m\times n
m×n的矩阵,定义如下:
∂
f
∂
x
=
(
∂
f
1
∂
x
1
.
.
.
∂
f
1
∂
x
n
.
.
.
.
.
.
.
.
.
∂
f
m
∂
x
1
.
.
.
∂
f
m
∂
x
n
)
\frac{\partial \mathbf{f}}{\partial \mathbf{x}}= \left(\begin{array}{ccc}\frac{\partial f_1}{\partial x_1} &... & \frac{\partial f_1}{\partial x_n}\\ ... & ... &...\\ \frac{\partial f_m}{\partial x_1} & ... &\frac{\partial f_m}{\partial x_n}\\ \end{array} \right)
∂x∂f=⎝⎛∂x1∂f1...∂x1∂fm.........∂xn∂f1...∂xn∂fm⎠⎞
即
(
∂
f
∂
x
)
i
j
=
∂
f
i
∂
x
j
(\frac{\partial f}{\partial x})_{ij}=\frac{\partial f_i}{\partial x_j}
(∂x∂f)ij=∂xj∂fi,Jacobian矩阵相乘就可以实现链式法则(chain rule)的运算。
2 有用的等式
(1) z = W x \mathbf{z}=\mathbf{W}\mathbf{x} z=Wx,则 ∂ z ∂ x = W \frac{\partial \mathbf{z}}{\partial \mathbf{x}}=\mathbf{W} ∂x∂z=W
假设
W
∈
R
n
×
m
\mathbf{W}\in \mathbb{R}^{n\times m}
W∈Rn×m,可以看作是将
m
m
m维向量映射到
n
n
n维向量,展开为标量形式为:
(
z
1
z
2
.
.
.
z
n
)
=
(
W
11
W
12
.
.
.
W
1
m
.
.
.
.
.
.
.
.
.
.
.
.
W
n
1
W
n
2
.
.
.
W
n
m
)
(
x
1
x
2
.
.
.
x
m
)
\left(\begin{array}{ccc} z_1\\ z_2\\ ...\\ z_n \end{array}\right) = \left(\begin{array}{ccc} W_{11} &W_{12} &... & W_{1m}\\ ... & ...&... &...\\ W_{n1}&W_{n2} &... & W_{nm}\\ \end{array} \right) \left(\begin{array}{ccc} x_1\\ x_2\\ ...\\ x_m \end{array}\right)
⎝⎜⎜⎛z1z2...zn⎠⎟⎟⎞=⎝⎛W11...Wn1W12...Wn2.........W1m...Wnm⎠⎞⎝⎜⎜⎛x1x2...xm⎠⎟⎟⎞
z i = ∑ k = 1 m W i k x k z_i = \sum_{k=1}^{m}W_{ik}x_k zi=k=1∑mWikxk
所以Jacobian为
n
×
m
n\times m
n×m:
(
∂
z
∂
x
)
i
j
=
∂
z
i
∂
x
j
=
∂
∂
x
j
∑
k
=
1
m
W
i
k
x
k
=
∑
k
=
1
m
W
i
k
∂
∂
x
j
x
k
=
W
i
j
(\frac{\partial \mathbf{z}}{\partial \mathbf{x}})_{ij} =\frac{\partial z_i}{\partial x_j} =\frac{\partial}{\partial x_j}\sum_{k=1}^{m}W_{ik}x_k =\sum_{k=1}^{m}W_{ik}\frac{\partial}{\partial x_j}x_k=W_{ij}
(∂x∂z)ij=∂xj∂zi=∂xj∂k=1∑mWikxk=k=1∑mWik∂xj∂xk=Wij
因此有,
∂
z
∂
x
=
W
\frac{\partial \mathbf{z}}{\partial \mathbf{x}}=\mathbf{W}
∂x∂z=W
(2)行向量(row vector), x \mathbf{x} x ( 1 × n 1\times n 1×n)和 z \mathbf{z} z ( 1 × m 1\times m 1×m), z = x W \mathbf{z}=\mathbf{x}\mathbf{W} z=xW,则 ∂ z ∂ x = W T \frac{\partial \mathbf{z}}{\partial \mathbf{x}}=\mathbf{W}^T ∂x∂z=WT
同上Jacobian为:
(
∂
z
∂
x
)
i
j
=
∂
z
i
∂
x
j
=
∂
∂
x
j
∑
k
=
1
n
x
k
W
k
i
=
∑
k
=
1
n
W
k
i
∂
∂
x
j
x
k
=
W
j
i
(\frac{\partial \mathbf{z}}{\partial \mathbf{x}})_{ij} =\frac{\partial z_i}{\partial x_j} =\frac{\partial}{\partial x_j}\sum_{k=1}^{n}x_kW_{ki} =\sum_{k=1}^{n}W_{ki}\frac{\partial}{\partial x_j}x_k =W_{ji}
(∂x∂z)ij=∂xj∂zi=∂xj∂k=1∑nxkWki=k=1∑nWki∂xj∂xk=Wji
因此, ∂ z ∂ x = W T \frac{\partial \mathbf{z}}{\partial \mathbf{x}}={\mathbf{W}}^T ∂x∂z=WT
(3) z = x \mathbf{z} = \mathbf{x} z=x, ∂ z ∂ x = I \frac{\partial{\mathbf{z}}}{\partial{\mathbf{x}}}=\mathbf{I} ∂x∂z=I
( ∂ z ∂ x ) i j = ∂ z i ∂ x j = ∂ x i ∂ x j = { 1 , if i=j 0 , otherwise (\frac{\partial \mathbf{z}}{\partial \mathbf{x}})_{ij}=\frac{\partial z_i}{\partial x_j}=\frac{\partial x_i}{\partial x_j}= \begin{cases}1 \quad ,\text{if i=j}\\0 \quad, \text{otherwise}\end{cases} (∂x∂z)ij=∂xj∂zi=∂xj∂xi={1,if i=j0,otherwise
所以可以看出Jacobian矩阵对角线都是1,其他位置都为0,所以是一个单位矩阵(identity matrix)。
(4) z = f ( x ) \mathbf{z}=f(\mathbf{x}) z=f(x), ∂ z ∂ x = d i a g ( f ′ ( x ) ) \frac{\partial{\mathbf{z}}}{\partial{\mathbf{x}}}=diag(f^{'}(\mathbf{x})) ∂x∂z=diag(f′(x))
有 z i = f ( x i ) z_i=f(x_i) zi=f(xi),所以:
( ∂ z ∂ x ) i j = ∂ z i ∂ x j = ∂ ∂ x j f ( x i ) = { f ′ ( x i ) , if i=j 0 , otherwise (\frac{\partial \mathbf{z}}{\partial \mathbf{x}})_{ij}=\frac{\partial z_i}{\partial x_j}=\frac{\partial }{\partial x_j}f(x_i)= \begin{cases}f^{'}(x_i) \quad ,\text{if i=j}\\ 0 \quad, \text{otherwise} \end{cases} (∂x∂z)ij=∂xj∂zi=∂xj∂f(xi)={f′(xi),if i=j0,otherwise
Jacobian为对角线为 f ′ ( x i ) f^{'}(x_i) f′(xi)的对角矩阵(diagonal matrix)。
(5)Matrix times column vector with respect to the matrix
z = W x \mathbf{z}=\mathbf{W}\mathbf{x} z=Wx, δ = ∂ J ∂ z \delta = \frac{\partial J}{\partial \mathbf{z}} δ=∂z∂J, ∂ J ∂ W = ∂ J ∂ z ∂ z ∂ W = δ T x T \frac{\partial J}{\partial \mathbf{W}}=\frac{\partial J}{\partial \mathbf{z}}\frac{\partial \mathbf{z}}{\partial \mathbf{W}}=\delta^Tx^T ∂W∂J=∂z∂J∂W∂z=δTxT
假设我们有一个损失函数
J
J
J(一个标量),然后要计算它关于
W
∈
R
n
×
m
\mathbf{W}\in \mathbb{R}^{n\times m}
W∈Rn×m的梯度,这样我们可以把
J
J
J当作是一个关于
W
\mathbf{W}
W的函数,有
n
m
nm
nm个输入(
W
\mathbf{W}
W的size),和一个输出(
J
J
J),所以
∂
J
∂
W
\frac{\partial J}{\partial \mathbf{W}}
∂W∂J的Jacobian是
1
×
n
m
1\times nm
1×nm的向量,但是在实践中,这个方式不是很有用,如果导数以下形式出现会更方便一些,因为它和
W
\mathbf{W}
W的形式保持一致,直接与其相减就可以实现梯度下降,所以将它作为
∂
J
∂
W
\frac{\partial J}{\partial \mathbf{W}}
∂W∂J,
∂
J
∂
W
=
(
∂
J
∂
W
11
∂
J
∂
W
12
.
.
.
∂
J
∂
W
1
m
.
.
.
.
.
.
.
.
.
.
.
.
∂
J
∂
W
n
1
∂
J
∂
W
n
2
.
.
.
∂
J
∂
W
n
m
)
\frac{\partial J}{\partial \mathbf{W}}= \left(\begin{array}{ccc} \frac{\partial J}{\partial W_{11}} &\frac{\partial J}{\partial W_{12}} &... & \frac{\partial J}{\partial W_{1m}}\\ ... & ...&... &...\\ \frac{\partial J}{\partial W_{n1}}&\frac{\partial J}{\partial W_{n2}} &... & \frac{\partial J}{\partial W_{nm}}\\ \end{array} \right)
∂W∂J=⎝⎛∂W11∂J...∂Wn1∂J∂W12∂J...∂Wn2∂J.........∂W1m∂J...∂Wnm∂J⎠⎞
z k = ∑ l = 1 m W k l x l ∂ z k ∂ W i j = ∑ l = 1 m x l ∂ ∂ W i j W k l z_k = \sum_{l=1}^{m}W_{kl}x_l\\ \frac{\partial z_k}{\partial W_{ij}}= \sum_{l=1}^{m}x_l\frac{\partial }{\partial W_{ij}}W_{kl} zk=l=1∑mWklxl∂Wij∂zk=l=1∑mxl∂Wij∂Wkl
这里如果 i = k i=k i=k且 j = l j=l j=l,则 ∂ ∂ w i j w k l = 1 \frac{\partial }{\partial w_{ij}}w_{kl}=1 ∂wij∂wkl=1,否则等于0。
∂ z ∂ W i j = ( 0 . . 0 x j 0 . 0 ) \frac{\partial z}{\partial W_{ij}} = \left(\begin{array}{ccc} 0\\ .\\ . \\ 0\\ x_j\\ 0\\ .\\ 0 \end{array} \right) ∂Wij∂z=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛0..0xj0.0⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞
∂ J ∂ W i j = ∂ J ∂ z ∂ z ∂ W i j = ∑ k = 1 n δ k ∂ z k ∂ W i j = δ i x j \frac{\partial J}{\partial W_{ij}}=\frac{\partial J}{\partial \mathbf{z}}\frac{\partial \mathbf{z}}{\partial W_{ij}} =\sum_{k=1}^n\delta_k \frac{\partial z_k}{\partial W_{ij}} =\delta_ix_j ∂Wij∂J=∂z∂J∂Wij∂z=k=1∑nδk∂Wij∂zk=δixj
x ( m , 1 ) x(m, 1) x(m,1)
W ( n , m ) W(n, m) W(n,m)
z ( n , 1 ) z(n, 1) z(n,1)
δ = ∂ J ∂ z ( 1 , n ) \delta=\frac{\partial J}{\partial z}(1, n) δ=∂z∂J(1,n)
∂ J ∂ W = δ T x T ( n , m ) \frac{\partial J}{\partial \mathbf{W}}=\delta^Tx^T(n, m) ∂W∂J=δTxT(n,m)
(6)Row vector time matrix with respect to the matrix
z
=
x
W
\mathbf{z}=\mathbf{x}\mathbf{W}
z=xW,
δ
=
∂
J
∂
z
\delta = \frac{\partial J}{\partial \mathbf{z}}
δ=∂z∂J,
∂
J
∂
W
=
∂
J
∂
z
∂
z
∂
W
=
x
T
δ
\frac{\partial J}{\partial \mathbf{W}}=\frac{\partial J}{\partial \mathbf{z}}\frac{\partial \mathbf{z}}{\partial \mathbf{W}}=x^T\delta
∂W∂J=∂z∂J∂W∂z=xTδ
过程同(5)
(7)Cross-Entropy loss with respect to logits
y ^ = s o f t m a x ( θ ) \hat{\mathbf{y}}=softmax(\mathbf{θ}) y^=softmax(θ), J = C E ( y , y ^ ) J =CE(\mathbf{y},\hat{\mathbf{y}}) J=CE(y,y^), ∂ J ∂ θ = y ^ − y \frac{\partial J}{\partial \mathbf{\theta}}=\hat{\mathbf{y}}-\mathbf{y} ∂θ∂J=y^−y
下面给出过程:
y
^
=
s
o
f
t
m
a
x
(
θ
)
=
(
e
x
p
(
θ
1
)
∑
i
=
1
n
θ
i
.
.
.
e
x
p
(
θ
n
)
∑
i
=
1
n
θ
i
)
∈
R
n
\hat{\mathbf{y}}=softmax(\mathbf{\theta})=\left(\begin{array}{ccc} \frac{exp(\theta_1)}{\sum_{i=1}^n\theta_i}\\ .\\.\\.\\ \frac{exp(\theta_n)}{\sum_{i=1}^n\theta_i} \end{array} \right) \in \mathbb{R}^n
y^=softmax(θ)=⎝⎜⎜⎜⎜⎜⎛∑i=1nθiexp(θ1)...∑i=1nθiexp(θn)⎠⎟⎟⎟⎟⎟⎞∈Rn
J = C E ( y , y ^ ) = − y T log ( y ^ ) J = CE(\mathbf{y},\hat{\mathbf{y}}) = -\mathbf{y}^T\log(\hat{\mathbf{y}}) J=CE(y,y^)=−yTlog(y^)
因为
y
y
y代表正确标签,会使用one-hot编码,所以只有正确的位置为1,其他位置为0,如下:
y
=
(
0
.
.
.
0
1
0
.
.
.
0
)
\mathbf{y} = \left(\begin{array}{ccc} 0\\ ...\\ 0\\ 1\\ 0\\ ...\\ 0 \end{array} \right)
y=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎛0...010...0⎠⎟⎟⎟⎟⎟⎟⎟⎟⎞
不失一般性,令
y
k
=
1
y_k=1
yk=1,即第k个为正确值,所以
J
J
J如下:
J
=
−
y
k
log
(
y
^
k
)
=
−
log
(
y
^
k
)
=
−
log
(
e
x
p
(
θ
k
)
∑
i
=
1
n
e
x
p
(
θ
i
)
)
=
−
θ
k
+
log
(
∑
i
=
1
n
e
x
p
(
θ
i
)
)
J = -y_k\log(\hat{y}_k) = -\log(\hat{y}_k)=-\log(\frac{exp(\theta_k)}{\sum_{i=1}^n exp(\theta_i)})=-\theta_k+\log(\sum_{i=1}^nexp(\theta_i))
J=−yklog(y^k)=−log(y^k)=−log(∑i=1nexp(θi)exp(θk))=−θk+log(i=1∑nexp(θi))
则,
∂
J
∂
θ
i
=
∂
∂
θ
i
−
log
(
y
^
k
)
=
∂
∂
θ
i
(
−
θ
k
+
log
(
∑
x
=
1
n
e
x
p
(
θ
x
)
)
=
−
∂
∂
θ
i
θ
k
+
∂
∂
θ
i
log
(
∑
x
=
1
n
e
x
p
(
θ
x
)
)
=
−
∂
∂
θ
i
θ
k
+
1
∑
x
=
1
n
e
x
p
(
θ
x
)
∂
∂
θ
i
∑
x
=
1
n
e
x
p
(
θ
x
)
=
−
∂
∂
θ
i
θ
k
+
1
∑
x
=
1
n
e
x
p
(
θ
x
)
e
x
p
(
θ
i
)
=
−
∂
∂
θ
i
θ
k
+
y
^
i
\begin{aligned} \frac{\partial J}{\partial \theta_i}&=\frac{\partial}{\partial \theta_i}-\log(\hat{y}_k)\\ &= \frac{\partial}{\partial \theta_i}(-\theta_k+\log(\sum_{x=1}^nexp(\theta_x))\\ &= -\frac{\partial}{\partial \theta_i}\theta_k+\frac{\partial}{\partial \theta_i}\log(\sum_{x=1}^nexp(\theta_x))\\ &=-\frac{\partial}{\partial \theta_i}\theta_k+\frac{1}{\sum_{x=1}^nexp(\theta_x)}\frac{\partial}{\partial \theta_i}\sum_{x=1}^nexp(\theta_x)\\ &=-\frac{\partial}{\partial \theta_i}\theta_k+\frac{1}{\sum_{x=1}^nexp(\theta_x)}exp(\theta_i)\\ &=-\frac{\partial}{\partial \theta_i}\theta_k+\hat{y}_i \end{aligned}
∂θi∂J=∂θi∂−log(y^k)=∂θi∂(−θk+log(x=1∑nexp(θx))=−∂θi∂θk+∂θi∂log(x=1∑nexp(θx))=−∂θi∂θk+∑x=1nexp(θx)1∂θi∂x=1∑nexp(θx)=−∂θi∂θk+∑x=1nexp(θx)1exp(θi)=−∂θi∂θk+y^i
等式前半部分只有
i
=
k
i=k
i=k时为1,其他情况为0,所以有:
∂
J
∂
θ
=
−
y
+
y
^
\frac{\partial J}{\partial \mathbf{\theta}}=-\mathbf{y}+\hat{\mathbf{y}}
∂θ∂J=−y+y^