Computing Nerual Network Gradients

最新推荐文章于 2020-06-05 11:03:09 发布

遇见更好的自己

最新推荐文章于 2020-06-05 11:03:09 发布

阅读量382

点赞数

分类专栏：深度学习

本文链接：https://blog.csdn.net/yc1203968305/article/details/72835346

版权

深度学习专栏收录该内容

16 篇文章 0 订阅

订阅专栏

Matrix times column vector with respect to the column vector $(z = Wx$ ,what is $\frac{d z}{d x})$
where $W \in R^{n\times m},x \in R^m$

$d z d x = W$ $\frac{d z}{d x} = W$
Row vector times matrix with respect to the row vector ( $z = xW$ ,what is $\frac{d z}{d x}$ )
where $W \in R^{n\times m},x\in R^n$

$d z d x = W T$ $\frac{d z}{d x} = W^T$
A vector with itself ( $z = x$ ，what is $\frac{d z}{d x}$ ?) this is just the identity matrix:

$d z d x = I$ $\frac{d z}{d x} = I$
this term will disappear because a matrix multiplied by the identity matrix does not change
An elementwise function applied a vector( $z = f(x)$ ，what is $\frac{d z}{d x}$ ?)

$(d z d x) i j = d z d x = d d x j f (x i) = {f' (x i), 0, if i = j if otherwise$ $(\frac{d z}{d x})_{ij} = \frac{d z}{d x} = \frac{d}{d x_j}f(x_i) = \begin{cases} f'(x_i), & \text{if $i=j$} \\ 0, & \text{if otherwise} \end{cases}$
we can write this as $\frac{d z}{d x} = diag(f'(x))$ 。since multiplication by a diagonal matrix is the same as doing elementwise multipication by the diagonal，we could also write $\circ f'(x)$ when applying the chain rule
Matrix times column vector with respect to the matrix( $z = Wx,\delta=\frac{d J}{d z}$ what is $\frac{d J}{d W}=\frac{d J}{d z}\frac{dz}{d W}=\delta\frac{d z}{d W}$ ?) where $W \in R^{n*m},x \in R^m,z \in R^n$

$z k = \sum l = 1 m W k l x l$ $z_k=\sum_{l=1}^mW_{kl}x_l$
$d z k d W i j = \sum l = 1 m x l d d W i j W k l$ $\frac{d z_k}{d W_{ij}}=\sum_{l=1}^mx_l\frac{d}{d W_{ij}}W_{kl}$
Note that $\frac{d}{d W_{ij}}W_{kl}=1$ if $i=k$ and $j=l$ and 0 if otherwise。so if $k \neq i$ everything in the sum is zero and the gradient is zero，Otherwise，the only nonzero element of the sum is when l=j . so
$d z k d W i j = x j$ $\frac{d z_k}{d W_{ij}}=x_{j}$
Now let’s compute
$d J d W i j = d J d z d z d w i j = δ d z d W i j = \sum k = 1 m δ k d z k d W i j = δ i x j$ $\frac{d J}{d W_{ij}}=\frac{d J}{d z}\frac{d z}{d w_{ij}}=\delta\frac{d z}{d W_{ij}}=\sum_{k=1}^m\delta_k\frac{d z_k}{d W_{ij}}=\delta_ix_j$
(the only nonzero term in the sum is $\delta_i\frac{d z_i}{d W_{ij}}$ ). To get $\frac{d J}{d W}$ we want a matrix where entry $(i,j)$ is $\delta_{i}x_j$ . This matrix is equal to the outer product
$d J d W = δ x T$ $\frac{d J}{d W} = \delta x^T$
Row vector time matrix with respect to the matrix( $z=xW, \delta=\frac{d J}{d z}$ what is $\frac{d J}{d W}=\delta\frac{d z}{d W}$ ?) where $W \in R^{n*m},x \in R^{1*n},z \in R^{1*m}$
A similar computation to (5）shows that

$d J d W = x T δ$ $\frac{d J}{d W}=x^T\delta$
Cross-entropy loss with respect to logits( $\hat{y}=softmax(\theta),J=CE(y.\hat{y}),$ what is $\frac{d J}{d \theta}$ ?)
$d J d θ = y ̂ - y$ $\frac{d J}{d \theta}=\hat{y}-y$

遇见更好的自己

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Computing Nerual Network Gradients

Matrix times column vector with respect to the column vector (z=Wx(z = Wx ,what is dzdx)\frac{d z}{d x}) where W∈Rn×m,x∈RmW \in R^{n\times m},x \in R^m dzdx=W \frac{d z}{d x} = WRow vector times ma
复制链接

扫一扫