Matrix times column vector with respect to the column vector (z=Wx ,what is dzdx)
where W∈Rn×m,x∈Rm
dzdx=WRow vector times matrix with respect to the row vector ( z=xW ,what is dzdx )
where W∈Rn×m,x∈Rn
dzdx=WTA vector with itself ( z=x ,what is dzdx ?) this is just the identity matrix:
dzdx=I
this term will disappear because a matrix multiplied by the identity matrix does not changeAn elementwise function applied a vector( z=f(x) ,what is dzdx ?)
(dzdx)ij=dzdx=ddxjf(xi)={f′(xi),0,if i=jif otherwise
we can write this as dzdx=diag(f′(x)) 。since multiplication by a diagonal matrix is the same as doing elementwise multipication by the diagonal,we could also write ∘f′(x) when applying the chain ruleMatrix times column vector with respect to the matrix( z=Wx,δ=dJdz what is dJdW=dJdzdzdW=δdzdW ?) where W∈Rn∗m,x∈Rm,z∈Rn
zk=∑l=1mWklxl
dzkdWij=∑l=1mxlddWijWkl
Note that ddWijWkl=1 if i=k and j=l and 0 if otherwise。so if k≠i everything in the sum is zero and the gradient is zero,Otherwise,the only nonzero element of the sum is when l=j . so
dzkdWij=xj
Now let’s compute
dJdWij=dJdzdzdwij=δdzdWij=∑k=1mδkdzkdWij=δixj
(the only nonzero term in the sum is δidzidWij ). To get dJdW we want a matrix where entry (i,j) is δixj . This matrix is equal to the outer product
dJdW=δxTRow vector time matrix with respect to the matrix( z=xW,δ=dJdz what is dJdW=δdzdW ?) where W∈Rn∗m,x∈R1∗n,z∈R1∗m
A similar computation to (5)shows that
dJdW=xTδ- Cross-entropy loss with respect to logits(
ŷ =softmax(θ),J=CE(y.ŷ ),
what is
dJdθ
?)
dJdθ=ŷ −y
Computing Nerual Network Gradients
最新推荐文章于 2020-06-05 11:03:09 发布