Computing Nerual Network Gradients

  1. Matrix times column vector with respect to the column vector (z=Wx ,what is dzdx)
    where WRn×m,xRm

    dzdx=W

  2. Row vector times matrix with respect to the row vector ( z=xW ,what is dzdx )
    where WRn×m,xRn

    dzdx=WT

  3. A vector with itself ( z=x ,what is dzdx ?) this is just the identity matrix:

    dzdx=I

    this term will disappear because a matrix multiplied by the identity matrix does not change

  4. An elementwise function applied a vector( z=f(x) ,what is dzdx ?)

    (dzdx)ij=dzdx=ddxjf(xi)={f(xi),0,if i=jif otherwise

    we can write this as dzdx=diag(f(x)) 。since multiplication by a diagonal matrix is the same as doing elementwise multipication by the diagonal,we could also write f(x) when applying the chain rule

  5. Matrix times column vector with respect to the matrix( z=Wx,δ=dJdz what is dJdW=dJdzdzdW=δdzdW ?) where WRnm,xRm,zRn

    zk=l=1mWklxl

    dzkdWij=l=1mxlddWijWkl

    Note that ddWijWkl=1 if i=k and j=l and 0 if otherwise。so if ki everything in the sum is zero and the gradient is zero,Otherwise,the only nonzero element of the sum is when l=j . so
    dzkdWij=xj

    Now let’s compute
    dJdWij=dJdzdzdwij=δdzdWij=k=1mδkdzkdWij=δixj

    (the only nonzero term in the sum is δidzidWij ). To get dJdW we want a matrix where entry (i,j) is δixj . This matrix is equal to the outer product
    dJdW=δxT

  6. Row vector time matrix with respect to the matrix( z=xW,δ=dJdz what is dJdW=δdzdW ?) where WRnm,xR1n,zR1m
    A similar computation to (5)shows that

    dJdW=xTδ

  7. Cross-entropy loss with respect to logits( ŷ =softmax(θ),J=CE(y.ŷ ), what is dJdθ ?)
    dJdθ=ŷ y
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值