Improving Deep Neural Networks

1.10 Vanishing/Exploding Gradients

这一篇写得特别好:
详解机器学习中的梯度消失、爆炸原因及其解决方法

这里写图片描述
One of the problems of training neural network, especially very deep neural networks, is data vanishing and exploding gradients. What that means is that when you're training a very deep network your derivatives or your slopes can sometimes get either very, very big or very, very small, maybe even exponentially small, and this makes training difficult.

这里写图片描述
假设激活函数 g(z)=z g ( z ) = z ,即激活函数是线性的,并且没有偏移项 b b

z[1]=W[1]xa[1]=z[1]z[2]=W[2]a[1]=W[2]W[1]xy^=W[L]W[L1]W[L2]W[2]W[1]x

  • The weights W, if they’re all just a little bit bigger than one or just a little bit bigger than the identity matrix, then with a very deep network the activations can explode.
  • And if W is just a little bit less than identity. So this maybe here’s 0.9, 0.9, then you have a very deep network, the activations will decrease exponentially.
  • And even though I went through this argument in terms of activations increasing or decreasing exponentially as a function of L, a similar argument can be used to show that the derivatives or the gradients the computer is going to send will also increase exponentially or decrease exponentially as a function of the number of layers.

1.11 Weight initialization for deep networks

这里写图片描述

It turns out that a partial solution to vanishing/exploding gradients, doesn’t solve it entirely but helps a lot, is better or more careful choice of the random initialization for your neural network.

So in order to make z z not blow up and not become too small you notice that the larger n is, the smaller you want Wi to be, right? Because z is the sum of the WiXi and so if you’re adding up a lot of these terms you want each of these terms to be smaller. One reasonable thing to do would be to set the variance of Wi to be equal to 1 over n, where n is the number of input features that’s going into a neuron.

Relu:var(W)=2n[l1]W[l]=np.random.randn(shape)np.sqrt(2n[l1])n[l1]denotes the number of units in l1 layertanh:var(W)=1n[l1] Xavier Initialization

So in practice, what you can do is set the weight matrix W W for a certain layer to be np.random.randn you know, and then whatever the shape of the matrix is for this out here, and then times square root of 1 over the number of features that I fed into each neuron and there else is going to be n[(l1)] n [ ( l − 1 ) ] $because that’s the number of units that I’m feeding into each of the units and they are l.

So if the input features of activations are roughly mean 0 and standard variance and variance 1 then this would cause z to also take on a similar scale and this doesn’t solve, but it definitely helps reduce the vanishing, exploding gradients problem because it’s trying to set each of the weight matrices w you know so that it’s not too much bigger than 1 and not too much less than 1 so it doesn’t explode or vanish too quickly. I’ve just mention some other variants.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值