Coursera | Andrew Ng (02-week-3-3.4)—归一化网络的激活函数

最新推荐文章于 2024-04-28 16:57:19 发布

ZJ_Improve

最新推荐文章于 2024-04-28 16:57:19 发布

阅读量723

点赞数 1

分类专栏：深度学习 | 吴恩达-02.改善深层NN：超参数调试、正则化以及优化深度学习 | 吴恩达文章标签： batch 归一化深度学习神经网络

本文链接：https://blog.csdn.net/JUNJUN_ZHAO/article/details/79119257

版权

深度学习 | 吴恩达同时被 2 个专栏收录

129 篇文章 19 订阅

订阅专栏

深度学习 | 吴恩达-02.改善深层NN：超参数调试、正则化以及优化

34 篇文章 2 订阅

订阅专栏

该系列仅在原课程基础上部分知识点添加个人学习笔记，或相关推导补充等。如有错误，还请批评指教。在学习了 Andrew Ng 课程的基础上，为了更方便的查阅复习，将其整理成文字。因本人一直在学习英语，所以该系列以英文为主，同时也建议读者以英文为主，中文辅助，以便后期进阶时，为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂

转载请注明作者和出处：ZJ 微信公众号-「SelfImprovementLab」

知乎：https://zhuanlan.zhihu.com/c_147249273

CSDN：http://blog.csdn.net/junjun_zhao/article/details/79119257

3.4 Normalizing activations in a network (归一化网络的激活函数 )

(字幕来源：网易云课堂)

这里写图片描述

In the rise of deep learning,one of the most important ideas has been an algorithm called Batch Normalization created by two researchers Sergey Ioffe and Christian Szegedy. Batch normalization makes your hyperparameter search problem much easier,makes the neural network much more robust to the choice of hyperparameters, there’s a much bigger range of hyperparameters that work well, and will also enable you to much more easily train even very deep networks.Let’s see how Batch Normalization works.

在深度学习兴起后，最重要的一个思想是它的一种算法叫做 Batch 归一化由 Sergey Ioffe 和 Christian Szegedy 两位研究者创造， Batch 归一化会使你的参数搜索问题变得很容易，使神经网络对超参数的选择更加稳定，超参数的范围会更庞大工作效果也很好，也会使你很容易的训练甚至是深层网络，让我们来看看 Batch 归一化是怎么起作用的吧。

When training a model such as logistic regression you might remember that normalizing the input features can speed up learnings.You compute the means, subtract off the means from your training set, compute the variances.This sum of $X^{(i)}$ squared, this is the element-wise squaring, and then normalize your data set according to the variances.And we saw in an earlier video how this can turn the contours of your learning problem from something that might be very elongated to something that is more round and easier for an algorithm like grading to send to optimize.So this works in terms of normalizing the input feature values to a neural network or to logistic regression.

这里写图片描述

当训练一个模型比如 logistic 回归时你也许会记得，归一化输入特征可加速学习过程，你计算了平均值从训练集中减去平均值，计算了方差， $x^{(i)}$ 的平方和，这是点积平方，接着根据方差归一化你的数据集，在之前的视频中我们看到，这是如何把学习问题的轮廓，从很长的东西变成更圆的东西,更易于算法优化，所以这是有效的，对 logistic 回归和神经网络的归一化输入特征值而言。

Now, how about a deeper model?You have not just input features x, but in this layer you have activations $a^{[1]}$ ,in this layer you have activations $a^{[2]}$ and so on.So if you want to train the parameters, say, $w^{[3]}$ , $b^{[3]}$ ,then won’t it be nice if you can normalize the mean and variance of $a^{[2]}$ to make the training of $w^{[3]}$ , $b^{[3]}$ more efficient?In the case of logistic regression, we saw how normalizing x1, x2, x3maybe helps you train w and b more efficiently.So here the question is, for any hidden layer, can we normalize the values of a, let’s say $a^{[2]}$ in this example, but really any hidden layer, so as to train $w^{[3]}$ , $b^{[3]}$ faster, since $a^{[2]}$ is the input onto the next layer that therefore affects your training of $w^{[3]}$ and $b^{[3]}$ .So this is what Batch does, Batch Normalization, or Batch Norm for short, does.Although technically we’ll actually normalize the values of not $a^{[2]}$ , but $Z^{[2]}$ .There is some debate in the deep learning literature about whether you should normalize the value before the activation function, so $Z^{[2]}$ ,or whether you should normalize the value after applying deactivation function $a^{[2]}$ .In practice, normalizing $Z^{[2]}$ is done much more often, so that’s the version I presented, what I would recommend you use as the default choice.

这里写图片描述

那么更深的模型呢，你没有输入特征值 x，但这层有激活值 $a^{[1]}$ ，这层有激活值 $a^{[2]}$ 等等，如果你想训练这些参数比如 $w^{[3]}$ $b^{[3]}$ ，那归一化 $a^{[2]}$ 的平均值和方差岂不是很好？以便使 $w^{[3]}$ $b^{[3]}$ 的训练更有意义， logistic 回归的例子中，我们看到了如何归一化 $x_1 x_2 x_3$ ，会帮助你更有效的训练 w 和 b，所以问题来了，对于任何一个隐藏层而言，我们能否规归一化 a 值，在此例中比如说 $a^{[2]}$ 的值，但可以是任何隐藏层的，以更快速地训练 $w^{[3]}$ $b^{[3]}$ ，因为 $a^{[2]}$ 是，下一层的输入值所以就会影响 $w^{[3]}$ $b^{[3]}$ 的训练，简单来说这就是 Batch 归一化的作用，尽管严格来说我们真正归一化的不是 $a^{[2]}$ 而是 $z^{[2]}$ ，深度学习文献中有一些争论，关于在激活函数之前是否应将值 $z^{[2]}$ 归一化，或是否应该，在应用失活函数 $a^{[2]}$ 后再规范值，实践中经常做的是归一 $z^{[2]}$ ，所以这就是我介绍的版本，我推荐其为默认选择。

So here is how you would implement Batch Norm.Given some intermediate values 中间值 in your neuro net, let’s say that you have some hidden unit values $Z^{(1)}$ up to $Z^{(m)}$ ,and this is really from some hidden layer, so it’d be more accurate to write this as z for some hidden layer, i for i=one through m. But to do this writing,I’m going to omit this square bracket L just to simplify the notation on this line.So given these values, what you do is compute the mean as follows:Again, all this is specific to some layer L, but I’m omitting the square bracket L, and then you compute the variance using the pretty much the formula that you would expect.And then you would take each of the $Z^{(i)}$ s and normalize it, so you get $Z^{(i)}$ normalized by subtracting off the mean and dividing by the standard deviation.For numerical stability, you usually add epsilon to denominator like that, just in case sigma squared turns out to be zero in some estimate.And so now we’ve taken these values Z and normalized them to have mean zero and standard unit variance.So every component of Z has mean zero and variance one.But we don’t want the hidden units to always have mean zero and variance one.Maybe it makes sense for hidden units to have a different distribution.So what we do instead is compute, we call it Ztilde, equals gamma $Z^{(i)}$ norm plus beta.And here gamma and beta are learnable parameters of your model.So we are using gradient descent, or some other algorithm like the gradient descent momentum or Nesterov, Adam you would update the parameters gamma and beta just as you update the weights of the neural network.

这里写图片描述

那下面就是 Batch 归一化的使用方法，在神经网络中已知一些中间值，假设你有一些隐藏单元值从 $Z^{(1)}$ 到 $z^{(m)}$ ，这些来源于隐藏层，所以这样写会更准确即 z 为隐藏层，i 从 1 到 m 但这样书写，我要省略 L 及方括号以便简化这一行的符号，所以已知这些值，如下你要计算平均值，强调一下所有这些都针对于 L 层，但我要省略 L 及方括号，然后用正如你常用的那个公式计算方差，接着你会取每个 $z^{(i)}$ 值使其规范化，方法如下，减去均值再除以标准偏差，为了使数值稳定通常将 $ε$ 作为分母，以防 $σ=0$ 的情况，所以现在我们已把这些 z 值标准化，化为含平均值 0 和标准单位方差，所以 z 的每一个分量都含有平均值 0 和方差 1，但我们不想让隐藏单元总是含有平均值 0 和方差 1，也许隐藏单元有了不同的分布会有意义，所以我们所要做的就是计算，称之为 $z̃=\gamma z^{(i)} _{norm}+\beta$ ，这里 $\gamma$ 和 $\beta$ 是你模型的学习参数，所以我们使用梯度下降，或一些其它类似梯度下降的算法，比如 momentum 或者 Nesterov, Adam，你会更新 $\gamma$ 和 $\beta$ ，正如更新神经网络的权重一样。

Now, notice that the effect of gamma and beta is that it allows you to set the mean of Ztilde to be whatever you want it to be.In fact, if gamma equals square root sigma squared plus epsilon, so if gamma were equal to this denominator term and if beta were equal to mu, so this value up here,then the effect of gamma Znorm plus beta is that it would exactly invert this equation.So if this is true, then actually Ztilde(i) is equal to $Z^{(i)}$ .And so by an appropriate setting of the parameters gamma and beta, this normalization step, that is these four equations, is just computing essentially the identity function.By choosing other values of gamma and beta, this allows you to make the hidden unit values of other means and variances as well.And so the way you fit this unit in your network is whereas previously you are using these values $Z^{(1)}$ , $Z^{(2)}$ , and so on, you would now use Ztilde(i) instead of $Z^{(i)}$ for the later computations in your neural network.And you want to put back in this square bracket L, to explicitly denote which layer it is in and you can put it back there.

这里写图片描述

请注意 $\gamma$ 和 $\beta$ 的作用是，你可以随意设置 z̃ 的平均值，事实上如果 $\gamma =\sqrt{σ^2+ε}$ ，如果 $\gamma$ 等于这个分母项， $\beta$ 等于 μ，这里的这个值，那 $\gamma z^i_{norm}+\beta$ 的作用在于，它会精确转化这个方程，如果这些成立，那么 $z̃^{(i)} = z^{(i)}$ ，通过对 $\gamma$ 和 $\beta$ 合理设定，规范化过程即这四个等式，从根本来说只是计算恒等函数，通过赋予 $\gamma$ 和 $\beta$ 其它值，可以使你构造含其他平均值和方差的隐藏单元值，所以在网络匹配这个单元的方式，之前可能是用 $Z^{(1)}$ ， $Z^{(2)}$ 等等，现在则会用 $z̃^{(i)}$ 取代 $z^{(i)}$ ，方便神经网络中的后续计算，如果你想放回 [L]，以清楚的表明它位于哪层你可以把它放这。

So the intuition I hope you take away from this is that we saw how normalizing the input features X can help learning in the neural network.And what Batch Norm does is it applies that normalization process not just to the input layer but to the values even deep in some hidden layer in the neural networks.You apply this type of normalization to normalize the mean and variance of some of your hidden units values Z.But one difference between the training input and these hidden unit values is you might not want your hidden unit values to be forced to mean zero and variance one.For example, if you have a sigmoid activation function, you don’t want your values to always be clustered here, you might want them to have a larger variance or have a mean that’s different than zero in order to better take advantage of the non-linearity of the sigmoid function rather than have all your values be in just this the linear version.So that’s why with the parameters gamma and beta you can now make sure that your $Z^{(i)}$ values have the range of values that you want.Or what it does really is that it ensures that your hidden units have standardized mean and variance where the mean and variance are controlled by two explicit parameters, gamma and beta, which the learning algorithm can set to whatever it wants.So what it really does is it normalizes the mean and variance of these hidden unit values, really, the Z[i]s, to have some fixed mean and variance.And that mean and variance could be zero and one or it could be some other value and it’s controlled by these parameters gamma and beta.

这里写图片描述

所以我希望你学到的是归一化输入特征 X 是怎样有助于神经网络中的学习， Batch 归一化的作用是它适用的归一化过程不只是输入层，甚至同样适用于神经网络中的深度隐藏层，你应用 Batch 归一化了，一些隐藏单元值中的平均值和方差，不过训练输入和这些隐藏单元值的一个区别是，你也许不想隐藏单元值必须是平均值 0 和方差 1，比如如果你有 sigmoid 激活函数，你不想让你的值总是全部集中在这里，你想使它们有更大的方差，或不是 0 的平均值，以便更好的利用非线性的 sigmoid 函数，而不是使所有的值都集中于这个线性版本中，这就是为什么有了 $\gamma$ 和 $\beta$ 两个参数后，你可以确保所有的 $z^{(i)}$ 值可以是你想赋予的任意值，或者它的作用是保证隐藏的单元已使均值和方差标准化，那里均值和方差由两参数控制，即 $\gamma$ 和 $\beta$ 学习算法可以设置为任何值，所以它真正的作用是，使隐藏单元值的均值和方差标准化，即 $z^{[i]}$ 有固定的均值和方差，均值和方差可以是 0 和 1，也可以是其它值它是由 $\gamma$ 和 $\beta$ 两参数控制的。

So I hope that gives you a sense of the mechanics of how to implement Batch Norm, at least for a single layer in the neural network.In the next video I want to show you how to fit Batch Norm into a neural network, even a deep neural network, and how to make it work for the many different layers of a neural network.And after that we’ll give some more intuition about why Batch Norm could help you train your neural networks.So in case why works still seems a little bit mysterious, stay with me.And I think in the two videos from now we’re going to make that clear.

我希望你能学会怎样使用 Batch 归一化，至少就神经网络的单一层而言，在下一个视频中我会教你如何将 Batch 归一化与神经网络，甚至是深度神经网络相匹配，对于神经网络许多不同层而言又该如何使它适用，之后我会告诉你， Batch 归一化有助于训练神经网络的原因，所以如果觉得 Batch 归一化起作用的原因还显得有点神秘那跟着我走，在接下来的两个视频中我们会弄清楚。

重点总结：

网络中激活值的归一化

在 logistic Regression 中，将输入特征进行归一化，可以加速模型的训练。那么对于更深层次的神经网络，我们是否可以归一化隐藏层的输出 $a^{[l]}$ 或者经过激活函数前的 $z^{[l]}$ ，以便加速神经网络的训练过程？答案是肯定的。

常用的方式是将隐藏层的经过激活函数前的 $z^{[l]}$ 进行归一化。

Batch Norm 的实现

以神经网络中某一隐藏层的中间值为例： $z^{(1)},z^{(2)},\ldots,z^{(m)}$

μ=1m∑iz(i)σ2=1m∑i(z(i)−μ)2z(i)norm=z(i)−μσ2+ε−−−−−√ μ = 1 m ∑ i z ( i ) σ 2 = 1 m ∑ i ( z ( i ) − μ ) 2 z n o r m ( i ) = z ( i ) − μ σ 2 + ε $\mu = \dfrac{1}{m}\sum\limits_{i}z^{(i)} \\ \sigma^{2}=\dfrac{1}{m}\sum\limits_{i}(z^{(i)}-\mu)^{2} \\z^{(i)}_{\rm norm} = \dfrac{z^{(i)}-\mu}{\sqrt{\sigma^{2}+\varepsilon}}$

这里加上 ε 是为了保证数值的稳定。

到这里所有 z 的分量都是平均值为 0 和方差为 1 的分布，但是我们不希望隐藏层的单元总是如此，也许不同的分布会更有意义，所以我们再进行计算：

z˜(i)=γz(i)norm+β z ~ ( i ) = γ z n o r m ( i ) + β $\widetilde z^{(i)} = \gamma z^{(i)}_{\rm norm}+\beta$

这里 $\gamma$ 和 $\beta$ 是可以更新学习的参数，如神经网络的权重 $w$ 一样，两个参数的值来确定 $\widetilde z^{(i)}$ 所属的分布。

参考文献：

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记（2-3）– 超参数调试和 Batch Norm

PS: 欢迎扫码关注公众号：「SelfImprovementLab」！专注「深度学习」，「机器学习」，「人工智能」。以及「早起」，「阅读」，「运动」，「英语」「其他」不定期建群打卡互助活动。

ZJ_Improve

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Coursera | Andrew Ng (02-week-3-3.4)—归一化网络的激活函数

该系列仅在原课程基础上部分知识点添加个人学习笔记，或相关推导补充等。如有错误，还请批评指教。在学习了 Andrew Ng 课程的基础上，为了更方便的查阅复习，将其整理成文字。因本人一直在学习英语，所以该系列以英文为主，同时也建议读者以英文为主，中文辅助，以便后期进阶时，为学习相关领域的学术论文做铺垫。- ZJ Coursera 课程 |deeplearning.ai |网易云课堂...
复制链接

扫一扫

专栏目录