Coursera | Andrew Ng (02-week-3-3.4)—归一化网络的激活函数

该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂


转载请注明作者和出处:ZJ 微信公众号-「SelfImprovementLab」

知乎https://zhuanlan.zhihu.com/c_147249273

CSDNhttp://blog.csdn.net/junjun_zhao/article/details/79119257


3.4 Normalizing activations in a network (归一化网络的激活函数 )

(字幕来源:网易云课堂)

这里写图片描述

In the rise of deep learning,one of the most important ideas has been an algorithm called Batch Normalization created by two researchers Sergey Ioffe and Christian Szegedy. Batch normalization makes your hyperparameter search problem much easier,makes the neural network much more robust to the choice of hyperparameters, there’s a much bigger range of hyperparameters that work well, and will also enable you to much more easily train even very deep networks.Let’s see how Batch Normalization works.

在深度学习兴起后,最重要的一个思想是它的一种算法叫做 Batch 归一化 由 Sergey Ioffe 和 Christian Szegedy 两位研究者创造Batch 归一化会使你的参数搜索问题变得很容易使神经网络对超参数的选择更加稳定,超参数的范围会更庞大工作效果也很好,也会使你很容易的训练甚至是深层网络,让我们来看看 Batch 归一化是怎么起作用的吧。

When training a model such as logistic regression you might remember that normalizing the input features can speed up learnings.You compute the means, subtract off the means from your training set, compute the variances.This sum of X(i) X ( i ) squared, this is the element-wise squaring, and then normalize your data set according to the variances.And we saw in an earlier video how this can turn the contours of your learning problem from something that might be very elongated to something that is more round and easier for an algorithm like grading to send to optimize.So this works in terms of normalizing the input feature values to a neural network or to logistic regression.

这里写图片描述

当训练一个模型 比如 logistic 回归时 你也许会记得,归一化输入特征可加速学习过程你计算了平均值 从训练集中减去平均值,计算了方差 x(i) x ( i ) 的平方和,这是点积平方接着根据方差归一化你的数据集,在之前的视频中我们看到,这是如何把学习问题的轮廓,从很长的东西变成更圆的东西,更易于算法优化,所以这是有效的,对 logistic 回归和神经网络的归一化输入特征值而言。

Now, how about a deeper model?You have not just input features x, but in this layer you have activations a[1] a [ 1 ] ,in this layer you have activations a[2] a [ 2 ] and so on.So if you want to train the parameters, say, w[3] w [ 3 ] , b[3] b [ 3 ] ,then won’t it be nice if you can normalize the mean and variance of a[2] a [ 2 ] to make the training of w[3] w [ 3 ] , b[3] b [ 3 ] more efficient?In the case of logistic regression, we saw how normalizing x1, x2, x3maybe helps you train w and b more efficiently.So here the question is, for any hidden layer, can we normalize the values of a, let’s say a[2] a [ 2 ] in this example, but really any hidden layer, so as to train w[3] w [ 3 ] , b[3] b [ 3 ] faster, since a[2] a [ 2 ] is the input onto the next layer that therefore affects your training of w[3] w [ 3 ] and b[3] b [ 3 ] .So this is what Batch does, Batch Normalization, or Batch Norm for short, does.Although technically we’ll actually normalize the values of not a[2] a [ 2 ] , but Z[2] Z [ 2 ] .There is some debate in the deep learning literature about whether you should normalize the value before the activation function, so Z[2] Z [ 2 ] ,or whether you should normalize the value after applying deactivation function a[2] a [ 2 ] .In practice, normalizing Z[2] Z [ 2 ] is done much more often, so that’s the version I presented, what I would recommend you use as the default choice.

这里写图片描述

那么更深的模型呢,你没有输入特征值 x,但这层有激活值 a[1] a [ 1 ] ,这层有激活值 a[2] a [ 2 ] 等等,如果你想训练这些参数 比如 w[3] w [ 3 ] b[3] b [ 3 ] 那归一化 a[2] a [ 2 ] 的平均值和方差岂不是很好?以便使 w[3] w [ 3 ] b[3] b [ 3 ] 的训练更有意义, logistic 回归的例子中,我们看到了如何归一化 x1x2x3 x 1 x 2 x 3 ,会帮助你更有效的训练 w 和 b,所以问题来了,对于任何一个隐藏层而言,我们能否规归一化 a 值,在此例中 比如说 a[2] a [ 2 ] 的值,但可以是任何隐藏层的,以更快速地训练 w[3] w [ 3 ] b[3] b [ 3 ] ,因为 a[2] a [ 2 ] 是,下一层的输入值 所以就会影响 w[3] w [ 3 ] b[3] b [ 3 ] 的训练,简单来说 这就是 Batch 归一化的作用,尽管 严格来说 我们真正归一化的不是 a[2] a [ 2 ] 而是 z[2] z [ 2 ] ,深度学习文献中有一些争论,关于在激活函数之前是否应将值 z[2] z [ 2 ] 归一化,或是否应该,在应用失活函数 a[2] a [ 2 ] 后再规范值,实践中经常做的是归一 z[2] z [ 2 ] ,所以这就是我介绍的版本,我推荐其为默认选择

So here is how you would implement Batch Norm.Given some intermediate values 中间值 in your neuro net, let’s say that you have some hidden unit values Z(1) Z ( 1 ) up to Z(m) Z ( m ) ,and this is really from some hidden layer, so it’d be more accurate to write this as z for some hidden layer, i for i=one through m. But to do this writing,I’m going to omit this square bracket L just to simplify the notation on this line.So given these values, what you do is compute the mean as follows:Again, all this is specific to some layer L, but I’m omitting the square bracket L, and then you compute the variance using the pretty much the formula that you would expect.And then you would take each of the Z(i) Z ( i ) s and normalize it, so you get Z(i) Z ( i ) normalized by subtracting off the mean and dividing by the standard deviation.For numerical stability, you usually add epsilon to denominator like that, just in case sigma squared turns out to be zero in some estimate.And so now we’ve taken these values Z and normalized them to have mean zero and standard unit variance.So every component of Z has mean zero and variance one.But we don’t want the hidden units to always have mean zero and variance one.Maybe it makes sense for hidden units to have a different distribution.So what we do instead is compute, we call it Ztilde, equals gamma Z(i) Z ( i ) norm plus beta.And here gamma and beta are learnable parameters of your model.So we are using gradient descent, or some other algorithm like the gradient descent momentum or Nesterov, Adam you would update the parameters gamma and beta just as you update the weights of the neural network.

这里写图片描述

那下面就是 Batch 归一化的使用方法,在神经网络中已知一些中间值,假设你有一些隐藏单元值 从 Z(1) Z ( 1 ) z(m) z ( m ) ,这些来源于隐藏层,所以 这样写会更准确 即 z 为隐藏层,i 从 1 到 m 但这样书写,我要省略 L 及方括号 以便简化这一行的符号,所以已知这些值,如下 你要计算平均值,强调一下 所有这些都针对于 L 层,但我要省略 L 及方括号,然后 用正如你常用的那个公式计算方差,接着你会取每个 z(i) z ( i ) 值 使其规范化,方法如下,减去均值再除以标准偏差,为了使数值稳定 通常将 ε ε 作为分母,以防σ=0的情况,所以现在 我们已把这些 z 值标准化,化为含平均值 0 和标准单位方差,所以 z 的每一个分量都含有平均值 0 和方差 1,但我们不想让隐藏单元总是含有平均值 0 和方差 1,也许隐藏单元有了不同的分布会有意义,所以我们所要做的就是计算,称之为 z̃ =γz(i)norm+β z ̃ = γ z n o r m ( i ) + β ,这里 γ γ β β 是你模型的学习参数,所以我们使用梯度下降,或一些其它类似梯度下降的算法,比如 momentum 或者 Nesterov, Adam,你会更新 γ γ β β ,正如更新神经网络的权重一样。

Now, notice that the effect of gamma and beta is that it allows you to set the mean of Ztilde to be whatever you want it to be.In fact, if gamma equals square root sigma squared plus epsilon, so if gamma were equal to this denominator term and if beta were equal to mu, so this value up here,then the effect of gamma Znorm plus beta is that it would exactly invert this equation.So if this is true, then actually Ztilde(i) is equal to Z(i) Z ( i ) .And so by an appropriate setting of the parameters gamma and beta, this normalization step, that is these four equations, is just computing essentially the identity function.By choosing other values of gamma and beta, this allows you to make the hidden unit values of other means and variances as well.And so the way you fit this unit in your network is whereas previously you are using these values Z(1) Z ( 1 ) , Z(2) Z ( 2 ) , and so on, you would now use Ztilde(i) instead of Z(i) Z ( i ) for the later computations in your neural network.And you want to put back in this square bracket L, to explicitly denote which layer it is in and you can put it back there.

这里写图片描述

请注意 γ γ β β 的作用是,你可以随意设置 z̃ 的平均值,事实上 如果 γ=σ2+ε γ = σ 2 + ε ,如果 γ γ 等于这个分母项, β β 等于 μ,这里的这个值,那 γzinorm+β γ z n o r m i + β 的作用在于,它会精确转化这个方程,如果这些成立,那么 z̃ (i)=z(i) z ̃ ( i ) = z ( i ) ,通过对 γ γ β β 合理设定,规范化过程 即这四个等式,从根本来说 只是计算恒等函数,通过赋予 γ γ β β 其它值,可以使你构造含其他平均值和方差的隐藏单元值,所以 在网络匹配这个单元的方式,之前可能是用 Z(1) Z ( 1 ) Z(2) Z ( 2 ) 等等,现在则会用 z̃ (i) z ̃ ( i ) 取代 z(i) z ( i ) ,方便神经网络中的后续计算,如果你想放回 [L],以清楚的表明它位于哪层 你可以把它放这。

So the intuition I hope you take away from this is that we saw how normalizing the input features X can help learning in the neural network.And what Batch Norm does is it applies that normalization process not just to the input layer but to the values even deep in some hidden layer in the neural networks.You apply this type of normalization to normalize the mean and variance of some of your hidden units values Z.But one difference between the training input and these hidden unit values is you might not want your hidden unit values to be forced to mean zero and variance one.For example, if you have a sigmoid activation function, you don’t want your values to always be clustered here, you might want them to have a larger variance or have a mean that’s different than zero in order to better take advantage of the non-linearity of the sigmoid function rather than have all your values be in just this the linear version.So that’s why with the parameters gamma and beta you can now make sure that your Z(i) Z ( i ) values have the range of values that you want.Or what it does really is that it ensures that your hidden units have standardized mean and variance where the mean and variance are controlled by two explicit parameters, gamma and beta, which the learning algorithm can set to whatever it wants.So what it really does is it normalizes the mean and variance of these hidden unit values, really, the Z[i]s, to have some fixed mean and variance.And that mean and variance could be zero and one or it could be some other value and it’s controlled by these parameters gamma and beta.

这里写图片描述

所以我希望你学到的是归一化输入特征 X 是怎样有助于神经网络中的学习Batch 归一化的作用是它适用的归一化过程不只是输入层,甚至同样适用于神经网络中的深度隐藏层,你应用 Batch 归一化了,一些隐藏单元值中的平均值和方差,不过 训练输入和这些隐藏单元值的一个区别是,你也许不想隐藏单元值必须是平均值 0 和方差 1,比如 如果你有 sigmoid 激活函数,你不想让你的值总是全部集中在这里,你想使它们有更大的方差,或不是 0 的平均值,以便更好的利用非线性的 sigmoid 函数,而不是使所有的值都集中于这个线性版本中,这就是为什么有了 γ γ β β 两个参数后,你可以确保所有的 z(i) z ( i ) 值可以是你想赋予的任意值,或者它的作用是保证隐藏的单元已使均值和方差标准化,那里 均值和方差由两参数控制,即 γ γ β β 学习算法可以设置为任何值,所以它真正的作用是,使隐藏单元值的均值和方差标准化 z[i] z [ i ] 有固定的均值和方差,均值和方差可以是 0 和 1,也可以是其它值 它是由 γ γ β β 两参数控制的。

So I hope that gives you a sense of the mechanics of how to implement Batch Norm, at least for a single layer in the neural network.In the next video I want to show you how to fit Batch Norm into a neural network, even a deep neural network, and how to make it work for the many different layers of a neural network.And after that we’ll give some more intuition about why Batch Norm could help you train your neural networks.So in case why works still seems a little bit mysterious, stay with me.And I think in the two videos from now we’re going to make that clear.

我希望你能学会怎样使用 Batch 归一化,至少就神经网络的单一层而言,在下一个视频中 我会教你如何将 Batch 归一化与神经网络,甚至是深度神经网络相匹配,对于神经网络许多不同层而言 又该如何使它适用,之后 我会告诉你, Batch 归一化有助于训练神经网络的原因,所以如果觉得 Batch 归一化起作用的原因还显得有点神秘 那跟着我走,在接下来的两个视频中 我们会弄清楚。


重点总结:

网络中激活值的归一化

在 logistic Regression 中,将输入特征进行归一化,可以加速模型的训练。那么对于更深层次的神经网络,我们是否可以归一化隐藏层的输出 a[l] a [ l ] 或者经过激活函数前的 z[l] z [ l ] ,以便加速神经网络的训练过程?答案是肯定的。

常用的方式是将隐藏层的经过激活函数前的 z[l] z [ l ] 进行归一化。

Batch Norm 的实现

以神经网络中某一隐藏层的中间值为例: z(1),z(2),,z(m) z ( 1 ) , z ( 2 ) , … , z ( m )

μ=1miz(i)σ2=1mi(z(i)μ)2z(i)norm=z(i)μσ2+ε μ = 1 m ∑ i z ( i ) σ 2 = 1 m ∑ i ( z ( i ) − μ ) 2 z n o r m ( i ) = z ( i ) − μ σ 2 + ε

这里加上 ε 是为了保证数值的稳定。

到这里所有 z 的分量都是平均值为 0 和方差为 1 的分布,但是我们不希望隐藏层的单元总是如此,也许不同的分布会更有意义,所以我们再进行计算:

z˜(i)=γz(i)norm+β z ~ ( i ) = γ z n o r m ( i ) + β

这里 γ γ β β 是可以更新学习的参数,如神经网络的权重 w w 一样,两个参数的值来确定 z~(i) 所属的分布。

参考文献:

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记(2-3)– 超参数调试 和 Batch Norm


PS: 欢迎扫码关注公众号:「SelfImprovementLab」!专注「深度学习」,「机器学习」,「人工智能」。以及 「早起」,「阅读」,「运动」,「英语 」「其他」不定期建群 打卡互助活动。

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: Coursera-ml-andrewng-notes-master.zip是一个包含Andrew Ng的机器学习课程笔记和代码的压缩包。这门课程是由斯坦福大学提供的计算机科学和人工智能实验室(CSAIL)的教授Andrew Ng教授开设的,旨在通过深入浅出的方式介绍机器学习的基础概念,包括监督学习、无监督学习、逻辑回归、神经网络等等。 这个压缩包中的笔记和代码可以帮助机器学习初学者更好地理解和应用所学的知识。笔记中包含了课程中涉及到的各种公式、算法和概念的详细解释,同时也包括了编程作业的指导和解答。而代码部分包含了课程中使用的MATLAB代码,以及Python代码的实现。 这个压缩包对机器学习爱好者和学生来说是一个非常有用的资源,能够让他们深入了解机器学习的基础,并掌握如何运用这些知识去解决实际问题。此外,这个压缩包还可以作为教师和讲师的教学资源,帮助他们更好地传授机器学习的知识和技能。 ### 回答2: coursera-ml-andrewng-notes-master.zip 是一个 Coursera Machine Learning 课程的笔记和教材的压缩包,由学生或者讲师编写。这个压缩包中包括了 Andrew Ng 教授在 Coursera 上发布的 Machine Learning 课程的全部讲义、练习题和答案等相关学习材料。 Machine Learning 课程是一个介绍机器学习的课程,它包括了许多重要的机器学习算法和理论,例如线性回归、神经网络、决策树、支持向量机等。这个课程的目标是让学生了解机器学习的方法,学习如何使用机器学习来解决实际问题,并最终构建自己的机器学习系统。 这个压缩包中包含的所有学习材料都是免费的,每个人都可以从 Coursera 的网站上免费获取。通过学习这个课程,你将学习到机器学习的基础知识和核心算法,掌握机器学习的实际应用技巧,以及学会如何处理不同种类的数据和问题。 总之,coursera-ml-andrewng-notes-master.zip 是一个非常有用的学习资源,它可以帮助人们更好地学习、理解和掌握机器学习的知识和技能。无论你是机器学习初学者还是资深的机器学习专家,它都将是一个重要的参考工具。 ### 回答3: coursera-ml-andrewng-notes-master.zip是一份具有高价值的文件,其中包含了Andrew NgCoursera上开授的机器学习课程的笔记。这份课程笔记可以帮助学习者更好地理解掌握机器学习技术和方法,提高在机器学习领域的实践能力。通过这份文件,学习者可以学习到机器学习的算法、原理和应用,其中包括线性回归、逻辑回归、神经网络、支持向量机、聚类、降维等多个内容。同时,这份笔记还提供了很多代码实现和模板,学习者可以通过这些实例来理解、运用和进一步深入研究机器学习技术。 总的来说,coursera-ml-andrewng-notes-master.zip对于想要深入学习和掌握机器学习技术和方法的学习者来说是一份不可多得的资料,对于企业中从事机器学习相关工作的从业人员来说也是进行技能提升或者知识更新的重要资料。因此,对于机器学习领域的学习者和从业人员来说,学习并掌握coursera-ml-andrewng-notes-master.zip所提供的知识和技能是非常有价值的。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值