Coursera | Andrew Ng (01-week-3-3.6)—激活函数

该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂


转载请注明作者和出处:ZJ 微信公众号-「SelfImprovementLab」

知乎https://zhuanlan.zhihu.com/c_147249273

CSDNhttp://blog.csdn.net/junjun_zhao/article/details/78998236


3.6 Activation functions (激活函数)

(字幕来源:网易云课堂)

这里写图片描述

When you build a neural network,one of the choices you get to make is what activation functions use in the hidden layers,as well as at the output unit of your neural network.So far we’ve just been using the sigmoid activation function,but sometimes other choices can work much better,let’s take a look at some of the options,in the forward propagation steps for the neural network,we have these two steps where we use the sigmoid function here,so that sigmoid is called an activation function,and here is the familiar sigmoid function,a equals 1 over 1 plus e to negative z.So in the more general case we can have a different function g of z,which I’m gonna write here,where g could be a nonlinear function that may not be the sigmoid function

要搭建一个神经网络,你可以选择的是,选择隐层里用那一个激活函数,还有神经网络的输出单元用什么激活函数,到目前为止 我们一直用的是 σ σ 激活函数,但有时其他函数效果要好得多,我们看看一些可供选择的函数,在神经网络的正向传播步骤中,我们有这两步 用的是σ函数,这个 σ σ 就是所谓的激活函数,这是大家很熟悉的σ函数,就是 a=1(1+ez) a = 1 ( 1 + e − z ) ,所以在更一般的情况下 我们可以使用不同的函数 g(z) g ( z ) ,我会在这里写出来,其中 g g 可以是非线性函数 不一定是σ函数。

so for example the sigmoid function goes between 0 and 1,an activation function that almost always works better than the sigmoid function is,the tanh function or the hyperbolic tangent function,so this is z, this is a, this is a equals tanh of z,and this goes between plus 1 and minus 1,the formula for the tanh function is e to the z minus e to negative z over their sum,and it’s actually mathematically a shifted version of the sigmoid function,so as a you know sigmoid function just like that,but shift it so that it now crosses a zero zero point and rescale,so it goes from minus one to plus one,and it turns out that for hidden units,if you let the function g(z) be equal to tanh t a n h (z).

比如说 σ σ 函数介于 0 和 1 之间,有个激活函数几乎总比σ表现更好,就是 tanh t a n h 函数 或者叫双曲正切函数,所以这是 z z 这是 a 这是 a=tanh(z) a = t a n h ( z ) ,这函数介于 -1 和 1 之间, tanh t a n h 函数的公式是 eze(z)ez+e(z) e z − e ( − z ) e z + e ( − z ) 再除以它们之和,数学上这实际上是 σ σ 函数平移后的版本,所以你知道σ函数是这样的,但把它移动一下 让它穿过零点 然后重新标度,让它介于 -1 和 1 之间,事实证明 对于隐藏单元,如果你让函数 g(z) g ( z ) 等于 tanh(z) t a n h ( z )

这里写图片描述

this almost always works better than the sigmoid function,because with values between plus one and minus one,the mean of the activations that come out of your head in there,are closer to having a zero mean,and so just as sometimes when you train a learning algorithm,you might Center the data and have your data have zero mean,using a tanh instead of a sigmoid function kind of has the effect of centering your data,so that the mean of the data is close to the zero,rather than maybe a 0.5,and this actually makes learning for the next layer a little bit easier,we’ll say more about this in the second course,when we talk about optimization algorithms as well,but one takeaway is that I pretty much never use the sigmoid activation function anymore,the tanh function is almost always strictly superior.

这几乎总比 σ σ 函数效果更好,因为现在函数输出介于 -1 和 1 之间,激活函数的平均值,就更接近 0,就像有时在你训练学习算法时,你可能需要平移所有数据 让数据平均值为 0,使用tanh而不是 σ σ 函数也有类似数据中心化的效果,使得数据的平均值接近 0,而不是 0.5,这实际上让下一层的学习更方便一点,我会在第二门课程里详细讨论这点,那时我们也会介绍算法优化,但这里要记住一点 我几乎不用σ激活函数了, tanh t a n h 函数几乎在所有场合都更优越。

the one exception is for the output layer,because if y is either 0 or 1,then it makes sense for y hat to be a number that you want to output just between 0 and 1,rather than between minus 1 and 1,so the one exception where I would use the sigmoid activation function is,when you’re using binary classification,in which case you might use the sigmoid activation function for the output layer,so g(z^[2]) here is equal to Sigma(z^[2]),and so what you see in this example is,where you might have a tanh activation function for the hidden layer,and sigmoid for the output layer,so the activation functions can be different for different layers,and sometimes to denote that the different activation functions are different for different layers,we might use these square brackets super scripts as well,to indicate that g of square bracket 1 may be different than g square bracket 2,to indicate, square bracket 1 superscript refers to this layer,and superscript square bracket 2 refers to the output layer.

一个例外是输出层,因为如果 y 是 0 或 1,那么你希望 y_帽 介于 0 到 1 之间更合理,而不是 -1 和 1 之间,我会用 σ σ 激活函数的一个例外场合是,使用二元分类的时候,在这种情况下 你可以用σ激活函数作为输出层,所以 g(z[2]) g ( z [ 2 ] ) 等于 σ(z[2]) σ ( z [ 2 ] ) ,所以在这个例子中你可以,在隐层里用 tanh t a n h 激活函数,输出层用 σ σ 函数,所以不同层的激活函数可以不一样,有时候为了表示不同层的不同激活函数,我们可能会用这些方括号上标,来表示g[1]可能和 g[2] g [ 2 ] 不同,上标方括号 1 表示这一层,上标方括号 2 表示输出层。

这里写图片描述

now one of the downsides of both the sigmoid function and the tanh function is,that if z is either very large or very small,then the gradient of the derivative or the slope of this function becomes very small,so z is very large or z is very small,the slope of the function you know ends up being close to zero,and so this can slow down gradient descent,so one of the toys that is very popular in machine learning is,what’s called the rectified linear unit,so the ReLU function looks like this,and the formula is a equals max of 0 comma z,so the derivative is 1 so long as z is positive,and derivative or the slope is 0 when z is negative,if you’re implementing this technically,the derivative when z is exactly 0 is not well-defined,but when you implement is in the computer the,odds that you get exactly z equals 0 0 0 0 0 0 0 0 0 0 is very small.

现在 σ σ 函数和tanh函数都有的一个缺点 就是,如果 z 非常大或非常小,那么导数的梯度 或者说这个函数的斜率可能就很小,所以 z 很大或很小的时候,函数的斜率很接近 0,这样会拖慢梯度下降算法,在机器学习里 最受欢迎的一个玩具是,所谓的修正线性单元 (ReLU) ( R e L U ) ,所以一个ReLU函数长这样,公式就是 a=max(0,z) a = m a x ( 0 , z ) ,只要 z 为正 导数就是 1,当 z 为负时 斜率为 0,如果你实际使用这个函数,z 刚好为 0 时 导数是没有定义的,但如果你编程实现,那么你得到 z 刚好等于 0 0 0 0 0 0 的概率很低。

这里写图片描述

so you don’t need to worry about it in practice,you could pretend a derivative when z is equal to 0,you can pretend is either 1 or 0 and you can work just fine,so the fact that is not differentiable the fact that,so here are some rules of thumb for choosing activation functions,if your output is 0 1 value,if you’re I’m using binary classification,then the sigmoid activation function is very natural for the output layer,and then for all other units on ReLU,or the rectified linear unit,is increasingly the default choice of activation function,so if you’re not sure what to use for your hidden layer.

所以实践中不用担心这点,你可以在 z=0 时 给导数赋值,你可以赋值成 1 或 0 那样也是可以的,尽管事实上这个函数不可微,在选择激活函数时有一些经验法则,如果你的输出值是 0 和1,如果你在做二元分类,那么 σ σ 函数很适合作为输出层的激活函数,然后其他所有单元都用ReLU,所谓的修正线性单元,现在已经变成激活函数的默认选择了,如果你不确定隐层应该用哪个。

I would just use the ReLU activation function,that’s what you see most people using these days,although sometimes people also use,the tanh activation function,one disadvantage of the ReLU is that,the derivative is equal to zero when z is negative,in practice this works just fine,but there is another version of the ReLU called the leaky ReLU,we will give you the formula on the next slide,but instead of it being zero when z is negative,it just takes a slight slope like so,so this is called the Leaky ReLU,this usually works better than the ReLU activation function,although it’s just not used as much in practice,either one should be fine although if you had to pick one.

我就用ReLU作为激活函数,这是今天大多数人都在用的,虽然人们有时候也会用,tanh激活函数,而ReLU的一个缺点是,当z为负时 导数等于零,在实践中这没什么问题,但ReLU还有另一个版本 叫做带泄漏的ReLU,我们会在下一页给出公式,当z为负时 函数不再为 0,它有一个很平缓的斜率,这就是所谓的带泄漏 ReLU,这通常比ReLU激活函数更好,不过实际中使用的频率没那么高,这些选一个就好了 如果你一定要选一个。

I usually just use the ReLU,and the advantage of both the ReLU and the leaky ReLU is that,for a lot of the space of z,the derivative of the activation function,the slope of the activation function is very different from zero,and so in practice using the ReLU activation function,your neural network will often learn much faster,than using the tanh or the sigmoid activation function,and the main reason is that,there is less of this effect of the slope of the function going to zero,which slows down learning,and I know that for half of the range of z,the slope of ReLU is zero but in practice,enough of your hidden units will have z greater than zero,so learning can still be quite mask for most training examples.

我通常只用ReLU,ReLU和带泄漏的ReLU好处在于,对于很多 z 空间,激活函数的导数,激活函数的斜率和 0 差很远,所以在实践中使用ReLU激活函数,你的神经网络的学习速度通常会快得多,比使用 tanh t a n h σ σ 激活函数快得多,主要原因在于,ReLU没有这种函数斜率接近 0 时,减慢学习速度的效应,我知道 对于 z 的一半范围来说,ReLU的斜率为零 但在实践中,有足够多的隐藏单元 令 z 大于0,所以对大多数训练样本来说还是很快的。

这里写图片描述

so let’s just quickly recap,there are pros and cons of different activation functions,here’s the sigmoid activation function.I would say never use this except for the output layer if you are doing binary classification,or maybe almost never use this,and the reason I almost never use this is because,the tanh is pretty much strictly superior,so the tanh activation function is this,and then the default the most commonly used activation function is the ReLU,which is this,so you’re not sure what else to use, use this one,and maybe you know feel free also to try the leaky ReLU,where um might be (0.01z,z) right,so a is the max of 0.01 times z and z,so that gives you this some Bend in the function.

我们快速回顾一下,不同激活函数的利弊,这里是σ激活函数,除非用在二元分类的输出层 不然绝对不要用,或者几乎从来不会用,我几乎从来没用过 原因在于, tanh t a n h 几乎在所有场合都更优越,所以 tanh t a n h 函数是这样的,还有最常用的默认激活函数是ReLU,就是这个,如果你不确定要用哪个 就用这个,或者你想用的话 也可以试试带泄漏的ReLU,公式可能是 max(0.01z,z) m a x ( 0.01 z , z ) 对吧,所以 a 是 0.01*z 和 z 两者的最大值,这样函数会这样拐一下。

这里写图片描述

and you might say you know why is that constant 0.01,well you can also make that another parameter of the learning algorithm,and some people say that works even better,but I hardly see people do that so,but if you feel like trying it in your application,you know please feel free to do so and,and you can just see how it works and how well it works,and stick with it if it gives you good result,so I hope that gives you a sense of,some of the choices of activation functions you can use in your network,one of the themes we’ll see in deep learning is that,you often have a lot of different choices in how you build your neural network,ranging from number of hidden units,to the choice of activation function,to how you initialize the weights which we’ll see later,a lot of choices like that and it turns out that is sometimes difficult,to get good guidelines for exactly what will work best for your problem.

你可能会问 为什么那个常数是 0.01?,你也可以把它设成学习函数的另一个参数,有人说这样效果更好,但我很少见到有人这么做 所以,如果你想在你的应用里试试,自己喜欢就好,你可以看看效果如何 有多好,如果结果很好 那么就一直用它,我希望这样你就对,如何在你的网络里选择激活函数有概念,深度学习其中一个特点是,在建立神经网络时经常有很多不同的选择,比如隐藏单元数,激活函数,还有如何初始化权重 这个我们接下来会讲,有很多这样的选择 有时真的很难,去定下一个准则 来确定什么参数最适合你的问题。

so throughout all these three courses.I’ll keep on giving you a sense of what I see in the industry,in terms of what’s more or less popular,but for your application with your applications idiosyncrasy,it’s actually very difficult to know in advance exactly what will work best,so a piece of advice would be,if you’re not sure which one of these activation functions work best,you know try them all and then evaluate on like a holdout validation set,or like a development set which we’ll talk about later,and see which one works better and then go with that,and I think that by testing these different choices for your application,you’d be better at future proofing your neural network architecture,against the the idiosyncrasy in our problem,as well evolutions of the algorithms,rather than you know if I were to tell you always use a ReLUactivation,and don’t use anything else that that just,may or may not apply for whatever problem you end up working on,you know either either in the near future on the distant future.

所以在这三门课程中,我会让你大概了解我在行业里见到的,热门选择 或者冷门选择,但是对于你的应用 你的应用的特质,事实上很难预先准确地知道什么参数最有效,所以一个建议是,如果你不确定哪种激活函数最有效,你可以先试试 在你的保留交叉验证数据集上跑跑,或者在开发集上跑跑,看看哪个参数效果更好 就用那个,我想通过在你的应用中测试这些不同的选择,你可以搭建具有前瞻性的神经网络架构,可以对你问题的特质更有针对性,让你的算法迭代更流畅,我这里不会告诉你一定要用ReLU激活函数,而不用其他的..,那对你现在或者未来要处理的问题而言,可能管用 也可能不管用。

all right so that was a choice of activation functions,you’ve seen the most popular activation functions,there’s one other question that sometimes is ask which is,why do you even need to use an activation function at all,why not just do away with that,so let’s talk about that in the next video,and where you will see why neural network,do need some sort of nonlinear activation function.

好 这就是激活函数的选择,你们看到了最热门的激活函数,还有另一个问题 经常有人会问,为什么你需要激活函数呢?,为什么不直接去掉,在下一个视频我们会谈到为什么,神经网络确实需要,某种非线性激活函数。


重点总结:

激活函数的选择

几种不同的激活函数 g(x) g ( x )

这里写图片描述

sigmoid: a=11+ez a = 1 1 + e − z

  • 导数: a=a(1a) a ′ = a ( 1 − a )

tanh: a=ezezez+ez a = e z − e − z e z + e − z

  • 导数: a=1a2 a ′ = 1 − a 2

ReLU(修正线性单元): a=max(0,z) a = m a x ( 0 , z )

Leaky ReLU:a= max(0.01z,z) m a x ( 0.01 z , z )

激活函数的选择:

sigmoid 函数和 tanh 函数比较:

隐藏层:tanh 函数的表现要好于 sigmoid 函数,因为 tanh 取值范围为[−1,+1],输出分布在 0 值的附近,均值为 0,从隐藏层到输出层数据起到了归一化(均值为 0)的效果。

输出层:对于二分类任务的输出取值为 {0,1},故一般会选择sigmoid函数。
然而sigmoid和 tanh t a n h 函数在当 |z| | z | 很大的时候,梯度会很小,在依据梯度的算法中,更新在后期会变得很慢。在实际应用中,要使 |z| | z | 尽可能的落在 0 值附近。

ReLU弥补了前两者的缺陷,当 z>0 时,梯度始终为 1,从而提高神经网络基于梯度算法的运算速度。然而当 z<0 时,梯度一直为0,但是实际的运用中,该缺陷的影响不是很大。

LeakyReLU L e a k y R e L U 保证在 z<0 的时候,梯度仍然不为 0。

在选择激活函数的时候,如果在不知道该选什么的时候就选择ReLU,当然也没有固定答案,要依据实际问题在交叉验证集合中进行验证分析。

参考文献:

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记(1-3)– 浅层神经网络


PS: 欢迎扫码关注公众号:「SelfImprovementLab」!专注「深度学习」,「机器学习」,「人工智能」。以及 「早起」,「阅读」,「运动」,「英语 」「其他」不定期建群 打卡互助活动。

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: Coursera-ml-andrewng-notes-master.zip是一个包含Andrew Ng的机器学习课程笔记和代码的压缩包。这门课程是由斯坦福大学提供的计算机科学和人工智能实验室(CSAIL)的教授Andrew Ng教授开设的,旨在通过深入浅出的方式介绍机器学习的基础概念,包括监督学习、无监督学习、逻辑回归、神经网络等等。 这个压缩包中的笔记和代码可以帮助机器学习初学者更好地理解和应用所学的知识。笔记中包含了课程中涉及到的各种公式、算法和概念的详细解释,同时也包括了编程作业的指导和解答。而代码部分包含了课程中使用的MATLAB代码,以及Python代码的实现。 这个压缩包对机器学习爱好者和学生来说是一个非常有用的资源,能够让他们深入了解机器学习的基础,并掌握如何运用这些知识去解决实际问题。此外,这个压缩包还可以作为教师和讲师的教学资源,帮助他们更好地传授机器学习的知识和技能。 ### 回答2: coursera-ml-andrewng-notes-master.zip 是一个 Coursera Machine Learning 课程的笔记和教材的压缩包,由学生或者讲师编写。这个压缩包中包括了 Andrew Ng 教授在 Coursera 上发布的 Machine Learning 课程的全部讲义、练习题和答案等相关学习材料。 Machine Learning 课程是一个介绍机器学习的课程,它包括了许多重要的机器学习算法和理论,例如线性回归、神经网络、决策树、支持向量机等。这个课程的目标是让学生了解机器学习的方法,学习如何使用机器学习来解决实际问题,并最终构建自己的机器学习系统。 这个压缩包中包含的所有学习材料都是免费的,每个人都可以从 Coursera 的网站上免费获取。通过学习这个课程,你将学习到机器学习的基础知识和核心算法,掌握机器学习的实际应用技巧,以及学会如何处理不同种类的数据和问题。 总之,coursera-ml-andrewng-notes-master.zip 是一个非常有用的学习资源,它可以帮助人们更好地学习、理解和掌握机器学习的知识和技能。无论你是机器学习初学者还是资深的机器学习专家,它都将是一个重要的参考工具。 ### 回答3: coursera-ml-andrewng-notes-master.zip是一份具有高价值的文件,其中包含了Andrew NgCoursera上开授的机器学习课程的笔记。这份课程笔记可以帮助学习者更好地理解掌握机器学习技术和方法,提高在机器学习领域的实践能力。通过这份文件,学习者可以学习到机器学习的算法、原理和应用,其中包括线性回归、逻辑回归、神经网络、支持向量机、聚类、降维等多个内容。同时,这份笔记还提供了很多代码实现和模板,学习者可以通过这些实例来理解、运用和进一步深入研究机器学习技术。 总的来说,coursera-ml-andrewng-notes-master.zip对于想要深入学习和掌握机器学习技术和方法的学习者来说是一份不可多得的资料,对于企业中从事机器学习相关工作的从业人员来说也是进行技能提升或者知识更新的重要资料。因此,对于机器学习领域的学习者和从业人员来说,学习并掌握coursera-ml-andrewng-notes-master.zip所提供的知识和技能是非常有价值的。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值