softmax函数计算概念,过程。Coursera | Andrew Ng (02-week3-3.9)—训练一个 Softmax 分类器

版权声明:本文为博主--ZJ--原创文章,未经博主允许不得转载。 https://blog.csdn.net/JUNJUN_ZHAO/article/details/79122927

该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂


转载请注明作者和出处:ZJ 微信公众号-「SelfImprovementLab」

知乎https://zhuanlan.zhihu.com/c_147249273

CSDNhttp://blog.csdn.net/junjun_zhao/article/details/79122927


3.9 Trying a Softmax classifier (训练一个 Softmax 分类器 )

(字幕来源:网易云课堂)

这里写图片描述

In the last video, you learned about the Softmax layer and the Softmax activation function.In this video, you deepen your understanding of Softmax classification,and also learn how to train a model that uses a Softmax layer.Recall our earlier example where the output layer computes z[L]z[L].So you notice that in the z vector, the biggest element was 5, andthe biggest probability ends up being this first probability.

这里写图片描述

上一个视频中我们学习了 Softmax 层,和 Softmax 激活函数,在这个视频中 你将更深入地了解 Softmax 分类并学习如何训练一个使用了 Softmax 层的模型,回忆一下我们之前举的例子,输出层计算出的 z[L]z[L] 中 最大的元素是 5,而最大的概率也就是第一种概率。

The name Softmax comes from contrasting it to what’s called a hard max which would have taken the vector z and map it to this vector.So hard max function will look at the elements of z and just put an 1 in the position of the biggest element of z and then 0s everywhere else.And so this is a very hard max where the biggest element gets a output of 1 and everything else gets an output of 0. Whereas in contrast,a Softmax is a more gentle mapping from z to these probabilities.So, I’m not sure if this is a great name but at least, that was the intuition behind why we call it a Softmax ,all this in contrast to the hard max.And one thing I didn’t really show but had alluded to is that Softmax regression or the Softmax activation function generalizes the logistic activation function to C classes rather than just two classes.And it turns out that if C = 2, then Softmax with C = 2 essentially reduces to logistic regression.And I’m not going to prove this in this video but the rough outline for the proof is that if C = 2 and if you apply Softmax ,then the output layer, a[L]a[L], will output two numbers if C = 2,so maybe it outputs 0.842 and 0.158, right?And these two numbers always have to sum to 1.And because these two numbers always have to sum to 1, they’re actually redundant.And maybe you don’t need to bother to compute two of them,maybe you just need to compute one of them.And it turns out that the way you end up computing that number reduces tothe way that logistic regression is computing its single output.So that wasn’t much of a proof but the takeaway from this is that Softmax regression is a generalization of logistic regression to more than two classes.

这里写图片描述

Softmax 这个名称的来源是与所谓 hard max 对比hard max 会把向量 z 变成这个向量,hard max 函数会观察 z 的元素,然后在 z 中最大元素的位置放上 1,其他位置放上 0,所以这是一个很硬 (hard) 的 max,也就是最大的元素的输出为 1,其他的输出都为 0,与之相反Softmax 所做的从 z 到这些概率的映射更为温和,我不知道这是不是一个好名字,但至少这就是 Softmax 这一名称背后所包含的想法,与 hard max 正好相反,有一点我没有细讲 但之前已经提到过的,就是 Softmax 回归或 Softmax 激活函数,将 logistic 激活函数推广到 C 类 而不仅仅是两类,结果就是如果 C 等于 2 那么 C 等于 2 的 Softmax 实际上变回到了 logistic 回归,我不会在这个视频中给出证明,但是大致的证明思路是这样的,如果 C 等于 2 并且你应用了 Softmax ,那么输出层a[L]a[L]将会输出两个数字,如果 C 等于 2 的话,也许它会输出 0.842 和 0.158 对吧,这两个数字加起来要等于 1,因为它们的和必须为 1 其实它们是冗余的,也许你不需要计算两个,而只需要计算其中一个,结果就是你最终计算那个数字的方式又回到了,logistic 回归计算单个输出的方式,这算不上是一个证明 但我们可以从中得出结论, Softmax 回归将 logistic 回归推广到了两种分类以上

Now let’s look at how you would actually train a neural network with a Softmax output layer.So in particular,let’s define the loss functions you use to train your neural network.Let’s take an example.Let’s see of an example in your training set where the target output,the ground truth label is 0 1 0 0.So the example from the previous video,this means that this is an image of a cat because it falls into Class 1.And now let’s say that your neural network is currently outputting y hat equals…so y hat would be a vector of probabilities sum to 1…0.1, 0.4, so you can check that sums to 1, and this is going to be a[L]a[L] were equal to 0.

这里写图片描述

接下来我们来看,怎样训练带有 Softmax 输出层的神经网络,具体而言,我们先定义训练神经网络时会用到的损失函数,举个例子,我们来看看训练集中某个样本的目标输出,真实标签是0 1 0 0,用上一个视频中讲到过的例子,这表示这是一张猫的图片 因为它属于类 1,现在我们假设你的神经网络输出的是ŷ y^的项都等于 0。

And the only term you’re left with is -y2 log y hat 2,because when you sum over the indices of j,all the terms will end up 0, except when j is equal to 2.And because y2=1y2=1 , this is just -log y hat 2.So what this means is that,if your learning algorithm is trying to make this small because you use gradient descent to try to reduce the loss on your training set.Then the only way to make this small is to make this small.And the only way to do that is to make y hat 2 as big as possible.And these are probabilities, so they can never be bigger than 1.But this kind of makes sensebecause x for this example is the picture of a cat,then you want that output probability to be as big as possible.So more generally, what this loss function does isit looks at whatever is the ground truth class in your training set,and it tries to make the corresponding probability of that class as high as possible.If you’re familiar with maximum likelihood estimation statistics,this turns out to be a form of maximum likelyhood estimation.But if you don’t know what that means, don’t worry about it.The intuition we just talked about will suffice.Now this is the loss on a single training example.How about the cost J on the entire training set.So, the cost of setting of the parameters and so on,of all the ways of biases,you define that as pretty much what you’d guess,sum of your entire training sets of the loss,your learning algorithm’s predictions are summed over your training samples.And so, what you do is use gradient descentin order to try to minimize this cost.

这里写图片描述

最后只剩下 y2logŷ 2−y2logy^2尽可能大,因为这些是概率 所以不可能比 1 大,但这的确也讲得通,因为在这个例子中 x 是猫的图片,你就需要这项输出的概率尽可能地大,概括来讲 损失函数所做的就是,它找到你的训练集中的真实类别然后试图使该类别相应的概率尽可能地高,如果你熟悉统计学中的最大似然估计,这其实就是最大似然估计的一种形式,但如果你不知道那是什么意思 也不用担心,用我们刚刚讲过的算法思维也足够了,这是单个训练样本的损失,整个训练集的损失 J 又如何呢,也就是设定参数的代价之类的,还有各种形式的偏差的代价,它的定义你大致也能猜到,就是整个训练集损失的总和,把你的训练算法对所有训练样本的预测都加起来,因此你要做的就是用梯度下降法,使这里的损失最小化

Finally, one more implementation detail.Notice that because C is equal to 4, y is a 4 by 1 vector, andy hat is also a 4 by 1 vector.So if you’re using a vectorized implementation,the matrix capital Y is going to be ŷ (1)y^(1), stacked horizontally.And so for example, if this example up here is your first training examplethen the first column of this matrix Y will be 0 1 0 0and then maybe the second example is a dog,maybe the third example is a none of the above, and so on.And then this matrix Y will end up being a 4 by m dimensional matrix.And similarly, Y hat will be y hat 1 stacked up horizontally going through y hat mso this is actually y hat 1 or the output on the first training exampleThen y hat with these 0.3, 0.2, 0.1, and 0.4, and so on.And y hat itself will also be 4 by m dimensional matrix.

这里写图片描述

最后还有一个实现细节,注意因为 C=4C=4维矩阵。

Finally, let’s take a look at how you’d implement gradient descent when you have a Softmax output layer.So this output layer will compute z[L]z[L]and then sort of start off the backprop processto compute all the derivatives you need throughout your neural network.But it turns out that in this week’s primary exercise,we’ll start to use one of the deep learning program frameworks and for those program frameworks,usually it turns out you just need to focus on getting the forward prop right.And so long as you specify it as a program framework, the forward prop pass,the program framework will figure out how to do back prop,how to do the backward pass for you.

这里写图片描述

最后我们来看一下,在有 Softmax 输出层时如何实现梯度下降法,这个输出层会计算z[L]z[L],然后开始反向传播的过程,计算整个神经网络中所需的所有导数,但是在这周的初级练习中,我们将开始使用一种深度学习编程框架,对于这些编程框架,通常你只需专注于把前向传播做对,只要你将它指明为编程框架 前向传播,它自己会弄明白怎样反向传播,会帮你实现反向传播。

So this expression is worth keeping in mind for if you ever need to implement Softmax regression, or Softmax classification from scratch.Although you won’t actually need this in this week’s primary exercise because the program framework you use will take care of this derivative computation for you.So that’s it for Softmax classification,with it you can now implement learning algorithms to categorize inputs into not just one of two classes,but one of C different classes.Next, I want to show you some of the deep learning program frameworks which can make you much more efficient in terms of implementing deep learning algorithms.Let’s go on to the next video to discuss that.

这个表达式值得牢记,如果你需要从头开始,实现 Softmax 回归或者 Softmax 分类,但其实在这周的初级练习中你不会用到它,因为编程框架会帮你搞定导数计算, Softmax 分类就讲到这里,有了它 你就可以运用学习算法,将输入分成不止两类,而是 C 个不同类别,接下来我想向你展示一些深度学习编程框架,可以让你在实现深度学习算法时更加高效,让我们在下个视频中一起讨论。


重点总结:

训练 Sotfmax 分类器

理解 Sotfmax

为什么叫做Softmax?我们以前面的例子为例,由z[L]z[L]的计算过程如下:

通常我们判定模型的输出类别,是将输出的最大值对应的类别判定为该模型的类别,也就是说最大值为的位置1,其余位置为0,这也就是所谓的“hardmax”。而Sotfmax将模型判定的类别由原来的最大数字5,变为了一个最大的概率0.842,这相对于“hardmax”而言,输出更加“soft”而没有那么“hard”。

Sotfmax回归 将 logistic回归 从二分类问题推广到了多分类问题上。

Softmax 的Loss function

在使用Sotfmax层时,对应的目标值 y 以及训练结束前某次的输出的概率值ŷ y^分别为:

y=⎡⎣⎢⎢⎢⎢0100⎤⎦⎥⎥⎥⎥, ŷ =⎡⎣⎢⎢⎢⎢0.30.20.10.4⎤⎦⎥⎥⎥⎥y=[0100], y^=[0.30.20.10.4]

Sotfmax 使用的 Loss function为:

L(ŷ ,y)=j=14yjlogŷ jL(y^,y)=−∑j=14yjlog⁡y^j

在训练过程中,我们的目标是最小化Loss function,由目标值我们可以知道,y1=y3=y4=0y2=1y1=y3=y4=0,y2=1中,有:

L(ŷ ,y)=j=14yjlogŷ j=y2logŷ 2=logŷ 2L(y^,y)=−∑j=14yjlog⁡y^j=−y2log⁡y^2=−log⁡y^2

所以为了最小化Loss function,我们的目标就变成了使得 ŷ 2y^2 的概率尽可能的大。

也就是说,这里的损失函数的作用就是找到你训练集中的真实的类别,然后使得该类别相应的概率尽可能地高,这其实是最大似然估计的一种形式。

对应的Cost function如下:

J(w[1],b[1],)=1mi=1mL(ŷ (i),y(i))J(w[1],b[1],…)=1m∑i=1mL(y^(i),y(i))

Softmax 的梯度下降

在Softmax层的梯度计算公式为:

Jz[L]
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值