该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- ZJ
转载请注明作者和出处:ZJ 微信公众号-「SelfImprovementLab」
知乎:https://zhuanlan.zhihu.com/c_147249273
CSDN:http://blog.csdn.net/junjun_zhao/article/details/79122737
3.8 Softmax regression ( Softmax 回归)
(字幕来源:网易云课堂)
So far, the classification examples we’ve talked about have used binary classification, where you had two possible labels, 0 or 1.Is it a cat, is it not a cat?What if we have multiple possible classes?There’s a generalization of logistic regression called Softmax regression that lets you make predictions where you’re trying to recognize one of C or one of multiple classes, rather than just recognize two classes.Let’s take a look.Let’s say that instead of just recognizing cats you want to recognize cats, dogs, and baby chicks.So I’m going to call cats class 1, dogs class 2, baby chicks class 3.And if none of the above, then there’s an other or a none of the above class, which I’m going to call class 0.So here’s an example of the images and the classes they belong to.That’s a picture of a baby chick, so the class is 3. C ats is class 1, dog is class 2,I guess that’s a koala, so that’s none of the above,so that is class 0. C lass 3 and so on.
到目前为止 我们讲到过的分类的例子都使用了二分分类。这种分类只有两种可能的标记 0 或 1,这是一只猫或者不是一只猫,如果我们有多种可能的类型的话呢?有一种 logistic 回归的一般形式 叫做 Softmax 回归,能让你在试图识别某一分类时做出预测,或者说是多种分类中的一个 不只是识别两个分类,我们来一起看一下,假设你不单需要识别猫,而是想要识别猫 狗和小鸡,我把猫叫做类 1 狗为类 2 小鸡是类 3,如果不属于以上任何一类 就分到”其他”或者说,”以上均不符合”这一类 我把它叫做类 0,这里显示的图片及其对应的分类就是一个例子,这幅图片上是一只小鸡 所以是类 3,猫是类 1 狗是类 2,我猜这是一只考拉 所以以上均不符合,那就是类 0 下一个类 3 以此类推。
So the notation we’re going to use is, I’m going to use capital C to denote the number of classes you’re trying to categorize your inputs into.And in this case, you have four possible classes,including the other or the none of the above class.So when you have four classes, the numbers indexing your classes would be 0 through capital C minus one.So in other words, that would be zero, one, two or three.In this case, we’re going to build a neural network where the output layer has four,or in this case the variable capital alphabet C output units.So n, the number of units output layer which is layer L is going to equal to 4 or in general this is going to equal to C .And what we want is for the number of units in the output layer to tell us what is the probability of each of these four classes.So the first node here is supposed to output,or we want it to output the probability that is the other class, given the input x,This will output probability there’s a cat, given an x.This will output probability that there’s a dog, given an x.That will output the probability…I’m just going to abbreviate baby chick to bc So probability of a baby chick, abbreviated to bc, given the input x.So here, the output labels y hat is going to be a four by one dimensional vector,because it now has to output four numbers, giving you these four probabilities.And because probably it should sum to one,the four numbers in the output y hat, they should sum to one.
我们将会用符号表示 我会用大写 C ,来表示你的输入会被分入的类别总个数,在这个例子中 我们有 4 种可能的类别,包括”其他”或”以上均不符合”这一类,当有 4 个分类时 指示类别的数字,就是从 0 到大写 C 减 1,换句话说 就是 0 1 2 3,在这个例子中 我们将建立一个神经网络 其输出层有 4个,或者说变量大写字母 C 个输出单元,因此 n 即输出层也就是 L 层的单元数量,等于 4 或者一般而言等于 C ,我们想要输出层单元的数字告诉我们,这 4 种类型中每一个的概率有多大,所以这里的第一个节点输出的应该是,或者说我们希望它输出”其他”类的概率 在输入 x 的情况下,这个会输出是猫的概率 在输入 x 的情况下,这个会输出是狗的概率 在输入 x 的情况下,这个会输出…,我把小鸡缩写为 bc,小鸡即 bc 的概率 在输入 x 的情况下,因此这里的 y^ y ^ 将是一个 4∗1 4 ∗ 1 维向量,因为它必须输出四个数字 给你这四种概率,因为它加起来应该等于 1,输出的 y^ y ^ 中的四个数字加起来应该等于 1。
The standard model for getting your network to do this uses what’s calleda Softmax layer, and the output layer in order to generate these outputs.Let me write down the math,then you can come back and get some intuition about what the Softmax there is doing.So in the final layer of the neural network,you are going to compute as usual the linear part of the layers.So z, capital L, that’s the z variable for the final layer.So remember this is layer capital L.So as usual you compute that as w[L] w [ L ] times the activation of the previous layerplus the biases for that final layer.Now having computer z,you now need to apply what’s called the Softmax activation function.So that activation function is a bit unusual for the Softmax layer,but this is what it does.First, we’re going to computes a temporary variable,which we’re going to call t, which is e to the z[L] z [ L ] .So this is applied element-wise.So z[L] z [ L ] here, in our example, z[L] z [ L ] is going to be four by oneThis is a four dimensional vector.So t itself e to the zL, that’s an element-wise exponentiation.T will also be a 4 by 1 dimensional vector.Then the output a[L] a [ L ] ,is going to be basically the vector t but normalized to sum to 1.So aL is going to be e to the z[L] z [ L ] divided by sum from J equal 1 through 4,because we have four classes of t subscript i.So in other words of saying that a[L] a [ L ] is also a four by one vector,and the i element of this four dimensional vector,let’s write that, a[L] a [ L ] subscript i that’s going to be equal to ti t i over sum of ti t i , okay?In case this math isn’t clear,we’ll do an example in a minute that will make this clearer.
让你的网络做到这一点的标准模型要用到, Softmax 层 以及输出层来生成输出,让我把式子写下来,然后回过头来 就会对 Softmax 的作用有一点感觉了,在神经网络的最后一层,你将会像往常一样计算各层的线性部分,z 大写L 这是最后一层的z变量,记住这是大写L层,和往常一样 计算方法是 w[L] w [ L ] 乘以上一层的激活值,再加上最后一层的偏差,算出了z之后,你需要应用 Softmax 激活函数,这个激活函数对于 Softmax 层而言有些不同,它的作用是这样的,首先我们要计算一个临时变量,我们把它叫做t 它等于e的 z[L] z [ L ] 次方,这适用于每个元素,而这里的 z[L] z [ L ] 在我们的例子中 z[L] z [ L ] 是 4∗1 4 ∗ 1 的,四维向量,t就是e的 z[L] z [ L ] 次方 这是对所有元素求幂,t也是一个 4∗1 4 ∗ 1 维向量,然后输出的 a[L] a [ L ] ,基本上就是向量t 但是会归一化 使和为1,因此 a[L] a [ L ] 等于e的 z[L] z [ L ] 次方,除以j等于1到4时的总和,因为我们有4个类型的t下标i,换句话说 a[L] a [ L ] 也是一个 4∗1 4 ∗ 1 维向量,而这个四维向量的i元素,我把它写下来 a[L] a [ L ] 下标i等于 ti t i 除以 ti t i 之和。
So in case this math isn’t clear,let’s go through a specific example that will make this clearer.Let’s say that you compute z[L] z [ L ] , and z[L] z [ L ] is a four dimensional vector, let’s say is 5, 2, -1, 3.What we’re going to do is use this element-wise exponentiation to compute this vector t.So t is going to be e to the 5, e to the 2, e to the -1, e to the 3.And if you plug that in the calculator, these are the values you get.E to the 5 is 148.4, e squared is about 7.4,to the -1 is 0.4, and e cubed is 20.1.And so, the way we go from the vector t to the vector a[L] a [ L ] is justto normalize these entries to sum to one.So if you sum up the elements of t,if you just add up those 4 numbers you get 176.3.So finally, a[L] a [ L ] is just going to be this vector t,as a vector, divided by 176.3.So for example, this first node here,this will output e to the 5 divided by 176.3.And that turns out to be 0.842.So saying that, for this image, if this is the value of z you get,the chance of it being class zero is 84.2%.And then the next nodes outputs e squared over 176.3,that turns out to be 0.042, so this is 4.2% chance.The next one is e to -1 over that, which is 0.002.And the final one is e cubed over that, which is 0.114.So it is 11.4% chance that this is class number three,which is the baby chick class, right?So there’s a chance of it being class zero, class one, class two, class three.So the output of the neural network a[L] a [ L ] , this is also y hat,this is a 4 by 1 vectorwhere the elements of this 4 by 1 vector are going to be these four numbers that we just compute it.So this algorithm takes the vector z[L] z [ L ] and mathes it to four probabilities that sum to 1.And if we summarize what we just did to math from z[L] z [ L ] to a[L] a [ L ] ,this whole computation–computing exponentiation toget this temporary variable t and then normalizing,we can summarize this into a Softmax activation function and say a[L] a [ L ] equals the activation function g applied to the vector z[L] z [ L ] .The unusual thing about this particular activation function is that,this activation function g, it takes us input a 4 by 1 vector and it outputs a 4 by 1 vector.So previously, our activation functions used to take in a single row value input.So for example, the Sigmoid and the ReLU activation functionsinput a real number and output a real number.The unusual thing about the Softmax activation function is,because it needs to normalize across the different possible outputs,it needs to take in a vector of input and then outputs a vector.
以防这里的计算不够清晰易懂,我们马上会举个例子来详细解释,以防计算不够清晰,我们来看一个例子 详细解释,假设你算出了 z[L] z [ L ] , z[L] z [ L ] 是一个四维向量 假设为5 2 -1 3,我们要做的就是用这个元素取幂方法,来计算向量 t t ,因此t就是 e2 e 2 e(−1) e ( − 1 ) e3 e 3 ,如果你按一下计算器就会得到以下值, e5 e 5 是 148.4 e2 e 2 约为 7.4, e(−1) e ( − 1 ) 是 0.4 e3 e 3 是 20.1,我们从向量 t t 得到向量就只需,将这些项目归一化使总和为 1,如果你把 t t 的元素都加起来,把这四个数字加起来 得到 176.3,最终就等于向量 t t ,作为一个向量 除以 176.3,例如这里的第一个节点,它会输出除以 176.3,也就是 0.842,这样说来 对于这张图片 如果这是你得到的 z z 值,它是类 0 的概率就是 84.2%,下一个节点输出除以 176.3,等于 0.042 也就是 4.2 4.2 的几率,下一个是 e(−1) e ( − 1 ) 除以它 即 0.002,最后一个是 e3 e 3 除以它 等于 0.114,也就是 11.4% 的概率属于类 3,也就是小鸡组 对吧,这就是它属于类 0 类 1 类 2 类 3的可能性,神经网络的输出 a[L] a [ L ] 也就是 y^ y ^ ,是一个 4 乘 1 维向量,这个 4 乘 1 向量的元素就是我们算出来的这四个数字,这种算法通过向量 z[L] z [ L ] 计算出总和为 1 的四个概率,如果我们总结一下从 z[L] z [ L ] 到 a[L] a [ L ] 的计算步骤,整个计算过程 从计算幂,到得出临时变量 t t 再归一化,我们可以将此概括为一个 Softmax 激活函数,设等于向量 z[L] z [ L ] 的激活函数 g g ,这一激活函数的与众不同之处在于,这个激活函数 需要输入一个 4∗1 4 ∗ 1 维向量,然后输出一个 4∗1 4 ∗ 1 维向量,之前 我们的激活函数都是接受单行数值输入,例如 Sigmoid 和 ReLU 激活函数,输入一个实数 输出一个实数, Softmax 激活函数的特殊之处在于,因为需要将所有可能的输出归一化,就需要输入一个向量 最后输出一个向量。
So what other things that a Softmax classifier can represent?I’m going to show you some examples where you have inputs x1 x 1 , x2 x 2 .And these feed directly to a Softmax layer that has three or four, or more output nodes that then outputs y hat.So I’m going to show you a neural network with no hidden layer,and all it does is compute z[1] z [ 1 ] equals w[1] w [ 1 ] times the input x plus b.And then the output a[1] a [ 1 ] , or y hatis just the Softmax activation function applied to z[1] z [ 1 ] .So in this neural network with no hidden layers,it should give you a sense of the types of things a Softmax function can represent.So here’s one example with just raw inputs x1 x 1 and x2 x 2 .A Softmax layer with C equals 3 output classes can representthis type of decision boundaries.Notice this is kind of several linear decision boundaries,but this allows it to separate out the data into three classes.And in this diagram, what we did waswe actually took the training set that’s kind of shown in this figure andtrain the Softmax classifier with three output labels on the dataAnd then the color on this plot shows threshold in the outputs in the Softmax classifier and coloring in the input base on which one of the three outputs have the highest probability.So we can maybe kind of see that this is like a generalization of logistic regression with sort of linear decision boundaries, but with more than two classes instead of class being just 0, 1, the class could be 0, 1, or 2.Here’s another example of the decision boundary that the Softmax classifier can represent when training on a data set with three classes.And here’s another one.
那么 Softmax 分类器还可以代表其他的什么东西呢?我来举几个例子 你有两个输入 x1 x 1 x2 x 2 ,它们直接输入到 Softmax 层,它有三四个或者更多的输出节点 输出 y^ y ^ ,我将向你展示一个没有隐藏层的神经网络,它所做的就是计算 z[1]=w[1]∗x(输入)+b z [ 1 ] = w [ 1 ] ∗ x ( 输 入 ) + b ,而输出的 a[1] a [ 1 ] 或者说 y^ y ^ ,就是 z[1] z [ 1 ] 的 Softmax 激活函数,这个没有隐藏层的神经网络,应该能让你对 Softmax 函数能够代表的东西有所了解,这个例子中 原始输入只有 x1 x 1 和 x2 x 2 ,一个 C 等于 3 个输出分类的 Softmax 层能够代表,这种类型的决策边界,请注意这是几条线性决策边界,但这使得它能够将数据分到3个类别中,在这张图表中 我们所做的是,选择这张图中显示的训练集,用数据的 3 种输出标签来训练 Softmax 分类器,图中的颜色显示了 Softmax 分类器的输出的阈值,输入的着色是基于三种输出中概率最高的那种,因此我们可以看到这是 logistic 回归的一般形式,有类似线性的决策边界 但有超过两个分类,分类不只有0和1 而是可以是0 1或2,这是另一个 Softmax 分类器可以代表的决策边界的例子,用有三个分类的数据集来训练,这里还有一个。
Right, so this is a… but one intuition is thatthe decision boundary between any two classes will be linear.That’s why you see for example that decision boundary between the yellow and the red classes, that’s the linear boundary between purple and red is another linear boundary,between the purple and yellow is another linear decision boundary.But it was able to use these different linear functionsin order to separate the space into three classes.Let’s look at some examples with more classes.So it’s an example with C equals 4so that the green class and Softmax can continue to represent these types of linear decision boundaries between multiple classes.So here’s one more example with C equals 5 classes,and here’s one last example with C equals 6.So this shows the type of things the Softmax classifier can do when there is no hidden layer Of course, even much deeper neural network with x and then some hidden units, and then more hidden units, and so on,then you can learn even more complex non-linear decision boundaries to separate out multiple different classes.So I hope this gives you a sense of what a Softmax layer orthe Softmax activation function in the neural network can do.In the next video, let’s take a look at how you can traina neural network that uses a Softmax layer.
对吧 但是直觉告诉我们,任何两个分类之间的决策边界都是线性的,这就是为什么你看到比如这里,黄色和红色分类之间的决策边界是线性边界,紫色和红色之间的也是线性边界,紫色和黄色之间的也是线性决策边界,但它能用这些不同的线性函数,来把空间分成三类。我们来看一下更多分类的例子,这个例子中 C 等于4,因此这个绿色分类和 Softmax 仍旧可以代表,多种分类之间的这些类型的线性决策边界,另一个例子是 C 等于5类,最后一个例子是 C 等于6,这显示了 Softmax 分类器在没有隐藏层的情况下,能够做到的事情,当然 更深的神经网络会有x,然后是一些隐藏单元 以及更多隐藏单元等等,你就可以学习更复杂的非线性决策边界,来区分多种不同分类,我希望你了解了神经网络中的 Softmax 层,或者 Softmax 激活函数有什么作用,下一个视频中 我们来看一下你该怎样训练,一个使用 Softmax 层的神经网络。
重点总结:
Softmax 回归
在多分类问题中,有一种 logistic regression的一般形式,叫做 Softmax regression。Softmax 回归可以将多分类任务的输出转换为各个类别可能的概率,从而将最大的概率值所对应的类别作为输入样本的输出类别。
计算公式
下图是Softmax的公式以及一个简单的例子:
可以看出 Softmax 通过向量 z[L] z [ L ] 计算出总和为 1 的四个概率。
在没有隐藏隐藏层的时候,直接对 Softmax 层输入样本的特点,则在不同数量的类别下,Sotfmax 层的作用:
参考文献:
[1]. 大树先生.吴恩达 Coursera深度学习课程 DeepLearning.ai 提炼笔记(2-3)– 超参数调试 和 Batch Norm
PS: 欢迎扫码关注公众号:「SelfImprovementLab」!专注「深度学习」,「机器学习」,「人工智能」。以及 「早起」,「阅读」,「运动」,「英语 」「其他」不定期建群 打卡互助活动。