sigmoid函数在逻辑回归以及深度学习的应用


前言

逻辑回归与之前的线性回归虽然名字类似,但其实是一种分类的方法,如分辨是否为垃圾邮件(是或否),输入肿瘤特征分辨是良性还是恶性等。因为最终的类别已经确定,我们只需要将不同的输出结果进行分类,这就需要一个阈值,以此为界限进行分类。

一、sigmoid函数在逻辑回归以及深度学习的应用

## 1.逻辑回归分类

我们将因变量(dependant variable)可能属于的两个类分别称为负向类(negative class)和正向类(positive class),则因变量yϵ{0,1},其中0表示负向类,1表示正向类。
对于多分类问题

可以如下定义因变量y:y∈{0,1,2,3,…,n}。如果分类器用的是回归模型,并且已经训练好了一个模型,可以设置一个阈值:如果hθ(x)≥0.5,则预测值y=1,y属于正例;如果hθ(x)<0.5,则预测值y=0,y属于负例;

在这里插入图片描述


比如线性回归函数的取值可能大于1或者小于0,而逻辑回归函数的取值是介于0到1之间的。

So the two terms are basically interchangeable and either term can be used to refer to this function g. And if we take these two equations, and put them together, then here’s just an alternative way of writing out the form of my hypothesis. I’m saying h θ ( x ) = 1 1 + e − θ T x h_{\theta }(x)=\frac{1}{1+e^{-\theta ^{T}x}} hθ(x)=1+eθTx1, and all I have done is I’ve taken the variable z, z here is a real number, and plugged in θ T x \theta ^{T}x θTx, so end up with, θ T x \theta ^{T}x θTx, in place of z there. Lastly, let me show you what the sigmoid function looks like. We’re going to plot it on the figure here. The sigmid function g(z), also called the logistic function, looks like this. It starts off near 0 and then rises until it processes 0.5 at the origin and then it flattens out again like so. So that’s what the sigmoid function looks like. And you notice that the sigmoid function, well, it asymptotes at 1, and asymptotes at 0 as z, the horizontal axis is z, goes to minus infinity, g(z) approaches zero, and as z approaches infinity, g(z) approaches 1. Because g(z) offers values that are between 0 and 1, we also have that h_{\theta }(x) must be between 0 and 1 ( 0 ≤ h θ ( x ) ⩽ 1 ) (0\leq h_{\theta }(x)\leqslant 1) (0hθ(x)1). Finally, given this hypothesis representation, what we need to do(get next), as before, is fit the parameters θ \theta θ to our data. So giving a training set, we need pick a value for the parameters θ \theta θ and this hypothesis will then let us make predictions. We’ll talk about a learning algorithm later for fitting the parameter theta. But first let’s talk a little about the interpretation of this model.

sigmid函数g(z)也称为逻辑函数,看起来像这样。它在0附近开始,然后上升直到在原点处处理0.5,然后再次变平。这就是S型函数的样子。您会注意到,S形函数在1处渐近,在z处渐近于0,水平轴为z,变为负无穷大,g(z)的值接近零,而当z趋近无穷大时,g(z)的值接近1.由于g(z)提供的值在0到1之间,所以我们也有 h   t h e t a ( x ) h _ {\ theta}(x) h theta(x)必须在0到1之间 ( 0 ≤ h θ ( x ) ⩽ 1 ) (0\leq h_{\theta }(x)\leqslant 1) (0hθ(x)1).。最后,给定该假设表示形式,像以前一样,我们需要做的(获取下一个)适合参数 θ \theta θ到我们的数据中。因此,给定训练集,我们需要为参数 θ \theta θ选择一个值,然后该假设将使我们做出预测。我们谈谈该模型的解释。

2.假设陈述

逻辑回归模型
Here is how I’m going to interpret the output of my hypothesis h θ ( x ) h_{\theta }(x) hθ(x). When my hypothesis outputs some number, I am going to treat that number as the estimated probability that y is equal to 1 on a new input example x. Let’s say we’re using the tumor classification example. So we may have a feature vector x, which is this x 0 = 1 x_{0}=1 x0=1 as always, and then our one feature is the size of the tumor. Suppose I have a patient come in and they have some tumor size and I feed their feature vector x into my hypothesis and suppose my hypothesis outputs the number 0.7. I’m going to interpret my hypothesis as follows. I’m going to say that this hypothesis is telling me that for a patient with features x the probability that y equals 1 is 0.7. In other words, I’m going to tell my patient the tumor, sadly, has a 70% chance or a 0.7 chance of being malignant. To write this out slightly more formally or to write this out in math, I’m going to interpret my hypothesis output as P of y equals 1, given x, parameterized by θ \theta θ, i.e., h θ ( x ) = P ( y = 1 ∣ x ; θ ) h_{\theta }(x)=P(y=1|x;\theta ) hθ(x)=P(y=1x;θ). So, for those of you that are familiar with probability, this equation might make sense; if you’re a little less familiar with probability, here is how I read this expression, this is the probability that y is equals to one given x, so that is given that my patient has features x. Given my patient has a particular tumor size represented by my features x, and this probability is parameterized by θ \theta θ. So I’m basically going to count on my hypothesis to give me estimates the probability that y is equals to 1. Now since this is a classification task, we know that y must be equal to 0 or 1, right? Those are the only two values that y could possibly take on, either in the training set or for new patients that may walk into my office or into the doctor’s office in the future. So given h θ ( x ) h_{\theta }(x) hθ(x), we can therefore compute the probability that y is equal to 0 as well. Concretely, because y must be either 0 or 1, we know that the probability that y=0, plus the probability of y=1, must add up to 1. This first equation looks a little bit more complicated, but it’s basic saying that probability that y=0 for a particular patient with features x, and given our parameters θ \theta θ, plus the probability of y=1 for that same patient with features x and given parameters θ \theta θ must add up to 1. If this equation looks a little bit complicated, feel free to mentally imagine it without that x and θ \theta θ. And this is just saying that the probability of y=0 plus the probability of y=1 must be equal to 1. And we know this to be true because y has to be either 0 or 1. So the chance of y being 0 plus the chance that y is 1 those two must add up to 1. And so if you just take this term and move it to the right-hand side, then you end up with this equation that says probability that y=0 is one minus probability y equals 1. And thus if our hypothesis if h θ ( x ) h_{\theta }(x) hθ(x) gives that term you can therefore quite simply compute the probability, or compute the estimated probability that y is equal to 0 as well. So you now know what the hypothesis representation is for logistic regression and we’re seeing what the mathematically formula is defining the hypothesis for logistic regression.

这就是我要解释我的假设 h θ ( x ) h_{\theta }(x) hθ(x)的输出的方式。当我的假设输出一些数字时,我将把这个数字当作新输入示例x上y等于1的估计概率。假设我们使用的是肿瘤分类示例。因此,我们可能有一个特征向量x,一如既往是 x 0 = 1 x_ {0} = 1 x0=1,然后我们的一个特征就是肿瘤的大小。假设我有一个病人进来并且他们有一些肿瘤大小,并且我将其特征向量x输入了我的假设,并假设我的假设输出的数字为0.7。我将如下解释我的假设。我要说的是,这个假设告诉我,对于具有特征x的患者,y等于1的概率为0.7。换句话说,我要告诉我的病人,不幸的是,该肿瘤有70%或0.7的机会是恶性的。为了更正式地写出来或在数学上写出来,我将解释假设输出为y的P等于1,给定x,由 ​ ​ θ ​​\theta θ参数化,即 h θ ( x ) = P ( y = 1 ∣ x ; θ ) h _ {\theta} (x)= P(y = 1 | x; \theta) hθ(x)=P(y=1x;θ)。因此,对于你们中那些熟悉概率的人来说,这个方程可能有意义;如果您对概率不太熟悉,这就是我读取此表达式的方式,这是y等于给定x的概率,即假定我的患者具有特征x。给定我的患者具有由我的特征x表示的特定肿瘤大小,并且该概率由 θ \theta θ参数化。因此,我基本上将依靠我的假设来估计y等于1的概率。现在,由于这是一个分类任务,我们知道y必须等于0或1,对吗?这是y仅有的两个值,无论是在培训中还是在将来可能会走进我的办公室或医生办公室的新患者中。因此,给定 h θ ( x ) h _ {\theta}(x) hθ(x),我们因此可以计算y等于0的概率。具体来说,由于y必须为0或1,我们知道y = 0的概率加上y = 1的概率必须加起来为1。对于具有特征x并给定参数 θ \theta θ的特定患者,y = 0的概率,加上具有特征x和给定参数 θ \theta θ的同一患者的y = 1的概率,必须加起来为1如果没有x和 θ \theta θ,可以随意想象。这只是说y = 0的概率加上y = 1的概率必须等于1。我们知道这是正确的,因为y必须为0或1。因此y的概率为0+ y为1的机会这两个必须加起来为1。因此,如果您仅取此项并将其移至右侧,则最终得到的方程式为y = 0的概率是一个负概率y等于1。因此,如果我们的假设 h   h e t a ( x ) h_ {\ heta}(x) h heta(x)给出了该项,那么您可以非常简单地计算概率,或者也可以计算y等于0的估计概率。因此,您现在知道了逻辑回归的假设表示是什么,并且我们看到了数学公式定义了逻辑回归的假设。
在这里插入图片描述

3.决策边界

suppose predict or predict

Concretely, this hypothesis is outputting estimates of the probability that y is equal to 1 given x and parameterized by θ \theta θ. So if we wanted to predict is y equal to 1 or is y equal to 0, here is something we might do. Whenever the hypothesis outputs that the probability with y being 1 is greater than or equal to 0.5 so this means that it is more likely to be y equals 1 than y equals 0 then let’s predict y equals 1. And otherwise, if the probability of, the estimated probability of being 1 is less than 0.5, then let’s predict y equals 0. And I chose a greater than or equal to 0.5 or less than 0.5. If h θ ( x ) h_{\theta }(x) hθ(x) is equal to 0.5 exactly, then we could predict positive or negative, but I put a greater than or equal to here so we default maybe to predict a positive if h θ ( x ) h_{\theta }(x) hθ(x) is 0.5. But that’s a detail that really doesn’t matter that much. What I want to do is understand better when it is exactly that h θ ( x ) h_{\theta }(x) hθ(x) will be greater or equal to 0.5, so that we end up predicting y is equal to 1. If we look at this plot of the sigmoid function, we’ll notice that the sigmoid function, g(z), is greater than or equal to 0.5 whenever z is greater than or equal to 0. So is in this half of the figure that, g takes on values that are 0.5 and higher. This is node here, that’s the 0.5. So when z is positive, g(z) the sigmoid function, is greater than or equal to 0.5. Since the hypothesis for logistic regression is h θ ( x ) = g ( θ T x ) h_{\theta }(x)=g(\theta ^{T}x) hθ(x)=g(θTx). This is therefore going to be greater than or equal to 0.5 whenever θ T x \theta ^{T}x θTx is greater than or equal to 0. So what was shown, right, because here θ T x \theta ^{T}x θTx takes the role of z. So what we’re shown is that our hypothesis is going to predict y equals 1 whenever θ T x \theta ^{T}x θTx is greater than or equal to 0. Let’s now consider the other case of when a hypothesis will predict y is equal to 0. Well, by similar argument, h θ ( x ) h_{\theta }(x) hθ(x) is going to be less than 0.5 whenever g(z) is less than 0.5, because the range of values of z that calls g(z) to take on values less than 0.5, well that’s when z is negative. So when g(z) is less than 0.5, our hypothesis will predict that y is equal to 0, and by similar argument to what we had earlier, h θ ( x ) = g ( θ T x ) h_{\theta }(x)=g(\theta ^{T}x) hθ(x)=g(θTx). And so, we’ll predict y equals 0 whenever this quantity θ T x \theta ^{T}x θTx is less than 0. To summarize what we just worked out, we saw that if we decide to predict whether y is equal to 1 or y is equal to 0, depending on whether the estimated probability is greater than or equal to 0.5, or whether it’s less than 0.5, that’s the same as saying that will predict y equals 1 whenever θ T x \theta ^{T}x θTx is greater than or equal to 0, and we’ll predict y is equal to 0 whenever θ T x \theta ^{T}x θTx is less than 0.

具体地,该假设输出在给定x并由 θ \theta θ参数化的情况下y等于1的概率的估计值。因此,如果我们要预测y等于1或y等于0,则可以执行以下操作。每当假设输出y为1的概率大于或等于0.5时,这意味着y等于1大于y等于0的可能性更大,那么我们就可以预测y等于1。否则,如果y等于1,估计为1的概率小于0.5,那么让我们预测y等于0。然后我选择大于或等于0.5或小于0.5。如果If h θ ( x ) h_{\theta }(x) hθ(x)正好等于0.5,那么我们可以预测正数或负数,但是我在此处输入了一个大于或等于的值,因此如果 h θ ( x ) h_{\theta }(x) hθ(x)为0.5。我想做的是更好地理解 h θ ( x ) h_{\theta }(x) hθ(x)恰好大于或等于0.5,以便最终预测y等于1。 S形函数,我们会注意到,每当z大于或等于0时,S形函数g(z)都大于或等于0.5。因此在图的上半部分,g取是0.5和更高。这是这里的节点,即0.5。因此,当z为正数时,s(sigmoid)函数g(z)大于或等于0.5。由于逻辑回归的假设是 h θ ( x ) = g ( θ T x ) h _ {\theta}(x)= g(\theta ^ {T} x) hθ(x)=g(θTx)。因此,每当 θ T x \theta ^ {T} x θTx大于或等于0时,它将大于或等于0.5。所以显示的是正确的,因为这里 θ T x \theta ^ {T} x θTx取z的作用。因此,我们看到的是,每当 θ T x \theta ^ {T} x θTx大于或等于0时,我们的假设将预测y等于1。现在考虑假设在预测y等于时的另一种情况。到0。用类似的说法,只要g(z)小于0.5, h θ ( x ) h _ {\theta}(x) hθ(x)就会小于0.5,因为调用g(z)的z值范围取小于0.5的值,也就是z为负值时。因此,当g(z)小于0.5时,我们的假设将预测y等于0,并且通过与我们之前的说法类似的 h θ ( x ) = g ( θ T x ) h _ {\theta}(x)= g(\theta ^ {T} x) hθx=g(θTx)。因此,只要此数量 θ T x \theta ^ {T} x θTx小于0,我们就可以预测y等于0。总结一下我们刚刚得出的结论,我们发现,如果我们决定预测y是否等于1或y等于0,这取决于估计的概率是大于还是等于0.5,还是小于0.5,这与说在 θ T x \theta ^ {T} x θTx较大时预测y等于1相同。等于或等于0,并且我们将预测 θ T x \theta ^ {T} x θTx小于0时y等于0。

Decision Boundary
Now, let’s suppose we have a training set like that shown on the slide, and suppose our hypothesis is h θ ( x ) = g ( θ 0 + θ 1 x 1 + θ 2 x 2 ) h_{\theta }(x)=g(\theta _{0}+\theta _{1}x_{1}+\theta _{2}x_{2}) hθ(x)=g(θ0+θ1x1+θ2x2). But suppose that very procedure to be specified, we end up choosing the following values for the parameters. Let’s say we choose θ 0 = − 3 \theta _{0}=-3 θ0=3, θ 1 = 1 \theta _{1}=1 θ1=1, θ 2 = 1 \theta _{2}=1 θ2=1. So this means my parameter vector is going to be θ = [ − 3 1 1 ] \theta =\begin{bmatrix} -3\\ 1\\ 1 \end{bmatrix} θ=311. So, we’re given this choice of my hypothesis parameters, let’s try to figure out where a hypothesis will end up predicting y=1 and where it will end up predicting y equals 0. Using the formulas that we worked on the previous slides, we know that y=1 is more likely, that is the probability that y=1 is greater than or equals to 0.5. Whenever θ T x \theta ^{T}x θTx is greater than 0. And this formula that I just underlined, − 3 + x 1 + x 2 -3+x_{1}+x_{2} 3+x1+x2 is, of course, θ T x \theta ^{T}x θTx, when θ \theta θ is equal to this value of the parameters that we just chose. So, for any example, for any example with features x1 and x2, that satisfy this equation that − 3 + x 1 + x 2 -3+x_{1}+x_{2} 3+x1+x2 is greater than or equal to 0, our hypothesis will think that y equals 1 is more likely, or will predict that y is equal to 1. We can also take -3 and bring this to the right and rewrite this as x 1 + x 2 ⩾ 3 x_{1}+x_{2}\geqslant 3 x1+x23. And so, equivalently, we found that this hypothesis will predict y=1 whenever x1+x2 is greater than or equal to 3. Let’s see what that means on the figure. If I write down the equation x 1 + x 2 = 3 x_{1}+x_{2}=3 x1+x2=3, this defines the equation of a straight line. And if I draw what that straight line looks like, it gives me the following line which passes through 3 and 3 on the x1 and x2 axis. So the part of the input space, the part of the x1 and x2 plane that corresponds to when x1+x2 is greater than or equal to 3. That is going to be this right half plane. That is everything to the upper right portion of this magenta line that I just drew. And so, the region where our hypothesis will predict y=1 is really this huge region this half space over to the upper right. And let me just write that down. I’m gonna call this y=1 region. And in contrast, the region there x1+x2 is less than 3 that’s when we’ll predict that y=0, and that corresponds to this region. You know, it’s really a half plane, but that region on the left is the region where our hypothesis is predict y=0. I want to give this line, this magenta line that I drew a name. This line there is called the decision boundary. And concretely, this straight line x1+x2=3. That corresponds to the set of points, that corresponds to the region where h θ ( x ) h_{\theta }(x) hθ(x) is equal to 0.5 exactly. And the decision boundary, that is this straight line, that’s the line that separates the region where the hypothesis predicts y=1 from the region where the hypothesis predicts that y=0. And just to be clear, the decision boundary is a property of the hypothesis including the parameters θ 0 \theta _{0} θ0, θ 1 \theta _{1} θ1 and θ 2 \theta _{2} θ2. And in the figure I drew a training set. I drew a data set in order to help the visualization. But even if we take away the data set, this decision boundary and a region where we predict y=1 versus y=0. That’s a property of the hypothesis and of the parameters of the hypothesis, and not a property of the data set. Later on, of course, we’ll talk about how to fit the parameters and there we’ll end up using the training set, or using our data, to determine the value of the parameters. But once we have particular values for the parameters θ 0 \theta _{0} θ0, θ 1 \theta _{1} θ1 and θ 2 \theta _{2} θ2, then that completely defines the decision boundary and we don’t actually need to plot a training set in order to plot the decision boundary.

现在,假设我们有训练集,函数表达式是 h θ ( x ) = g ( θ 0 + θ 1 x 1 + θ 2 x 2 ) h_{\theta }(x)=g(\theta _{0}+\theta _{1}x_{1}+\theta _{2}x_{2}) hθ(x)=g(θ0+θ1x1+θ2x2)。假设我们选择 θ 0 = − 3 \theta _{0}=-3 θ0=3, θ 1 = 1 \theta _{1}=1 θ1=1, θ 2 = 1 \theta _{2}=1 θ2=1。因此,我们可以选择假设参数,让我们尝试找出假设最终预测y = 1的地方以及预测y等于0的地方。使用上一张幻灯片中的公式,我们知道y = 1的可能性更大,也就是y = 1大于或等于0.5的概率。每当 θ T x \theta ^ {T}x θTx大于0时,我刚刚强调的这个公式 − 3 + x 1 + x 2 -3 + x_ {1} + x_ {2} 3+x1+x2当然就是 θ T x \theta ^ {T}x θTx,当 θ \theta θ等于我们刚刚选择的参数的值时。因此,对于任何示例,对于具有特征x1和x2的任何示例,如果满足以下等式 − 3 + x 1 + x 2 -3 + x_ {1} + x_ {2} 3+x1+x2大于或等于0,我们的假设将认为y等于1更有可能,或者将预测y等于1。我们也可以采用-3并将其移到右边,并将其重写为 x 1 + x 2 ⩾ 3 x_ {1} + x_ {2}\geqslant 3 x1+x23。因此,等价地,我们发现,只要x1 + x2大于或等于3,此假设就可以预测y = 1。让我们看看图中的含义。如果我写下等式 x 1 + x 2 = 3 x_ {1} + x_ {2} = 3 x1+x2=3,这将定义一条直线的等式。而且,如果我画出该直线的样子,它将为我提供以下直线,该直线在x1和x2轴上穿过3和3。因此,输入空间的一部分,即x1和x2平面的对应于x1 + x2大于或等于3的部分。这将是该右半平面。这就是我刚刚绘制的洋红色线右上部分的所有内容。因此,我们的假设将预测y = 1的区域实际上就是这个大区域,位于右上半部。让我写下来。我将其称为y = 1地区。相反,x1 + x2的区域小于3,这是我们预测y = 0的时间,并且与该区域相对应。您知道,它实际上是一个半平面,但是左侧的区域是我们的假设可以预测y = 0的区域。我想给这行,这是我画过的洋红色行。这条线称为决策边界。具体而言,该直线x1 + x2 = 3。这对应于点集,对应于 h θ ( x ) h _ {\theta}(x) hθ(x)精确等于0.5的区域。决策边界即这条直线,是将假设预测为y = 1的区域与假设预测为y = 0的区域分开的线。还要明确一点,决策边界是假设的属性,包括参数 θ 0 \theta _ {0} θ0 θ 1 \theta _ {1} θ1 θ 2 \theta _ {2} θ2。在图中,我画了一个训练集。我绘制了一个数据集以帮助可视化。即使我们删除了数据集,这个决策边界和一个我们预测y = 1对y = 0的区域。那是假设和假设参数的属性,而不是数据集的属性。一旦我们为参数 θ 0 \theta _ {0} θ0 θ 1 \theta _ {1} θ1   t h e t a 2 \ theta _ {2}  theta2确定了特定的值,就可以完全定义决策边界,而我们实际上并不需要绘制训练集以绘制决策边界。Non-linear decision boundaries
Given a training set like this, how can I get logistic regression to fit this sort of data? Earlier, when we were talking about polynomial regression or when we’re talking about linear regression, we talked about how we can add extra higher order polynomial terms to the features. And we can do the same for logistic regression. Concretely, let’s say my hypothesis looks like this. Where I’ve added two extra features, x 1 2 x_{1}^{2} x12and x 2 2 x_{2}^{2} x22 to my features. So that I now have 5 parameters, θ 0 \theta _{0} θ0through θ 4 \theta _{4} θ4. As before, we’ll defer to the next video our discussion on how to automatically choose values for the parameters θ 0 \theta _{0} θ0 through θ 4 \theta _{4} θ4. But let’s say that very procedure to be specified, I end up choosing θ 0 = 1 \theta _{0}=1 θ0=1, θ 1 = 0 \theta _{1}=0 θ1=0, θ 2 = 0 \theta _{2}=0 θ2=0, θ 3 = 1 \theta _{3}=1 θ3=1, and θ 4 = 1 \theta _{4}=1 θ4=1. What this means is that with this particular choice of parameters, my parameter vector θ = [ − 1 0 0 1 1 ] \theta =\begin{bmatrix} -1\\ 0\\ 0\\ 1\\ 1 \end{bmatrix} θ=10011. Following our earlier discussion, this means that my hypothesis will predict that y=1 whenever − 1 + x 1 2 + x 2 2 ⩾ 0 -1+x_{1}^{2}+x_{2}^{2}\geqslant 0 1+x12+x220. This is whenever θ T x ⩾ 0 \theta ^{T}x\geqslant 0 θTx0. And if I take -1 and just bring this to the right, I’m saying that my hypothesis will predict that y=1 whenever x 1 2 + x 2 2 ⩾ 1 x_{1}^{2}+x_{2}^{2}\geqslant 1 x12+x221. So, what does decision boundary look like? Well, if you were to plot the curve for x 1 2 + x 2 2 = 1 x_{1}^{2}+x_{2}^{2}= 1 x12+x22=1. that’s the equation for a circle of radius 1 centered around the origin. So, that is my decision boundary. And everything outside the circle I’m going to predict as y=1. So out here is my y=1 region. And inside the circle is where I’ll predict y=0. So, by adding this more complex polynomial terms to my features as well, I can get more complex decision boundaries that don’t just try to separate the positive and negative examples with straight line. I can get in this example a decision boundary that is a circle. Once again, the decision boundary is a property not of the training set, but of the hypothesis and of the parameters. So long as we’ve given my parameter vector \theta, that defines the decision boundary which is the circle. But the training set is not what we use to define the decision boundary. The training set may be used to fit the parameters \theta. We’ll talk about how to do that later. But once you have the parameters \theta, that is what defines the decision boundary. Let me put the back the training set just for visualization.

给定这样的训练集,如何获得逻辑回归以适合此类数据?之前,当我们讨论多项式回归或线性回归时,我们谈到了如何向要素添加额外的高阶多项式项。我们可以对逻辑回归做同样的事情。我的假设将随时预测y = 1。每当 − 1 + x 1 2 + x 2 2 ⩾ 0 -1+x_{1}^{2}+x_{2}^{2}\geqslant 0 1+x12+x220。每当 θ T x ⩾ 0 \theta ^{T}x\geqslant 0 θTx0,如果我取-1并将其右移,假设将预测每当 x 1 2 + x 2 2 ⩾ 1 x_{1}^{2}+x_{2}^{2}\geqslant 1 x12+x221时y = 1.。那么,决策边界是什么样的?好吧,如果要绘制 x 1 2 + x 2 2 = 1 x_{1}^{2}+x_{2}^{2}= 1 x12+x22=1的曲线。这是一个以原点为中心的半径为1的圆的方程。所以,这就是我的决定边界。圆以外的所有东西我都会预测为y = 1。所以这是我的y = 1区域。我将在圆圈内预测y = 0。因此,通过将这些更复杂的多项式项也添加到我的特征中,我可以获得更复杂的决策边界,而不仅仅是尝试用直线将正例和负例分开。在此示例中,我可以得到一个决策边界,即一个圆形。再者,决策边界不是训练集的属性,而是假设和参数的属性。只要我们给定了参数向量 θ \theta θ,它就定义了决策边界即圆。但是训练集不是我们用来定义决策边界的内容。训练集可用于拟合参数 θ \theta θ。一旦有了参数 θ \theta θ,就可以定义决策边界。让我把训练集放回去只是为了可视化。

If I have even higher order polynomial terms, so things like h θ ( x ) = g ( θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 1 2 + θ 4 x 1 2 x 2 + θ 5 x 1 2 x 2 2 + θ 6 x 1 3 x 2 + . . . ) h_{\theta }(x)=g(\theta _{0}+\theta _{1}x_{1}+\theta _{2}x_{2}+\theta _{3}x_{1}^{2}+\theta _{4}x_{1}^{2}x_{2}+\theta _{5}x_{1}^{2}x_{2}^{2}+\theta _{6}x_{1}^{3}x_{2}+...) hθ(x)=g(θ0+θ1x1+θ2x2+θ3x12+θ4x12x2+θ5x12x22+θ6x13x2+...). If I have much higher order polynomials then it’s possible to show that you can get even more complex decision boundaries and logistic regression can be used to find decision boundaries that may, for example, be an ellipse like that, or with a different setting of parameters, maybe you can get a different decision boundary which may even look like, some funny shape like that. you can also get decision boundaries that could look like more complex shape like that. Where everything in here you predict y=1, and everything outside you predict y=0. So these higher order polynomial features you can get very complex decision boundaries. So with these visualizations, I hope that gives you a sense what’s the range of hypothesis functions you can represent using the representation that we have for logistic regression.Now that we know what h θ ( x ) h_{\theta }(x) hθ(x) can represent.
如果我有更高阶的多项式项,那么 h θ ( x ) = g ( θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 1 2 + θ 4 x 1 2 x 2 + θ 5 x 1 2 x 2 2 + θ 6 x 1 3 x 2 + . . . ) h_{\theta }(x)=g(\theta _{0}+\theta _{1}x_{1}+\theta _{2}x_{2}+\theta _{3}x_{1}^{2}+\theta _{4}x_{1}^{2}x_{2}+\theta _{5}x_{1}^{2}x_{2}^{2}+\theta _{6}x_{1}^{3}x_{2}+...) hθ(x)=g(θ0+θ1x1+θ2x2+θ3x12+θ4x12x2+θ5x12x22+θ6x13x2+...)。如果我有更高阶的多项式,那么有可能表明您可以获得更复杂的决策边界,并且逻辑回归可用于查找决策边界,例如,可能是像这样的椭圆,或者具有不同的参数设置,也许您可​​以获得一个不同的决策边界,甚至可能看起来像是一些有趣的形状。您还可以获得看起来像更复杂形状的决策边界。其中这里的所有内容您都预测y = 1,而外部的所有内容您都预测y = 0。因此,这些高阶多项式特征可以得到非常复杂的决策边界。因此,希望借助这些可视化,使您了解使用逻辑回归所能表示的假设函数范围是什么。现在,我们知道 h θ ( x ) h_{\theta }(x) hθ(x)可以表示什么。

参考博客:吴恩达机器学习笔记

4.sigmoid函数介绍

函数的基本性质:

1、定义域:(−∞,+∞)(−∞,+∞)
2、值域:(−1,1)(−1,1)
3、函数在定义域内为连续和光滑函数
4、处处可导,导数为:f′(x)=f(x)(1−f(x))f′(x)=f(x)(1−f(x))

import matplotlib.pyplot as plt
import numpy as np
 
 
def sigmoid(x):
    # 直接返回sigmoid函数
    return 1. / (1. + np.exp(-x))
 
 
def plot_sigmoid():
    # param:起点,终点,间距
    x = np.arange(-8, 8, 0.2)
    y = sigmoid(x)
    plt.plot(x, y)
    plt.show()
 
 
if __name__ == '__main__':
    plot_sigmoid()

sigmoid函数图像

在上图可以看出,Sigmoid函数连续,光滑,严格单调,是一个非常良好的阈值函数。当x趋近负无穷时,y趋近于0;趋近于正无穷时,y趋近于1;x=0时,y=0.5。当然,在x超出[-6,6]的范围后,函数值基本上没有变化,值非常接近,在应用中一般不考虑。Sigmoid函数的值域范围限制在(0,1)之间,这和概率值的范围[0,1]很接近,所以二分类的概率常常用这个函数。

sigmoid function is sometimes also known as the logistic function. It is a non-linear function used not only in Machine Learning (Logistic Regression), but also in Deep Learning.

5.sigmoid函数在深度学习上的应用

sigmoid在压缩数据幅度方面有优势,在深度网络中,在前向传播中,sigmoid可以保证数据幅度在[0,1]内,这样数据幅度稳住了,不会出现数据扩散,不会有太大的失误。
sigmoid函数公式
  σ ( z ) = 1 1 + e − z   (1)  σ ( z ) =\frac{1}{1+e^{-z}}\tag{1}   σ(z)=1+ez1 (1)
sigmoid函数Python实现

def sigmoid(Z):
    """
    Implements the sigmoid activation in numpy
    
    Arguments:
    Z -- numpy array of any shape
    
    Returns:
    A -- output of sigmoid(z), same shape as Z
    cache -- returns Z as well, useful during backpropagation
    """
    
    A = 1/(1+np.exp(-Z))
    cache = Z

    return A, cache

注:因为反向传播要用到Z,所以先将其储存在cache里
二. sigmoid函数反向传播原理
sigmoid函数导数
      σ ′ ( z ) = σ ( z ) ∗ ( 1 − σ ( z ) )   (2)     σ'(z)=σ(z)*(1-σ(z))\tag{2}      σ(z)=σ(z)(1σ(z)) (2)

sigmoid函数反向传播原理
在第 l l l 层神经网络,正向传播计算公式如下:
Z [ l ] = W [ l ] A [ l − 1 ] + b [ l ] (3) Z^{[l]}=W^{[l]}A^{[l-1]} + b^{[l]}\tag{3} Z[l]=W[l]A[l1]+b[l](3)
A [ l ] = σ ( Z [ l ] ) (4) A^{[l]} = σ(Z^{[l]})\tag{4} A[l]=σ(Z[l])(4)

其中(1)为线性部分,(2)为激活部分,激活函数为sigmoid函数
在反向传播中,计算到第 l ll 层时,会通过后一层得到 d A [ l ] dA^{[l]} dA[l](即 ∂ L ∂ A [ l ] \frac{\partial \mathcal{L} }{\partial A^{[l]}} A[l]L
​ ,其中 L \mathcal{L} L为成本函数)
当前层需要计算 d Z [ l ] dZ^{[l]} dZ[l]
(即 ∂ L ∂ Z [ l ] \frac{\partial \mathcal{L} }{\partial Z^{[l]}} Z[l]L ),公式如下:
(5) d Z [ l ] dZ^{[l]} dZ[l] = ∂ L ∂ Z [ l ] \frac{\partial \mathcal{L} }{\partial Z^{[l]}} Z[l]L = ∂ L ∂ A [ l ] \frac{\partial \mathcal{L} }{\partial A^{[l]}} A[l]L ∂ A [ l ] ∂ Z [ l ] \frac{\partial A^{[l]} }{\partial Z^{[l]}} Z[l]A[l] = d A ∗ σ ( Z [ l ] ) dA * σ(Z^{[l]}) dAσ(Z[l]) = d A ∗ σ ( z ) ∗ ( 1 − σ ( z ) ) (5) dA * σ(z)*(1-σ(z))\tag{5} dAσ(z)(1σ(z))(5)

因此实现代码如下:

sigmoid函数反向传播Python实现
def sigmoid_backward(dA, cache):
    """
    Implement the backward propagation for a single SIGMOID unit.
    
    Arguments:
    dA -- post-activation gradient, of any shape
    cache -- 'Z' where we store for computing backward propagation efficiently
    
    Returns:
    dZ -- Gradient of the cost with respect to Z
    """

    Z = cache

    s = 1/(1+np.exp(-Z))
    dZ = dA * s * (1-s)

    assert (dZ.shape == Z.shape)

    return dZ

6.sigmoid函数的优缺点

sigmoid优点:
在特征相差比较复杂或是相差不是特别大时效果比较好。
sigmoid缺点:
1.激活函数计算量大,反向传播求误差梯度时,求导涉及除法
2.反向传播时,很容易就会出现梯度消失的情况,从而无法完成深层网络的训练
3.Sigmoids函数饱和且kill掉梯度。

二、逻辑回归代码

from numpy import *
filename='...\\testSet.txt' #文件目录
def loadDataSet():   #读取数据(这里只有两个特征)
    dataMat = []
    labelMat = []
    fr = open(filename)
    for line in fr.readlines():
        lineArr = line.strip().split()
        dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])   #前面的1,表示方程的常量。比如两个特征X1,X2,共需要三个参数,W1+W2*X1+W3*X2
        labelMat.append(int(lineArr[2]))
    return dataMat,labelMat

def sigmoid(inX):  #sigmoid函数
    return 1.0/(1+exp(-inX))

def gradAscent(dataMat, labelMat): #梯度上升求最优参数
    dataMatrix=mat(dataMat) #将读取的数据转换为矩阵
    classLabels=mat(labelMat).transpose() #将读取的数据转换为矩阵
    m,n = shape(dataMatrix)
    alpha = 0.001  #设置梯度的阀值,该值越大梯度上升幅度越大
    maxCycles = 500 #设置迭代的次数,一般看实际数据进行设定,有些可能200次就够了
    weights = ones((n,1)) #设置初始的参数,并都赋默认值为1。注意这里权重以矩阵形式表示三个参数。
    for k in range(maxCycles):
        h = sigmoid(dataMatrix*weights)
        error = (classLabels - h)     #求导后差值
        weights = weights + alpha * dataMatrix.transpose()* error #迭代更新权重
    return weights

def stocGradAscent0(dataMat, labelMat):  #随机梯度上升,当数据量比较大时,每次迭代都选择全量数据进行计算,计算量会非常大。所以采用每次迭代中一次只选择其中的一行数据进行更新权重。
    dataMatrix=mat(dataMat)
    classLabels=labelMat
    m,n=shape(dataMatrix)
    alpha=0.01
    maxCycles = 500
    weights=ones((n,1))
    for k in range(maxCycles):
        for i in range(m): #遍历计算每一行
            h = sigmoid(sum(dataMatrix[i] * weights))
            error = classLabels[i] - h
            weights = weights + alpha * error * dataMatrix[i].transpose()
    return weights

def stocGradAscent1(dataMat, labelMat): #改进版随机梯度上升,在每次迭代中随机选择样本来更新权重,并且随迭代次数增加,权重变化越小。
    dataMatrix=mat(dataMat)
    classLabels=labelMat
    m,n=shape(dataMatrix)
    weights=ones((n,1))
    maxCycles=500
    for j in range(maxCycles): #迭代
        dataIndex=[i for i in range(m)]
        for i in range(m): #随机遍历每一行
            alpha=4/(1+j+i)+0.0001  #随迭代次数增加,权重变化越小。
            randIndex=int(random.uniform(0,len(dataIndex)))  #随机抽样
            h=sigmoid(sum(dataMatrix[randIndex]*weights))
            error=classLabels[randIndex]-h
            weights=weights+alpha*error*dataMatrix[randIndex].transpose()
            del(dataIndex[randIndex]) #去除已经抽取的样本
    return weights

def plotBestFit(weights):  #画出最终分类的图
    import matplotlib.pyplot as plt
    dataMat,labelMat=loadDataSet()
    dataArr = array(dataMat)
    n = shape(dataArr)[0]
    xcord1 = []; ycord1 = []
    xcord2 = []; ycord2 = []
    for i in range(n):
        if int(labelMat[i])== 1:
            xcord1.append(dataArr[i,1])
            ycord1.append(dataArr[i,2])
        else:
            xcord2.append(dataArr[i,1])
            ycord2.append(dataArr[i,2])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')
    ax.scatter(xcord2, ycord2, s=30, c='green')
    x = arange(-3.0, 3.0, 0.1)
    y = (-weights[0]-weights[1]*x)/weights[2]
    ax.plot(x, y)
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.show()

def main():
    dataMat, labelMat = loadDataSet()
    weights=gradAscent(dataMat, labelMat).getA()
    plotBestFit(weights)

if __name__=='__main__':
    main()

决策边界图片

参考:1.《机器学习实战》
2.《吴恩达机器学习讲义以及课件》

已标记关键词 清除标记
©️2020 CSDN 皮肤主题: 代码科技 设计师:Amelia_0503 返回首页