# 一、sigmoid函数在逻辑回归以及深度学习的应用

## 1.逻辑回归分类

So the two terms are basically interchangeable and either term can be used to refer to this function g. And if we take these two equations, and put them together, then here’s just an alternative way of writing out the form of my hypothesis. I’m saying h θ ( x ) = 1 1 + e − θ T x h_{\theta }(x)=\frac{1}{1+e^{-\theta ^{T}x}} , and all I have done is I’ve taken the variable z, z here is a real number, and plugged in θ T x \theta ^{T}x , so end up with, θ T x \theta ^{T}x , in place of z there. Lastly, let me show you what the sigmoid function looks like. We’re going to plot it on the figure here. The sigmid function g(z), also called the logistic function, looks like this. It starts off near 0 and then rises until it processes 0.5 at the origin and then it flattens out again like so. So that’s what the sigmoid function looks like. And you notice that the sigmoid function, well, it asymptotes at 1, and asymptotes at 0 as z, the horizontal axis is z, goes to minus infinity, g(z) approaches zero, and as z approaches infinity, g(z) approaches 1. Because g(z) offers values that are between 0 and 1, we also have that h_{\theta }(x) must be between 0 and 1 ( 0 ≤ h θ ( x ) ⩽ 1 ) (0\leq h_{\theta }(x)\leqslant 1) . Finally, given this hypothesis representation, what we need to do(get next), as before, is fit the parameters θ \theta to our data. So giving a training set, we need pick a value for the parameters θ \theta and this hypothesis will then let us make predictions. We’ll talk about a learning algorithm later for fitting the parameter theta. But first let’s talk a little about the interpretation of this model.

sigmid函数g（z）也称为逻辑函数，看起来像这样。它在0附近开始，然后上升直到在原点处处理0.5，然后再次变平。这就是S型函数的样子。您会注意到，S形函数在1处渐近，在z处渐近于0，水平轴为z，变为负无穷大，g(z)的值接近零，而当z趋近无穷大时，g(z)的值接近1.由于g(z)提供的值在0到1之间，所以我们也有 h   t h e t a ( x ) h _ {\ theta}(x) 必须在0到1之间 ( 0 ≤ h θ ( x ) ⩽ 1 ) (0\leq h_{\theta }(x)\leqslant 1) .。最后，给定该假设表示形式，像以前一样，我们需要做的（获取下一个）适合参数 θ \theta 到我们的数据中。因此，给定训练集，我们需要为参数 θ \theta 选择一个值，然后该假设将使我们做出预测。我们谈谈该模型的解释。

## 2.假设陈述

Here is how I’m going to interpret the output of my hypothesis h θ ( x ) h_{\theta }(x) . When my hypothesis outputs some number, I am going to treat that number as the estimated probability that y is equal to 1 on a new input example x. Let’s say we’re using the tumor classification example. So we may have a feature vector x, which is this x 0 = 1 x_{0}=1 as always, and then our one feature is the size of the tumor. Suppose I have a patient come in and they have some tumor size and I feed their feature vector x into my hypothesis and suppose my hypothesis outputs the number 0.7. I’m going to interpret my hypothesis as follows. I’m going to say that this hypothesis is telling me that for a patient with features x the probability that y equals 1 is 0.7. In other words, I’m going to tell my patient the tumor, sadly, has a 70% chance or a 0.7 chance of being malignant. To write this out slightly more formally or to write this out in math, I’m going to interpret my hypothesis output as P of y equals 1, given x, parameterized by θ \theta , i.e., h θ ( x ) = P ( y = 1 ∣ x ; θ ) h_{\theta }(x)=P(y=1|x;\theta ) . So, for those of you that are familiar with probability, this equation might make sense; if you’re a little less familiar with probability, here is how I read this expression, this is the probability that y is equals to one given x, so that is given that my patient has features x. Given my patient has a particular tumor size represented by my features x, and this probability is parameterized by θ \theta . So I’m basically going to count on my hypothesis to give me estimates the probability that y is equals to 1. Now since this is a classification task, we know that y must be equal to 0 or 1, right? Those are the only two values that y could possibly take on, either in the training set or for new patients that may walk into my office or into the doctor’s office in the future. So given h θ ( x ) h_{\theta }(x) , we can therefore compute the probability that y is equal to 0 as well. Concretely, because y must be either 0 or 1, we know that the probability that y=0, plus the probability of y=1, must add up to 1. This first equation looks a little bit more complicated, but it’s basic saying that probability that y=0 for a particular patient with features x, and given our parameters θ \theta , plus the probability of y=1 for that same patient with features x and given parameters θ \theta must add up to 1. If this equation looks a little bit complicated, feel free to mentally imagine it without that x and θ \theta . And this is just saying that the probability of y=0 plus the probability of y=1 must be equal to 1. And we know this to be true because y has to be either 0 or 1. So the chance of y being 0 plus the chance that y is 1 those two must add up to 1. And so if you just take this term and move it to the right-hand side, then you end up with this equation that says probability that y=0 is one minus probability y equals 1. And thus if our hypothesis if h θ ( x ) h_{\theta }(x) gives that term you can therefore quite simply compute the probability, or compute the estimated probability that y is equal to 0 as well. So you now know what the hypothesis representation is for logistic regression and we’re seeing what the mathematically formula is defining the hypothesis for logistic regression.

## 3.决策边界

Concretely, this hypothesis is outputting estimates of the probability that y is equal to 1 given x and parameterized by θ \theta . So if we wanted to predict is y equal to 1 or is y equal to 0, here is something we might do. Whenever the hypothesis outputs that the probability with y being 1 is greater than or equal to 0.5 so this means that it is more likely to be y equals 1 than y equals 0 then let’s predict y equals 1. And otherwise, if the probability of, the estimated probability of being 1 is less than 0.5, then let’s predict y equals 0. And I chose a greater than or equal to 0.5 or less than 0.5. If h θ ( x ) h_{\theta }(x) is equal to 0.5 exactly, then we could predict positive or negative, but I put a greater than or equal to here so we default maybe to predict a positive if h θ ( x ) h_{\theta }(x) is 0.5. But that’s a detail that really doesn’t matter that much. What I want to do is understand better when it is exactly that h θ ( x ) h_{\theta }(x) will be greater or equal to 0.5, so that we end up predicting y is equal to 1. If we look at this plot of the sigmoid function, we’ll notice that the sigmoid function, g(z), is greater than or equal to 0.5 whenever z is greater than or equal to 0. So is in this half of the figure that, g takes on values that are 0.5 and higher. This is node here, that’s the 0.5. So when z is positive, g(z) the sigmoid function, is greater than or equal to 0.5. Since the hypothesis for logistic regression is h θ ( x ) = g ( θ T x ) h_{\theta }(x)=g(\theta ^{T}x) . This is therefore going to be greater than or equal to 0.5 whenever θ T x \theta ^{T}x is greater than or equal to 0. So what was shown, right, because here θ T x \theta ^{T}x takes the role of z. So what we’re shown is that our hypothesis is going to predict y equals 1 whenever θ T x \theta ^{T}x is greater than or equal to 0. Let’s now consider the other case of when a hypothesis will predict y is equal to 0. Well, by similar argument, h θ ( x ) h_{\theta }(x) is going to be less than 0.5 whenever g(z) is less than 0.5, because the range of values of z that calls g(z) to take on values less than 0.5, well that’s when z is negative. So when g(z) is less than 0.5, our hypothesis will predict that y is equal to 0, and by similar argument to what we had earlier, h θ ( x ) = g ( θ T x ) h_{\theta }(x)=g(\theta ^{T}x) . And so, we’ll predict y equals 0 whenever this quantity θ T x \theta ^{T}x is less than 0. To summarize what we just worked out, we saw that if we decide to predict whether y is equal to 1 or y is equal to 0, depending on whether the estimated probability is greater than or equal to 0.5, or whether it’s less than 0.5, that’s the same as saying that will predict y equals 1 whenever θ T x \theta ^{T}x is greater than or equal to 0, and we’ll predict y is equal to 0 whenever θ T x \theta ^{T}x is less than 0.

Now, let’s suppose we have a training set like that shown on the slide, and suppose our hypothesis is h θ ( x ) = g ( θ 0 + θ 1 x 1 + θ 2 x 2 ) h_{\theta }(x)=g(\theta _{0}+\theta _{1}x_{1}+\theta _{2}x_{2}) . But suppose that very procedure to be specified, we end up choosing the following values for the parameters. Let’s say we choose θ 0 = − 3 \theta _{0}=-3 , θ 1 = 1 \theta _{1}=1 , θ 2 = 1 \theta _{2}=1 . So this means my parameter vector is going to be θ = [ − 3 1 1 ] \theta =\begin{bmatrix} -3\\ 1\\ 1 \end{bmatrix} . So, we’re given this choice of my hypothesis parameters, let’s try to figure out where a hypothesis will end up predicting y=1 and where it will end up predicting y equals 0. Using the formulas that we worked on the previous slides, we know that y=1 is more likely, that is the probability that y=1 is greater than or equals to 0.5. Whenever θ T x \theta ^{T}x is greater than 0. And this formula that I just underlined, − 3 + x 1 + x 2 -3+x_{1}+x_{2} is, of course, θ T x \theta ^{T}x , when θ \theta is equal to this value of the parameters that we just chose. So, for any example, for any example with features x1 and x2, that satisfy this equation that − 3 + x 1 + x 2 -3+x_{1}+x_{2} is greater than or equal to 0, our hypothesis will think that y equals 1 is more likely, or will predict that y is equal to 1. We can also take -3 and bring this to the right and rewrite this as x 1 + x 2 ⩾ 3 x_{1}+x_{2}\geqslant 3 . And so, equivalently, we found that this hypothesis will predict y=1 whenever x1+x2 is greater than or equal to 3. Let’s see what that means on the figure. If I write down the equation x 1 + x 2 = 3 x_{1}+x_{2}=3 , this defines the equation of a straight line. And if I draw what that straight line looks like, it gives me the following line which passes through 3 and 3 on the x1 and x2 axis. So the part of the input space, the part of the x1 and x2 plane that corresponds to when x1+x2 is greater than or equal to 3. That is going to be this right half plane. That is everything to the upper right portion of this magenta line that I just drew. And so, the region where our hypothesis will predict y=1 is really this huge region this half space over to the upper right. And let me just write that down. I’m gonna call this y=1 region. And in contrast, the region there x1+x2 is less than 3 that’s when we’ll predict that y=0, and that corresponds to this region. You know, it’s really a half plane, but that region on the left is the region where our hypothesis is predict y=0. I want to give this line, this magenta line that I drew a name. This line there is called the decision boundary. And concretely, this straight line x1+x2=3. That corresponds to the set of points, that corresponds to the region where h θ ( x ) h_{\theta }(x) is equal to 0.5 exactly. And the decision boundary, that is this straight line, that’s the line that separates the region where the hypothesis predicts y=1 from the region where the hypothesis predicts that y=0. And just to be clear, the decision boundary is a property of the hypothesis including the parameters θ 0 \theta _{0} , θ 1 \theta _{1} and θ 2 \theta _{2} . And in the figure I drew a training set. I drew a data set in order to help the visualization. But even if we take away the data set, this decision boundary and a region where we predict y=1 versus y=0. That’s a property of the hypothesis and of the parameters of the hypothesis, and not a property of the data set. Later on, of course, we’ll talk about how to fit the parameters and there we’ll end up using the training set, or using our data, to determine the value of the parameters. But once we have particular values for the parameters θ 0 \theta _{0} , θ 1 \theta _{1} and θ 2 \theta _{2} , then that completely defines the decision boundary and we don’t actually need to plot a training set in order to plot the decision boundary.

Given a training set like this, how can I get logistic regression to fit this sort of data? Earlier, when we were talking about polynomial regression or when we’re talking about linear regression, we talked about how we can add extra higher order polynomial terms to the features. And we can do the same for logistic regression. Concretely, let’s say my hypothesis looks like this. Where I’ve added two extra features, x 1 2 x_{1}^{2} and x 2 2 x_{2}^{2} to my features. So that I now have 5 parameters, θ 0 \theta _{0} through θ 4 \theta _{4} . As before, we’ll defer to the next video our discussion on how to automatically choose values for the parameters θ 0 \theta _{0} through θ 4 \theta _{4} . But let’s say that very procedure to be specified, I end up choosing θ 0 = 1 \theta _{0}=1 , θ 1 = 0 \theta _{1}=0 , θ 2 = 0 \theta _{2}=0 , θ 3 = 1 \theta _{3}=1 , and θ 4 = 1 \theta _{4}=1 . What this means is that with this particular choice of parameters, my parameter vector θ = [ − 1 0 0 1 1 ] \theta =\begin{bmatrix} -1\\ 0\\ 0\\ 1\\ 1 \end{bmatrix} . Following our earlier discussion, this means that my hypothesis will predict that y=1 whenever − 1 + x 1 2 + x 2 2 ⩾ 0 -1+x_{1}^{2}+x_{2}^{2}\geqslant 0 . This is whenever θ T x ⩾ 0 \theta ^{T}x\geqslant 0 . And if I take -1 and just bring this to the right, I’m saying that my hypothesis will predict that y=1 whenever x 1 2 + x 2 2 ⩾ 1 x_{1}^{2}+x_{2}^{2}\geqslant 1 . So, what does decision boundary look like? Well, if you were to plot the curve for x 1 2 + x 2 2 = 1 x_{1}^{2}+x_{2}^{2}= 1 . that’s the equation for a circle of radius 1 centered around the origin. So, that is my decision boundary. And everything outside the circle I’m going to predict as y=1. So out here is my y=1 region. And inside the circle is where I’ll predict y=0. So, by adding this more complex polynomial terms to my features as well, I can get more complex decision boundaries that don’t just try to separate the positive and negative examples with straight line. I can get in this example a decision boundary that is a circle. Once again, the decision boundary is a property not of the training set, but of the hypothesis and of the parameters. So long as we’ve given my parameter vector \theta, that defines the decision boundary which is the circle. But the training set is not what we use to define the decision boundary. The training set may be used to fit the parameters \theta. We’ll talk about how to do that later. But once you have the parameters \theta, that is what defines the decision boundary. Let me put the back the training set just for visualization.

If I have even higher order polynomial terms, so things like h θ ( x ) = g ( θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 1 2 + θ 4 x 1 2 x 2 + θ 5 x 1 2 x 2 2 + θ 6 x 1 3 x 2 + . . . ) h_{\theta }(x)=g(\theta _{0}+\theta _{1}x_{1}+\theta _{2}x_{2}+\theta _{3}x_{1}^{2}+\theta _{4}x_{1}^{2}x_{2}+\theta _{5}x_{1}^{2}x_{2}^{2}+\theta _{6}x_{1}^{3}x_{2}+...) . If I have much higher order polynomials then it’s possible to show that you can get even more complex decision boundaries and logistic regression can be used to find decision boundaries that may, for example, be an ellipse like that, or with a different setting of parameters, maybe you can get a different decision boundary which may even look like, some funny shape like that. you can also get decision boundaries that could look like more complex shape like that. Where everything in here you predict y=1, and everything outside you predict y=0. So these higher order polynomial features you can get very complex decision boundaries. So with these visualizations, I hope that gives you a sense what’s the range of hypothesis functions you can represent using the representation that we have for logistic regression.Now that we know what h θ ( x ) h_{\theta }(x) can represent.

## 4.sigmoid函数介绍

1、定义域：(−∞,+∞)(−∞,+∞)
2、值域：(−1,1)(−1,1)
3、函数在定义域内为连续和光滑函数
4、处处可导，导数为：f′(x)=f(x)(1−f(x))f′(x)=f(x)(1−f(x))

import matplotlib.pyplot as plt
import numpy as np

def sigmoid(x):
# 直接返回sigmoid函数
return 1. / (1. + np.exp(-x))

def plot_sigmoid():
# param:起点，终点，间距
x = np.arange(-8, 8, 0.2)
y = sigmoid(x)
plt.plot(x, y)
plt.show()

if __name__ == '__main__':
plot_sigmoid()


sigmoid function is sometimes also known as the logistic function. It is a non-linear function used not only in Machine Learning (Logistic Regression), but also in Deep Learning.

## 5.sigmoid函数在深度学习上的应用

sigmoid在压缩数据幅度方面有优势，在深度网络中，在前向传播中，sigmoid可以保证数据幅度在[0,1]内，这样数据幅度稳住了，不会出现数据扩散，不会有太大的失误。
sigmoid函数公式
σ ( z ) = 1 1 + e − z 　 (1) 　σ ( z ) =\frac{1}{1+e^{-z}}\tag{1}
sigmoid函数Python实现

def sigmoid(Z):
"""
Implements the sigmoid activation in numpy

Arguments:
Z -- numpy array of any shape

Returns:
A -- output of sigmoid(z), same shape as Z
cache -- returns Z as well, useful during backpropagation
"""

A = 1/(1+np.exp(-Z))
cache = Z

return A, cache


sigmoid函数导数
σ ′ ( z ) = σ ( z ) ∗ ( 1 − σ ( z ) ) 　 (2) 　　　 σ'(z)=σ(z)*(1-σ(z))\tag{2}

sigmoid函数反向传播原理

Z [ l ] = W [ l ] A [ l − 1 ] + b [ l ] (3) Z^{[l]}=W^{[l]}A^{[l-1]} + b^{[l]}\tag{3}
A [ l ] = σ ( Z [ l ] ) (4) A^{[l]} = σ(Z^{[l]})\tag{4}

​ ，其中 L \mathcal{L} 为成本函数)

(即 ∂ L ∂ Z [ l ] \frac{\partial \mathcal{L} }{\partial Z^{[l]}} )，公式如下：
(5) d Z [ l ] dZ^{[l]} = ∂ L ∂ Z [ l ] \frac{\partial \mathcal{L} }{\partial Z^{[l]}} = ∂ L ∂ A [ l ] \frac{\partial \mathcal{L} }{\partial A^{[l]}} ∂ A [ l ] ∂ Z [ l ] \frac{\partial A^{[l]} }{\partial Z^{[l]}} = d A ∗ σ ( Z [ l ] ) dA * σ(Z^{[l]}) = d A ∗ σ ( z ) ∗ ( 1 − σ ( z ) ) (5) dA * σ(z)*(1-σ(z))\tag{5}

sigmoid函数反向传播Python实现
def sigmoid_backward(dA, cache):
"""
Implement the backward propagation for a single SIGMOID unit.

Arguments:
dA -- post-activation gradient, of any shape
cache -- 'Z' where we store for computing backward propagation efficiently

Returns:
dZ -- Gradient of the cost with respect to Z
"""

Z = cache

s = 1/(1+np.exp(-Z))
dZ = dA * s * (1-s)

assert (dZ.shape == Z.shape)

return dZ


## 6.sigmoid函数的优缺点

sigmoid优点：

sigmoid缺点：
1.激活函数计算量大，反向传播求误差梯度时，求导涉及除法
2.反向传播时，很容易就会出现梯度消失的情况，从而无法完成深层网络的训练
3.Sigmoids函数饱和且kill掉梯度。

# 二、逻辑回归代码

from numpy import *
filename='...\\testSet.txt' #文件目录
def loadDataSet():   #读取数据（这里只有两个特征）
dataMat = []
labelMat = []
fr = open(filename)
for line in fr.readlines():
lineArr = line.strip().split()
dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])   #前面的1，表示方程的常量。比如两个特征X1,X2，共需要三个参数，W1+W2*X1+W3*X2
labelMat.append(int(lineArr[2]))
return dataMat,labelMat

def sigmoid(inX):  #sigmoid函数
return 1.0/(1+exp(-inX))

def gradAscent(dataMat, labelMat): #梯度上升求最优参数
dataMatrix=mat(dataMat) #将读取的数据转换为矩阵
classLabels=mat(labelMat).transpose() #将读取的数据转换为矩阵
m,n = shape(dataMatrix)
alpha = 0.001  #设置梯度的阀值，该值越大梯度上升幅度越大
maxCycles = 500 #设置迭代的次数，一般看实际数据进行设定，有些可能200次就够了
weights = ones((n,1)) #设置初始的参数，并都赋默认值为1。注意这里权重以矩阵形式表示三个参数。
for k in range(maxCycles):
h = sigmoid(dataMatrix*weights)
error = (classLabels - h)     #求导后差值
weights = weights + alpha * dataMatrix.transpose()* error #迭代更新权重
return weights

def stocGradAscent0(dataMat, labelMat):  #随机梯度上升，当数据量比较大时，每次迭代都选择全量数据进行计算，计算量会非常大。所以采用每次迭代中一次只选择其中的一行数据进行更新权重。
dataMatrix=mat(dataMat)
classLabels=labelMat
m,n=shape(dataMatrix)
alpha=0.01
maxCycles = 500
weights=ones((n,1))
for k in range(maxCycles):
for i in range(m): #遍历计算每一行
h = sigmoid(sum(dataMatrix[i] * weights))
error = classLabels[i] - h
weights = weights + alpha * error * dataMatrix[i].transpose()
return weights

def stocGradAscent1(dataMat, labelMat): #改进版随机梯度上升，在每次迭代中随机选择样本来更新权重，并且随迭代次数增加，权重变化越小。
dataMatrix=mat(dataMat)
classLabels=labelMat
m,n=shape(dataMatrix)
weights=ones((n,1))
maxCycles=500
for j in range(maxCycles): #迭代
dataIndex=[i for i in range(m)]
for i in range(m): #随机遍历每一行
alpha=4/(1+j+i)+0.0001  #随迭代次数增加，权重变化越小。
randIndex=int(random.uniform(0,len(dataIndex)))  #随机抽样
h=sigmoid(sum(dataMatrix[randIndex]*weights))
error=classLabels[randIndex]-h
weights=weights+alpha*error*dataMatrix[randIndex].transpose()
del(dataIndex[randIndex]) #去除已经抽取的样本
return weights

def plotBestFit(weights):  #画出最终分类的图
import matplotlib.pyplot as plt
dataMat,labelMat=loadDataSet()
dataArr = array(dataMat)
n = shape(dataArr)[0]
xcord1 = []; ycord1 = []
xcord2 = []; ycord2 = []
for i in range(n):
if int(labelMat[i])== 1:
xcord1.append(dataArr[i,1])
ycord1.append(dataArr[i,2])
else:
xcord2.append(dataArr[i,1])
ycord2.append(dataArr[i,2])
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')
ax.scatter(xcord2, ycord2, s=30, c='green')
x = arange(-3.0, 3.0, 0.1)
y = (-weights[0]-weights[1]*x)/weights[2]
ax.plot(x, y)
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()

def main():
dataMat, labelMat = loadDataSet()
weights=gradAscent(dataMat, labelMat).getA()
plotBestFit(weights)

if __name__=='__main__':
main()


2.《吴恩达机器学习讲义以及课件》

10-13 12万+
05-03 303
03-13 649
07-18 8万+
01-27 2572
12-16 1858
01-14 2万+
11-27 147
09-15 2705
02-25 569
12-19 8万+
12-19 839
11-19 1751
02-23 293
03-21 2592
12-28 723
01-08 2万+
©️2020 CSDN 皮肤主题: 代码科技 设计师:Amelia_0503