【机器学习】逻辑回归原理

最新推荐文章于 2022-01-28 00:21:42 发布

Mr_health

最新推荐文章于 2022-01-28 00:21:42 发布

阅读量599

点赞数 1

分类专栏：机器学习 tensorflow 概率论文章标签：机器学习 tensorflow 逻辑回归

本文链接：https://blog.csdn.net/Mr_health/article/details/84728437

版权

机器学习同时被 3 个专栏收录

28 篇文章 2 订阅

订阅专栏

tensorflow

19 篇文章 1 订阅

订阅专栏

概率论

9 篇文章 1 订阅

订阅专栏

逻辑回归假设数据服从伯努利分布,通过极大化似然函数的方法，运用梯度下降来求解参数，来达到将数据二分类的目的，属于线性模型

参考：手推逻辑回归--面试前拯救一下

手推记录-logistic regression （逻辑斯蒂回归）

1、分类和回归

回归(Regression)和分类(Classification)是机器学习中的两大类问题，回归问题的输出是连续的，而分类的输出则是代表不同类别的有限个离散数值。

回归问题通常是用来预测一个值，如预测房价、明天的温度（23，24，25度）等等，这些输出都是连续的。一个比较常见的回归算法是线性回归算法（Linear Regression），其公式为：

写成向量的形式是：

分类问题是用于将事物打上一个标签，通常结果为离散值。例如判断一幅图片上的动物是一只猫还是一只狗，分类通常是建立在回归之上。

2、逻辑回归

尽管是被称为回归，但是实际上逻辑（Logistic）回归是一个分类方法，主要用于二分类问题。也就是对于给出的数据，我们要拟合（回归）的是区分这些数据的那条线（表达式）。

逻辑回归是在线性回归模型的基础上，通过sigmoid激活函数（下式）将wx+b映射到(0,1)上，并划分一个阈值，大于阈值的分为一类，小于等于分为另一类。

逻辑回归模型可以表示如下。通常，我们把Sigmoid输出hθ(x)大于等于0.5的归为类别1，小于0.5的归为类别0。：

因此可以看出，hθ(x)的作用是对于给定的输入变量x，根据给定的参数θ计算出输出变量y=1的可能性，即hθ(x)=P(y=1|x;θ)，我们可以将其视为y=1的后验概率估计。

那么逻辑回归究竟属于线性模型还是非线性模型呢？

虽然逻辑回归的原始形式g(z)是非线性的，但是决定这个复合函数是否是线性的，是z的形式,即z = wx。因为Z才是所求的决策面，这个决策面的两侧分别是正例和负例。逻辑回归g(z)的作用是把决策面两侧的点映射到逻辑回归曲线阈值的两侧。（除去sigmoid映射函数关系，其他的步骤，算法都是线性回归的）

$y = \left\{\begin{matrix}1, g(z) > 0.5 & \\ 0, g(z) < 0.5 & \end{matrix}\right. \Rightarrow y = \left\{\begin{matrix}1, z = wx > 0 & \\ 0, z = wx < 0 & \end{matrix}\right.$

3、推导逻辑回归

3.1 从手推线性回归开始

3.2 逻辑回归代价函数

我们第一个想到的自然是模仿线性回归的做法，利用误差平方和来当代价函数：

但是问题在于，当我们将

带入到这样定义了的代价函数中时，我们得到的代价函数将是一个非凸函数（non-convex function）。

这意味着我们的代价函数有许多局部最小值，这将影响梯度下降算法寻找全局最小值。我们需要寻找一个凸函数作为代价函数。

那么我们不妨来换一个思路解决这个问题。前面说到，我们可以将hθ(x)视为y=1的后验概率估计，所以可以得到：
P(y=1|x;θ) = hθ(x) = g(θTx) = g(z)

那么y=0的概率就是：P(y=0|x;θ) = 1- g(z)

从这里可以看出实际上逻辑回归是服从伯努利分布的，

伯努利分布有一个简单的例子是抛硬币，抛中为正面的概率是p,抛中为负面的概率是1−p.在逻辑回归这个模型里面是假设 hθ(x)为样本为正的概率，1−hθ(x)为样本为负的概率

将这两式写成一般形式：

我们就可以利用上式来构建我们的损失函数。

3.3 逻辑回归推导

接下来我们就要用极大似然估计来根据给定的训练集估计出参数θ。

那么LR损失函数为什么用极大似然函数？

因为我们想让每一个样本被预测到正确分类的概率都最大，类别为1的样本被预测为1的概率要大，同理类别为0的样本被预测为0的概率也要大，即对于每一个样本要max P(Y|X) = max ，那么所有样本正确预测概率相乘最大化就是我们所期望的，就是极大似然函数。

step1：构造似然函数：

step2：得到对数似然函数，以简化运算 ：

step3：得到代价函数

我们现在求的是使得l(θ)最大的θ，而代价函数是预测值与真实值越接近则应该越小，因此代价函数是负的对数似然函数。

为了更好地理解代价函数，我们取一个样本来看：
J(g(z),y; θ)=− (y ln(g(z)) + (1−y) ln (1−g(z)))
也就是说：

代价函数图像为：

从图中不难看出，如果样本的值是1的话，估计值g(z)越接近1付出的代价就越小，反之越大；同理，如果样本的值是0的话，估计值g(z)越接近0付出的代价就越小，反之越大。

step4：梯度下降

在得到这样一个代价函数以后，我们便可以用梯度下降算法来求得能使代价函数最小的参数了，在梯度下降过程中会用到sigmoid函数的导数，求导过程如下：

求梯度：

所以，在使用梯度下降法更新权重时

注：上式子中少了1/m

虽然得到的梯度下降算法表面上看上去与线性回归的梯度下降算法一样，但是这里的 hθ(x)=g(θTx)与线性回归中不同，所以实际上是不一样的。

4、例子

逻辑回归经常被用于处理二分类的问题。首先将图片flaten成一个一维的向量，按照线性回归的方法与w相乘，最后经过sigmoid激活函数得到一个数值，根据这个数值和我们提前所设置的阈值进行分类。

1、正向计算及损失计算

2、反向传播及参数更新（采用了梯度下降的方法）

3、输入数据的要求及预处理（我自己的数据，按照这个格式就可以）

### START CODE HERE ### (≈ 3 lines of code)
m_train = train_set_x_orig.shape[0]  #当矩阵为一维时，shape表示一维矩阵的长度
m_test = test_set_x_orig.shape[0]
num_px = train_set_x_orig.shape[1]
### END CODE HERE ###
#当矩阵为一维时，shape表示一维矩阵的长度；如a=[1,2];则a.shape = 2；
#当矩阵为二维时，shape就表示矩阵的大小；如a=[[1,2,3],[4,5,6]],则a.shape=2,3 也就是2行3列
#当矩阵为三维时，例如有一张彩色RGB图片a，大小为2×2，则a.shape = 3,2,2。也就是第三维显示在第一位数
print ("Number of training examples: m_train = " + str(m_train))
print ("Number of testing examples: m_test = " + str(m_test))
print ("Height/Width of each image: num_px = " + str(num_px))
print ("Each image is of size: (" + str(num_px) + ", " + str(num_px) + ", 3)")
print ("train_set_x shape: " + str(train_set_x_orig.shape))
print ("train_set_y shape: " + str(train_set_y.shape))
print ("test_set_x shape: " + str(test_set_x_orig.shape))
print ("test_set_y shape: " + str(test_set_y.shape))

输出为：

# Reshape the training and test examples

### START CODE HERE ### (≈ 2 lines of code)
train_set_x_flatten = train_set_x_orig.reshape(train_set_x_orig.shape[0], -1).T
test_set_x_flatten = test_set_x_orig.reshape(test_set_x_orig.shape[0], -1).T
### END CODE HERE ###

print ("train_set_x_flatten shape: " + str(train_set_x_flatten.shape))
print ("train_set_y shape: " + str(train_set_y.shape))
print ("test_set_x_flatten shape: " + str(test_set_x_flatten.shape))
print ("test_set_y shape: " + str(test_set_y.shape))
print ("sanity check after reshaping: " + str(train_set_x_flatten[0:5,0]))

输出为：

##预处理
train_set_x = train_set_x_flatten/255.
test_set_x = test_set_x_flatten/255.

4、主要代码实现

（1）sigmoid函数

def sigmoid(z):
    """
    Compute the sigmoid of z

    Arguments:
    z -- A scalar or numpy array of any size.

    Return:
    s -- sigmoid(z)
    """

    ### START CODE HERE ### (≈ 1 line of code)
    s = 1 / (1 + np.exp(-z))
    ### END CODE HERE ###
    
    return s

（2）参数初始化

# GRADED FUNCTION: initialize_with_zeros

def initialize_with_zeros(dim):
    """
    This function creates a vector of zeros of shape (dim, 1) for w and initializes b to 0.
    
    Argument:
    dim -- size of the w vector we want (or number of parameters in this case)
    
    Returns:
    w -- initialized vector of shape (dim, 1)
    b -- initialized scalar (corresponds to the bias)
    """
    
    ### START CODE HERE ### (≈ 1 line of code)
    w = np.zeros((dim, 1))
    b = 0
    ### END CODE HERE ###

    assert(w.shape == (dim, 1))
    assert(isinstance(b, float) or isinstance(b, int))
    
    return w, b

（3）向前传播

def propagate(w, b, X, Y):
    """
    Implement the cost function and its gradient for the propagation explained above

    Arguments:
    w -- weights, a numpy array of size (num_px * num_px * 3, 1)
    b -- bias, a scalar
    X -- data of size (num_px * num_px * 3, number of examples)
    Y -- true "label" vector (containing 0 if non-cat, 1 if cat) of size (1, number of examples)

    Return:
    cost -- negative log-likelihood cost for logistic regression
    dw -- gradient of the loss with respect to w, thus same shape as w
    db -- gradient of the loss with respect to b, thus same shape as b
    
    Tips:
    - Write your code step by step for the propagation. np.log(), np.dot()
    """
    
    m = X.shape[1]
    
    # FORWARD PROPAGATION (FROM X TO COST)
    ### START CODE HERE ### (≈ 2 lines of code)
    A = sigmoid(np.dot(w.T, X) + b)            # compute activation
    cost = -1 / m * np.sum(Y * np.log(A) + (1 - Y) * np.log(1 - A))         # compute cost
    ### END CODE HERE ###
    
    # BACKWARD PROPAGATION (TO FIND GRAD)
    ### START CODE HERE ### (≈ 2 lines of code)
    dw = 1 / m * np.dot(X, (A - Y).T)
    db = 1 / m * np.sum(A - Y)
    ### END CODE HERE ###
    assert(dw.shape == w.shape)
    assert(db.dtype == float)
    cost = np.squeeze(cost)
    assert(cost.shape == ())
    
    grads = {"dw": dw,
             "db": db}
    
    return grads, cost

（4）参数优化

def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost = False):
    """
    This function optimizes w and b by running a gradient descent algorithm
    
    Arguments:
    w -- weights, a numpy array of size (num_px * num_px * 3, 1)
    b -- bias, a scalar
    X -- data of shape (num_px * num_px * 3, number of examples)
    Y -- true "label" vector (containing 0 if non-cat, 1 if cat), of shape (1, number of examples)
    num_iterations -- number of iterations of the optimization loop
    learning_rate -- learning rate of the gradient descent update rule
    print_cost -- True to print the loss every 100 steps
    
    Returns:
    params -- dictionary containing the weights w and bias b
    grads -- dictionary containing the gradients of the weights and bias with respect to the cost function
    costs -- list of all the costs computed during the optimization, this will be used to plot the learning curve.
    
    Tips:
    You basically need to write down two steps and iterate through them:
        1) Calculate the cost and the gradient for the current parameters. Use propagate().
        2) Update the parameters using gradient descent rule for w and b.
    """
    
    costs = []
    
    for i in range(num_iterations):
        
        
        # Cost and gradient calculation (≈ 1-4 lines of code)
        ### START CODE HERE ### 
        grads, cost = propagate(w, b, X, Y)
        ### END CODE HERE ###
        
        # Retrieve derivatives from grads
        dw = grads["dw"]
        db = grads["db"]
        
        # update rule (≈ 2 lines of code)
        ### START CODE HERE ###
        w = w - learning_rate * dw
        b = b - learning_rate * db
        ### END CODE HERE ###
        
        # Record the costs
        if i % 100 == 0:
            costs.append(cost)
        
        # Print the cost every 100 training examples
        if print_cost and i % 100 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))
    
    params = {"w": w,
              "b": b}
    
    grads = {"dw": dw,
             "db": db}
    
    return params, grads, costs

（5）预测函数

def predict(w, b, X):
    '''
    Predict whether the label is 0 or 1 using learned logistic regression parameters (w, b)
    
    Arguments:
    w -- weights, a numpy array of size (num_px * num_px * 3, 1)
    b -- bias, a scalar
    X -- data of size (num_px * num_px * 3, number of examples)
    
    Returns:
    Y_prediction -- a numpy array (vector) containing all predictions (0/1) for the examples in X
    '''
    
    m = X.shape[1]  #测试样本的数量
    Y_prediction = np.zeros((1,m))
    w = w.reshape(X.shape[0], 1)
    
    # Compute vector "A" predicting the probabilities of a cat being present in the picture
    ### START CODE HERE ### (≈ 1 line of code)
    A = sigmoid(np.dot(w.T, X) + b)
    ### END CODE HERE ###

    for i in range(A.shape[1]):
        
        # Convert probabilities A[0,i] to actual predictions p[0,i]
        ### START CODE HERE ### (≈ 4 lines of code)
        if A[0, i] <= 0.5:
            Y_prediction[0, i] = 0
        else:
            Y_prediction[0, i] = 1
        ### END CODE HERE ###
    
    assert(Y_prediction.shape == (1, m))
    
    return Y_prediction

（6）训练开始

def model(X_train, Y_train, X_test, Y_test, num_iterations = 2000, learning_rate = 0.5, print_cost = False):
    """
    Builds the logistic regression model by calling the function you've implemented previously
    
    Arguments:
    X_train -- training set represented by a numpy array of shape (num_px * num_px * 3, m_train)
    Y_train -- training labels represented by a numpy array (vector) of shape (1, m_train)
    X_test -- test set represented by a numpy array of shape (num_px * num_px * 3, m_test)
    Y_test -- test labels represented by a numpy array (vector) of shape (1, m_test)
    num_iterations -- hyperparameter representing the number of iterations to optimize the parameters
    learning_rate -- hyperparameter representing the learning rate used in the update rule of optimize()
    print_cost -- Set to true to print the cost every 100 iterations
    
    Returns:
    d -- dictionary containing information about the model.
    """
    
    ### START CODE HERE ###
    
    # initialize parameters with zeros (≈ 1 line of code)
    w, b = initialize_with_zeros(X_train.shape[0])

    # Gradient descent (≈ 1 line of code)
    parameters, grads, costs = optimize(w, b, X_train, Y_train, num_iterations, learning_rate, print_cost)
    
    # Retrieve parameters w and b from dictionary "parameters"
    w = parameters["w"]
    b = parameters["b"]
    
    # Predict test/train set examples (≈ 2 lines of code)
    Y_prediction_test = predict(w, b, X_test)
    Y_prediction_train = predict(w, b, X_train)

    ### END CODE HERE ###

    # Print train/test Errors
    print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100))
    print("test accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_test - Y_test)) * 100))

    
    d = {"costs": costs,
         "Y_prediction_test": Y_prediction_test, 
         "Y_prediction_train" : Y_prediction_train, 
         "w" : w, 
         "b" : b,
         "learning_rate" : learning_rate,
         "num_iterations": num_iterations}
    
    return d

运行结果：

（7）可视化loss

costs = np.squeeze(d['costs'])
plt.plot(costs)
plt.ylabel('cost')
plt.xlabel('iterations (per hundreds)')
plt.title("Learning rate =" + str(d["learning_rate"]))
plt.show()