[笔记]机器学习之Logistic回归

最新推荐文章于 2023-12-04 14:00:39 发布

kestiny

最新推荐文章于 2023-12-04 14:00:39 发布

阅读量421

点赞数 2

分类专栏： Python/机器学习

本文链接：https://blog.csdn.net/chlk118/article/details/80198852

版权

Python/机器学习专栏收录该内容

9 篇文章 3 订阅

订阅专栏

Logistic回归是一种广义的线性回归，他是一种分类分析方法。Logistic大概也是最常用的分类方法之一。

sigmod函数

　　Logistic中因变量为二分类变量，某个概率作为方程的因变量估计值取值范围为0或者1，因此我们需要一个具有此性质的函数，于是，sigmod函数就进入了我们的视野。
　　sigmod函数的原型为：这里写图片描述
　　当x=0时，sigmod函数的值为0.5；随着想的增大，sigmod函数的值逼近于1；而随着x的减小，sigmod函数的值将逼近于0。

原理

　　为了实现Logistic回归分类器，我们可以在每个特征上都乘以一个回归系数，然后把所有的结果值相加，将这个总和代入sigmod函数中，于是我们得到一个范围在0-1之间的数值。任何大于0.5的数据被分为1类，小于0.5的数据即将被归为0类。

梯度上升法

　　最佳回归系数的确定，我们需要使用最优化算法，其中最常用的是梯度上升法。其思想是：要找到某函数的最大值，最好的方法是沿着该函数的梯度方向探寻。若梯度记为grad(x,y)或者，则函数f(x,y)的梯度表示入下：
　　grad(x,y)= =
　　梯度算子总是指向函数值增长最快的方向。梯度的移动量大小称为步长，记做a。用向量表示的话，梯度算法的迭代公式如下：这里写图片描述，该公式一直被迭代执行，直到达到某个停止条件为止，比如指定迭代次数或者算法达到某个允许的误差范围等。

python代码实现

sigmod函数实现

# -*- coding: utf-8 -*-
__author__ = 'kestiny'

import numpy as np
import matplotlib.pyplot as plt
import random


def sigmod(inX):
    return 1.0 / (1 + np.exp(-inX))

使用梯度上升发迭代求导最佳回归系数

def gradAscent(dataMat, labelMat):
    m, n = np.shape(dataMat)
    alpha = 0.001    # 指定梯度的步长
    maxCycles = 500  # 指定迭代次数，即迭代终止条件
    weights = np.ones((n, 1))  # 初始化回归系数为1
    for k in range(maxCycles):
        h = sigmod(np.dot(dataMat, weights))  # 矩阵相乘
        error = (labelMat - h)
        weights = weights + np.dot(np.dot(alpha, dataMat.transpose()), error)  # 按照误差的方向调整回归系数
    return weights

测试

def plotBestFit(weights, dataMat, labelMat):
    n = dataMat.shape[0]
    xcord1 = []
    xcord2 = []
    ycord1 = []
    ycord2 = []
    for i in range(n):
        if int(labelMat[i]) == 1:
            xcord1.append(dataMat[i, 1])
            ycord1.append(dataMat[i, 2])
        else:
            xcord2.append(dataMat[i, 1])
            ycord2.append(dataMat[i, 2])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1, ycord1, s = 30, c='r', marker='s')
    ax.scatter(xcord2, ycord2, s=30, c='green')
    x = np.arange(-5.0, 5.0, 0.1)
    y = (-weights[0] - weights[1] * x) / weights[2]
    plt.plot(x, y)
    plt.xlabel('X1')
    plt.ylabel("X2")
    plt.show()
if __name__ == '__mai


n__':
    file = 'testSet.txt'
    data = np.loadtxt(file, dtype=float, delimiter='\t', encoding='utf-8')
    dataMat, labelMat = np.split(data, (2,), axis=1)
    cloumns = dataMat.shape[0]
    dataMat = np.insert(dataMat, 0, values=np.ones((1, cloumns)), axis=1)
    weights = gradAscent(dataMat, labelMat)
    print('系数：', weights)
    plotBestFit(weights, dataMat, labelMat)

效果

这里写图片描述

随机梯度上升法

　　随机梯度上升法解决的是梯度上升算法在对相当大的数据量进行处理的尴尬问题。因为梯度上升算法在每次更新回归系数时都需要遍历整个数据集，一旦有数亿样本和成千上万的特征，那边梯度上升算法的计算复杂度就太高了。因此，我们可以改进为一次仅用一个样本点或者用一个确定数量的N（10或者100）个个样本来更新回归系数，该方法称为随机梯度上升算法。由于可以在新样本到来的时对分类器进行增量式更新，因此随机梯度上升算法是一个在线学习算法。

python代码实现

随机梯度上升算法

def stocGradAscent(dataMat, labelMat,times=100):
    m, n = np.shape(dataMat)
    weights = np.ones(n)
    for j in range(times):
        dataIndex = range(m)
        for i in range(m):
            alpha = 4 / (1.0 + j + i) + 0.01
            randIndex = int(random.uniform(0, len(dataIndex)))
            h = sigmod(np.sum(np.dot(dataMat[randIndex], weights)))
            error = labelMat[randIndex] - h
            weights = weights + alpha * error * dataMat[randIndex]
            print('times:%d h= %s error=%s weights=%s' %(m * j + i, h, error, weights))
    return weights

测试

if __name__ == '__main__':
    file = 'testSet.txt'
    data = np.loadtxt(file, dtype=float, delimiter='\t', encoding='utf-8')
    dataMat, labelMat = np.split(data, (2,), axis=1)
    cloumns = dataMat.shape[0]
    dataMat = np.insert(dataMat, 0, values=np.ones((1, cloumns)), axis=1)
    weights = stocGradAscent(dataMat, labelMat)
    print('系数：', weights)
    plotBestFit(weights, dataMat, labelMat)