Numpy实现Logistic回归（含完整代码）

最新推荐文章于 2023-12-01 15:31:55 发布

五年打鱼三年晒网

最新推荐文章于 2023-12-01 15:31:55 发布

阅读量1.9k

点赞数 1

分类专栏：算法学习之路文章标签： numpy 回归

本文链接：https://blog.csdn.net/u013310820/article/details/128193017

版权

算法学习之路专栏收录该内容

6 篇文章 0 订阅

订阅专栏

Numpy实现对数几率回归

问题描述

对数几率回归通过一个单调可微函数（Sigmoid函数）把分类任务的真实标签与线性回归模型的预测值联系起来，进行回归学习，完成分类任务。其中，Sigmoid函数如下：
$y=\frac{1}{1+e^{-\left( \omega ^Tx+b \right)}}$
$y$ 是当前样本被分为正例的概率， $x$ 是当前样本的特征向量， $\omega$ 和 $b$ 是模型要确定的参数估计， $\omega$ 可以看成是样本特征的权重， $b$ 是偏置项。通过极大似然估计法估计相应参数。问题等价于求解以下损失函数的最小值。
$\ell(\boldsymbol{\beta})=\sum_{i=1}^{m}\left(-y_{i} \boldsymbol{\beta}^{\mathrm{T}} \hat{\boldsymbol{x}}_{i}+\ln \left(1+e^{\boldsymbol{\beta}^{\mathrm{T}} \hat{\boldsymbol{x}}_{i}}\right)\right)$
本文通过牛顿法求解上述损失函数的最小值。迭代公式如下：
$\boldsymbol{\beta}^{t+1}=\boldsymbol{\beta}^{t}-\alpha\left(\frac{\partial^{2} \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta} \partial \boldsymbol{\beta}^{\mathrm{T}}}\right)^{-1} \frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}$
其中， $\alpha$ 为学习率，默认设置为0.1。通过牛顿法，最小化损失函数，进而获取 $\omega$ 和 $b$ 的参数估计，以此预测测试样本的分类。
代码实现中，逻辑回归的相关代码封装为MyLogisticRegression类。问题中有三次数据集划分、三次训练和绘图，为了避免冗余重复的代码，把划分训练集并训练、损失函数可视化、ROC曲线可视化、分类效果二维可视化打包为函数，
MyLogisticRegression类中，计算后验概率估计，当 $\boldsymbol{w}^{\text{T}}\boldsymbol{x}+b>0$ 时，np.exp()会发生数据溢出，此时将概率估计公式改写为：
$p\left( y=1\mid \boldsymbol{x} \right) =\frac{1}{1+e^{-\left( \boldsymbol{w}^{\text{T}}\boldsymbol{x}+b \right)}}$
$p\left( y=0\mid \boldsymbol{x} \right) =\frac{e^{-\left( \boldsymbol{w}^{\text{T}}\boldsymbol{x}+b \right)}}{1+e^{-\left( \boldsymbol{w}^{\text{T}}\boldsymbol{x}+b \right)}}$
同理，为了防止计算损失函数时发生数据溢出，当 $\boldsymbol{\beta}^{\text{T}}\boldsymbol{\hat{x}}>0$ 时，将损失函数改写为：
$\ell \left( \boldsymbol{\beta } \right) =\sum_{i=1}^m{\left( -y_i\boldsymbol{\beta }^{\text{T}}\boldsymbol{\hat{x}}_i-\ln \left( \frac{e^{-\left( \boldsymbol{\beta }^{\text{T}}\boldsymbol{\hat{x}} \right)}}{1+e^{-\left( \boldsymbol{\beta }^{\text{T}}\boldsymbol{\hat{x}} \right)}} \right) \right)}$

数据集描述

鸢尾花数据集（Iris）由150朵花的4种属性及其分类标签构成。其属性信息与分类类别构成如下：

萼片长度	萼片宽度	花瓣长度	花瓣宽度	所属分类
cm	cm	cm	cm	class

这150朵花分属三种类型（Setosa、Versicolour、Virginica），其中一种类型与另外两种类型是线性可分的。在训练中，选取前100朵花，即前两种花进行分类模型的训练与测试。
PCA降维可视化分类

通过对选出的两种花降维并可视化，我们可以看出两种花分为明显的两簇，簇与簇间应该较容易找到一条划分的直线。

实验结果图

50%作为训练集

选取数据集的50%作为训练集，50%作为测试集，迭代500次，得到如下损失函数变化曲线。
50%作为训练集的损失函数变化曲线

当迭代到第50次时，模型在测试集的预测准确率为50%，有以下ROC曲线：
迭代50次的ROC曲线与AUC

当迭代到500次时，模型在训练集的预测准确率为100%，ROC曲线与x=0和y=1两条直线重合，即完全包住整个图，此处不再展示。

70%作为训练集

选取数据集的70%作为训练集，30%作为测试集，迭代500次，得到如下损失函数变化曲线。
70%作为训练集的损失函数变化曲线

90%作为训练集

选取数据集的90%作为训练集，10%作为测试集，迭代500次，得到如下损失函数变化曲线。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3Vd2awdD-1670244391474) 90%作为训练集的损失函数变化曲线

实验结果分析

分析上述结果图可知，epochs取50时的ROC曲线紧贴对角线，此时损失函数仍然呈现急剧下降趋势，此时模型在测试集的预测准确率为50%，训练效果并不好。
当迭代到150次左右时，3种划分的损失函数下降均趋于平缓，在各自测试集的预测正确率均为100%，有优秀的训练效果。相比50%作训练集的数据划分，其它两种划分的初始损失函数更大，但它们都在迭代150次左右后损失函数下降趋于平缓，在测试集上预测准确率达到100%。
在这里插入图片描述

从上图可以看出，以70%作为训练集，30%作为测试集后，测试集分类预测能较好的区分出两个簇。

完整代码

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns


class MyLogisticRegression:
    __method = "newton"  # 使用什么优化方法
    __epochs = 500  # 优化迭代次数
    __learning_rate = 0.1  # 学习率，保证能够忽略泰勒公式的高阶无穷小项
    __beta = 1  # beta矩阵（w; b）
    __if_fitted = 0  # 标记是否完成拟合，0为未拟合
    __random_state = 0  # 随机数种子
    loss = []  # 记录训练中的损失函数变化
    weight = 0  # 训练得到的权重向量
    bias = 0  # 训练得到的偏置
    score = 0  # 模型在测试集上的得分

    def __init__(self, method="newton", epochs=500, random_seed=2, learning_rate=0.1):
        """
        构造函数，若未给定参数，则把参数初始化为默认值

        :param method: 默认使用牛顿优化法
        :param epochs: 默认迭代500次
        :param random_seed: 默认使用随机种子2
        """
        self.__method = method
        self.__epochs = epochs
        self.__if_fitted = 0
        self.__random_state = random_seed
        self.__learning_rate = learning_rate

    def fit(self, x, y):
        """
        拟合模型

        :param x: 输入特征，横行为特征，纵列为不同样本
        :param y: 数据标签，一维数组
        """
        if self.__method == "newton":
            self.__beta = self._newton_method(x=x, y=y, epochs=self.__epochs, learning_rate=self.__learning_rate)
            self.weight = self.__beta[:-1]
            self.bias = self.__beta[-1]
        self.__if_fitted = 1

    def predict(self, x, y_score=False):
        """
        预测输入的x，并返回结果

        :param y_score: 返回样本是正例的概率，为列表
        :param x: 输入特征，横行为特征，纵列为不同样本
        :return: 返回预测结果(一维数组)
        """
        if self.__if_fitted == 0:
            print("The model hasn't been fitted!")
            return
        x_hat = np.hstack((x, np.ones((x.shape[0], 1))))
        p1 = []
        y_scores = []
        for i in range(x_hat.shape[0]):
            x_hat_i = x_hat[i]
            x_hat_i = x_hat_i.reshape(x_hat_i.shape[0], 1)
            tmp = self._estimate(self, y=1, x=np.dot(self.__beta.reshape(1, self.__beta.shape[0]), x_hat_i))
            res = 0
            y_scores.append(tmp[0][0])
            if tmp[0][0] > 0.5:
                res = 1
            p1.append(int(res))
        if y_score:
            return y_scores
        else:
            return np.array(p1)

    @staticmethod
    def _estimate(self, y, x):
        """
        计算后验概率估计

        因为sigmoid函数自变量为负时会发生数据溢出，所以此处优化了sigmoid函数

        :param y: y为1表示样本x作为正例的可能性，y为0表示样本x作为负例的可能性
        :return: 返回后验概率，为一个数值
        """
        if y == 0:
            # 对sigmoid函数进行优化，避免了出现数据溢出
            if x >= 0:
                return np.exp(-x) / (1 + np.exp(-x))
            else:
                return 1.0 / (1 + np.exp(x))
        elif y == 1:
            if x >= 0:
                return 1.0 / (1 + np.exp(-x))
            else:
                return np.exp(x) / (1 + np.exp(x))

    @staticmethod
    def _loss(self, x_hat, beta, y):
        """
        计算损失函数

        :param x_hat: (x; 1)
        :param beta: (w, b)
        :param y: 样本标签
        :return: 计算得到的损失函数，为一个数值
        """
        m = y.shape[0]
        loss = 0
        for i in range(m):
            tmp = np.dot(beta.reshape(1, beta.shape[0]), x_hat[i].reshape(x_hat.shape[1], 1))
            # x<0时增长率非常大。所以在使用sigma函数时，当x<0函数值增长速度极大，产生了上溢
            if tmp >= 0:
                # exp(x)可能发生数据溢出，故大于0时使用此式
                # 取对数后结果趋于负无穷大，故增加精度1e-6
                loss += (-y[i] * tmp - np.log(np.exp(-tmp) / (1 + np.exp(-tmp)) + 1e-6))
            else:
                loss += (-y[i] * tmp + np.log(1 + np.exp(tmp)))
        return loss

    def _newton_first_derivative(self, y, x_hat, beta):
        """
        牛顿法关于beta的一阶导数

        :param x_hat: (x; 1)
        :param beta: (w, b)
        :param y: 样本标签
        :return: 计算得到的一阶导数，为一个列向量
        """
        m = x_hat.shape[0]
        fd = 0
        for i in range(m):
            x_hat_i = x_hat[i]
            x_hat_i = x_hat_i.reshape(x_hat.shape[1], 1)
            p1 = self._estimate(self, y=1, x=np.dot(beta.reshape(1, beta.shape[0]), x_hat_i))
            fd += x_hat_i * (y[i] - p1)
        return -fd

    def _newton_second_derivative(self, x_hat, beta):
        """
        牛顿法关于beta的二阶导数

        :param x_hat: (x; 1)
        :param beta: (w, b)
        :param y: 样本标签
        :return: 计算得到的二阶导数，为一个黑塞矩阵
        """
        m = x_hat.shape[0]
        sd = 0
        for i in range(m):
            x_hat_i = x_hat[i]
            x_hat_i = x_hat_i.reshape(x_hat_i.shape[0], 1)
            x_hat_i_t = x_hat[i]
            p1 = self._estimate(self, y=1, x=np.dot(beta.reshape(1, beta.shape[0]), x_hat_i))
            x_hat_i_t = x_hat_i_t.reshape(1, x_hat_i_t.shape[0])
            # p1_dia1 = np.eye(x_hat.shape[1]) * p1  # 以p1 * (1 - p1)构成对角，其余为0的矩阵
            # p1_dia2 = np.eye(x_hat.shape[1]) * (1 - p1)
            sd += np.dot(x_hat_i, x_hat_i_t)
            # sd += np.dot(np.dot(np.dot(x_hat_i, x_hat_i_t), p1_dia1), p1_dia1)
            # sd += np.dot(np.dot(np.dot(p1_dia1, p1_dia2), x_hat_i), x_hat_i_t)
        return sd

    def _newton_method(self, x, y, epochs, learning_rate):
        """
        牛顿优化算法，返回beta

        :param x: 输入特征
        :param y: 数据标签
        :param epochs: 迭代次数
        :return: 返回beta矩阵（w; b）
        """
        np.random.seed(self.__random_state)
        w = np.random.normal(size=(x.shape[1], 1))
        """
        逻辑回归的权重可以初始化为0的原因： Logistic回归没有隐藏层。 
        如果将权重初始化为零，则Logistic回归中的第一个示例x将输出零，但Logistic回归的导数取决于不是零的输入x（因为没有隐藏层）。 
        因此，在第二次迭代（迭代发生在w和b值的更新中，即梯度下降）中，如果x不是常量向量，则权值遵循x的分布并且彼此不同 
        """
        b = np.array([1])
        beta = np.vstack((w, b))
        x_hat = np.hstack((x, np.ones((x.shape[0], 1))))
        self.loss.clear()
        for i in range(epochs):
            nsd = self._newton_second_derivative(x_hat, beta)
            nfd = self._newton_first_derivative(y, x_hat, beta)
            beta -= learning_rate * np.dot(np.linalg.pinv(nsd), nfd)  # 二阶导数矩阵为奇异矩阵不可逆，这里求伪逆
            loss = self._loss(self, x_hat, beta, y)
            self.loss.append(loss[0][0])
        return beta

    def score(self, x, y):
        """
        检验模型在测试集上的分类正确率

        :param x: 测试集的特征矩阵
        :param y: 测试集的真实标签
        :return: 预测正确的数量与测试集数量的比值
        """
        y_pred = self.predict(x)
        index = (y_pred == y)
        self.score = np.count_nonzero(index) / len(index)
        return self.score


def draw_loss(loss: list):
    """
    绘制损失函数下降曲线

    :param loss: 存储损失函数变化的列表
    """
    s = pd.Series(loss, name="Loss")
    sns.set_theme(style='darkgrid')
    sns.lineplot(data=s)
    plt.xlabel("Epochs")
    plt.title("Loss Function Change Curve")
    plt.figure(dpi=500)
    plt.show()


def draw_ROC(y_true, y_score):
    """
    fpr：横坐标值
    tpr：纵坐标值
    threshold：阈值
    test：测试集的结果
    score：测试集预测的结果
    pos_label=1：测试集值为1的是正类，其余为负类

    :param y_true: 数据集真实的标签
    :param y_score: 通过模型预测是正例的概率
    """
    # ROC坐标点计算
    fpr, tpr, threshold = metrics.roc_curve(y_true, y_score, pos_label=1)

    # AUC值计算
    auc = metrics.auc(fpr, tpr)

    # ROC曲线绘制
    plt.figure()
    plt.title('ROC CURVE (auc={:.2f})'.format(auc))
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])
    plt.plot(fpr, tpr, color='g')
    plt.plot([0, 1], [0, 1], color='m', linestyle='--')
    plt.show()


def visualization(X, y):
    """
    数据可视化

    :param X: 数据点的特征矩阵
    :param y: 数据点的标签
    """
    pca = PCA(n_components=2)
    x = pca.fit_transform(X)
    plt.scatter(x[:, 0], x[:, 1], c=y, cmap='coolwarm')
    # plt.savefig("分类可视化.png", dpi=500)
    plt.show()


def split_and_train(X, y, train_size=0.7, epochs=500):
    # 划分训练集和测试集
    x_train, x_test, y_train, y_test = train_test_split(X, y, train_size=train_size)
    # 初始化模型并开始训练
    clf = MyLogisticRegression(epochs=epochs)
    clf.fit(x_train, y_train)
    draw_loss(clf.loss)
    draw_ROC(y_test, clf.predict(x_test, y_score=True))
    print(clf.score(x_test, y_test))
    visualization(x_test, y_test)
    return clf


if __name__ == '__main__':
    X, y = load_iris(return_X_y=True)

    # 取前两种花进行分类任务
    X = X[:100, :]
    y = y[:100]

    # 三种训练集和测试集的划分方式训练模型
    clf1 = split_and_train(X, y, train_size=0.5, epochs=50)
    clf2 = split_and_train(X, y, train_size=0.7)
    clf3 = split_and_train(X, y, train_size=0.9)
    visualization(X, clf3.predict(X))