吴恩达机器学习[8]-正则化在线性回归、logistic回归应用与python实现

踏归1234

已于 2022-07-27 11:17:31 修改

阅读量703

点赞数

分类专栏：机器学习文章标签： python 线性回归逻辑回归

于 2022-07-22 23:57:02 首次发布

本文链接：https://blog.csdn.net/qq_44391572/article/details/125926187

版权

机器学习专栏收录该内容

18 篇文章 2 订阅

订阅专栏

正则化-线性回归及logistic回归的应用与python实现

过拟合问题 overfitting
- 问题识别
- 过拟合问题解决
代价函数 cost function
线性回归的正则化
logistic回归的正则化
代码实现+可视化

过拟合问题 overfitting

问题识别

欠拟合（underfitting ) 或高偏差（high bias），拟合效果差
恰好拟合（just right）
过拟合（overfiiting）或高方差（high variance 函数太过于庞大、变量太多），泛化能力（generalization ability）差

过拟合问题解决

在这里插入图片描述

减少特征变量个数
正则化（regularizition）。保留所有特征变量，但减小特征变量量级或降低变量在假设方程中的参数大小；从而使得每个特征变量对预测结果仅仅贡献部分作用。

代价函数 cost function

目标：介绍正则化（regularizition）如何应用，并写出相应的代价函数。
比如，在代价函数中加入 $\theta_3 、\theta_4$ 的惩罚项，使得它们接近于0，最终的假设模型中 $x^3、x^4$ 的系数很,小，从而假设模型近似于二次函数。
在这里插入图片描述
如果参数值较小，那么参数值较小意味着一个更加简单的假设模型（hypothesis）。一般来说，这会使得最终得到的函数更加平滑、更加简单，也不容易出现过拟合问题。
在实际应用中，因为不知道哪个特征变量属于高阶项，所以修改代价函数，直接缩小所有的参数，即 $J(\theta)$ 。值得注意的是，代价函数中没有给 $\theta_0$ 增加惩罚项，这是约定俗称的一种做法，实际上无论是否给它加入惩罚项对结果的影响都不大。
在这里插入图片描述
正则化代价函数 $J(\theta）$ （regularized cost function ，正则化的优化目标 regularized optimization objective）中最右边的求和项即正则化项（regularization term）， $\lambda$ 为正则化参数（regularization parameter）。 $\lambda$ 的作用即为控制不同目标间的取舍，目标一：更好得拟合数据集；目标二：保持参数尽可能小。
在这里插入图片描述

$\lambda$ 越大，各个 $\theta$ 越接近0，相当于把假设模型中的各个特征变量都忽略，从而导致拟合直线几乎变成一条平行的直线，导致拟合效果不好。因此需要合理选择正则化参数 $\lambda$ 的值。

线性回归的正则化

目标：正则化在线性回归方程中的应用
在这里插入图片描述
线性回归梯度下降时，参数 $\theta$ 变化如下。
$\theta_j := \theta_j(1-\alpha\frac \lambda m)-\alpha \frac 1m \sum_{i=0}^m(h_\theta (x^{(i)})-y^{(i)})x_j^{(i)}$
其中， $1-\alpha\frac \lambda m$ 通常是比1略微小的数。
因此，直观来说，正则化即每次把参数缩小一点点。从数学上来说，所做的还是对代价函数 $J\theta$ 进行梯度下降。
注意， $\theta_0$ 不用添加 $\frac \lambda m \theta_0$ 。
在这里插入图片描述
梯度下降只是拟合线性回归模型的一种方法，下面展示另一种方法——正规方程（normal equation）。
设计一个m*(n+1)维矩阵X，它的每一行都代表一个单独的训练样本。建立一个m维向量y，包含训练集里的所有标签。
则，全局最小值的参数 $\theta$ 如下：
在这里插入图片描述
对于一般的线性方程，当m小于等于n时，矩阵 $X^TX$ 不可逆，正规方程不可用。
但对于正则化后的方程，可以保证 $X^TX+\lambda\begin{bmatrix} 0&0&{ \cdots }&0\\ 0&1&{ \cdots }&0\\ { \vdots }&{ \vdots }&{ \ddots }&{ \vdots }\\ 0&0&{ \cdots }&1 \end{bmatrix}$ 为非奇异矩阵，即一定可逆。
因此正则化还可以解决正规方程中的不可逆问题。
在这里插入图片描述

logistic回归的正则化

目标：了解正则化如何应用到logistic回归函数。
在这里插入图片描述
与线性回归类似，也是在 $\theta_j$ 中加入 $\frac \lambda m \theta_j$

代码实现+可视化

正则化逻辑回归 python代码实现

"logistic regression"
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler#标准化
from sklearn.metrics import confusion_matrix,roc_curve,auc,classification_report #分类度量方法

class logisticRegressionGradientDescent:
    """
    逻辑回归，采用批量梯度下降，交叉熵损失函数
    """
    def __init__(self,dataset,attribute_list,aplha,mylambda):
        """
        类初始化
        :param dataset:数据集
        :param attribute_list:特征列表
        :param aplha:学习率
        :param mylambda:正则化参数
        """
        self.alpha=aplha
        self.attr_list =attribute_list[:-1]#特征值
        self.target_lable=attribute_list[-1]#目标列名（取最后一列）
        #数据标准化
        self.X= StandardScaler().fit_transform(dataset.iloc[:,:-1])
        #对目标值进行编码
        self.y,self.class_lables = self.target_encode(dataset.iloc[:,-1])
        #划分数据集,分层抽样(stratify 按照列标y) random_state随机种子，防止每次运行结果重现
        self.x_train,self.x_test,self.y_train,self.y_test=\
            train_test_split(self.X,self.y,train_size=0.8,random_state=1,stratify=self.y)
        self.n,self.k=self.x_train.shape #训练数据样本量，特征变量个数
        self.cross_entropy_cost = []#每次训练交叉熵的平均值
        self.bdg_weight= dict()#每次训练权重更新
        self.mylambda=mylambda#正则化参数

    @staticmethod
    def sigmoid(y_preval):
        '''
        激活函数
        :param y_preval: 样本值乘以权重系数后的值，数组
        :return:
        '''
        return 1/(1+np.exp(-y_preval))
    @staticmethod
    def target_encode(target):
        """
        静态方法，不用self,标记 @staticmethod
        二分类类别编码为0,1
        :param self:
        :param target: 类别列表
        :return:
        """
        class_lables=target.unique()# 获取不同类别值
        if len(class_lables)>2:
            print("此逻辑回归只是限于二分类，请选择多分类算法")
            exit(0)
        if(class_lables.max()==1 and class_lables.min()==0):
            return target.tolist(),class_lables
        else:
            #编码，采用列表推导式
            target_y = [0 if y == class_lables[0] else 1 for y in target]
            return target_y,class_lables

    def logistic_regression_model_train(self,max_lop,threshold):
        '''
        逻辑回归训练函数，采用批量梯度下降法，交叉熵损失函数
        :param max_lop: 最大训练次数
        :param threshold:退出训练阈值
        :return:
        '''
        np.random.seed(101)#设置随机种子，避免每次都一样
        weight =np.random.random(self.k)/100 #随机化权重 权重数同特征变量数  random模块的random函数
        weight_old =weight

        for j in range(self.k):
            self.bdg_weight[str(j)]=[]
        for loop in range(max_lop):
            self.alpha*=0.95#衰减指数慢慢减少
            y_hat = self.sigmoid(self.x_train.dot(weight.T))#激活函数·，预测属于某一类别的概率（0,1） 求x乘以权重 矩阵计算
            dw= ((y_hat-self.y_train)*self.x_train.T).mean(axis=1)#权值更新 对所有的列求均值 结果等同于(self.x_train.T*(y_hat-self.y_train)).mean(axis=1)
            weight=(1-self.alpha*mylambda/len(y_hat))*weight-self.alpha*dw #权值更新
            #weight=weight-self.alpha*dw #权值更新 未正则化
            for j in range(self.k):
                self.bdg_weight[str(j)].append(weight[j])
            #交叉熵损失均值 1e-10是因为防止log后取值太小对结果产生影响
            ce_loss =-(np.array(self.y_train)*np.log(y_hat+1e-10)+
                       (1-np.array(self.y_train))*np.log(1-y_hat+1e-10)).mean()+1/2*mylambda*np.power(weight, 2).sum()/len(y_hat)
            # ce_loss =-(np.array(self.y_train)*np.log(y_hat+1e-10)+
            #            (1-np.array(self.y_train))*np.log(1-y_hat+1e-10)).mean()#未正则化
            self.cross_entropy_cost.append(ce_loss)
            #退出条件，避免过拟合，提前停止训练
            if(len(self.cross_entropy_cost)>2):
                if np.abs(self.cross_entropy_cost[-1]-self.cross_entropy_cost[-2])>threshold:
                    break
                elif np.abs(weight-weight_old).all()<threshold:
                    break
                else:
                    weight_old=weight
        # #画图
        # plt.plot(self.cross_entropy_cost)
        # plt.show()
        return weight

    def plt_cost(self):
        """
        绘制交叉熵损失下降曲线
        :return:
        """
        plt.plot(self.cross_entropy_cost)
        plt.xlabel("Training times")
        plt.ylabel("Cross entropy cost")
        plt.title("Decline curve of loss function in Logistic regression")
        # plt.show()

    def plt_weight(self):
        """
        绘制权重更新曲线
        :return:
        """
        for k in range(self.k):
            plt.plot(self.bdg_weight[str(k)],label=self.attr_list[k])
        plt.legend()
        plt.xlabel("Training times")
        plt.ylabel("Weight")
        plt.title("Logistic regression weight coefficient update curve")

    def predict(self,weight):
        """
        测试样本预测类别，并根据概率进行类别编码
        :param weight:训练最终权重
        :return:
        """
        y_pred =[]#预测类别
        y_score =self.sigmoid(self.x_test.dot(weight.T))
        threshold =0.5 # 类别不平衡问题需要考虑阈值，待解决
        for y in y_score:
            if y<threshold:
                y_pred.append(0)
            elif y>= threshold:
                y_pred.append(1)
        cm= confusion_matrix(self.y_test,y_pred)
        acc= np.sum(np.diag(cm))/len(y_pred) #预测精度
        return y_pred,cm,acc,y_score

    def plt_confusion_matrix(self,cm,acc):
        """
        绘制混淆矩阵
        :param cm: 混淆矩阵
        :param acc: 预测精度
        :return:
        """
        cm =pd.DataFrame(cm,columns=self.class_lables,index=self.class_lables)
        sns.heatmap(cm,annot=True,cbar=False,fmt='d')#绘制热图
        plt.xlabel("Predict")
        plt.ylabel("True")
        plt.title("Confusion matrix and accuracy =%.2f%%" %(acc*100))#%%表示直接输出一个%

    def plt_roc_auc(self,y_score):
        """
        绘制ROC曲线，并计算AUC
        :param y_score: 预测样本预测评分
        :return:
        """
        false_positive_rate,true_positive_rate,_ =roc_curve(self.y_test,y_score)
        roc_auc=auc(false_positive_rate,true_positive_rate)
        plt.plot(false_positive_rate,true_positive_rate,"b",label="AUC=%.2f" % roc_auc)
        plt.legend(loc="lower right")
        plt.plot([0,1],[0,1],"r--")
        plt.xlabel("False_positive_rate")
        plt.ylabel("True_positive_rate")
        plt.title("Logistic Regression of Binary Classification ROC Curve and AUC")

if __name__=='__main__':
    url="../datasets/Mtrain_set.csv"#数据集路径
    data=pd.read_csv(url).dropna().iloc[:,1:]
    attribute_list =data.columns#列名列表  list列表，没有loc属性
    alpha =0.8
    mylambda=1000
    #print(attribute_list)
    lrgd=logisticRegressionGradientDescent(data,attribute_list,alpha,mylambda)#正则化处理
    # lrgd=logisticRegressionGradientDescent(data,attribute_list,alpha)#没有正则化
    weight=lrgd.logistic_regression_model_train(1000,1e-8)
    print("正则化逻辑回归，采用批量梯度下降法训练，最终特征变量系数：")
    for i in range(lrgd.k):
        print(" %-10s %.15f" % (lrgd.attr_list[i],weight[i]))
    y_pred,cm,acc,y_score =lrgd.predict(weight)
    #绘图
    plt.figure(figsize=(12,10))
    plt.subplot(221)#表示将整个图像窗口分为2行2列, 当前位置为1.
    lrgd.plt_cost()
    plt.subplot(222)
    lrgd.plt_weight()
    plt.subplot(223)
    lrgd.plt_confusion_matrix(cm,acc)
    plt.subplot(224)
    lrgd.plt_roc_auc(y_score)
    plt.show()

    #还可以再打印出一个分类报告