线性回归(LInear Regression)算法介绍及numpy实现

本文链接：https://blog.csdn.net/weixin_44142858/article/details/121468047

本文介绍了线性回归模型的原理和代价函数，详细阐述了梯度下降法和正规方程法求解参数的公式推导，并通过Python代码实现了两种方法。对比了两者在处理大规模特征时的优劣，最后总结了线性回归模型的学习过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

1. 概述
2. 梯度下降法求解
3. 正规方程法求解
- 公式推导
- 代码实现
4. 梯度下降法VS正规方程法
5. 总结

1. 概述

线性回归：给定数据集 $D\left \{ (x_{i} ,y_{i}) , i=1, 2, \cdots , m \right \}, x_{i}\in R^{n}, y_{i}\in R$ ，基于数据集学习一个线性预测模型 $f (x)$ ，使之对于任意的 $x_{i}\in R^{n}$ ，尽可能准确预测实值输出 $\widehat{y} = f(x)$ 。

其中模型（预测函数）为线性回归模型： $\theta^{T} \cdot x,$ 其中 $\theta = \left [ \theta_{0}, \theta_{1}, \dots , \theta_{n} \right ] ^{T}$ ， $\left [ x^{(0)}, x^{(1)}, \dots , x^{(n)} \right ] ^{T}$ ， $x^{(0)}$ 始终等于1。

模型使用的代价函数为： $J(\Theta ) = \frac{1}{2m}\sum_{i=1}^{m}(f(x_{i}) - y_{i} ) ^{2},$ 该函数为凸函数，只有一个全局最低点。

目标：求得模型的参数 $\Theta = \underset{\Theta}{argmin} J(\Theta)$

2. 梯度下降法求解

上一篇文章单独讲了梯度下降法，不了解的可以看一下：梯度下降法求解最优化问题。
我们的目标是找到使得代价函数最小的参数向量，最值问题考虑采用梯度下降法求解，该方法主要需要知道代价函数 $J(\Theta )$ 对于参数 $\Theta$ 的偏导是多少，下面我们假设只有两个参数 $\theta_{0}$ 和 $\theta_{1}$ ，来求一下各自的偏导。

公式推导

假设只有两个参数，则 $\theta_{0} + \theta_{1}x$ ， $x^{(0)} = 1$ ， $x^{(1)} = x$ .
$\begin{aligned} J(\Theta ) &= \frac{1}{2m}\sum_{i=1}^{m}(f(x_{i}) - y_{i} ) ^{2}\\ &= \frac{1}{2m}\sum_{i=1}^{m}(\theta _{0} + \theta _{1}x_{i} - y_{i} ) ^{2} \end{aligned}$
$\begin{aligned} \frac{\partial J(\Theta )}{\partial \theta _{0} } &= \frac{\partial \frac{1}{2m}\sum_{i=1}^{m}(\theta _{0} + \theta _{1}x_{i} - y_{i} ) ^{2}}{\partial \theta _{0}}\\ &= \frac{1}{2m}\sum_{i=1}^{m}2(\theta _{0} + \theta _{1}x_{i} - y_{i})\\ &= \frac{1}{m}\sum_{i=1}^{m}(f(x_{i}) - y_{i})\\ &= \frac{1}{m}\sum_{i=1}^{m}(f(x_{i}) - y_{i})x_{i}^{(0)} \end{aligned}$
$\begin{aligned} \frac{\partial J(\Theta )}{\partial \theta _{1} } &= \frac{\partial \frac{1}{2m}\sum_{i=1}^{m}(\theta _{0} + \theta _{1}x_{i} - y_{i} ) ^{2}}{\partial \theta _{1}}\\ &= \frac{1}{2m}\sum_{i=1}^{m}2(\theta _{0} + \theta _{1}x_{i} - y_{i})x_{i}\\ &= \frac{1}{m}\sum_{i=1}^{m}(f(x_{i}) - y_{i})x_{i}\\ &= \frac{1}{m}\sum_{i=1}^{m}(f(x_{i}) - y_{i})x_{i}^{(1)} \end{aligned}$

推广到n个参数的情况， $\frac{\partial J(\Theta )}{\partial \theta _{j} } = \frac{1}{m}\sum_{i=1}^{m}(f(x_{i}) - y_{i})x_{i}^{(j)}$

特征缩放（Feature Scaling）

如果各特征的特征值相差太大，要给数据进行归一化预处理，这样能够加快梯度下降的速率。
给数据做如下变化： ${x}_{i}{}' = \frac{x_{i} - \mu }{s},$ 其中 $\mu$ 为均值向量， $\left\{\begin{matrix} 各个特征的max-min构成的向量\\ 标准差向量 \end{matrix}\right.$ .
注： $x_{0}$ 不需要进行缩放， $x_{0}=1$ .

代码实现

下面我将使用线性回归模型结合批量梯度下降算法拟合波士顿房价数据集。

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split

def load_data():
    boston = datasets.load_boston()
    dataX = boston.data
    dataY = boston.target
    print("dataX的shape:", dataX.shape)
    return dataX, dataY

def normalization(dataX):
    '''z-score归一化
    Parameters:
        dataX: 输入特征向量
    Returns:
        dataX1: 归一化之后的dataX
        mu: 均值向量
        sigma: 标准差向量
    '''
    mu = dataX.mean(0)
    sigma = dataX.std(0)
    dataX1 = (dataX - mu) / sigma
    return dataX1, mu, sigma

class linear_regression():
    def __init__(self, dataX, dataY, alpha=0.1):
        ones = np.ones((dataX.shape[0], 1))
        self.dataX = np.c_[ones, dataX]
        self.dataY = dataY
        self.datasize = self.dataX.shape[0]
        self.alpha = alpha # 学习率
        self.theta = np.zeros(self.dataX.shape[1]) # 参数初始化
        
    def fit(self):
        iterations = 100 # 设置停止迭代规则，迭代100次就停止更新
        for i in range(iterations):
            y = np.dot(self.dataX, self.theta).reshape(self.datasize,) # 数据集矩阵与参数向量相乘，得到一个向量，为各个样本的预测值
#             print(y)
            loss = 1 / self.datasize * sum((y - self.dataY) ** 2) # 计算平方差损失
            print("第", i, "次迭代，损失为：", loss)
            # 更新theta值
            updated_theta = []
            for j, theta_j in enumerate(self.theta):
                x_j = self.dataX[:, j]
                updated_theta_j = theta_j - self.alpha * (1 / self.datasize) * np.sum((y - self.dataY) * x_j)
                updated_theta.append(updated_theta_j)
            self.theta = np.array(updated_theta)
        print("训练完成！")
        
    def predict(self, testX):
        ones = np.ones((testX.shape[0], 1))
        testX = np.c_[ones, testX]
        y_hat = np.dot(testX, self.theta)
        return y_hat

if __name__ == "__main__":
    # 1.获取数据集
    dataX, dataY = load_data()
    # 2.划分数据集
    trainX, testX, trainY, testY = train_test_split(dataX, dataY, test_size=0.3, random_state=0)
    # 3.标准化预处理
    newTrainX, mu, sigma = normalization(trainX)
    print('mu =', mu, "sigma =", sigma)
    newTestX = (testX - mu) / sigma
    # 4.构建一个线性回归模型
    model = linear_regression(newTrainX, trainY)
    # 5.训练模型
    model.fit()
    # 6.测试模型
    y_hat = model.predict(newTestX)
    # 7.输出预测值与真实值的差距
    print("预测值：", y_hat)
    print("真实值：", testY)
    print("差距：", 1/testX.shape[0] * np.sum((y_hat - testY) ** 2))

代码执行结果：

3. 正规方程法求解

公式推导

令 $\begin{bmatrix} (x_{1})^{T} \\ (x_{2})^{T} \\ \vdots \\ (x_{m})^{T} \end{bmatrix}$ ， $\begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix}$ ，其中 $x_{i} = \begin{bmatrix} x_{i}^{(0)} \\ x_{i}^{(1)} \\ \vdots \\ x_{i}^{(n)} \end{bmatrix}\in \Re ^{n+1}$
首先给出结果 $\Theta = (X^{T} X)^{-1}X^{T} y,$ 下面证明这个结果是正确的：
$\begin{aligned} &\because X\Theta = y, \Theta = (X^{T} X)^{-1}X^{T} y\\ &\therefore X (X^{T} X)^{-1}X^{T} y\\ &= XX^{-1}(X^{T})^{-1}X^{T}y \\ &= E\cdot E\cdot y\\ &=y \end{aligned}$ 由于式中出现了 $X^{T} X)^{-1}$ ，所以要求 $X^{T} X$ 必须是可逆的，如果不可逆，则不能使用正规方程法求解。
另外，正规方程法不需要进行特征缩放，直接计算即可。

代码实现

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split

def load_data():
    boston = datasets.load_boston()
    dataX = boston.data
    dataY = boston.target
    return dataX, dataY

def get_theta(dataX, dataY):
    '''使用正规方程法求解参数
    parameters:
        dataX: shape(m, n)
        dataY: shape(m,)
    Return:
        计算得到的参数向量
        
    用到的numpy中的几个函数：
    1. np.dot(A, B): 矩阵乘法，A为m*n阶, B为n*p阶，运算结果为m*p阶
    2. np.linalg.inv(A): 返回A的逆矩阵
    3. A.T(): 返回A的转置矩阵
    '''
    XTX = np.dot(dataX.T, dataX)
    XTXI = np.linalg.inv(XTX)
    return np.dot(np.dot(XTXI, dataX.T), dataY)

if __name__ == '__main__':
    # 1.加载数据集
    dataX, dataY = load_data()
    # 2.划分数据集
    trainX, testX, trainY, testY = train_test_split(dataX, dataY, test_size=0.3, random_state=0)
    # 3.给特征向量补1
    ones = np.ones((trainX.shape[0], 1))
    trainX = np.c_[ones, trainX]
    # 3.求解参数向量
    theta = get_theta(trainX, trainY)
    # 4.根据求得的参数预测测试集输出
    ones1 = np.ones((testX.shape[0], 1))
    testX = np.c_[ones1, testX]
    n = testX.shape[0]
    y_hat = np.dot(testX, theta)
    loss = 1 / n * np.sum((y_hat - testY) ** 2)
    print('预测值：', y_hat)
    print('真实值：', testY)
    print('差距：', loss)

代码执行结果：

4. 梯度下降法VS正规方程法

梯度下降法	正规方程法
需要选择学习率 $\alpha$	不需要选择学习率 $\alpha$
需要很多次迭代	不需要迭代
适用于特征数目n较大的情况	需要计算 $X^{T}X)^{-1}$ ，计算逆矩阵的代价为O( $n^{3}$ )
$n = 10^{6}$ 考虑使用梯度下降法	当 $n$ 很大时运行很慢，适于 $n = 100, 1000, 10000$