线性回归

最新推荐文章于 2024-06-23 09:14:46 发布

zeroml

最新推荐文章于 2024-06-23 09:14:46 发布

阅读量136

点赞数

分类专栏：机器学习文章标签：线性模型

本文链接：https://blog.csdn.net/qq_36606305/article/details/85495289

版权

机器学习专栏收录该内容

1 篇文章 0 订阅

订阅专栏

线性回归算法

解决回归问题
形式简单、易于建模
蕴含着机器学习中一些重要的基本思想
许多强大的非线性模型的基础

线性模型

给定由 $d$ 个属性描述的示例

$x = (x_1;x_2;x_3;...;x_d)$

其中 $x_i$ 是 $x$ 在第 $i$ 个属性上的取值

线性模型试图学得一个通过属性的线性组合来进行预测的函数，即

$f(x) = a_1x_1 + a_2x_2 + ... + a_dx_d + b$

一般用向量形式写成

$f(x) = a^Tx + b$

其中 $a = (a_1;a_2;...;a_d)$ . $a$ 和 $b$ 学得之后，模型就得以确定

数据集 : $D={(x_1, y_1),(x_2,y_2),(x_3,y_3),....,(x_m,y_m)}$ ,

$x_i = (x_{i1};x_{i2};x_{i3};...;x_{id})$ , $y_i=\mathbb{R}$ .

其中 $x_i$ 是 $x$ 在第 $i$ 个属性上的取值.

简单线性回归

我们先考虑一种最简单的情形：输入属性的数目只有一个。即

$D = {(x_i, y_i)}_{i=1} ^{m}$

$x_i\in \mathbb{R}$

线性回归试图学得

$f(x_i) = ax_i + b$ , 使得 $f(x_i) \simeq y_i$

其中 $f(x_i)$ 可写成 $\hat{y_i}$

如图：

在这里插入图片描述

如何确定 $a$ 和 $b$ 呢? 关键在与如何衡量 $\hat{y_i}$ 与 $y$ 之间的差别。如图：

在这里插入图片描述

上图所示，

我们先考虑让 $\hat{y_i}$ 与 $y$ 之间的差距，当值为 0 时，直线完全拟合数据。我们希望如此，但这是不可能的。

我们又考虑让 $\hat{y_i}$ 与 $y$ 之间的绝对值，绝对值的函数并不是处处可导的，所以也不取。

最后，我们选择 $\hat{y_i}$ 与 $y$ 差的平方，考虑到全部样本。

在这里插入图片描述

目标：找到 $a$ 和 $b$ ，使得 $\sum_{i=1}^{m}(y_i - ax_i - b)^2$ 尽可能地小

令损失函数 $E (a, b)$ 导后为 0，就能得到最优解

最小二乘法解决：

$\frac{\sum_{i=1}^{m}(x_i - \bar{x})(y_i - \bar{y})}{\sum _{i=1}^{m}(x_i - \bar{x})^2}$

$\bar{y} - a\bar{x}$

其中 $\bar{x} = \frac{1}{m}\sum_{i=1}^{m}x_i$ 为 $x$ 的均值

代码实现：简单线性回归

import numpy as np
import matplotlib.pyplot as plt

class LinearRegression1(object):
    def __init__(self):
        self.a_ = None
        self.b_ = None

    def fit(self, x_train, y_train):
        # 简单线性回归器只能求解单一特征训练数据
        assert x_train.ndim == 1, \
            "Simple Linear Regressor can only solve single feature training data"
        assert len(x_train) == len(y_train), \
        "the size of x_train must be to the szie of equal y_train"
        x_mean = np.mean(x_train)
        y_mean = np.mean(y_train)
        num = 0.0
        d = 0.0
        for x, y in zip(x_train, y_train):
            num += (x - x_mean) * (y - y_mean)
            d += (x - x_mean) ** 2
        self.a_ = num / d
        self.b_ = y_mean - self.a_ * x_mean
        return self

    def predict(self, x_predict):
        assert x_predict.ndim == 1, \
            "Simple Linear Regressor can only solve single feature training data"
        assert self.a_ is not None and self.b_ is not None, \
            "the size of x_train must be to the szie of equal y_train"
        return np.array([self._predict(x) for x in x_predict])

    def _predict(self, x_single):
        return self.a_ * x_single + self.b_
    
    def __repr__(self):
        return "Simple Linear Regression1()"


if __name__ == '__main__':
    x = np.array([1., 2., 3., 4., 5.])
    y = np.array([1., 3., 2., 3., 5.])
    lrg = LinearRegression1()
    lrg.fit(x, y)
    y_hat = lrg.predict(x)
    plt.scatter(x, y)
    plt.plot(x, y_hat, c='r')
    plt.show()

如图：

在这里插入图片描述

对于上面代码所使用的 for 对单个变量循环来说，我们使用一种更为常用且快速的方法 向量运算。

对于 $\sum_{i=1}^{m}(y_i - ax_i - b)^2$

$\frac{\sum_{i=1}^{m}(x_i - \bar{x})(y_i - \bar{y})}{\sum _{i=1}^{m}(x_i - \bar{x})^2}$

$\bar{y} - a\bar{x}$

令 $\sum_{i=1}^{m}w_i \cdot v_i$

其中 $w_i = (w_1;w_2;w_3;...w_m)$

$v_i = (v_1;v_2;v_3;...v_m)$

$\cdot v$

num = (x_train - x_mean).dot(y_train - y_mean)
d = (x_train - x_mean).dot(x_train - x_mean)

多元线性回归

更为常见的情形是如开头的数据集 $D$ ，样本由 $d$ 个属性来描述，此时我们试图学得

$\hat{y_i} = \theta_1x_1 + \theta_2x_2 + ... + \theta_dX_d + b$

$b$ 为截距， $(\theta_1;...;\theta_d)$ 为系数，将 $b$ 写为 $\theta_0$ ，并将 $x_0$ 设为 1，得到

$\hat{y_i} = \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_dX_d$

所以我们的目标为

找到 $\theta_0, \theta_1, ..., \theta_d$ ，使得 $\sum_ {i=0} ^{m}(y_i - \hat{y_i})^2$ ，尽可能地小

将 $\theta = (\theta_0;\theta_1;...;\theta_d)$ ， $\theta$ 是一个一维向量，我们习惯把一维向量看为列向量 表示

对此，我们想要使得 $\theta$ 点乘 $X$ ,得到

$f(x_i) = \theta^Tx_i$

$\hat{y} = \theta^T \cdot X$

那么对于多元函数公式有

$\theta = (y - X\theta)^T(y - X\theta)$

令 $X\theta)^T(y - X\theta)$ 尽可能地小，得到

$\theta = (X^TX)^{-1}X^Ty$

问题：时间复杂度高： $O(n^3)(优化 O(n^{2.4}))$

优点：不需要对数据做归一化处理

代码：

import numpy as np
from blog_metrics import r2_score
import matplotlib.pyplot as plt
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split


class LinearRegression(object):
    def __init__(self):
        self.coef_ = None
        self.interception_ = None
        self._theta = None

    def fit_normal(self, X_train, y_train):
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        # np.c_[np.ones((len(X), 1)), X]

        self._theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)
        self.interception_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

    def predict(self, X_predict):
        assert self.interception_ is not None and self.coef_ is not None, \
        "must fit before predict"
        assert X_predict.shape[1] == len(self.coef_), \
        "the feature number of X_predict must be equal to X_predict"

        X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
        return X_b.dot(self._theta)

    def score(self, X_test, y_test):
        y_predict = self.predict(X_test)
        return r2_score(y_test, y_predict)

    def __repr__(self):
        return "Linear Regression"


if __name__ == '__main__':
    bost = datasets.load_boston()

    x = bost.data
    y = bost.target

    X = x[y < 50]
    y = y[y < 50]

    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

    lrg = LinearRegression()
    lrg.fit_normal(X_train, y_train)

    # 系数
    print(lrg.coef_)
    # 截距
    print(lrg.interception_)

    # R^2 值
    print(lrg.score(X_test, y_test))

拓展

一类机器学习算法的基本思路

我们所谓的建模就是指找到一个模型最大程度地拟合我们的数据，在线性模型中就是直线方程。最大拟合数据其本质就是找到一个函数。在 $\sum_{i=1}^{m}(y_i - ax_i - b)^2$ 中就是指 损失函数(loss function)，度量出模型没有拟合出我们的样本的一部分。另一种时度量出拟合的部分，称 效用函数(utility function)

在这里插入图片描述

近乎所有参数学习算法都是这样的套路，如

线性回归、多项式回归、路基回归、SVM、神经网络…

性能度量

对模型的泛化性能进行评估，不仅需要有效可行的实验估计方法，还要有衡量模型泛化能力的评价标准，这就是性能度量(performance measure)

性能度量反映了任务需求，再对比不同的模型的能力时，使用不同的性能度量往往会导致不同的评判结果。这意味着模型的好坏是相对的，什么样的模型是好的，不取决于算法和数据，还决定于任务需求。

在预测任务中：

数据集： $D={(x_1, y_1),(x_2,y_2),(x_3,y_3),....,(x_m,y_m)}$

$y_i$ 是样本 $x_i$ 的真实标记。要评估模型的性能，就要把预测结果 $\hat{y}$ 与 $y$ 进行比较。

回归任务最常用的性能度量是 均方误差(MSE)

均方误差(MSE)

$\frac{1}{m}\sum_{i=1}^{m}(y_i - \hat{y_i})^2$

均方根误差

$\sqrt{\frac{1}{m}\sum_{i=1}^{m}(y_i - \hat{y_i})^2} = \sqrt{MSE}$

平方绝对误差

$\frac{1}{m}\sum_{i=1}^{m}\left | y_i - \hat{y_i} \right |$

虽然平方绝对误差不能用来进行预测，但是它可以用来评测算法的好坏。

R Squared 广泛使用

$R^2 = 1 - \frac{\sum_i (\hat{y_i} - y_i)^2}{\sum_i(\bar{y} - y_i)^2}$

$\sum_i (\hat{y_i} - y_i)^2$ 使用我们的模型预测产生的错误

$\sum_i(\bar{y} - y_i)^2$ 使用 $\bar{y}$ 预测产生的错误， $y$ 的方差

$R^2$ <= 1

$R^2$ 越大越好，当我们的预测模型不犯任何错误时， $R^2$ 得到最大值 1
$R^2 = 0$

当我们的模型等于基准模型时， $R^2$ 为 0
$R^2$ < 0

我们的数据很有可能不存在任何线性关系

def mean_squared_error(y_true, y_predict):
    assert len(y_true) == len(y_predict),  "the size of y_true must be equal to the size of y_predict"
    return np.sum((y_true - y_predict) ** 2) / len(y_true)


def root_mean_squared_error(y_true, y_predict):
    return sqrt(mean_squared_error(y_true, y_predict))


def mean_absolute_error(y_true, y_predict):
    assert len(y_true) == len(y_predict), "the size of y_true must be equal to the size of y_predict"
    return np.sum(np.absolute(y_true - y_predict)) / len(y_true)


def r2_score(y_true, y_predict):
    return 1 - mean_squared_error(y_true, y_predict) / np.var(y_true)