2.3 模型之母：线性回归的评价指标学习笔记

最新推荐文章于 2023-06-17 18:49:55 发布

Leorio Paladinight

最新推荐文章于 2023-06-17 18:49:55 发布

阅读量456

点赞数

文章标签：机器学习

原文链接：https://mp.weixin.qq.com/s/BEmMdQd2y1hMu9wT8QYCPg

版权

本篇内容就是关于回归模型的评价，首先介绍线性回归模型的三个常用评价方法，然后通过波士顿房产预测的实际例子，对评价方法进行代码实现。最后我们会隆重引出最好的衡量线性回归法的指标：R Square

1.线性回归算法的衡量标准

简单线性回归的目标是：已知训练数据样本x、y ，找到a和b的值，使Σ(y-ax-b)² 尽可能小

衡量标准是看在测试数据集中y的真实值与预测值之间的差距。

但是这里有一个问题，这个衡量标准是和数据量m相关的。在具体衡量时，测试数据集不同将会导致误差的累积量不同。
在得到a和b之后将x(test)代入a、b中。可以使用Σ(y(test)-ax(test)-b)²来作为衡量回归算法好坏的标准。

1.1均方误差MSE

测试集中的数据量m不同，因为有累加操作，所以随着数据的增加，误差会逐渐积累；因此衡量标准和 m 相关。为了抵消掉数据量的形象，可以除去数据量，抵消误差。通过这种处理方式得到的结果叫做均方误差MSE（Mean Squared Error）

1.2 均方根误差RMSE

但是使用均方误差MSE收到量纲的影响。例如在衡量房产时，y的单位是（万元），那么衡量标准得到的结果是（万元平方）。为了解决量纲的问题，可以将其开方（为了解决方差的量纲问题，将其开方得到平方差）得到均方根误差RMSE（Root Mean Squarde Error）

1.3 平均绝对误差MAE

对于线性回归算法还有另外一种非常朴素评测标准。要求真实值与预测结果之间的距离最小，可以直接相减做绝对值，加m次再除以m，即可求出平均距离，被称作平均绝对误差MAE（Mean Absolute Error）

确定损失函数时，绝对值函数不是处处可导的，因此没有使用绝对值。但是在评价模型时不影响。因此模型的评价方法可以和损失函数不同。

2.评价标准的代码实现

2.1 数据探索

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

# 查看数据集描述
boston = datasets.load_boston()
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

因为是测试简单回归算法，因此我们选择其中的一个特征进行建模。选择：

RM average number of rooms per dwelling 每个住宅的平均房间数
下面我们进行简单的数据探索：

# 查看数据集的特征列表
boston.feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

# 取出数据中的第六例的所有行（房间数量）
x = boston.data[:,5]
x.shape

(506,)

# 取出样本标签
y = boston.target
y.shape

(506,)

plt.scatter(x,y)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Noh4P1Ap-1584261607027)(output_19_0.png)]

在图中可以看到 50W 美元的档分布着一些点。这些点可能是超出了限定范围（比如在问卷调查中，价格的最高档位是“50万及以上”，那么就全都划到50W上了，因此在本例中，可以将这部分数据去除）

np.max(y)
# 这里有一个骚操作，用比较运算符返回一个布尔值的向量，将其作为索引，直接在矩阵里对每个元素进行过滤。
x = x[y < 50.0]
y = y[y < 50.0]
plt.scatter(x,y)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XScNIDYv-1584261607029)(output_21_0.png)]

2.2 简单线性回归预测

from model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)
print(x_train.shape)    # (392,)
print(y_train.shape)    #(392,)
print(x_test.shape)     #(98,)
print(y_test.shape)     #(98,)

(392,)
(392,)
(98,)
(98,)

from SimpleLinearRegression import SimpleLinearRegression

reg = SimpleLinearRegression()
reg.fit(x_train,y_train)
print(reg.a_)   # 7.8608543562689555
print(reg.b_)   # -27.459342806705543

7.860854356268954
-27.459342806705536

plt.scatter(x_train,y_train)
plt.plot(x_train, reg.predict(x_train),color='r')
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FYkgqKi0-1584261607029)(output_25_0.png)]

y_predict = reg.predict(x_test)
print(y_predict)

[23.09381156 23.14883754 19.20268865 29.02089574 25.6014241   5.06887252
 24.66598243 26.47397893 15.52380881 28.38416654 17.29250104 13.0633614
 23.99780981 21.37228445 23.29033292 21.66313607 21.22292822 19.94946982
 22.41777808 25.17693796 19.51712283 24.14716604 24.57165218 19.07691498
 23.14097668 28.78507011 20.46042535 18.18863844 15.93257324 29.46110359
 31.65428195 19.36776659 16.77368466 38.07659996 19.72936589 20.99496345
 18.27510784 24.22577459 21.67099692 22.7086297  21.22292822 19.17910609
 15.41375685 19.41493172 16.5771633  23.13311583 23.71481905 30.13713706
 17.99211708 24.69742585 19.43065343 25.4284853  22.71649055 16.53785903
 19.13194096 18.82536764 22.15836989 18.55809859 25.03544258 29.52399042
 18.75461995 19.69006162 18.59740287 14.58836714 18.81750679 24.65812158
 20.83774636 19.77653102 27.88893272 19.32846232 22.9837596  22.86584678
 25.3262942  22.19767416 26.25387501 24.6188173  20.45256449 16.55358073
 14.2346287  26.19884903 35.96989099 19.62717479 21.01854601 15.42947856
 20.90849405 20.24818228 21.79677059 27.37797718 22.65360371 18.69959397
 23.5340194  27.31509035 32.63688875 20.02807836 19.43851428 30.38082355
 31.13546556 25.00399917]

2.3 MSE 均方误差

mse_test = np.sum((y_predict - y_test) ** 2) / len(y_test)
mse_test

24.156602134387438

2.4 RMSE 均方根误差

from math import sqrt

rmse_test = sqrt(mse_test)
rmse_test

4.914936635846635

RMSE消除了量纲的差异，输出的结果是4.9，与y的量纲相同。解释为在RMSE指标下，我们预测的房产数据平均误差在4.9万美元左右。

2.5 MAE 平均绝对误差

mae_test = np.sum(np.absolute(y_predict - y_test)) / len(y_test)
mae_test

3.543097440946387

在MAE指标下，我们预测的房产数据平均误差在3.54万美元左右。我们看到MAE指标得到的误差要比RMSE指标得到的误差小。说明不同的评价指标的结果不同。

从数学角度来分析，RMSE和MAE的量纲相同，但RMSE的结果较大，这是因为RMSE是将错误值平方，平方操作会放大样本中预测结果和真实结果较大的差距。MAE没有放大。而我们就是要解决目标函数最大差距，因为选RMSE更好一点。

3.封装及调用

3.1 在工程文件中封装

在工程文件的metrics.py中添加以上评价指标：

import numpy as np
from math import sqrt

def accuracy_score(y_true, y_predict):
    """计算y_true和y_predict之间的准确率"""
    assert y_true.shape[0] != y_predict.shape[0], \
        "the size of y_true must be equal to the size of y_predict"
    return sum(y_true == y_predict) / len(y_true)

def mean_squared_error(y_true, y_predict):
    """计算y_true和y_predict之间的MSE"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"
    return np.sum((y_true - y_predict) ** 2) / len(y_true)

def root_mean_squared_error(y_true, y_predict):
    """计算y_true和y_predict之间的RMSE"""
    return sqrt(mean_squared_error(y_true, y_predict))

def mean_absolute_error(y_true, y_predict):
    """计算y_true和y_predict之间的MAE"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"

    return np.sum(np.absolute(y_predict - y_true)) / len(y_predict)

3.2 调用

from metrics import mean_squared_error
from metrics import root_mean_squared_error
from metrics import mean_absolute_error

mean_squared_error(y_test, y_predict)

24.156602134387438

root_mean_squared_error(y_test, y_predict)

4.914936635846635

mean_absolute_error(y_test, y_predict)

3.543097440946387

3.3 sklearn中的MSE和MAE

sklearn中不存在RMSE，我们可以手动对MSE开方：

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

mean_squared_error(y_test, y_predict)

24.156602134387438

mean_absolute_error(y_test, y_predict)

3.543097440946387

4.更好用的 R Square

4.1 R Square介绍以及为什么好

分类准确率，就是在01之间取值。但RMSE和MAE没有这样的性质，得到的误差。因此RMSE和MAE就有这样的局限性，比如我们在预测波士顿方差，RMSE值是4.9（万美元）我们再去预测身高，可能得到的误差是10（厘米），我们不能说后者比前者更准确，因为二者的量纲根本就不是一类东西。

其实这种局限性，可以被解决。用一个新的指标R Squared。

R方这个指标为什么好呢？

对于分子来说，预测值和真实值之差的平方和，即使用我们的模型预测产生的错误。
对于分母来说，是均值和真实值之差的平方和，即认为“预测值=样本均值”这个模型（Baseline Model）所产生的错误。
我们使用Baseline模型产生的错误较多，我们使用自己的模型错误较少。因此用1减去较少的错误除以较多的错误，实际上是衡量了我们的模型拟合住数据的地方，即没有产生错误的相应指标。

我们根据上述分析，可以得到如下结论：

R^2 <= 1
R2越大也好，越大说明减数的分子小，错误率低；当我们预测模型不犯任何错误时，R2最大值1
当我们的模型等于基准模型时，R^2 = 0
如果R^2 < 0，说明我们学习到的模型还不如基准模型。此时，很有可能我们的数据不存在任何线性关系。

4.2 R Square实现

如果分子分母同时除以m，我们会发现，分子就是之前介绍过的均方误差，分母实际上是y这组数据对应的方差

1 - mean_squared_error(y_test, y_predict) / np.var(y_test)

0.6129316803937322

下面我们在工程文件metrics.py中添加自己实现的r2_score方法：

from metrics import r2_score
r2_score(y_test, y_predict)

0.6129316803937322

其实这跟掉sklearn中的方法相同：

from sklearn.metrics import r2_score
r2_score(y_test, y_predict)

0.6129316803937324

线性回归的评价指标与分类的评价指标有很大的不同，本篇介绍了均方误差MSE（预测值与真实值之差的平方和，再除以样本量）、均方根误差RMSE（为了消除量纲，将MSE开方）、平均绝对误差MAE（预测值与真实值之差的绝对值，再除以样本量）、以及非常重要的、效果非常好的R方（因此用1减去较少的错误除以较多的错误，实际上是衡量了我们的模型拟合住数据的地方，即没有产生错误的相应指标）。

在实际应用过程中，需要这些评价指标，来判别模型的好坏。