【机器学习实战】性能指标之回归问题

最新推荐文章于 2023-03-21 22:01:22 发布

AI阿聪

最新推荐文章于 2023-03-21 22:01:22 发布

阅读量1.1k

点赞数 2

分类专栏：机器学习文章标签：机器学习性能指标回归问题

本文链接：https://blog.csdn.net/weixin_40431584/article/details/104691558

版权

机器学习专栏收录该内容

11 篇文章 12 订阅

订阅专栏

目录：

一、前言

我们可以使用各种指标来评估ML算法，分类以及回归算法的性能。我们必须谨慎选择评估ML性能的指标，因为

如何测量和比较ML算法的性能完全取决于您选择的指标。
您如何权衡各种特征在结果中的重要性，将完全取决于您选择的指标。

二、回归问题的性能指标

在这里，我们将讨论各种性能指标，这些指标可用于评估回归问题的预测。

1.平均绝对误差（MAE）

它是用于回归问题的最简单的误差度量。它基本上是预测值与实际值之间的绝对差的平均值之和。

简而言之，借助MAE，我们可以了解预测的错误程度。以下是计算MAE的公式：

我们可以使用sklearn.metrics中的mean_absolute_error函数来计算MAE。

2.均方误差（MSE）

MSE就像MAE，但唯一的区别是，它是对所有实际值和预测值进行求和之前将它们平方和。MSE可以评价数据的变化程度，MSE的值越小，说明预测模型描述实验数据具有更好的精确度。公式如下：

我们可以使用sklearn.metrics的mean_squared_error函数来计算MSE。

3.均方根误差（RMSE）

顾名思义，RMSE就是对MSE求根而得，公式如下：

在sklearn.metrics中没有此方法，我们只需对MSE求根即可。

4.R平方（R Squared）

R平方度量通常用于说明目的，并提供一组预测输出值与实际输出值之间的优劣程度的指示。以下公式将帮助我们理解它：

将上式化成以下公式，以便更好理解。

在上式中，分子为MSE，分母为y值的方差。

我们可以使用sklearn.metrics的r2_score函数来计算R平方值。

三、简单例子

以下是Python中的一个简单例子，它将是我们了解如何在回归模型上使用上述解释的性能指标。

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

X_actual = [5, -1, 2, 10]
Y_predict = [3.5, -0.9, 2, 9.9]

print ('MAE =',mean_absolute_error(X_actual, Y_predict))
print ('MSE =',mean_squared_error(X_actual, Y_predict))
print ('RMSE =',sqrt(mean_squared_error(X_actual, Y_predict)))
print ('R Squared =',r2_score(X_actual, Y_predict))

MAE = 0.42499999999999993
MSE = 0.5674999999999999
RMSE = 0.7533259586659681
R Squared = 0.9656060606060606

结论：尽可能使MAE、MSE和RMSE值越小，R Squared值越接近1，模型训练效果相对会更好。

四、对Boston房产数据进行实战

1.加载Boston数据集

# 导包
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

# 加载Boston数据集
boston = datasets.load_boston()

print(boston.keys())

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

# Boston数据集描述
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

# 显示Boston数据集特征
print(boston.feature_names)

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

2.绘制散点图

x = boston.data[:, 5]  # 只使用房间数量这个特征
y = boston.target

# x.shape，y.shape 为 (506,)

# 绘制散点图
plt.scatter(x, y)
plt.show()

因为np.max(y)为50.0，需过滤掉y=50的数据。

# 过滤掉y=50的数据
x = x[y < 50.0]
y = y[y < 50.0]

# 重新绘制散点图
plt.scatter(x, y)
plt.show()

3.使用简单线性回归法

from model_selection import train_test_split

# 将数据集化为训练数据和测试数据
x_train, x_test, y_train, y_test = train_test_split(x, y, seed = 666) 

print(x_train.shape)
print(x_test.shape)

(392,)

(98,)

使用train_test_split方法时，训练数据和测试数据默认比值为4，即训练数据占总数据的80%。

# 将x_train和x_test都转为一列
x_train = x_train.reshape(-1,1)
x_test = x_test.reshape(-1,1)

# 训练
reg.fit(x_train, y_train)

# 绘制带回归线的散点图
plt.scatter(x_train, y_train)
plt.plot(x_train, reg.predict(x_train), color = 'r')
plt.show()

4.计算各个性能指标

# 预测数据
y_predict = reg.predict(x_test)

# 导入相应的包
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# 计算出各性能指标MAE、MSE、RMSE和R Squared
print ('MAE =',mean_absolute_error(y_test, y_predict))
print ('MSE =',mean_squared_error(y_test, y_predict))
print ('RMSE =',sqrt(mean_squared_error(y_test, y_predict)))
print ('R Squared =',r2_score(y_test, y_predict))

MAE = 3.543097440946387
MSE = 24.15660213438744
RMSE = 4.914936635846636
R Squared = 0.6129316803937322