目录:
一、前言
我们可以使用各种指标来评估ML算法,分类以及回归算法的性能。我们必须谨慎选择评估ML性能的指标,因为
-
如何测量和比较ML算法的性能完全取决于您选择的指标。
-
您如何权衡各种特征在结果中的重要性,将完全取决于您选择的指标。
二、回归问题的性能指标
在这里,我们将讨论各种性能指标,这些指标可用于评估回归问题的预测。
1.平均绝对误差(MAE)
它是用于回归问题的最简单的误差度量。它基本上是预测值与实际值之间的绝对差的平均值之和。
简而言之,借助MAE,我们可以了解预测的错误程度。以下是计算MAE的公式:
我们可以使用sklearn.metrics中的mean_absolute_error函数来计算MAE。
2.均方误差(MSE)
MSE就像MAE,但唯一的区别是,它是对所有实际值和预测值进行求和之前将它们平方和。MSE可以评价数据的变化程度,MSE的值越小,说明预测模型描述实验数据具有更好的精确度。公式如下:
我们可以使用sklearn.metrics的mean_squared_error函数来计算MSE。
3.均方根误差(RMSE)
顾名思义,RMSE就是对MSE求根而得,公式如下:
在sklearn.metrics中没有此方法,我们只需对MSE求根即可。
4.R平方(R Squared)
R平方度量通常用于说明目的,并提供一组预测输出值与实际输出值之间的优劣程度的指示。以下公式将帮助我们理解它:
将上式化成以下公式,以便更好理解。
在上式中,分子为MSE,分母为y值的方差。
我们可以使用sklearn.metrics的r2_score函数来计算R平方值。
三、简单例子
以下是Python中的一个简单例子,它将是我们了解如何在回归模型上使用上述解释的性能指标。
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
X_actual = [5, -1, 2, 10]
Y_predict = [3.5, -0.9, 2, 9.9]
print ('MAE =',mean_absolute_error(X_actual, Y_predict))
print ('MSE =',mean_squared_error(X_actual, Y_predict))
print ('RMSE =',sqrt(mean_squared_error(X_actual, Y_predict)))
print ('R Squared =',r2_score(X_actual, Y_predict))
MAE = 0.42499999999999993
MSE = 0.5674999999999999
RMSE = 0.7533259586659681
R Squared = 0.9656060606060606
结论:尽可能使MAE、MSE和RMSE值越小,R Squared值越接近1,模型训练效果相对会更好。
四、对Boston房产数据进行实战
1.加载Boston数据集
# 导包
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
# 加载Boston数据集
boston = datasets.load_boston()
print(boston.keys())
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
# Boston数据集描述
print(boston.DESCR)
.. _boston_dataset:
Boston house prices dataset
---------------------------
**Data Set Characteristics:**
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
.. topic:: References
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
# 显示Boston数据集特征
print(boston.feature_names)
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')
2.绘制散点图
x = boston.data[:, 5] # 只使用房间数量这个特征
y = boston.target
# x.shape,y.shape 为 (506,)
# 绘制散点图
plt.scatter(x, y)
plt.show()
因为np.max(y)为50.0,需过滤掉y=50的数据。
# 过滤掉y=50的数据
x = x[y < 50.0]
y = y[y < 50.0]
# 重新绘制散点图
plt.scatter(x, y)
plt.show()
3.使用简单线性回归法
from model_selection import train_test_split
# 将数据集化为训练数据和测试数据
x_train, x_test, y_train, y_test = train_test_split(x, y, seed = 666)
print(x_train.shape)
print(x_test.shape)
(392,)
(98,)
使用train_test_split方法时,训练数据和测试数据默认比值为4,即训练数据占总数据的80%。
# 将x_train和x_test都转为一列
x_train = x_train.reshape(-1,1)
x_test = x_test.reshape(-1,1)
# 训练
reg.fit(x_train, y_train)
# 绘制带回归线的散点图
plt.scatter(x_train, y_train)
plt.plot(x_train, reg.predict(x_train), color = 'r')
plt.show()
4.计算各个性能指标
# 预测数据
y_predict = reg.predict(x_test)
# 导入相应的包
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
# 计算出各性能指标MAE、MSE、RMSE和R Squared
print ('MAE =',mean_absolute_error(y_test, y_predict))
print ('MSE =',mean_squared_error(y_test, y_predict))
print ('RMSE =',sqrt(mean_squared_error(y_test, y_predict)))
print ('R Squared =',r2_score(y_test, y_predict))
MAE = 3.543097440946387
MSE = 24.15660213438744
RMSE = 4.914936635846636
R Squared = 0.6129316803937322