目录
2.1.2.5集成模型(回归)
1、模型介绍
这一节除了继续使用普通随机森林和提升树模型的回归器版本之外,还要补充介绍随机森林模型的另一个变种:极端随机森林。与普通的随机森林模型不同的是,极端随机森林在每当构建一棵树的分裂节点的时候,不会任意地选取特征;而是先随机收集一部分特征,然后利用信息熵(Information Gain)和基尼不纯性(Gini Impurity)等指标挑选最佳的节点特征。
2、数据描述
(1)美国波士顿地区房价数据描述
# 代码34:美国波士顿地区房价数据描述
# 从sklearn.datasets导入波士顿房价数据读取器。
from sklearn.datasets import load_boston
# 从读取房价数据存储在变量boston中。
boston = load_boston()
# 输出数据描述。
print(boston.DESCR)
本地输出:
.. _boston_dataset:
Boston house prices dataset
---------------------------
**Data Set Characteristics:**
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
.. topic:: References
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
结论:总体而言,该数据共有506条美国波士顿地区房价的数据,每条数据包括对指定房屋的13项数值型特征描述和目标房价。另外,数据中没有缺失的属性/特征值,更加方便了后续的分析。
(2)美国波士顿地区房价数据分割
# 代码35:美国波士顿地区房价数据分割
# 从sklearn.model_selection导入数据分割器。
from sklearn.model_selection import train_test_split
# 导入numpy并重命名为np。
import numpy as np
X = boston.data
y = boston.target
# 随机采样25%的数据构建测试样本,其余作为训练样本。
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33, test_size=0.25)
# 分析回归目标值的差异。
print("The max target value is", np.max(boston.target))
print("The min target value is", np.min(boston.target))
print("The average target value is", np.mean(boston.target))
备注:原来的导入模型from sklearn.cross_validation import train_test_split的时候,提示错误:
from sklearn.cross_validation import train_test_split
ModuleNotFoundError: No module named 'sklearn.cross_validation'
需要替换cross_validation:
from sklearn.model_selection import train_test_split
本地输出:
The max target value is 50.0
The min target value is 5.0
The average target value is 22.532806324110677
结论:预测目标房价之间的差异较大,因此需要对特征以及目标值进行标准化处理。
备注:读者无需质疑将真实房价做标准化处理的做法。事实上,尽管在标准化之后,数据有了很大的变化。但是依然可以使用标准化器中的inverse_transform函数还原真实的结果;并且,对于预测的回归值也可以采用相同的做法进行还原。
(3)美国波士顿地区房价数据标准化处理
# 代码36:训练与测试数据标准化处理
# 从sklearn.preprocessing导入数据标准化模块。
from sklearn.preprocessing import StandardScaler
# 分别初始化对特征和目标值的标准化器。
ss_X = StandardScaler()
ss_y = StandardScaler()
# 分别对训练和测试数据的特征以及目标值进行标准化处理。
X_train = ss_X.fit_transform(X_train)
X_test = ss_X.transform(X_test)
y_train = ss_y.fit_transform(y_train.reshape(-1, 1))
y_test = ss_y.transform(y_test.reshape(-1, 1))
备注:原来的会报错,是因为工具包版本更新造成的;故采用以下方法。
根据错误的提示相应的找出原来出错的两行代码:
y_train = ss_y.fit_transform(y_train)
y_test = ss_y.transform(y_test)
问题出现在上面的两行代码中,例如数据格式为[1, 2, 3, 4]就会出错,如果把这行数据转换成[[1], [2], [3], [4]]就不会出错了。所以要对上面导致出错的两行代码做出修改:
y_train = ss_y.fit_transform(y_train.reshape(-1, 1))
y_test = ss_y.transform(y_test.reshape(-1,1))
3、编程实践
使用Scikit-learn中三种集成回归模型,即RandomForestRegressor、ExtraTreesRegressor以及GradientBoostingRegressor对“美国波士顿房价”数据进行回归预测。
# 代码45:使用三种集成回归模型对美国波士顿房价训练数据进行学习,并对测试数据进行预测
# 从sklearn.ensemble中导入RandomForestRegressor、ExtraTreesRegressor以及GradientBoostingRegressor。
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
# 使用RandomForestRegressor训练模型,并对测试数据做出预测,结果存储在变量rfr_y_predict中。
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)
rfr_y_predict = rfr.predict(X_test)
# 使用ExtraTreesRegressor训练模型,并对测试数据做出预测,结果存储在变量etr_y_predict中。
etr = ExtraTreesRegressor()
etr.fit(X_train, y_train)
etr_y_predict = etr.predict(X_test)
# 使用GradientBoostingRegressor训练模型,并对测试数据做出预测,结果存储在变量gbr_y_predict中。
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)
gbr_y_predict = gbr.predict(X_test)
4、性能测评
对上述三种集成回归模型在“波士顿房价”数据的预测能力进行评估,比较它们性能上的差异。
# 代码46:对三种集成回归模型在美国波士顿房价测试数据上的预测性能进行评估
# 使用R-squared、MSE以及MAE指标对默认配置的随机回归森林在测试集上进行性能评估。
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
print('R-squared value of RandomForestRegressor:', rfr.score(X_test, y_test))
print('The mean squared error of RandomForestRegressor:', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rfr_y_predict)))
print('The mean absolute error of RandomForestRegressor:', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rfr_y_predict)))
# 使用R-squared、MSE以及MAE指标对默认配置的极端回归森林在测试集上进行性能评估。
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
print('R-squared value of ExtraTreesRegressor:', etr.score(X_test, y_test))
print('The mean squared error of ExtraTreesRegressor:', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(etr_y_predict)))
print('The mean absolute error of ExtraTreesRegressor:', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(etr_y_predict)))
# 利用训练好的极端回归森林模型,输出每种特征对预测目标的贡献度。
print(np.sort(list(zip(etr.feature_importances_, boston.feature_names)), axis=0))
# 使用R-squared、MSE以及MAE指标对默认配置的梯度提升回归树在测试集上进行性能评估。
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
print('R-squared value of GradientBoostingRegressor:', gbr.score(X_test, y_test))
print('The mean squared error of GradientBoostingRegressor:', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(gbr_y_predict)))
print('The mean absolute error of GradientBoostingRegressor:', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(gbr_y_predict)))
备注:原来的会报错,错误提示为:
a.sort(axis=axis, kind=kind, order=order)
numpy.AxisError: axis 0 is out of bounds for array of dimension 0
报错原因:
zip()方法在Python2和Python3中的不同:在Python3.x中为了减少内存,zip()返回的是一个对象。如需展示列表,需手动list()转换。
故采用以下方法。
print(np.sort(list(zip(etr.feature_importances_, boston.feature_names), axis=0)))
本地输出:
R-squared value of RandomForestRegressor: 0.8057603527804128
The mean squared error of RandomForestRegressor: 15.061593700787402
The mean absolute error of RandomForestRegressor: 2.4366929133858273
R-squared value of ExtraTreesRegressor: 0.7998120972647847
The mean squared error of ExtraTreesRegressor: 15.522829133858266
The mean absolute error of ExtraTreesRegressor: 2.512834645669291
[['0.0018973421722974625' 'AGE']
['0.015609006100893458' 'B']
['0.016601690570333415' 'CHAS']
['0.022296289296773814' 'CRIM']
['0.022370329321442138' 'DIS']
['0.022796518091252754' 'INDUS']
['0.029163602082538948' 'LSTAT']
['0.029749918933513652' 'NOX']
['0.0341990933203418' 'PTRATIO']
['0.07706383242917793' 'RAD']
['0.08919109041561663' 'RM']
['0.29655508242192996' 'TAX']
['0.34250620484388805' 'ZN']]
R-squared value of GradientBoostingRegressor: 0.8369584469718909
The mean squared error of GradientBoostingRegressor: 12.642453089294607
The mean absolute error of GradientBoostingRegressor: 2.2765542114339175
5、特点分析
许多在业界从事商业分析系统开发和搭建的工作者更加青睐集成模型,并且经常以这些模型的性能表现为基准,与新设计的其他模型性能进行比较。虽然这些集成模型在训练过程中要耗费更多的时间,但是往往可以提供更高的表现性能和更好的稳定性。
若是对所有介绍的模型在“美国波士顿房价预测”问题上的性能进行排序比较,也可以发现使用非线性回归树模型,特别是集成模型,能够取得更高的性能表现。
表2-1 多种经典回归模型在“美国波士顿房价预测”问题的回归预测能力排名 | ||||
Rank | Regressors | R-squared | MSE | MAE |
1 | GradientBoostingRegressor | 0.8370 | 12.64 | 2.28 |
2 | RandomForestRegressor | 0.8058 | 15.06 | 2.44 |
3 | ExtraTreesRegressor | 0.7998 | 15.52 | 2.51 |
4 | SVM Regressor(RBF Kernel) | 0.7560 | 18.92 | 2.61 |
5 | KNN Regressor(Distance-weighted) | 0.7201 | 21.70 | 2.80 |
6 | DecisionTreeRegressor | 0.6911 | 23.95 | 3.24 |
7 | KNN Regressor(Uniform-weighted) | 0.6907 | 23.98 | 2.97 |
8 | LinearRegression | 0.6758 | 25.14 | 3.53 |
9 | SGDRegressor | 0.6669 | 25.83 | 3.50 |
10 | SVM Regressor(Linear Kernel) | 0.6507 | 27.09 | 3.43 |
11 | SVM Regressor(Poly Kernel) | 0.4037 | 46.24 | 3.74 |
备注:表2-1是根据“本地输出”得到的表格,与原来的表格不一致,排名与原来的差别为:RandomForestRegressor与ExtraTreesRegressor排名位次交换,其它排名位次没有变化。