Python利用支持向量机模型（linear,poly,rbf三种核函数）对sklearn库中的boston房价数据进行预测并作出评估

最新推荐文章于 2024-08-19 14:45:48 发布

ZiTalk梓言梓语

最新推荐文章于 2024-08-19 14:45:48 发布

阅读量3.6k

点赞数 2

分类专栏：大学文章标签： python 数据挖掘支持向量机

本文链接：https://blog.csdn.net/jjhhshhgg/article/details/105961013

版权

大学专栏收录该内容

56 篇文章 1 订阅

订阅专栏

源代码：

from sklearn.datasets import load_boston #从sklearn.datasets导入波士顿房价数据
boston = load_boston() #将数据存储在变量boston里
print(boston.DESCR) #打印数据
'''其中
CRIM: 城镇人均犯罪率
ZN: 住宅用地所占比例
INDUS: 城镇中非住宅用地所占比例
CHAS: CHAS 虚拟变量,用于回归分析
NOX: 环保指数
RM: 每栋住宅的房间数
AGE: 1940 年以前建成的自住单位的比例
DIS: 距离 5 个波士顿的就业中心的加权距离。
RAD: 距离高速公路的便利指数
TAX: 每一万美元的不动产税率
PRTATIO: 城镇中的教师学生比例
B: 城镇中的黑人比例
LSTAT: 地区中有多少房东属于低收入人群
MEDV: 自住房屋房价中位数（也就是均价）
'''

import numpy as np #导入numpy库
from sklearn.model_selection import train_test_split # 从sklearn.cross_validation导入数据分割器。
'''
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=33)
test_size：如果是浮点数，在0-1之间，表示样本占比；如果是整数的话就是样本的数量
random_state：是随机数的种子。其实就是该组随机数的编号，在需要重复试验的时候，保证得到一组一样的随机数。比如你每次都填1，其他参数一样的情况下你得到的随机数组是一样的。但填0或不填，每次都会不一样。
随机数的产生取决于种子，随机数和种子之间的关系遵从以下两个规则：种子不同，产生不同的随机数；种子相同，即使实例不同也产生相同的随机数。
'''
x = boston.data #划分的样本特征集
y = boston.target #划分的结果集

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=33) #将25%作为测试样本，其余作为训练样本
print ("最大结果值：", np.max(boston.target))
print ("最小结果值：", np.min(boston.target))
print ("平均结果值：", np.mean(boston.target))

from sklearn.preprocessing import StandardScaler #从sklearn.preprocessing导入数据标准化模块

ss_x = StandardScaler() #初始化目标及结果的标准容器
ss_y = StandardScaler()

x_train = ss_x.fit_transform(x_train) #分别对训练和测试数据的特征以及目标值进行标准化处理。
x_test = ss_x.transform(x_test)
y_train = ss_y.fit_transform(y_train.reshape(-1, 1))
y_test = ss_y.transform(y_test.reshape(-1, 1))

from sklearn.svm import SVR #从sklearn.svm中导入支持向量机（回归）模型。

linear_svr = SVR(kernel='linear') #使用线性核函数配置的支持向量机进行回归训练，并且对测试样本进行预测。
linear_svr.fit(x_train, y_train.ravel())
linear_svr_predict = linear_svr.predict(x_test)

poly_svr = SVR(kernel='poly') #使用多项式核函数配置的支持向量机进行回归训练，并且对测试样本进行预测。
poly_svr.fit(x_train, y_train.ravel())
poly_svr_predict = poly_svr.predict(x_test)

rbf_svr = SVR(kernel='rbf') #使用径向基核函数配置的支持向量机进行回归训练，并且对测试样本进行预测。
rbf_svr.fit(x_train, y_train.ravel())
rbf_svr_predict = rbf_svr.predict(x_test)

'''
接下来我们就不同核函数配置下的支持向量机回归模型在测试集上的回归性能作出评估。
通过三组性能评测我们发现，不同配置下的模型在相同测试集上，存在非常重大的性能差异。
并且在使用了径向基核函数对特征进行非线性映射之后，支持向量机展现了最佳的回归性能。 
'''

from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error #使用R-squared、MSE和MAE指标对三种配置的支持向量机（回归）模型在相同测试集上进行性能评估。

print('linear SVR的默认测量值为', linear_svr.score(x_test, y_test))
print('linear SVR的R平方值为', r2_score(y_test, linear_svr_predict))
print('linear SVR的均方误差为',mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(linear_svr_predict)))
print('linear SVR的平均绝对误差为',mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(linear_svr_predict)),"\n******************************************************")

print('poly SVR的默认测量值为', poly_svr.score(x_test, y_test))
print('poly SVR的R平方值为', r2_score(y_test, poly_svr_predict))
print('poly SVR的均方误差为',mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(poly_svr_predict)))
print('poly SVR的平均绝对误差为',mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(poly_svr_predict)),"\n******************************************************")

print('rbf SVR的默认测量值为', rbf_svr.score(x_test, y_test))
print('rbf SVR的R平方值为', r2_score(y_test, rbf_svr_predict))
print('rbf SVR的均方误差为',mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rbf_svr_predict)))
print('rbf SVR的平均绝对误差为',mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rbf_svr_predict)))

实现结果：

Connected to pydev debugger (build 192.7142.56)
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

最大结果值： 50.0
最小结果值： 5.0
平均结果值： 22.532806324110677
linear SVR的默认测量值为 0.650659546421538
linear SVR的R平方值为 0.650659546421538
linear SVR的均方误差为 27.088311013556027
linear SVR的平均绝对误差为 3.4328013877599624 
******************************************************
poly SVR的默认测量值为 0.403650651025512
poly SVR的R平方值为 0.403650651025512
poly SVR的均方误差为 46.241700531039
poly SVR的平均绝对误差为 3.7384073710465047 
******************************************************
rbf SVR的默认测量值为 0.7559887416340947
rbf SVR的R平方值为 0.7559887416340947
rbf SVR的均方误差为 18.920948861538715
rbf SVR的平均绝对误差为 2.6067819999501114