数据下载地址:
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
使用sklearn.datasets.load_boston可加载相关数据
数据描述如下:
CRIM:城镇人均犯罪率。
ZN:住宅用地超过 25000 sq.ft. 的比例。
INDUS:城镇非零售商用土地的比例。
CHAS:查理斯河空变量(如果边界是河流,则为1;否则为0)。
NOX:一氧化氮浓度。
RM:住宅平均房间数。
AGE:1940 年之前建成的自用房屋比例。
DIS:到波士顿五个中心区域的加权距离。
RAD:辐射性公路的接近指数。
TAX:每 10000 美元的全值财产税率。
PTRATIO:城镇师生比例。
B:1000(Bk-0.63)^ 2,其中 Bk 指代城镇中黑人的比例。
LSTAT:人口中地位低下者的比例。
MEDV:自住房的平均房价,以千美元计。
代码实现
from sklearn.datasets import load_boston
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LassoCV #LassoCV用于实现Lasso的交叉验证,通常用于求解最佳参数
import seaborn as sns #可视化工具
house = load_boston()
print(house.DESCR) #查看数据描述
.. _boston_dataset: Boston house prices dataset --------------------------- **Data Set Characteristics:** :Number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target. :Attribute Information (in order): - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in $1000's :Missing Attribute Values: None :Creator: Harrison, D. and Rubinfeld, D.L. This is a copy of UCI ML housing dataset. Index of /ml/machine-learning-databases/housing This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter. The Boston house-price data has been used in many machine learning papers that address regression problems. .. topic:: References - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261. - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
x = house.data #取数据
y = house.target #取标签
df = pd.DataFrame(x,columns=house.feature_names) #构造表格型数据
df['Target'] = pd.DataFrame(y,columns=['Target'])
df.head() #打印前五行
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
#画热力图,数值为两个变量之间的相关系数
plt.figure(figsize=(15,15)) #创建自定义图像
p = sns.heatmap(df.corr(),annot=True,square=True) #画热力图
#data.corr()表示了data中的两个变量之间的相关性,取值范围为[-1,1],取值接近-1,表示反相关,类似反比例函数,取值接近1,表正相关。
#sns.heatmap参数介绍 https://www.cntofu.com/book/172/docs/30.md
#数据标准化 训练样本用fit_transform,而测试样本用transform
# X_train = ss.fit_transform(X_train)
# X_test = ss.transform(X_test)
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X = ss.fit_transform(x) #归一化数据
print(X[:4])
[[-0.41978194 0.28482986 -1.2879095 -0.27259857 -0.14421743 0.41367189 -0.12001342 0.1402136 -0.98284286 -0.66660821 -1.45900038 0.44105193 -1.0755623 ] [-0.41733926 -0.48772236 -0.59338101 -0.27259857 -0.74026221 0.19427445 0.36716642 0.55715988 -0.8678825 -0.98732948 -0.30309415 0.44105193 -0.49243937] [-0.41734159 -0.48772236 -0.59338101 -0.27259857 -0.74026221 1.28271368 -0.26581176 0.55715988 -0.8678825 -0.98732948 -0.30309415 0.39642699 -1.2087274 ] [-0.41675042 -0.48772236 -1.30687771 -0.27259857 -0.83528384 1.01630251 -0.80988851 1.07773662 -0.75292215 -1.10611514 0.1130321 0.41616284 -1.36151682]]
#切分数据集
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)
#创建模型
model = LassoCV()
model.fit(x_train,y_train)
#lasso系数
print(model.alpha_)
0.75101819879345
#相关系数
print(model.coef_)
[-0.06338152 0.05225382 -0. 0. -0. 1.81419358 0.01723948 -0.78293398 0.31906904 -0.01794047 -0.78602545 0.01093303 -0.68469353]
#决定系数
model.score(x_test,y_test)
0.6687838293074235