机器学习数据集之波士顿房价

最新推荐文章于 2024-09-20 21:25:03 发布

fanyamin

最新推荐文章于 2024-09-20 21:25:03 发布

阅读量2k

点赞数 2

本文链接：https://blog.csdn.net/fanyamin/article/details/103160828

版权

该数据集包含美国人口普查局收集的美国马萨诸塞州波士顿住房价格的有关信息, 数据集很小，只有506个案例。

数据集都有以下14个属性:

CRIM--城镇人均犯罪率
ZN - 占地面积超过25,000平方英尺的住宅用地比例。
INDUS - 每个城镇非零售业务的比例。
CHAS - Charles River虚拟变量（如果是河道，则为1;否则为0）
NOX - 一氧化氮浓度（每千万份）
RM - 每间住宅的平均房间数
AGE - 1940年以前建造的自住单位比例
DIS加权距离波士顿的五个就业中心
RAD - 径向高速公路的可达性指数
TAX - 每10,000美元的全额物业税率
PTRATIO - 城镇的学生与教师比例
B - 1000（Bk - 0.63）^ 2其中Bk是城镇黑人的比例
LSTAT - 人口状况下降％
MEDV - 自有住房的中位数报价, 单位1000美元

from sklearn.datasets import load_boston
import pandas as pd

import matplotlib.pyplot as plt

from sklearn import datasets
from pandas.plotting import scatter_matrix

boston = load_boston()

print('--- %s ---' % 'boston type')
print(type(boston))
print('--- %s ---' % 'boston keys')
print(boston.keys())
print('--- %s ---' % 'boston data')
print(type(boston.data))

print('--- %s ---' % 'boston target')
print(type(boston.target))
print('--- %s ---' % 'boston data shape')
print(boston.data.shape)

print('--- %s ---' % 'boston feature names')
print(boston.feature_names);


X = boston.data
y = boston.target
df = pd.DataFrame(X, columns= boston.feature_names)

print('--- %s ---' % 'df.head')
print(df.head())
print('--- %s ---' % 'df.info')
print(df.info())
print('--- %s ---' % 'df.describe')
print(df.describe())

输出:

--- boston type ---
<class 'sklearn.utils.Bunch'>
--- boston keys ---
dict_keys(['data', 'target', 'feature_names', 'DESCR'])
--- boston data ---
<class 'numpy.ndarray'>
--- boston target ---
<class 'numpy.ndarray'>
--- boston data shape ---
(506, 13)
--- boston feature names ---
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
--- df.head ---
      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   

   PTRATIO       B  LSTAT  
0     15.3  396.90   4.98  
1     17.8  396.90   9.14  
2     17.8  392.83   4.03  
3     18.7  394.63   2.94  
4     18.7  396.90   5.33  
--- df.info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
None
--- df.describe ---
             CRIM          ZN       INDUS        CHAS         NOX          RM  \
count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000   
mean     3.593761   11.363636   11.136779    0.069170    0.554695    6.284634   
std      8.596783   23.322453    6.860353    0.253994    0.115878    0.702617   
min      0.006320    0.000000    0.460000    0.000000    0.385000    3.561000   
25%      0.082045    0.000000    5.190000    0.000000    0.449000    5.885500   
50%      0.256510    0.000000    9.690000    0.000000    0.538000    6.208500   
75%      3.647423   12.500000   18.100000    0.000000    0.624000    6.623500   
max     88.976200  100.000000   27.740000    1.000000    0.871000    8.780000   

              AGE         DIS         RAD         TAX     PTRATIO           B  \
count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000   
mean    68.574901    3.795043    9.549407  408.237154   18.455534  356.674032   
std     28.148861    2.105710    8.707259  168.537116    2.164946   91.294864   
min      2.900000    1.129600    1.000000  187.000000   12.600000    0.320000   
25%     45.025000    2.100175    4.000000  279.000000   17.400000  375.377500   
50%     77.500000    3.207450    5.000000  330.000000   19.050000  391.440000   
75%     94.075000    5.188425   24.000000  666.000000   20.200000  396.225000   
max    100.000000   12.126500   24.000000  711.000000   22.000000  396.900000   

            LSTAT  
count  506.000000  
mean    12.653063  
std      7.141062  
min      1.730000  
25%      6.950000  
50%     11.360000  
75%     16.955000  
max     37.970000