import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
Using matplotlib backend: MacOSX
读取数据集并观察数据特点
字段意义和数据观察结果
- longitude: 经度
- latitude: 纬度
- housing_median_age: 房龄中位数
- total_rooms: 房间总数
- total_bedrooms: 卧室总数
- population: 人口数
- households: 家庭户数
- median_income: 收入中位数
- median_house_value: 房价中位数
- ocean_proximity: 距离海边的距离
数据加载
# 读取数据,原数据可查看 https://github.com/ageron/handson-ml/tree/master/datasets/housing
housing_df = pd.read_csv('https://query.data.world/s/yffqqcx3rsjlzspztxr6zt5iqd45kn')
# 查看数据结构
housing_df.head(10)
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
5 | -122.25 | 37.85 | 52.0 | 919.0 | 213.0 | 413.0 | 193.0 | 4.0368 | 269700.0 | NEAR BAY |
6 | -122.25 | 37.84 | 52.0 | 2535.0 | 489.0 | 1094.0 | 514.0 | 3.6591 | 299200.0 | NEAR BAY |
7 | -122.25 | 37.84 | 52.0 | 3104.0 | 687.0 | 1157.0 | 647.0 | 3.1200 | 241400.0 | NEAR BAY |
8 | -122.26 | 37.84 | 42.0 | 2555.0 | 665.0 | 1206.0 | 595.0 | 2.0804 | 226700.0 | NEAR BAY |
9 | -122.25 | 37.84 | 52.0 | 3549.0 | 707.0 | 1551.0 | 714.0 | 3.6912 | 261100.0 | NEAR BAY |
数据的描述性统计
数据属性
# 查看数据信息
# 1. 通过以下数据可发现除了total_bedrooms字段数据缺失(非空20433,总数为20640),其他字段数据都是完整的。
# 2. 除了ocean_proximity,其他都是float64类型,可以进一步查看ocean_proximity非枚举值分布
housing_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude 20640 non-null float64
latitude 20640 non-null float64
housing_median_age 20640 non-null float64
total_rooms 20640 non-null float64
total_bedrooms 20433 non-null float64
population 20640 non-null float64
households 20640 non-null float64
median_income 20640 non-null float64
median_house_value 20640 non-null float64
ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
# 查看ocean_proximity枚举值分布
housing_df['ocean_proximity'].value_counts()
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: ocean_proximity, dtype: int64
枚举值释义:
- INLAND: 内地,内陆
- NEAR OCEAN: 靠海
- NEAR BAY: 靠海
- ISLAND: 岛上
数据的描述性统计
# 统计信息
housing_df.describe()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
count | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20433.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 |
mean | -119.569704 | 35.631861 | 28.639486 | 2635.763081 | 537.870553 | 1425.476744 | 499.539680 | 3.870671 | 206855.816909 |
std | 2.003532 | 2.135952 | 12.585558 | 2181.615252 | 421.385070 | 1132.462122 | 382.329753 | 1.899822 | 115395.615874 |
min | -124.350000 | 32.540000 | 1.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 0.499900 | 14999.000000 |
25% | -121.800000 | 33.930000 | 18.000000 | 1447.750000 | 296.000000 | 787.000000 | 280.000000 | 2.563400 | 119600.000000 |
50% | -118.490000 | 34.260000 | 29.000000 | 2127.000000 | 435.000000 | 1166.000000 | 409.000000 | 3.534800 | 179700.000000 |
75% | -118.010000 | 37.710000 | 37.000000 | 3148.000000 | 647.000000 | 1725.000000 | 605.000000 | 4.743250 | 264725.000000 |
max | -114.310000 | 41.950000 | 52.000000 | 39320.000000 | 6445.000000 | 35682.000000 | 6082.000000 | 15.000100 | 500001.000000 |
数据分布
# 各个属性的数据分布
housing_df.hist(bins=50, figsize=(20,15))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x10d8efd68>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10e45d198>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10e48e748>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x118efccf8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x118f392e8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x118f6a898>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x119283e48>,
<matplotlib.axes._subplots.AxesSubplot object at 0x1192bf470>,
<matplotlib.axes._subplots.AxesSubplot object at 0x1192bf4a8>]],
dtype=object)
地理数据可视化
参考:https://www.bigendiandata.com/2017-06-27-Mapping_in_Jupyter/
housing_df.plot(kind='scatter', x='longitude', y='latitude',
s=housing_df['population']/100, c='median_house_value', cmap=plt.get_cmap('jet'),
colorbar=True, alpha=0.1, figsize=