加州房价预测模型

import numpy as npimport pandas as pdimport matplotlib.pyplot as plt%matplotlib%matplotlib inline%config InlineBackend.figure_format = 'retina'Using matplotlib backend: MacOSX读取数据集并观察数据特点字段...
摘要由CSDN通过智能技术生成
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
Using matplotlib backend: MacOSX

读取数据集并观察数据特点

字段意义和数据观察结果

  • longitude: 经度
  • latitude: 纬度
  • housing_median_age: 房龄中位数
  • total_rooms: 房间总数
  • total_bedrooms: 卧室总数
  • population: 人口数
  • households: 家庭户数
  • median_income: 收入中位数
  • median_house_value: 房价中位数
  • ocean_proximity: 距离海边的距离

数据加载

# 读取数据,原数据可查看 https://github.com/ageron/handson-ml/tree/master/datasets/housing
housing_df = pd.read_csv('https://query.data.world/s/yffqqcx3rsjlzspztxr6zt5iqd45kn')
# 查看数据结构
housing_df.head(10)
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
5 -122.25 37.85 52.0 919.0 213.0 413.0 193.0 4.0368 269700.0 NEAR BAY
6 -122.25 37.84 52.0 2535.0 489.0 1094.0 514.0 3.6591 299200.0 NEAR BAY
7 -122.25 37.84 52.0 3104.0 687.0 1157.0 647.0 3.1200 241400.0 NEAR BAY
8 -122.26 37.84 42.0 2555.0 665.0 1206.0 595.0 2.0804 226700.0 NEAR BAY
9 -122.25 37.84 52.0 3549.0 707.0 1551.0 714.0 3.6912 261100.0 NEAR BAY

数据的描述性统计

数据属性
# 查看数据信息
# 1. 通过以下数据可发现除了total_bedrooms字段数据缺失(非空20433,总数为20640),其他字段数据都是完整的。
# 2. 除了ocean_proximity,其他都是float64类型,可以进一步查看ocean_proximity非枚举值分布
housing_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
# 查看ocean_proximity枚举值分布
housing_df['ocean_proximity'].value_counts()
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

枚举值释义:

  • INLAND: 内地,内陆
  • NEAR OCEAN: 靠海
  • NEAR BAY: 靠海
  • ISLAND: 岛上
数据的描述性统计
# 统计信息
housing_df.describe()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000
数据分布
# 各个属性的数据分布
housing_df.hist(bins=50, figsize=(20,15))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x10d8efd68>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10e45d198>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10e48e748>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x118efccf8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x118f392e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x118f6a898>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x119283e48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1192bf470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1192bf4a8>]],
      dtype=object)

png

地理数据可视化

参考:https://www.bigendiandata.com/2017-06-27-Mapping_in_Jupyter/

housing_df.plot(kind='scatter', x='longitude', y='latitude', 
                s=housing_df['population']/100, c='median_house_value', cmap=plt.get_cmap('jet'),
                colorbar=True, alpha=0.1, figsize=
  • 2
    点赞
  • 32
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值