学习笔记:Build a model of housing prices — test_set

1 Get th Data

California census data url:
https://github.com/ageron/handson-ml2/blob/master/datasets/housing/README.md

import pandas as pd
import os

data_path = r'C:\Users\cainm\datasets\housing'
file_name = 'housing.csv'
def  load_data(path=data_path, name=file_name):
	housing_path = os.path.join(path,name)
	return pd.read_csv(housing_path)
	
housing = load_data()
housing.head()

view
Quick look

housing.columns

Index([‘longitude’, ‘latitude’, ‘housing_median_age’, ‘total_rooms’,
‘total_bedrooms’, ‘population’, ‘households’, ‘median_income’,
‘median_house_value’, ‘ocean_proximity’],
dtype=‘object’)
经度,纬度,房子年龄,总房间数,总卧室数,数量,居民家庭,中位收入,中位房价,近海

housing.info()

view
共20640行,出了total bedrooms 都非空,在后面数据清洗要注意;
ocean_proximity 非数值型

housing['ocean_proximity'].value_counts()

view

housing.describe()

view

%matplotlib inline
import matplotlib.pyplot as plt

housing.hist(bins=50,figsize=(20,15))
plt.show()

view
基本上都是右偏长尾,2/8在哪里都适用

housing.head(10)

细节看下:
1 经度纬度没问题,
2 median_income 中位数收入水平,单位是 $0000,比方说第一个,8.3252 其实是 $83,252 大概50WCNY fair

2 Create a test set

import numpy as np

def split_train_test(data,test_ratio):
	m = len(data)
	shuffled_indices = np.random.permutation(m) # shuffle 索引
	test_set_size = int(m*test_ratio)
	train_set_indices = shuffled_indices[test_set_size:]
	test_set_indices = shuffled_indices[:test_set_size]
	train_set = data.iloc[train_set_indices]
	test_set = data.iloc[test_set_indices]
	return train_set, test_set

train_set, test_set = split_train_test(housing,0.2)
train_set

view
再加上no.random.seed()就OK了
np.random.seed(42)
但是依然有问题,如果数据变动,没有唯一index确保每次新增数据的index唯一或整体不变;
可是尝试用经度纬度形成唯一index建成索引;

Scikit-Learn写法:

from sklearn.model_selection import train_test_split

train_set,test_set = train_test_split(housing, test_size=0.2, random_state=42)

train_set

view
实际结果一致,param中包括 random_state

以上都是随机抽样,适合数据量大,如果数据量小,会引起抽样误差;
下面用分层抽样(stratified sampling)
the population is divided into homogeneous subgroups called strata,
and the right number of instances are sampled from each stratum to guarantee that the test set is representitve of the overall population

假设预测房价受median_income的影响更多,我们就想要保证测试集尽可能多的在income上能代表数据集,先简单把income分成5等份

housing['income_cat'] = pd.cut(housing,
						bins=[0., 1.5, 3., 4.5, 6., np.inf]
						labels=[1,2,3,4,5])
housing['income_cat].hist()
plt.show()

view

from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_split=1, 
								test_size=0.2, 
								random_state=42)
for train_index, test_index in split.split(housing,
										housing['income_cat']
										):
train_set = housing.iloc[train_index]
test_set = housing.iloc[test_index]

# 分层比例
test_set['income_cat'].value_counts()/len(test_set)

view

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值