学习笔记:Build a model of housing prices

本文链接：https://blog.csdn.net/weixin_43874578/article/details/115416178

该博客介绍了如何加载加利福尼亚州的住房数据集，展示了数据的基本信息，如各列的含义和统计分布。接着，通过随机抽样的方法创建训练集和测试集，并探讨了在数据量小时使用分层抽样的重要性。作者将中位收入分为5个等级，并使用StratifiedShuffleSplit进行分层抽样，确保测试集在收入分布上与总体保持一致，以提高预测模型的代表性。

摘要由CSDN通过智能技术生成

1 Get th Data

California census data url:
https://github.com/ageron/handson-ml2/blob/master/datasets/housing/README.md

import pandas as pd
import os

data_path = r'C:\Users\cainm\datasets\housing'
file_name = 'housing.csv'
def  load_data(path=data_path, name=file_name):
	housing_path = os.path.join(path,name)
	return pd.read_csv(housing_path)
	
housing = load_data()
housing.head()

view
Quick look

housing.columns

Index([‘longitude’, ‘latitude’, ‘housing_median_age’, ‘total_rooms’,
‘total_bedrooms’, ‘population’, ‘households’, ‘median_income’,
‘median_house_value’, ‘ocean_proximity’],
dtype=‘object’)
经度，纬度，房子年龄，总房间数，总卧室数，数量，居民家庭，中位收入，中位房价，近海

housing.info()

view
共20640行，出了total bedrooms 都非空，在后面数据清洗要注意；
ocean_proximity 非数值型

housing['ocean_proximity'].value_counts()

view

housing.describe()

view

%matplotlib inline
import matplotlib.pyplot as plt

housing.hist(bins=50,figsize=(20,15))
plt.show()

view
基本上都是右偏长尾，2/8在哪里都适用

housing.head(10)

细节看下：
1 经度纬度没问题，
2 median_income 中位数收入水平，单位是 $0000，比方说第一个，8.3252 其实是 $83,252 大概50WCNY fair

2 Create a test set

import numpy as np

def split_train_test(data,test_ratio):
	m = len(data)
	shuffled_indices = np.random.permutation(m) # shuffle 索引
	test_set_size = int(m*test_ratio)
	train_set_indices = shuffled_indices[test_set_size:]
	test_set_indices = shuffled_indices[:test_set_size]
	train_set = data.iloc[train_set_indices]
	test_set = data.iloc[test_set_indices]
	return train_set, test_set

train_set, test_set = split_train_test(housing,0.2)
train_set

view
再加上no.random.seed()就OK了
np.random.seed(42)
但是依然有问题，如果数据变动，没有唯一index确保每次新增数据的index唯一或整体不变；
可是尝试用经度纬度形成唯一index建成索引；

Scikit-Learn写法：

from sklearn.model_selection import train_test_split

train_set,test_set = train_test_split(housing, test_size=0.2, random_state=42)

train_set

view
实际结果一致，param中包括 random_state

以上都是随机抽样，适合数据量大，如果数据量小，会引起抽样误差；
下面用分层抽样(stratified sampling)
the population is divided into homogeneous subgroups called strata,
and the right number of instances are sampled from each stratum to guarantee that the test set is representitve of the overall population

假设预测房价受median_income的影响更多，我们就想要保证测试集尽可能多的在income上能代表数据集,先简单把income分成5等份

housing['income_cat'] = pd.cut(housing,
						bins=[0., 1.5, 3., 4.5, 6., np.inf]
						labels=[1,2,3,4,5])
housing['income_cat].hist()
plt.show()

view

from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_split=1, 
								test_size=0.2, 
								random_state=42)
for train_index, test_index in split.split(housing,
										housing['income_cat']
										):
train_set = housing.iloc[train_index]
test_set = housing.iloc[test_index]

# 分层比例
test_set['income_cat'].value_counts()/len(test_set)