1 Get th Data
California census data url:
https://github.com/ageron/handson-ml2/blob/master/datasets/housing/README.md
import pandas as pd
import os
data_path = r'C:\Users\cainm\datasets\housing'
file_name = 'housing.csv'
def load_data(path=data_path, name=file_name):
housing_path = os.path.join(path,name)
return pd.read_csv(housing_path)
housing = load_data()
housing.head()
Quick look
housing.columns
Index([‘longitude’, ‘latitude’, ‘housing_median_age’, ‘total_rooms’,
‘total_bedrooms’, ‘population’, ‘households’, ‘median_income’,
‘median_house_value’, ‘ocean_proximity’],
dtype=‘object’)
经度,纬度,房子年龄,总房间数,总卧室数,数量,居民家庭,中位收入,中位房价,近海
housing.info()
共20640行,出了total bedrooms 都非空,在后面数据清洗要注意;
ocean_proximity 非数值型
housing['ocean_proximity'].value_counts()
housing.describe()
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50,figsize=(20,15))
plt.show()
基本上都是右偏长尾,2/8在哪里都适用
housing.head(10)
细节看下:
1 经度纬度没问题,
2 median_income 中位数收入水平,单位是 $0000,比方说第一个,8.3252 其实是 $83,252 大概50WCNY fair
2 Create a test set
import numpy as np
def split_train_test(data,test_ratio):
m = len(data)
shuffled_indices = np.random.permutation(m) # shuffle 索引
test_set_size = int(m*test_ratio)
train_set_indices = shuffled_indices[test_set_size:]
test_set_indices = shuffled_indices[:test_set_size]
train_set = data.iloc[train_set_indices]
test_set = data.iloc[test_set_indices]
return train_set, test_set
train_set, test_set = split_train_test(housing,0.2)
train_set
再加上no.random.seed()就OK了
np.random.seed(42)
但是依然有问题,如果数据变动,没有唯一index确保每次新增数据的index唯一或整体不变;
可是尝试用经度纬度形成唯一index建成索引;
Scikit-Learn写法:
from sklearn.model_selection import train_test_split
train_set,test_set = train_test_split(housing, test_size=0.2, random_state=42)
train_set
实际结果一致,param中包括 random_state
以上都是随机抽样,适合数据量大,如果数据量小,会引起抽样误差;
下面用分层抽样(stratified sampling)
the population is divided into homogeneous subgroups called strata,
and the right number of instances are sampled from each stratum to guarantee that the test set is representitve of the overall population
假设预测房价受median_income的影响更多,我们就想要保证测试集尽可能多的在income上能代表数据集,先简单把income分成5等份
housing['income_cat'] = pd.cut(housing,
bins=[0., 1.5, 3., 4.5, 6., np.inf]
labels=[1,2,3,4,5])
housing['income_cat].hist()
plt.show()
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_split=1,
test_size=0.2,
random_state=42)
for train_index, test_index in split.split(housing,
housing['income_cat']
):
train_set = housing.iloc[train_index]
test_set = housing.iloc[test_index]
# 分层比例
test_set['income_cat'].value_counts()/len(test_set)