一、问题描述
这个实例原型是《Hands On Machine Learning with Sklearn and Tensorflow》书第二章End-to-End Machine Learning Project的课后题,题目的解答和正文掺杂在一起,我做了一些整理,也有一些补充(所以不敢保证没有纰漏)。仅供需要的同学批判性地参考,顺便给自己当一个笔记。 😃
数据集是来自 https://github.com/ageron/handson-ml 下的master/datasets/housing,简而言之,就是根据房子的一些属性来预测房子的价格。数据文件中一共有10个字段,具体的,特征
X={‘longitude’, ‘latitude’, ‘housing_median_age’,
‘total_rooms’, ‘total_bedrooms’, ‘population’,
‘households’, ‘median_income’, ‘ocean_proximity’},
其中’ocean_proximity’为文本类型,表示房子靠海的程度,只有5种情况——{‘NEAR BAY’, ‘<1H OCEAN’, ‘INLAND’, ‘NEAR OCEAN’, ‘ISLAND’}。
标签
Y={‘median_house_value’},
为连续型数据,因此此为一个多元回归问题。该问题的示意图略图如下:
二、Pipeline处理流程示意图
上图中间即为该end-to-end机器学习模型的流程图(pipeline),其中两侧的6个框中的记为pipeline的组件,亦即本模型的操作。需要说明的有以下几点:
(1)本数据中“分类型数据”不含异常值,因为异常值处理没有组件;
(2)其中Imputer和StandardScaler为python自带的类,其他的类是需要自己编写;
(3)其中LabelBinary类,直接用会报错,原因见于https://stackoverflow.com/questions/46162855/fit-transform-takes-2-positional-arguments-but-3-were-given-with-labelbinarize ,
修改方法是自己在此类的基础上做一下修改,得到新类MyLabelBinary(见代码中)。
【更详细的内容,请参考原书或下述代码】
三、实例代码
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
import numpy as np
import time
#### 读入数据,划分训练集和测试集
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import Imputer, StandardScaler
from sklearn.preprocessing import LabelBinarizer
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
data = pd.read_csv('../handsOn/datasets/housing/housing.csv')
housing = data.drop("median_house_value", axis=1)
housing_labels = data["median_house_value"].copy()
L = int(0.8*len(housing))
train_x = housing.ix[0:L, :]
train_y = housing_labels.ix[0:L]
test_x = housing.ix[L:, :]
test_y = housing_labels.ix[L:]
cat_attribute = ['ocean_proximity']
housing_num = housing.drop(cat_attribute, axis=1)
num_attribute = list(housing_num)
start = time.clock()
# 构建 pipeline
# 构建新特征需要的类
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
class CombinedAttributeAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room=True):
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return(self)
def transform(self, X):
'''X is an array'''
rooms_per_household = X[:, rooms_ix]/X[:, household_ix]
population_per_household = X[:, population_ix]/X[:, household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix]/X[:, rooms_ix]
return(np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room])
else:
return(np.c_[X, rooms_per_household, population_per_household])
# 根据列名选取出该列的类
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return(self)
def transform(self, X):
'''X is a DataFrame'''
return(X[self.attribute_names].values)
# 对分类属性进行onhehot编码
class MyLabelBinarizer(BaseEstimator, TransformerMixin):
def __init__(self):
self.encoder = LabelBinarizer()
def fit(self, x, y=0):
self.encoder.fit(x)
return self
def transform(self, x, y=0):
return self.encoder.transform(x)
# 构造数值型特征处理的pipeline
# 流水线中的所有组件,除了最后一个,必须是transformer;最后一个估计器可以是任何类型
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribute)),
('imputer', Imputer(strategy='median')),
('attribute_adder', CombinedAttributeAdder()),
('std_scaler', StandardScaler())
])
# 构造分类型特征处理的pipeline
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribute)),
('label_binarizer', MyLabelBinarizer())
])
# 将上述pipeline用FeatureUnion组合,将两组特征组合起来
full_pipeline = FeatureUnion(transformer_list=[
('num_pipeline', num_pipeline),
('cat_pipeline', cat_pipeline)
])
# 特征选择类:用随机森林计算特征的重要程度,返回最高的K个特征
class TopFeaturesSelector(BaseEstimator, TransformerMixin):
def __init__(self, feature_importance_k=5):
self.top_k_attributes = None
self.k = feature_importance_k
def fit(self, X, y):
reg = RandomForestRegressor()
reg.fit(X, y)
feature_importance = reg.feature_importances_
top_k_attributes = np.argsort(-feature_importance)[0:self.k]
self.top_k_attributes = top_k_attributes
return(self)
def transform(self, X, **fit_params):
return(X[:, self.top_k_attributes])
# 数据预处理以及选取最重要的K个特征的pipeline
prepare_and_top_feature_pipeline = Pipeline([
('full_pipeline', full_pipeline),
('feature_selector', TopFeaturesSelector(feature_importance_k=5))
])
# 用GridSearchCV计算最随机森林最优的参数
train_x_ = full_pipeline.fit_transform(train_x)
# Tree model,select best parameter with GridSearchCV
param_grid = {
'n_estimators': [10, 50],
'max_depth': [8, 10]
}
reg = RandomForestRegressor()
grid_search = GridSearchCV(reg, param_grid=param_grid, cv=5)
grid_search.fit(train_x_, train_y)
# 构造最终的数据处理和预测pipeline
prepare_and_predict_pipeline = Pipeline([
('prepare', prepare_and_top_feature_pipeline),
('random_forest', RandomForestRegressor(**grid_search.best_params_))
])
# 对上述总的pipeline用GridSearchCV选取最好的pipeline参数
param_grid2 = {'prepare__feature_selector__feature_importance_k': [1, 3, 5, 10],
'prepare__full_pipeline__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent']}
grid_search2 = GridSearchCV(prepare_and_predict_pipeline, param_grid=param_grid2, cv=2,
scoring='neg_mean_squared_error', verbose=2, n_jobs=4)
grid_search2.fit(train_x, train_y)
pred = grid_search2.predict(test_x)
end = time.clock()
print('RMSE on test set={}'.format(np.sqrt(mean_squared_error(test_y, pred))))
print('cost time={}'.format(end-start))
print('grid_search2.best_params_=\n', grid_search2.best_params_)
上述程序的的输出结果在准确率上并不高,因为为了提高速度,GridSearchCV都是用了最简单的参数,可以扩大参数范围,提高精度。