sklearn-pipeline用法实例——房价预测

一、问题描述

这个实例原型是《Hands On Machine Learning with Sklearn and Tensorflow》书第二章End-to-End Machine Learning Project的课后题,题目的解答和正文掺杂在一起,我做了一些整理,也有一些补充(所以不敢保证没有纰漏)。仅供需要的同学批判性地参考,顺便给自己当一个笔记。 😃
  数据集是来自 https://github.com/ageron/handson-ml 下的master/datasets/housing,简而言之,就是根据房子的一些属性来预测房子的价格。数据文件中一共有10个字段,具体的,特征
     X={‘longitude’,  ‘latitude’,     ‘housing_median_age’,     
       ‘total_rooms’, ‘total_bedrooms’, ‘population’,
       ‘households’, ‘median_income’, ‘ocean_proximity’},
其中’ocean_proximity’为文本类型,表示房子靠海的程度,只有5种情况——{‘NEAR BAY’, ‘<1H OCEAN’, ‘INLAND’, ‘NEAR OCEAN’, ‘ISLAND’}。
标签
            Y={‘median_house_value’},
为连续型数据,因此此为一个多元回归问题。该问题的示意图略图如下:
在这里插入图片描述

二、Pipeline处理流程示意图


上图中间即为该end-to-end机器学习模型的流程图(pipeline),其中两侧的6个框中的记为pipeline的组件,亦即本模型的操作。需要说明的有以下几点:
(1)本数据中“分类型数据”不含异常值,因为异常值处理没有组件;
(2)其中Imputer和StandardScaler为python自带的类,其他的类是需要自己编写;
(3)其中LabelBinary类,直接用会报错,原因见于https://stackoverflow.com/questions/46162855/fit-transform-takes-2-positional-arguments-but-3-were-given-with-labelbinarize ,
修改方法是自己在此类的基础上做一下修改,得到新类MyLabelBinary(见代码中)。
【更详细的内容,请参考原书或下述代码】

三、实例代码

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
import numpy as np
import time

####  读入数据,划分训练集和测试集
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import Imputer, StandardScaler
from sklearn.preprocessing import LabelBinarizer
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline


data = pd.read_csv('../handsOn/datasets/housing/housing.csv')
housing = data.drop("median_house_value", axis=1)
housing_labels = data["median_house_value"].copy()

L = int(0.8*len(housing))
train_x = housing.ix[0:L, :]
train_y = housing_labels.ix[0:L]
test_x = housing.ix[L:, :]
test_y = housing_labels.ix[L:]

cat_attribute = ['ocean_proximity']
housing_num = housing.drop(cat_attribute, axis=1)
num_attribute = list(housing_num)

start = time.clock()
# 构建 pipeline
# 构建新特征需要的类
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6


class CombinedAttributeAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room=True):
        self.add_bedrooms_per_room = add_bedrooms_per_room

    def fit(self, X, y=None):
        return(self)

    def transform(self, X):
        '''X is an array'''
        rooms_per_household = X[:, rooms_ix]/X[:, household_ix]
        population_per_household = X[:, population_ix]/X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix]/X[:, rooms_ix]
            return(np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room])
        else:
            return(np.c_[X, rooms_per_household, population_per_household])


# 根据列名选取出该列的类
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names

    def fit(self, X, y=None):
        return(self)

    def transform(self, X):
        '''X is a DataFrame'''
        return(X[self.attribute_names].values)


# 对分类属性进行onhehot编码
class MyLabelBinarizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.encoder = LabelBinarizer()

    def fit(self, x, y=0):
        self.encoder.fit(x)
        return self

    def transform(self, x, y=0):
        return self.encoder.transform(x)


# 构造数值型特征处理的pipeline
# 流水线中的所有组件,除了最后一个,必须是transformer;最后一个估计器可以是任何类型
num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attribute)),
    ('imputer', Imputer(strategy='median')),
    ('attribute_adder', CombinedAttributeAdder()),
    ('std_scaler', StandardScaler())
])

# 构造分类型特征处理的pipeline
cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribute)),
    ('label_binarizer', MyLabelBinarizer())
])

# 将上述pipeline用FeatureUnion组合,将两组特征组合起来
full_pipeline = FeatureUnion(transformer_list=[
    ('num_pipeline', num_pipeline),
    ('cat_pipeline', cat_pipeline)
])


# 特征选择类:用随机森林计算特征的重要程度,返回最高的K个特征
class TopFeaturesSelector(BaseEstimator, TransformerMixin):
    def __init__(self, feature_importance_k=5):
        self.top_k_attributes = None
        self.k = feature_importance_k

    def fit(self, X, y):
        reg = RandomForestRegressor()
        reg.fit(X, y)
        feature_importance = reg.feature_importances_
        top_k_attributes = np.argsort(-feature_importance)[0:self.k]
        self.top_k_attributes = top_k_attributes
        return(self)

    def transform(self, X, **fit_params):
        return(X[:, self.top_k_attributes])


# 数据预处理以及选取最重要的K个特征的pipeline
prepare_and_top_feature_pipeline = Pipeline([
    ('full_pipeline', full_pipeline),
    ('feature_selector', TopFeaturesSelector(feature_importance_k=5))
])

# 用GridSearchCV计算最随机森林最优的参数
train_x_ = full_pipeline.fit_transform(train_x)
# Tree model,select best parameter with GridSearchCV
param_grid = {
    'n_estimators': [10, 50],
    'max_depth': [8, 10]
}
reg = RandomForestRegressor()
grid_search = GridSearchCV(reg, param_grid=param_grid, cv=5)
grid_search.fit(train_x_, train_y)

# 构造最终的数据处理和预测pipeline
prepare_and_predict_pipeline = Pipeline([
    ('prepare', prepare_and_top_feature_pipeline),
    ('random_forest', RandomForestRegressor(**grid_search.best_params_))
])

# 对上述总的pipeline用GridSearchCV选取最好的pipeline参数
param_grid2 = {'prepare__feature_selector__feature_importance_k': [1, 3, 5, 10],
               'prepare__full_pipeline__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent']}
grid_search2 = GridSearchCV(prepare_and_predict_pipeline, param_grid=param_grid2, cv=2,
                            scoring='neg_mean_squared_error', verbose=2, n_jobs=4)
grid_search2.fit(train_x, train_y)
pred = grid_search2.predict(test_x)

end = time.clock()
print('RMSE on test set={}'.format(np.sqrt(mean_squared_error(test_y, pred))))
print('cost time={}'.format(end-start))
print('grid_search2.best_params_=\n', grid_search2.best_params_)

上述程序的的输出结果在准确率上并不高,因为为了提高速度,GridSearchCV都是用了最简单的参数,可以扩大参数范围,提高精度。

  • 3
    点赞
  • 22
    收藏
    觉得还不错? 一键收藏
  • 11
    评论
评论 11
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值