sklearn-pipeline用法实例——房价预测

置顶南瓜派三蔬

已于 2022-05-20 22:57:08 修改

阅读量2.4k

点赞数 3

分类专栏： # 《Hands On ML》笔记文章标签：机器学习 sklearn pipeline

于 2019-01-09 19:43:29 首次发布

本文链接：https://blog.csdn.net/qq_36810398/article/details/86172762

版权

《Hands On ML》笔记专栏收录该内容

29 篇文章 10 订阅

订阅专栏

文章目录

一、问题描述

这个实例原型是《Hands On Machine Learning with Sklearn and Tensorflow》书第二章End-to-End Machine Learning Project的课后题，题目的解答和正文掺杂在一起，我做了一些整理，也有一些补充（所以不敢保证没有纰漏）。仅供需要的同学批判性地参考，顺便给自己当一个笔记。 😃
　　数据集是来自 https://github.com/ageron/handson-ml 下的master/datasets/housing，简而言之，就是根据房子的一些属性来预测房子的价格。数据文件中一共有10个字段，具体的，特征
　　　　　X＝{‘longitude’,　　‘latitude’,　　　　　‘housing_median_age’,　　　　　
　　　　　　　‘total_rooms’,　‘total_bedrooms’,　‘population’,
　　　　　　　‘households’,　‘median_income’,　‘ocean_proximity’}，
其中’ocean_proximity’为文本类型，表示房子靠海的程度，只有5种情况——{‘NEAR BAY’, ‘<1H OCEAN’, ‘INLAND’, ‘NEAR OCEAN’, ‘ISLAND’}。
标签
　　　　　　　　　　　Y={‘median_house_value’}，
为连续型数据，因此此为一个多元回归问题。该问题的示意图略图如下：
在这里插入图片描述

二、Pipeline处理流程示意图

上图中间即为该end-to-end机器学习模型的流程图（pipeline），其中两侧的6个框中的记为pipeline的组件，亦即本模型的操作。需要说明的有以下几点：
（1）本数据中“分类型数据”不含异常值，因为异常值处理没有组件；
（2）其中Imputer和StandardScaler为python自带的类，其他的类是需要自己编写；
（3）其中LabelBinary类，直接用会报错，原因见于https://stackoverflow.com/questions/46162855/fit-transform-takes-2-positional-arguments-but-3-were-given-with-labelbinarize ，
修改方法是自己在此类的基础上做一下修改，得到新类MyLabelBinary(见代码中)。
【更详细的内容，请参考原书或下述代码】

三、实例代码

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
import numpy as np
import time

####  读入数据，划分训练集和测试集
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import Imputer, StandardScaler
from sklearn.preprocessing import LabelBinarizer
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline


data = pd.read_csv('../handsOn/datasets/housing/housing.csv')
housing = data.drop("median_house_value", axis=1)
housing_labels = data["median_house_value"].copy()

L = int(0.8*len(housing))
train_x = housing.ix[0:L, :]
train_y = housing_labels.ix[0:L]
test_x = housing.ix[L:, :]
test_y = housing_labels.ix[L:]

cat_attribute = ['ocean_proximity']
housing_num = housing.drop(cat_attribute, axis=1)
num_attribute = list(housing_num)

start = time.clock()
# 构建　pipeline
# 构建新特征需要的类
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6


class CombinedAttributeAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room=True):
        self.add_bedrooms_per_room = add_bedrooms_per_room

    def fit(self, X, y=None):
        return(self)

    def transform(self, X):
        '''X is an array'''
        rooms_per_household = X[:, rooms_ix]/X[:, household_ix]
        population_per_household = X[:, population_ix]/X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix]/X[:, rooms_ix]
            return(np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room])
        else:
            return(np.c_[X, rooms_per_household, population_per_household])


# 根据列名选取出该列的类
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names

    def fit(self, X, y=None):
        return(self)

    def transform(self, X):
        '''X is a DataFrame'''
        return(X[self.attribute_names].values)


# 对分类属性进行onhehot编码
class MyLabelBinarizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.encoder = LabelBinarizer()

    def fit(self, x, y=0):
        self.encoder.fit(x)
        return self

    def transform(self, x, y=0):
        return self.encoder.transform(x)


# 构造数值型特征处理的pipeline
# 流水线中的所有组件，除了最后一个，必须是transformer；最后一个估计器可以是任何类型
num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attribute)),
    ('imputer', Imputer(strategy='median')),
    ('attribute_adder', CombinedAttributeAdder()),
    ('std_scaler', StandardScaler())
])

# 构造分类型特征处理的pipeline
cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribute)),
    ('label_binarizer', MyLabelBinarizer())
])

# 将上述pipeline用FeatureUnion组合，将两组特征组合起来
full_pipeline = FeatureUnion(transformer_list=[
    ('num_pipeline', num_pipeline),
    ('cat_pipeline', cat_pipeline)
])


# 特征选择类：用随机森林计算特征的重要程度，返回最高的Ｋ个特征
class TopFeaturesSelector(BaseEstimator, TransformerMixin):
    def __init__(self, feature_importance_k=5):
        self.top_k_attributes = None
        self.k = feature_importance_k

    def fit(self, X, y):
        reg = RandomForestRegressor()
        reg.fit(X, y)
        feature_importance = reg.feature_importances_
        top_k_attributes = np.argsort(-feature_importance)[0:self.k]
        self.top_k_attributes = top_k_attributes
        return(self)

    def transform(self, X, **fit_params):
        return(X[:, self.top_k_attributes])


# 数据预处理以及选取最重要的Ｋ个特征的pipeline
prepare_and_top_feature_pipeline = Pipeline([
    ('full_pipeline', full_pipeline),
    ('feature_selector', TopFeaturesSelector(feature_importance_k=5))
])

# 用GridSearchCV计算最随机森林最优的参数
train_x_ = full_pipeline.fit_transform(train_x)
# Tree model,select best parameter with GridSearchCV
param_grid = {
    'n_estimators': [10, 50],
    'max_depth': [8, 10]
}
reg = RandomForestRegressor()
grid_search = GridSearchCV(reg, param_grid=param_grid, cv=5)
grid_search.fit(train_x_, train_y)

# 构造最终的数据处理和预测pipeline
prepare_and_predict_pipeline = Pipeline([
    ('prepare', prepare_and_top_feature_pipeline),
    ('random_forest', RandomForestRegressor(**grid_search.best_params_))
])

# 对上述总的pipeline用GridSearchCV选取最好的pipeline参数
param_grid2 = {'prepare__feature_selector__feature_importance_k': [1, 3, 5, 10],
               'prepare__full_pipeline__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent']}
grid_search2 = GridSearchCV(prepare_and_predict_pipeline, param_grid=param_grid2, cv=2,
                            scoring='neg_mean_squared_error', verbose=2, n_jobs=4)
grid_search2.fit(train_x, train_y)
pred = grid_search2.predict(test_x)

end = time.clock()
print('RMSE on test set={}'.format(np.sqrt(mean_squared_error(test_y, pred))))
print('cost time={}'.format(end-start))
print('grid_search2.best_params_=\n', grid_search2.best_params_)

上述程序的的输出结果在准确率上并不高，因为为了提高速度，GridSearchCV都是用了最简单的参数，可以扩大参数范围，提高精度。

南瓜派三蔬

关注

3
点赞
踩
22

收藏

觉得还不错? 一键收藏
11
评论
sklearn-pipeline用法实例——房价预测

sklearn用pipeline做机器学习实例一、问题描述这个实例原型是《Hands On Machine Learning with Sklearn and Tensorflow》书第二章End-to-End Machine Learning Project的课后题，题目的解答和正文掺杂在一起，我做了一些整理，也有一些改动（所以不敢保证没有纰漏）。仅供需要的同学批判性地参考，顺便给自己当一个...
复制链接

扫一扫

专栏目录