【Stacking改进】基于随机采样与精度加权的Stacking算法

最新推荐文章于 2024-03-31 08:46:13 发布

圈外人

最新推荐文章于 2024-03-31 08:46:13 发布

阅读量6.7k

点赞数 15

文章标签： python 机器学习算法人工智能数据挖掘

本文链接：https://blog.csdn.net/you_just_look/article/details/117486255

版权

本文介绍了基于随机采样和精度加权的Stacking集成模型，用于Kaggle房价预测。通过GBDT确定特征重要性，进行随机采样和特征扰动，形成多个子集训练基模型。根据验证集误差分配权重，构建元模型，提高预测准确性和效率。实验结果显示该方法优于传统Stacking和基分类器。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

【Stacking改进】基于随机采样与精度加权的Stacking算法

摘要

近年来，人工智能的强势崛起让我们领略到人工智能技术的巨大潜力，机器学习也被广泛应用于各个领域，并取得不错的成果。本文以Kaggle竞赛House Prices的房价数据为实验样本，借鉴Bagging的自助采样法和k折交叉验证法，构建一种基于伪随机采样的Stacking集成模型，用于房价预测。首先利用GBDT对数据集进行简单训练，并得到各个特征重要性。接着对数据集进行多次随机采样，然后根据特征重要性进行属性扰动，组成多个训练数据子集和验证数据子集。用这些数据子集训练基模型，并计算验证集的均方根误差和预测结果，根据误差分配权重。根据各个基模型预测结果组成第二层的元模型，最后在测试数据集上进行房价预测。实验结果表明，基于随机采样和精度加权的Stacking集成模型的均方根误差小于所有基分类器和同结构的经典Stacking集成方法。

Stacking算法理论基础

Stacking是一种分层模型集成框架，在1992年被Wolpert提出。Stacking集成可以有多层的情况，但通常会设计两层，第一层由多种基模型组成，输入为原始训练集，而输出为各种基模型的预测值，而第二层只有一个元模型，对第一层的各种模型的预测值和真实值进行训练，从而得到完成的集成模型。同理，预测测试集的过程也要先经过所有基模型的预测，组成第二层的特征，再用第二层的元模型预测出最终的结果。为了防止模型过拟合的情况，一般Stacking算法在第一层训练基模型时会结合k折交叉验证法。以五折交叉验证法为例，Stacking算法的过程如下图所示。
在这里插入图片描述

传统Stacking代码

# 定义一个交叉评估函数 Validation function
n_folds = 5

def rmsle_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse= np.sqrt(-cross_val_score(model, train.values, target_variable, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)
# 堆叠模型(Stacking Averaged Models)        score: 0.1087 (0.0061)
class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, base_models, meta_model, n_folds=5):
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
        
   
    def fit(self, X, y):
        self.base_models_ = [list() for x in self.base_models]  #  4×5 list 存放训练好的模型
        self.meta_model_ = clone(self.meta_model) # 复制基准模型，因为这里会有多个模型
        kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
        
        # 训练基准模型，基于基准模型训练的结果导出成特征
        # that are needed to train the cloned meta-model
        out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
        for i, model in enumerate(self.base_models):
            for train_index, holdout_index in kfold.split(X, y): #分为预测holdout_index和训练train_index
                instance = clone(model)
                self.base_models_[i].append(instance)
                instance.fit(X[train_index], y[train_index])
                y_pred = instance.predict(X[holdout_index])
                out_of_fold_predictions[holdout_index, i] = y_pred
                
        # 将基准模型预测数据作为特征用来给meta_model训练
        self.meta_model_.fit(out_of_fold_predictions, y)
        return self
    
    def predict(self, X):
        meta_features = np.column_stack