Datawhale AI夏令营第二期——机器学习基于神经网络stack融合策略的多模型融合

TXY242

已于 2024-07-17 22:05:39 修改

阅读量1.9k

点赞数 45

文章标签：人工智能

于 2024-07-17 20:52:30 首次发布

本文链接：https://blog.csdn.net/weixin_61149295/article/details/140502866

版权

#AI夏令营 #Datawhale夏令营

基于神经网络stack融合策略的多模型融合

改进点：

1.数据清洗，异常值替换（板块2）

2.基于神经网络的stack模型融合（板块6）

根据大佬的提示对Task3所做的改进，大佬链接：http://t.csdnimg.cn/RSC3o

1.模型导入

导入所需要包：

import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold, KFold, GroupKFold
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

2.数据信息展示及可视化

赛题数据由训练集和测试集组成，为了保证比赛的公平性，将每日日期进行脱敏，用1-N进行标识，即1为数据集最近一天，其中1-10为测试集数据。数据集由字段id（房屋id）、 dt（日标识）、type（房屋类型）、target（实际电力消耗）组成，如下图所示。

每个房间给出其从第11-506天的电力消耗,共2877305行，5832个房间，19种房间类型，下图为train.csv的数据展示。

房间类型与电力消耗关系如下图所示：

import matplotlib.pyplot as plt  
import pandas as pd 
import matplotlib.cm as cm  
plt.rcParams['font.family'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False  
target = train.groupby('type')['target'].mean().reset_index()  
plt.figure(figsize=(6, 4))  
norm = plt.Normalize(vmin=0, vmax=len(target['type'].unique())-1)  
colors = cm.viridis(norm(range(len(target['type'].unique()))))  
plt.bar(target['type'], target['target'], color=colors)  
plt.xlabel('房间类型')  
plt.ylabel('平均消耗电力')  
plt.title('房间类型与电力消耗关系')   
plt.show()

以第一个房间编号为例，展示其随时间的电力变化，如下图所示：

id_all = train[train['id'] == '00037f39cf']
plt.figure(figsize=(8, 4))
plt.plot(id_all['dt'], id_all['target'], linestyle='-')
plt.xlabel('日期')
plt.ylabel('电力消耗')
plt.title('电力消耗随日期变化')
plt.show()

通过该图可知，数据有明显的异常值，需要进行数据筛除以及补全，故对全篇数据进行循环，检查数据合理性，对不合理数据进行平均值替换，得出新的train.csv文件

#数据预处理，异常值转换
products = train['id'].unique()  
  
# 创建一个空的 DataFrame 来存储修改后的数据  
modified_train = train.copy()  
  
for product in products:  
    # 筛选出特定产品的数据  
    id_all = modified_train[modified_train['id'] == product]  
    ls = id_all['target'].tolist()  # 转换为列表以便修改  
      
    # 使用列表推导式来处理异常值  
    new_ls = [ls[0]]  # 初始值，避免索引-1错误  
    for i in range(1, len(ls) - 1):  
        if abs(ls[i] - ls[i - 1])  >= 10 and abs(ls[i] - ls[i + 1]) >= 10:
            ll=(ls[i - 1] + ls[i + 1]) / 2
            new_ls.append(ll)  
        else:  
            new_ls.append(ls[i])  
    new_ls.append(ls[-1])  # 添加最后一个值，避免索引超出范围  
      
    # 更新 DataFrame 中的 target 列  
    modified_train.loc[modified_train['id'] == product, 'target'] = new_ls  
  
# 最后，将所有修改后的数据写入 CSV 文件  
modified_train[['id', 'dt', 'type', 'target']].to_csv('train1.csv', index=None)

结果如图所示，图像相比变得较为平滑，突出数据减少，但是这里判断合理性只考虑数据和其相邻数据，数据清洗不是很全面，可以具体详细更改。

3.特征提取

这里主要构建了历史平移特征、差分特征、和窗口统计特征；具体说明如下：

（1）历史平移特征：通过历史平移获取上个阶段的信息；

（2）差分特征：可以帮助获取相邻阶段的增长差异，描述数据的涨减变化情况。在此基础上还可以构建相邻数据比值变化、二阶差分等；

（3）窗口统计特征：窗口统计可以构建不同的窗口大小，然后基于窗口范围进统计均值、最大值、最小值、中位数、方差的信息，可以反映最近阶段数据的变化情况。

# 合并训练数据和测试数据
data = pd.concat([train, test], axis=0).reset_index(drop=True)
data = data.sort_values(['id','dt'], ascending=False).reset_index(drop=True)

# 历史平移
for i in range(10,36):
    data[f'target_shift{i}'] = data.groupby('id')['target'].shift(i)

# 历史平移 + 差分特征
for i in range(1,4):
    data[f'target_shift10_diff{i}'] = data.groupby('id')['target_shift10'].diff(i)
    
# 窗口统计
for win in [15,30,50,70]:
    data[f'target_win{win}_mean'] = data.groupby('id')['target'].rolling(window=win, min_periods=3, closed='left').mean().values
    data[f'target_win{win}_max'] = data.groupby('id')['target'].rolling(window=win, min_periods=3, closed='left').max().values
    data[f'target_win{win}_min'] = data.groupby('id')['target'].rolling(window=win, min_periods=3, closed='left').min().values
    data[f'target_win{win}_std'] = data.groupby('id')['target'].rolling(window=win, min_periods=3, closed='left').std().values

# 历史平移 + 窗口统计
for win in [7,14,28,35,50,70]:
    data[f'target_shift10_win{win}_mean'] = data.groupby('id')['target_shift10'].rolling(window=win, min_periods=3, closed='left').mean().values
    data[f'target_shift10_win{win}_max'] = data.groupby('id')['target_shift10'].rolling(window=win, min_periods=3, closed='left').max().values
    data[f'target_shift10_win{win}_min'] = data.groupby('id')['target_shift10'].rolling(window=win, min_periods=3, closed='left').min().values
    data[f'target_shift10_win{win}_sum'] = data.groupby('id')['target_shift10'].rolling(window=win, min_periods=3, closed='left').sum().values
    data[f'target_shift710win{win}_std'] = data.groupby('id')['target_shift10'].rolling(window=win, min_periods=3, closed='left').std().values

train = data[data.target.notnull()].reset_index(drop=True)
test = data[data.target.isnull()].reset_index(drop=True)
train_cols = [f for f in data.columns if f not in ['id','target']]

4.三种机器学习模型对数据的预测

XGBoost、LightGBM和CatBoost是三种强大的梯度提升树算法，各自在算法结构、分类变量处理、性能优化及实际应用中展现独特优势。XGBoost以高效性和灵活性著称，适合处理大规模数据集；LightGBM则通过直方图算法和智能分裂策略，在稀疏数据处理上表现出色；而CatBoost特别擅长处理分类特征，通过自动特征处理和排序提升技术，在分类问题上具有卓越性能。

以下代码定义了一个名为cv_model的函数，旨在通过交叉验证（Cross-Validation）的方式评估并训练给定的机器学习模型（支持LightGBM，XGBoost和CatBoost三种模型）。函数接收模型实例、训练数据及其标签、测试数据、模型名称以及随机种子作为输入。函数内部，首先设置了5折交叉验证（KFold），并初始化了用于存储训练集外预测（OOF, Out-Of-Fold）结果、测试集预测结果和交叉验证分数的数组。

在每次交叉验证的迭代中，函数根据当前训练集和验证集的索引分割数据，并根据模型名称（clf_name）选择相应的参数和训练方式。对于LightGBM（lgb），它构建了两个数据集对象（训练集和验证集），设置了特定的参数（如学习率、树的数量、特征采样比例等），并使用这些参数训练模型，同时应用了早停法（early stopping）来防止过拟合。对于XGBoost和CatBoost，采用类似的流程，但参数和细节有所不同。

在每次迭代中，模型都会在验证集上进行预测，并将预测结果累加到OOF数组中，同时计算并存储交叉验证的分数。对于测试集，模型在每个迭代中都会进行预测，但最终的测试集预测结果是所有迭代预测结果的平均值。

def cv_model(clf, train_x, train_y, test_x, clf_name, seed = 2024):
    '''
    clf：调用模型
    train_x：训练数据
    train_y：训练数据对应标签
    test_x：测试数据
    clf_name：选择使用模型名
    seed：随机种子
    '''
    folds = 5
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
    oof = np.zeros(train_x.shape[0])
    test_predict = np.zeros(test_x.shape[0])
    cv_scores = []
    
    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        print('************************************ {} ************************************'.format(str(i+1)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
        
        if clf_name == "lgb":
            train_matrix = clf.Dataset(trn_x, label=trn_y)
            valid_matrix = clf.Dataset(val_x, label=val_y)
            params = {
                'boosting_type': 'gbdt',
                'objective': 'regression',
                'metric': 'mae',
                'min_child_weight': 6,
                'num_leaves': 2 ** 6,
                'lambda_l2': 10,
                'feature_fraction': 0.8,
                'bagging_fraction': 0.8,
                'bagging_freq': 4,
                'learning_rate': 0.05,
                'seed': 2024,
                'nthread' : 16,
                'verbose' : -1,
            }
            model = clf.train(params, train_matrix, 5000, valid_sets=[train_matrix, valid_matrix],
                              categorical_feature=[], verbose_eval=500, early_stopping_rounds=500)
            val_pred = model.predict(val_x, num_iteration=model.best_iteration)
            test_pred = model.predict(test_x, num_iteration=model.best_iteration)
        
        if clf_name == "xgb":
            xgb_params = {
              'booster': 'gbtree', 
              'objective': 'reg:squarederror',
              'eval_metric': 'mae',
              'max_depth': 5,
              'lambda': 10,
              'subsample': 0.7,
              'colsample_bytree': 0.7,
              'colsample_bylevel': 0.7,
              'eta': 0.1,
              'tree_method': 'hist',
              'seed': 520,
              'nthread': 16
              }
            train_matrix = clf.DMatrix(trn_x , label=trn_y)
            valid_matrix = clf.DMatrix(val_x , label=val_y)
            test_matrix = clf.DMatrix(test_x)
            
            watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')]
            
            model = clf.train(xgb_params, train_matrix, num_boost_round=5000, evals=watchlist, verbose_eval=500, early_stopping_rounds=500)
            val_pred  = model.predict(valid_matrix)
            test_pred = model.predict(test_matrix)
            
        if clf_name == "cat":
            params = {'learning_rate': 0.05, 'depth': 5, 'bootstrap_type':'Bernoulli','random_seed':1314,
                      'od_type': 'Iter', 'od_wait': 100, 'random_seed': 11, 'allow_writing_files': False}
            
            model = clf(iterations=5000, **params)
            model.fit(trn_x, trn_y, eval_set=(val_x, val_y),
                      metric_period=100,
                      use_best_model=True, 
                      cat_features=[],
                      verbose=1)
            
            val_pred  = model.predict(val_x)
            test_pred = model.predict(test_x)
        
        oof[valid_index] = val_pred
        test_predict += test_pred / kf.n_splits
        
        score = mean_absolute_error(val_y, val_pred)
        cv_scores.append(score)
        print(cv_scores)
        
    return oof, test_predict

# 选择lightgbm模型
lgb_oof, lgb_test = cv_model(lgb, train[train_cols], train['target'], test[train_cols], 'lgb')
# 选择xgboost模型
xgb_oof, xgb_test = cv_model(xgb, train[train_cols], train['target'], test[train_cols], 'xgb')
# 选择catboost模型
cat_oof, cat_test = cv_model(CatBoostRegressor, train[train_cols], train['target'], test[train_cols], 'cat')

5. 基于Ridge回归器的模型堆叠

Ridge：Ridge回归是一种线性回归的扩展，它通过向损失函数中添加一个L2正则化项（即权重的平方和）来减少模型的复杂度，从而避免过拟合。这种方法通过约束模型系数的大小，使得模型在拟合数据的同时保持了一定的泛化能力。

以下代码实现了一个堆叠模型（Stacking Model）的过程，其中结合了三个基础模型（如LightGBM、XGBoost和CatBoost）的预测结果来构建一个新的集成模型。首先，它将三个基础模型在训练集上的OOF（Out-Of-Fold）预测和测试集上的预测结果分别拼接成新的DataFrame。然后，使用这些拼接后的数据作为新特征，通过多次重复K折交叉验证（RepeatedKFold）来训练一个线性回归模型（这里使用的是Ridge回归器），每次验证都计算并记录了模型在验证集上的平均绝对误差（MAE）。最终，通过平均所有折叠中测试集上的预测结果来得到最终的堆叠模型预测。此过程旨在通过组合不同基础模型的预测来减少过拟合风险，并提高模型的泛化能力。返回的stack_oof和stack_pred分别是堆叠模型在训练集上的OOF预测和测试集上的预测结果。

def stack_model(oof_1, oof_2, oof_3, predictions_1, predictions_2, predictions_3, y):  
    '''  
    输入的oof_1, oof_2, oof_3可以对应lgb_oof，xgb_oof，cat_oof  
    predictions_1, predictions_2, predictions_3对应lgb_test，xgb_test，cat_test  
    '''  
    # 将Series转换为DataFrame  
    oof_1, oof_2, oof_3 = pd.DataFrame(oof_1), pd.DataFrame(oof_2), pd.DataFrame(oof_3)  
    predictions_1, predictions_2, predictions_3 = pd.DataFrame(predictions_1), pd.DataFrame(predictions_2), pd.DataFrame(predictions_3)  
      
    # 拼接OOF和测试预测  
    train_stack = pd.concat([oof_1, oof_2, oof_3], axis=1)  
    test_stack = pd.concat([predictions_1, predictions_2, predictions_3], axis=1)  
      
    oof = np.zeros(train_stack.shape[0])  
    predictions = np.zeros(test_stack.shape[0])  
    scores = []  
      
    folds = RepeatedKFold(n_splits=5, n_repeats=2, random_state=2021)  
      
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_stack, y)):  
        print(f"fold n°{fold_+1}")  
        trn_data, trn_y = train_stack.iloc[trn_idx], y.iloc[trn_idx]  
        val_data, val_y = train_stack.iloc[val_idx], y.iloc[val_idx]  
          
        clf = Ridge(random_state=2021)  
        clf.fit(trn_data, trn_y)  
  
        oof[val_idx] = clf.predict(val_data)  
        predictions += clf.predict(test_stack) / len(folds.split(train_stack, y))  
          
        score_single = mean_absolute_error(val_y, oof[val_idx])  
        scores.append(score_single)  
        print(f'{fold_+1}/{5}', score_single)  
    print('mean: ', np.mean(scores))  
     
    return oof, predictions  
   
stack_oof, stack_pred = stack_model(lgb_oof, xgb_oof, cat_oof, lgb_test, xgb_test, cat_test, train['target'])

6.基于神经网络的模型堆叠

简单的神经网络模型展示（并非本代码使用的网络层，仅起框架展示作用）

以下代码定义了一个名为stack_model_nn的函数，它实现了一个使用神经网络作为元学习器的堆叠模型（Stacking Model）过程。函数接收来自三个基础模型（如LightGBM、XGBoost和CatBoost）的OOF（Out-Of-Fold）预测和测试集预测结果，以及训练集的真实标签。首先，它将输入的数据从Series转换为DataFrame，并将OOF预测和测试预测结果分别拼接成新的DataFrame。

接下来，函数通过多次重复K折交叉验证（RepeatedKFold）来训练一个神经网络模型。在每个折叠中，它使用当前折叠的训练数据（即基础模型的OOF预测结果）来训练神经网络，并在验证集上进行预测以评估模型性能。神经网络模型包含三个密集层（Dense layers），使用ReLU激活函数和Adam优化器，以平均绝对误差（MAE）作为损失函数。

在每次验证后，函数计算并记录模型在验证集上的MAE分数，并将神经网络在测试集上的预测结果进行累加平均，以便最终得到整个测试集的预测。最后，函数返回堆叠模型在训练集上的OOF预测和测试集上的预测结果，以及验证过程中所有折叠的平均分数。

通过这种方式，stack_model_nn函数利用神经网络作为元学习器来整合不同基础模型的预测，旨在提高模型的预测准确性和泛化能力。

def stack_model_nn(oof_1, oof_2, oof_3, predictions_1, predictions_2, predictions_3, y):  
    # 将Series转换为DataFrame  
    oof_1, oof_2, oof_3 = pd.DataFrame(oof_1), pd.DataFrame(oof_2), pd.DataFrame(oof_3)  
    predictions_1, predictions_2, predictions_3 = pd.DataFrame(predictions_1), pd.DataFrame(predictions_2), pd.DataFrame(predictions_3)  
      
    # 拼接OOF和测试预测  
    train_stack = pd.concat([oof_1, oof_2, oof_3], axis=1)  
    test_stack = pd.concat([predictions_1, predictions_2, predictions_3], axis=1)  
      
    oof = np.zeros(train_stack.shape[0])  
    predictions = np.zeros(test_stack.shape[0])  
    scores = []  
      
    folds = RepeatedKFold(n_splits=5, n_repeats=2, random_state=2021)  
      
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_stack, y)):  
        print(f"fold n°{fold_+1}")  
        trn_data, trn_y = train_stack.iloc[trn_idx], y.iloc[trn_idx]  
        val_data, val_y = train_stack.iloc[val_idx], y.iloc[val_idx]  
          
        # 定义神经网络模型  
        model = Sequential([  
            Dense(128, activation='relu', input_shape=(trn_data.shape[1],)),  
            Dense(64, activation='relu'),  
            Dense(1)  
        ])  
        model.compile(optimizer=Adam(), loss='mean_absolute_error')  
          
        # 预测  
        oof[val_idx] = model.predict(val_data).flatten()  
        predictions += model.predict(test_stack).flatten() / len(folds.split(train_stack, y))  
          
        # 计算并打印分数  
        score_single = mean_absolute_error(val_y, oof[val_idx])  
        scores.append(score_single)  
        print(f'{fold_+1}/{5}', score_single)  
    print('mean: ', np.mean(scores))  
     
    return oof, predictions  

stack_oof, stack_pred1 = stack_model_nn(lgb_oof, xgb_oof, cat_oof, lgb_test, xgb_test, cat_test, train['target'])

本代码仅起抛砖引玉的作用，最后的结果就不贴了，欢迎各位大佬进行进一步的改进。

TXY242

关注

45
点赞
踩
19

收藏

觉得还不错? 一键收藏
1
评论
Datawhale AI夏令营第二期——机器学习基于神经网络stack融合策略的多模型融合

然后，使用这些拼接后的数据作为新特征，通过多次重复K折交叉验证（RepeatedKFold）来训练一个线性回归模型（这里使用的是Ridge回归器），每次验证都计算并记录了模型在验证集上的平均绝对误差（MAE）。），它构建了两个数据集对象（训练集和验证集），设置了特定的参数（如学习率、树的数量、特征采样比例等），并使用这些参数训练模型，同时应用了早停法（early stopping）来防止过拟合。最后，函数返回堆叠模型在训练集上的OOF预测和测试集上的预测结果，以及验证过程中所有折叠的平均分数。
复制链接

扫一扫