Stacking【集成学习】

 

1.集成学习

在监督学习中,我们往往建立单一的模型来做预测。而集成学习就是通过组合多个模型(往往是weak learner 弱学习器) 以得到一个强模型(strong learner 强学习器),以求得到更好的预测效果。

2.stacking

stacking属于集成学习的一种

训练集经过多个模型预测出结果,然后将他们的结果构建成新的数据集,过程如下

5df991fdc5414351b4f545581a148425.jpeg

 接着,只要用这个新数据集训练新的学习器就ok了,具体可以看示例

3.Python示例

以boston数据集为例,下面是stacking解决回归问题的示例

stacking可以看成两层模型的构建,请参考代码部分

from sklearn import datasets
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor as GBDT
from sklearn.ensemble import ExtraTreesRegressor as ET
from sklearn.ensemble import RandomForestRegressor as RF
from sklearn.ensemble import AdaBoostRegressor as ADA
from sklearn.metrics import r2_score
import pandas as pd
import numpy as np

boston = datasets.load_boston()

X = boston.data   
Y = boston.target

df = pd.DataFrame(X,columns = boston.feature_names)
df.head()

295b2603b77b43469e836fab00d7d474.png

# 数据集划分
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,random_state=123)

# 标准化
transfer = StandardScaler()
X_train=transfer.fit_transform(X_train)
X_test=transfer.transform(X_test)

print ("Number of training examples: " + str(X_train.shape[0]))
print ("Number of testing examples: " + str(X_test.shape[0]))
print("X_train shape: "+ str(X_train.shape))
print("Y_train shape: "+ str(Y_train.shape))

918288057dfd4d3bb9ad0057629a27e7.png

定义第一层模型并训练

model_num = 4
models = [GBDT(n_estimators=100),
          RF(n_estimators=100),
          ET(n_estimators=100),
          ADA(n_estimators=100)]

# 第二层模型训练和测试数据集
# 第一层每个模型交叉验证对训练集的预测值作为训练数据,对测试集预测值的平均作为测试数据
X_train_stack = np.zeros((X_train.shape[0], len(models))) 
X_test_stack = np.zeros((X_test.shape[0], len(models)))
# 第一层训练
# 10折stacking
n_folds = 10
kf = KFold(n_splits=n_folds)
# kf.split返回划分的索引

for i, model in enumerate(models):
    X_stack_test_n = np.zeros((X_test.shape[0], n_folds)) #(test样本数,10组索引)

    for j, (train_index, test_index) in enumerate(kf.split(X_train)):
        tr_x = X_train[train_index]
        tr_y = Y_train[train_index]
        model.fit(tr_x, tr_y)
        
        # 生成stacking训练数据集
        X_train_stack[test_index, i] = model.predict(X_train[test_index])
        X_stack_test_n[:, j] = model.predict(X_test)

    # 生成stacking测试数据集
    X_test_stack[:, i] = X_stack_test_n.mean(axis=1)

# 查看构建的新数据集
print("X_train_stack shape: "+ str(X_train_stack.shape))
print("X_test_stack shape: "+ str(X_test_stack.shape))

c0de2e6567614521aaba88eaa95b4ed9.png

现在新的数据集便构建完毕了,进入第二层模型

第二层定义了一个普通的线性模型(这里用了keras,后续可以将模型保存为h5文件)

为了防止过拟合,这个模型应该简单些

# 第二层训练
from keras import models
from keras.models import Sequential
from keras.layers import Dense

model_second = Sequential()
model_second.add(Dense(units=1,input_dim=X_train_stack.shape[1]))
model_second.compile(loss='mean_squared_error',optimizer='adam')

model_second.fit(X_train_stack,Y_train,epochs=500)
pred = model_second.predict(X_test_stack)
print("R2:", r2_score(Y_test, pred))

e818accbafc04d2bbb4e8ebde57e9ca4.png

# 模型评估
from sklearn.metrics import mean_absolute_error
Y_test=np.array(Y_test)
print('MAE:%f',mean_absolute_error(Y_test,pred))
for i in range(len(Y_test)):
    print("Real:%f,Predict:%f"%(Y_test[i],pred[i]))

或者直接用sklearn中的线性回归

from sklearn.linear_model import LinearRegression

model_second = LinearRegression()
model_second.fit(X_train_stack,Y_train)
pred = model_second.predict(X_test_stack)
print("R2:", r2_score(Y_test, pred))

# 模型评估
from sklearn.metrics import mean_absolute_error
Y_test=np.array(Y_test)
print('MAE:%f',mean_absolute_error(Y_test,pred))
for i in range(len(Y_test)):
    print("Real:%f,Predict:%f"%(Y_test[i],pred[i]))

 

 

  • 3
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Stacking(堆叠)是一种集成学习的方法,它通过将多个基本模型的预测结果作为新特征输入到一个元模型中,来进一步提升模型性能。下面是一个简单的 stacking 代码实现: ```python import numpy as np from sklearn.model_selection import KFold from sklearn.linear_model import LinearRegression from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error # 基本模型列表 base_models = [LinearRegression(), RandomForestRegressor()] # 元模型 meta_model = LinearRegression() # 加载数据集 X, y = load_data() # 初始化 stacking 结果矩阵 stacking_train = np.zeros((X.shape[0], len(base_models))) # 交叉验证 kf = KFold(n_splits=5, shuffle=True) for i, (train_index, valid_index) in enumerate(kf.split(X)): # 获取训练集和验证集 X_train, y_train = X[train_index], y[train_index] X_valid, y_valid = X[valid_index], y[valid_index] # 训练基本模型并预测验证集 for j, model in enumerate(base_models): model.fit(X_train, y_train) y_pred = model.predict(X_valid) stacking_train[valid_index, j] = y_pred # 训练元模型 meta_model.fit(stacking_train, y) # 测试集 stacking stacking_test = np.zeros((X_test.shape[0], len(base_models))) for j, model in enumerate(base_models): model.fit(X, y) y_pred = model.predict(X_test) stacking_test[:, j] = y_pred # 预测测试集 y_pred = meta_model.predict(stacking_test) # 性能评估 mse = mean_squared_error(y_test, y_pred) ``` 该代码中,我们使用了两个基本模型(线性回归和随机森林回归),并将它们的预测结果作为新特征输入到一个线性回归元模型中。在交叉验证过程中,我们分别训练两个基本模型,并使用它们的预测结果构建 stacking 训练集。在测试集中,我们同样使用两个基本模型进行预测,并将它们的预测结果作为新特征输入到元模型中进行预测。最后,我们使用均方误差(MSE)对预测结果进行性能评估。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值