# Stacking Learning在分类问题中的使用

## 建议先阅读以下文章

1. 知乎(必读)：Kaggle机器学习之模型融合（stacking）心得
2. Blog：Stacking Models for Improved Predictions
3. Blog：KAGGLE ENSEMBLING GUIDE(注脚)
4. Blog：如何在 Kaggle 首战中进入前 10%
5. Github：[ikki407](https://github.com/ikki407)/stacking
6. Paper：M. Paz Sesmero, Agapito I. Ledezma, Araceli Sanchis, “Generating ensembles of heterogeneous classifiers using Stacked Generalization,” WIREs Data Mining and Knowledge Discovery 5: 21-34 (2015) paper下载地址 密码: c7rf
7. 神作：Stacked Generalization (Stacking)

## 分类问题构建stacking模型

### code

# -*- coding:utf-8 -*-
# Author:哈士奇说喵
# 二级stacking learning
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
import pandas as pd

# 导入数据集切割训练与测试数据

data_D = preprocessing.StandardScaler().fit_transform(data.data)
data_L = data.target
data_train, data_test, label_train, label_test = train_test_split(data_D,data_L,random_state=1,test_size=0.7)

def SelectModel(modelname):

if modelname == "SVM":
from sklearn.svm import SVC
model = SVC(kernel='rbf', C=16, gamma=0.125,probability=True)

elif modelname == "GBDT":

elif modelname == "RF":
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()

elif modelname == "XGBOOST":
import xgboost as xgb
model = xgb()

elif modelname == "KNN":
from sklearn.neighbors import KNeighborsClassifier as knn
model = knn()
else:
pass
return model

def get_oof(clf,n_folds,X_train,y_train,X_test):
ntrain = X_train.shape[0]
ntest =  X_test.shape[0]
classnum = len(np.unique(y_train))
kf = KFold(n_splits=n_folds,random_state=1)
oof_train = np.zeros((ntrain,classnum))
oof_test = np.zeros((ntest,classnum))

for i,(train_index, test_index) in enumerate(kf.split(X_train)):
kf_X_train = X_train[train_index] # 数据
kf_y_train = y_train[train_index] # 标签

kf_X_test = X_train[test_index]  # k-fold的验证集

clf.fit(kf_X_train, kf_y_train)
oof_train[test_index] = clf.predict_proba(kf_X_test)

oof_test += clf.predict_proba(X_test)
oof_test = oof_test/float(n_folds)
return oof_train, oof_test

# 单纯使用一个分类器的时候
clf_second = RandomForestClassifier()
clf_second.fit(data_train, label_train)
pred = clf_second.predict(data_test)
accuracy = metrics.accuracy_score(label_test, pred)*100
print accuracy
# 91.0969793323

# 使用stacking方法的时候
# 第一级，重构特征当做第二级的训练集
modelist = ['SVM','GBDT','RF','KNN']
newfeature_list = []
newtestdata_list = []
for modelname in modelist:
clf_first = SelectModel(modelname)
oof_train_ ,oof_test_= get_oof(clf=clf_first,n_folds=10,X_train=data_train,y_train=label_train,X_test=data_test)
newfeature_list.append(oof_train_)
newtestdata_list.append(oof_test_)

# 特征组合
newfeature = reduce(lambda x,y:np.concatenate((x,y),axis=1),newfeature_list)
newtestdata = reduce(lambda x,y:np.concatenate((x,y),axis=1),newtestdata_list)

# 第二级，使用上一级输出的当做训练集
clf_second1 = RandomForestClassifier()
clf_second1.fit(newfeature, label_train)
pred = clf_second1.predict(newtestdata)
accuracy = metrics.accuracy_score(label_test, pred)*100
print accuracy
# 96.4228934817


### Pay Attention

1. 这里只是使用了两层的stacking，完成了一个基本的stacking操作，也可以同理构建三层，四层等等
2. 对于第二级的输入来说，特征进行了变化(有一级分类器构成的判决作为新特征)，所以相应的测试集也需要进行同样的转换，毕竟分类器学习的训练集已经不一样了，学习的内容肯定是无法适用于旧的测试集的，要清楚的是，当初我们是对整个Data集合随机分测试集和训练集的！
3. 适用k-fold的方法，实质上使用了cv的思想，所以数据并没有泄露(没有用到测试集，用的是训练集中的hold-set)，所以这个方法也叫做out-of-folds

### Further

1. 可以将之前的原始特征和之后新的特征进行融合，相当于特征扩充，然后注意标准化和归一化进行处理，毕竟一级的特征量纲和产出的后验概率组成的特征量纲不一样，自己做测试吧~

### 致谢

• Self-Trained Stacking Model for Semi-Supervised Learning

• 广告
• 抄袭
• 版权
• 政治
• 色情
• 无意义
• 其他

120