学习目标:
学习stacking理论知识、代码和案例一代码
学习内容:
Blending集成学习算法
简化版的Stacking,也叫做Blending
Blending集成学习方式:
(1) 将数据划分为训练集和测试集(test_set),其中训练集需要再次划分为训练集(train_set)和验证集(val_set);
(2) 创建第一层的多个模型,这些模型可以使同质的也可以是异质的;
(3) 使用train_set训练步骤2中的多个模型,然后用训练好的模型预测val_set和test_set得到val_predict, test_predict1;
(4) 创建第二层的模型,使用val_predict作为训练集训练第二层的模型;
(5) 使用第二层训练好的模型对第二层测试集test_predict1进行预测,该结果为整个测试集的结果。
用一些案例来使用这个集成方式:
# 加载相关工具包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use("ggplot")
%matplotlib inline
import seaborn as sns
# 创建数据
from sklearn import datasets
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
data, target = make_blobs(n_samples=10000, centers=2, random_state=1, cluster_std=1.0 )
## 创建训练集和测试集
X_train1,X_test,y_train1,y_test = train_test_split(data, target, test_size=0.2, random_state=1)
## 创建训练集和验证集
X_train,X_val,y_train,y_val = train_test_split(X_train1, y_train1, test_size=0.3, random_state=1)
print("The shape of training X:",X_train.shape)
print("The shape of training y:",y_train.shape)
print("The shape of test X:",X_test.shape)
print("The shape of test y:",y_test.shape)
print("The shape of validation X:",X_val.shape)
print("The shape of validation y:",y_val.shape)
The shape of training X: (5600, 2)
The shape of training y: (5600,)
The shape of test X: (2000, 2)
The shape of test y: (2000,)
The shape of validation X: (2400, 2)
The shape of validation y: (2400,)
# 设置第一层分类器
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
clfs = [SVC(probability = True),RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),KNeighborsClassifier()]
# 设置第二层分类器
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
# 输出第一层的验证集结果与测试集结果
val_features = np.zeros((X_val.shape[0],len(clfs))) # 初始化验证集结果
test_features = np.zeros((X_test.shape[0],len(clfs))) # 初始化测试集结果
for i,clf in enumerate(clfs):
clf.fit(X_train,y_train)
val_feature = clf.predict_proba(X_val)[:, 1]
test_feature = clf.predict_proba(X_test)[:,1]
val_features[:,i] = val_feature
test_features[:,i] = test_feature
# 将第一层的验证集的结果输入第二层训练第二层分类器
lr.fit(val_features,y_val)
# 输出预测的结果
from sklearn.model_selection import cross_val_score
cross_val_score(lr,test_features,y_test,cv=5)
可以看到,在每一折的交叉验证的效果都是非常好的,这个集成学习方法在这个数据集上是十分有效的,不过这个数据集是我们虚拟的,因此大家可以把他用在实际数据上看看效果。
Stacking集成学习算法
我们对Stacking进行建模(如下图):
- 首先将所有数据集生成测试集和训练集(假如训练集为10000,测试集为2500行),那么上层会进行5折交叉检验,使用训练集中的8000条作为训练集,剩余2000行作为验证集(橙色)。
- 每次验证相当于使用了蓝色的8000条数据训练出一个模型,使用模型对验证集进行验证得到2000条数据,并对测试集进行预测,得到2500条数据,这样经过5次交叉检验,可以得到中间的橙色的5*
2000条验证集的结果(相当于每条数据的预测结果),5* 2500条测试集的预测结果。 - 接下来会将验证集的5* 2000条预测结果拼接成10000行长的矩阵,标记为 𝐴1 ,而对于5*
2500行的测试集的预测结果进行加权平均,得到一个2500一列的矩阵,标记为 𝐵1 。 - 上面得到一个基模型在数据集上的预测结果 𝐴1 、 𝐵1 ,这样当我们对3个基模型进行集成的话,相于得到了 𝐴1 、 𝐴2 、
𝐴3 、 𝐵1 、 𝐵2 、 𝐵3 六个矩阵。 - 之后我们会将 𝐴1 、 𝐴2 、 𝐴3 并列在一起成10000行3列的矩阵作为training data, 𝐵1 、 𝐵2 、
𝐵3 合并在一起成2500行3列的矩阵作为testing data,让下层学习器基于这样的数据进行再训练。 - 再训练是基于每个基础模型的预测结果作为特征(三个特征),次学习器会学习训练如果往这样的基学习的预测结果上赋予权重w,来使得最后的预测最为准确。
实际应用下Stacking是如何集成算法的:
由于sklearn并没有直接对Stacking的方法,因此我们需要下载mlxtend工具包(pip install mlxtend)
# 1. 简单堆叠3折CV分类
from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingCVClassifier
RANDOM_SEED = 42
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
clf3 = GaussianNB()
lr = LogisticRegression()
# Starting from v0.16.0, StackingCVRegressor supports
# `random_state` to get deterministic result.
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3], # 第一层分类器
meta_classifier=lr, # 第二层分类器
random_state=RANDOM_SEED)
print('3-fold cross validation:\n')
for clf, label in zip([clf1, clf2, clf3, sclf], ['KNN', 'Random Forest', 'Naive Bayes','StackingClassifier']):
scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
3-fold cross validation:
Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.95 (+/- 0.01) [Random Forest]
Accuracy: 0.91 (+/- 0.02) [Naive Bayes]
Accuracy: 0.93 (+/- 0.02) [StackingClassifier]
# 我们画出决策边界
from mlxtend.plotting import plot_decision_regions
import matplotlib.gridspec as gridspec
import itertools
gs = gridspec.GridSpec(2, 2)
fig = plt.figure(figsize=(10,8))
for clf, lab, grd in zip([clf1, clf2, clf3, sclf],
['KNN',
'Random Forest',
'Naive Bayes',
'StackingCVClassifier'],
itertools.product([0, 1], repeat=2)):
clf.fit(X, y)
ax = plt.subplot(gs[grd[0], grd[1]])
fig = plot_decision_regions(X=X, y=y, clf=clf)
plt.title(lab)
plt.show()
# 2.使用概率作为元特征
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3],
use_probas=True, #
meta_classifier=lr,
random_state=42)
print('3-fold cross validation:\n')
for clf, label in zip([clf1, clf2, clf3, sclf],
['KNN',
'Random Forest',
'Naive Bayes',
'StackingClassifier']):
scores = cross_val_score(clf, X, y,
cv=3, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]"
% (scores.mean(), scores.std(), label))
3-fold cross validation:
Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.95 (+/- 0.01) [Random Forest]
Accuracy: 0.91 (+/- 0.02) [Naive Bayes]
Accuracy: 0.95 (+/- 0.02) [StackingClassifier]
# 3. 堆叠5折CV分类与网格搜索(结合网格搜索调参优化)
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from mlxtend.classifier import StackingCVClassifier
# Initializing models
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr,
random_state=42)
params = {
'kneighborsclassifier__n_neighbors': [1, 5],
'randomforestclassifier__n_estimators': [10, 50],
'meta_classifier__C': [0.1, 10.0]}
grid = GridSearchCV(estimator=sclf,
param_grid=params,
cv=5,
refit=True)
grid.fit(X, y)
cv_keys = ('mean_test_score', 'std_test_score', 'params')
for r, _ in enumerate(grid.cv_results_['mean_test_score']):
print("%0.3f +/- %0.2f %r"
% (grid.cv_results_[cv_keys[0]][r],
grid.cv_results_[cv_keys[1]][r] / 2.0,
grid.cv_results_[cv_keys[2]][r]))
print('Best parameters: %s' % grid.best_params_)
print('Accuracy: %.2f' % grid.best_score_)
0.947 +/- 0.03 {‘kneighborsclassifier__n_neighbors’: 1, ‘meta_classifier__C’: 0.1, ‘randomforestclassifier__n_estimators’: 10}
0.933 +/- 0.02 {‘kneighborsclassifier__n_neighbors’: 1, ‘meta_classifier__C’: 0.1, ‘randomforestclassifier__n_estimators’: 50}
0.940 +/- 0.02 {‘kneighborsclassifier__n_neighbors’: 1, ‘meta_classifier__C’: 10.0, ‘randomforestclassifier__n_estimators’: 10}
0.940 +/- 0.02 {‘kneighborsclassifier__n_neighbors’: 1, ‘meta_classifier__C’: 10.0, ‘randomforestclassifier__n_estimators’: 50}
0.953 +/- 0.02 {‘kneighborsclassifier__n_neighbors’: 5, ‘meta_classifier__C’: 0.1, ‘randomforestclassifier__n_estimators’: 10}
0.953 +/- 0.02 {‘kneighborsclassifier__n_neighbors’: 5, ‘meta_classifier__C’: 0.1, ‘randomforestclassifier__n_estimators’: 50}
0.953 +/- 0.02 {‘kneighborsclassifier__n_neighbors’: 5, ‘meta_classifier__C’: 10.0, ‘randomforestclassifier__n_estimators’: 10}
0.953 +/- 0.02 {‘kneighborsclassifier__n_neighbors’: 5, ‘meta_classifier__C’: 10.0, ‘randomforestclassifier__n_estimators’: 50}
Best parameters: {‘kneighborsclassifier__n_neighbors’: 5, ‘meta_classifier__C’: 0.1, ‘randomforestclassifier__n_estimators’: 10}
Accuracy: 0.95
# 如果我们打算多次使用回归算法,我们要做的就是在参数网格中添加一个附加的数字后缀,如下所示:
from sklearn.model_selection import GridSearchCV
# Initializing models
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingCVClassifier(classifiers=[clf1, clf1, clf2, clf3],
meta_classifier=lr,
random_state=RANDOM_SEED)
params = {
'kneighborsclassifier-1__n_neighbors': [1, 5],
'kneighborsclassifier-2__n_neighbors': [1, 5],
'randomforestclassifier__n_estimators': [10, 50],
'meta_classifier__C': [0.1, 10.0]}
grid = GridSearchCV(estimator=sclf,
param_grid=params,
cv=5,
refit=True)
grid.fit(X, y)
cv_keys = ('mean_test_score', 'std_test_score', 'params')
for r, _ in enumerate(grid.cv_results_['mean_test_score']):
print("%0.3f +/- %0.2f %r"
% (grid.cv_results_[cv_keys[0]][r],
grid.cv_results_[cv_keys[1]][r] / 2.0,
grid.cv_results_[cv_keys[2]][r]))
print('Best parameters: %s' % grid.best_params_)
print('Accuracy: %.2f' % grid.best_score_)
0.940 +/- 0.02 {‘kneighborsclassifier-1__n_neighbors’: 1, ‘kneighborsclassifier-2__n_neighbors’: 1, ‘meta_classifier__C’: 0.1, ‘randomforestclassifier__n_estimators’: 10}
0.940 +/- 0.02 {‘kneighborsclassifier-1__n_neighbors’: 1, ‘kneighborsclassifier-2__n_neighbors’: 1, ‘meta_classifier__C’: 0.1, ‘randomforestclassifier__n_estimators’: 50}
0.940 +/- 0.02 {‘kneighborsclassifier-1__n_neighbors’: 1, ‘kneighborsclassifier-2__n_neighbors’: 1, ‘meta_classifier__C’: 10.0, ‘randomforestclassifier__n_estimators’: 10}
0.940 +/- 0.02 {‘kneighborsclassifier-1__n_neighbors’: 1, ‘kneighborsclassifier-2__n_neighbors’: 1, ‘meta_classifier__C’: 10.0, ‘randomforestclassifier__n_estimators’: 50}
0.960 +/- 0.02 {‘kneighborsclassifier-1__n_neighbors’: 1, ‘kneighborsclassifier-2__n_neighbors’: 5, ‘meta_classifier__C’: 0.1, ‘randomforestclassifier__n_estimators’: 10}
0.953 +/- 0.02 {‘kneighborsclassifier-1__n_neighbors’: 1, ‘kneighborsclassifier-2__n_neighbors’: 5, ‘meta_classifier__C’: 0.1, ‘randomforestclassifier__n_estimators’: 50}
0.953 +/- 0.02 {‘kneighborsclassifier-1__n_neighbors’: 1, ‘kneighborsclassifier-2__n_neighbors’: 5, ‘meta_classifier__C’: 10.0, ‘randomforestclassifier__n_estimators’: 10}
0.953 +/- 0.02 {‘kneighborsclassifier-1__n_neighbors’: 1, ‘kneighborsclassifier-2__n_neighbors’: 5, ‘meta_classifier__C’: 10.0, ‘randomforestclassifier__n_estimators’: 50}
0.960 +/- 0.02 {‘kneighborsclassifier-1__n_neighbors’: 5, ‘kneighborsclassifier-2__n_neighbors’: 1, ‘meta_classifier__C’: 0.1, ‘randomforestclassifier__n_estimators’: 10}
0.953 +/- 0.02 {‘kneighborsclassifier-1__n_neighbors’: 5, ‘kneighborsclassifier-2__n_neighbors’: 1, ‘meta_classifier__C’: 0.1, ‘randomforestclassifier__n_estimators’: 50}
0.953 +/- 0.02 {‘kneighborsclassifier-1__n_neighbors’: 5, ‘kneighborsclassifier-2__n_neighbors’: 1, ‘meta_classifier__C’: 10.0, ‘randomforestclassifier__n_estimators’: 10}
0.953 +/- 0.02 {‘kneighborsclassifier-1__n_neighbors’: 5, ‘kneighborsclassifier-2__n_neighbors’: 1, ‘meta_classifier__C’: 10.0, ‘randomforestclassifier__n_estimators’: 50}
0.953 +/- 0.02 {‘kneighborsclassifier-1__n_neighbors’: 5, ‘kneighborsclassifier-2__n_neighbors’: 5, ‘meta_classifier__C’: 0.1, ‘randomforestclassifier__n_estimators’: 10}
0.953 +/- 0.02 {‘kneighborsclassifier-1__n_neighbors’: 5, ‘kneighborsclassifier-2__n_neighbors’: 5, ‘meta_classifier__C’: 0.1, ‘randomforestclassifier__n_estimators’: 50}
0.953 +/- 0.02 {‘kneighborsclassifier-1__n_neighbors’: 5, ‘kneighborsclassifier-2__n_neighbors’: 5, ‘meta_classifier__C’: 10.0, ‘randomforestclassifier__n_estimators’: 10}
0.953 +/- 0.02 {‘kneighborsclassifier-1__n_neighbors’: 5, ‘kneighborsclassifier-2__n_neighbors’: 5, ‘meta_classifier__C’: 10.0, ‘randomforestclassifier__n_estimators’: 50}
Best parameters: {‘kneighborsclassifier-1__n_neighbors’: 1, ‘kneighborsclassifier-2__n_neighbors’: 5, ‘meta_classifier__C’: 0.1, ‘randomforestclassifier__n_estimators’: 10}
Accuracy: 0.96
# 4.在不同特征子集上运行的分类器的堆叠
##不同的1级分类器可以适合训练数据集中的不同特征子集。以下示例说明了如何使用scikit-learn管道和ColumnSelector:
from sklearn.datasets import load_iris
from mlxtend.classifier import StackingCVClassifier
from mlxtend.feature_selection import ColumnSelector
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
iris = load_iris()
X = iris.data
y = iris.target
pipe1 = make_pipeline(ColumnSelector(cols=(0, 2)), # 选择第0,2列
LogisticRegression())
pipe2 = make_pipeline(ColumnSelector(cols=(1, 2, 3)), # 选择第1,2,3列
LogisticRegression())
sclf = StackingCVClassifier(classifiers=[pipe1, pipe2],
meta_classifier=LogisticRegression(),
random_state=42)
sclf.fit(X, y)
# 5.ROC曲线 decision_function
### 像其他scikit-learn分类器一样,它StackingCVClassifier具有decision_function可用于绘制ROC曲线的方法。
### 请注意,decision_function期望并要求元分类器实现decision_function。
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingCVClassifier
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
iris = datasets.load_iris()
X, y = iris.data[:, [0, 1]], iris.target
# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
RANDOM_SEED = 42
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=RANDOM_SEED)
clf1 = LogisticRegression()
clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
clf3 = SVC(random_state=RANDOM_SEED)
lr = LogisticRegression()
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)
# Learn to predict each class against the other
classifier = OneVsRestClassifier(sclf)
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
plt.figure()
lw = 2
plt.plot(fpr[2], tpr[2], color='darkorange',
lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()
Blending与Stacking对比:
Blending的优点在于:
- 比stacking简单(因为不用进行k次的交叉验证来获得stacker feature)
而缺点在于:
- 使用了很少的数据(是划分hold-out作为测试集,并非cv)
- blender可能会过拟合(其实大概率是第一点导致的)
- stacking使用多次的CV会比较稳健
集成学习案例一 (幸福感预测)
背景介绍
此案例是一个数据挖掘类型的比赛——幸福感预测的baseline。比赛的数据使用的是官方的《中国综合社会调查(CGSS)》文件中的调查结果中的数据,其共包含有139个维度的特征,包括个体变量(性别、年龄、地域、职业、健康、婚姻与政治面貌等等)、家庭变量(父母、配偶、子女、家庭资本等等)、社会态度(公平、信用、公共服务)等特征。
数据信息
赛题要求使用以上 139 维的特征,使用 8000 余组数据进行对于个人幸福感的预测(预测值为1,2,3,4,5,其中1代表幸福感最低,5代表幸福感最高)。 因为考虑到变量个数较多,部分变量间关系复杂,数据分为完整版和精简版两类。可从精简版入手熟悉赛题后,使用完整版挖掘更多信息。在这里我直接使用了完整版的数据。赛题也给出了index文件中包含每个变量对应的问卷题目,以及变量取值的含义;survey文件中为原版问卷,作为补充以方便理解问题背景。
评价指标
最终的评价指标为均方误差MSE,即:
导入package
import os
import time
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.linear_model