集成学习主要分为 bagging, boosting 和 stacking方法。stacking过程如下图,首先在训练集上训练m个基分类器(或回归器),这m个分类器输出的预测结果(可以为label或概率)作为新的特征输入到Meta-Classifier
中训练Meta-Classifier,通常Meta-Classifier会用LogisticRegression。在预测时,每个样本先通过所有m个分类器得到新的特征(label或概率),新的特征输入到Meta-Classifier得到最后的预测结果。
Example 1 简单使用lable作为新特征
需要的包
import itertools
import matplotlib.gridspec as gridspec
from mlxtend.plotting import plot_decision_regions
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
import numpy as np
from mlxtend.classifier import StackingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import model_selection
from sklearn import datasets
导入Iris数据
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
定义3个基分类器和一个Meta-Classifier(这里用了LogisticRegression)
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)
使用10折交叉验证检验模型
print('10-fold cross validation:\n')
for clf, clf_name in zip([clf1, clf2, clf3, sclf], ['knn', 'random forest', 'naive bayes', 'stacking']):
scores = model_selection.cross_val_score(clf, X, y,
cv=10, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]"
% (scores.mean(), scores.std(), clf_name))
输出
Accuracy: 0.92 (+/- 0.07) [knn]
Accuracy: 0.94 (+/- 0.06) [random forest]
Accuracy: 0.91 (+/- 0.05) [naive bayes]
Accuracy: 0.93 (+/- 0.08) [stacking]
可视化四种分类器的决策边界
gs = gridspec.GridSpec(2, 2)
fig = plt.figure(figsize=(10, 8))
for clf, lab, grd in zip([clf1, clf2, clf3, sclf],
['KNN',
'Random Forest',
'Naive Bayes',
'StackingClassifier'],
itertools.product([0, 1], repeat=2)):
print('fit {}..'.format(lab))
clf.fit(X, y)
ax = plt.subplot(gs[grd[0], grd[1]])
fig = plot_decision_regions(X=X, y=y, clf=clf)
plt.title(lab)
plt.show()
Example 2 使用基分类器的输出概率作为新特征
只需修改参数即可,use_probas=True
,若average_probas=True
则将每个分类器输出概率进行平均,否则进行拼接,例如有两个基分类器,对于一个三分类问题他们输出的概率为
- classifier 1: [0.2, 0.5, 0.3]
- classifier 2: [0.3, 0.4, 0.4]
则采用average_probas新特征为 - [0.25, 0.45, 0.35]
否则为 - [0.2, 0.5, 0.3, 0.3, 0.4, 0.4]
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
use_probas=True,
average_probas=False,
meta_classifier=lr)
print('10-fold cross validation:\n')
for clf, clf_name in zip([clf1, clf2, clf3, sclf], ['knn', 'random forest', 'naive bayes', 'stacking']):
scores = model_selection.cross_val_score(clf, X, y,
cv=10, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]"
% (scores.mean(), scores.std(), clf_name))
未完…