什么是Stacking
使用多个不同的分类器对训练集进预测,把预测 得到的结果作为一个次级分类器的输入。次级分 类器的输出是整个模型的预测结果。
Stacking需要训练两层分类器,第一层的初级分类器(比如:决策树 + KNN + 神经网络 + 逻辑回归)和第二层的次级分类器。
代码实现
from sklearn import datasets
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from mlxtend.classifier import StackingClassifier # pip install mlxtend
import numpy as np
# 载入数据集
iris = datasets.load_iris()
# 只要第1,2列的特征
x_data, y_data = iris.data[:, 1:3], iris.target
# 定义三个不同的分类器
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = DecisionTreeClassifier()
clf3 = LogisticRegression()
# 定义一个次级分类器
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)
for clf,label in zip([clf1, clf2, clf3, sclf],
['KNN','Decision Tree','LogisticRegression','StackingClassifier']):
scores = model_selection.cross_val_score(clf, x_data, y_data, cv=3, scoring='accuracy')
print("Accuracy: %0.2f [%s]" % (scores.mean(), label))
cross_val_score做交叉验证
Accuracy: 0.91 [KNN]
Accuracy: 0.93 [Decision Tree]
Accuracy: 0.95[LogisticRegression]
Accuracy: 0.95 [StackingClassifier]
Stacking一般都比单个的算法正确率要高。
下面介绍一种投票的集成学习
from sklearn import datasets
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier
import numpy as np
# 载入数据集
iris = datasets.load_iris()
# 只要第1,2列的特征
x_data, y_data = iris.data[:, 1:3], iris.target
# 定义三个不同的分类器
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = DecisionTreeClassifier()
clf3 = LogisticRegression()
sclf = VotingClassifier([('knn',clf1),('dtree',clf2), ('lr',clf3)])
for clf, label in zip([clf1, clf2, clf3, sclf],
['KNN','Decision Tree','LogisticRegression','VotingClassifier']):
scores = model_selection.cross_val_score(clf, x_data, y_data, cv=3, scoring='accuracy')
print("Accuracy: %0.2f [%s]" % (scores.mean(), label))
Accuracy: 0.91 [KNN]
Accuracy: 0.93 [Decision Tree]
Accuracy: 0.91 [LogisticRegression]
Accuracy: 0.93 [VotingClassifier]
Voting不需要训练次级分类器,按照投票的结果进行做交叉验证。
总结:集成学习
“人多力量大”,
好的集成学习:多样性+准确性