集成算法
总述:
集成算法:为了让效果更好,用多个算法集合起来使用
训练时,让多种分类器一起完成同一份任务
测试时,对待测试样本分布通过不同的分类器,汇总最后的结果
投票机制:
1、数据准备:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
x,y=make_moons(n_samples=500,noise=0.3,random_state=42)
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=42)
硬投票:众数
from sklearn.ensemble import RandomForestClassifier,VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC#分类算法
log_clf=LogisticRegression(random_state=42)
rnd_clf=RandomForestClassifier(random_state=42)
svc_clf=SVC(random_state=42)
voting_clf=VotingClassifier(estimators=[('lr',log_clf),('rf',rnd_clf),('svc',svc_clf)],voting='hard')
voting_clf.fit(x_train,y_train)
from sklearn.metrics import accuracy_score
for clf in (log_clf,rnd_clf,svc_clf,voting_clf):
clf.fit(x_train,y_train)
y_pred=clf.predict(x_test)
print(clf.__class__.__name__,accuracy_score(y_test,y_pred))
对比可得集合准确率确实较好
LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.912
软投票:概率值的加权平均
from sklearn.ensemble import RandomForestClassifier,VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC#分类算法
log_clf=LogisticRegression(random_state=42)
rnd_clf=RandomForestClassifier(random_state=42)
svc_clf=SVC(random_state=42,probability=True)
#分类器需要得到概率值
voting_clf=VotingClassifier(estimators=[('lr',log_clf),('rf',rnd_clf),('svc',svc_clf)],voting='soft')
voting_clf.fit(x_train,y_train)
from sklearn.metrics import accuracy_score
for clf in (log_clf,rnd_clf,svc_clf,voting_clf):
clf.fit(x_train,y_train)
y_pred = clf.predict(x_test)
print (clf.__class__.__name__,accuracy_score(y_test,y_pred))
结果对比:
LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.92
Bagging(bootstrap aggregation):
训练多个分类器取平均值
减弱:(类似于电压)是并行减弱的
随机具有二重性:1、数据采集随机2、特征选择随机
优点:1、能够处理多维度的数据,不用做特征选择
2、训练完成后,可以得到那些特征比较重要
3、容易做成并行化方法,速度较快
4、可以进行可视化展示,便于分析
典型:
1、随机森林:
特征随机,数据随机
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
iris = load_iris()
rf_clf = RandomForestClassifier(n_estimators=500,n_jobs=-1)
rf_clf.fit(iris['data'],iris['target'])
for name,score in zip(iris['feature_names'],
rf_clf.feature_importances_):
print (name,score)
sepal length (cm) 0.11249225099876375
sepal width (cm) 0.02311928828251033
petal length (cm) 0.4410304643639577
petal width (cm) 0.4233579963547682
可以将所有的特征全部训练
2、KNN算法
代码实现
对比决策树和bagging的准确率:
1、bagging:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bag_clf=BaggingClassifier(DecisionTreeClassifier(),
n_estimators=500,
max_samples=100,
bootstrap=True,
n_jobs=-1,#用多少个线程,-1剩余cpu全部使用
random_state=42
)
bag_clf.fit(x_train,y_train)
y_pred=bag_clf.predict(x_test)
accuracy_score(y_test,y_pred)#0.904
决策树:
tree_clf=DecisionTreeClassifier(random_state=42)
tree_clf.fit(x_train,y_train)
y_pred_tree=tree_clf.predict(x_test)
accuracy_score(y_test,y_pred_tree)#0.856
并且bagging可以防止过拟合
Boosting:
从弱学习器开始加强,通过加权来训练
典型:AdBoost,Xgboost
AdBoost
会根据前一次的分类效果调整数据的权重
如果上一次数据分类错误,那么下一次就要给予更大的权重
代码实现
from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
n_estimators = 200,
learning_rate = 0.5,
random_state = 42
)
ada_clf.fit(x_train,y_train)
y_pred=ada_clf.predict(x_test)
accuracy_score(y_test,y_pred)#0.896
Gradient Boosting
学习率大时,迭代次数就多些;
学习率小时,迭代次数就少些
提前停止策略
不是迭代次数越大越好
from sklearn.metrics import mean_squared_error
#不一定跌打次数越多越好,可能会出现反弹
x_train,x_val,y_train,y_val=train_test_split(x,y,random_state=49)
gbrt = GradientBoostingRegressor(max_depth = 2,
n_estimators = 120,
random_state = 42
)
gbrt.fit(x_train,y_train)
#均方误差
errors = [mean_squared_error(y_val,y_pred) for y_pred in
gbrt.staged_predict(x_val)]
#staged_predict,将数据分成不同的阶段
bst_n_estimators = np.argmin(errors)
gbrt_best = GradientBoostingRegressor(max_depth = 2,
n_estimators = bst_n_estimators,
random_state = 42
)
#best:主要是多少个n_estimators最好,均方误差最小
gbrt_best.fit(x_train,y_train)
GradientBoostingRegressor(max_depth=2, n_estimators=55, random_state=42)
画图展示:
plt.figure(figsize=(11,4))
plt.subplot(121)
plt.plot(errors,'b.-')
plt.plot([bst_n_estimators,bst_n_estimators],[0,min_error],'k--')
plt.plot([0,120],[min_error,min_error],'k--')
plt.axis([0,120,0,0.01])
plt.title('Val Error')
plt.subplot(122)
plot_predictions([gbrt_best],x,y,axes=[-0.5,0.5,-0.1,0.8])
plt.title('Best Model(%d trees)'%bst_n_estimators)
Text(0.5, 1.0, ‘Best Model(55 trees)’)
Stacking:
聚合多个分类或回归模型(可以分阶段使用)
直接将各种分类器堆叠,并且分阶段训练,下一次训练直接使用上一次训练的结果==>类似于刷分
但是非常耗时间,效率不高