数据挖掘–模型融合
- 介绍:模型融合通常可以在各种不同的机器学习任务中使结果获得提升。顾名思义,模型融合就是综合考虑不同模型的情况,并将它们的结果融合到一起。具体内容包括以下几个方面:
1、Voting
2、Averaging
3、Ranking
4、Bagging
5、Boosting
6、Stacking
7、Blending
from sklearn.model_selection import train_test_split
import pickle
import numpy as np
from sklearn.ensemble import RandomForestClassifier,VotingClassifier
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,roc_auc_score,roc_curve
from sklearn.model_selection import cross_val_predict
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
# 数据集预览
dataset = pd.read_csv(r'/Users/dongxiaojie/Documents/广东1810成绩/task1.csv',encoding='gbk')
# 去重
dataset.drop_duplicates(inplace=True)
print(dataset.shape)
dataset.head()
(4754, 90)
features = dataset.iloc[:,:-1]
labels = dataset.iloc[:,-1]
print('feature shape:{}, label shape:{}'.format(features.shape,labels.shape))
random_state = 2018
# 拆分数据集
x_train,x_test,y_train,y_test = train_test_split(features,labels,test_size=0.3, random_state=random_state)
x_sub1,x_sub2,y_sub1,y_sub2 = train_test_split(x_train,y_train,test_size = 0.5,random_state=random_state)
print(len(x_sub1),len(x_sub2),len(x_test))
feature shape:(4754, 89), label shape:(4754,)
1663 1664 1427
# get etimators
def getEstimators(file):
estimator = None
with open(file,'rb') as pf:
try:
estimator = pickle.load(pf)
except:
print('getEstimators error')
return estimator
import os
root = 'r'/Users/dongxiaojie/Documents/广东1810成绩/task1.csv'
estimators = []
for file in os.listdir(root):
file_path = os.path.join(root,file)
print(file_path)
estimators.append(getEstimators(file_path))
estimators
使用Stacking 进行模型融合
stacking是一种分层模型集成框架。以两层为例,第一层由多个基学习器组成,其输入为原始训练集,第二层的模型则是以第一层基学习器的输出作为训练集进行再训练,从而得到完整的stacking模型。
stacking两层模型都使用了全部的训练数据
x_sub2_predict = np.empty((len(x_sub1),len(estimators)),dtype=np.float32)
x_test_predict = np.empty((len(x_test),len(estimators)),dtype=np.float32)
for i,estimator in enumerate(estimators):
estimator.fit(x_sub1,y_sub1)
x_sub2_predict[:,i] = estimator.predict(x_sub2)
x_test_predict[:,i] = estimator.predict(x_test)
#rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)
rnd_forest_blender.fit(x_sub2_predict, y_sub2)
使用voting进行融合
Voting即投票机制,分为软投票和硬投票两种,其原理采用少数服从多数的思想。
硬投票:对多个模型直接进行投票,最终投票数最多的类为最终被预测的类。
软投票:和硬投票原理相同,增加了设置权重的功能,可以为不同模型设置不同权重,进而区别模型不同的重要度。
备注:此方法用于解决分类问题。
named_estimators = [naned_estimator for naned_estimator in zip(['dt','lr','rf','svc','xgb'],estimators)]
voting_clf = VotingClassifier(estimators=named_estimators,voting='hard')
new_estimators = estimators.copy()
new_estimators.append(voting_clf)
for clf in new_estimators:
clf.fit(x_train,y_train)
y_pred = clf.predict(x_test)
print(clf.__class__.__name__,accuracy_score(y_test,y_pred))
DecisionTreeClassifier 0.7786088257292446
LogisticRegression 0.7569184741959611
RandomForestClassifier 0.7928197456993269
SVC 0.7576664173522812
VotingClassifier 0.7808526551982049