task6：模型融合

最新推荐文章于 2021-03-21 14:00:39 发布

silent_eyes

最新推荐文章于 2021-03-21 14:00:39 发布

阅读量173

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/weixin_40460554/article/details/99710697

版权

python 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

本文介绍了模型融合在数据挖掘中的应用，重点讨论了Stacking和Voting两种方法。Stacking通过多层模型集成，利用基学习器的输出作为更高层模型的输入。Voting则分为硬投票和软投票，允许根据模型性能设定权重，适用于分类问题。文中给出了不同模型如DecisionTree、LogisticRegression、RandomForest和SVC的性能，并展示了VotingClassifier的融合效果。

摘要由CSDN通过智能技术生成

数据挖掘–模型融合

介绍：模型融合通常可以在各种不同的机器学习任务中使结果获得提升。顾名思义，模型融合就是综合考虑不同模型的情况，并将它们的结果融合到一起。具体内容包括以下几个方面：

1、Voting

2、Averaging

3、Ranking

4、Bagging

5、Boosting

6、Stacking

7、Blending

from sklearn.model_selection import train_test_split 
import pickle
import numpy as np
from sklearn.ensemble import RandomForestClassifier,VotingClassifier
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,roc_auc_score,roc_curve
from sklearn.model_selection import cross_val_predict
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# 数据集预览
dataset = pd.read_csv(r'/Users/dongxiaojie/Documents/广东1810成绩/task1.csv',encoding='gbk')
# 去重
dataset.drop_duplicates(inplace=True)
print(dataset.shape)
dataset.head()

(4754, 90)

features = dataset.iloc[:,:-1]
labels = dataset.iloc[:,-1]
print('feature shape:{}, label shape:{}'.format(features.shape,labels.shape))
random_state = 2018
# 拆分数据集
x_train,x_test,y_train,y_test = train_test_split(features,labels,test_size=0.3, random_state=random_state)
x_sub1,x_sub2,y_sub1,y_sub2 = train_test_split(x_train,y_train,test_size = 0.5,random_state=random_state)
print(len(x_sub1),len(x_sub2),len(x_test))

feature shape:(4754, 89), label shape:(4754,)
1663 1664 1427

# get etimators
def getEstimators(file):
    estimator = None
    with open(file,'rb') as pf:
        try:
            estimator = pickle.load(pf)
        except:
            print('getEstimators error')
    return estimator

import os
root = 'r'/Users/dongxiaojie/Documents/广东1810成绩/task1.csv'
estimators = []
for file in os.listdir(root):
    file_path = os.path.join(root,file)
    print(file_path)
    estimators.append(getEstimators(file_path))
estimators

使用Stacking 进行模型融合

stacking是一种分层模型集成框架。以两层为例，第一层由多个基学习器组成，其输入为原始训练集，第二层的模型则是以第一层基学习器的输出作为训练集进行再训练，从而得到完整的stacking模型。

stacking两层模型都使用了全部的训练数据

x_sub2_predict = np.empty((len(x_sub1),len(estimators)),dtype=np.float32)
x_test_predict = np.empty((len(x_test),len(estimators)),dtype=np.float32)
for i,estimator in enumerate(estimators):
    estimator.fit(x_sub1,y_sub1)
    x_sub2_predict[:,i] = estimator.predict(x_sub2)
    x_test_predict[:,i] = estimator.predict(x_test)    
    
#rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)
rnd_forest_blender.fit(x_sub2_predict, y_sub2)

使用voting进行融合

Voting即投票机制，分为软投票和硬投票两种，其原理采用少数服从多数的思想。

硬投票：对多个模型直接进行投票，最终投票数最多的类为最终被预测的类。

软投票：和硬投票原理相同，增加了设置权重的功能，可以为不同模型设置不同权重，进而区别模型不同的重要度。

备注：此方法用于解决分类问题。

named_estimators = [naned_estimator for naned_estimator in zip(['dt','lr','rf','svc','xgb'],estimators)]
voting_clf = VotingClassifier(estimators=named_estimators,voting='hard')
new_estimators = estimators.copy()
new_estimators.append(voting_clf)
for clf in new_estimators:
    clf.fit(x_train,y_train)
    y_pred = clf.predict(x_test)
    print(clf.__class__.__name__,accuracy_score(y_test,y_pred))

DecisionTreeClassifier 0.7786088257292446
LogisticRegression 0.7569184741959611
RandomForestClassifier 0.7928197456993269
SVC 0.7576664173522812
VotingClassifier 0.7808526551982049