阿里云天池金融风控训练营【Task5 模型融合】学习笔记

最新推荐文章于 2024-07-30 12:51:28 发布

优岚岚

最新推荐文章于 2024-07-30 12:51:28 发布

阅读量413

点赞数 1

文章标签：机器学习 python 人工智能深度学习数据挖掘

本文链接：https://blog.csdn.net/weixin_49270402/article/details/116400621

版权

这篇学习笔记详细介绍了阿里云天池金融风控训练营Task5的模型融合技术，包括平均法、投票法、stacking和blending。平均法分为简单平均和加权平均，投票法分为硬投票和软投票。stacking通过多层模型提升预测性能，而blending则通过结合预测值和原始特征来预测。笔记讨论了如何选择融合方法，并对比了stacking与blending的优缺点。

摘要由CSDN通过智能技术生成

金融风控训练营 Task5 模型融合学习笔记

本学习笔记为阿里云天池龙珠计划金融风控训练营的学习内容，学习链接为：https://tianchi.aliyun.com/specials/activity/promotion/aicampfr

一、学习知识点概要

文章目录

金融风控训练营 Task5 模型融合学习笔记

二、学习内容

1. 模型融合的方式

平均：简单平均法/加权平均法
投票：简单投票法/加权投票法
综合：排序融合/log融合
stacking：构建多层模型，并利用预测结果再拟合预测
blending：选取部分数据预测训练得到预测结果作为新特征，带入剩下的数据中预测
boosting/bagging (task4)

2. 平均

常用

快速、简单

简单加权平均法直接求预测结果的平均值

pre = (pre1 + pre2 + pre3 +…+pren )/n
加权平均法加权求平均值

pre = 0.3pre1 + 0.3pre2 + 0.4pre3

3. 投票

简单投票（硬投票分类器）

聚合每个分类器的预测，获得最多投票的类作为自己的预测

from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = XGBClassifier(learning_rate=0.1, n_estimators=150, max_depth=4, min_child_weight=2, subsample=0.7,objective='binary:logistic')

vclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('xgb', clf3)])
vclf = vclf .fit(x_train,y_train)
print(vclf .predict(x_test))

加权投票（软投票分类器）

将所有模型预测样本为某一类别的概率的平均值作为标准，概率最高的对应的类型为最终的预测结果

from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = XGBClassifier(learning_rate=0.1, n_estimators=150, max_depth=4, min_child_weight=2, subsample=0.7,objective='binary:logistic')

vclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('xgb', clf3)], voting='soft', weights=[2, 1, 1])
vclf = vclf .fit(x_train,y_train)
print(vclf .predict(x_test))

4. stacking\blending详解

参考：https://blog.csdn.net/wuzhongqiang/article/details/105012739

具体过程实现：https://zhuanlan.zhihu.com/p/25836678

stacking

多个基学习器获得的预测结果，将预测结果作为新的训练集来训练一个学习器

速度慢，内存消耗快

模型复杂程度

import warnings
warnings.filterwarnings('ignore')
import itertools
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from mlxtend.plotting import plot_learning_curves
from mlxtend.plotting import plot_decision_regions


# 以python自带的鸢尾花数据集为例
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target


clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], 
                          meta_classifier=lr)


label = ['KNN', 'Random Forest', 'Naive Bayes', 'Stacking Classifier']
clf_list = [clf1, clf2, clf3, sclf]
    
fig = plt.figure(figsize=(10,8))
gs = gridspec.GridSpec(2, 2)
grid = itertools.product([0,1],repeat=2)


clf_cv_mean = []
clf_cv_std = []
for clf, label, grd in zip(clf_list, label, grid):
        
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    print("Accuracy: %.2f (+/- %.2f) [%s]" %(scores.mean(), scores.std(), label))
    clf_cv_mean.append(scores.mean())
    clf_cv_std.append(scores.std())
        
    clf.fit(X, y)
    ax = plt.subplot(gs[grd[0], grd[1]])
    fig = plot_decision_regions(X=X, y=y, clf=clf)
    plt.title(label)
 

plt.show()

blending 将预测的值作为新的特征和原特征合并，构成新的特征值，用于预测
- 为了防止过拟合，将数据分为两部分d1、d2，d1 → 训练集，d2 → 测试集
- 预测得到的数据作为新特征使用d2的数据作为训练集结合新特征，预测测试集结果

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
# 以python自带的鸢尾花数据集为例
data_0 = iris.data
data = data_0[:100,:]


target_0 = iris.target
target = target_0[:100]
 
#模型融合中基学习器
clfs = [LogisticRegression(),
        RandomForestClassifier(),
        ExtraTreesClassifier(),
        GradientBoostingClassifier()]
 
#切分一部分数据作为测试集
X, X_predict, y, y_predict = train_test_split(data, target, test_size=0.3, random_state=914)

#切分训练数据集为d1,d2两部分
X_d1, X_d2, y_d1, y_d2 = train_test_split(X, y, test_size=0.5, random_state=914)
dataset_d1 = np.zeros((X_d2.shape[0], len(clfs)))
dataset_d2 = np.zeros((X_predict.shape[0], len(clfs)))
 
for j, clf in enumerate(clfs):
    #依次训练各个单模型
    clf.fit(X_d1, y_d1)
    y_submission = clf.predict_proba(X_d2)[:, 1]
    dataset_d1[:, j] = y_submission
    #对于测试集，直接用这k个模型的预测值作为新的特征。
    dataset_d2[:, j] = clf.predict_proba(X_predict)[:, 1]
    print("val auc Score: %f" % roc_auc_score(y_predict, dataset_d2[:, j]))

#融合使用的模型
clf = GradientBoostingClassifier()
clf.fit(dataset_d1, y_d2)
y_submission = clf.predict_proba(dataset_d2)[:, 1]
print("Val auc Score of Blending: %f" % (roc_auc_score(y_predict, y_submission)))

三、学习问题与解答

Q：方法很多，该如何选择？

A：每个方法都有自己的一个特点，像平均法简单、快速，在比赛中使用的频率就很高；而投票法，是基于每个分类器的预测的一个聚合，通过少数服从多数的方法，也进一步减小了错误的几率；stacking/blending、boosting/bagging这些方法，相对来说比较复杂，但也有它们的特别之处，是平均法和投票法无法做到的，这时候使用这些方法就会使得我们的模型融合得更加好。但是，相对的，每种方法也有自己不可避免的缺点，像stacking时间长内存消耗大等等，也是我们在对模型融合时需要去考虑的。

四、学习思考与总结

stacking vs Blending 不同：

stacking
- 多次交叉验证要更加稳健
- 两层使用的数据不同，可以避免信息泄露
- 在组队竞赛的过程中，不需要给队友分享自己的随机种子
Blending
- 数据划分为两个部分，只用了整体数据的一部分，最后预测有部分数据信息被忽略
- 第二层数据较少时，可能过拟合
- Blending比较简单，而Stacking相对比较复杂

模型有优点有缺点，没有哪个机器学习模型是可以常胜的，学会如何找到当前问题的最优解，也是进行数据挖掘时的一个核心问题。结合/融合/整合多个机器学习模型，都可以使得我们模型的整体预测能力得到一个提高。在多分类器系统和集成学习中，融合非常的重要。
模型融合是可以提高最终的预测能力，一般来说会比最优子模型好一些。

不同的子模型在不同的数据上有不同的表达能力，我们结合它们擅长的部分，得到一个在各个方面都很“准确”的模型，也就是=>各取其长，从而达到提升效果。但这也是一种理想化，在实际操作中还是会出现各种误差。