【一周算法实践进阶】任务3 模型融合(Stacking)

导入本次任务所用到的包:

import pandas as pd
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
import numpy as np
import warnings
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
                            confusion_matrix, f1_score, roc_curve, roc_auc_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

warnings.filterwarnings(module='sklearn*', action='ignore', category=DeprecationWarning)
%matplotlib inline
plt.rc('font', family='SimHei', size=14)
plt.rcParams['axes.unicode_minus']=False
%config InlineBackend.figure_format = 'retina'

准备数据

导入数据

原始数据集下载地址: https://pan.baidu.com/s/1wO9qJRjnrm8uhaSP67K0lw

说明:这份数据集是金融数据(非原始数据,已经处理过了),我们要做的是预测贷款用户是否会逾期。表格中 “status” 是结果标签:0 表示未逾期,1 表示逾期。

本次导入的是前文(【一周算法实践进阶】任务 2 特征工程)已经完成特征工程过的数据集:

data_del = pd.read_csv('data_del.csv')
data_del.head()
absapply_scoreconsfin_avg_limithistorical_trans_amounthistory_fail_feelatest_one_month_failloans_overdue_countloans_scoremax_cumulative_consume_later_1_monthrepayment_capabilitytrans_amount_3_monthtrans_fail_top_count_enum_last_1_monthstatus
0-0.2006650.124820-1.201348-0.255030-0.427773-0.337569-0.0982100.144596-0.0671830.020868-0.049208-0.3463691.0
1-0.0905241.4970240.2386400.215237-0.547614-0.080162-0.7339731.509325-0.073494-0.034355-0.274805-0.8683800.0
2-0.3126231.516627-0.671941-0.675385-0.627508-0.080162-0.7339731.476440-0.262821-0.171658-0.3217730.6976531.0
31.3598420.3600550.7362820.7905240.331218-0.3375690.537552-0.0198290.471049-0.2378500.505738-0.3463690.0
4-0.315531-0.6985030.042759-0.5227140.291271-0.3375691.173315-1.055708-0.172665-0.144424-0.2826970.6976531.0

划分数据

调用sklearn包将数据集按比例7:3划分为训练集和数据集,随机种子2018:

X_train, X_test, y_train, y_test = train_test_split(data_del.drop(['status'], axis=1).values, 
                                                    data_del['status'].values, test_size=0.3, 
                                                    random_state=2018)

查看划分的数据集和训练集大小:

[X_train.shape, y_train.shape, X_test.shape, y_test.shape]
[(3133, 12), (3133,), (1343, 12), (1343,)]

模型融合(Stacking)

在这里插入图片描述
将数据分为训练集、测试集两部分。

  • 训练集(图中Training Data)

对训练集进行K折交叉验证。以图中五折为例,将训练集分为5个部分,每次取一个部分作为验证数据,其余作为训练数据。用训练数据训练模型后,再对验证数据进行预测,预测的结果即图中橙色的Predict。最后将五次交叉验证得到的Predict整合为橙色的Predictions,作为下一个模型的训练集。

  • 测试集(图中Testing Data)

在每次交叉验证用训练数据训练模型后,不仅要对验证集进行预测,还要对测试集进行预测,每次预测的结果就是图中绿色部分Predict。最后将五次Predict的结果取平均值,就得到了下一个模型的测试集(绿色Predictions)。

def get_stacking_data(models, X_train, y_train, X_test, y_test, k=5):
    '''获得下一模型的训练集,测试集
    models: 当前模型
    X_train: 当前训练数据
    y_train: 当前训练标签
    X_test: 当前测试数据
    y_test: 当前测试标签
    k: K折交叉验证
    return: new_train: 下一个模型的训练集
            new_test: 下一个模型的测试集
    '''
    kfold = KFold(n_splits=k, random_state=2018, shuffle=True)
    next_train = np.zeros((X_train.shape[0], len(models)))
    next_test = np.zeros((X_test.shape[0], len(models)))
    
    for j, model in enumerate(models):
        next_test_temp = np.zeros((X_test.shape[0], k))
        ksplit = kfold.split(X_train)
        for i, (train_index, val_index) in enumerate(ksplit):
            X_train_fold, y_train_fold = X_train[train_index], y_train[train_index]
            X_val = X_train[val_index]
            model.fit(X_train_fold, y_train_fold)
            next_train[val_index, j] = model.predict(X_val)
            next_test_temp[:, i] = model.predict(X_test)
        next_test[:, j] = np.mean(next_test_temp, axis=1)
    
    return next_train, next_test

融合模型选择

查看默认参数下七个模型的评估结果,代码见上篇文章:

AUCAccuracyF1-scorePrecisionRecall
随机森林训练集:99.92%;测试集:75.32%训练集:98.44%;测试集:77.29%训练集:96.78%;测试集:42.99%训练集:94.36%;测试集:33.24%训练集:99.33%;测试集:60.85%
GBDT训练集:88.12%;测试集:79.54%训练集:84.14%;测试集:79.08%训练集:59.03%;测试集:49.19%训练集:45.90%;测试集:39.31%训练集:82.68%;测试集:65.70%
XGBoost训练集:87.15%;测试集:79.72%训练集:82.73%;测试集:79.37%训练集:54.88%;测试集:48.42%训练集:42.18%;测试集:37.57%训练集:78.52%;测试集:68.06%
LightGBM训练集:99.61%;测试集:78.80%训练集:96.17%;测试集:78.26%训练集:91.77%;测试集:47.67%训练集:85.77%;测试集:38.44%训练集:98.67%;测试集:62.74%
逻辑回归训练集:76.45%;测试集:78.48%训练集:78.90%;测试集:78.18%训练集:39.19%;测试集:39.59%训练集:27.31%;测试集:27.75%训练集:69.38%;测试集:69.06%
SVM训练集:79.32%;测试集:74.56%训练集:79.76%;测试集:78.03%训练集:38.57%;测试集:34.00%训练集:25.51%;测试集:21.97%训练集:78.97%;测试集:75.25%
决策树训练集:100.00%;测试集:62.97%训练集:100.00%;测试集:70.66%训练集:100.00%;测试集:45.28%训练集:100.00%;测试集:47.11%训练集:100.00%;测试集:43.58%

四个集成模型(随机森林、GBDT、XGBoost、LightGBM)和逻辑回归明显效果较好,将随机森林、GBDT、逻辑回归、LightGBM作为基础模型,XGBoost作为第二层模型。训练参数暂时采用默认参数。

rnd_clf = RandomForestClassifier(random_state=2018)
gbdt = GradientBoostingClassifier(random_state=2018)
xgb = XGBClassifier(random_state=2018)
lgbm = LGBMClassifier(random_state=2018)
log = LogisticRegression(random_state=2018, max_iter=1000)
svc = SVC(random_state=2018, probability=True)
tree = DecisionTreeClassifier(random_state=2018)
base_models = [rnd_clf, gbdt, lgbm, log]
next_train, next_test = get_stacking_data(base_models, X_train, y_train, X_test, y_test, k=10)

融合模型训练及评估

stacking_model= XGBClassifier(random_state=2018)
stacking_model.fit(next_test, y_test)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic',
       random_state=2018, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1)
AUCAccuracyF1-scorePrecisionRecall
随机森林训练集:99.92%;测试集:75.32%训练集:98.44%;测试集:77.29%训练集:96.78%;测试集:42.99%训练集:94.36%;测试集:33.24%训练集:99.33%;测试集:60.85%
GBDT训练集:88.12%;测试集:79.54%训练集:84.14%;测试集:79.08%训练集:59.03%;测试集:49.19%训练集:45.90%;测试集:39.31%训练集:82.68%;测试集:65.70%
XGBoost训练集:87.15%;测试集:79.72%训练集:82.73%;测试集:79.37%训练集:54.88%;测试集:48.42%训练集:42.18%;测试集:37.57%训练集:78.52%;测试集:68.06%
LightGBM训练集:99.61%;测试集:78.80%训练集:96.17%;测试集:78.26%训练集:91.77%;测试集:47.67%训练集:85.77%;测试集:38.44%训练集:98.67%;测试集:62.74%
逻辑回归训练集:76.45%;测试集:78.48%训练集:78.90%;测试集:78.18%训练集:39.19%;测试集:39.59%训练集:27.31%;测试集:27.75%训练集:69.38%;测试集:69.06%
SVM训练集:79.32%;测试集:74.56%训练集:79.76%;测试集:78.03%训练集:38.57%;测试集:34.00%训练集:25.51%;测试集:21.97%训练集:78.97%;测试集:75.25%
决策树训练集:100.00%;测试集:62.97%训练集:100.00%;测试集:70.66%训练集:100.00%;测试集:45.28%训练集:100.00%;测试集:47.11%训练集:100.00%;测试集:43.58%
融合(Stacking)模型训练集:64.89%;测试集:79.04%训练集:78.58%;测试集:83.54%训练集:39.06%;测试集:59.15%训练集:27.56%;测试集:46.24%训练集:66.98%;测试集:82.05%

ROC曲线:
在这里插入图片描述
综合比较ROC曲线:

训练集测试集
在这里插入图片描述在这里插入图片描述

总结

相比于其他模型,融合模型除了AUC值外的其他指标值都有所上升,而且是在使用默认参数的情况下。相信经过调参后,效果会有进一步的提升。

参考资料

[1] https://blog.csdn.net/wstcjf/article/details/77989963, 详解 stacking 过程

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值