2021科大讯飞-车辆贷款违约预测赛事 Top1方案

最新推荐文章于 2023-08-17 10:57:56 发布

AI Studio

最新推荐文章于 2023-08-17 10:57:56 发布

阅读量2k

点赞数 6

文章标签：深度学习计算机视觉人工智能机器学习

原文链接：https://aistudio.baidu.com/aistudio/projectdetail/3795293

版权

转自AI Studio，原文链接：2021科大讯飞-车辆贷款违约预测赛事 Top1方案 - 飞桨AI Studio

辆贷款违约预测赛事冠军方案

本项目参考车辆贷款违约预测赛事 Top1方案完成。作者：@望尼玛

原作者转载授权

📖 0 项目背景

笔者在2021年参加了比赛前期，一直在十名左右，后来由于暑期实习冲突，也对于提高成绩没有太好的思路，无奈草草放弃。

后来看到该赛事的冠军方案后有所启发，希望能对该段参赛经历进行一次总结。

在项目中加入部分笔者之前项目作为补充知识点，也希望对金融方向竞赛感兴趣的同学有所帮助，

该赛题作为经典的二分类问题，非常适用于对数据分析和风控问题感兴趣的同学作为入门教程

🔋0.1 赛事背景

随着监管政策步入关键落地期，受合规监管标的限额影响，曾备受追捧的大额标的逐渐消失，小额分散的车贷业务成为网贷平台转型的主要方向之一。

车贷资产由于进入门槛低、借款额度低、流动性高、限期短等优点，但做好风险防控依然是行业的主要问题之一。

国内某贷款机构就面临了这样的难题，该机构的借款人往往拖欠还款或拒不还款，导致该机构的不良贷款率居高不下。

面对如此头疼的问题，该机构将部分贷款数据开放，诚邀大家帮助他们建立风险识别模型来预测可能违约的借款人（敏感信息已脱敏）

🔋0.2 赛事任务

给定某机构实际业务中的相关借款人信息，包含53个与客户相关的字段，其中loan_default字段表明借款人是否会拖欠付款。

任务目标是通过训练集训练模型，来预测测试集中loan_default字段的具体值，

即借款人是否会拖欠付款，以此为依据，降低贷款风险。

🔋0.3 评估指标

竞赛的评价标准采用F1 score指标，正样本为1.

评估代码参考：

from sklearn.metrics import f1_score
  
y_pred = [0, 1, 1, 1, 0, 1]
y_true = [0, 1, 0, 1, 1, 1] 

score = f1_score(y_true, y_pred, average='macro')

🥦 1 项目功能

如果有以下需求，该项目可能对您有用：

渴望学习数据科学竞赛但不知如何入手
希望了解金融风控方向的技术应用
参加过一些竞赛，希望学习积累更多的上分点

💡 2 数据集介绍

赛题数据由训练集和测试集组成，总数据量超过25w，包含52个特征字段。
在所有数据中抽取15万条作为训练集，3万条作为测试集，同时会对部分字段信息进行脱敏。

特征字段	字段描述
customer_id	客户标识符
main_account_loan_no	主账户申请贷款数量
main_account_active_loan_no	主账户申请的有效贷款数量
main_account_overdue_no	主账号逾期数量
main_account_outstanding_loan	主账户未偿还的贷款余额
main_account_sanction_loan	主账户所有贷款被批准的贷款金额
main_account_disbursed_loan	主账户所有贷款已发放的贷款金额
sub_account_loan_no	二级账户申请贷款数量
sub_account_active_loan_no	二级账户申请的有效贷款数量
sub_account_overdue_no	二级账户逾期数量
sub_account_outstanding_loan	二级账户未偿还的贷款金额
sub_account_sanction_loan	二级账户所有贷款被批准的贷款金额
sub_account_disbursed_loan	二级账户所有贷款已发放的贷款金额
disbursed_amount	已发放贷款金额
asset_cost	资产成本
branch_id	发放贷款的分行
supplier_id	发放贷款的车辆经销商
manufacturer_id	汽车制造商
year_of_birth	客户出生日期
disbursed_date	贷款日期
area_id	付款区域
employee_code_id	记录付款的对接员工
mobileno_flag	是否填写手机号
idcard_flag	是否填写身份证
Driving_flag	是否出具驾驶证
passport_flag	是否填写护照
credit_score	信用评分
main_account_monthly_payment	主账户月供金额
sub_account_monthly_payment	二级账户的月供金额
last_six_month_new_loan_no	过去六个月客户的新贷款申请数量
last_six_month_defaulted_no	过去六个月客户的违约数量
average_age	平均贷款期限
credit_history	信用记录
enquirie_no	客户查询贷款次数
loan_to_asset_ratio	贷款与资产比例
total_account_loan_no	所有账户申请贷款数量
main_account_inactive_loan_no	主账户申请的无效贷款数量
sub_account_inactive_loan_no	二级账户申请的无效贷款数量
total_inactive_loan_no	所有账户申请的无效贷款数量
total_overdue_no	所有账户的逾期次数
total_outstanding_loan	所有账户的未结余额的总额
total_sanction_loan	来自所有账户的所有贷款被批准的贷款金额
total_disbursed_loan	为所有账户的所有贷款支付的贷款金额
total_monthly_payment	所有账户的月供金额
outstanding_disburse_ratio	已发放贷款总额/未偿还贷款总额（两者比例）
main_account_tenure	主账户还款期数
sub_account_tenure	二级账户还款期数
disburse_to_sactioned_ratio	已发放贷款/批准贷款（两者比例）
active_to_inactive_act_ratio	有效贷款次数/无效贷款次数（两者比例）
Credit_level	信用评分
employment_type	工作类型
age	年龄
loan_default	1表示客户逾期，0表示客户未逾期

💡 3 解题思路

⚽ 3.1 原作者方案

这种偏数据挖掘的比赛的关键点在于如何基于对数据的理解抽象归纳出有用的特征，

因此，我一开始做的时候，并没有想着说去套各种高大上的模型，而是通过对数据的分析去构造一些特征。

如果不想往后看代码的话，我在这一章节会简单把我的整个方案讲一下：

正负样本分布：

可以看到这道题的正负样本比为 82:18 这样，在风控里面其实已经属于正负样本分布较为平衡的数据了，

所以我在比赛中，并没有刻意的去往正负样本不平衡这块去做，有做了一些过采样的尝试，但效果反而不增反降

特征工程：

首先，我一开始就发现有很多ID类的特征，然后我就基于这些ID类特征做了一些target encoding特征，这些简单的特征 + 树模型就已经0.583了，能让我前期一直处在Top 10；

而后，从业务角度构造了一些诸如：主账户和二级账户的年利率特征（因为往往银行的利率表现了其对用户的信用预测）；

从数据分布角度对一些金额类的特征做了些分箱操作；再从特征本身的有效性和冗余角度出发，剔除了一些毫无信息量的特征，比如贷款日期等。这时，我们可以做到0.587这样的水平；

然后，在一次误打误撞的模型训练时，我误把客户ID放进模型中去训练了，结果我发现似乎还对模型性能有一定提升？

那我这时候的想法是：这一定是由于欺诈有些集中性导致的，

黑产可能在借贷银行（where）或借贷时间（when）上存在一定的集中性，而这种集中性一方面可以通过branch_id/supplier_id/manufacturer_id等反映出来，

另一方面，本身客户的customer_id也是可以体现时间上的集中性，因此，我又基于这个点构造了近邻欺诈特征，这时候我们就能做到0.589了；

模型选取：

前期，我一直是用的LightGBM，然后也没有很仔细的去调参（比如hyperopt/optuna等工具，我都没有用），就很随意（平平无奇的手动调参小天才）

后期，我开始尝试其他的XGBoost/CatBoost/TabNet等模型，但是发现CatBoost和TabNet效果都不是很好，就没有深入往下去钻了（主要白天还是要上班的，因此精力有限，说是摸鱼打比赛，但更准确的说是熬夜打比赛）

阈值选取：

由于该题是用F1 Score作为评判标准的，因此，我们需要自己划一个阈值，然后决定哪些样本预测为正样本，哪些样本预测为负样本。

在尝试了不同方案后，我们的方案基于oof的预测结果，选出一个在oof上表现最优的阈值，此时在榜上的效果是最佳的（千分位的提升）

融合策略：

最后选定了两个模型来融合，一个是LightGBM，一个是XGBoost（哈哈哈，就很土有没有），

然后，直接按预测概率加权融合的话效果是比较一般的，而按照其ranking值分位点化之后再加权融合效果会更好。

效果而言，单模LGB最优是0.5892，XGB是在0.5872这边，按照概率加权最优是0.59011，按照排序加权最优是0.59038

其实主要思路和方案，就如同上述文字所描述的了。但看起来总是干巴巴的，如果你还对代码有兴趣的话，可以继续往下看。毕竟 Talk is Cheap， :)

⚽ 3.2 笔者解读

正如原作者所说：数据竞赛的核心在于特征工程。

特征工程就是是指用一系列工程化的方式从原始数据中筛选出更好的数据特征,以提升模型的训练效果。

在数据竞赛中有各种各样的特征工程方法，我会在后文提及一些，并告诉大家应该去哪里找更多的资料。

特征工程的核心思想在于：基于对数据的理解构造特征。比如给出很多男女的外形属性，需要对男女进行分类。

其中，胡子的长度一定是关键的特征，这是基于对问题的充分认识。

因此，在进行特征构造前，数据的探索性分析（Exploratory Data Analysis ，EDA）能帮助你更好的理解数据的分布。

需要注意的是，EDA方法并没有一个固定的套路，而是通过对数据的不断量化分析，帮助研究者更好的理解数据。

基础的指标包括，标签的分布，属性在不同标签中的统计指标(均值，最大值，最小值，方差，四分位数)，各个属性的相关度等。

正负样本分布：

作者提到样本失衡的问题，在数据竞赛中这是经常遇到的问题之一。

可以从数据方面进行处理：

常用的方法包括过采样，欠采样等方法，包括最新的SPE算法等。

具体的代码可以参考我之前的项目

从损失函数方面进行处理：

可以通过对不同类别根据比例添加权重，即对数量少的类别添加较大的权重。

使用针对不平衡数据开发的损失函数，例如FocalLoss等。

特征工程：

特征工程包括一些固定的套路，但是每个人会根据对赛题和数据的理解构造出新的特征。原作者给出了构造特征的思路。

在项目【金融风控系列】_[2]_欺诈识别中，给出了五种特征编码方式的代码。

频率编码：统计该值出现的个数
标签编码：将原数据映射称为一组顺序数字，类似ONE-HOT,不过 pd.factorize 映射为[1]，[2]，[3]。 pd.get_dummies() 映射为 [1,0,0],[0,1,0],[0,0,1]
统计特征：主要使用 pd.groupby对变量进行分组，再使用agg计算分组的统计特征
交叉特征：对两列的特征重新组合成为新特征，再进行标签编码
唯一值特征：分组后返回目标属性的唯一值个数

更多的资料建议大家去KAGGLE竞赛平台上找一些获奖的方案学习。

特征选择

在上一步中，使用特征工程会生成大量的特征，但并不是所有的特征对训练模型来说都是有利的。

在选择特征时，模型性能才是最好的标准。但是在特征非常多的情况下，很难尝试所有的特征组合。

因此常用的方法是：

选择在训练集和测试集中分布近似的特征。

如果一个特征取值的分布在训练集和测试集上存在非常大的差异。

比如在训练集中所有男生的头发都是黑色，但在测试集中所有男生的头发都是白色。

那么，在模型中头发颜色属性就应该被剔除。
选择在不同标签上分布不同的属性。

比如头发长度，在男生标签和女生标签中分布存在明显差异，

该属性即为对于分类的有效特征。此外可以通过对属性进行细分处理，使得新的属性在不同标签上分布不同。

比如，在全年的消费额中，男女并无明显的区别，但按照月份分析发现某几个月的消费额存在明显的差异。

此时，可以提取这些特殊的月份作为新的属性。

例如：

选择在模型中特征重要性取值高的属性。

该方法通过训练一个树模型，然后输出该模型的Feature Importances。

选择Feature Importances取值高的作为训练特征，剔除取值低的特征。

或者使用SHAP值探索模型可解释性。黑盒模型实际上比逻辑回归更具可解释性

根据SHAP值的大小，进而完成特征的选择。

例如：

模型选取：

目前数据竞赛最常用的三大模型，LightGBM，XGBoost，CatBoost。作者提到的TabNet是近年来的新模型。

从结构到性能，一文概述XGBoost、Light GBM和CatBoost的同与不同

LightGBM

LightGBM是轻量级(Light)的梯度提升机器(GBM),是GBDT模型的另一个进化版本。它延续了XGBoost的那一套集成学习的方式,相对于xgboost, 具有训练速度快和内存占用率低的特点。

https://zhuanlan.zhihu.com/p/343842417

XGBoost

XGBoost(eXtreme Gradient Boosting)是基于Boosting框架的一个算法工具包（包括工程实现），在并行计算效率、缺失值处理、预测性能上都非常强大。

https://zhuanlan.zhihu.com/p/340223260

CatBoost

CatBoost是俄罗斯的搜索巨头Yandex在2017年开源的机器学习库，是Gradient Boosting(梯度提升) + Categorical Features(类别型特征)，也是基于梯度提升决策树的机器学习框架。

https://www.biaodianfu.com/catboost.html
https://zhuanlan.zhihu.com/p/346420728

TabNet

TabNet是2020年Google Cloud AI提出的模型，专门针对表格型数据设计的网络结构，既考虑树模型，又考虑神经网络。

https://zhuanlan.zhihu.com/p/359959585

阈值选取：

由于该题是用F1 Score作为评判标准的，因此，需要划一个阈值，决定哪些样本预测为正样本，哪些为负样本。

原作者方案基于oof的预测结果，选出一个在oof上表现最优的阈值，此时在榜上的效果是最佳的（千分位的提升）

具体来说该方案是假设训练数据集与测试数据集具有相同的分布，此时通过在训练集（验证集）中进行遍历，得到最优的阈值。

该阈值可作为测试数据集的阈值。

另外的思路是选择测试集阈值，使训练集的正负样本比例与测试集的正负样本比例相近。

当然，具体选择的方案需要经过最终实际验证。

融合策略：

关于模型的融合，分为三个策略，Bagging， Boosting 和 Stacking方法。

Bagging ：独立的集成多个模型，每个模型有一定的差异，最终综合有差异的模型的结果，获得学习的最终的结果；

Boosting（增强集成学习）：集成多个模型，每个模型都在尝试增强（Boosting）整体的效果；

Stacking（堆叠）：集成 k 个模型，得到 k 个预测结果，将 k 个预测结果再传给一个新的算法，得到的结果为集成系统最终的预测结果；

https://www.jianshu.com/p/6491340b3474

💡 4 代码详解

🔥 4.1 特征工程

target encoding/mean encoding，这里要注意的是，为了防止过拟合，需要分折来做

# 用来TG编码的特征：
TARGET_ENCODING_FETAS = ['employment_type','branch_id','supplier_id','manufacturer_id', 'area_id','employee_code_id', 'asset_cost_bin']
# 具体实现：
def gen_target_encoding_feats(train, test, encode_cols, target_col, n_fold=10):
    '''生成target encoding特征'''
    # for training set - cv
    tg_feats = np.zeros((train.shape[0], len(encode_cols)))
    kfold = StratifiedKFold(n_splits=n_fold, random_state=1024, shuffle=True)
    for _, (train_index, val_index) in enumerate(kfold.split(train[encode_cols], train[target_col])):
        df_train, df_val = train.iloc[train_index], train.iloc[val_index]
        for idx, col in enumerate(encode_cols):
            target_mean_dict = df_train.groupby(col)[target_col].mean()
            df_val[f'{col}_mean_target'] = df_val[col].map(target_mean_dict)
            tg_feats[val_index, idx] = df_val[f'{col}_mean_target'].values
    for idx, encode_col in enumerate(encode_cols):
        train[f'{encode_col}_mean_target'] = tg_feats[:, idx]
    # for testing set
    for col in encode_cols:
        target_mean_dict = train.groupby(col)[target_col].mean()
        test[f'{col}_mean_target'] = test[col].map(target_mean_dict)
    return train, test

年利率特征/分箱等特征：

def gen_new_feats(train, test):
    '''生成新特征：如年利率/分箱等特征'''
    # Step 1: 合并训练集和测试集
    data = pd.concat([train, test])
    # Step 2: 具体特征工程
    # 计算二级账户的年利率
    data['sub_Rate'] = (data['sub_account_monthly_payment'] * data['sub_account_tenure'] - data[
        'sub_account_sanction_loan']) / data['sub_account_sanction_loan']
    # 计算主账户的年利率
    data['main_Rate'] = (data['main_account_monthly_payment'] * data['main_account_tenure'] - data[
        'main_account_sanction_loan']) / data['main_account_sanction_loan']
    # 对部分特征进行分箱操作
    # 等宽分箱
    loan_to_asset_ratio_labels = [i for i in range(10)]
    data['loan_to_asset_ratio_bin'] = pd.cut(data["loan_to_asset_ratio"], 10, labels=loan_to_asset_ratio_labels)
    # 等频分箱
    data['asset_cost_bin'] = pd.qcut(data['asset_cost'], 10, labels=loan_to_asset_ratio_labels)
    # 自定义分箱
    amount_cols = [
                   'total_monthly_payment',
                   'main_account_sanction_loan',
                   'main_account_disbursed_loan',
                   'sub_account_sanction_loan',
                   'sub_account_disbursed_loan',
                   'main_account_monthly_payment',
                   'sub_account_monthly_payment',
                   'total_sanction_loan'
                ]
    amount_labels = [i for i in range(10)]
    for col in amount_cols:
        total_monthly_payment_bin = [-1, 5000, 10000, 30000, 50000, 100000, 300000, 500000, 1000000, 3000000, data[col].max()]
        data[col + '_bin'] = pd.cut(data[col], total_monthly_payment_bin, labels=amount_labels).astype(int)
    # Step 3: 返回包含新特征的训练集 & 测试集
    return data[data['loan_default'].notnull()], data[data['loan_default'].isnull()]

近邻欺诈特征（ID前后10个近邻的欺诈概率，其实可以更多不同尝试寻找最优的近邻数，但精力有限哈哈）

def gen_neighbor_feats(train, test):
    '''产生近邻欺诈特征'''
    if not os.path.exists('../user_data/neighbor_default_probs.pkl'):
        # 该特征需要跑的时间较久，因此将其存成了pkl文件
        neighbor_default_probs = []
        for i in tqdm(range(train.customer_id.max())):
            if i >= 10 and i < 199706:
                customer_id_neighbors = list(range(i - 10, i)) + list(range(i + 1, i + 10))
            elif i < 199706:
                customer_id_neighbors = list(range(0, i)) + list(range(i + 1, i + 10))
            else:
                customer_id_neighbors = list(range(i - 10, i)) + list(range(i + 1, 199706))
            customer_id_neighbors = [customer_id_neighbor for customer_id_neighbor in customer_id_neighbors if
                                     customer_id_neighbor in train.customer_id.values.tolist()]
            neighbor_default_prob = train.set_index('customer_id').loc[customer_id_neighbors].loan_default.mean()
            neighbor_default_probs.append(neighbor_default_prob)
        df_neighbor_default_prob = pd.DataFrame({'customer_id': range(0, train.customer_id.max()),
                                                 'neighbor_default_prob': neighbor_default_probs})
        save_pkl(df_neighbor_default_prob, '../user_data/neighbor_default_probs.pkl')
    else:
        df_neighbor_default_prob = load_pkl('../user_data/neighbor_default_probs.pkl')
    train = pd.merge(left=train, right=df_neighbor_default_prob, on='customer_id', how='left')
    test = pd.merge(left=test, right=df_neighbor_default_prob, on='customer_id', how='left')
    return train, test

最终我只选取了47维特征：

USED_FEATS = [
             'customer_id',
             'neighbor_default_prob',
             'disbursed_amount',
             'asset_cost',
             'branch_id',
             'supplier_id',
             'manufacturer_id',
             'area_id',
             'employee_code_id',
             'credit_score',
             'loan_to_asset_ratio',
             'year_of_birth',
             'age',
             'sub_Rate',
             'main_Rate',
             'loan_to_asset_ratio_bin',
             'asset_cost_bin',
             'employment_type_mean_target',
             'branch_id_mean_target',
             'supplier_id_mean_target',
             'manufacturer_id_mean_target',
             'area_id_mean_target',
             'employee_code_id_mean_target',
             'asset_cost_bin_mean_target',
             'credit_history',
             'average_age',
             'total_disbursed_loan',
             'main_account_disbursed_loan',
             'total_sanction_loan',
             'main_account_sanction_loan',
             'active_to_inactive_act_ratio',
             'total_outstanding_loan',
             'main_account_outstanding_loan',
             'Credit_level',
             'outstanding_disburse_ratio',
             'total_account_loan_no',
             'main_account_tenure',
             'main_account_loan_no',
             'main_account_monthly_payment',
             'total_monthly_payment',
             'main_account_active_loan_no',
             'main_account_inactive_loan_no',
             'sub_account_inactive_loan_no',
             'enquirie_no',
             'main_account_overdue_no',
             'total_overdue_no',
             'last_six_month_defaulted_no'
        ]

🔥 4.2 模型训练

LightGBM（十折效果更优）

def train_lgb_kfold(X_train, y_train, X_test, n_fold=5):
    '''train lightgbm with k-fold split'''
    gbms = []
    kfold = StratifiedKFold(n_splits=n_fold, random_state=1024, shuffle=True)
    oof_preds = np.zeros((X_train.shape[0],))
    test_preds = np.zeros((X_test.shape[0],))
    for fold, (train_index, val_index) in enumerate(kfold.split(X_train, y_train)):
        logging.info(f'############ fold {fold} ###########')
        X_tr, X_val, y_tr, y_val = X_train.iloc[train_index], X_train.iloc[val_index], y_train[train_index], y_train[val_index]
        dtrain = lgb.Dataset(X_tr, y_tr)
        dvalid = lgb.Dataset(X_val, y_val, reference=dtrain)
        params = {
            'objective': 'binary',
            'metric': 'auc',
            'num_leaves': 64,
            'learning_rate': 0.02,
            'min_data_in_leaf': 150,
            'feature_fraction': 0.8,
            'bagging_fraction': 0.7,
            'n_jobs': -1,
            'seed': 1024
        }
        gbm = lgb.train(params,
                        dtrain,
                        num_boost_round=1000,
                        valid_sets=[dtrain, dvalid],
                        verbose_eval=50,
                        early_stopping_rounds=20)
        oof_preds[val_index] = gbm.predict(X_val, num_iteration=gbm.best_iteration)
        test_preds += gbm.predict(X_test, num_iteration=gbm.best_iteration) / kfold.n_splits
        gbms.append(gbm)
    return gbms, oof_preds, test_preds

XGBoost

def train_xgb_kfold(X_train, y_train, X_test, n_fold=5):
    '''train xgboost with k-fold split'''
    gbms = []
    kfold = StratifiedKFold(n_splits=10, random_state=1024, shuffle=True)
    oof_preds = np.zeros((X_train.shape[0],))
    test_preds = np.zeros((X_test.shape[0],))
    for fold, (train_index, val_index) in enumerate(kfold.split(X_train, y_train)):
        logging.info(f'############ fold {fold} ###########')
        X_tr, X_val, y_tr, y_val = X_train.iloc[train_index], X_train.iloc[val_index], y_train[train_index], y_train[val_index]
        dtrain = xgb.DMatrix(X_tr, y_tr)
        dvalid = xgb.DMatrix(X_val, y_val)
        dtest = xgb.DMatrix(X_test)
        params={
            'booster':'gbtree',
            'objective': 'binary:logistic',
            'eval_metric': ['logloss', 'auc'],
            'max_depth': 8,
            'subsample':0.9,
            'min_child_weight': 10,
            'colsample_bytree':0.85,
            'lambda': 10,
            'eta': 0.02,
            'seed': 1024
        }
        watchlist = [(dtrain, 'train'), (dvalid, 'test')]
        gbm = xgb.train(params,
                        dtrain,
                        num_boost_round=1000,
                        evals=watchlist,
                        verbose_eval=50,
                        early_stopping_rounds=20)
        oof_preds[val_index] = gbm.predict(dvalid, iteration_range=(0, gbm.best_iteration))
        test_preds += gbm.predict(dtest, iteration_range=(0, gbm.best_iteration)) / kfold.n_splits
        gbms.append(gbm)
    return gbms, oof_preds, test_preds

🔥 4.3 模型融合与阈值选取

import pandas as pd
def gen_submit_file(df_test, test_preds, thres, save_path):
    df_test['test_preds_binary'] = np.where(test_preds > thres, 1, 0)
    df_test_submit = df_test[['customer_id', 'test_preds_binary']]
    df_test_submit.columns = ['customer_id', 'loan_default']
    print(f'saving result to: {save_path}')
    df_test_submit.to_csv(save_path, index=False)
    print('done!')
    return df_test_submit
def gen_thres_new(df_train, oof_preds):
    df_train['oof_preds'] = oof_preds
    quantile_point = df_train['loan_default'].mean()
    thres = df_train['oof_preds'].quantile(1 - quantile_point)
    _thresh = []
    for thres_item in np.arange(thres - 0.2, thres + 0.2, 0.01):
        _thresh.append(
            [thres_item, f1_score(df_train['loan_default'], np.where(oof_preds > thres_item, 1, 0), average='macro')])
    _thresh = np.array(_thresh)
    best_id = _thresh[:, 1].argmax()
    best_thresh = _thresh[best_id][0]

    print("阈值: {}\n训练集的f1: {}".format(best_thresh, _thresh[best_id][1]))
    return best_thresh
# 结果
df_oof_res = pd.DataFrame({'customer_id': train['customer_id'],
                           'oof_preds_xgb': oof_preds_xgb,
                           'oof_preds_lgb': oof_preds_lgb,
                           'loan_default': train['loan_default']
                          })
# 模型融合
df_oof_res['xgb_rank'] = df_oof_res['oof_preds_xgb'].rank(pct=True)
df_oof_res['lgb_rank'] = df_oof_res['oof_preds_lgb'].rank(pct=True)
df_oof_res['preds'] = 0.31 * df_oof_res['xgb_rank'] + 0.69 * df_oof_res['lgb_rank']
# 得到最优阈值
thres = gen_thres_new(df_oof_res, df_oof_res['preds'])
df_test_res = pd.DataFrame({'customer_id': test['customer_id'],
                            'test_preds_xgb': test_preds_xgb,
                            'test_preds_lgb': test_preds_lgb})
df_test_res['xgb_rank'] = df_test_res['test_preds_xgb'].rank(pct=True)
df_test_res['lgb_rank'] = df_test_res['test_preds_lgb'].rank(pct=True)
df_test_res['preds'] = 0.31 * df_test_res['xgb_rank'] + 0.69 * df_test_res['lgb_rank']
# 结果产出
df_submit = gen_submit_file(df_test_res, df_test_res['preds'], thres,
                            save_path='../prediction_result/result.csv')

🔥 4.4 自动EDA方法

这里介绍一个自动EDA的python库-Pandas Profiling

仅需一行代码，你就可以使用Pandas Profiling生成EDA报告，其中包括描述性统计信息，相关性，缺失值，文本分析等。

由于生成结果为HTML格式无法在此展示，因此仅展示效果截图，不演示代码。

更多详情可以查看github

🔣 5 模型训练

借助平台的在线环境，我们对原作者的训练过程进行了复现运行。

In [ ]

%cd work/xunfei2021_car_loan_top1-main/code/

In [2]

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
@author: heiye
@time: 2021/9/20 13:03
"""

from utils import *
from gen_feats import *


def train_xgb(train, test, feat_cols, label_col, n_fold=10):
    '''训练xgboost'''
    for col in ['sub_Rate', 'main_Rate', 'outstanding_disburse_ratio']:
        train[col] = train[col].apply(lambda x: 1 if x > 1 else x)
        test[col] = test[col].apply(lambda x: 1 if x > 1 else x)

    X_train = train[feat_cols]
    y_train = train[label_col]
    X_test = test[feat_cols]
    gbms_xgb, oof_preds_xgb, test_preds_xgb = train_xgb_kfold(X_train, y_train, X_test, n_fold=n_fold)

    if not os.path.exists('../user_data/gbms_xgb.pkl'):
        save_pkl(gbms_xgb, '../user_data/gbms_xgb.pkl')

    return gbms_xgb, oof_preds_xgb, test_preds_xgb


def train_lgb(train, test, feat_cols, label_col, n_fold=10):
    '''训练lightgbm'''
    X_train = train[feat_cols]
    y_train = train[label_col]
    X_test = test[feat_cols]
    gbms_lgb, oof_preds_lgb, test_preds_lgb = train_lgb_kfold(X_train, y_train, X_test, n_fold=n_fold)

    if not os.path.exists('../user_data/gbms_lgb.pkl'):
        save_pkl(gbms_lgb, '../user_data/gbms_lgb.pkl')

    return gbms_lgb, oof_preds_lgb, test_preds_lgb


if __name__ == '__main__':
    # 读取原始数据集
    logging.info('data loading...')
    train = pd.read_csv('../xfdata/车辆贷款违约预测数据集/train.csv')
    test = pd.read_csv('../xfdata/车辆贷款违约预测数据集/test.csv')

    # 特征工程
    logging.info('feature generating...')
    train, test = gen_new_feats(train, test)
    train, test = gen_target_encoding_feats(train, test, TARGET_ENCODING_FETAS, target_col='loan_default', n_fold=10)
    train, test = gen_neighbor_feats(train, test)

    train['asset_cost_bin'] = train['asset_cost_bin'].astype(int)
    test['asset_cost_bin'] = test['asset_cost_bin'].astype(int)
    train['loan_to_asset_ratio_bin'] = train['loan_to_asset_ratio_bin'].astype(int)
    test['loan_to_asset_ratio_bin'] = test['loan_to_asset_ratio_bin'].astype(int)
    train['asset_cost_bin_mean_target'] = train['asset_cost_bin_mean_target'].astype(float)
    test['asset_cost_bin_mean_target'] = test['asset_cost_bin_mean_target'].astype(float)

    # 模型训练：linux和mac的xgboost结果会有些许不同，以模型文件结果为主
    gbms_xgb, oof_preds_xgb, test_preds_xgb = train_xgb(train.copy(), test.copy(),
                                                        feat_cols=SAVE_FEATS,
                                                        label_col='loan_default')
    gbms_lgb, oof_preds_lgb, test_preds_lgb = train_lgb(train, test,
                                                        feat_cols=SAVE_FEATS,
                                                        label_col='loan_default')
    xgb_thres = gen_thres_new(train, oof_preds_xgb)
    lgb_thres =  gen_thres_new(train, oof_preds_lgb)

    # 结果聚合
    df_oof_res = pd.DataFrame({'customer_id': train['customer_id'],
                               'oof_preds_xgb': oof_preds_xgb,
                               'oof_preds_lgb': oof_preds_lgb})

    # 模型融合
    df_oof_res['xgb_rank'] = df_oof_res['oof_preds_xgb'].rank(pct=True)
    df_oof_res['lgb_rank'] = df_oof_res['oof_preds_lgb'].rank(pct=True)
    df_oof_res['preds'] = 0.31 * df_oof_res['xgb_rank'] + 0.69 * df_oof_res['lgb_rank']
    thres = gen_thres_new(df_oof_res, df_oof_res['preds'])

    '''
    df_test_res = pd.DataFrame({'customer_id': test['customer_id'],
                                'test_preds_xgb': test_preds_xgb,
                                'test_preds_lgb': test_preds_lgb})

    df_test_res['xgb_rank'] = df_test_res['test_preds_xgb'].rank(pct=True)
    df_test_res['lgb_rank'] = df_test_res['test_preds_lgb'].rank(pct=True)
    df_test_res['preds'] = 0.31 * df_test_res['xgb_rank'] + 0.69 * df_test_res['lgb_rank']

    # 结果产出
    df_submit = gen_submit_file(df_test_res, df_test_res['preds'], thres,
                                save_path='../prediction_result/result.csv')
    '''

🐱 6 项目总结

项目是对车贷违约预测竞赛冠军方案的搬运及解读
借助AiStudio平台实现了模型训练
项目完成中参考了许多资料及笔者的项目
欢迎对数据竞赛感兴趣的童鞋Fork关注！支持！

特别注意:本项目参考车辆贷款违约预测赛事 Top1方案完成。获得作者：@望尼玛搬运授权

有任何问题，欢迎评论区留言交流。