Kaggle[1] - Loan Default Prediction - Imperial College London

最新推荐文章于 2024-06-10 10:03:10 发布

杨之之

最新推荐文章于 2024-06-10 10:03:10 发布

阅读量2.3k

点赞数

分类专栏： Kaggle 文章标签： kaggle

本文链接：https://blog.csdn.net/u011292007/article/details/35827445

版权

该Kaggle竞赛旨在预测贷款违约等级，通过二阶段方法解决，首先进行违约二分类，再进行损失回归。特征选择通过正向特征选择法，使用Logistic Regression和GBM等模型。最终模型包括GBM，强调了特征工程和二阶段建模的重要性。

摘要由CSDN通过智能技术生成

比赛页面：http://www.kaggle.com/c/loan-default-prediction。

This competition asks you to determine whether a loan will default, as well as the loss incurred if it does default. Unlike traditional finance-based approaches to this problem, where one distinguishes between good or bad counterparties in a binary way, we seek to anticipate and incorporate both the default and the severity of the losses that result. In doing so, we are building a bridge between traditional banking, where we are looking at reducing the consumption of economic capital, to an asset-management perspective, where we optimize on the risk to the financial investor.

原文放在上面，我简要说下这个题目的内容。

比赛提供的数据主要是做贷款违约评级，评级的范围是0-100，0表示没有违约，1-100表示违约程度。

数据一共有780多个特征(包含很多不相关特征)，比赛提供了训练集(包含违约标签)和测试集。

最终的目的是要预测测试集每个样本的标签(也就是每个客户的违约等级)，评价标准是MAE。

题目已经说清楚了，下面就来看看思路是怎样的。

1. 首先要知道，这780个特征一起训练肯定是不现实的，因为第一，过多特征会导致训练过慢；第二，由于充斥着大量的不相关信息，这些信息的存在会导致模型不准确。

2.具体做法应该要分2步走，第一步是要判断是否违约(对应一个二分类问题)；第二步是如果前一步判断出来是违约，那么我们需要进一步预测他的违约等级(对应一个回归问题)。简单地说，就是two-stage的情形，二分类对应判断是否违约，回归对应违约后的具体评级。

下面先看看特征选择应该如何做？

回忆一下之前写过的blog里有提到，由于这个比赛有细心人给出了2个golden feature(已经有AUC>0.9)，大部分人选择用forward step的方法。

所谓的forward step就是每次增加一个特征，然后算算AUC或者F1, 把AUC最大的那个特征添加到已选择的特征集。算AUC需要一个具体的model，比如GBM/LR等。

STAGE 1：

看看别人是如何做特征选择的。