使用Logistic回归预测贷款违约

介绍 (Introduction)

Traditionally, loaning has been based on the foundation of trust. Although there were credit reports before 1989, when the FICO Score was created according to myFICO.com, the money lending process was fairly subjective, and potential borrowers were often judged by how trustworthy their character seemed. Today, lenders are able to use tools like FICO Scores to quantify how trustworthy potential borrowers are, minimizing randomness. All of this is done for one purpose: to determine how likely it is that a given borrower will default a loan.

传统上,贷款是基于信任的基础。 尽管在1989年之前有信用报告,但根据myFICO.com创建FICO评分时,放贷过程是相当主观的,潜在的借款人通常通过其性格的可信度来判断。 如今,贷方能够使用FICO评分等工具来量化潜在借款人的信誉程度,从而最大程度地减少随机性。 所有这些都是出于一个目的:确定给定借款人拖欠贷款的可能性。

Predicting default rates is a significant part of money-lending because lenders must predict whether giving out a loan will result in profit or loss. Normally, loans are profitable because of interest, but sometimes a borrower will default, which is both a betrayal of the moneylender’s trust and a hazard to the moneylender’s business. Thus, it is important that the lender is able to gauge the likelihood of a borrower defaulting before making a loan to him/her.

预测违约率是放贷的重要组成部分,因为贷方必须预测发放贷款是否会导致损益。 通常,贷款因利息而有利可图,但有时借款人会违约,这既背叛了放债人的信任,又对放债人的业务造成了危害。 因此,重要的是,贷方必须能够在借款人向他/她借贷之前衡量借款人违约的可能性。

Given the high number of factors that might affect borrower default rate, it may be infeasible to come up with good estimates heuristically or by hand. The goal of this project is to explore whether or not we can employ statistical and machine learning models to better predict the risk of borrower default. By analyzing variables that describe loans and the financial situations of their borrowers, we may determine key relationships between default rates and a few other variables. Along the way, we will look into key relationships between loan default chances, loan characteristics, and buyer behaviors.

鉴于可能影响借款人违约率的因素很多,通过启发式或手动得出良好的估计可能是不可行的。 该项目的目的是探索我们是否可以采用统计和机器学习模型来更好地预测借款人违约的风险。 通过分析描述贷款及其借款人财务状况的变量,我们可以确定违约率与其他一些变量之间的关键关系。 在此过程中,我们将研究贷款违约机会,贷款特征和买方行为之间的关键关系。

资料说明 (Data Description)

For this project, we will use anonymized data from a lending company. The data contains historical information on details of the loan itself and characteristics of the lender. Some feature names are also anonymized to protect sensitive information. Of the variables in the original data file, we will target the following variables as points of interest:

对于此项目,我们将使用来自贷款公司的匿名数据。 数据包含有关贷款本身和贷方特征的历史信息。 一些功能名称也被匿名以保护敏感信息。 在原始数据文件中的变量中,我们将以下变量作为目标点:

Default: This variable is binary and represents whether or not the buyer defaulted on the loan. Default rates will be the focus of this project because we want to analyze how they could be related to other variables. The data set contains 1000 loans that had been defaulted and 2000 that had not. In reality, only around 7% of loans were defaulted on, but we upsample this group to better extract signals on what might lead to loan default.

默认值:此变量为二进制,表示买方是否拖欠贷款。 默认利率将是该项目的重点,因为我们想分析它们如何与其他变量相关。 数据集包含1000笔已拖欠的贷款和2000笔未拖欠的贷款。 实际上,只有约7%的贷款被拖欠,但我们对该小组进行了上采样,以更好地提取可能导致贷款拖欠的信号。

Reason: This categorical variable represents the reason the loan was taken out. Reasons for taking out a loan have been coded as the following: for the purchase of a boat, for a business, for credit cards, for an event, for a holiday, for the purchase of a home, for medical bills, for home relocation, for home renovation, for the installation of solar panels, for transport, and for other reasons.

原因:此类别变量表示贷款被提取的原因。 贷款的原因编码如下:购买船,商业,信用卡,活动,度假,购买房屋,医疗费用,房屋搬迁,用于家庭装修,安装太阳能电池板,用于运输以及其他原因。

Amount: This continuous variable represents the amount of money that was taken out as the loan.

金额:此连续变量表示已提取为贷款的金额。

Annual Income: This continuous variable represents the amount of money that the borrower earned last year.

年收入:此连续变量表示借款人去年赚取的金额。

Interest: This variable represents the amount of interest charged on the loan.

利息:此变量代表对贷款收取的利息金额。

Term: This variable represents the length of time the loan lasts. In this data set, loan terms are either 3 or 5 years.

期限:此变量代表贷款的持续时间。 在此数据集中,贷款期限为3年或5年。

Employment: This variable represents the length of time the borrower has been employed. In this data set, this variable is categorical, ranging from < 1 year to 1 year to 10+ years.

就业:此变量表示借款人受雇的时间长度。 在此数据集中,此变量是分类变量,范围从<1年到1年到10+年。

Credit Balance: This continuous variable represents the amount of money that the borrower spent on credit last year.

信贷余额:此连续变量表示借款人去年在信贷上花费的金额。

Credit Ratio: This continuous variable is the proportion of credit the borrower has used up to the credit line. Values are expressed as percentages, so the ratio is multiplied by 100. Although credit used up should not surpass the credit line, a few borrowers have credit ratios greater than 100.

信贷比率:此连续变量是借款人在信贷额度之前使用的信贷比率 。 值用百分比表示,因此该比率乘以100。尽管用尽的信贷不应超过信贷额度,但一些借款人的信贷比率大于100。

v5 and v6 are anonymized continuous variables.

v5v6是匿名连续变量。

探索性数据分析 (Exploratory Data Analysis)

自变量 (Independent Variables)

We will first look at the distributions of and between some characteristics of the loan or the borrower of the loan. This will help us determine which predictor variables may have interesting patterns and where we should be concerned about multicollinearity, which is when the model breaks down because multiple variables are too correlated.

我们将首先研究贷款或贷款借款人的某些特征之间的分配。 这将帮助我们确定哪些预测变量可能具有有趣的模式,以及我们应该在哪里关注多重共线性,这是因为多个变量过于相关而导致模型崩溃的时候。

Image for post
Image for post
Figure 1 (left) & Figure 2 (right). All images are by author unless stated otherwise.
图1(左)和图2(右)。 除非另有说明,否则所有图片均由作者提供。

It is interesting to note that the values of interest rate and loan amount have varied frequency, with some values occurring over 30 times in the data set and others occurring only once (Figure 1 & 2). This is probably because some interest rates and amounts are more popular as parts of standard loan packages.

有趣的是,利率和贷款金额的值变化的频率不同,其中一些值在数据集中出现超过30次,而其他值仅出现一次(图1和2)。 这可能是因为某些利率和金额作为标准贷款计划的一部分更受欢迎。

<
  • 2
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值