35. 贷款违约预测

一、项目介绍

背景

以金融风控中的个人信贷为背景,根据贷款申请人的数据信息预测其是否有违约的可能,以此判断是否通过此项贷款,这是一个典型的分类问题。

具体的列名含义

            id 为贷款清单分配的唯一信用证标识
            loanAmnt 贷款金额
            term 贷款期限(year)
            interestRate 贷款利率
            installment 分期付款金额
            grade 贷款等级
            subGrade 贷款等级之子级
            employmentTitle 就业职称
            employmentLength 就业年限(年)
            homeOwnership 借款人在登记时提供的房屋所有权状况
            annualIncome 年收入
            verificationStatus 验证状态
            issueDate 贷款发放的月份
            purpose 借款人在贷款申请时的贷款用途类别
            postCode 借款人在贷款申请中提供的邮政编码的前3位数字
            regionCode 地区编码
            dti 债务收入比
            delinquency_2years 借款人过去2年信用档案中逾期30天以上的违约事件数
            ficoRangeLow 借款人在贷款发放时的fico所属的下限范围
            ficoRangeHigh 借款人在贷款发放时的fico所属的上限范围
            openAcc 借款人信用档案中未结信用额度的数量
            pubRec 贬损公共记录的数量
            pubRecBankruptcies 公开记录清除的数量
            revolBal 信贷周转余额合计
            revolUtil 循环额度利用率,或借款人使用的相对于所有可用循环信贷的信贷金额
            totalAcc 借款人信用档案中当前的信用额度总数
            initialListStatus 贷款的初始列表状态
            applicationType 表明贷款是个人申请还是与两个共同借款人的联合申请
            earliesCreditLine 借款人最早报告的信用额度开立的月份
            title 借款人提供的贷款名称
            policyCode 公开可用的策略代码=1新产品不公开可用的策略代码=2
            n系列匿名特征 匿名特征n0-n14,为一些贷款人行为计数特征的处理

二、数据准备

导入相关库

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from sklearn.model_selection import cross_val_score,train_test_split,GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')

##### 取消pandas最大列显示限制
pd.options.display.max_columns = None 

获取数据

train = pd.read_csv('../data/贷款违约预测/train.csv')

三、数据分析

3.1 总体了解数据

train.shape
(800000, 47)
train.columns
Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
       'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
       'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
       'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
       'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
       'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
       'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
       'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8',
       'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
      dtype='object')
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  800000 non-null  int64  
 1   loanAmnt            800000 non-null  float64
 2   term                800000 non-null  int64  
 3   interestRate        800000 non-null  float64
 4   installment         800000 non-null  float64
 5   grade               800000 non-null  object 
 6   subGrade            800000 non-null  object 
 7   employmentTitle     799999 non-null  float64
 8   employmentLength    753201 non-null  object 
 9   homeOwnership       800000 non-null  int64  
 10  annualIncome        800000 non-null  float64
 11  verificationStatus  800000 non-null  int64  
 12  issueDate           800000 non-null  object 
 13  isDefault           800000 non-null  int64  
 14  purpose             800000 non-null  int64  
 15  postCode            799999 non-null  float64
 16  regionCode          800000 non-null  int64  
 17  dti                 799761 non-null  float64
 18  delinquency_2years  800000 non-null  float64
 19  ficoRangeLow        800000 non-null  float64
 20  ficoRangeHigh       800000 non-null  float64
 21  openAcc             800000 non-null  float64
 22  pubRec              800000 non-null  float64
 23  pubRecBankruptcies  799595 non-null  float64
 24  revolBal            800000 non-null  float64
 25  revolUtil           799469 non-null  float64
 26  totalAcc            800000 non-null  float64
 27  initialListStatus   800000 non-null  int64  
 28  applicationType     800000 non-null  int64  
 29  earliesCreditLine   800000 non-null  object 
 30  title               799999 non-null  float64
 31  policyCode          800000 non-null  float64
 32  n0                  759730 non-null  float64
 33  n1                  759730 non-null  float64
 34  n2                  759730 non-null  float64
 35  n3                  759730 non-null  float64
 36  n4                  766761 non-null  float64
 37  n5                  759730 non-null  float64
 38  n6                  759730 non-null  float64
 39  n7                  759730 non-null  float64
 40  n8                  759729 non-null  float64
 41  n9                  759730 non-null  float64
 42  n10                 766761 non-null  float64
 43  n11                 730248 non-null  float64
 44  n12                 759730 non-null  float64
 45  n13                 759730 non-null  float64
 46  n14                 759730 non-null  float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB
train.describe()
id loanAmnt term interestRate installment employmentTitle homeOwnership annualIncome verificationStatus isDefault purpose postCode regionCode dti delinquency_2years ficoRangeLow ficoRangeHigh openAcc pubRec pubRecBankruptcies revolBal revolUtil totalAcc initialListStatus applicationType title policyCode n0 n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13 n14
count 800000.000000 800000.000000 800000.000000 800000.000000 800000.000000 799999.000000 800000.000000 8.000000e+05 800000.000000 800000.000000 800000.000000 799999.000000 800000.000000 799761.000000 800000.000000 800000.000000 800000.000000 800000.000000 800000.000000 799595.000000 8.000000e+05 799469.000000 800000.000000 800000.000000 800000.000000 799999.000000 800000.0 759730.000000 759730.000000 759730.000000 759730.000000 766761.000000 759730.000000 759730.000000 759730.000000 759729.000000 759730.000000 766761.000000 730248.000000 759730.000000 759730.000000 759730.000000
mean 399999.500000 14416.818875 3.482745 13.238391 437.947723 72005.351714 0.614213 7.613391e+04 1.009683 0.199513 1.745982 258.535648 16.385758 18.284557 0.318239 696.204081 700.204226 11.598020 0.214915 0.134163 1.622871e+04 51.790734 24.998861 0.416953 0.019267 1754.113589 1.0 0.511932 3.642330 5.642648 5.642648 4.735641 8.107937 8.575994 8.282953 14.622488 5.592345 11.643896 0.000815 0.003384 0.089366 2.178606
std 230940.252015 8716.086178 0.855832 4.765757 261.460393 106585.640204 0.675749 6.894751e+04 0.782716 0.399634 2.367453 200.037446 11.036679 11.150155 0.880325 31.865995 31.866674 5.475286 0.606467 0.377471 2.245802e+04 24.516126 11.999201 0.493055 0.137464 7941.474040 0.0 1.333266 2.246825 3.302810 3.302810 2.949969 4.799210 7.400536 4.561689 8.124610 3.216184 5.484104 0.030075 0.062041 0.509069 1.844377
min 0.000000 500.000000 3.000000 5.310000 15.690000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 -1.000000 0.000000 630.000000 634.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 2.000000 0.000000 0.000000 0.000000 1.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 199999.750000 8000.000000 3.000000 9.750000 248.450000 427.000000 0.000000 4.560000e+04 0.000000 0.000000 0.000000 103.000000 8.000000 11.790000 0.000000 670.000000 674.000000 8.000000 0.000000 0.000000 5.944000e+03 33.400000 16.000000 0.000000 0.000000 0.000000 1.0 0.000000 2.000000 3.000000 3.000000 3.000000 5.000000 4.000000 5.000000 9.000000 3.000000 8.000000 0.000000 0.000000 0.000000 1.000000
50% 399999.500000 12000.000000 3.000000 12.740000 375.135000 7755.000000 1.000000 6.500000e+04 1.000000 0.000000 0.000000 203.000000 14.000000 17.610000 0.000000 690.000000 694.000000 11.000000 0.000000 0.000000 1.113200e+04 52.100000 23.000000 0.000000 0.000000 1.000000 1.0 0.000000 3.000000 5.000000 5.000000 4.000000 7.000000 7.000000 7.000000 13.000000 5.000000 11.000000 0.000000 0.000000 0.000000 2.000000
75% 599999.250000 20000.000000 3.000000 15.990000 580.710000 117663.500000 1.000000 9.000000e+04 2.000000 0.000000 4.000000 395.000000 22.000000 24.060000 0.000000 710.000000 714.000000 14.000000 0.000000 0.000000 1.973400e+04 70.700000 32.000000 1.000000 0.000000 5.000000 1.0 0.000000 5.000000 7.000000 7.000000 6.000000 11.000000 11.000000 10.000000 19.000000 7.000000 14.000000 0.000000 0.000000 0.000000 3.000000
max 799999.000000 40000.000000 5.000000 30.990000 1715.420000 378351.000000 5.000000 1.099920e+07 2.000000 1.000000 13.000000 940.000000 50.000000 999.000000 39.000000 845.000000 850.000000 86.000000 86.000000 12.000000 2.904836e+06 892.300000 162.000000 1.000000 1.000000 61680.000000 1.0 51.000000 33.000000 63.000000 63.000000 49.000000 70.000000 132.000000 79.000000 128.000000 45.000000 82.000000 4.000000 4.000000 39.000000 30.000000
# 查看数据集中特征缺失值的特征数
train.isnull().any().sum()
22
# 具体的查看缺失特征数量并可视化
missing = train.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace = True)
missing.plot.bar();

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lUzXSsK9-1615121853776)(output_13_0.png)]

# 查看训练集测试集中特征属性只有一值的特征
fea = [col for col in train.columns if train[col].nunique() <=1]
fea
['policyCode']
# 查看特征的数值类型有哪些,对象类型有哪些
numerical_fea = list(train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x:x not in numerical_fea,list(train.columns)))
print('数值类型特征有{}个,分别为{}:'.format(len(numerical_fea),numerical_fea))
print()
print('对象类型特征有{}个,分别为{}:'.format(len(category_fea),category_fea))
数值类型特征有42个,分别为['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'employmentTitle', 'homeOwnership', 'annualIncome', 'verificationStatus', 'isDefault', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'initialListStatus', 'applicationType', 'title', 'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14']:

对象类型特征有5个,分别为['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']:
# 划分数值型变量中的连续变量和离散型变量
numerical_noserial_fea = []
numerical_serial_fea = []

for fea in numerical_fea:
    temp = train[fea].nunique()
    if temp <= 10:
        numerical_noserial_fea.append(fea)
        continue
    numerical_serial_fea.append(fea)
    
print('数值连续型变量特征有:',numerical_serial_fea)
print()
print('数值离散型变量特征有:',numerical_noserial_fea)
数值连续型变量特征有: ['id', 'loanAmnt', 'interestRate', 'installment', 'employmentTitle', 'annualIncome', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'title', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n13', 'n14']

数值离散型变量特征有: ['term', 'homeOwnership', 'verificationStatus', 'isDefault', 'initialListStatus', 'applicationType', 'policyCode', 'n11', 'n12']

3.2 数值离散型变量分析

for fea in numerical_noserial_fea:
    print('离散型变量:',fea)
    print(train[fea].value_counts())
    print()
    print()
离散型变量: term
3    606902
5    193098
Name: term, dtype: int64


离散型变量: homeOwnership
0    395732
1    317660
2     86309
3       185
5        81
4        33
Name: homeOwnership, dtype: int64


离散型变量: verificationStatus
1    309810
  • 2
    点赞
  • 31
    收藏
    觉得还不错? 一键收藏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值