一、项目介绍
背景
以金融风控中的个人信贷为背景,根据贷款申请人的数据信息预测其是否有违约的可能,以此判断是否通过此项贷款,这是一个典型的分类问题。
具体的列名含义
id 为贷款清单分配的唯一信用证标识
loanAmnt 贷款金额
term 贷款期限(year)
interestRate 贷款利率
installment 分期付款金额
grade 贷款等级
subGrade 贷款等级之子级
employmentTitle 就业职称
employmentLength 就业年限(年)
homeOwnership 借款人在登记时提供的房屋所有权状况
annualIncome 年收入
verificationStatus 验证状态
issueDate 贷款发放的月份
purpose 借款人在贷款申请时的贷款用途类别
postCode 借款人在贷款申请中提供的邮政编码的前3位数字
regionCode 地区编码
dti 债务收入比
delinquency_2years 借款人过去2年信用档案中逾期30天以上的违约事件数
ficoRangeLow 借款人在贷款发放时的fico所属的下限范围
ficoRangeHigh 借款人在贷款发放时的fico所属的上限范围
openAcc 借款人信用档案中未结信用额度的数量
pubRec 贬损公共记录的数量
pubRecBankruptcies 公开记录清除的数量
revolBal 信贷周转余额合计
revolUtil 循环额度利用率,或借款人使用的相对于所有可用循环信贷的信贷金额
totalAcc 借款人信用档案中当前的信用额度总数
initialListStatus 贷款的初始列表状态
applicationType 表明贷款是个人申请还是与两个共同借款人的联合申请
earliesCreditLine 借款人最早报告的信用额度开立的月份
title 借款人提供的贷款名称
policyCode 公开可用的策略代码=1新产品不公开可用的策略代码=2
n系列匿名特征 匿名特征n0-n14,为一些贷款人行为计数特征的处理
二、数据准备
导入相关库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from sklearn.model_selection import cross_val_score,train_test_split,GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')
##### 取消pandas最大列显示限制
pd.options.display.max_columns = None
获取数据
train = pd.read_csv('../data/贷款违约预测/train.csv')
三、数据分析
3.1 总体了解数据
train.shape
(800000, 47)
train.columns
Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8',
'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
dtype='object')
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 800000 non-null int64
1 loanAmnt 800000 non-null float64
2 term 800000 non-null int64
3 interestRate 800000 non-null float64
4 installment 800000 non-null float64
5 grade 800000 non-null object
6 subGrade 800000 non-null object
7 employmentTitle 799999 non-null float64
8 employmentLength 753201 non-null object
9 homeOwnership 800000 non-null int64
10 annualIncome 800000 non-null float64
11 verificationStatus 800000 non-null int64
12 issueDate 800000 non-null object
13 isDefault 800000 non-null int64
14 purpose 800000 non-null int64
15 postCode 799999 non-null float64
16 regionCode 800000 non-null int64
17 dti 799761 non-null float64
18 delinquency_2years 800000 non-null float64
19 ficoRangeLow 800000 non-null float64
20 ficoRangeHigh 800000 non-null float64
21 openAcc 800000 non-null float64
22 pubRec 800000 non-null float64
23 pubRecBankruptcies 799595 non-null float64
24 revolBal 800000 non-null float64
25 revolUtil 799469 non-null float64
26 totalAcc 800000 non-null float64
27 initialListStatus 800000 non-null int64
28 applicationType 800000 non-null int64
29 earliesCreditLine 800000 non-null object
30 title 799999 non-null float64
31 policyCode 800000 non-null float64
32 n0 759730 non-null float64
33 n1 759730 non-null float64
34 n2 759730 non-null float64
35 n3 759730 non-null float64
36 n4 766761 non-null float64
37 n5 759730 non-null float64
38 n6 759730 non-null float64
39 n7 759730 non-null float64
40 n8 759729 non-null float64
41 n9 759730 non-null float64
42 n10 766761 non-null float64
43 n11 730248 non-null float64
44 n12 759730 non-null float64
45 n13 759730 non-null float64
46 n14 759730 non-null float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB
train.describe()
id | loanAmnt | term | interestRate | installment | employmentTitle | homeOwnership | annualIncome | verificationStatus | isDefault | purpose | postCode | regionCode | dti | delinquency_2years | ficoRangeLow | ficoRangeHigh | openAcc | pubRec | pubRecBankruptcies | revolBal | revolUtil | totalAcc | initialListStatus | applicationType | title | policyCode | n0 | n1 | n2 | n3 | n4 | n5 | n6 | n7 | n8 | n9 | n10 | n11 | n12 | n13 | n14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 800000.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 799999.000000 | 800000.000000 | 8.000000e+05 | 800000.000000 | 800000.000000 | 800000.000000 | 799999.000000 | 800000.000000 | 799761.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 799595.000000 | 8.000000e+05 | 799469.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 799999.000000 | 800000.0 | 759730.000000 | 759730.000000 | 759730.000000 | 759730.000000 | 766761.000000 | 759730.000000 | 759730.000000 | 759730.000000 | 759729.000000 | 759730.000000 | 766761.000000 | 730248.000000 | 759730.000000 | 759730.000000 | 759730.000000 |
mean | 399999.500000 | 14416.818875 | 3.482745 | 13.238391 | 437.947723 | 72005.351714 | 0.614213 | 7.613391e+04 | 1.009683 | 0.199513 | 1.745982 | 258.535648 | 16.385758 | 18.284557 | 0.318239 | 696.204081 | 700.204226 | 11.598020 | 0.214915 | 0.134163 | 1.622871e+04 | 51.790734 | 24.998861 | 0.416953 | 0.019267 | 1754.113589 | 1.0 | 0.511932 | 3.642330 | 5.642648 | 5.642648 | 4.735641 | 8.107937 | 8.575994 | 8.282953 | 14.622488 | 5.592345 | 11.643896 | 0.000815 | 0.003384 | 0.089366 | 2.178606 |
std | 230940.252015 | 8716.086178 | 0.855832 | 4.765757 | 261.460393 | 106585.640204 | 0.675749 | 6.894751e+04 | 0.782716 | 0.399634 | 2.367453 | 200.037446 | 11.036679 | 11.150155 | 0.880325 | 31.865995 | 31.866674 | 5.475286 | 0.606467 | 0.377471 | 2.245802e+04 | 24.516126 | 11.999201 | 0.493055 | 0.137464 | 7941.474040 | 0.0 | 1.333266 | 2.246825 | 3.302810 | 3.302810 | 2.949969 | 4.799210 | 7.400536 | 4.561689 | 8.124610 | 3.216184 | 5.484104 | 0.030075 | 0.062041 | 0.509069 | 1.844377 |
min | 0.000000 | 500.000000 | 3.000000 | 5.310000 | 15.690000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1.000000 | 0.000000 | 630.000000 | 634.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 199999.750000 | 8000.000000 | 3.000000 | 9.750000 | 248.450000 | 427.000000 | 0.000000 | 4.560000e+04 | 0.000000 | 0.000000 | 0.000000 | 103.000000 | 8.000000 | 11.790000 | 0.000000 | 670.000000 | 674.000000 | 8.000000 | 0.000000 | 0.000000 | 5.944000e+03 | 33.400000 | 16.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 2.000000 | 3.000000 | 3.000000 | 3.000000 | 5.000000 | 4.000000 | 5.000000 | 9.000000 | 3.000000 | 8.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
50% | 399999.500000 | 12000.000000 | 3.000000 | 12.740000 | 375.135000 | 7755.000000 | 1.000000 | 6.500000e+04 | 1.000000 | 0.000000 | 0.000000 | 203.000000 | 14.000000 | 17.610000 | 0.000000 | 690.000000 | 694.000000 | 11.000000 | 0.000000 | 0.000000 | 1.113200e+04 | 52.100000 | 23.000000 | 0.000000 | 0.000000 | 1.000000 | 1.0 | 0.000000 | 3.000000 | 5.000000 | 5.000000 | 4.000000 | 7.000000 | 7.000000 | 7.000000 | 13.000000 | 5.000000 | 11.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
75% | 599999.250000 | 20000.000000 | 3.000000 | 15.990000 | 580.710000 | 117663.500000 | 1.000000 | 9.000000e+04 | 2.000000 | 0.000000 | 4.000000 | 395.000000 | 22.000000 | 24.060000 | 0.000000 | 710.000000 | 714.000000 | 14.000000 | 0.000000 | 0.000000 | 1.973400e+04 | 70.700000 | 32.000000 | 1.000000 | 0.000000 | 5.000000 | 1.0 | 0.000000 | 5.000000 | 7.000000 | 7.000000 | 6.000000 | 11.000000 | 11.000000 | 10.000000 | 19.000000 | 7.000000 | 14.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
max | 799999.000000 | 40000.000000 | 5.000000 | 30.990000 | 1715.420000 | 378351.000000 | 5.000000 | 1.099920e+07 | 2.000000 | 1.000000 | 13.000000 | 940.000000 | 50.000000 | 999.000000 | 39.000000 | 845.000000 | 850.000000 | 86.000000 | 86.000000 | 12.000000 | 2.904836e+06 | 892.300000 | 162.000000 | 1.000000 | 1.000000 | 61680.000000 | 1.0 | 51.000000 | 33.000000 | 63.000000 | 63.000000 | 49.000000 | 70.000000 | 132.000000 | 79.000000 | 128.000000 | 45.000000 | 82.000000 | 4.000000 | 4.000000 | 39.000000 | 30.000000 |
# 查看数据集中特征缺失值的特征数
train.isnull().any().sum()
22
# 具体的查看缺失特征数量并可视化
missing = train.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace = True)
missing.plot.bar();
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lUzXSsK9-1615121853776)(output_13_0.png)]
# 查看训练集测试集中特征属性只有一值的特征
fea = [col for col in train.columns if train[col].nunique() <=1]
fea
['policyCode']
# 查看特征的数值类型有哪些,对象类型有哪些
numerical_fea = list(train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x:x not in numerical_fea,list(train.columns)))
print('数值类型特征有{}个,分别为{}:'.format(len(numerical_fea),numerical_fea))
print()
print('对象类型特征有{}个,分别为{}:'.format(len(category_fea),category_fea))
数值类型特征有42个,分别为['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'employmentTitle', 'homeOwnership', 'annualIncome', 'verificationStatus', 'isDefault', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'initialListStatus', 'applicationType', 'title', 'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14']:
对象类型特征有5个,分别为['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']:
# 划分数值型变量中的连续变量和离散型变量
numerical_noserial_fea = []
numerical_serial_fea = []
for fea in numerical_fea:
temp = train[fea].nunique()
if temp <= 10:
numerical_noserial_fea.append(fea)
continue
numerical_serial_fea.append(fea)
print('数值连续型变量特征有:',numerical_serial_fea)
print()
print('数值离散型变量特征有:',numerical_noserial_fea)
数值连续型变量特征有: ['id', 'loanAmnt', 'interestRate', 'installment', 'employmentTitle', 'annualIncome', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'title', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n13', 'n14']
数值离散型变量特征有: ['term', 'homeOwnership', 'verificationStatus', 'isDefault', 'initialListStatus', 'applicationType', 'policyCode', 'n11', 'n12']
3.2 数值离散型变量分析
for fea in numerical_noserial_fea:
print('离散型变量:',fea)
print(train[fea].value_counts())
print()
print()
离散型变量: term
3 606902
5 193098
Name: term, dtype: int64
离散型变量: homeOwnership
0 395732
1 317660
2 86309
3 185
5 81
4 33
Name: homeOwnership, dtype: int64
离散型变量: verificationStatus
1 309810