零基础入门金融风控-贷款违约预测-数据分析

这篇博客介绍了金融风控中的贷款违约预测分析,通过数据总体了解、缺失值和唯一值分析、数据类型查看以及数据间相关性研究,为特征工程和建模打下基础。在代码实例中,涉及了数据导入、缺失值处理、特征类型检查以及数值特征的分布分析等步骤,旨在揭示数据背后的模式和关系。
摘要由CSDN通过智能技术生成

1.数据分析

       查看并初步了解数据,熟悉数据,为后续的特征工程做准备,主要目的如下:
       1.EDA价值主要在于熟悉整个数据的基本情况(取值类型、取值类别、取值范围、缺失值、异常值等),对数据集进行分析是否可以进一步进行建模分析;
       2.了解各变量间的相互关系、变量与预测值之间的存在关系;
       3.为特征工程做准备;

2.主要分析内容

2.1 数据总体了解:

       读取数据集并了解数据集大小,原始特征维度;
       通过info熟悉数据类型;
       粗略查看数据集中各特征基本统计量;

2.2 缺失值和唯一值:

       查看数据缺失值情况;
       查看唯一值特征情况;

2.3 深入数据-查看数据类型:

       类别型数据;
       数值型数据;
       离散数值型数据
       连续数值型数据

2.4 数据间相关关系:

       特征和特征之间关系;
       特征和目标变量之间关系;

2.5 用pandas_profiling生成数据报告

3.代码实例

3.1 导入相关库

# 导入相关库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import warnings
warnings.filterwarnings('ignore')

pd.options.display.max_columns = None
pd.options.display.max_rows = None

%matplotlib inline
pd.set_option('display.max_colwidth', -1)

3.2 读取文件数据

# 读取数据
train_data = pd.read_csv('./RawData/train.csv')
test_data = pd.read_csv('./RawData/testA.csv')

3.3 初步了解

# 查看数据训练数据
train_data.head()
id loanAmnt term interestRate installment grade subGrade employmentTitle employmentLength homeOwnership annualIncome verificationStatus issueDate isDefault purpose postCode regionCode dti delinquency_2years ficoRangeLow ficoRangeHigh openAcc pubRec pubRecBankruptcies revolBal revolUtil totalAcc initialListStatus applicationType earliesCreditLine title policyCode n0 n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13 n14
0 0 35000.0 5 19.52 917.97 E E2 320.0 2 years 2 110000.0 2 2014-07-01 1 1 137.0 32 17.05 0.0 730.0 734.0 7.0 0.0 0.0 24178.0 48.9 27.0 0 0 Aug-2001 1.0 1.0 0.0 2.0 2.0 2.0 4.0 9.0 8.0 4.0 12.0 2.0 7.0 0.0 0.0 0.0 2.0
1 1 18000.0 5 18.49 461.90 D D2 219843.0 5 years 0 46000.0 2 2012-08-01 0 0 156.0 18 27.83 0.0 700.0 704.0 13.0 0.0 0.0 15096.0 38.9 18.0 1 0 May-2002 1723.0 1.0 NaN NaN NaN NaN 10.0 NaN NaN NaN NaN NaN 13.0 NaN NaN NaN NaN
2 2 12000.0 5 16.99 298.17 D D3 31698.0 8 years 0 74000.0 2 2015-10-01 0 0 337.0 14 22.77 0.0 675.0 679.0 11.0 0.0 0.0 4606.0 51.8 27.0 0 0 May-2006 0.0 1.0 0.0 0.0 3.0 3.0 0.0 0.0 21.0 4.0 5.0 3.0 11.0 0.0 0.0 0.0 4.0
3 3 11000.0 3 7.26 340.96 A A4 46854.0 10+ years 1 118000.0 1 2015-08-01 0 4 148.0 11 17.21 0.0 685.0 689.0 9.0 0.0 0.0 9948.0 52.6 28.0 1 0 May-1999 4.0 1.0 6.0 4.0 6.0 6.0 4.0 16.0 4.0 7.0 21.0 6.0 9.0 0.0 0.0 0.0 1.0
4 4 3000.0 3 12.99 101.07 C C2 54.0 NaN 1 29000.0 2 2016-03-01 0 10 301.0 21 32.16 0.0 690.0 694.0 12.0 0.0 0.0 2942.0 32.0 27.0 0 0 Aug-1977 11.0 1.0 1.0 2.0 7.0 7.0 2.0 4.0 9.0 10.0 15.0 7.0 12.0 0.0 0.0 0.0 4.0
# 查看训练数据字段
train_data.columns
Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
       'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
       'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
       'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
       'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
       'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
       'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
       'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8',
       'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
      dtype='object')
# 测试数据集字段
test_data.columns
Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
       'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
       'annualIncome', 'verificationStatus', 'issueDate', 'purpose',
       'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow',
       'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal',
       'revolUtil', 'totalAcc', 'initialListStatus', 'applicationType',
       'earliesCreditLine', 'title', 'policyCode', 'n0', 'n1', 'n2', 'n3',
       'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
      dtype='object')
# 数据维度
print(train_data.shape)
print(test_data.shape)
(800000, 47)
(200000, 46)

       训练数据集比测试数据集多了一个变量isDefault,即为用户是否违约的标签,其余变量作为输入变量;
       数据总量:训练集-800000,测试集-200000

# 变量类型和空值
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
id                    800000 non-null int64
loanAmnt              800000 non-null float64
term                  800000 non-null int64
interestRate          800000 non-null float64
installment           800000 non-null float64
grade                 800000 non-null object
subGrade              800000 non-null object
employmentTitle       799999 non-null float64
employmentLength      753201 non-null object
homeOwnership         800000 non-null int64
annualIncome          800000 non-null float64
verificationStatus    800000 non-null int64
issueDate             800000 non-null object
isDefault             800000 non-null int64
purpose               800000 non-null int64
postCode              799999 non-null float64
regionCode            800000 non-null int64
dti                   799761 non-null float64
delinquency_2years    800000 non-null float64
ficoRangeLow          800000 non-null float64
ficoRangeHigh         800000 non-null float64
openAcc               800000 non-null float64
pubRec                800000 non-null float64
pubRecBankruptcies    799595 non-null float64
revolBal              800000 non-null float64
revolUtil             799469 non-null float64
totalAcc              800000 non-null float64
initialListStatus     800000 non-null int64
applicationType       800000 non-null int64
earliesCreditLine     800000 non-null object
title                 799999 non-null float64
policyCode            800000 non-null float64
n0                    759730 non-null float64
n1                    759730 non-null float64
n2                    759730 non-null float64
n3                    759730 non-null float64
n4                    766761 non-null float64
n5                    759730 non-null float64
n6                    759730 non-null float64
n7                    759730 non-null float64
n8                    759729 non-null float64
n9                    759730 non-null float64
n10                   766761 non-null float64
n11                   730248 non-null float64
n12                   759730 non-null float64
n13                   759730 non-null float64
n14                   759730 non-null float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB
训练数据集数据总量为800000,某些属性的数据为空
employmentTitle(就业职称):float64,1个空值;
employmentLength(就业年限(年)):object,非数值型<字符串>,多个空值;
postCode(借款人在贷款申请中提供的邮政编码的前3位数字):float64,1个空值;
dti(债务收入比):float64,多个空值;
pubRecBankruptcies(公开记录清除的数量):float64,多个空值;
revolUtil(循环额度利用率,或借款人使用的相对于所有可用循环信贷的信贷金额):float64,多个空值;
title(借款人提供的贷款名称):float64,1个空值;
n0-n14(计数特征):float64,均有大量空值;
同数据中也有一些非数值型数据:
grade(贷款等级)、subGrade(贷款等级之子级)、employmentLength(就业年限(年))、
issueDate(贷款发放的月份)、earliesCreditLine(借款人最早报告的信用额度开立的月份)
# 查看具体的空值变量及其数量
train_data.isnull().sum()
id                    0    
loanAmnt              0    
term                  0    
interestRate          0    
installment           0    
grade                 0    
subGrade              0    
employmentTitle       1    
employmentLength      46799
homeOwnership         0    
annualIncome          0    
verificationStatus    0    
issueDate             0    
isDefault             0    
purpose               0    
postCode              1    
regionCode            0    
dti                   239  
delinquency_2years    0    
ficoRangeLow          0    
ficoRangeHigh         0    
openAcc               0    
pubRec                0    
pubRecBankruptcies    405  
revolBal              0    
revolUtil             531  
totalAcc              0    
initialListStatus     0    
applicationType       0    
earliesCreditLine     0    
title                 1    
policyCode            0    
n0                    40270
n1                    40270
n2                    40270
n3                    40270
n4                    33239
n5                    40270
n6                    40270
n7                    40270
n8                    40271
n9                    40270
n10                   33239
n11                   69752
n12                   40270
n13                   40270
n14                   40270
dtype: int64

       大致查看数据中各变量的统计量

train_data.describe()
id loanAmnt term interestRate installment employmentTitle homeOwnership annualIncome verificationStatus isDefault purpose postCode regionCode dti delinquency_2years ficoRangeLow ficoRangeHigh openAcc pubRec pubRecBankruptcies revolBal revolUtil totalAcc initialListStatus applicationType title policyCode n0 n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13 n14
count 800000.000000 800000.000000 800000.000000 800000.000000 800000.000000 799999.000000 800000.000000 8.000000e+05 800000.000000 800000.000000 800000.000000 799999.000000 800000.000000 799761.000000 800000.000000 800000.000000 800000.000000 800000.000000 800000.000000 799595.000000 8.000000e+05 799469.000000 800000.000000 800000.000000 800000.000000 799999.000000 800000.0 759730.000000 759730.000000 759730.000000 759730.000000 766761.000000 759730.000000 759730.000000 759730.000000 759729.000000 759730.000000 766761.000000 730248.000000 759730.000000 759730.000000 759730.000000
mean 399999.500000 14416.818875 3.482745 13.238391 437.947723 72005.351714 0.614213 7.613391e+04 1.009683 0.199513 1.745982 258.535648 16.385758 18.284557 0.318239 696.204081 700.204226 11.598020 0.214915 0.134163 1.622871e+04 51.790734 24.998861 0.416953 0.019267 1754.113589 1.0 0.511932 3.642330 5.642648 5.642648 4.735641 8.107937 8.575994 8.282953 14.622488 5.592345 11.643896 0.000815 0.003384 0.089366 2.178606
std 230940.252013 8716.086178 0.855832 4.765757 261.460393 106585.640204 0.675749 6.894751e+04 0.782716 0.399634 2.367453 200.037446 11.036679 11.150155 0.880325 31.865995 31.866674 5.475286 0.606467 0.377471 2.245802e+04 24.516126 11.999201 0.493055 0.137464 7941.474040 0.0 1.333266 2.246825 3.302810 3.302810 2.949969 4.799210 7.400536 4.561689 8.124610 3.216184 5.484104 0.030075 0.062041 0.509069 1.844377
min 0.000000 500.000000 3.000000 5.310000 15.690000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 -1.000000 0.000000 630.000000 634.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 2.000000 0.000000 0.000000 0.000000
  • 3
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值