零基础入门金融风控之贷款违约预测—数据分析

主要目标

  • 了解数据集的一些基本情况,如缺失值,异常值,数据集大小,原始特征维度等
  • 了解变量间的相互关系、变量与预测值之间存在的关系
  • 观察变量的分布情况,为特征工程做准备

初始化

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import warnings
import pandas.util.testing as tm
warnings.filterwarnings('ignore')

# 文件目录,绝对路径
INPUT_PATH = 'G:\代码\数据挖掘实践\\' 

读取文件

data_train = pd.read_csv(INPUT_PATH +'train.csv')
data_test_a = pd.read_csv(INPUT_PATH +'testA.csv')  

分块读取

当需要读取的数据文件特别大时,可以采用分块读取的方式,分块读取返回的类型为 TextFileReader,是一个可迭代的对象。

#设置chunksize参数,来控制每次读取数据的大小
chunker = pd.read_csv(INPUT_PATH +"train.csv",chunksize=100000,iterator=True)
temp = chunker.get_chunk(200000) # 读取指定长度的数据
len(temp)
200000
# 对每一块进行迭代
for item in chunker: # 每个item是一个长度为chunksize的DataFrame
    print(type(item))
    print(len(item))
    print(item['loanAmnt'].mean())
<class 'pandas.core.frame.DataFrame'>
100000
14405.10325
<class 'pandas.core.frame.DataFrame'>
100000
14449.50525
<class 'pandas.core.frame.DataFrame'>
100000
14368.34225
<class 'pandas.core.frame.DataFrame'>
100000
14422.87375
<class 'pandas.core.frame.DataFrame'>
100000
14426.4795
<class 'pandas.core.frame.DataFrame'>
100000
14417.336

统计分析

查看数据集的样本个数和原始特征维度:

print(data_train.shape)
print(data_train.columns)
(800000, 47)
Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
       'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
       'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
       'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
       'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
       'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
       'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
       'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8',
       'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
      dtype='object')

查看数据类型以及是否有空值:

data_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  800000 non-null  int64  
 1   loanAmnt            800000 non-null  float64
 2   term                800000 non-null  int64  
 3   interestRate        800000 non-null  float64
 4   installment         800000 non-null  float64
 5   grade               800000 non-null  object 
 6   subGrade            800000 non-null  object 
 7   employmentTitle     799999 non-null  float64
 8   employmentLength    753201 non-null  object 
 9   homeOwnership       800000 non-null  int64  
 10  annualIncome        800000 non-null  float64
 11  verificationStatus  800000 non-null  int64  
 12  issueDate           800000 non-null  object 
 13  isDefault           800000 non-null  int64  
 14  purpose             800000 non-null  int64  
 15  postCode            799999 non-null  float64
 16  regionCode          800000 non-null  int64  
 17  dti                 799761 non-null  float64
 18  delinquency_2years  800000 non-null  float64
 19  ficoRangeLow        800000 non-null  float64
 20  ficoRangeHigh       800000 non-null  float64
 21  openAcc             800000 non-null  float64
 22  pubRec              800000 non-null  float64
 23  pubRecBankruptcies  799595 non-null  float64
 24  revolBal            800000 non-null  float64
 25  revolUtil           799469 non-null  float64
 26  totalAcc            800000 non-null  float64
 27  initialListStatus   800000 non-null  int64  
 28  applicationType     800000 non-null  int64  
 29  earliesCreditLine   800000 non-null  object 
 30  title               799999 non-null  float64
 31  policyCode          800000 non-null  float64
 32  n0                  759730 non-null  float64
 33  n1                  759730 non-null  float64
 34  n2                  759730 non-null  float64
 35  n3                  759730 non-null  float64
 36  n4                  766761 non-null  float64
 37  n5                  759730 non-null  float64
 38  n6                  759730 non-null  float64
 39  n7                  759730 non-null  float64
 40  n8                  759729 non-null  float64
 41  n9                  759730 non-null  float64
 42  n10                 766761 non-null  float64
 43  n11                 730248 non-null  float64
 44  n12                 759730 non-null  float64
 45  n13                 759730 non-null  float64
 46  n14                 759730 non-null  float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB

查看数据的统计描述:

data_train.describe()
idloanAmntterminterestRateinstallmentemploymentTitlehomeOwnershipannualIncomeverificationStatusisDefault...n5n6n7n8n9n10n11n12n13n14
count800000.000000800000.000000800000.000000800000.000000800000.000000799999.000000800000.0000008.000000e+05800000.000000800000.000000...759730.000000759730.000000759730.000000759729.000000759730.000000766761.000000730248.000000759730.000000759730.000000759730.000000
mean399999.50000014416.8188753.48274513.238391437.94772372005.3517140.6142137.613391e+041.0096830.199513...8.1079378.5759948.28295314.6224885.59234511.6438960.0008150.0033840.0893662.178606
std230940.2520138716.0861780.8558324.765757261.460393106585.6402040.6757496.894751e+040.7827160.399634...4.7992107.4005364.5616898.1246103.2161845.4841040.0300750.0620410.5090691.844377
min0.000000500.0000003.0000005.31000015.6900000.0000000.0000000.000000e+000.0000000.000000...0.0000000.0000000.0000001.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%199999.7500008000.0000003.0000009.750000248.450000427.0000000.0000004.560000e+040.0000000.000000...5.0000004.0000005.0000009.0000003.0000008.0000000.0000000.0000000.0000001.000000
50%399999.50000012000.0000003.00000012.740000375.1350007755.0000001.0000006.500000e+041.0000000.000000...7.0000007.0000007.00000013.0000005.00000011.0000000.0000000.0000000.0000002.000000
75%599999.25000020000.0000003.00000015.990000580.710000117663.5000001.0000009.000000e+042.0000000.000000...11.00000011.00000010.00000019.0000007.00000014.0000000.0000000.0000000.0000003.000000
max799999.00000040000.0000005.00000030.9900001715.420000378351.0000005.0000001.099920e+072.0000001.000000...70.000000132.00000079.000000128.00000045.00000082.0000004.0000004.00000039.00000030.000000

8 rows × 42 columns

查看存在缺失值的情况:

print(f'有{data_train.isnull().any().sum()}列数据存在缺失值') # .any()表示任意一列中存在缺失值
有22列数据存在缺失值

查看缺失特征中缺失率大于50%的特征:

have_null_fea_dict = (data_train.isnull().sum()/len(data_train)).to_dict()
fea_null_moreThanHalf = {}
for key,value in have_null_fea_dict.items():
    if value > 0.5:
        fea_null_moreThanHalf[key] = value
        
fea_null_moreThanHalf
{}

具体的查看缺失特征及缺失率:

missing = data_train.isnull().sum()/len(data_train)
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()

在这里插入图片描述

如果缺失值的数量较小一般选择填充,如使用 lgb 等树模型可以直接填充空缺,让树自己去优化。如果存在的缺失值过多,则可以考虑删掉该特征。

统计取值都相同的特征:

one_value_fea = [col for col in data_train.columns if data_train[col].nunique() <= 1] 
one_value_fea 
['policyCode']

查看特征的数据类型:

  • 特征一般都是由类别型特征和数值型特征组成。
  • 类别型特征有时具有非数值关系,有时也具有数值关系。比如’grade’中的等级 A,B,C 等,是否只是单纯的分类,还是A优于其他要结合业务判断。
  • 数值型特征一般是可以直接入模的,但风控人员往往还要对其做分箱、转化为 WOE 编码进而做标准评分卡等操作。从模型效果上来看,特征分箱主要是为了降低变量的复杂性,减少变量噪音对模型的影响,提高自变量和因变量的相关度。从而使模型更加稳定。
numerical_fea = list(data_train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x: x not in numerical_fea,list(data_train.columns)))
print('数值型特征')
print(numerical_fea)
print('类别型特征')
print(category_fea)
数值型特征
['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'employmentTitle', 'homeOwnership', 'annualIncome', 'verificationStatus', 'isDefault', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'initialListStatus', 'applicationType', 'title', 'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14']
类别型特征
['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']

对数值型变量进行分析

  • 判断数值特征是离散型还是连续型
# 过滤数值型类别特征
def get_numerical_serial_fea(data,feas):
    numerical_serial_fea = []
    numerical_noserial_fea = []
    for fea in feas:
        temp = data[fea].nunique()
        if temp <= 10: # 取值少于10个的特征视为离散型特征,取值多于10个的视为连续性特征
            numerical_noserial_fea.append(fea)
            continue
        numerical_serial_fea.append(fea)
    return numerical_serial_fea,numerical_noserial_fea
numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(data_train,numerical_fea)

print('连续性数值特征')
print(numerical_serial_fea)
print('离散型数值特征')
print(numerical_noserial_fea) 
连续性数值特征
['id', 'loanAmnt', 'interestRate', 'installment', 'employmentTitle', 'annualIncome', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'title', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n13', 'n14']
离散型数值特征
['term', 'homeOwnership', 'verificationStatus', 'isDefault', 'initialListStatus', 'applicationType', 'policyCode', 'n11', 'n12']
  • 查看离散型数值特征的类别数
# 查看离散型数值特征的类别数
print('term')
print(data_train['term'].value_counts())
print('\nhomeOwnership')
print(data_train['homeOwnership'].value_counts())
print('\nverificationStatus')
print(data_train['verificationStatus'].value_counts())
print('\ninitialListStatus')
print(data_train['initialListStatus'].value_counts())
print('\napplicationType')
print(data_train['applicationType'].value_counts())
print('\npolicyCode')
print(data_train['policyCode'].value_counts())
print('\nn11')
print(data_train['n11'].value_counts())
print('\nn12')
print(data_train['n12'].value_counts())
term
3    606902
5    193098
Name: term, dtype: int64

homeOwnership
0    395732
1    317660
2     86309
3       185
5        81
4        33
Name: homeOwnership, dtype: int64

verificationStatus
1    309810
2    248968
0    241222
Name: verificationStatus, dtype: int64

initialListStatus
0    466438
1    333562
Name: initialListStatus, dtype: int64

applicationType
0    784586
1     15414
Name: applicationType, dtype: int64

policyCode
1.0    800000
Name: policyCode, dtype: int64

n11
0.0    729682
1.0       540
2.0        24
4.0         1
3.0         1
Name: n11, dtype: int64

n12
0.0    757315
1.0      2281
2.0       115
3.0        16
4.0         3
Name: n12, dtype: int64

存在取值全部相同的情况,如’policyCode’,这在实际分析中是没有价值的。此外,有些变量存在取值数量相差较大的情况,如’n11’,‘n12’.

  • 连续型数值变量可视化
# 每个数值特征的分布可视化
f = pd.melt(data_train, value_vars=numerical_serial_fea) # pd.melt 列转行
g = sns.FacetGrid(f, col="variable", col_wrap=5, sharex=False, sharey=False) 
# FacetGrid 是sns的库的一个类,在初始化的时候只需要传入一个DataFrame将其示例化,实例化之后就可以直接使用这个类绘制需要的图形
g = g.map(sns.distplot, "value")

在这里插入图片描述

  • 查看某一个数值型变量的分布,查看变量是否符合正态分布,如果不符合正态分布的变量可以取对数后再观察是否符合正态分布。
  • 取对数的作用相当于标准化,可以将变化范围较大的数据限制在较小的变化范围内。
# 画出 loanAmnt 与 log(loanAmnt) 的分布
plt.figure(figsize=(16,12))
plt.suptitle('Transaction Values Distribution', fontsize=22)
plt.subplot(221)
sub_plot_1 = sns.distplot(data_train['loanAmnt'])
sub_plot_1.set_title("loanAmnt Distribuition", fontsize=18)
sub_plot_1.set_xlabel("")
sub_plot_1.set_ylabel("Probability", fontsize=15)
plt.subplot(222)
sub_plot_2 = sns.distplot(np.log(data_train['loanAmnt']))
sub_plot_2.set_title("loanAmnt (Log) Distribuition", fontsize=18)
sub_plot_2.set_xlabel("")
sub_plot_2.set_ylabel("Probability", fontsize=15)

在这里插入图片描述

对类别型特征进行分析

print('grade')
print(data_train['grade'].value_counts()) 
print('\nsubGrade')
print(data_train['subGrade'].value_counts())
print('\nemploymentLength')
print(data_train['employmentLength'].value_counts())
print('\nissueDate')
print(data_train['issueDate'].value_counts())
print('\nearliesCreditLine')
print(data_train['earliesCreditLine'].value_counts())
print('\nisDefault')
print(data_train['isDefault'].value_counts())
grade
B    233690
C    227118
A    139661
D    119453
E     55661
F     19053
G      5364
Name: grade, dtype: int64

subGrade
C1    50763
B4    49516
B5    48965
B3    48600
C2    47068
C3    44751
C4    44272
B2    44227
B1    42382
C5    40264
A5    38045
A4    30928
D1    30538
D2    26528
A1    25909
D3    23410
A3    22655
A2    22124
D4    21139
D5    17838
E1    14064
E2    12746
E3    10925
E4     9273
E5     8653
F1     5925
F2     4340
F3     3577
F4     2859
F5     2352
G1     1759
G2     1231
G3      978
G4      751
G5      645
Name: subGrade, dtype: int64

employmentLength
10+ years    262753
2 years       72358
< 1 year      64237
3 years       64152
1 year        52489
5 years       50102
4 years       47985
6 years       37254
8 years       36192
7 years       35407
9 years       30272
Name: employmentLength, dtype: int64

issueDate
2016-03-01    29066
2015-10-01    25525
2015-07-01    24496
2015-12-01    23245
2014-10-01    21461
              ...  
2007-08-01       23
2007-07-01       21
2008-09-01       19
2007-09-01        7
2007-06-01        1
Name: issueDate, Length: 139, dtype: int64

earliesCreditLine
Aug-2001    5567
Aug-2002    5403
Sep-2003    5403
Oct-2001    5258
Aug-2000    5246
            ... 
Feb-1960       1
Jul-1955       1
Oct-1957       1
Aug-1946       1
Aug-1958       1
Name: earliesCreditLine, Length: 720, dtype: int64

isDefault
0    640390
1    159610
Name: isDefault, dtype: int64

变量分布可视化

单一变量分布可视化

plt.figure(figsize=(8, 8))
sns.barplot(data_train["employmentLength"].value_counts(dropna=False)[:20],
data_train["employmentLength"].value_counts(dropna=False).keys()[:20])
plt.show()

在这里插入图片描述

可视化某个特征x在不同y值上的分布

  • 查看类别型变量在不同y值上的分布
train_loan_fr = data_train.loc[data_train['isDefault'] == 1]
train_loan_nofr = data_train.loc[data_train['isDefault'] == 0]
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 8))
train_loan_fr.groupby('grade')['grade'].count().plot(kind='barh', ax=ax1, title='Count of grade fraud')
train_loan_nofr.groupby('grade')['grade'].count().plot(kind='barh', ax=ax2, title='Count of grade non-fraud')
train_loan_fr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh',ax=ax3, title='Count of employmentLength fraud')
train_loan_nofr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh',ax=ax4, title='Count of employmentLength non-fraud')
plt.show()

在这里插入图片描述

  • 查看连续型变量在不同y值上的分布
fig, ((ax1, ax2)) = plt.subplots(1, 2, figsize=(15, 6))
data_train.loc[data_train['isDefault'] == 1] \
['loanAmnt'].apply(np.log) \
.plot(kind='hist',
bins=100,
title='Log Loan Amt - Fraud',
color='r',
xlim=(-3, 10),
ax= ax1)
data_train.loc[data_train['isDefault'] == 0] \
['loanAmnt'].apply(np.log) \
.plot(kind='hist',
bins=100,
title='Log Loan Amt - Not Fraud',
color='b',
xlim=(-3, 10),
ax=ax2)

在这里插入图片描述

total = len(data_train)
total_amt = data_train.groupby(['isDefault'])['loanAmnt'].sum().sum()
plt.figure(figsize=(12,5))
plt.subplot(121) # 1代表行,2代表列,所以一共有2个图,1代表此时绘制第一个图。
plot_tr = sns.countplot(x='isDefault',data=data_train) 
plot_tr.set_title("Fraud Loan Distribution \n 0: good user | 1: bad user", fontsize=14)
plot_tr.set_xlabel("Is fraud by count", fontsize=16)
plot_tr.set_ylabel('Count', fontsize=16)
for p in plot_tr.patches:
    height = p.get_height()
    plot_tr.text(p.get_x()+p.get_width()/2.,
    height + 3,
    '{:1.2f}%'.format(height/total*100),
    ha="center", fontsize=15)
percent_amt = (data_train.groupby(['isDefault'])['loanAmnt'].sum())
percent_amt = percent_amt.reset_index()
plt.subplot(122)
plot_tr_2 = sns.barplot(x='isDefault', y='loanAmnt', dodge=True, data=percent_amt)
plot_tr_2.set_title("Total Amount in loanAmnt \n 0: good user | 1: bad user", fontsize=14)
plot_tr_2.set_xlabel("Is fraud by percent", fontsize=16)
plot_tr_2.set_ylabel('Total Loan Amount Scalar', fontsize=16)
for p in plot_tr_2.patches:
    height = p.get_height()
    plot_tr_2.text(p.get_x()+p.get_width()/2.,
    height + 3,
    '{:1.2f}%'.format(height/total_amt * 100),
    ha="center", fontsize=15)

在这里插入图片描述

时间格式数据处理及查看

#转化成时间格式
data_train['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
data_train['issueDateDT'] = data_train['issueDate'].apply(lambda x: x-startdate).dt.days

data_test_a['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
data_test_a['issueDateDT'] = data_test_a['issueDate'].apply(lambda x: x-startdate).dt.days

plt.hist(data_train['issueDateDT'], label='train');
plt.hist(data_test_a['issueDateDT'], label='test');
plt.legend();
plt.title('Distribution of issueDateDT dates');
# train 和 test issueDateDT 日期有重叠,所以使用基于时间的分割进行验证是不明智的

在这里插入图片描述

透视图

# 透视图的索引可以有多个,'columns'是可选的,聚合函数 aggfunc 最后是被应用到了变量'values'中所列举的项目上
pivot = pd.pivot_table(data_train, index=['grade'], columns=['issueDateDT'], values=['loanAmnt'], aggfunc=np.sum)
pivot 
loanAmnt
issueDateDT0306192122153183214245274...3926395739874018404840794110414041714201
grade
ANaN53650.042000.019500.034425.063950.043500.0168825.085600.0101825.0...13093850.011757325.011945975.09144000.07977650.06888900.05109800.03919275.02694025.02245625.0
BNaN13000.024000.032125.07025.095750.0164300.0303175.0434425.0538450.0...16863100.017275175.016217500.011431350.08967750.07572725.04884600.04329400.03922575.03257100.0
CNaN68750.08175.010000.061800.052550.0175375.0151100.0243725.0393150.0...17502375.017471500.016111225.011973675.010184450.07765000.05354450.04552600.02870050.02246250.0
DNaNNaN5500.02850.028625.0NaN167975.0171325.0192900.0269325.0...11403075.010964150.010747675.07082050.07189625.05195700.03455175.03038500.02452375.01771750.0
E7500.0NaN10000.0NaN17975.01500.094375.0116450.042000.0139775.0...3983050.03410125.03107150.02341825.02225675.01643675.01091025.01131625.0883950.0802425.0
FNaNNaN31250.02125.0NaNNaNNaN49000.027000.043000.0...1074175.0868925.0761675.0685325.0665750.0685200.0316700.0315075.072300.0NaN
GNaNNaNNaNNaNNaNNaNNaN24625.0NaNNaN...56100.0243275.0224825.064050.0198575.0245825.053125.023750.025100.01000.0

7 rows × 139 columns

用 pandas_profiling 生成数据报告

import pandas_profiling
pfr = pandas_profiling.ProfileReport(data_train)
pfr.to_file("example.html")  

参考资料

关于特征分箱可以参考:特征的分箱

  • 2
    点赞
  • 21
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值