天池项目笔记-金融风控-贷款违约预测 Task2

Task02_EDA

1.导入数据 load the data

train_data = pd.read_csv('./train.csv', sep = ',')
test_data = pd.read_csv('./testA.csv', sep = ',')

2.数据浏览 data overview

​ 在开始使用数据前先对其进行大概的浏览,包括数据的行列、缺失值、重复值与统计信息等。

train_data.head()
idloanAmntterminterestRateinstallmentgradesubGradeemploymentTitleemploymentLengthhomeOwnershipannualIncomeverificationStatusissueDateisDefaultpurposepostCoderegionCodedtidelinquency_2yearsficoRangeLowficoRangeHighopenAccpubRecpubRecBankruptciesrevolBalrevolUtiltotalAccinitialListStatusapplicationTypeearliesCreditLinetitlepolicyCoden0n1n2n2.1n4n5n6n7n8n9n10n11n12n13n14
0035000.0519.52917.97EE2320.02 years2110000.022014-07-0111137.03217.050.0730.0734.07.00.00.024178.048.927.000Aug-20011.01.00.02.02.02.04.09.08.04.012.02.07.00.00.00.02.0
1118000.0518.49461.90DD2219843.05 years046000.022012-08-0100156.01827.830.0700.0704.013.00.00.015096.038.918.010May-20021723.01.0NaNNaNNaNNaN10.0NaNNaNNaNNaNNaN13.0NaNNaNNaNNaN
2212000.0516.99298.17DD331698.08 years074000.022015-10-0100337.01422.770.0675.0679.011.00.00.04606.051.827.000May-20060.01.00.00.03.03.00.00.021.04.05.03.011.00.00.00.04.0
3311000.037.26340.96AA446854.010+ years1118000.012015-08-0104148.01117.210.0685.0689.09.00.00.09948.052.628.010May-19994.01.06.04.06.06.04.016.04.07.021.06.09.00.00.00.01.0
443000.0312.99101.07CC254.0NaN129000.022016-03-01010301.02132.160.0690.0694.012.00.00.02942.032.027.000Aug-197711.01.01.02.07.07.02.04.09.010.015.07.012.00.00.00.04.0
train_data.shape
test_data.shape

(200000, 48)

train_data.columns

Index([‘id’, ‘loanAmnt’, ‘term’, ‘interestRate’, ‘installment’, ‘grade’,
‘subGrade’, ‘employmentTitle’, ‘employmentLength’, ‘homeOwnership’,
‘annualIncome’, ‘verificationStatus’, ‘issueDate’, ‘isDefault’,
‘purpose’, ‘postCode’, ‘regionCode’, ‘dti’, ‘delinquency_2years’,
‘ficoRangeLow’, ‘ficoRangeHigh’, ‘openAcc’, ‘pubRec’,
‘pubRecBankruptcies’, ‘revolBal’, ‘revolUtil’, ‘totalAcc’,
‘initialListStatus’, ‘applicationType’, ‘earliesCreditLine’, ‘title’,
‘policyCode’, ‘n0’, ‘n1’, ‘n2’, ‘n2.1’, ‘n4’, ‘n5’, ‘n6’, ‘n7’, ‘n8’,
‘n9’, ‘n10’, ‘n11’, ‘n12’, ‘n13’, ‘n14’],
dtype=‘object’)

train_data.info()

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):

‘#’ Column Non-Null Count Dtype


0 id 800000 non-null int64
1 loanAmnt 800000 non-null float64
2 term 800000 non-null int64
3 interestRate 800000 non-null float64
4 installment 800000 non-null float64
5 grade 800000 non-null object
6 subGrade 800000 non-null object
7 employmentTitle 799999 non-null float64
8 employmentLength 753201 non-null object
9 homeOwnership 800000 non-null int64
10 annualIncome 800000 non-null float64
11 verificationStatus 800000 non-null int64
12 issueDate 800000 non-null object
13 isDefault 800000 non-null int64
14 purpose 800000 non-null int64
15 postCode 799999 non-null float64
16 regionCode 800000 non-null int64
17 dti 799761 non-null float64
18 delinquency_2years 800000 non-null float64
19 ficoRangeLow 800000 non-null float64
20 ficoRangeHigh 800000 non-null float64
21 openAcc 800000 non-null float64
22 pubRec 800000 non-null float64
23 pubRecBankruptcies 799595 non-null float64
24 revolBal 800000 non-null float64
25 revolUtil 799469 non-null float64
26 totalAcc 800000 non-null float64
27 initialListStatus 800000 non-null int64
28 applicationType 800000 non-null int64
29 earliesCreditLine 800000 non-null object
30 title 799999 non-null float64
31 policyCode 800000 non-null float64
32 n0 759730 non-null float64
33 n1 759730 non-null float64
34 n2 759730 non-null float64
35 n2.1 759730 non-null float64
36 n4 766761 non-null float64
37 n5 759730 non-null float64
38 n6 759730 non-null float64
39 n7 759730 non-null float64
40 n8 759729 non-null float64
41 n9 759730 non-null float64
42 n10 766761 non-null float64
43 n11 730248 non-null float64
44 n12 759730 non-null float64
45 n13 759730 non-null float64
46 n14 759730 non-null float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB

train_data['employmentTitle'].value_counts()

54.0 51149
38.0 12644
32.0 11543
184.0 6112
151.0 5193

311225.0 1
13438.0 1
311224.0 1
311223.0 1
269398.0 1
Name: employmentTitle, Length: 248683, dtype: int64

​ 对该特征来说,职业头衔被脱敏使用某种表征方法通过计数来保存了,也许使用了NLP的技术,如果是这样的话,可以考虑使用数据分箱的方法将其转换为类别型数据。

4.数据分布 data distribution

#首先初步划分数值型和类别变量
feature_columns = ['loanAmnt', 'term', 'interestRate', 'installment', 'grade',
       'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
       'annualIncome', 'verificationStatus', 'issueDate', 
       'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
       'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
       'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
       'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
       'policyCode', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8',
       'n9', 'n10', 'n11', 'n12', 'n13', 'n14']      
numerical_fea = ['loanAmnt', 'term', 'interestRate', 'installment', 
        'employmentTitle',  'annualIncome',  'issueDate', 
        'postCode', 'regionCode', 'dti', 'delinquency_2years',
       'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
       'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
       'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
       'policyCode', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8',
       'n9', 'n10', 'n11', 'n12', 'n13', 'n14']
categorical_fea = ['grade','subGrade','employmentLength','homeOwnership','verificationStatus','purpose']
sample = train_data.sample(1000)
# 查看Numerical features的分布情况,离散还是连续
for feas in num_fea:
  plt.figure(figsize = (6,4))
  plt.title(feas)
  plt.scatter(sample['id'], sample[feas], s = 8)

​ 这里取1000个数据样本,观察初步划分的数值型变量中的数据分布,考虑是否进一步对特征种类进行划分,在此处展示部分特征数据的分布图。
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

#过滤数值型类别特征
def get_numerical_serial_fea(data,feas):
    numerical_serial_fea = []
    numerical_noserial_fea = []
    for fea in feas:
        temp = data[fea].nunique()
        if temp <= 10:
            numerical_noserial_fea.append(fea)
            continue
        numerical_serial_fea.append(fea)
    return numerical_serial_fea,numerical_noserial_fea
numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(train_data,num_fea)

#每个数字特征得分布可视化
f = pd.melt(train_data, value_vars=numerical_serial_fea)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

​ 此处根据每个数值类型特征的分布,将类别数大于10类的特征保留为数值类型,类别数小于10的保留为类别特征。
在这里插入图片描述

# 变量分布可视化
plt.figure(figsize=(8, 8))
sns.barplot(train_data["employmentLength"].value_counts(dropna=False)[:20],
            train_data["employmentLength"].value_counts(dropna=False).keys()[:20])
plt.show()

在这里插入图片描述

# 正负样本比例
total = len(train_data)
total_amt = train_data.groupby(['isDefault'])['loanAmnt'].sum().sum()
plt.figure(figsize=(6,5))
plot_tr = sns.countplot(x='isDefault',data=train_data)#train data‘isDefault’这个特征每种类别的数量**
plot_tr.set_title("Fraud Loan Distribution \n 0: good user | 1: bad user", fontsize=14)
plot_tr.set_xlabel("Is fraud by count", fontsize=16)
plot_tr.set_ylabel('Count', fontsize=16)
for p in plot_tr.patches:
    height = p.get_height()
    plot_tr.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=15) 

在这里插入图片描述
​ 由于金融欺诈数据中,实际欺诈行为即正样本的数量应该会很少,存在样本不平衡的情况,这是应该在后续工作中克服的。

5.缺失值 missing values

​ 统计缺失值:

train_data.isnull().sum()

id 0
loanAmnt 0
term 0
interestRate 0
installment 0
grade 0
subGrade 0
employmentTitle 1
employmentLength 46799
homeOwnership 0
annualIncome 0
verificationStatus 0
issueDate 0
isDefault 0
purpose 0
postCode 1
regionCode 0
dti 239
delinquency_2years 0
ficoRangeLow 0
ficoRangeHigh 0
openAcc 0
pubRec 0
pubRecBankruptcies 405
revolBal 0
revolUtil 531
totalAcc 0
initialListStatus 0
applicationType 0
earliesCreditLine 0
title 1
policyCode 0
n0 40270
n1 40270
n2 40270
n2.1 40270
n4 33239
n5 40270
n6 40270
n7 40270
n8 40271
n9 40270
n10 33239
n11 69752
n12 40270
n13 40270
n14 40270
dtype: int64

​ 对缺失值进行可视化:

sample = train_data.sample(1000)
msno.matrix(sample)
plt.show()
msno.bar(sample)
plt.show()
msno.heatmap(sample)
plt.show()

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

6.特征热力图

​ 通过热力图可以观察特征之间的相关关系,确定哪些是冗余数据。

plt.figure(figsize = (30,10))
sns.heatmap(train_data.corr(), annot = True, cmap = 'coolwarm')

在这里插入图片描述

​ 在完成对数据的探索性分析之后,下一步应该考虑如何预处理非数值类型的数据并构造新的特征工程。下一步会注意在日期类型数据、类别数据的encoding、数据分箱和特征聚合上的工作。根据新的特征确定新的特征列。

©️2020 CSDN 皮肤主题: 游动-白 设计师:上身试试 返回首页