1.数据分析
查看并初步了解数据,熟悉数据,为后续的特征工程做准备,主要目的如下:
1.EDA价值主要在于熟悉整个数据的基本情况(取值类型、取值类别、取值范围、缺失值、异常值等),对数据集进行分析是否可以进一步进行建模分析;
2.了解各变量间的相互关系、变量与预测值之间的存在关系;
3.为特征工程做准备;
2.主要分析内容
2.1 数据总体了解:
读取数据集并了解数据集大小,原始特征维度;
通过info熟悉数据类型;
粗略查看数据集中各特征基本统计量;
2.2 缺失值和唯一值:
查看数据缺失值情况;
查看唯一值特征情况;
2.3 深入数据-查看数据类型:
类别型数据;
数值型数据;
离散数值型数据
连续数值型数据
2.4 数据间相关关系:
特征和特征之间关系;
特征和目标变量之间关系;
2.5 用pandas_profiling生成数据报告
3.代码实例
3.1 导入相关库
# 导入相关库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import warnings
warnings.filterwarnings('ignore')
pd.options.display.max_columns = None
pd.options.display.max_rows = None
%matplotlib inline
pd.set_option('display.max_colwidth', -1)
3.2 读取文件数据
# 读取数据
train_data = pd.read_csv('./RawData/train.csv')
test_data = pd.read_csv('./RawData/testA.csv')
3.3 初步了解
# 查看数据训练数据
train_data.head()
id | loanAmnt | term | interestRate | installment | grade | subGrade | employmentTitle | employmentLength | homeOwnership | annualIncome | verificationStatus | issueDate | isDefault | purpose | postCode | regionCode | dti | delinquency_2years | ficoRangeLow | ficoRangeHigh | openAcc | pubRec | pubRecBankruptcies | revolBal | revolUtil | totalAcc | initialListStatus | applicationType | earliesCreditLine | title | policyCode | n0 | n1 | n2 | n3 | n4 | n5 | n6 | n7 | n8 | n9 | n10 | n11 | n12 | n13 | n14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 35000.0 | 5 | 19.52 | 917.97 | E | E2 | 320.0 | 2 years | 2 | 110000.0 | 2 | 2014-07-01 | 1 | 1 | 137.0 | 32 | 17.05 | 0.0 | 730.0 | 734.0 | 7.0 | 0.0 | 0.0 | 24178.0 | 48.9 | 27.0 | 0 | 0 | Aug-2001 | 1.0 | 1.0 | 0.0 | 2.0 | 2.0 | 2.0 | 4.0 | 9.0 | 8.0 | 4.0 | 12.0 | 2.0 | 7.0 | 0.0 | 0.0 | 0.0 | 2.0 |
1 | 1 | 18000.0 | 5 | 18.49 | 461.90 | D | D2 | 219843.0 | 5 years | 0 | 46000.0 | 2 | 2012-08-01 | 0 | 0 | 156.0 | 18 | 27.83 | 0.0 | 700.0 | 704.0 | 13.0 | 0.0 | 0.0 | 15096.0 | 38.9 | 18.0 | 1 | 0 | May-2002 | 1723.0 | 1.0 | NaN | NaN | NaN | NaN | 10.0 | NaN | NaN | NaN | NaN | NaN | 13.0 | NaN | NaN | NaN | NaN |
2 | 2 | 12000.0 | 5 | 16.99 | 298.17 | D | D3 | 31698.0 | 8 years | 0 | 74000.0 | 2 | 2015-10-01 | 0 | 0 | 337.0 | 14 | 22.77 | 0.0 | 675.0 | 679.0 | 11.0 | 0.0 | 0.0 | 4606.0 | 51.8 | 27.0 | 0 | 0 | May-2006 | 0.0 | 1.0 | 0.0 | 0.0 | 3.0 | 3.0 | 0.0 | 0.0 | 21.0 | 4.0 | 5.0 | 3.0 | 11.0 | 0.0 | 0.0 | 0.0 | 4.0 |
3 | 3 | 11000.0 | 3 | 7.26 | 340.96 | A | A4 | 46854.0 | 10+ years | 1 | 118000.0 | 1 | 2015-08-01 | 0 | 4 | 148.0 | 11 | 17.21 | 0.0 | 685.0 | 689.0 | 9.0 | 0.0 | 0.0 | 9948.0 | 52.6 | 28.0 | 1 | 0 | May-1999 | 4.0 | 1.0 | 6.0 | 4.0 | 6.0 | 6.0 | 4.0 | 16.0 | 4.0 | 7.0 | 21.0 | 6.0 | 9.0 | 0.0 | 0.0 | 0.0 | 1.0 |
4 | 4 | 3000.0 | 3 | 12.99 | 101.07 | C | C2 | 54.0 | NaN | 1 | 29000.0 | 2 | 2016-03-01 | 0 | 10 | 301.0 | 21 | 32.16 | 0.0 | 690.0 | 694.0 | 12.0 | 0.0 | 0.0 | 2942.0 | 32.0 | 27.0 | 0 | 0 | Aug-1977 | 11.0 | 1.0 | 1.0 | 2.0 | 7.0 | 7.0 | 2.0 | 4.0 | 9.0 | 10.0 | 15.0 | 7.0 | 12.0 | 0.0 | 0.0 | 0.0 | 4.0 |
# 查看训练数据字段
train_data.columns
Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8',
'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
dtype='object')
# 测试数据集字段
test_data.columns
Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
'annualIncome', 'verificationStatus', 'issueDate', 'purpose',
'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow',
'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal',
'revolUtil', 'totalAcc', 'initialListStatus', 'applicationType',
'earliesCreditLine', 'title', 'policyCode', 'n0', 'n1', 'n2', 'n3',
'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
dtype='object')
# 数据维度
print(train_data.shape)
print(test_data.shape)
(800000, 47)
(200000, 46)
训练数据集比测试数据集多了一个变量isDefault,即为用户是否违约的标签,其余变量作为输入变量;
数据总量:训练集-800000,测试集-200000
# 变量类型和空值
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
id 800000 non-null int64
loanAmnt 800000 non-null float64
term 800000 non-null int64
interestRate 800000 non-null float64
installment 800000 non-null float64
grade 800000 non-null object
subGrade 800000 non-null object
employmentTitle 799999 non-null float64
employmentLength 753201 non-null object
homeOwnership 800000 non-null int64
annualIncome 800000 non-null float64
verificationStatus 800000 non-null int64
issueDate 800000 non-null object
isDefault 800000 non-null int64
purpose 800000 non-null int64
postCode 799999 non-null float64
regionCode 800000 non-null int64
dti 799761 non-null float64
delinquency_2years 800000 non-null float64
ficoRangeLow 800000 non-null float64
ficoRangeHigh 800000 non-null float64
openAcc 800000 non-null float64
pubRec 800000 non-null float64
pubRecBankruptcies 799595 non-null float64
revolBal 800000 non-null float64
revolUtil 799469 non-null float64
totalAcc 800000 non-null float64
initialListStatus 800000 non-null int64
applicationType 800000 non-null int64
earliesCreditLine 800000 non-null object
title 799999 non-null float64
policyCode 800000 non-null float64
n0 759730 non-null float64
n1 759730 non-null float64
n2 759730 non-null float64
n3 759730 non-null float64
n4 766761 non-null float64
n5 759730 non-null float64
n6 759730 non-null float64
n7 759730 non-null float64
n8 759729 non-null float64
n9 759730 non-null float64
n10 766761 non-null float64
n11 730248 non-null float64
n12 759730 non-null float64
n13 759730 non-null float64
n14 759730 non-null float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB
训练数据集数据总量为800000,某些属性的数据为空
employmentTitle(就业职称):float64,1个空值;
employmentLength(就业年限(年)):object,非数值型<字符串>,多个空值;
postCode(借款人在贷款申请中提供的邮政编码的前3位数字):float64,1个空值;
dti(债务收入比):float64,多个空值;
pubRecBankruptcies(公开记录清除的数量):float64,多个空值;
revolUtil(循环额度利用率,或借款人使用的相对于所有可用循环信贷的信贷金额):float64,多个空值;
title(借款人提供的贷款名称):float64,1个空值;
n0-n14(计数特征):float64,均有大量空值;
同数据中也有一些非数值型数据:
grade(贷款等级)、subGrade(贷款等级之子级)、employmentLength(就业年限(年))、
issueDate(贷款发放的月份)、earliesCreditLine(借款人最早报告的信用额度开立的月份)
# 查看具体的空值变量及其数量
train_data.isnull().sum()
id 0
loanAmnt 0
term 0
interestRate 0
installment 0
grade 0
subGrade 0
employmentTitle 1
employmentLength 46799
homeOwnership 0
annualIncome 0
verificationStatus 0
issueDate 0
isDefault 0
purpose 0
postCode 1
regionCode 0
dti 239
delinquency_2years 0
ficoRangeLow 0
ficoRangeHigh 0
openAcc 0
pubRec 0
pubRecBankruptcies 405
revolBal 0
revolUtil 531
totalAcc 0
initialListStatus 0
applicationType 0
earliesCreditLine 0
title 1
policyCode 0
n0 40270
n1 40270
n2 40270
n3 40270
n4 33239
n5 40270
n6 40270
n7 40270
n8 40271
n9 40270
n10 33239
n11 69752
n12 40270
n13 40270
n14 40270
dtype: int64
大致查看数据中各变量的统计量
train_data.describe()
id | loanAmnt | term | interestRate | installment | employmentTitle | homeOwnership | annualIncome | verificationStatus | isDefault | purpose | postCode | regionCode | dti | delinquency_2years | ficoRangeLow | ficoRangeHigh | openAcc | pubRec | pubRecBankruptcies | revolBal | revolUtil | totalAcc | initialListStatus | applicationType | title | policyCode | n0 | n1 | n2 | n3 | n4 | n5 | n6 | n7 | n8 | n9 | n10 | n11 | n12 | n13 | n14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 800000.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 799999.000000 | 800000.000000 | 8.000000e+05 | 800000.000000 | 800000.000000 | 800000.000000 | 799999.000000 | 800000.000000 | 799761.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 799595.000000 | 8.000000e+05 | 799469.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 799999.000000 | 800000.0 | 759730.000000 | 759730.000000 | 759730.000000 | 759730.000000 | 766761.000000 | 759730.000000 | 759730.000000 | 759730.000000 | 759729.000000 | 759730.000000 | 766761.000000 | 730248.000000 | 759730.000000 | 759730.000000 | 759730.000000 |
mean | 399999.500000 | 14416.818875 | 3.482745 | 13.238391 | 437.947723 | 72005.351714 | 0.614213 | 7.613391e+04 | 1.009683 | 0.199513 | 1.745982 | 258.535648 | 16.385758 | 18.284557 | 0.318239 | 696.204081 | 700.204226 | 11.598020 | 0.214915 | 0.134163 | 1.622871e+04 | 51.790734 | 24.998861 | 0.416953 | 0.019267 | 1754.113589 | 1.0 | 0.511932 | 3.642330 | 5.642648 | 5.642648 | 4.735641 | 8.107937 | 8.575994 | 8.282953 | 14.622488 | 5.592345 | 11.643896 | 0.000815 | 0.003384 | 0.089366 | 2.178606 |
std | 230940.252013 | 8716.086178 | 0.855832 | 4.765757 | 261.460393 | 106585.640204 | 0.675749 | 6.894751e+04 | 0.782716 | 0.399634 | 2.367453 | 200.037446 | 11.036679 | 11.150155 | 0.880325 | 31.865995 | 31.866674 | 5.475286 | 0.606467 | 0.377471 | 2.245802e+04 | 24.516126 | 11.999201 | 0.493055 | 0.137464 | 7941.474040 | 0.0 | 1.333266 | 2.246825 | 3.302810 | 3.302810 | 2.949969 | 4.799210 | 7.400536 | 4.561689 | 8.124610 | 3.216184 | 5.484104 | 0.030075 | 0.062041 | 0.509069 | 1.844377 |
min | 0.000000 | 500.000000 | 3.000000 | 5.310000 | 15.690000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1.000000 | 0.000000 | 630.000000 | 634.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 |