task2:探索性数据分析(EDA)

目的

  1. EDA价值主要在于熟悉了解整个数据集的基本情况(缺失值,异常值),对数据集进行验证是否可以进行接 下来的机器学习或者深度学习建模.
  2. 了解变量间的相互关系、变量与预测值之间的存在关系。
  3. 为特征工程做准备
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
train_data=pd.read_csv('./数据/train.csv')
test_data=pd.read_csv('./数据/testA.csv')
import os
os.getcwd()
'C:\\Users\\gnzha\\工作\\工作\\bonc\\python\\学习资料\\202009datawhale资料'
train_data=pd.read_csv('./数据/train.csv')

1.1 数据理解

type(train_data)
pandas.core.frame.DataFrame
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  800000 non-null  int64  
 1   loanAmnt            800000 non-null  float64
 2   term                800000 non-null  int64  
 3   interestRate        800000 non-null  float64
 4   installment         800000 non-null  float64
 5   grade               800000 non-null  object 
 6   subGrade            800000 non-null  object 
 7   employmentTitle     799999 non-null  float64
 8   employmentLength    753201 non-null  object 
 9   homeOwnership       800000 non-null  int64  
 10  annualIncome        800000 non-null  float64
 11  verificationStatus  800000 non-null  int64  
 12  issueDate           800000 non-null  object 
 13  isDefault           800000 non-null  int64  
 14  purpose             800000 non-null  int64  
 15  postCode            799999 non-null  float64
 16  regionCode          800000 non-null  int64  
 17  dti                 799761 non-null  float64
 18  delinquency_2years  800000 non-null  float64
 19  ficoRangeLow        800000 non-null  float64
 20  ficoRangeHigh       800000 non-null  float64
 21  openAcc             800000 non-null  float64
 22  pubRec              800000 non-null  float64
 23  pubRecBankruptcies  799595 non-null  float64
 24  revolBal            800000 non-null  float64
 25  revolUtil           799469 non-null  float64
 26  totalAcc            800000 non-null  float64
 27  initialListStatus   800000 non-null  int64  
 28  applicationType     800000 non-null  int64  
 29  earliesCreditLine   800000 non-null  object 
 30  title               799999 non-null  float64
 31  policyCode          800000 non-null  float64
 32  n0                  759730 non-null  float64
 33  n1                  759730 non-null  float64
 34  n2                  759730 non-null  float64
 35  n2.1                759730 non-null  float64
 36  n4                  766761 non-null  float64
 37  n5                  759730 non-null  float64
 38  n6                  759730 non-null  float64
 39  n7                  759730 non-null  float64
 40  n8                  759729 non-null  float64
 41  n9                  759730 non-null  float64
 42  n10                 766761 non-null  float64
 43  n11                 730248 non-null  float64
 44  n12                 759730 non-null  float64
 45  n13                 759730 non-null  float64
 46  n14                 759730 non-null  float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB
test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 48 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  200000 non-null  int64  
 1   loanAmnt            200000 non-null  float64
 2   term                200000 non-null  int64  
 3   interestRate        200000 non-null  float64
 4   installment         200000 non-null  float64
 5   grade               200000 non-null  object 
 6   subGrade            200000 non-null  object 
 7   employmentTitle     200000 non-null  float64
 8   employmentLength    188258 non-null  object 
 9   homeOwnership       200000 non-null  int64  
 10  annualIncome        200000 non-null  float64
 11  verificationStatus  200000 non-null  int64  
 12  issueDate           200000 non-null  object 
 13  purpose             200000 non-null  int64  
 14  postCode            200000 non-null  float64
 15  regionCode          200000 non-null  int64  
 16  dti                 199939 non-null  float64
 17  delinquency_2years  200000 non-null  float64
 18  ficoRangeLow        200000 non-null  float64
 19  ficoRangeHigh       200000 non-null  float64
 20  openAcc             200000 non-null  float64
 21  pubRec              200000 non-null  float64
 22  pubRecBankruptcies  199884 non-null  float64
 23  revolBal            200000 non-null  float64
 24  revolUtil           199873 non-null  float64
 25  totalAcc            200000 non-null  float64
 26  initialListStatus   200000 non-null  int64  
 27  applicationType     200000 non-null  int64  
 28  earliesCreditLine   200000 non-null  object 
 29  title               200000 non-null  float64
 30  policyCode          200000 non-null  float64
 31  n0                  189889 non-null  float64
 32  n1                  189889 non-null  float64
 33  n2                  189889 non-null  float64
 34  n2.1                189889 non-null  float64
 35  n2.2                189889 non-null  float64
 36  n2.3                189889 non-null  float64
 37  n4                  191606 non-null  float64
 38  n5                  189889 non-null  float64
 39  n6                  189889 non-null  float64
 40  n7                  189889 non-null  float64
 41  n8                  189889 non-null  float64
 42  n9                  189889 non-null  float64
 43  n10                 191606 non-null  float64
 44  n11                 182425 non-null  float64
 45  n12                 189889 non-null  float64
 46  n13                 189889 non-null  float64
 47  n14                 189889 non-null  float64
dtypes: float64(35), int64(8), object(5)
memory usage: 73.2+ MB
train_data.shape
(800000, 47)
test_data.shape
(200000, 48)
train_data.columns
Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
       'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
       'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
       'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
       'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
       'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
       'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
       'policyCode', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8',
       'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
      dtype='object')
test_data.columns
Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
       'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
       'annualIncome', 'verificationStatus', 'issueDate', 'purpose',
       'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow',
       'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal',
       'revolUtil', 'totalAcc', 'initialListStatus', 'applicationType',
       'earliesCreditLine', 'title', 'policyCode', 'n0', 'n1', 'n2', 'n2.1',
       'n2.2', 'n2.3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12',
       'n13', 'n14'],
      dtype='object')
train_data.index
RangeIndex(start=0, stop=800000, step=1)
test_data.index
RangeIndex(start=0, stop=200000, step=1)
train_data.values
array([[0, 35000.0, 5, ..., 0.0, 0.0, 2.0],
       [1, 18000.0, 5, ..., nan, nan, nan],
       [2, 12000.0, 5, ..., 0.0, 0.0, 4.0],
       ...,
       [799997, 6000.0, 3, ..., 0.0, 1.0, 4.0],
       [799998, 19200.0, 3, ..., 0.0, 0.0, 5.0],
       [799999, 9000.0, 3, ..., 0.0, 0.0, 2.0]], dtype=object)
train_data.head()
idloanAmntterminterestRateinstallmentgradesubGradeemploymentTitleemploymentLengthhomeOwnership...n5n6n7n8n9n10n11n12n13n14
0035000.0519.52917.97EE2320.02 years2...9.08.04.012.02.07.00.00.00.02.0
1118000.0518.49461.90DD2219843.05 years0...NaNNaNNaNNaNNaN13.0NaNNaNNaNNaN
2212000.0516.99298.17DD331698.08 years0...0.021.04.05.03.011.00.00.00.04.0
3311000.037.26340.96AA446854.010+ years1...16.04.07.021.06.09.00.00.00.01.0
443000.0312.99101.07CC254.0NaN1...4.09.010.015.07.012.00.00.00.04.0

5 rows × 47 columns

train_data.dtypes
id                      int64
loanAmnt              float64
term                    int64
interestRate          float64
installment           float64
grade                  object
subGrade               object
employmentTitle       float64
employmentLength       object
homeOwnership           int64
annualIncome          float64
verificationStatus      int64
issueDate              object
isDefault               int64
purpose                 int64
postCode              float64
regionCode              int64
dti                   float64
delinquency_2years    float64
ficoRangeLow          float64
ficoRangeHigh         float64
openAcc               float64
pubRec                float64
pubRecBankruptcies    float64
revolBal              float64
revolUtil             float64
totalAcc              float64
initialListStatus       int64
applicationType         int64
earliesCreditLine      object
title                 float64
policyCode            float64
n0                    float64
n1                    float64
n2                    float64
n2.1                  float64
n4                    float64
n5                    float64
n6                    float64
n7                    float64
n8                    float64
n9                    float64
n10                   float64
n11                   float64
n12                   float64
n13                   float64
n14                   float64
dtype: object
train_data.describe()
idloanAmntterminterestRateinstallmentemploymentTitlehomeOwnershipannualIncomeverificationStatusisDefault...n5n6n7n8n9n10n11n12n13n14
count800000.000000800000.000000800000.000000800000.000000800000.000000799999.000000800000.0000008.000000e+05800000.000000800000.000000...759730.000000759730.000000759730.000000759729.000000759730.000000766761.000000730248.000000759730.000000759730.000000759730.000000
mean399999.50000014416.8188753.48274513.238391437.94772372005.3517140.6142137.613391e+041.0096830.199513...8.1079378.5759948.28295314.6224885.59234511.6438960.0008150.0033840.0893662.178606
std230940.2520158716.0861780.8558324.765757261.460393106585.6402040.6757496.894751e+040.7827160.399634...4.7992107.4005364.5616898.1246103.2161845.4841040.0300750.0620410.5090691.844377
min0.000000500.0000003.0000005.31000015.6900000.0000000.0000000.000000e+000.0000000.000000...0.0000000.0000000.0000001.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%199999.7500008000.0000003.0000009.750000248.450000427.0000000.0000004.560000e+040.0000000.000000...5.0000004.0000005.0000009.0000003.0000008.0000000.0000000.0000000.0000001.000000
50%399999.50000012000.0000003.00000012.740000375.1350007755.0000001.0000006.500000e+041.0000000.000000...7.0000007.0000007.00000013.0000005.00000011.0000000.0000000.0000000.0000002.000000
75%599999.25000020000.0000003.00000015.990000580.710000117663.5000001.0000009.000000e+042.0000000.000000...11.00000011.00000010.00000019.0000007.00000014.0000000.0000000.0000000.0000003.000000
max799999.00000040000.0000005.00000030.9900001715.420000378351.0000005.0000001.099920e+072.0000001.000000...70.000000132.00000079.000000128.00000045.00000082.0000004.0000004.00000039.00000030.000000

8 rows × 42 columns

train_data.head(3).append(train_data.tail(3))
idloanAmntterminterestRateinstallmentgradesubGradeemploymentTitleemploymentLengthhomeOwnership...n5n6n7n8n9n10n11n12n13n14
0035000.0519.52917.97EE2320.02 years2...9.08.04.012.02.07.00.00.00.02.0
1118000.0518.49461.90DD2219843.05 years0...NaNNaNNaNNaNNaN13.0NaNNaNNaNNaN
2212000.0516.99298.17DD331698.08 years0...0.021.04.05.03.011.00.00.00.04.0
7999977999976000.0313.33203.12CC32582.010+ years1...4.026.04.010.04.05.00.00.01.04.0
79999879999819200.036.92592.14AA4151.010+ years0...10.06.012.022.08.016.00.00.00.05.0
7999997999999000.0311.06294.91BB313.05 years0...3.04.04.08.03.07.00.00.00.02.0

6 rows × 47 columns

1.2 查看数据中缺失值、唯一值等

train_data.isnull()
idloanAmntterminterestRateinstallmentgradesubGradeemploymentTitleemploymentLengthhomeOwnership...n5n6n7n8n9n10n11n12n13n14
0FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...TrueTrueTrueTrueTrueFalseTrueTrueTrueTrue
2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
4FalseFalseFalseFalseFalseFalseFalseFalseTrueFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
..................................................................
799995FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
799996FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
799997FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
799998FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
799999FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse

800000 rows × 47 columns

train_data.isnull().sum()
id                        0
loanAmnt                  0
term                      0
interestRate              0
installment               0
grade                     0
subGrade                  0
employmentTitle           1
employmentLength      46799
homeOwnership             0
annualIncome              0
verificationStatus        0
issueDate                 0
isDefault                 0
purpose                   0
postCode                  1
regionCode                0
dti                     239
delinquency_2years        0
ficoRangeLow              0
ficoRangeHigh             0
openAcc                   0
pubRec                    0
pubRecBankruptcies      405
revolBal                  0
revolUtil               531
totalAcc                  0
initialListStatus         0
applicationType           0
earliesCreditLine         0
title                     1
policyCode                0
n0                    40270
n1                    40270
n2                    40270
n2.1                  40270
n4                    33239
n5                    40270
n6                    40270
n7                    40270
n8                    40271
n9                    40270
n10                   33239
n11                   69752
n12                   40270
n13                   40270
n14                   40270
dtype: int64
type(train_data.isnull().sum())
pandas.core.series.Series
train_data.isnull().sum().dtype
dtype('int64')
train_data.isnull().sum().shape
(47,)
train_data.isnull().sum().index
Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
       'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
       'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
       'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
       'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
       'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
       'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
       'policyCode', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8',
       'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
      dtype='object')
train_data.isnull().any()#any()函数用于判断给定的迭代参数iterable是否全部为false,是则返回false,如果有一个为true,则返回true
id                    False
loanAmnt              False
term                  False
interestRate          False
installment           False
grade                 False
subGrade              False
employmentTitle        True
employmentLength       True
homeOwnership         False
annualIncome          False
verificationStatus    False
issueDate             False
isDefault             False
purpose               False
postCode               True
regionCode            False
dti                    True
delinquency_2years    False
ficoRangeLow          False
ficoRangeHigh         False
openAcc               False
pubRec                False
pubRecBankruptcies     True
revolBal              False
revolUtil              True
totalAcc              False
initialListStatus     False
applicationType       False
earliesCreditLine     False
title                  True
policyCode            False
n0                     True
n1                     True
n2                     True
n2.1                   True
n4                     True
n5                     True
n6                     True
n7                     True
n8                     True
n9                     True
n10                    True
n11                    True
n12                    True
n13                    True
n14                    True
dtype: bool
train_data.isnull().any().sum()
22
train_data.isnull().any().values
array([False, False, False, False, False, False, False,  True,  True,
       False, False, False, False, False, False,  True, False,  True,
       False, False, False, False, False,  True, False,  True, False,
       False, False, False,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True])
len(train_data)
800000
train_data.isnull().sum()/len(train_data)
id                    0.000000
loanAmnt              0.000000
term                  0.000000
interestRate          0.000000
installment           0.000000
grade                 0.000000
subGrade              0.000000
employmentTitle       0.000001
employmentLength      0.058499
homeOwnership         0.000000
annualIncome          0.000000
verificationStatus    0.000000
issueDate             0.000000
isDefault             0.000000
purpose               0.000000
postCode              0.000001
regionCode            0.000000
dti                   0.000299
delinquency_2years    0.000000
ficoRangeLow          0.000000
ficoRangeHigh         0.000000
openAcc               0.000000
pubRec                0.000000
pubRecBankruptcies    0.000506
revolBal              0.000000
revolUtil             0.000664
totalAcc              0.000000
initialListStatus     0.000000
applicationType       0.000000
earliesCreditLine     0.000000
title                 0.000001
policyCode            0.000000
n0                    0.050338
n1                    0.050338
n2                    0.050338
n2.1                  0.050338
n4                    0.041549
n5                    0.050338
n6                    0.050338
n7                    0.050338
n8                    0.050339
n9                    0.050338
n10                   0.041549
n11                   0.087190
n12                   0.050338
n13                   0.050338
n14                   0.050338
dtype: float64
(train_data.isnull().sum()/len(train_data)).to_dict()
{'id': 0.0,
 'loanAmnt': 0.0,
 'term': 0.0,
 'interestRate': 0.0,
 'installment': 0.0,
 'grade': 0.0,
 'subGrade': 0.0,
 'employmentTitle': 1.25e-06,
 'employmentLength': 0.05849875,
 'homeOwnership': 0.0,
 'annualIncome': 0.0,
 'verificationStatus': 0.0,
 'issueDate': 0.0,
 'isDefault': 0.0,
 'purpose': 0.0,
 'postCode': 1.25e-06,
 'regionCode': 0.0,
 'dti': 0.00029875,
 'delinquency_2years': 0.0,
 'ficoRangeLow': 0.0,
 'ficoRangeHigh': 0.0,
 'openAcc': 0.0,
 'pubRec': 0.0,
 'pubRecBankruptcies': 0.00050625,
 'revolBal': 0.0,
 'revolUtil': 0.00066375,
 'totalAcc': 0.0,
 'initialListStatus': 0.0,
 'applicationType': 0.0,
 'earliesCreditLine': 0.0,
 'title': 1.25e-06,
 'policyCode': 0.0,
 'n0': 0.0503375,
 'n1': 0.0503375,
 'n2': 0.0503375,
 'n2.1': 0.0503375,
 'n4': 0.04154875,
 'n5': 0.0503375,
 'n6': 0.0503375,
 'n7': 0.0503375,
 'n8': 0.05033875,
 'n9': 0.0503375,
 'n10': 0.04154875,
 'n11': 0.08719,
 'n12': 0.0503375,
 'n13': 0.0503375,
 'n14': 0.0503375}
#查看缺失值大于50%的特征
ratio=(train_data.isnull().sum()/len(train_data)).to_dict()
more_than_half={}
for key,value in ratio.items():
    if value>0.5:
        more_than_half[key]=value
print(more_than_half)
{}
missing=train_data.isnull().sum()/len(train_data)
missing=missing[missing>0]
missing.sort_values(inplace=True)
missing.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x12b0f98f6c8>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DXqhifi6-1600443888491)(output_35_1.png)]

one_value_feature=[col for col in train_data.columns if train_data[col].nunique()<=1]
one_value_feature
['policyCode']
train_data['n8'].unique()
array([ 12.,  nan,   5.,  21.,  15.,   3.,  18.,  17.,  39.,  16.,  11.,
         9.,  32.,  14.,   7.,  22.,  27.,  19.,  10.,  31.,  37.,   8.,
        29.,  13.,   2.,   6.,  23.,  25.,  20.,  24.,  26.,  28.,  36.,
         4.,  30.,  47.,  42.,  44.,  35.,  33.,  57.,  34.,  41.,  38.,
        45.,  43.,  40.,  51.,  61.,  55.,  49.,  46.,  52.,  50.,  54.,
        81.,  67.,  74.,  62.,  48.,  56.,  69.,  72.,  66.,  58.,  65.,
        53.,  78.,  60.,  59.,   1.,  70.,  92.,  77.,  71.,  63.,  68.,
        64.,  75., 127., 103.,  88.,  86.,  79.,  73.,  90.,  76.,  87.,
       107., 105.,  82.,  80., 128.,  85.,  84., 101.,  89.,  96., 104.,
        93., 102.,  91.,  83.])
train_data['n8'].nunique()
102
print(f'There are {len(one_value_feature)} columns in train dataset with one unique value.')
There are 1 columns in train dataset with one unique value.

1.3 查看数据类型

  1. 特征一般都是由类别型特征和数值型特征组成
  2. 类别型特征有时具有非数值关系,有时也具有数值关系。比如‘grade’中的等级A,B,C等,是否只是单纯的 分类,还是A优于其他要结合业务判断。
  3. 数值型特征本是可以直接入模的,但往往风控人员要对其做分箱,转化为WOE编码进而做标准评分卡等操作。从模型效果上来看,特征分箱主要是为了降低变量的复杂性,减少变量噪音对模型的影响,提高自变量和因变量的相关度。从而使模型更加稳定。
train_data.dtypes
id                      int64
loanAmnt              float64
term                    int64
interestRate          float64
installment           float64
grade                  object
subGrade               object
employmentTitle       float64
employmentLength       object
homeOwnership           int64
annualIncome          float64
verificationStatus      int64
issueDate              object
isDefault               int64
purpose                 int64
postCode              float64
regionCode              int64
dti                   float64
delinquency_2years    float64
ficoRangeLow          float64
ficoRangeHigh         float64
openAcc               float64
pubRec                float64
pubRecBankruptcies    float64
revolBal              float64
revolUtil             float64
totalAcc              float64
initialListStatus       int64
applicationType         int64
earliesCreditLine      object
title                 float64
policyCode            float64
n0                    float64
n1                    float64
n2                    float64
n2.1                  float64
n4                    float64
n5                    float64
n6                    float64
n7                    float64
n8                    float64
n9                    float64
n10                   float64
n11                   float64
n12                   float64
n13                   float64
n14                   float64
dtype: object
train_data.head()
idloanAmntterminterestRateinstallmentgradesubGradeemploymentTitleemploymentLengthhomeOwnership...n5n6n7n8n9n10n11n12n13n14
0035000.0519.52917.97EE2320.02 years2...9.08.04.012.02.07.00.00.00.02.0
1118000.0518.49461.90DD2219843.05 years0...NaNNaNNaNNaNNaN13.0NaNNaNNaNNaN
2212000.0516.99298.17DD331698.08 years0...0.021.04.05.03.011.00.00.00.04.0
3311000.037.26340.96AA446854.010+ years1...16.04.07.021.06.09.00.00.00.01.0
443000.0312.99101.07CC254.0NaN1...4.09.010.015.07.012.00.00.00.04.0

5 rows × 47 columns

train_data.select_dtypes(exclude='object')
idloanAmntterminterestRateinstallmentemploymentTitlehomeOwnershipannualIncomeverificationStatusisDefault...n5n6n7n8n9n10n11n12n13n14
0035000.0519.52917.97320.02110000.021...9.08.04.012.02.07.00.00.00.02.0
1118000.0518.49461.90219843.0046000.020...NaNNaNNaNNaNNaN13.0NaNNaNNaNNaN
2212000.0516.99298.1731698.0074000.020...0.021.04.05.03.011.00.00.00.04.0
3311000.037.26340.9646854.01118000.010...16.04.07.021.06.09.00.00.00.01.0
443000.0312.99101.0754.0129000.020...4.09.010.015.07.012.00.00.00.04.0
..................................................................
79999579999525000.0314.49860.412659.0172000.000...6.02.012.013.010.014.00.00.00.03.0
79999679999617000.037.90531.9429205.0099000.020...15.016.02.019.02.07.00.00.00.00.0
7999977999976000.0313.33203.122582.0165000.021...4.026.04.010.04.05.00.00.01.04.0
79999879999819200.036.92592.14151.0096000.020...10.06.012.022.08.016.00.00.00.05.0
7999997999999000.0311.06294.9113.00120000.000...3.04.04.08.03.07.00.00.00.02.0

800000 rows × 42 columns

numerical_feature=list(train_data.select_dtypes(exclude='object').columns)
category_feature=list(filter(lambda x:x not in numerical_feature,list(train_data.columns)))
numerical_feature
['id',
 'loanAmnt',
 'term',
 'interestRate',
 'installment',
 'employmentTitle',
 'homeOwnership',
 'annualIncome',
 'verificationStatus',
 'isDefault',
 'purpose',
 'postCode',
 'regionCode',
 'dti',
 'delinquency_2years',
 'ficoRangeLow',
 'ficoRangeHigh',
 'openAcc',
 'pubRec',
 'pubRecBankruptcies',
 'revolBal',
 'revolUtil',
 'totalAcc',
 'initialListStatus',
 'applicationType',
 'title',
 'policyCode',
 'n0',
 'n1',
 'n2',
 'n2.1',
 'n4',
 'n5',
 'n6',
 'n7',
 'n8',
 'n9',
 'n10',
 'n11',
 'n12',
 'n13',
 'n14']
category_feature
['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']
train_data['grade']
0         E
1         D
2         D
3         A
4         C
         ..
799995    C
799996    A
799997    C
799998    A
799999    B
Name: grade, Length: 800000, dtype: object
train_data.grade
0         E
1         D
2         D
3         A
4         C
         ..
799995    C
799996    A
799997    C
799998    A
799999    B
Name: grade, Length: 800000, dtype: object
#划分数值型变量中的连续变量和分类变量
def get_numerical_serial_fea(data,feas):
    numerical_serial_feas=[]
    numerical_nserial_feas=[]
    for fea in feas:
        if data[fea].nunique()<=10:
            numerical_nserial_feas.append(fea)
        else:
            numerical_serial_feas.append(fea)
    return numerical_serial_feas,numerical_nserial_feas
x1,x2=get_numerical_serial_fea(train_data,numerical_feature)
x1# 数值连续型变量
['id',
 'loanAmnt',
 'interestRate',
 'installment',
 'employmentTitle',
 'annualIncome',
 'purpose',
 'postCode',
 'regionCode',
 'dti',
 'delinquency_2years',
 'ficoRangeLow',
 'ficoRangeHigh',
 'openAcc',
 'pubRec',
 'pubRecBankruptcies',
 'revolBal',
 'revolUtil',
 'totalAcc',
 'title',
 'n0',
 'n1',
 'n2',
 'n2.1',
 'n4',
 'n5',
 'n6',
 'n7',
 'n8',
 'n9',
 'n10',
 'n13',
 'n14']
x2#数值类别型变量
['term',
 'homeOwnership',
 'verificationStatus',
 'isDefault',
 'initialListStatus',
 'applicationType',
 'policyCode',
 'n11',
 'n12']
train_data.term.value_counts()
3    606902
5    193098
Name: term, dtype: int64
train_data['homeOwnership'].value_counts()
0    395732
1    317660
2     86309
3       185
5        81
4        33
Name: homeOwnership, dtype: int64
train_data['loanAmnt'].value_counts()
10000.0    59068
12000.0    43291
20000.0    41885
15000.0    41629
35000.0    30371
           ...  
35975.0        1
35925.0        1
35850.0        1
35775.0        1
725.0          1
Name: loanAmnt, Length: 1540, dtype: int64
#每个数字特征得分布可视化 
f = pd.melt(train_data, value_vars=x1) 
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False) 
g = g.map(sns.distplot, "value")
print(f)
  1. 查看某一个数值型变量的分布,查看变量是否符合正态分布,如果不符合正太分布的变量可以log化后再观察下是否符合正态分布。

  2. 如果想统一处理一批数据变标准化 必须把这些之前已经正态化的数据提出

plt.figure(figsize=(16,12))
plt.subplot(221)
sns.distplot(train_data['loanAmnt'])
plt.subplot(222)
sns.distplot(np.log(train_data['loanAmnt']))
<matplotlib.axes._subplots.AxesSubplot at 0x12b2c369e48>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GoDWSAVY-1600443888498)(output_59_1.png)]

# 非数值类别型变量分析
category_feature
['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']
train_data.grade.value_counts()
B    233690
C    227118
A    139661
D    119453
E     55661
F     19053
G      5364
Name: grade, dtype: int64

1.4 变量分布可视化

plt.figure(figsize=(8, 8)) 
sns.barplot(train_data["employmentLength"].value_counts(dropna=False)[:20],
            train_data["employmentLength"].value_counts(dropna=False).keys()[:20]) 
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ookc32IA-1600443888500)(output_63_0.png)]

import datetime
#转化成时间格式 
train_data['issueDate'] = pd.to_datetime(train_data['issueDate'],format='%Y-%m-%d') 
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d') 
train_data['issueDateDT'] = train_data['issueDate'].apply(lambda x: x-startdate).dt.days
plt.hist(train_data['issueDateDT'], label='train')
(array([   698.,   2104.,   7139.,  15255.,  43725., 105400., 191202.,
        256182., 131659.,  46636.]),
 array([   0. ,  420.1,  840.2, 1260.3, 1680.4, 2100.5, 2520.6, 2940.7,
        3360.8, 3780.9, 4201. ]),
 <a list of 10 Patch objects>)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BGR76FMc-1600443888502)(output_65_1.png)]

#透视图 索引可以有多个,“columns(列)”是可选的,聚合函数aggfunc后是被应用到了变量“values”中你所列举的项目上。 
pivot = pd.pivot_table(train_data, index=['grade'], columns=['issueDateDT'], values= ['loanAmnt'], aggfunc=np.sum)
pivot
loanAmnt
issueDateDT0306192122153183214245274...3926395739874018404840794110414041714201
grade
ANaN53650.042000.019500.034425.063950.043500.0168825.085600.0101825.0...13093850.011757325.011945975.09144000.07977650.06888900.05109800.03919275.02694025.02245625.0
BNaN13000.024000.032125.07025.095750.0164300.0303175.0434425.0538450.0...16863100.017275175.016217500.011431350.08967750.07572725.04884600.04329400.03922575.03257100.0
CNaN68750.08175.010000.061800.052550.0175375.0151100.0243725.0393150.0...17502375.017471500.016111225.011973675.010184450.07765000.05354450.04552600.02870050.02246250.0
DNaNNaN5500.02850.028625.0NaN167975.0171325.0192900.0269325.0...11403075.010964150.010747675.07082050.07189625.05195700.03455175.03038500.02452375.01771750.0
E7500.0NaN10000.0NaN17975.01500.094375.0116450.042000.0139775.0...3983050.03410125.03107150.02341825.02225675.01643675.01091025.01131625.0883950.0802425.0
FNaNNaN31250.02125.0NaNNaNNaN49000.027000.043000.0...1074175.0868925.0761675.0685325.0665750.0685200.0316700.0315075.072300.0NaN
GNaNNaNNaNNaNNaNNaNNaN24625.0NaNNaN...56100.0243275.0224825.064050.0198575.0245825.053125.023750.025100.01000.0

7 rows × 139 columns

©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页