风控-数据分析

最新推荐文章于 2024-08-03 18:07:58 发布

学机器学习的机械工程师

最新推荐文章于 2024-08-03 18:07:58 发布

阅读量1.7k

点赞数

分类专栏：机器学习实战文章标签：数据分析

本文链接：https://blog.csdn.net/abc360605383/article/details/108669276

版权

机器学习实战专栏收录该内容

5 篇文章 0 订阅

订阅专栏

数据总体了解：

读取数据集并了解数据集大小，原始特征维度；

通过info熟悉数据类型；

粗略查看数据集中各特征基本统计量；

缺失值和唯一值：

查看数据缺失值情况

查看唯一值特征情况

深入数据-查看数据类型

类别型数据

数值型数据

离散数值型数据

连续数值型数据

数据间相关关系

特征和特征之间关系

特征和目标变量之间关系

用pandas_profiling生成数据报告

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import warnings
warnings.filterwarnings('ignore')

1.读取文件

#读取文件
data_train = pd.read_csv("./train.csv")
data_test_a = pd.read_csv('./testA.csv')

2.整体信息

#查看数据集样本个数和原始特征维度
data_test_a.shape

(200000, 48)

data_train.shape

(800000, 47)

data_train.columns

Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
       'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
       'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
       'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
       'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
       'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
       'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
       'policyCode', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8',
       'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
      dtype='object')

data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
id                    800000 non-null int64
loanAmnt              800000 non-null float64
term                  800000 non-null int64
interestRate          800000 non-null float64
installment           800000 non-null float64
grade                 800000 non-null object
subGrade              800000 non-null object
employmentTitle       799999 non-null float64
employmentLength      753201 non-null object
homeOwnership         800000 non-null int64
annualIncome          800000 non-null float64
verificationStatus    800000 non-null int64
issueDate             800000 non-null object
isDefault             800000 non-null int64
purpose               800000 non-null int64
postCode              799999 non-null float64
regionCode            800000 non-null int64
dti                   799761 non-null float64
delinquency_2years    800000 non-null float64
ficoRangeLow          800000 non-null float64
ficoRangeHigh         800000 non-null float64
openAcc               800000 non-null float64
pubRec                800000 non-null float64
pubRecBankruptcies    799595 non-null float64
revolBal              800000 non-null float64
revolUtil             799469 non-null float64
totalAcc              800000 non-null float64
initialListStatus     800000 non-null int64
applicationType       800000 non-null int64
earliesCreditLine     800000 non-null object
title                 799999 non-null float64
policyCode            800000 non-null float64
n0                    759730 non-null float64
n1                    759730 non-null float64
n2                    759730 non-null float64
n2.1                  759730 non-null float64
n4                    766761 non-null float64
n5                    759730 non-null float64
n6                    759730 non-null float64
n7                    759730 non-null float64
n8                    759729 non-null float64
n9                    759730 non-null float64
n10                   766761 non-null float64
n11                   730248 non-null float64
n12                   759730 non-null float64
n13                   759730 non-null float64
n14                   759730 non-null float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB

data_train.describe()

	id	loanAmnt	term	interestRate	installment	employmentTitle	homeOwnership	annualIncome	verificationStatus	isDefault	...	n5	n6	n7	n8	n9	n10	n11	n12	n13	n14
count	800000.000000	800000.000000	800000.000000	800000.000000	800000.000000	799999.000000	800000.000000	8.000000e+05	800000.000000	800000.000000	...	759730.000000	759730.000000	759730.000000	759729.000000	759730.000000	766761.000000	730248.000000	759730.000000	759730.000000	759730.000000
mean	399999.500000	14416.818875	3.482745	13.238391	437.947723	72005.351714	0.614213	7.613391e+04	1.009683	0.199513	...	8.107937	8.575994	8.282953	14.622488	5.592345	11.643896	0.000815	0.003384	0.089366	2.178606
std	230940.252013	8716.086178	0.855832	4.765757	261.460393	106585.640204	0.675749	6.894751e+04	0.782716	0.399634	...	4.799210	7.400536	4.561689	8.124610	3.216184	5.484104	0.030075	0.062041	0.509069	1.844377
min	0.000000	500.000000	3.000000	5.310000	15.690000	0.000000	0.000000	0.000000e+00	0.000000	0.000000	...	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	199999.750000	8000.000000	3.000000	9.750000	248.450000	427.000000	0.000000	4.560000e+04	0.000000	0.000000	...	5.000000	4.000000	5.000000	9.000000	3.000000	8.000000	0.000000	0.000000	0.000000	1.000000
50%	399999.500000	12000.000000	3.000000	12.740000	375.135000	7755.000000	1.000000	6.500000e+04	1.000000	0.000000	...	7.000000	7.000000	7.000000	13.000000	5.000000	11.000000	0.000000	0.000000	0.000000	2.000000
75%	599999.250000	20000.000000	3.000000	15.990000	580.710000	117663.500000	1.000000	9.000000e+04	2.000000	0.000000	...	11.000000	11.000000	10.000000	19.000000	7.000000	14.000000	0.000000	0.000000	0.000000	3.000000
max	799999.000000	40000.000000	5.000000	30.990000	1715.420000	378351.000000	5.000000	1.099920e+07	2.000000	1.000000	...	70.000000	132.000000	79.000000	128.000000	45.000000	82.000000	4.000000	4.000000	39.000000	30.000000

8 rows × 42 columns

data_train.head(3).append(data_train.tail(3))

	id	loanAmnt	term	interestRate	installment	grade	subGrade	employmentTitle	employmentLength	homeOwnership	...	n5	n6	n7	n8	n9	n10	n11	n12	n13	n14
0	0	35000.0	5	19.52	917.97	E	E2	320.0	2 years	2	...	9.0	8.0	4.0	12.0	2.0	7.0	0.0	0.0	0.0	2.0
1	1	18000.0	5	18.49	461.90	D	D2	219843.0	5 years	0	...	NaN	NaN	NaN	NaN	NaN	13.0	NaN	NaN	NaN	NaN
2	2	12000.0	5	16.99	298.17	D	D3	31698.0	8 years	0	...	0.0	21.0	4.0	5.0	3.0	11.0	0.0	0.0	0.0	4.0
799997	799997	6000.0	3	13.33	203.12	C	C3	2582.0	10+ years	1	...	4.0	26.0	4.0	10.0	4.0	5.0	0.0	0.0	1.0	4.0
799998	799998	19200.0	3	6.92	592.14	A	A4	151.0	10+ years	0	...	10.0	6.0	12.0	22.0	8.0	16.0	0.0	0.0	0.0	5.0
799999	799999	9000.0	3	11.06	294.91	B	B3	13.0	5 years	0	...	3.0	4.0	4.0	8.0	3.0	7.0	0.0	0.0	0.0	2.0

6 rows × 47 columns

#查看数据集中特征缺失值，唯一值等
print(f'There are {data_train.isnull().any().sum()} columns in train dataset with missing values.')
#上面得到训练集有22列特征有缺失值，进一步查看缺失特征中缺失率大于50%的特征

There are 22 columns in train dataset with missing values.

have_null_feature_dict = (data_train.isnull().sum()/len(data_train)).to_dict()
fea_null_moreThanHalf = {}
for key,value in have_null_feature_dict.items():
    if value>0.5:
        fea_null_moreThanHalf[key] = value
fea_null_moreThanHalf

{}

have_null_feature_dict

{'id': 0.0,
 'loanAmnt': 0.0,
 'term': 0.0,
 'interestRate': 0.0,
 'installment': 0.0,
 'grade': 0.0,
 'subGrade': 0.0,
 'employmentTitle': 1.25e-06,
 'employmentLength': 0.05849875,
 'homeOwnership': 0.0,
 'annualIncome': 0.0,
 'verificationStatus': 0.0,
 'issueDate': 0.0,
 'isDefault': 0.0,
 'purpose': 0.0,
 'postCode': 1.25e-06,
 'regionCode': 0.0,
 'dti': 0.00029875,
 'delinquency_2years': 0.0,
 'ficoRangeLow': 0.0,
 'ficoRangeHigh': 0.0,
 'openAcc': 0.0,
 'pubRec': 0.0,
 'pubRecBankruptcies': 0.00050625,
 'revolBal': 0.0,
 'revolUtil': 0.00066375,
 'totalAcc': 0.0,
 'initialListStatus': 0.0,
 'applicationType': 0.0,
 'earliesCreditLine': 0.0,
 'title': 1.25e-06,
 'policyCode': 0.0,
 'n0': 0.0503375,
 'n1': 0.0503375,
 'n2': 0.0503375,
 'n2.1': 0.0503375,
 'n4': 0.04154875,
 'n5': 0.0503375,
 'n6': 0.0503375,
 'n7': 0.0503375,
 'n8': 0.05033875,
 'n9': 0.0503375,
 'n10': 0.04154875,
 'n11': 0.08719,
 'n12': 0.0503375,
 'n13': 0.0503375,
 'n14': 0.0503375}

#具体的查看缺失特征及缺失率
# nan可视化
missing = data_train.isnull().sum()/len(data_train)
missing = missing[missing>0]
missing.sort_values(inplace=True)
missing.plot.bar()

<AxesSubplot:>

在这里插入图片描述

纵向了解哪些列存在 “nan”, 并可以把nan的个数打印，主要的目的在于查看某一列nan存在的个数是否真的很大，如果nan存在的过多，说明这一列对label的影响几乎不起作用了，可以考虑删掉。如果缺失值很小一般可以选择填充。
另外可以横向比较，如果在数据集中，某些样本数据的大部分列都是缺失的且样本足够的情况下可以考虑删除。

Tips: 比赛大杀器lgb模型可以自动处理缺失值，Task4模型会具体学习模型了解模型哦！

#查看训练集测试集中特征属性只有一值的特征
one_value_fea = [col for col in data_train.columns if data_train[col].nunique()<=1]
one_value_fea_test = [col for col in data_test_a.columns if data_test_a[col].nunique() <= 1]
one_value_fea

['policyCode']

one_value_fea_test

['policyCode']

总结：47列数据中有22列都缺少数据，这在现实世界中很正常。‘policyCode’具有一个唯一值（或全部缺失）。有很多连续变量和一些分类变量。

3.数值类型特征，对象类型特征

特征一般都是由类别型特征和数值型特征组成，而数值型特征又分为连续型和离散型。
类别型特征有时具有非数值关系，有时也具有数值关系。比如‘grade’中的等级A，B，C等，是否只是单纯的分类，还是A优于其他要结合业务判断
数值型特征本是可以直接入模的，但往往风控人员要对其做分箱，转化为WOE编码进而做标准评分卡等操作。从模型效果上来看，特征分箱主要是为了降低变量的复杂性，减少变量噪音对模型的影响，提高自变量和因变量的相关度。从而使模型更加稳定。

num_fea = list(data_train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x : x not in num_fea,list(data_train.columns)))

len(num_fea)

category_fea

['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']

data_train.grade

0         E
1         D
2         D
3         A
4         C
5         A
6         A
7         C
8         C
9         B
10        B
11        E
12        D
13        B
14        A
15        B
16        D
17        B
18        E
19        E
20        C
21        C
22        D
23        C
24        A
25        A
26        B
27        C
28        A
29        C
         ..
799970    B
799971    A
799972    G
799973    D
799974    B
799975    E
799976    C
799977    C
799978    B
799979    C
799980    C
799981    C
799982    B
799983    B
799984    C
799985    D
799986    C
799987    B
799988    C
799989    D
799990    C
799991    B
799992    C
799993    A
799994    E
799995    C
799996    A
799997    C
799998    A
799999    B
Name: grade, Length: 800000, dtype: object

3.1数值型变量分析，数值型肯定是包括连续型变量和离散型变量的，找出来

划分数值型变量中的连续变量和离散型变量

#过滤数值型类别特征
def get_num_serialFea(data,feas):
    num_seralFea = []
    num_noseralFea =[]
    for fea in feas:
        temp = data[fea].nunique()
        if temp <= 10:
            num_noseralFea.append(fea)
            continue
        num_seralFea.append(fea)
    return num_seralFea,num_noseralFea
num_seralFea,num_noseralFea = get_num_serialFea(data_train,num_fea)

num_seralFea

['id',
 'loanAmnt',
 'interestRate',
 'installment',
 'employmentTitle',
 'annualIncome',
 'purpose',
 'postCode',
 'regionCode',
 'dti',
 'delinquency_2years',
 'ficoRangeLow',
 'ficoRangeHigh',
 'openAcc',
 'pubRec',
 'pubRecBankruptcies',
 'revolBal',
 'revolUtil',
 'totalAcc',
 'title',
 'n0',
 'n1',
 'n2',
 'n2.1',
 'n4',
 'n5',
 'n6',
 'n7',
 'n8',
 'n9',
 'n10',
 'n13',
 'n14']

num_noseralFea

['term',
 'homeOwnership',
 'verificationStatus',
 'isDefault',
 'initialListStatus',
 'applicationType',
 'policyCode',
 'n11',
 'n12']

数值离散型变量分析

data_train['term'].value_counts()#离散型变量

3    606902
5    193098
Name: term, dtype: int64

data_train['homeOwnership'].value_counts()#离散型变量

0    395732
1    317660
2     86309
3       185
5        81
4        33
Name: homeOwnership, dtype: int64

data_train['verificationStatus'].value_counts()#离散型变量

1    309810
2    248968
0    241222
Name: verificationStatus, dtype: int64

data_train['initialListStatus'].value_counts()#离散型变量

0    466438
1    333562
Name: initialListStatus, dtype: int64

data_train['applicationType'].value_counts()#离散型变量

0    784586
1     15414
Name: applicationType, dtype: int64

data_train['policyCode'].value_counts()#离散型变量，无用，全部一个值

1.0    800000
Name: policyCode, dtype: int64

data_train['n11'].value_counts()#离散型变量，相差悬殊，用不用再分析

0.0    729682
1.0       540
2.0        24
4.0         1
3.0         1
Name: n11, dtype: int64

data_train['n12'].value_counts()#离散型变量，相差悬殊，用不用再分析

0.0    757315
1.0      2281
2.0       115
3.0        16
4.0         3
Name: n12, dtype: int64

数值连续型变量分析

#每个数字特征得分布可视化
f = pd.melt(data_train,value_vars=num_seralFea)
g = sns.FacetGrid(f,col='variable',col_wrap=4,sharex=False,sharey=False)
g = g.map(sns.distplot,"value")

在这里插入图片描述

查看某一个数值型变量的分布，查看变量是否符合正态分布，如果不符合正太分布的变量可以log化后再观察下是否符合正态分布。
如果想统一处理一批数据变标准化必须把这些之前已经正态化的数据提出
正态化的原因：一些情况下正态非正态可以让模型更快的收敛，一些模型要求数据正态（eg. GMM、KNN）,保证数据不要过偏态即可，过于偏态可能会影响模型预测结果。

#Ploting Transaction Amount Values Distribution
plt.figure(figsize=(16,12))
plt.suptitle('Transaction Values Distribution', fontsize=22)
plt.subplot(221)
sub_plot_1 = sns.distplot(data_train['loanAmnt'])
sub_plot_1.set_title("loanAmnt Distribuition", fontsize=18)
sub_plot_1.set_xlabel("")
sub_plot_1.set_ylabel("Probability", fontsize=15)

plt.subplot(222)
sub_plot_2 = sns.distplot(np.log(data_train['loanAmnt']))
sub_plot_2.set_title("loanAmnt (Log) Distribuition", fontsize=18)
sub_plot_2.set_xlabel("")
sub_plot_2.set_ylabel("Probability", fontsize=15)

Text(0, 0.5, 'Probability')

在这里插入图片描述

非数值类别型变量分析

data_train['grade'].value_counts()

B    233690
C    227118
A    139661
D    119453
E     55661
F     19053
G      5364
Name: grade, dtype: int64

data_train['subGrade'].value_counts()

C1    50763
B4    49516
B5    48965
B3    48600
C2    47068
C3    44751
C4    44272
B2    44227
B1    42382
C5    40264
A5    38045
A4    30928
D1    30538
D2    26528
A1    25909
D3    23410
A3    22655
A2    22124
D4    21139
D5    17838
E1    14064
E2    12746
E3    10925
E4     9273
E5     8653
F1     5925
F2     4340
F3     3577
F4     2859
F5     2352
G1     1759
G2     1231
G3      978
G4      751
G5      645
Name: subGrade, dtype: int64

data_train['employmentLength'].value_counts()

10+ years    262753
2 years       72358
< 1 year      64237
3 years       64152
1 year        52489
5 years       50102
4 years       47985
6 years       37254
8 years       36192
7 years       35407
9 years       30272
Name: employmentLength, dtype: int64

data_train['issueDate'].value_counts()

2016-03-01    29066
2015-10-01    25525
2015-07-01    24496
2015-12-01    23245
2014-10-01    21461
2016-02-01    20571
2015-11-01    19453
2015-01-01    19254
2015-04-01    18929
2015-08-01    18750
2015-05-01    17119
2016-01-01    16792
2014-07-01    16355
2015-06-01    15236
2015-09-01    14950
2016-04-01    14248
2014-11-01    13793
2015-03-01    13549
2016-08-01    13301
2015-02-01    12881
2016-07-01    12835
2016-06-01    12270
2016-12-01    11562
2016-10-01    11245
2016-11-01    11172
2014-05-01    10886
2014-04-01    10830
2016-05-01    10680
2014-08-01    10648
2016-09-01    10165
              ...  
2010-01-01      355
2009-10-01      305
2009-09-01      270
2009-08-01      231
2009-07-01      223
2009-06-01      191
2009-05-01      190
2009-04-01      166
2009-03-01      162
2009-02-01      160
2009-01-01      145
2008-12-01      134
2008-03-01      130
2008-11-01      113
2008-02-01      105
2008-04-01       92
2008-01-01       91
2008-10-01       62
2007-12-01       55
2008-07-01       52
2008-08-01       38
2008-05-01       38
2008-06-01       33
2007-10-01       26
2007-11-01       24
2007-08-01       23
2007-07-01       21
2008-09-01       19
2007-09-01        7
2007-06-01        1
Name: issueDate, Length: 139, dtype: int64

data_train['earliesCreditLine'].value_counts()

Aug-2001    5567
Sep-2003    5403
Aug-2002    5403
Oct-2001    5258
Aug-2000    5246
Sep-2004    5219
Sep-2002    5170
Aug-2003    5116
Oct-2000    5034
Oct-2002    5034
Oct-2003    4969
Aug-2004    4904
Nov-2000    4798
Sep-2001    4787
Sep-2000    4780
Nov-1999    4773
Oct-1999    4678
Oct-2004    4647
Sep-2005    4608
Jul-2003    4586
Nov-2001    4514
Aug-2005    4494
Jul-2001    4480
Aug-1999    4446
Sep-1999    4441
Dec-2001    4379
Jul-2002    4342
Aug-2006    4283
Mar-2001    4268
May-2001    4223
            ... 
Nov-1958       2
Aug-1950       2
May-1955       2
Jun-1952       2
Feb-1962       2
Jul-1951       2
Jul-1959       2
Mar-1962       1
Nov-1954       1
Oct-2015       1
Aug-1946       1
May-1957       1
Oct-1954       1
Mar-1957       1
May-1960       1
Mar-1958       1
Dec-1951       1
Jan-1946       1
Sep-1957       1
Dec-1960       1
Apr-1958       1
Jun-1958       1
Aug-1955       1
Feb-1960       1
Aug-1958       1
Jan-1944       1
Jul-1955       1
Nov-1953       1
Sep-1953       1
Oct-1957       1
Name: earliesCreditLine, Length: 720, dtype: int64

data_train['isDefault'].value_counts()

0    640390
1    159610
Name: isDefault, dtype: int64

总结：

上面我们用value_counts()等函数看了特征属性的分布，但是图表是概括原始信息最便捷的方式。
数无形时少直觉
同一份数据集，在不同的尺度刻画上显示出来的图形反映的规律是不一样的。python将数据转化成图表，但结论是否正确需要由你保证。

3.2变量分布可视化

单一变量分布可视化

plt.figure(figsize = (8,8))
sns.barplot(data_train['employmentLength'].value_counts(dropna = False)[:20],
           data_train['employmentLength'].value_counts(dropna = False).keys()[:20])
plt.show()

在这里插入图片描述

根绝y值不同可视化x某个特征的分布

首先查看类别型变量在不同y值上的分布

train_loan_fr = data_train.loc[data_train['isDefault'] == 1]
train_loan_nofr = data_train.loc[data_train['isDefault'] == 0]

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(20, 8))

train_loan_fr.groupby('grade')['grade'].count().plot(kind='barh', ax=ax1, title='Count of grade fraud')
train_loan_nofr.groupby('grade')['grade'].count().plot(kind='barh', ax=ax2, title='Count of grade non-fraud')
train_loan_fr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh', ax=ax3, title='Count of employmentLength fraud')
train_loan_nofr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh', ax=ax4, title='Count of employmentLength non-fraud')
plt.show()

在这里插入图片描述

其次查看连续型变量在不同y值上的分布

fig, ((ax1, ax2)) = plt.subplots(1, 2, figsize=(15, 6))
data_train.loc[data_train['isDefault'] == 1]['loanAmnt'].apply(np.log).plot(kind='hist',
          bins=100,
          title='Log Loan Amt - Fraud',
          color='r',
          xlim=(-3, 10),
         ax= ax1)
data_train.loc[data_train['isDefault'] == 0]['loanAmnt'].apply(np.log).plot(kind='hist',
          bins=100,
          title='Log Loan Amt - Not Fraud',
          color='b',
          xlim=(-3, 10),
         ax=ax2)

<AxesSubplot:title={'center':'Log Loan Amt - Not Fraud'}, ylabel='Frequency'>

在这里插入图片描述

total = len(data_train)
total_amt = data_train.groupby(['isDefault'])['loanAmnt'].sum().sum()
plt.figure(figsize=(12,5))
plt.subplot(121)##1代表行，2代表列，所以一共有2个图，1代表此时绘制第一个图。
plot_tr = sns.countplot(x='isDefault',data=data_train)#data_train‘isDefault’这个特征每种类别的数量**
plot_tr.set_title("Fraud Loan Distribution \n 0: good user | 1: bad user", fontsize=14)
plot_tr.set_xlabel("Is fraud by count", fontsize=16)
plot_tr.set_ylabel('Count', fontsize=16)
for p in plot_tr.patches:
    height = p.get_height()
    plot_tr.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=15) 
    
percent_amt = (data_train.groupby(['isDefault'])['loanAmnt'].sum())
percent_amt = percent_amt.reset_index()
plt.subplot(122)
plot_tr_2 = sns.barplot(x='isDefault', y='loanAmnt',  dodge=True, data=percent_amt)
plot_tr_2.set_title("Total Amount in loanAmnt  \n 0: good user | 1: bad user", fontsize=14)
plot_tr_2.set_xlabel("Is fraud by percent", fontsize=16)
plot_tr_2.set_ylabel('Total Loan Amount Scalar', fontsize=16)
for p in plot_tr_2.patches:
    height = p.get_height()
    plot_tr_2.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/total_amt * 100),
            ha="center", fontsize=15)

在这里插入图片描述

3.3时间格式数据处理及查看

#转化成时间格式  issueDateDT特征表示数据日期离数据集中日期最早的日期（2007-06-01）的天数
data_train['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
data_train['issueDateDT'] = data_train['issueDate'].apply(lambda x: x-startdate).dt.days
#转化成时间格式
data_test_a['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
data_test_a['issueDateDT'] = data_test_a['issueDate'].apply(lambda x: x-startdate).dt.days
plt.hist(data_train['issueDateDT'], label='train');
plt.hist(data_test_a['issueDateDT'], label='test');
plt.legend();
plt.title('Distribution of issueDateDT dates');
#train 和 test issueDateDT 日期有重叠 所以使用基于时间的分割进行验证是不明智的

在这里插入图片描述

3.4掌握透视图可以让我们更好的了解数据

#透视图 索引可以有多个，“columns（列）”是可选的，聚合函数aggfunc最后是被应用到了变量“values”中你所列举的项目上
pivot = pd.pivot_table(data_train, index=['grade'], columns=['issueDateDT'], values=['loanAmnt'], aggfunc=np.sum)
pivot

	loanAmnt
issueDateDT	0	30	61	92	122	153	183	214	245	274	...	3926	3957	3987	4018	4048	4079	4110	4140	4171	4201
grade
A	NaN	53650.0	42000.0	19500.0	34425.0	63950.0	43500.0	168825.0	85600.0	101825.0	...	13093850.0	11757325.0	11945975.0	9144000.0	7977650.0	6888900.0	5109800.0	3919275.0	2694025.0	2245625.0
B	NaN	13000.0	24000.0	32125.0	7025.0	95750.0	164300.0	303175.0	434425.0	538450.0	...	16863100.0	17275175.0	16217500.0	11431350.0	8967750.0	7572725.0	4884600.0	4329400.0	3922575.0	3257100.0
C	NaN	68750.0	8175.0	10000.0	61800.0	52550.0	175375.0	151100.0	243725.0	393150.0	...	17502375.0	17471500.0	16111225.0	11973675.0	10184450.0	7765000.0	5354450.0	4552600.0	2870050.0	2246250.0
D	NaN	NaN	5500.0	2850.0	28625.0	NaN	167975.0	171325.0	192900.0	269325.0	...	11403075.0	10964150.0	10747675.0	7082050.0	7189625.0	5195700.0	3455175.0	3038500.0	2452375.0	1771750.0
E	7500.0	NaN	10000.0	NaN	17975.0	1500.0	94375.0	116450.0	42000.0	139775.0	...	3983050.0	3410125.0	3107150.0	2341825.0	2225675.0	1643675.0	1091025.0	1131625.0	883950.0	802425.0
F	NaN	NaN	31250.0	2125.0	NaN	NaN	NaN	49000.0	27000.0	43000.0	...	1074175.0	868925.0	761675.0	685325.0	665750.0	685200.0	316700.0	315075.0	72300.0	NaN
G	NaN	NaN	NaN	NaN	NaN	NaN	NaN	24625.0	NaN	NaN	...	56100.0	243275.0	224825.0	64050.0	198575.0	245825.0	53125.0	23750.0	25100.0	1000.0

7 rows × 139 columns

3.5用pandas_profiling生成数据报告

import pandas_profiling
pfr = pandas_profiling.ProfileReport(data_train)
pfr.to_file("./example.html")

学机器学习的机械工程师

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录