task02 EDA

最新推荐文章于 2021-12-08 21:10:05 发布

奔跑吧小男孩

最新推荐文章于 2021-12-08 21:10:05 发布

阅读量164

点赞数

文章标签：机器学习 python

本文链接：https://blog.csdn.net/tycon21/article/details/108674858

版权

EDA的主要任务

EDA的价值主要在于熟悉数据集，了解数据集，对数据集进行验证来确定所获得数据集可以用于接下来的机器学习或者深度学习使用。
当了解了数据集之后我们下一步就是要去了解变量间的相互关系以及变量与预测值之间的存在关系。
引导数据科学从业者进行数据处理以及特征工程的步骤,使数据集的结构和特征集让接下来的预测问题更加可靠。
完成对于数据的探索性分析，并对于数据进行一些图表或者文字总结并打卡。

# 导入相关的库
import pandas as pd  #这个不用说了，最常用的库，就像打开excel文件一样自然
import numpy as np #经典的数据处理的库
import matplotlib.pyplot as plt #经典的画图的库
import seaborn as sn # 也是一个画图的库，似乎是在matplotlib上封装的，更加好用
import datetime as dt
import warnings
warnings.filterwarnings('ignore') #这个是干啥用的？ 原来是利用过滤器来实现忽略告警的，这是个好技巧~

# 导入文件
data_train = pd.read_csv(r'D:\datawhale\train.csv')
data_test_a = pd.read_csv(r'D:\datawhale\testA.csv')

数据总体了解：

读取数据集并了解数据集大小，原始特征维度；
通过info熟悉数据类型；
粗略查看数据集中各特征基本统计量；

# 看数据集大小
print(data_train.shape)
print(data_test_a.shape)
# 可以看到训练集有80万行， 47列，测试集A有20万行，46列

(800000, 47)
(200000, 46)

# 再看一下有哪些字段
data_train.columns

Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
       'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
       'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
       'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
       'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
       'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
       'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
       'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8',
       'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
      dtype='object')

字段含义如下：

id 为贷款清单分配的唯一信用证标识
loanAmnt 贷款金额
term 贷款期限（year）
interestRate 贷款利率
installment 分期付款金额
grade 贷款等级
subGrade 贷款等级之子级
employmentTitle 就业职称
employmentLength 就业年限（年）
homeOwnership 借款人在登记时提供的房屋所有权状况
annualIncome 年收入
verificationStatus 验证状态
issueDate 贷款发放的月份
purpose 借款人在贷款申请时的贷款用途类别
postCode 借款人在贷款申请中提供的邮政编码的前3位数字
regionCode 地区编码
dti 债务收入比
delinquency_2years 借款人过去2年信用档案中逾期30天以上的违约事件数
ficoRangeLow 借款人在贷款发放时的fico所属的下限范围
ficoRangeHigh 借款人在贷款发放时的fico所属的上限范围
openAcc 借款人信用档案中未结信用额度的数量
pubRec 贬损公共记录的数量
pubRecBankruptcies 公开记录清除的数量
revolBal 信贷周转余额合计
revolUtil 循环额度利用率，或借款人使用的相对于所有可用循环信贷的信贷金额
totalAcc 借款人信用档案中当前的信用额度总数
initialListStatus 贷款的初始列表状态
applicationType 表明贷款是个人申请还是与两个共同借款人的联合申请
earliesCreditLine 借款人最早报告的信用额度开立的月份
title 借款人提供的贷款名称
policyCode 公开可用的策略_代码=1新产品不公开可用的策略_代码=2
n系列匿名特征匿名特征n0-n14，为一些贷款人行为计数特征的处理

# 再通过info（）函数看一下各字段的详情
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  800000 non-null  int64  
 1   loanAmnt            800000 non-null  float64
 2   term                800000 non-null  int64  
 3   interestRate        800000 non-null  float64
 4   installment         800000 non-null  float64
 5   grade               800000 non-null  object 
 6   subGrade            800000 non-null  object 
 7   employmentTitle     799999 non-null  float64
 8   employmentLength    753201 non-null  object 
 9   homeOwnership       800000 non-null  int64  
 10  annualIncome        800000 non-null  float64
 11  verificationStatus  800000 non-null  int64  
 12  issueDate           800000 non-null  object 
 13  isDefault           800000 non-null  int64  
 14  purpose             800000 non-null  int64  
 15  postCode            799999 non-null  float64
 16  regionCode          800000 non-null  int64  
 17  dti                 799761 non-null  float64
 18  delinquency_2years  800000 non-null  float64
 19  ficoRangeLow        800000 non-null  float64
 20  ficoRangeHigh       800000 non-null  float64
 21  openAcc             800000 non-null  float64
 22  pubRec              800000 non-null  float64
 23  pubRecBankruptcies  799595 non-null  float64
 24  revolBal            800000 non-null  float64
 25  revolUtil           799469 non-null  float64
 26  totalAcc            800000 non-null  float64
 27  initialListStatus   800000 non-null  int64  
 28  applicationType     800000 non-null  int64  
 29  earliesCreditLine   800000 non-null  object 
 30  title               799999 non-null  float64
 31  policyCode          800000 non-null  float64
 32  n0                  759730 non-null  float64
 33  n1                  759730 non-null  float64
 34  n2                  759730 non-null  float64
 35  n3                  759730 non-null  float64
 36  n4                  766761 non-null  float64
 37  n5                  759730 non-null  float64
 38  n6                  759730 non-null  float64
 39  n7                  759730 non-null  float64
 40  n8                  759729 non-null  float64
 41  n9                  759730 non-null  float64
 42  n10                 766761 non-null  float64
 43  n11                 730248 non-null  float64
 44  n12                 759730 non-null  float64
 45  n13                 759730 non-null  float64
 46  n14                 759730 non-null  float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB

这里就有一些很明显的数据问题，比如employmentTitle缺少一个值

# 以及通过describe()来查看一些统计特征
data_train.describe()

	id	loanAmnt	term	interestRate	installment	employmentTitle	homeOwnership	annualIncome	verificationStatus	isDefault	...	n5	n6	n7	n8	n9	n10	n11	n12	n13	n14
count	800000.000000	800000.000000	800000.000000	800000.000000	800000.000000	799999.000000	800000.000000	8.000000e+05	800000.000000	800000.000000	...	759730.000000	759730.000000	759730.000000	759729.000000	759730.000000	766761.000000	730248.000000	759730.000000	759730.000000	759730.000000
mean	399999.500000	14416.818875	3.482745	13.238391	437.947723	72005.351714	0.614213	7.613391e+04	1.009683	0.199513	...	8.107937	8.575994	8.282953	14.622488	5.592345	11.643896	0.000815	0.003384	0.089366	2.178606
std	230940.252015	8716.086178	0.855832	4.765757	261.460393	106585.640204	0.675749	6.894751e+04	0.782716	0.399634	...	4.799210	7.400536	4.561689	8.124610	3.216184	5.484104	0.030075	0.062041	0.509069	1.844377
min	0.000000	500.000000	3.000000	5.310000	15.690000	0.000000	0.000000	0.000000e+00	0.000000	0.000000	...	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	199999.750000	8000.000000	3.000000	9.750000	248.450000	427.000000	0.000000	4.560000e+04	0.000000	0.000000	...	5.000000	4.000000	5.000000	9.000000	3.000000	8.000000	0.000000	0.000000	0.000000	1.000000
50%	399999.500000	12000.000000	3.000000	12.740000	375.135000	7755.000000	1.000000	6.500000e+04	1.000000	0.000000	...	7.000000	7.000000	7.000000	13.000000	5.000000	11.000000	0.000000	0.000000	0.000000	2.000000
75%	599999.250000	20000.000000	3.000000	15.990000	580.710000	117663.500000	1.000000	9.000000e+04	2.000000	0.000000	...	11.000000	11.000000	10.000000	19.000000	7.000000	14.000000	0.000000	0.000000	0.000000	3.000000
max	799999.000000	40000.000000	5.000000	30.990000	1715.420000	378351.000000	5.000000	1.099920e+07	2.000000	1.000000	...	70.000000	132.000000	79.000000	128.000000	45.000000	82.000000	4.000000	4.000000	39.000000	30.000000

8 rows × 42 columns

# 还可以看Ｎ条和后Ｎ条(通过append组合在一起）：
data_train.head(5).append(data_train.tail(5))

	id	loanAmnt	term	interestRate	installment	grade	subGrade	employmentTitle	employmentLength	homeOwnership	...	n5	n6	n7	n8	n9	n10	n11	n12	n13	n14
0	0	35000.0	5	19.52	917.97	E	E2	320.0	2 years	2	...	9.0	8.0	4.0	12.0	2.0	7.0	0.0	0.0	0.0	2.0
1	1	18000.0	5	18.49	461.90	D	D2	219843.0	5 years	0	...	NaN	NaN	NaN	NaN	NaN	13.0	NaN	NaN	NaN	NaN
2	2	12000.0	5	16.99	298.17	D	D3	31698.0	8 years	0	...	0.0	21.0	4.0	5.0	3.0	11.0	0.0	0.0	0.0	4.0
3	3	11000.0	3	7.26	340.96	A	A4	46854.0	10+ years	1	...	16.0	4.0	7.0	21.0	6.0	9.0	0.0	0.0	0.0	1.0
4	4	3000.0	3	12.99	101.07	C	C2	54.0	NaN	1	...	4.0	9.0	10.0	15.0	7.0	12.0	0.0	0.0	0.0	4.0
799995	799995	25000.0	3	14.49	860.41	C	C4	2659.0	7 years	1	...	6.0	2.0	12.0	13.0	10.0	14.0	0.0	0.0	0.0	3.0
799996	799996	17000.0	3	7.90	531.94	A	A4	29205.0	10+ years	0	...	15.0	16.0	2.0	19.0	2.0	7.0	0.0	0.0	0.0	0.0
799997	799997	6000.0	3	13.33	203.12	C	C3	2582.0	10+ years	1	...	4.0	26.0	4.0	10.0	4.0	5.0	0.0	0.0	1.0	4.0
799998	799998	19200.0	3	6.92	592.14	A	A4	151.0	10+ years	0	...	10.0	6.0	12.0	22.0	8.0	16.0	0.0	0.0	0.0	5.0
799999	799999	9000.0	3	11.06	294.91	B	B3	13.0	5 years	0	...	3.0	4.0	4.0	8.0	3.0	7.0	0.0	0.0	0.0	2.0

10 rows × 47 columns

缺失值和唯一值：

查看数据缺失值情况
查看唯一值特征情况

# 看有多少个字段有缺失值；
print("有",data_train.isnull().any().sum(),"个有缺失值的字段")

有 22 个有缺失值的字段

# 看有哪些特征的缺失值比较严重（超过了50%）
col =[]
for c in data_train.columns:
    if data_train[c].isnull().sum() > data_train.shape[0]*0.5:
        col.append(c)
print(col)

[]

并没有超过50%缺失值的字段。
教程中对于这些字段的判别方法和我写的不一样，不过殊途同归，可以学习一下：

have_null_fea_dict = (data_train.isnull().sum()/len(data_train)).to_dict()
fea_null_moreThanHalf = {}
for key,value in have_null_fea_dict.items():
    if value > 0.5:
        fea_null_moreThanHalf[key] = value
print(fea_null_moreThanHalf)

{}

# 再通过数据可视化方法看一看各特征总体上的缺失情况
missing = data_train.isnull().sum()/len(data_train) 
print(type(missing)) #原来这样写可以生成Series，很方便啊
missing = missing[missing> 0]
print(missing)

<class 'pandas.core.series.Series'>
employmentTitle       0.000001
employmentLength      0.058499
postCode              0.000001
dti                   0.000299
pubRecBankruptcies    0.000506
revolUtil             0.000664
title                 0.000001
n0                    0.050338
n1                    0.050338
n2                    0.050338
n3                    0.050338
n4                    0.041549
n5                    0.050338
n6                    0.050338
n7                    0.050338
n8                    0.050339
n9                    0.050338
n10                   0.041549
n11                   0.087190
n12                   0.050338
n13                   0.050338
n14                   0.050338
dtype: float64

# 好，下面直观看一下这些有缺失的字段情况
missing.sort_values(inplace=True)
missing.plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x27289588948>

在这里插入图片描述

通过这张图，可以快速了解到哪些字段有缺失值以及缺失程度如何，如果缺失比例很高，就要考虑删除改字段，如果缺失不多，就要设法填充
下面再查看哪些特征是只有一个值的。

# 查看只有一个值的
only_one_value_datatrain = [col for col in data_train.columns if data_train[col].nunique()<=1]
only_one_value_datatesta = [col for col in data_test_a.columns if data_test_a[col].nunique()<=1]
print(only_one_value_datatrain)
print(only_one_value_datatesta)

['policyCode']
['policyCode']

显然，policyCode这个字段只有一个值，这个特征几乎就没有用了。
综上，套用教程中的一个总结：
47列数据中有22列都缺少数据，这在现实世界中很正常。‘policyCode’具有一个唯一值（或全部缺失）。有很多连续变量和一些分类变量。

深入数据-查看数据类型

类别型数据
数值型数据
离散数值型数据
连续数值型数据

特征一般都是由类别型特征和数值型特征组成，而数值型特征又分为连续型和离散型。
类别型特征有时具有非数值关系，有时也具有数值关系。比如‘grade’中的等级A，B，C等，是否只是单纯的分类，还是A优于其他要结合业务判断。
数值型特征本是可以直接入模的，但往往风控人员要对其做分箱，转化为WOE编码进而做标准评分卡等操作。从模型效果上来看，特征分箱主要是为了降低变量的复杂性，减少变量噪音对模型的影响，提高自变量和因变量的相关度。从而使模型更加稳定。

# 分别列出哪些是数值型特征，哪些是类别型特征
numerical_fea = list(data_train.select_dtypes(exclude='object').columns)
category_fea = list(data_train.select_dtypes(include='object').columns)
print(numerical_fea)
print(category_fea)

['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'employmentTitle', 'homeOwnership', 'annualIncome', 'verificationStatus', 'isDefault', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'initialListStatus', 'applicationType', 'title', 'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14']
['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']

# 下面找出来数值性特征中，哪些是离散型的，哪些是连续型的
def filter_numerical_fea(data,numerical_fea):
    series_fea = []
    noseries_fea = []
    for c in numerical_fea:
        if data[c].nunique()<=10:
            noseries_fea.append(c)
        else:
            series_fea.append(c)
    return series_fea,noseries_fea
series_fea,noseries_fea = filter_numerical_fea(data_train,numerical_fea)

# 具体看看离散性的数值型变量的统计情况：
for i in noseries_fea:
    print(data_train[i].value_counts())

3    606902
5    193098
Name: term, dtype: int64
0    395732
1    317660
2     86309
3       185
5        81
4        33
Name: homeOwnership, dtype: int64
1    309810
2    248968
0    241222
Name: verificationStatus, dtype: int64
0    640390
1    159610
Name: isDefault, dtype: int64
0    466438
1    333562
Name: initialListStatus, dtype: int64
0    784586
1     15414
Name: applicationType, dtype: int64
1.0    800000
Name: policyCode, dtype: int64
0.0    729682
1.0       540
2.0        24
4.0         1
3.0         1
Name: n11, dtype: int64
0.0    757315
1.0      2281
2.0       115
3.0        16
4.0         3
Name: n12, dtype: int64

# 接下来看看连续型特征：
#每个数字特征得分布可视化
f = pd.melt(data_train, value_vars=series_fea)
g = sn.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = g.map(sn.distplot, "value")

执行到了一半，电脑卡在这里了，貌似很消耗性能~

数据间相关关系

特征和特征之间关系
特征和目标变量之间关系

用pandas_profiling生成数据报告

奔跑吧小男孩

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
task02 EDA

EDA的主要任务EDA的价值主要在于熟悉数据集，了解数据集，对数据集进行验证来确定所获得数据集可以用于接下来的机器学习或者深度学习使用。当了解了数据集之后我们下一步就是要去了解变量间的相互关系以及变量与预测值之间的存在关系。引导数据科学从业者进行数据处理以及特征工程的步骤,使数据集的结构和特征集让接下来的预测问题更加可靠。完成对于数据的探索性分析，并对于数据进行一些图表或者文字总结并打卡。# 导入相关的库import pandas as pd #这个不用说了，最常用的库，就像打开excel
复制链接

扫一扫