金融风控训练营-Task2学习笔记

最新推荐文章于 2021-12-13 09:25:49 发布

weixin_54525834

最新推荐文章于 2021-12-13 09:25:49 发布

阅读量168

点赞数

本文链接：https://blog.csdn.net/weixin_54525834/article/details/116139237

版权

Task2 数据分析

学习目标

今天主要是学习数据集有关内容，了解数据集的基本情况，变量之间的相关性质和存在关系，为以后的建模做准备。

内容介绍

数据总体了解：
- 读取数据集并了解数据集大小，原始特征维度；
- 通过info熟悉数据类型；
- 粗略查看数据集中各特征基本统计量；
缺失值和唯一值：
- 查看数据缺失值情况
- 查看唯一值特征情况
深入数据-查看数据类型
- 类别型数据
- 数值型数据
  - 离散数值型数据
  - 连续数值型数据
数据间相关关系
- 特征和特征之间关系
- 特征和目标变量之间关系
用pandas_profiling生成数据报告

代码示例

1，导入数据要用的及其分析的库并且有提示如果不行可以换别的方法使用

读取文件

介绍了相关用法并且有给补充（以下为补充方法）

pandas读取数据时相对路径载入报错时，尝试使用os.getcwd()查看当前工作目录。
TSV与CSV的区别：
- 从名称上即可知道，TSV是用制表符（Tab,'\t'）作为字段值的分隔符；CSV是用半角逗号（','）作为字段值的分隔符；
- Python对TSV文件的支持： Python的csv模块准确的讲应该叫做dsv模块，因为它实际上是支持范式的分隔符分隔值文件（DSV，delimiter-separated values）的。 delimiter参数值默认为半角逗号，即默认将被处理文件视为CSV。当delimiter='\t'时，被处理文件就是TSV。
读取文件的部分（适用于文件特别大的场景）
- 通过nrows参数，来设置读取文件的前多少行，nrows是一个大于等于0的整数。
- 分块读取

总体了解

查看数据集的样本个数和原始特征维度

查看一下具体的列名，赛题理解部分已经给出具体的特征含义，这里方便阅读再给一下：

id 为贷款清单分配的唯一信用证标识
loanAmnt 贷款金额
term 贷款期限（year）
interestRate 贷款利率
installment 分期付款金额
grade 贷款等级
subGrade 贷款等级之子级
employmentTitle 就业职称
employmentLength 就业年限（年）
homeOwnership 借款人在登记时提供的房屋所有权状况
annualIncome 年收入
verificationStatus 验证状态
issueDate 贷款发放的月份
purpose 借款人在贷款申请时的贷款用途类别
postCode 借款人在贷款申请中提供的邮政编码的前3位数字
regionCode 地区编码
dti 债务收入比
delinquency_2years 借款人过去2年信用档案中逾期30天以上的违约事件数
ficoRangeLow 借款人在贷款发放时的fico所属的下限范围
ficoRangeHigh 借款人在贷款发放时的fico所属的上限范围
openAcc 借款人信用档案中未结信用额度的数量
pubRec 贬损公共记录的数量
pubRecBankruptcies 公开记录清除的数量
revolBal 信贷周转余额合计
revolUtil 循环额度利用率，或借款人使用的相对于所有可用循环信贷的信贷金额
totalAcc 借款人信用档案中当前的信用额度总数
initialListStatus 贷款的初始列表状态
applicationType 表明贷款是个人申请还是与两个共同借款人的联合申请
earliesCreditLine 借款人最早报告的信用额度开立的月份
title 借款人提供的贷款名称
policyCode 公开可用的策略代码=1新产品不公开可用的策略代码=2
n系列匿名特征匿名特征n0-n14，为一些贷款人行为计数特征的处理

通过info()来熟悉数据类型

data_train.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 800000 entries, 0 to 799999

Data columns (total 47 columns):

 #   Column              Non-Null Count   Dtype  

---  ------              --------------   -----  

 0   id                  800000 non-null  int64  

 1   loanAmnt            800000 non-null  float64

 2   term                800000 non-null  int64  

 3   interestRate        800000 non-null  float64

 4   installment         800000 non-null  float64

 5   grade               800000 non-null  object 

 6   subGrade            800000 non-null  object 

 7   employmentTitle     799999 non-null  float64

 8   employmentLength    753201 non-null  object 

 9   homeOwnership       800000 non-null  int64  

 10  annualIncome        800000 non-null  float64

 11  verificationStatus  800000 non-null  int64  

 12  issueDate           800000 non-null  object 

 13  isDefault           800000 non-null  int64  

 14  purpose             800000 non-null  int64  

 15  postCode            799999 non-null  float64

 16  regionCode          800000 non-null  int64  

 17  dti                 799761 non-null  float64

 18  delinquency_2years  800000 non-null  float64

 19  ficoRangeLow        800000 non-null  float64

 20  ficoRangeHigh       800000 non-null  float64

 21  openAcc             800000 non-null  float64

 22  pubRec              800000 non-null  float64

 23  pubRecBankruptcies  799595 non-null  float64

 24  revolBal            800000 non-null  float64

 25  revolUtil           799469 non-null  float64

 26  totalAcc            800000 non-null  float64

 27  initialListStatus   800000 non-null  int64  

 28  applicationType     800000 non-null  int64  

 29  earliesCreditLine   800000 non-null  object 

 30  title               799999 non-null  float64

 31  policyCode          800000 non-null  float64

 32  n0                  759730 non-null  float64

 33  n1                  759730 non-null  float64

 34  n2                  759730 non-null  float64

 35  n3                  759730 non-null  float64

 36  n4                  766761 non-null  float64

 37  n5                  759730 non-null  float64

 38  n6                  759730 non-null  float64

 39  n7                  759730 non-null  float64

 40  n8                  759729 non-null  float64

 41  n9                  759730 non-null  float64

 42  n10                 766761 non-null  float64

 43  n11                 730248 non-null  float64

 44  n12                 759730 non-null  float64

 45  n13                 759730 non-null  float64

 46  n14                 759730 non-null  float64

dtypes: float64(33), int64(9), object(5)

memory usage: 286.9+ MB

总体粗略的查看数据集各个特征的一些基本统计量

查看数据集中特征缺失值，唯一值等

查看缺失值

There are 22 columns in train dataset with missing values.

上面得到训练集有22列特征有缺失值，进一步查看缺失特征中缺失率大于50%的特征

[21]:

{}

具体的查看缺失特征及缺失率

[22]:

<AxesSubplot:>

纵向了解哪些列存在 “nan”, 并可以把nan的个数打印，主要的目的在于查看某一列nan存在的个数是否真的很大，如果nan存在的过多，说明这一列对label的影响几乎不起作用了，可以考虑删掉。如果缺失值很小一般可以选择填充。
另外可以横向比较，如果在数据集中，某些样本数据的大部分列都是缺失的且样本足够的情况下可以考虑删除。

Tips: 比赛大杀器lgb模型可以自动处理缺失值，Task4模型会具体学习模型了解模型哦！

查看训练集测试集中特征属性只有一值的特征

[25]:

['policyCode']

[26]:

['policyCode']

There are 1 columns in train dataset with one unique value.

There are 1 columns in test dataset with one unique value.

感想和思考

1，通过今天的学习了解到了数据集的情况，还有自变量的关系，为后面做内容建模做准备。2，就是不仅仅只是学会了如何导入，还学会了在出现一些问题后可以使用别的方法进行改变，比如说使用到os.getcwd()，还有就是不同情况使用到不同的方法比如TSV是用制表符（Tab,'\t'）作为字段值的分隔符；CSV是用半角逗号（','）作为字段值的分隔符；3，就是看到了一些方法不单单只是单纯适用于金融风控，在以后数据分析也可以用到。