贷款违约预测Task2数据分析

最新推荐文章于 2021-09-15 19:17:30 发布

奇魄

最新推荐文章于 2021-09-15 19:17:30 发布

阅读量535

点赞数

文章标签：可视化大数据 python

本文链接：https://blog.csdn.net/yangjia3369/article/details/108654516

版权

本文探讨了贷款违约预测任务中的数据分析过程，包括数据集的基本情况、缺失值和唯一值检查、数据类型的识别以及变量分布的可视化。通过pandas和相关库进行数据探索，了解特征与目标变量的关系，为后续的机器学习建模打下基础。

摘要由CSDN通过智能技术生成

Task2:数据分析

数据分析也叫数据的探索性分析(EDA)，也就是统计学里的描述性统计。

探索性分析目的：
1.EDA价值主要在于熟悉了解整个数据集的基本情况（缺失值，异常值），对数据集进行验证是否可以进行接下来的机器学习或者深度学习建模.
2.了解变量间的相互关系、变量与预测值之间的存在关系。
3.为特征工程做准备

项目地址：https://github.com/datawhalechina/team-learning-data-mining/tree/master/FinancialRiskControl

比赛地址：https://tianchi.aliyun.com/competition/entrance/531830/introduction

2.1 学习目标

学习如何对数据集整体概况进行分析，包括数据集的基本情况（缺失值，异常值）
学习了解变量间的相互关系、变量与预测值之间的存在关系
完成相应学习打卡任务

2.2 内容介绍

数据总体了解：
- 读取数据集并了解数据集大小，原始特征维度；
  通过info熟悉数据类型；
  粗略查看数据集中各特征基本统计量；
缺失值和唯一值：
查看数据缺失值情况
查看唯一值特征情况
深入数据-查看数据类型
类别型数据
数值型数据
离散数值型数据
连续数值型数据
数据间相关关系
特征和特征之间关系
特征和目标变量之间关系
用pandas_profiling生成数据报告

2.3 代码示例

2.3.1 导入数据分析及可视化过程需要的库

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import warnings
warnings.filterwarnings('ignore')    #忽略粉红色的提示

python中需要用的库可以用pip工具进行安装，或者conda进行安装，如果安装了anaconda的话，基本的库会自带，其他高级库可以用以上两个工具在cmd窗口进行安装。

2.3.2 读取文件

data_train = pd.read_csv('./train.csv')
data_test_a = pd.read_csv('./testA.csv')

2.3.2.1 pandas读取文件常见问题

pandas可以读取常见数据存储格式的文件，包括excel、csv、sql等。
pandas读取数据时相对路径载入报错时，尝试使用os.getcwd()查看当前工作目录。
TSV与CSV的区别：
从名称上即可知道，TSV是用制表符（Tab,’\t’）作为字段值的分隔符；CSV是用半角逗号（’,’）作为字段值的分隔符；
Python对TSV文件的支持： Python的csv模块准确的讲应该叫做dsv模块，因为它实际上是支持范式的分隔符分隔值文件（DSV，delimiter-separated values）的。 delimiter参数值默认为半角逗号，即默认将被处理文件视为CSV。当delimiter=’\t’时，被处理文件就是TSV。
读取文件的部分（适用于文件特别大的场景）
通过nrows参数，来设置读取文件的前多少行，nrows是一个大于等于0的整数。
分块读取

data_train_sample = pd.read_csv("./train.csv",nrows=5)

#设置chunksize参数，来控制每次迭代数据的大小
chunker = pd.read_csv("./train.csv",chunksize=5)
for item in chunker:
    print(type(item))
    #<class 'pandas.core.frame.DataFrame'>
    print(len(item))
    #5

<class 'pandas.core.frame.DataFrame'>
5

注意：由于训练集数据非常多，迭代的时间可能非常长。

2.3.3总体了解

查看数据集的样本个数和原始特征维度

data_test_a.shape

(200000, 46)

测试集数据有200000条，特征属性有46个。

data_train.shape

(800000, 47)

训练集数据有800000条，特征属性有47个。

data_train.columns

Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
       'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
       'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
       'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
       'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
       'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
       'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
       'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8',
       'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
      dtype='object')

data_test_a.columns

Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
       'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
       'annualIncome', 'verificationStatus', 'issueDate', 'purpose',
       'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow',
       'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal',
       'revolUtil', 'totalAcc', 'initialListStatus', 'applicationType',
       'earliesCreditLine', 'title', 'policyCode', 'n0', 'n1', 'n2', 'n3',
       'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
      dtype='object')

通过info()来熟悉数据类型

data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  800000 non-null  int64  
 1   loanAmnt            800000 non-null  float64
 2   term                800000 non-null  int64  
 3   interestRate        800000 non-null  float64
 4   installment         800000 non-null  float64
 5   grade               800000 non-null  object 
 6   subGrade            800000 non-null  object 
 7   employmentTitle     799999 non-null  float64
 8   employmentLength    753201 non-null  object 
 9   homeOwnership       800000 non-null  int64  
 10  annualIncome        800000 non-null  float64
 11  verificationStatus  800000 non-null  int64  
 12  issueDate           800000 non-null  object 
 13  isDefault           800000 non-null  int64  
 14  purpose             800000 non-null  int64  
 15  postCode            799999 non-null  float64
 16  regionCode          800000 non-null  int64  
 17  dti                 799761 non-null  float64
 18  delinquency_2years  800000 non-null  float64
 19  ficoRangeLow        800000 non-null  float64
 20  ficoRangeHigh       800000 non-null  float64
 21  openAcc             800000 non-null  float64
 22  pubRec              800000 non-null  float64
 23  pubRecBankruptcies  799595 non-null  float64
 24  revolBal            800000 non-null  float64
 25  revolUtil           799469 non-null  float64
 26  totalAcc            800000 non-null  float64
 27  initialListStatus   800000 non-null  int64  
 28  applicationType     800000 non-null  int64  
 29  earliesCreditLine   800000 non-null  object 
 30  title               799999 non-null  float64
 31  policyCode          800000 non-null  float64
 32  n0                  759730 non-null  float64
 33  n1                  759730 non-null  float64
 34  n2                  759730 non-null  float64
 35  n3                  759730 non-null  float64
 36  n4                  766761 non-null  float64
 37  n5                  759730 non-null  float64
 38  n6                  759730 non-null  float64
 39  n7                  759730 non-null  float64
 40  n8                  759729 non-null  float64
 41  n9                  759730 non-null  float64
 42  n10                 766761 non-null  float64
 43  n11                 730248 non-null  float64
 44  n12                 759730 non-null  float64
 45  n13                 759730 non-null  float64
 46  n14                 759730 non-null  float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB

由训练集数据的信息来看，存在缺失数据。employmentTitle、employmentLength、postCode、dti、pubRecBankruptcies、revolUtil 、title 、n0～n14都存在缺失值，但是缺失的数据量都不是很多。

统计描述