EDA探索性数据分析
1. What is EDA?
探索性数据分析(Exploratory Data Analysis,简称EDA)是上世纪六十年代美国统计学家John Tukey提出的。它指的是对已有的数据(特别是调查或观察得来的原始数据)在尽量少的先验假定下进行探索,通过作图、制表、拟合、计算特征指标等手段探索数据的结构和规律的一种数据分析方法。
当大数据时代到来的时候,面对各种杂乱的“脏数据”,我们往往不知所措,不知道从哪里开始了解手上的数据,这时候,探索性数据分析就非常有效。
维基百科–EDA:
In statistics, exploratory data analysis(EDA) is an approach to analyzing data sets to summarize their maincharacteristics, often with visual methods. A statistical model can be used ornot, but primarily EDA is for seeing what the data can tell us beyond theformal modeling or hypothesis testing task. Exploratory data analysis waspromoted by John Tukey to encourage statisticians to explore the data, andpossibly formulate hypotheses that could lead to new data collection andexperiments.
EDA is different from initial data analysis (IDA), which focusesmore narrowly on checking assumptions required for model fitting and hypothesistesting, and handling missing values and making transformations of variables asneeded. EDA encompasses IDA.
2. Why EDA?
3. How EDA?
3.1 基本设置
Package Preparation
- pd 表格处理; np 数据运算;
- matplotlib.pyplot/plt, seaborn/sns, missingno/msno 数据可视化;
- os path setting: the original path is that where u set the code files 路径设置
- 输出cell 里面所有的运行结果
- 忽略警告warnings
#basic packages
import numpy as np #计算作用
import pandas as pd #表格作用
import matplotlib.pyplot as plt #制图1
%matplotlib inline #在jupyter中显示运行结果
import seaborn as sns #制图2
color = sns.color_palette() #绘制网格背景
sns.set(style="whitegrid", color_codes=True) #灰色网格挺漂亮的
sns.set(font_scale=1)
import missingno as msno #制图3
#basic settings
import warnings #忽略红色警告
warnings.filterwarnings("ignore")
import os #获得当前的路径
os.path
os.getcwd()
os.chdir('/Users/**/CreditScoring')# 设置当前途径
os.getcwd()
from IPython.core.interactiveshell import InteractiveShell #显示每一行代码的运行结果(否则默认只显示最后一行)
InteractiveShell.ast_node_interactivity = "all"
3.2 读取数据
- 删除无意义的行列,如序号等
- 更换变量名称,便于操作
信贷硬信息:
-
月收入:Monthly income;
-
Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards);
-
Number of mortgage and real estate loans including home equity lines of credit;
-
申请人在过去两年期间逾期(30-59天)行为的次数:Number of times borrower has been 30-59 days past due but no worse in the last 2 years;
-
申请人在过去两年期间逾期(60-89天)行为的次数:Number of times borrower has been 60-89 days past due but no worse in the last 2 years;
-
申请人Number of times borrower has been 90 days or more past due;
-
Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits;
-
Monthly debt payments, alimony,living costs divided by monthy gross income.
信贷软信息:
- Age of borrower in years;
- Number of dependents in family excluding themselves (spouse, children etc.).
# read data
trainset_orig = pd.read_csv(r'/Users/**/CreditScoring/dataset/cs-training.csv')
trainset = trainset_orig.drop(trainset_orig.columns[0], axis=1) #删除第一列序号,无意义
#Rename Variables
states={
'SeriousDlqin2yrs':'是否坏客户',
'RevolvingUtilizationOfUnsecuredLines':'剩余可用额度的比例',
'age':'年龄',
'NumberOfTime30-59DaysPastDueNotWorse':'两年内逾期30-59天的笔数',
'DebtRatio':'月负债百分比',
'MonthlyIncome':'月收入',
'NumberOfOpenCreditLinesAndLoans':'普通信贷的笔数',
'NumberOfTimes90DaysLate':'贷款以来逾期90天的笔数',
'NumberRealEstateLoansOrLines':'固定资产贷款的笔数',
'NumberOfTime60-89DaysPastDueNotWorse':'两年内逾期60-89天的笔数',
'NumberOfDependents':'家属人数'}
states2={
'SeriousDlqin2yrs':'Serious Dlqin 2yrs',
'RevolvingUtilizationOfUnsecuredLines':'Revolving Utilization',
'age':'Age',
'NumberOfTime30-59DaysPastDueNotWorse':'# of 30-59 Days Past Due',
'DebtRatio':'Debt Ratio',
'MonthlyIncome':'Monthly Income',
'NumberOfOpenCreditLinesAndLoans':'# of Open Credit Lines And Loans',
'NumberOfTimes90DaysLate':'# of 90 Days Late',
'NumberRealEstateLoansOrLines':'# of Real Estate Loans/Lines',
'NumberOfTime60-89DaysPastDueNotWorse':'# of 60-89 Days Past Due',
'NumberOfDependents':'# of Dependents'}
trainset.rename(columns=states2,inplace=True)
3.3 探索与分析
3.31 数据概览
- 样本类别(y)是否均衡
- 是否含有缺失值(x):两种图。msno.matrix显示变量缺失关联,msno.bar显示变量缺失百分比
# Overview data
'样本类别不均衡'
def add_freq(): #定义函数,添加数据标签:比例(百分数)
ncount=len(trainset) #数据量
ax2=ax.twinx()
ax2.yaxis.tick_left()
ax.yaxis.tick_right()
ax.yaxis.set_label_position('right')
ax2.yaxis.set_label_position('left')
ax2.set_ylabel('Frequency [%]',fontsize=21) #左边坐标轴名称
for p in ax.patches:
x=p.get_bbox().get_points()[:,0]
y=p.get_bbox().get_points()[1,1]
ax.annotate('{:.1f}%'.format(100.*y/ncount),(x.mean(),y),ha='center',va='bottom',fontsize=17,color='k')
ax2.set_ylim(0,100)
ax2.grid