1.EDA价值主要在于熟悉了解整个数据集的基本情况(缺失值,异常值),对数据集进行验证是否可以进行接下来的机器学习或者深度学习建模.
2.了解变量间的相互关系、变量与预测值之间的存在关系。
3.为特征工程做准备
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
%matplotlib inline
pd.set_option(‘display.max_columns’,None)
pd.set_option(‘display.max_rows’,None)
import warnings
warnings.filterwarnings(‘ignore’)
读取文件
train_data_file="./train.csv"
test_data_file="./testA.csv"
train_data = pd.read_csv(train_data_file)
test_data = pd.read_csv(test_data_file)
如果文件特别大 可以读取部分
data_train_sample = pd.read_csv("./train.csv",nrows=5)
查看数据集的样本个数和原始特征维度
train_data.shape
(800000, 47)
test_data.shape
(200000, 48)
train_data.columns
查看数据的各个列名
train_data.info()
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
Column Non-Null Count Dtype
0 id 800000 non-null int64
1 loanAmnt 800000 non-null float64
2 term 800000 non-null int64
3 interestRate 800000 non-null float64
4 installment 800000 non-null float64
…
信息太多进行省略
#object的变量 > objectList
#numerical的变量>classList
#连续变量 >numericalList
objectList=[]
classList=[]
numericalList=[]
for i in train_data.columns:
if train_data[i].dtype’O’:
objectList.append(i)
for i in list(train_data.select_dtypes(exclude=[‘object’]).columns):
temp=train_data[i].unique()
if len(temp)<=10:
classList.append(i)
else:
numericalList.append(i)
类别变量查看
for i in classList:
print(i)
print(’-’*30)
print(train_data[i].value_counts())
数值变量查看
dis_cols=6
dist_rows=len(numericalList)
plt.figure(figsize=(4dis_cols,4dist_rows))
i=1
for col in numericalList:
ax=plt.subplot(dist_rows,dis_cols,i)
ax=sns.kdeplot(train_data[col],color=‘Red’,shade=True)
ax=sns.kdeplot(test_data[col],color=‘Blue’,shade=True)
ax.set_xlabel(col)
ax.set_ylabel(“Frequency”)
ax=ax.legend([“train”,“test”])
i+=1
plt.show()
查看数据分布比例 把不好的数据分布进行剔除
object变量查看
‘employmentLength’, ‘issueDate’, 'earliesCreditLine’均为日期变量
data_all=pd.concat([train_data,test_data])
data_all[‘employmentLength’].head()
0 2 years
1 5 years
2 8 years
3 10+ years
4 NaN
Name: employmentLength, dtype: object
data_all[‘issueDate’].head()
0 2014-07-01
1 2012-08-01
2 2015-10-01
3 2015-08-01
4 2016-03-01
Name: issueDate, dtype: object
data_all[‘earliesCreditLine’].head()
0 Aug-2001
1 May-2002
2 May-2006
3 May-1999
4 Aug-1977
Name: earliesCreditLine, dtype: object
到这里我们就已经看完了整个数据分布的情况,接下来的工作就行对数据进行特征工程,剔除异常值,填补缺失值,并对数据进行格式转化