1.EDA价值主要在于熟悉了解整个数据集的基本情况（缺失值，异常值），对数据集进行验证是否可以进行接下来的机器学习或者深度学习建模.

2.了解变量间的相互关系、变量与预测值之间的存在关系。

3.为特征工程做准备

## 代码示例

### 2.3.1 导入数据分析及可视化过程需要的库

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import warnings
warnings.filterwarnings('ignore')


### 2.3.2 读取文件

data_train = pd.read_csv('./train.csv')

##### 读取文件的拓展知识

#设置chunksize参数，来控制每次迭代数据的大小
for item in chunker:
print(type(item))
#<class 'pandas.core.frame.DataFrame'>
print(len(item))
#5


### 2.3.3 总体了解

##### 查看数据集的样本个数和原始特征维度
data_test_a.shape


data_train.shape


data_train.columns


### 2.3.4 查看数据集中特征缺失值，唯一值等

print(f'There are {data_train.isnull().any().sum()} columns in train dataset with missing values.')


have_null_fea_dict = (data_train.isnull().sum()/len(data_train)).to_dict()
fea_null_moreThanHalf = {}
for key,value in have_null_fea_dict.items():
if value > 0.5:
fea_null_moreThanHalf[key] = value


one_value_fea = [col for col in data_train.columns if data_train[col].nunique() <= 1]

one_value_fea_test = [col for col in data_test_a.columns if data_test_a[col].nunique() <= 1]


print(f'There are {len(one_value_fea)} columns in train dataset with one unique value.')
print(f'There are {len(one_value_fea_test)} columns in test dataset with one unique value.')


There are 1 columns in train dataset with one unique value.
There are 1 columns in test dataset with one unique value.

### 2.3.5 查看特征的数值类型有哪些，对象类型有哪些

numerical_fea = list(data_train.select_dtypes(exclude=[‘object’]).columns)
category_fea = list(filter(lambda x: x not in numerical_fea,list(data_train.columns)))

#### 数值型变量分析，数值型肯定是包括连续型变量和离散型变量的，找出来

#过滤数值型类别特征
def get_numerical_serial_fea(data,feas):
numerical_serial_fea = []
numerical_noserial_fea = []
for fea in feas:
temp = data[fea].nunique()
if temp <= 10:
numerical_noserial_fea.append(fea)
continue
numerical_serial_fea.append(fea)
return numerical_serial_fea,numerical_noserial_fea
numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(data_train,numerical_fea)

#每个数字特征得分布可视化
f = pd.melt(data_train, value_vars=numerical_serial_fea)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")


12-22
03-06