数据探索分析-EDA

最新推荐文章于 2023-09-14 20:53:07 发布

Andy_shenzl

最新推荐文章于 2023-09-14 20:53:07 发布

阅读量2.1k

点赞数 3

分类专栏：数据可视化数据挖掘数据分析文章标签：数据探索

本文链接：https://blog.csdn.net/Andy_shenzl/article/details/105954315

版权

数据可视化同时被 3 个专栏收录

23 篇文章 3 订阅

订阅专栏

数据挖掘

21 篇文章 6 订阅

订阅专栏

python基础操作

17 篇文章 1 订阅

订阅专栏

一、什么是EDA

定义

在统计学中，探索性数据分析（EDA）是一种分析数据集以概括其主要特征的方法，通常使用可视化方法。可以使用或使用统计模型，但主要是EDA是为了了解数据在形式化建模或假设测试任务之外能告诉我们什么。探索性数据分析是John Tukey提拔的鼓励统计学家的研究数据，并尽可能提出假设，尽可能生成新的数据收集和实验。EDA不同于初始数据分析（IDA），它更集中于检查模型拟合和假设检验所需的假设，以及处理缺少的值，并根据需要进行变量转换。EDA包含IDA。

是指对已有的数据（特别是调查或观察得来的原始数据）在尽量少的先验假定下进行探索，通过作图、制表、方程拟合、计算特征量等手段探索数据的结构和规律的一种数据分析方法。特别是当我们对面对大数据时代到来的时候，各种杂乱的“脏数据”，往往不知所措，不知道从哪里开始了解目前拿到手上的数据时候，探索性数据分析就非常有效。探索性数据分析是上世纪六十年代提出，其方法有美国统计学家John Tukey提出的。

plan

Form hypotheses/develop investigation themes to explore：形成假设，确定主题去探索
Wrangle data：清理数据
Assess quality of data：评价数据质量
Profile data：数据报表
Explore each individual variable in the dataset：探索分析每个变量
Assess the relationship between each variable and the target：探索每个自变量与因变量之间的关系
Assess interactions between variables：探索每个自变量之间的相关性
Explore data across many dimensions：从不同的维度来分析数据

Throughout the entire analysis you want to:

Capture a list of hypotheses and questions that come up for further exploration：写出一系列你自己做的假设，然后接着做更深入的数据分析
Record things to watch out for/ be aware of in future analyses：记录下自己探索过程中更进一步的数据分析过程
Show intermediate results to colleagues to get a fresh perspective, feedback, domain knowledge. Don't do EDA in a bubble! Get feedback throughout especially from people removed from the problem and/or with relevant domain knowledge：把自己的中间的结果给自己的同行看看，让他们能够给你一些更有拓展性的反馈、或者意见。不要独自一个人做，国外的思维就是知道了什么就喜欢open to everybody，要走出去，多多交流，打开新的世界。

二、案例实战

1、整体步骤

2、实例-python演示

1>数据背景

目标名称：水的供应和用水是否与人均国内生产总值有关？（提出假设）

The dataset

We will be using the Food and Agriculture Organization (FAO) of the United Nation's AQUASTAT dataset.

From FAO:

粮农组织的三个主要目标是:

消除饥饿、粮食不安全和营养不良
消除贫困促进经济社会进步
自然资源的可持续管理和利用，包括土地、水、空气、气候和遗传资源，以造福今世后代。

为支持这些目标，《宪法》第1条要求粮农组织“收集、分析、解释和传播与营养、粮食和农业有关的信息”。因此，水温自动调节器开始，其目的是通过收集有助于联合国粮农组织的目标，与水资源相关的信息传播分析，用水和农业用水管理，对国家重点在非洲，亚洲，美国，拉丁美洲，加勒比海。

联合国粮农组织提供数据，元数据，报告国家概况，河流域概况，分析区域，图，表空间，数据，指导方针，和其他的在线工具:

水资源：内部、跨界、总
水的用途：按部门，按来源，废水
灌溉：地点、面积、类型、技术、作物
水坝：位置，高度，容量，表面积
与水有关的机构、政策和立法

http://www.fao.org/nr/water/aquastat/data/query/index.html

2>导入相关的包

%matplotlib inline
%config InlineBackend.figure_format='retina'
import matplotlib as mpl
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import os, sys
import warnings
warnings.filterwarnings('ignore')
sns.set_context("poster", font_scale=1.3)

3>导入数据及数据概览

data = pd.read_csv('./data/aquastat/aquastat.csv.gzip', compression='gzip')

查看数据：

data.head()/data.tail()/data.describe()/data.shape/data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143280 entries, 0 to 143279
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   country        143280 non-null  object 
 1   region         143280 non-null  object 
 2   variable       143280 non-null  object 
 3   variable_full  143280 non-null  object 
 4   time_period    143280 non-null  object 
 5   year_measured  96411 non-null   float64
 6   value          96411 non-null   float64
dtypes: float64(2), object(5)
memory usage: 7.7+ MB

4>变量初探索

本数据集的变量不是按列呈现，不同的变量在variable里面

data[['variable','variable_full']].drop_duplicates()
#total_area 国土面积（1000公顷）
#arable_land 可耕作面积
#permanent_crop_area 多年生作物面积
#cultivated_area 耕地面积
#percent_cultivated 耕地面积占比
#total_pop 总人口
#rural_pop 农村人口
#urban_pop 城市人口
#gdp 国内生产总值
#gdp_per_capita 人均国内生产总值
#agg_to_gdp 农业，增加国内生产总值
#human_dev_index 人类发展指数
#gender_inequal_index 性别不平等指数
#percent_undernourished 营养不良患病率
#avg_annual_rain_depth 长期平均年降水量
#national_rainfall_index 全国降雨指数

可以简单看下数据情况

data.country.nunique()
#199
countries = data.country.unique()
data.time_period.nunique()
#12
time_periods = data.time_period.unique()
print(time_periods)
'''
['1958-1962' '1963-1967' '1968-1972' '1973-1977' '1978-1982' '1983-1987'
 '1988-1992' '1993-1997' '1998-2002' '2003-2007' '2008-2012' '2013-2017']
'''
mid_periods = range(1960,2017,5)
#看下确实值
data[data.variable=='total_area'].value.isnull().sum()

数据收集了12个时间段199个国家的不同维度的数据

5>数据切分

从时间维度

#从时间维度切分
def time_slice(df, time_period):

    # Only take data for time period of interest
    df = df[df.time_period==time_period]
    # Pivot table 
    df = df.pivot(index='country', columns='variable', values='value')   
    df.columns.name = time_period   
    return df

time_slice(data, time_periods[0]).head()

查看时间1958-1962年阶段的不同国家的各种维度的数据情况

从地域维度

#从地域维度
def country_slice(df, country):
    
    # Only take data for country of interest
    df = df[df.country==country] 

    # Pivot table 
    df = df.pivot(index='variable', columns='time_period', values='value')
    
    df.index.name = country
    return df
    
country_slice(data, countries[0]).head()

查看第一个国家【Afghanistan】不同时间节点的数据

从变量维度

#从变量维度
def variable_slice(df, variable):
    
    # Only data for that variable
    df = df[df.variable==variable]
    
    # Get variable for each country over the time periods 
    df = df.pivot(index='country', columns='time_period', values='value')
    return df

variable_slice(data, 'total_pop').head()

查看total_pop变量的数据情况

多变量查看-2个

#多变量查看-2
def time_series(df, country, variable):
    
    # Only take data for country/variable combo 
    series = df[(df.country==country) & (df.variable==variable)]
    
    # Drop years with no data 
    series = series.dropna()[['year_measured', 'value']]
    
    # Change years to int and set as index 
    series.year_measured = series.year_measured.astype(int)
    series.set_index('year_measured', inplace=True)
    series.columns = [variable]
    return series

time_series(data, 'Belarus', 'total_pop')

数据组合

data.region.unique()

案例中区域划分层次太多，建立字典，将区域进行进一步简单数据整合

simple_regions ={
    'World | Asia':'Asia',
    'Americas | Central America and Caribbean | Central America': 'North America',
    'Americas | Central America and Caribbean | Greater Antilles': 'North America',
    'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
    'Americas | Northern America | Northern America': 'North America',
    'Americas | Northern America | Mexico': 'North America',
    'Americas | Southern America | Guyana':'South America',
    'Americas | Southern America | Andean':'South America',
    'Americas | Southern America | Brazil':'South America',
    'Americas | Southern America | Southern America':'South America', 
    'World | Africa':'Africa',
    'World | Europe':'Europe', 
    'World | Oceania':'Oceania'
}
data.region = data.region.apply(lambda x: simple_regions[x])
print(data.region.unique())

['Asia' 'North America' 'South America' 'Africa' 'Europe' 'Oceania']

我们可以提取其中某个区域进行查看

def subregion(data, region):
    return data[data.region==region]

6>数据质量评估

在试图了解数据中哪些信息之前，请确保您理解了数据代表什么和丢失了什么。

Overview

Basic things to do

分类：计数，区分计数，评估唯一值
数值：计数，最小，最大
抽查你熟悉的随机样品
切片和切块

Main questions

那里没有什么数据？
那里的数据对吗？
数据是按照你想象的方式生成的吗？

Helpful packages

missingno
pivottablejs
pandas_profiling

Example backlog

评估缺失数据在所有数据字段中的普遍性，评估其丢失是随机的还是系统的，并在缺少数据时确定模式
标识包含给定字段丢失数据的默认值。
确定质量评估抽样策略和初始EDA
datetime数据类型，保证格式的一致性和粒度的数据，并执行对数据的所有日期的检查.
在多个字段捕获相同或相似信息的情况下，了解它们之间的关系并评估最有效的字段使用。
评估每个字段数据类型
对于离散值类型，确保数据格式一致。
对于离散值类型，评估不同值和唯一百分比的数目，并对答案的类型进行正确检查。
对于连续数据类型，评估描述性统计，并对值进行检查。
了解时间戳和评估使用的分析之间的关系
按设备类型、操作系统、软件版本对数据进行切片，保证跨切片数据的一致性
对于设备或应用程序数据，确定版本发布日期，并评估这些日期前后格式或值的任何更改数据。

利用一个切片数据进行下面的分析

recent = time_slice(data, '2013-2017')
#利用missingno查看数据缺失情况
msno.matrix(recent, labels=True)

关于missingno详细解释可以查看https://blog.csdn.net/Andy_shenzl/article/details/81633356

我们还可以利用数据切分，从不同的维度进行查看

比如从区域维度

def subregion(data, region):
    return data[data.region == region]
north_america = subregion(data, 'North America')
#指数完整性
msno.matrix(msno.nullity_sort(time_slice(north_america, '2013-2017'), sort='descending').T, inline=False)
#plt.title('Fraction of fields complete by country for North America \n \n');

为了更直观的看数据，可以利用folium来查看数据

# JSON with coordinates for country boundaries 
geo = r'world.json'

null_data = recent['agg_to_gdp'].notnull()*1
map = folium.Map(location=[48, -102], zoom_start=2)
map.choropleth(geo_data=geo, 
               data=null_data,
               columns=['country', 'agg_to_gdp'],
               key_on='feature.properties.name', reset=True,
               fill_color='GnBu', fill_opacity=1, line_opacity=0.2,
               legend_name='Missing agricultural contribution to GDP data 2013-2017')
map

通过seaborn的热力图查看随着时间的每个国家对55个指标变量的重视程度变化趋势

fig, ax = plt.subplots(figsize=(16, 16));
sns.heatmap(data.groupby(['time_period','variable']).value.count().unstack().T , ax=ax);
plt.xticks(rotation=45);
plt.xlabel('Time period');
plt.ylabel('Variable');
plt.title('Number of countries with data reported for each variable over time');

可以利用pivottablejs做一个类似excel透视表的查看方式

pivottablejs.pivot_ui(time_slice(data, '2013-2017'),)

7>变量分析

pandas_profiling

PS：pandas版本有要求，最新版本不支持pandas_profiling

pandas_profiling.ProfileReport(time_slice(data, '2013-2017'))

具体可参考：https://blog.csdn.net/Andy_shenzl/article/details/81709409

8>数据峰度和偏度

Location: 均值，中位数，模式，四分位
Spread: 标准差、方差、范围、间距范围
Shape: 偏度、峰度

每个变量都会有Kurtosis和Skew，前面的是峰度，后面一个是偏度。峰度是描述总体中所有取值分布形态陡缓程度的统计量。这个统计量需要与正态分布相比较，峰度为3表示该总体数据分布与正态分布的陡缓程度相同；峰度大于3表示该总体数据分布与正态分布相比较为陡峭，为尖顶峰；峰度小于3表示该总体数据分布与正态分布相比较为平坦，为平顶峰。峰度的绝对值数值越大表示其分布形态的陡缓程度与正态分布的差异程度越大。

通常我们将峰度值减去3，也被称为超值峰度（Excess Kurtosis），这样正态分布的峰度值等于0，当峰度值>0，则表示该数据分布与正态分布相比较为高尖，当峰度值<0，则表示该数据分布与正态分布相比较为矮胖。

峰度的具体计算公式为：

偏度与峰度类似，它也是描述数据分布形态的统计量，其描述的是某总体取值分布的对称性。这个统计量同样需要与正态分布相比较，偏度为0表示其数据分布形态与正态分布的偏斜程度相同；偏度大于0表示其数据分布形态与正态分布相比为正偏或右偏，即有一条长尾巴拖在右边，数据右端有较多的极端值；偏度小于0表示其数据分布形态与正态分布相比为负偏或左偏，即有一条长尾拖在左边，数据左端有较多的极端值。偏度的绝对值数值越大表示其分布形态的偏斜程度越大。

偏度的具体计算公式为：

首先看一下简单数据描述

recent[['total_pop', 'urban_pop', 'rural_pop']].describe().astype(int)

发现一个异常值-98，说明数据有问题，类似的我们可以查看其他的异常数据情况

峰度值和偏度值利用scipy都做了统计发现这两个指标都能得出。我们发现结果的峰值和偏值都比较大，可以做一个对数转换，对数变换是数据变换的一种常用方式，数据变换的目的在于使数据的呈现方式接近我们所希望的前提假设，从而更好的进行统计推断。

recent[['total_pop', 'urban_pop', 'rural_pop']].apply(scipy.stats.skew)

2013-2017
total_pop    8.519379
urban_pop    8.545690
rural_pop    9.490029
dtype: float64

正态分布的偏度应为零。负偏度表示偏左，正偏表示右偏。

recent[['total_pop', 'urban_pop', 'rural_pop']].apply(scipy.stats.kurtosis)

2013-2017
total_pop    76.923725
urban_pop    85.499659
rural_pop    95.838930
dtype: float64

峰度也是一个正态分布和零只能是积极的。我们肯定有一些异常值！

单变量的变化来分析总的人口变量变化

fig, ax = plt.subplots(figsize=(12, 8))
ax.hist(recent.total_pop.values, bins=50);
ax.set_xlabel('Total population');
ax.set_ylabel('Number of countries');
ax.set_title('Distribution of population of countries 2013-2017');

数据明显不是高斯分布，所以我们进行一个对数变换

recent[['total_pop']].apply(np.log).apply(scipy.stats.skew)
#total_pop   -0.899063
recent[['total_pop']].apply(np.log).apply(scipy.stats.kurtosis)
#total_pop    1.086877

可以看到变换后的数据峰度和偏度都比较正常了

可以看到数据峰分布接近高斯分布了

8>数据增长变化可视化

看下每个国家（北美）人口增长情况

def time_series(df, country, variable):
    # Only take data for country/variable combo
    series = df[(df.country == country) & (df.variable == variable)]

    # Drop years with no data
    series = series.dropna()[['year_measured', 'value']]

    # Change years to int and set as index
    series.year_measured = series.year_measured.astype(int)
    series.set_index('year_measured', inplace=True)
    series.columns = [variable]
    return series

plt.subplots(figsize=(12, 8))
with sns.color_palette(sns.diverging_palette(220, 280, s=85, l=25, n=23)):
    north_america = time_slice(subregion(data, 'North America'), '1958-1962').sort_values('total_pop').index.tolist()
    for country in north_america:
        plt.plot(time_series(data, country, 'total_pop'), label=country);
        plt.xlabel('Year');
        plt.ylabel('Population');
        plt.title('North American populations over time');
    plt.legend(loc=2,prop={'size':10});

发现下面的数据全部堆积在一起，看不出变化情况

plt.subplots(figsize=(12, 8))
with sns.color_palette(sns.diverging_palette(220, 280, s=85, l=25, n=23)):
    for country in north_america:
        ts = time_series(data, country, 'total_pop')
        ts['norm_pop'] = ts.total_pop/ts.total_pop.min()*100
        plt.plot(ts['norm_pop'], label=country);
        plt.xlabel('Year');
        plt.ylabel('Percent increase in population');
        plt.title('Percent increase in population from 1960 in North American countries');
    plt.legend(loc=2,prop={'size':10});

我们还可以换成热图的形式

north_america_pop = variable_slice(subregion(data, 'North America'), 'total_pop')
north_america_norm_pop = north_america_pop.div(north_america_pop.min(axis=1), axis=0)*100
north_america_norm_pop = north_america_norm_pop.loc[north_america]


fig, ax = plt.subplots(figsize=(16, 12));
sns.heatmap(north_america_norm_pop, ax=ax, cmap=sns.light_palette((214, 90, 60), input="husl", as_cmap=True));
plt.xticks(rotation=45);
plt.xlabel('Time period');
plt.ylabel('Country, ordered by population in 1960 (<- greatest to least ->)');
plt.title('Percent increase in population from 1960');

9>每个变量与目标变量之间的相关性

2013-2017年之间的seasonal_varality与GDP per captial之间的相关性，可以先用散点图来查看。

#seasonal_variability 季节变化（WRI）
# recent.drop('gdp_bin', axis=1).astype(float).plot(x='seasonal_variability',y='gdp_per_capita', kind='scatter');
plt.scatter(recent.seasonal_variability, recent.gdp_per_capita)
plt.xlabel('Seasonal variability');
plt.ylabel('GDP per capita ($USD/person)');

还可以用JointGrid画图，自动探索两个连续变量之间的相关性

sns.jointplot(x='total_bill', y='tip',  # 设置xy轴，显示columns名称
              data=tips,   # 设置数据
              color = 'k',   # 设置颜色
              s = 50, edgecolor="w",linewidth=1,
              # 设置散点大小、边缘线颜色及宽度(只针对scatter）
              kind = 'scatter',
              space = 0.2,  # 设置散点图和布局图的间距
              height= 7, ratio = 5,  # 散点图与布局图高度比，整型
              marginal_kws=dict(bins=20, rug=True)  # 设置柱状图箱数，是否设置rug
              )
plt.show()

svr = [recent.seasonal_variability.min(), recent.seasonal_variability.max()]
#[0.3, 4.6]
gdpr = [(recent.gdp_per_capita.min()), recent.gdp_per_capita.max()] 
#[276.0, 101911.0]
gdpbins = np.logspace(*np.log10(gdpr), 25)

g =sns.JointGrid(x="seasonal_variability", y="gdp_per_capita", data=recent, ylim=gdpr)
g.ax_marg_x.hist(recent.seasonal_variability, range=svr)
g.ax_marg_y.hist(recent.gdp_per_capita, range=gdpr, bins=gdpbins, orientation="horizontal")
g.plot_joint(plt.hexbin, gridsize=25)
ax = g.ax_joint
# ax.set_yscale('log')
g.fig.set_figheight(8)
g.fig.set_figwidth(9)

相关度量两个变量之间的*线性关系的强度。我们可以使用相关性来识别变量。

recent_corr = recent.corr().loc['gdp_per_capita'].drop(['gdp','gdp_per_capita'])
def conditional_bar(series, bar_colors=None, color_labels=None, figsize=(13,24),
                   xlabel=None, by=None, ylabel=None, title=None):
    fig, ax  = plt.subplots(figsize=figsize)
    if not bar_colors:
        bar_colors = mpl.rcParams['axes.prop_cycle'].by_key()['color'][0]
    plt.barh(range(len(series)),series.values, color=bar_colors)
    plt.xlabel('' if not xlabel else xlabel);
    plt.ylabel('' if not ylabel else ylabel)
    plt.yticks(range(len(series)), series.index.tolist())
    plt.title('' if not title else title);
    plt.ylim([-1,len(series)]);
    if color_labels:
        for col, lab in color_labels.items():
            plt.plot([], linestyle='',marker='s',c=col, label= lab);
        lines, labels = ax.get_legend_handles_labels();
        ax.legend(lines[-len(color_labels.keys()):], labels[-len(color_labels.keys()):], loc='upper right');
    plt.close()
    return fig

bar_colors = ['#0055A7' if x else '#2C3E4F' for x in list(recent_corr.values < 0)]
color_labels = {'#0055A7':'Negative correlation', '#2C3E4F':'Positive correlation'}

conditional_bar(recent_corr.apply(np.abs), bar_colors, color_labels,
               title='Magnitude of correlation with GDP per capita, 2013-2017',
               xlabel='|Correlation|')

10>异常值检测

使用boxplot来看数据分布

fig,axes=plt.subplots(1,2,sharey=False,figsize=(12,6))
sns.boxplot(y=recent['total_pop'],data=recent,palette="Set2",ax=axes[0])
sns.boxplot(y=recent['agg_to_gdp'],data=recent,palette="Set2",ax=axes[1])