快速进行数据探索的工具（EDA）

最新推荐文章于 2024-10-10 07:33:39 发布

anonymox

最新推荐文章于 2024-10-10 07:33:39 发布

阅读量376

点赞数

分类专栏： # ——Python分析和挖掘库

本文链接：https://blog.csdn.net/cathycheny/article/details/109553215

版权

——Python分析和挖掘库专栏收录该内容

6 篇文章 0 订阅

订阅专栏

下面介绍3种工具：

pandas_profiling —— 适用于快速生成单个变量的分析
sweetviz —— 适用于数据集之间和目标变量之间的分析
pandasGUI —— 适用于具有手动拖放功能的深度分析

pandas_profiling

pandas_profiling可以用一行代码生成详细的数据分析报告, 与pandas深度结合, 非常适合前期的数据探索阶段, 以及结果数据报告批量化生产。

import pandas as pd
import pandas_profiling as pp

data = pd.read_csv('xxx.csv')
report = pp.ProfileReport(data, title='My Data Report', explorative=True)
report

# 生成html文件（可以指定绝对或相对路径）
report.to_file('report.html')

sweetviz

输出的是一个完全独立的html应用程序。

安装方式

pip install sweetviz

生成数据报告的方式

sweetviz不仅可以查看单变量的分布、统计特性，还可以设置目标变量，将变量和目标变量进行关联分析。

sweetviz有3种生成html报告的方式：

analyze(…)
compare(…)
compare_intra(…)

analyze()

可以使用 analyze() 直接对DataFrame生成报告

analyze()参数

analyze(source: Union[pd.DataFrame, Tuple[pd.DataFrame, str]],
            target_feat: str = None,
            feat_cfg: FeatureConfig = None,
            pairwise_analysis: str = 'auto'):

source: Either the data frame (as in the example) or a tuple containing the data frame and a name to show in the report. e.g. my_df or [my_df, “Training”]
target_feat: A string representing the name of the feature to be marked as “target”. Only BOOLEAN and NUMERICAL features can be targets for now.
feat_cfg: A FeatureConfig object representing features to be skipped, or to be forced a certain type in the analysis. The arguments can either be a single string or list of strings. Parameters are skip, force_cat, force_num and force_text. The “force_” arguments override the built-in type detection. They can be constructed as follows:
feature_config = sv.FeatureConfig(skip="PassengerId", force_text=["Age"])
pairwise_analysis: Correlations and other associations can take exponential time (n^2) to complete. The default setting (“auto”) will run without warning until a data set contains “association_auto_threshold” features. Past that threshold, you need to explicitly pass the parameter pairwise_analysis="on" (or ="off") since processing that many features would take a long time. This parameter also covers the generation of the association graphs (based on Drazen Zaric’s concept):

import pandas as pd
import seaborn as sns
titanic = sns.load_dataset('titanic')  # 这里用seaborn自带的titanic
titanic.head()


import sweetviz as sv
my_report = sv.analyze(titanic, target_feat ='titanic')  # 可以选择目标特征
my_report.show_html('report.html')  # 可以自定义生成的html报告的名字

生成结果讲解

首先，最上方会显示对这个数据集的基本描述，比如：

titanic这个数据集一共有891行，其中有107个重复行，占用内存大概是419kb
这个数据集一共有14个feature，其中有12个是类别特征，有2个是数值特征

关于单个特征的描述：

因为我们指定了survived是y变量，所以报告中先针对survived变量进行了频数统计。可以看到y变量的分布中，正样本大概占了40%，负样本占了60%，变量倾斜性不大。
针对单个特征（比如这里的plass），条形图表示该特征的频数分布情况，折线图表示该特征下survived=1的占比情况，等价于：
titanic.groupby(['pclass'])['survived'].sum()/titanic.groupby(['pclass']).size()

compare() —— 比较两个数据集

使用compare()生成比较文件，可以传入两个不同的数据集进行比较（e.g. Test vs Training sets）

my_report = sv.compare([my_dataframe, "Training Data"], [test_df, "Test Data"], "Survived", feature_config)

# 例子
my_report = sv.compare([df_train, "Training Data"], [df_test, "Test Data"], "survived")
my_report.show_html('report_compare.html')

compare_intra() —— 比较同一个数据集的两个不同子集

使用compare_intra() ，可以传入同一个数据集，并指定对某个维度下的不同子集进行对比分析，并生成EDA报告。

# 下面的例子中实现了同一个数据集中，两组数据的比较
my_report = sv.compare_intra(titanic, titanic['sex'] =='male', ['male', 'female'], target_feat='survived')
my_report.show_html('report_compare_dim.html')

生成结果如下（不再赘述）
在这里插入图片描述

pandasGUI

pandasGUI不会生成报告，而是生成一个GUI（图形用户界面）的数据框，可以用它来更详细地分析DataFrame。

在此GUI中，可以做很多事情，比如过滤（Filters）、统计信息（Statistics）、在变量之间创建图表（Grapher）、以及重塑数据（Reshaper）。这些操作可以根据需求拖动选项卡来完成。

（通过创建新的数据透视表或者融合数据集来进行重塑）

优点：
可以拖拽、过滤数据、快速绘图
缺点：
没有完整的统计信息、不能生成报告

安装方式

pip install pandasgui

代码

from pandasgui import show
gui = show(titanic)  # 部署GUI的数据集

anonymox

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录