熊猫烧香分析报告_熊猫分析进行最佳探索性数据分析

最新推荐文章于 2020-12-23 09:05:52 发布

张_伟_杰

最新推荐文章于 2020-12-23 09:05:52 发布

阅读量321

点赞数

文章标签： python 数据分析 java 大数据

原文链接：https://towardsdatascience.com/the-best-exploratory-data-analysis-with-pandas-profiling-e85b4d514583

版权

熊猫烧香分析报告

目录 (Table of Contents)

Introduction
介绍
Overview
总览
Variables
变数
Interactions
互动互动
Correlations
相关性
Missing Values
缺失值
Sample
样品
Summary
摘要

介绍 (Introduction)

There are countless ways to perform exploratory data analysis (EDA) in Python (and in R). I do most of mine in the popular Jupyter Notebook. Once I realized there was a library that could summarize my dataset with just one line of code, I made sure to utilize it for every project, reaping countless benefits from the ease of this EDA tool. The EDA step should be performed first before executing any Machine Learning models for all Data Scientists, therefore, the kind and intelligent developers from Pandas Profiling [2] have made it easy to view your dataset in a beautiful format, while also describing the information well in your dataset.

在Python( 和R )中执行探索性数据分析(EDA)的方法有无数种。我在流行的Jupyter笔记本电脑上做大多数事情。一旦意识到有一个库可以用一行代码来总结我的数据集，我便确保将其用于每个项目，并从此EDA工具的易用性中获得了无数的收益。在为所有数据科学家执行任何机器学习模型之前，应首先执行EDA步骤，因此， Pandas Profiling [2]的友善而又聪明的开发人员已轻松以美观的格式查看数据集，同时也很好地描述了信息在您的数据集中。

The Pandas Profiling report serves as this excellent EDA tool that can offer the following benefits: overview, variables, interactions, correlations, missing values, and a sample of your data. I will be using randomly generated data to serve as an example of this useful tool.

熊猫分析报告是一种出色的EDA工具，可提供以下好处：概述，变量，交互作用，相关性，缺失值和数据样本。我将使用随机生成的数据作为此有用工具的示例。

总览 (Overview)

Image for post — Overview example. Screenshot by Author [3].

The overview tab in the report provides a quick glance at how many variables and observations you have or the number of rows and columns. It will also perform a calculation to see how many of your missing cells there are compared to the whole dataframe column. Additionally, it will point out duplicate rows as well and calculate that percentage. This tab is most similar to part of the describe function from Pandas, while providing a better user-interface (UI) experience.

报告中的“概述”选项卡可让您快速浏览一下您拥有多少变量和观测值，或者行和列的数量。它还将执行计算，以查看与整个数据框列相比有多少个丢失的单元格。此外，它还将指出重复的行并计算该百分比。此选项卡与Pandas的describe函数的一部分最为相似，同时提供了更好的用户界面 ( UI )体验。

The overview is broken into dataset statistics and variable types. You can also refer to warnings and reproduction for more specific information on your data.

概述分为数据集统计信息和变量类型。您还可以参考警告和复制以获取有关数据的更多特定信息。

I will be discussing variables, which are also referred to as columns or features of your dataframe

我将讨论变量，这些变量也称为数据框的列或特征

变数 (Variables)

To achieve more granularity in your descriptive statistics, the variables tab is the way to go. You can look at distinct, missing, aggregations or calculations like mean, min, and max of your dataframe features or variables. You can also see the type of data you are working with (i.e., NUM). Not pictured is when you click on ‘Toggle details’. This toggle prompts a whole plethora of more usable statistics. The details include:

为了使描述性统计信息更加精确，可以使用“变量”选项卡。您可以查看数据框特征或变量的不同，缺失，聚合或计算，例如均值，最小值和最大值。您还可以查看正在使用的数据类型( 即NUM )。当您单击“ 切换详细信息 ”时，未显示图片。此切换提示大量更多可用统计信息。详细信息包括：

Statistics — quantile and descriptive

统计-分位数和描述性

quantile

分位数

Minimum
5th percentile
Q1
Median
Q3
95th percentile
Maximum
Range
Interquartile range (IQR)

descriptive

描述性的

Standard deviation
Coefficient of variation (CV)
Kurtosis
Mean
Median Absolute Deviation (MAD)
Skewness
Sum
Variance
Monotonicity

These statistics also provide similar information from the describe function I see most Data Scientists using today, however, there are a few more and it presents them in an easy-to-view format.

这些统计信息还提供了我今天看到的大多数数据科学家使用的describe函数的类似信息，但是，还有更多信息，并且以易于查看的格式显示。

Histograms

直方图

The histograms provide for an easily digestible visual of your variables. You can expect to see the frequency of your variable on the y-axis and fixed-size bins (bins=15 is the default) on the x-axis.

直方图为您的变量提供了易于理解的视觉效果。你可以期望看到的在y轴变量的在x轴的频率和固定大小的块( 仓= 15是默认值 )。

Common Values

共同价值观

The common values will provide the value, count, and frequency that are most common for your variable.

公用值将提供最常用于变量的值，计数和频率。

Extreme Values

极端值

The extreme values will provide the value, count, and frequency that are in the minimum and maximum values of your dataframe.

极值将提供数据框的最小值和最大值中的值，计数和频率。

互动互动 (Interactions)

The interactions feature of the profiling report is unique in that you can choose from your list of columns to either be on the x-axis or y-xis provided. For example, pictured above is variable A against variable A, which is why you see overlapping. You can easily switch to other variables or columns to achieve a different plot and an excellent representation of your data points.

分析报告的交互功能是独特的，因为您可以从列列表中选择在提供的x轴还是y-xis上 。例如，如上图所示， 变量A相对于变量A ，这就是为什么看到重叠的原因。您可以轻松地切换到其他变量或列，以实现不同的图并很好地表示数据点。

缺失值 (Missing Values)

As you can see from the plot above, the report tool also includes missing values. You can see how much of each variable is missing, including the count, and matrix. It is a nice way to visualize your data before you perform any models with it. You would preferably want to see a plot like the above, meaning you have no missing values.

从上图可以看到，报告工具还包含缺失值。您可以看到缺少每个变量的多少，包括计数和矩阵。这是在执行任何模型之前可视化数据的好方法。您最好希望看到上面的图，这意味着您没有缺失的值。

样品 (Sample)

Sample acts similarly to the head and tail function where it returns your dataframe’s first few rows or last rows. In this example, you can see the first rows and last rows as well. I use this tab when I want a sense of where my data started and where it ended — I recommend ranking or ordering to see more benefit out of this tab, as you can see the range of your data, with a visual respective representation.

Sample的行为类似于head和tail函数，它返回数据框的前几行或最后几行。在此示例中，您还可以看到第一行和最后一行。当我想了解我的数据的开始和结束位置时，可以使用此选项卡-我建议进行排序或排序，以便从该选项卡中获得更多好处，因为您可以看到数据的范围，并具有直观的外观。

摘要 (Summary)

I hope this article provided you with some inspiration for your next exploratory data analysis. Being a Data Scientist can be overwhelming and EDA is often forgotten or not practiced as much as model-building. With the Pandas Profiling report, you can perform EDA with minimal code, providing useful statistics and visualizes as well. That way, you can focus on the fun part of Data Science and Machine Learning, the model process.

我希望本文能为您的下一个探索性数据分析提供一些启发。身为数据科学家可能会令人不知所措，而EDA常常像建立模型一样被遗忘或未得到实践。使用Pandas Profiling报告，您可以用最少的代码执行EDA，同时提供有用的统计信息并进行可视化。这样，您就可以专注于数据科学和机器学习的有趣部分，即模型过程。

To summarize, the main features of Pandas Profiling report include overview, variables, interactions, correlations, missing values, and a sample of your data.

总之，Pandas Profiling报告的主要功能包括概述，变量，交互作用，相关性，缺失值以及数据样本。

Here is the code I used to install and import libraries, as well as to generate some dummy data for the example, and finally, the one line of code used to generate the Pandas Profile report based on your Pandas dataframe [10].

这是我用于安装和导入库以及为示例生成一些虚拟数据的代码，最后是用于基于您的Pandas数据框[10]生成Pandas Profile报告的一行代码。

# install library 
#!pip install pandas_profilingimport pandas_profiling
import pandas as pd
import numpy as np# create data 
df = pd.DataFrame(np.random.randint(0,200,size=(15, 6)), columns=list('ABCDEF'))# run your report!
df.profile_report()# I did get an error and had to reinstall matplotlib to fix

Please feel free to comment down below if you have any questions or have used this feature before. There is still some information I did not describe, but you can find more of that information on the link I provided from above.

如果您有任何疑问或以前使用过此功能，请在下面随意评论。仍然有一些我没有描述的信息，但是您可以从上面提供的链接中找到更多的信息。

Thank you for reading, I hope you enjoyed!

谢谢您的阅读，希望您喜欢！

翻译自: https://towardsdatascience.com/the-best-exploratory-data-analysis-with-pandas-profiling-e85b4d514583

熊猫烧香分析报告

张_伟_杰

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
熊猫烧香分析报告_熊猫分析进行最佳探索性数据分析

熊猫烧香分析报告目录 (Table of Contents)Introduction 介绍 Overview 总览 Variables 变数 Interactions 互动互动 Correlations 相关性 Missing Values 缺失值 Sample 样品 Summary 摘要介绍 (Introduction)There are countless ways to perfo...
复制链接

扫一扫