This is part 1 in a series of articles guiding the reader through an entire data science project.
这是一系列文章的第1部分 ,指导读者完成整个数据科学项目。
I am a new writer on Medium and would truly appreciate constructive criticism in the comments below.
我是Medium的新作家,在下面的评论中,我将非常感谢建设性的批评。
总览 (Overview)
What is EDA anyway?
无论如何,EDA是什么?
EDA or Exploratory Data Analysis is the process of understanding what data we have in our dataset before we start finding solutions to our problem. In other words — it is the act of analyzing the data without biased assumptions in order to effectively preprocess the dataset for modeling.
EDA或探索性数据分析是在开始寻找问题的解决方案之前了解我们数据集中的数据的过程。 换句话说,这是在没有偏见的前提下分析数据的行为,以便有效地预处理数据集以进行建模。
Why do we do EDA?
我们为什么要进行EDA?
The main reasons we do EDA are to verify the data in the dataset, to check if the data makes sense in the context of the problem, and even sometimes just to learn about the problem we are exploring. Remember:
我们进行EDA的主要原因是为了验证数据集中的数据,检查数据是否在问题范围内有意义,甚至有时只是了解我们正在探索的问题。 记得:
EDA中有哪些步骤,我应该如何做? (What are the steps in EDA and how should I do each one?)
- Descriptive Statistics — get a high-level understanding of your dataset 描述性统计信息-全面了解您的数据集
- Missing values — come to terms with how bad your dataset is 缺失值-取决于数据集的严重程度
- Distributions and Outliers — and why countries that insist on using different units make our jobs so much harder 分布和异常值-以及为什么坚持使用不同单位的国家使我们的工作变得如此困难
- Correlations — and why sometimes even the most obvious patterns still require some investigating 相关性-为什么有时即使是最明显的模式仍需要进行一些调查
关于Pandas分析的说明 (A note on Pandas Profiling)
Pandas Profiling is probably the easiest way to do EDA quickly (although there are many other alternatives such as SweetViz ), the downside of using Pandas Profiling is that it can be slow to give you very in-depth analysis, even when not needed.
Pandas Profiling可能是快速进行EDA的最简单方法(尽管还有许多其他选择,例如SweetViz ),但是使用Pandas Profiling的不利之处在于,即使在不需要时也无法为您提供深入的分析。
I will describe below how I used Pandas Profiling for analyzing the Diabetics Readmission Dataset on Kaggle (https://www.kaggle.com/friedrichschneider/diabetic-dataset-for-readmission/data)
我将在下面介绍如何使用熊猫分析来分析Kaggle上的糖尿病再入院数据集( https://www.kaggle.com/friedrichschneider/diabetic-dataset-for-readmission/data )
To see the Pandas Profiling report, simply run the following:
要查看“熊猫分析”报告,只需运行以下命令:
描述性统计 (Descriptive Statistics)
For this stage I like to look at just a few key points:
在此阶段,我只想看几个关键点:
- I look at the count to see if I have a significant amount of missing values for each specific feature. If there are many missing values for a certain feature I might want to discard it. 我查看一下计数,以查看每个特定功能是否都有大量的缺失值。 如果某个功能缺少许多值,则可能要丢弃它。
- I look at the unique values (for categorical this will show up as NaN for pandas describe, but in Pandas Profiling we can see the distinct count). If a feature has only 1 unique value it will not help my model, so I discard it. 我查看了唯一的值(对于分类而言,这将显示为NaN用于熊猫描述,但在“熊猫剖析”中我们可以看到不同的计数)。 如果某个要素只有1个唯一值,则对我的模型无济于事,因此我将其丢弃。
- I look at the ranges of the values. If the max or min of a feature is significantly different from the mean and from the 75% / 25%, I might want to look into this further to understand if these values make sense in their context. 我看一下值的范围。 如果特征的最大值或最小值与均值和75%/ 25%显着不同,我可能需要进一步研究以了解这些值在上下文中是否有意义。
缺失值 (Missing Values)
Almost every real-world dataset has missing values. There are many ways to deal with missing values — usually the techniques we use depend on the dataset and the context. Sometimes we can made educated guesses and/or impute the values. Instead of going through all the each method (there are many great medium articles out there describing the different methods in depth — see this great article by Jun Wu ), I will discuss how, sometimes, even though we are given a value in the data, the value is actually missing, and one particular method that allows us to ignore the hidden values for the time being.
几乎每个现实世界的数据集都缺少值。 处理缺失值的方法有很多-通常,我们使用的技术取决于数据集和上下文。 有时我们可以进行有根据的猜测和/或估算值。 我将不讨论每种方法的全部问题(那里有很多很棒的中篇文章深入地介绍了不同的方法,请参见Jun Wu的这篇很棒的文章 ),我将讨论有时,即使我们在数据中获得了价值,也将如何讨论。 ,实际上是缺少该值,还有一种特殊的方法可以让我们暂时忽略隐藏的值。
The diabetes dataset is a great example of missing values hidden within the data. If we look at the ‘descriptive statistics’ we can see zero missing values, but a simple observation of one of the features, in this case, “payer_code” in the figure above, we can see that almost half of the samples have a category “?”. These are hidden missing values.
糖尿病数据集是隐藏在数据中的缺失值的一个很好的例子。 如果我们查看“描述性统计信息”,则可以看到零缺失值,但是简单观察其中一个功能(在本例中为上图中的“ payer_code”),我们可以看到几乎有一半的样本具有类别“?”。 这些是隐藏的缺失值。
What should we do when half the samples have missing values? There is no one right answer (See Jun Wu’s article). Many would say just exclude the feature with many missing values from your model as there is no way to accurately impute them.
当一半样本的值缺失时,我们该怎么办? 没有一个正确的答案( 请参阅Wu Jun的文章 )。 许多人会说只是从模型中排除掉具有许多缺失值的特征,因为无法准确估算它们。
But there is one method many data scientists miss out on. If you are using a Decision Tree-based model (such as a GBM), then the tree can take a missing value as an input. Since all features will be turned into numeric values we can just encode “?” as an extreme value that is far from the range used in the dataset (such as 999,999), this way at the node, all samples with missing values will split to one side of the tree. If we find after modeling that this value is very important, we can come back to the EDA stage and try and understand (probably by using a domain expert) if there is valuable information in all the missing values of this specific feature. Some packages don’t even require you to encode missing values, such as LightGBM which automatically does this split.
但是,许多数据科学家错过了一种方法。 如果您使用的是基于决策树的模型(例如GBM),则该树可能会使用缺失值作为输入。 由于所有功能都将转换为数字值,因此我们只需编码“?” 作为远离数据集中使用的范围的极值(例如999,999),以这种方式在节点上,所有缺少值的样本都将拆分到树的一侧。 如果在建模后发现此值非常重要,我们可以回到EDA阶段并尝试了解(可能通过使用领域专家)在此特定功能的所有缺失值中是否都包含有价值的信息。 有些软件包甚至不需要您对缺失值进行编码,例如LightGBM会自动进行此拆分。
行重复 (Duplicate Rows)
Duplicate rows sometimes appear in datasets. It is very easy to solve (this is one solution using the pandas build-in method):
重复的行有时会出现在数据集中。 这很容易解决(这是使用pandas内置方法的一种解决方案):
df.drop_duplicates(inplace=True)
There is another type of duplicate rows that you need to be wary of. Say you have a dataset on patients. You might have many rows for each patient that represent taking a medication. These are not duplicates. We will explore how to deal with this kind of duplicate rows later in the series when we explore ‘Feature Engineering’.
您需要警惕另一种重复行。 假设您有一个有关患者的数据集。 对于每个代表正在服药的患者,您可能会有很多行。 这些不是重复项。 在探索“功能工程”时,我们将在本系列的后面部分探讨如何处理这种重复行。
分布和异常值 (Distributions and Outliers)
The main reason to analyze the distributions and outliers in the dataset is to validate that the data is correct and makes sense. Another good reason to do this is to simplify the dataset.
分析数据集中的分布和离群值的主要原因是要验证数据正确无误。 这样做的另一个很好的理由是简化数据集。
验证数据集 (Validating the Dataset)
Let’s say we plot a histogram for the heights of the patient and we observe the following
假设我们绘制了一个针对患者身高的直方图,我们观察到以下
Clearly there is some kind of problem with the data. Here we can guess (due to the context) that 10% of the data has been measured in feet, and the rest in centimeters. We can then convert the rows where the height is less than 10 from feet to centimeters. Pretty simple. What do we do in a more complicated example, such as the one below?
显然,数据存在某种问题。 在这里,我们可以猜测(由于上下文),其中10%的数据以英尺为单位,其余数据以厘米为单位。 然后,我们可以将高度小于10的行从英尺转换为厘米。 很简单 在一个更复杂的示例(例如下面的示例)中,我们该怎么做?
Here, if we briefly look at the dataset and don’t check each and every feature, we will miss that patients’ heights are recorded as tall as even 6 meters, which doesn’t make sense (see Tallest People in the World). To solve this unit error, we must make some decisions on the cutoff: which heights are measured in feet and which in meters. Another option is to check if there is a correlation between height and country, for example, and we might find that all the feet measurements are from the US.
在这里,如果我们简单地查看数据集而不检查每个要素,我们会错过记录患者的身高甚至只有6米的高,这是没有道理的(请参见世界上最高的人 )。 为了解决这个单位误差,我们必须对截止值做出一些决定:哪些高度以英尺为单位,哪些高度以米为单位。 另一个选择是,例如,检查身高与国家/地区之间是否存在相关性,我们可能会发现所有的英尺测量值都来自美国。
离群值 (Outliers)
Another important thing is to check outliers. We can graph the different features either as box-plots or as a function of another feature (typically the target variable, but not necessarily). There are many statistics to check for outliers in the data, but often in EDA, we can identify them very easily. In the example below, we can immediately identify outliers (random data).
另一个重要的事情是检查异常值。 我们可以将不同的特征绘制成箱线图或作为另一个特征的函数(通常是目标变量,但不一定)。 有许多统计数据可用于检查数据中的异常值,但是在EDA中,我们经常可以很容易地识别它们。 在下面的示例中,我们可以立即识别异常值(随机数据)。
It is important to check outliers to understand if these are errors in the dataset. This is a whole separate topic (See Natasha Sharma’s excellent article on the topic), but a very important one to understand whether or not to keep there are errors in the dataset.
重要的是检查异常值,以了解这些是否是数据集中的错误。 这是一个完整的主题(请参阅Natasha Sharma关于该主题的出色文章 ),但对于理解数据集中是否存在错误非常重要。
简化数据集 (Simplifying the Dataset)
Another really important reason to do EDA is that we might want to simplify our dataset and or even just identify where to simplify the dataset.
进行EDA的另一个真正重要的原因是,我们可能希望简化数据集,甚至只是确定简化数据集的位置。
Perhaps we can group certain features in our dataset? Take the target variable “Readmission” in the diabetes patient dataset. If we plot the different variables we find that readmission in under 30 days and in over 30 days, generally follows the same distribution across different features. If we merge them we can balance our dataset and get better predictions.
也许我们可以将数据集中的某些特征分组? 在糖尿病患者数据集中获取目标变量“再入院”。 如果我们绘制不同的变量,我们会发现30天内和30天内的重新录入通常遵循不同特征的相同分布。 如果我们合并它们,我们可以平衡数据集并获得更好的预测。
If we check the distribution against different features we find that this still holds, take for example across genders
如果我们对照不同的特征检查分布,我们发现它仍然成立,例如跨性别
We can check this across different features, but here the conclusions seem to be that the dataset is very balanced and we can probably combine ‘readmitted’ in over or under 30 days.
我们可以在不同的功能上进行检查,但是这里的结论似乎是数据集非常平衡,我们可以在30天内或30天内组合“重新提交”。
了解数据集 (Learn the Dataset)
Another very important reason to visualize the distributions of your datasets is to learn what you even have. Take the following population pyramid of ‘Patient Numbers’ by age and gender
可视化数据集分布的另一个非常重要的原因是学习您甚至拥有的内容。 根据年龄和性别,获取以下“患者人数”的人群金字塔
Understanding the distribution of age and gender in our dataset is essential in order to make sure we are reducing the bias between them as much as possible. Studies have discovered that many models are extremely biased, as they’ve only been trained on one gender or race (often men or white people, for example), so this is an extremely important step in the EDA.
为了确保我们尽可能减少两者之间的偏差,了解数据集中的年龄和性别分布至关重要。 研究发现,许多模型有很大的偏见,因为它们只接受过一种性别或种族的训练(例如,通常是男性或白人),因此这是EDA中极为重要的一步。
相关性 (Correlations)
Often a lot of emphasis in EDA is on correlations, and often correlations are really interesting, but not wholly useful alone (see this article on interpreting basic correlations). A significant area of research in academia is how to identify causation versus correlation (for a brief intro see this Khan Academy lesson), often though domain experts can verify that a correlation is indeed causation.
EDA中通常会着重于相关性,而相关性通常确实很有趣,但并不完全有用(请参阅有关解释基本相关性的本文 )。 尽管领域专家可以验证关联确实是因果关系,但学术界研究的一个重要领域是如何确定因果关系与关联性(有关简介,请参见本可汗学院的课程 )。
There are many ways to plot correlations, and different correlation methods to use. I will focus on three —Phi K, Cramer’s V, and ‘one-way analysis’.
有许多方法可以绘制相关性,并可以使用不同的相关方法。 我将专注于三个-披披K,克拉默五世和“单向分析”。
Phi K相关 (Phi K Correlation)
Phi_K is a new correlation coefficient based on improvements to Pearson’s test of independence of two variables (see the documentation for more info). See below Phi K correlation from Pandas Profiling (one of several available correlation matrices)
Phi_K是一个新的相关系数,它基于对Pearson的两个变量的独立性测试的改进(请参阅文档以获取更多信息)。 参见下文,来自Pandas Profiling的Phi K相关(几种可用的相关矩阵之一)
We can very easily identify a correlation between ‘Readmitted’ — our target variable — (the last row/column) and several other features such as: ‘Admission Type’, ‘Discharge Disposition’, ‘Admission Source’, ‘Payer Code’ and ‘Number of Lab Procedures’. This should be light a lightbulb for us, and we must dig deeper into each one to understand if this makes sense in the context of the problem (probably — if you have more procedures then you probably have a more significant problem and so you are more likely to be readmitted), and to help confirm conclusions that our model might find later in the project.
我们可以很容易地确定目标变量“重新允许”(最后一行/列)与其他几个功能之间的相关性,例如:“入场类型”,“出院位置”,“入场来源”,“付款人代码”和“ “实验室程序数量”。 对于我们来说,这应该是一个灯泡,我们必须更深入地研究每个人,以了解在问题的背景下这是否有意义(可能—如果您拥有更多的程序,那么您可能会遇到更严重的问题,因此您会遇到更多可能会被重新承认),并有助于确认结论,我们的模型可能会在项目的后续阶段找到。
克拉默V相关 (Cramer’s V Correlation)
Cramer’s V is a great statistic to measure the correlation between two variables. In general, we are usually interested in the correlation between a feature and the target variable.
Cramer的V是衡量两个变量之间相关性的出色统计量。 通常,我们通常对特征和目标变量之间的相关性感兴趣。
Sometimes we can discover other interesting and sometimes surprising information from correlation diagrams, take for example one interesting fact discovered by Lilach Goldshtein in the diabetes dataset. Let us look at the Cramer’s V of ‘discharge_disposition_id’ (a categorical feature that indicates the reason a patient was discharged) and ‘readmitted’ (our target variable — whether or not the patient was readmitted).
有时我们可以从关联图中发现其他有趣且有时令人惊讶的信息,例如,Lilach Goldshtein在糖尿病数据集中发现的一个有趣事实。 让我们看一下Cramer V的“ discharge_disposition_id”(表明患者出院原因的分类特征)和“再入院”(我们的目标变量-是否再入院)的CramerV。
We note here that Discharge ID 11, 19, 20, and 21 have no readmitted patients — STRANGE!
我们在这里注意到,出院ID 11、19、20和21没有再入院的患者-STRANGE!
Let’s check what these IDs are:
让我们检查一下这些ID是什么:
These people were never readmitted because sadly, they passed away.
这些人从未被遗忘,因为他们不幸过世了。
This is a very obvious note — and we probably didn’t need to dig into data correlations to identify this fact — but such observations are often completely missed. Now a decision needs to be made regarding what to do with these samples — do we include them in the model or not? Probably not, but again that is up to the data scientist. What is important at the EDA stage is that we find these occurrences.
这是一个非常明显的注解-我们可能无需深入研究数据相关性即可识别这一事实-但经常会完全忽略这种观察。 现在需要决定如何处理这些样本-我们是否将它们包括在模型中? 可能不是,但这再次取决于数据科学家。 在EDA阶段重要的是我们发现这些情况。
单向分析 (One-Way Analysis)
Honestly, if you do one thing I outline in this entire article — do this. One-way analysis can pick up on many of the different observations I’ve touched on in this article, in one graph.
老实说,如果您做一件事,我将在整篇文章中概述-做到这一点。 单向分析可以在一幅图中获得我在本文中涉及的许多不同观察结果。
The above graphs show us the percentage of the dataset represented by a certain range and the median of the target variable in that range (this is not the diabetes dataset, but rather a dataset to be used for regression). On the left, we can see that most samples fall in the range 73.5–90.5 and that there is no linear correlation between the feature and the target. On the other hand, on the right-hand side we can see that the feature is directly correlated with the target and that in each group there is a good spread of samples.
上图显示了由某个范围表示的数据集的百分比以及该范围内目标变量的中位数(这不是糖尿病数据集,而是用于回归的数据集)。 在左侧,我们可以看到大多数样本都在73.5–90.5范围内,并且特征与目标之间没有线性相关。 另一方面,在右侧,我们可以看到该特征与目标直接相关,并且在每个组中样本分布良好。
The groups were chosen using a single Decision Tree to split optimally.
使用单个决策树选择组以进行最佳拆分。
This is a great way to analyze the dataset. We can see the distribution of the samples in a specific feature, we can see outliers if there are any (none in these examples) and we can identify missing values (either we encode them first as extreme numerical values as described before, or if it is a categorical feature we will see the label as “NaN” or in the diabetes case “?”).
这是分析数据集的好方法。 我们可以看到特定特征中样本的分布,可以看到是否有异常值(在这些示例中没有),并且可以识别缺失值(可以像之前所述将它们首先编码为极限数值,或者是一种分类功能,我们会看到标签为“ NaN”或在糖尿病病例中为“?”。
结论 (Conclusion)
As you have probably noticed by now — there is no one size fits all for EDA. In this article, I decided not to dive too deep into how to do each part of the analysis (most can be done with simple Pandas or Pandas Profiling methods), but rather explain what can we learn from each step and to help those who want to learn why each step is important.
正如您现在可能已经注意到的那样-EDA没有一种适合所有的尺寸。 在本文中,我决定不深入探讨如何进行分析的每个部分(大多数可以通过简单的Pandas或Pandas Profiling方法完成),而是解释我们可以从每个步骤中学到什么,并帮助需要的人了解为什么每个步骤都很重要。
In real-world datasets there are almost always missing values, errors in the data, unbalanced data, and biased data. EDA is the first step in tackling a data science project to learn what data we have and evaluate its validity.
在实际数据集中,几乎总是缺少值,数据中的错误,不平衡的数据和有偏差的数据。 EDA是解决数据科学项目的第一步,以了解我们拥有的数据并评估其有效性。
I would like to thank Lilach Goldshtein for her excellent talk on EDA which inspired this medium article.
我要感谢Lilach Goldshtein在EDA上的精彩演讲,这启发了这篇中等文章。
请继续关注经典数据科学项目中的后续步骤 (Stay tuned for the next steps in a classic data science project)
Part 1 Exploratory Data Analysis (EDA) — Don’t ask how, ask what.
第1部分探索性数据分析(EDA)-不要问如何,不要问什么。
Part 2 Preparing your Dataset for Modelling — Quickly and Easily
第2部分 -快速,轻松地为建模准备数据集
Part 3 Feature Engineering — 10X your model’s abilities
第3部分特征工程-将模型的能力提高10倍
Part 4 What is a GBM and how do I tune it?
第4部分什么是GBM,我该如何调整?
Part 5 GBM Explainability — What can I actually use SHAP for?
第5部分 GBM可解释性-我实际上可以将SHAP用于什么?
(Hopefully) Part 6 How to actually get a dataset and a sample project
(希望如此)第6部分如何实际获取数据集和示例项目