eda探索性数据分析,以更深入地了解数据

目录 (Table of contents)

  1. Introduction

    介绍
  2. Libraries

    图书馆
  3. Importance of visualization

    可视化的重要性
  4. Coding part

    编码部分
  5. Conclusion

    结论

介绍 (Introduction)

When I just started my journey in machine learning, I came across lots of machine learning tutorial books, videos, and courses. Which taught me different types of machine learning algorithms and how they work. So, after completing those tutorials I thought I now knew machine learning. Therefore, I tried to do some simple projects but the results were different like I did not get the accuracy of the model as I was expecting and I did not know what went wrong. So, I searched it on the internet to find ways to improve the accuracy of the model, of course, I found lots of ways but the most important part that I think I missed was to perform EDA on my data.

当我刚开始机器学习之旅时,我遇到了很多机器学习教程书籍,视频和课程。 哪些教了我不同类型的机器学习算法及其工作原理。 因此,完成这些教程后,我以为我现在知道机器学习了。 因此,我尝试做一些简单的项目,但是结果却不同,就像我没有得到我所期望的那样,该模型的准确性,也不知道出了什么问题。 因此,我在互联网上搜索了它,以找到提高模型准确性的方法,当然,我发现了很多方法,但是我想念的最重要的部分是对数据执行EDA。

Performing EDA on data according to me is like watching reviews and unboxing videos before buying any laptop, phone, or tablet or watching videos of car reviewers before buying a car to get to know more about the product that we are going to buy. So, now I guess you would have got some idea regarding why we perform EDA. We are going to get insights about data before training the model using it.

根据我的说法,对数据执行EDA就像在购买任何笔记本电脑,手机或平板电脑之前观看评论并取消包装视频,或者在购买汽车之前先观看汽车评论者的视频,以进一步了解我们将要购买的产品。 因此,现在我想您会对我们执行EDA的原因有所了解。 在使用数据训练模型之前,我们将获得有关数据的见解。

The dataset that we are going to use here is the Wine Quality dataset. This dataset contains 12 features which are fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality. Out of these features, we need to predict the quality feature, which represents the quality of the wine.

我们将在此处使用的数据集是“ 葡萄酒质量”数据集 。 该数据集包含12个特征,它们是固定酸度,挥发性酸度,柠檬酸,残留糖,氯化物,游离二氧化硫,总二氧化硫,密度,pH,硫酸盐,酒精和质量 。 从这些特征中,我们需要预测代表葡萄酒质量的质量特征。

图书馆 (Libraries)

Python has a massive number of libraries that make our lives more comfortable to perform these tasks. To perform EDA on the data we are going to use the following libraries:

Python具有大量的库,这些库使我们的生活更轻松地执行这些任务。 为了对数据执行EDA,我们将使用以下库:

Pandas is built on top of Numpy which is used to handle and manipulate the dataset.

Pandas建立在Numpy之上,该Numpy用于处理和操纵数据集。

Numpy is used to perform a mathematical operation on data and handle the multi-dimensional array and matrix.

Numpy用于对数据执行数学运算并处理多维数组和矩阵。

Matplotlib is basically a plotting library that allows us to visualize our data so that we can get more insights about it.

Matplotlib基本上是一个绘图库,它使我们能够可视化数据,以便获得更多有关数据的见解。

Seaborn is also a plotting library that is built on top of matplotlib and allows to create attractive plots.

Seaborn还是一个基于matplotlib构建的绘图库,可以创建有吸引力的绘图。

可视化的重要性 (Importance of Visualization)

As we all know that to learn something in a better way we need to visualize it. The more you are able to visualize it, the more you have insight about it. Since our end goal is to get to know more about the data we try to visualize it. It also helps non-technical people to get more insight into the data.

众所周知,要以更好的方式学习某些东西,我们需要对其进行可视化。 您越能看到它,就越了解。 由于我们的最终目标是要了解有关数据的更多信息,因此我们尝试将其可视化。 它还可以帮助非技术人员更深入地了解数据。

So, instead of looking at the actual data which is in the form of rows and columns if we visualize it using plot, charts, and other visualization tools then we get more information about the data easily.

因此,如果我们使用绘图,图表和其他可视化工具对其进行可视化,则无需查看行和列形式的实际数据,而是可以轻松获得有关数据的更多信息。

There are several types of plots are available but we are going to use only a few of them here. Which are listed as follows:

有几种类型的图可用,但在这里我们将仅使用其中几种。 其中列出如下:

  • Countplot

    计数图
  • Heatmap

    热图
  • Boxplot

    箱形图
  • Distplot

    距离图
  • Pair plot

    配对图

Other than these, we have plots like violin plot, swarm plot, joint plot, etc. For which we will only discuss it’s usage. But here we are going to use these listed plots only.

除此之外,还有小提琴图,群图,联合图等图。我们仅讨论其用法。 但是这里我们将仅使用这些列出的图。

让我们开始编码部分 (Let’s begin the coding part)

In this section, we are actually going to perform the EDA on the given dataset using the libraries that I had mentioned earlier. So, let’s get started with it.

在本节中,我们实际上将使用我之前提到的库对给定的数据集执行EDA。 因此,让我们开始吧。

Importing libraries

导入库

We are going to import all the necessary libraries.

我们将导入所有必需的库。

# Import all the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Import the dataset

导入数据集

We are going to import the dataset using the read_csv() function of pandas. Since our dataset uses ‘;’ as a separator we are going to specify that as a parameter to the method.

我们将使用pandas的read_csv()函数导入数据集。 由于我们的数据集使用';' 作为分隔符,我们将指定它作为方法的参数。

# Read the dataset and display any 5 samples out of it.
df = pd.read_csv('winequality-white.csv', sep=';')
df.sample(5)

df.sample(5) will show random 5 samples of the dataset. So that we can know what type of features and their values.

df.sample(5)将显示数据集的5个随机样本。 这样我们就可以知道什么类型的要素及其值。

The output of the above code would be as follows:

上面代码的输出如下:

Image for post
Image by Author
图片作者

Describe the dataset

描述数据集

describe() method of pandas shows all the statistical values of the dataset like count, mean, standard deviation, minimum, maximum, etc. respectively.

熊猫的describe()方法分别显示数据集的所有统计值,例如计数,均值,标准差,最小值,最大值等。

# To get overall idea about the data
df.describe()

The output of the above code would be as follows:

上面代码的输出如下:

Image for post
Image by Author
图片作者

To see all the columns that we have in the dataset we can use the following method of pandas:

要查看数据集中的所有列,我们可以使用以下熊猫方法:

# List all the columns
df.columns
Image for post
Image by Author
图片作者

Checking for missing values

检查缺失值

We have several methods to handle missing values but for that, we need to know how many features have the missing value. To check missing values pandas has a method named isnull(). We will sum up all the missing values of features so that we get an idea about how many features have missing values and how many missing values are there.

我们有几种方法来处理缺失值,但是为此,我们需要知道有多少要素具有缺失值。 为了检查缺失值,pandas有一个名为isull()的方法。 我们将汇总所有要素缺失值,以便我们了解多少要素具有缺失值,以及有多少缺失值。

# To check missing values
df.isnull().sum()
Image for post
Image by Author
图片作者

So, as we can see that we do not have any missing values in our dataset so we can go ahead.

因此,正如我们所看到的,我们在数据集中没有任何缺失值,因此可以继续进行。

Correlation between the columns

列之间的相关性

Finding a correlation between all the features helps to drop the highly correlated features. We can see the correlation between all the features using heatmap as follows:

找到所有要素之间的相关性有助于删除高度相关的要素。 我们可以使用热图查看所有功能之间的相关性,如下所示:

# To get idea about how much all the features are correlated with each other we create heatmap.
plt.figure(figsize=(15,15))
sns.heatmap(df.corr(), color='b', annot=True)
Image for post
Image by Author
图片作者

As we can see in the heat map, the features that are highly correlated are shown in lighter colors, and those which are negatively correlated are shown in dark colors. We can see in the heat map that the features density and residual sugar are highly correlated so we can drop one of them and we can also see that free sulfur dioxide and total sulfur dioxide are also highly correlated so we can drop one of them as well. Since adding both the features will not contribute to any new information.

正如我们在热图中所看到的,高度相关的特征以浅色显示,而负相关的特征则以深色显示。 我们在热图中可以看到特征密度残留糖高度相关,因此我们可以将其中之一删除,还可以看到游离二氧化硫总二氧化硫也高度相关,因此也可以将其中之一删除。 由于添加这两个功能将不会有助于任何新信息。

Checking the distribution of the data over the quality of wine

检查有关葡萄酒质量的数据分布

Here we check the frequency of the range of values the quality of wine can have. By doing this we get an idea about how our dataset is distributed. So, we can decide the strategy for resampling if needed.

在这里,我们检查葡萄酒质量值范围的频率。 通过这样做,我们对数据集的分布方式有了一个了解。 因此,如果需要,我们可以决定重采样策略。

We create count plot as follows:

我们创建计数图,如下所示:

# To check how the data is distributed over the classes.
sns.countplot(x='quality', data = df)

The result of this is as follows:

结果如下:

Image for post
Image by Author
图片作者

As we can see from the count plot that we have more data for the quality 6 and very less data for quality 3 and 9.

从计数图中可以看出,质量6的数据较多,质量3和9的数据较少。

Detect outlier using Box Plot

使用Box Plot检测异常值

We can plot our data using a box plot to detect the outlier in the data as well as to know whether our data is skewed or not.

我们可以使用箱形图来绘制数据,以检测数据中的异常值,并知道我们的数据是否偏斜。

The Box plot is nothing but just a simple rectangle box that shows minimum, maximum, median, 1st, and 3rd quartiles (25% and 75% respectively). The outliers are shown using the circles outside the box. If the median is not in the middle of the box then the data is skewed. The data is positively skewed if the median is closer to the top and negatively skewed if it is closer to the bottom.

箱形图仅是一个简单的矩形框,显示最小值,最大值,中位数,第一和第三四分位数(分别为25%和75%)。 离群值使用框外的圆圈显示。 如果中位数不在框的中间,则数据会偏斜。 如果中位数更接近顶部,则数据为正偏斜;如果更接近底部,则数据为负偏斜。

We can create the box plot as follows:

我们可以如下创建箱形图:

# To check the outliers we use boxplots.
plt.figure(figsize=(10,15))


for i, col in enumerate(list(df.columns.values)):
    plt.subplot(4,3,i+1)
    df.boxplot(col)
    plt.grid()
    plt.tight_layout()

The box plot for this dataset is as follows:

该数据集的箱形图如下:

Image for post
Image by Author
图片作者

Estimating the PDF for each feature

估算每个功能的PDF

We can get the information about the PDF that each feature follows using dist plot as follows:

我们可以使用dist绘图获得有关每个功能遵循的PDF的信息,如下所示:

# To check the estimated PDF (Probability Density Function)
plt.figure(figsize=(20,16))


for i,col in enumerate(list(df.columns.values)):
    plt.subplot(4,3,i+1)
    sns.distplot(df[col], color='b', kde=True, label='data')
    plt.grid()
    plt.tight_layout()
Image for post
Image by Author
图片作者

As we can see from the plot that pH follows the normal distribution.

从图中可以看出, pH值遵循正态分布。

Exploring the relationship between all the features

探索所有功能之间的关系

To explore the relationship between all the features we can use the pair plot. Which can be created as follows:

要探索所有特征之间的关系,我们可以使用配对图。 可以如下创建:

# To check the relationship among the numeric columns.
sns.pairplot(data=df, kind='scatter',diag_kind='kde')

The pair plot of the wine-quality white dataset is as follows:

葡萄酒品质的白色数据集的对图如下:

Image for post
Image by Author
图片作者

结论 (Conclusion)

So finally, we are at the end of the article. Of course, there is a lot more in EDA than I just covered here. To sum up the EDA, we can say that it is really helpful to know your data before you use it to train your model with it.

最后,我们到了本文的结尾。 当然,EDA的功能远远超过我在这里所介绍的。 综上所述,EDA可以说,在使用数据训练模型之前了解数据确实很有帮助。

You can find my notebook along with the dataset in my Github repo.

您可以在Github存储库中找到我的笔记本以及数据集。

Don’t hesitate to flow your ideas in the comment section below.

请不要在下面的评论部分中提出您的想法。

Share this article if you found it useful.

如果发现有用,请分享此文章。

Thank you for reading this article.

感谢您阅读本文。

翻译自: https://towardsdatascience.com/eda-exploratory-data-analysis-to-get-more-insight-into-the-data-b2fb74dabb82

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值