pandas数据可视化
One of the most common pitfalls I observe repeatedly among relatively junior data scientist and machine learning professionals is spending hours in finding the best algorithm for their project and not spending enough time to understand the data first.
在相对初级的数据科学家和机器学习专业人员中,我反复观察到的最常见的陷阱之一是花费数小时为他们的项目寻找最佳算法,而没有花费足够的时间首先理解数据。
A structured way to approach data science and machine learning projects starts with the project objective. The same set of data points can infer meaningful information about several things. Based on what we are looking for, we need to focus on a different aspect of the data. Once we are clear on the objective, we should start thinking about the data points we require for the same. This will enable us to focus on the most pertinent sets of information, and ignore the data sets which may not be important.
处理数据科学和机器学习项目的结构化方法始于项目目标。 同一组数据点可以推断出有关几件事情的有意义的信息。 根据我们要寻找的内容,我们需要专注于数据的另一个方面。 一旦我们明确了目标,就应该开始考虑需要相同的数据点。 这将使我们能够专注于最相关的信息集,而忽略可能不重要的数据集。
In real life, most of the time data collected from several sources have blank values, typo errors and other anomalies. It is vital to clean the data before any data analysis.
在现实生活中,大多数时候从多个来源收集的数据都有空白值,错字错误和其他异常情况。 在进行任何数据分析之前清理数据至关重要。
In this article, I will discuss five powerful data visualisation options which instantly provides a sense of the data characteristics. Performing an EDA conveys a lot about the data and relationship among the features even before formal modelling or hypothesis testing task.
在本文中,我将讨论五个功能强大的数据可视化选项,这些选项可立即提供数据特征感。 执行EDA甚至可以在正式建模或假设测试任务之前传达很多有关数据以及要素之间的关系的信息。
In the next article of this series I have discussed Advanced Visualisation for Exploratory data analysis (EDA)
在本系列的下一篇文章中,我讨论了探索性数据分析(EDA)的高级可视化
Step 1- We will import the packages pandas, matplotlib, seaborn and NumPy, which we are going to use for our analysis.
第1步- 我们将导入软件包pandas,matplotlib,seaborn和NumPy,这些软件包将用于我们的分析。
We require the scatter_matrix,autocorrelation_plot, lag_plot and parallel_coordinates in pandas for plotting.
我们需要熊猫中的scatter_matrix,autocorrelation_plot,lag_plot和parallel_coordinates进行绘图。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.plotting import autocorrelation_plot
import seaborn as sns
from pandas.plotting import scatter_matrix
from pandas.plotting import autocorrelation_plot
from pandas.plotting import parallel_coordinates
from pandas.plotting import lag_plot
Step 2- In the Seaborn package, there is a small inbuilt dataset. We will use the “mpg”, “tips” and “attention” data for our visualization. The datasets are loaded using load_dataset method in seaborn.
步骤2-在Seaborn包中,有一个小的内置数据集。 我们将使用“ mpg”,“提示”和“注意”数据进行可视化。 使用seaborn中的load_dataset方法加载数据集。
"""Download the datasets used in the program """
CarDatabase= sns.load_dataset("mpg")
MealDatabase= sns.load_dataset("tips")
AttentionDatabase= sns.load_dataset("attention")
Hexbin plots
六边形图
We often use a scatter plot to get a quick grasp of the relationship between the variables. It is really helpful to get an insight as long as plots are not overcrowded with densely populated data points.
我们经常使用散点图快速了解变量之间的关系。 只要地块不会被人口稠密的数据点过度拥挤,获得洞察力真的很有帮助。
In the below code, we have plotted the scatter plot between the “Acceleration” and “Horsepower” data points in “mpg” dataset.
在下面的代码中,我们绘制了“ mpg”数据集中“加速度”和“马力”数据点之间的散点图。
plt.scatter(CarDatabase.acceleration ,CarDatabase.horsepower,marker="^")
plt.show()
Points are densely populated in the scatter plot, and it is a bit difficult to get meaningful information from it.
点分散在散点图中,很难从中获取有意义的信息。
Hexbins are a very good alternative to address the overlapping points scatter plot. Each point is not plotted individually in a hexbin plot.
六叉戟是解决重叠点散点图的很好的选择。 在hexbin图中未单独绘制每个点。
In the below code, we plot a hexbin with the same dataset between the “Acceleration” and “Horsepower”.
在下面的代码中,我们在“加速”和“马力”之间绘制具有相同数据集的六边形。
CarDatabase.plot.hexbin(x='acceleration', y='horsepower', gridsize=10,cmap="YlGnBu")
plt.show()
We can deduce the acceleration and horsepower value range concentration clearly in hexbin plot and a negative linear relationship between the variables. The size of the hexagon is dependent on the “grid size” parameter.
我们可以在hexbin图中清楚地得出加速度和马力值范围的浓度以及变量之间的负线性关系。 六边形的大小取决于“网格大小”参数。
Self-exploration: I would encourage you to alter the grid size parameter and observe the changes in the hexbin plot.
自我探索:我鼓励您更改网格大小参数,并观察六边形图的变化。
Heatmaps
热图
Heatmaps are my personal favourite to view the correlation among different variables. Those of who follow me on Medium may have observed that I use it often.
我个人最喜欢热图,以查看不同变量之间的相关性。 那些在Medium上关注我的人可能发现我经常使用它。
In the below code, we are calculating the pairwise correlation among all the variables in seaborn “mpg” dataset and plotting it as a heatmap.
在下面的代码中,我们正在计算seaborn“ mpg”数据集中所有变量之间的成对相关性,并将其绘制为热图。
sns.heatmap(CarDatabase.corr(), annot=True, cmap="YlGnBu")
plt.show()
We can see that “cylinders” and “horsepower” are closely positively related( as expected in a car) and weight is inversely related to acceleration. We can understand the indicative relationship among all different variables quickly with just a couple of lines of code.
我们可以看到,“汽缸”和“马力”紧密相关(正像在汽车中所期望的),而重量与加速度则呈反相关。 我们只需几行代码就可以快速理解所有不同变量之间的指示关系。
Autocorrelation plot
自相关图
Autocorrelation plots are a quick litmus test to ascertain whether the data points are random. In case the data points are following a certain trend, then one or more of the autocorrelations will be significantly non-zero. The dotted line in the plot shows 99%, confidence band.
自相关图是一种快速的石蕊测试,可确定数据点是否随机。 如果数据点遵循某个趋势,则一个或多个自相关将明显为非零。 图中的虚线显示99%置信带。
In the code below, we are checking whether the total_bill amount in the “tips” database is random.
在下面的代码中,我们正在检查“ tips”数据库中的total_bill数量是否是随机的。
autocorrelation_plot(MealDatabase.total_bill)
plt.show()
We can see that the autocorrelation plot is moving very close to zero for all time-lags suggesting that the total_bill data points are random.
我们可以看到,对于所有时滞,自相关图都非常接近于零,这表明total_bill数据点是随机的。
When we plot the autocorrelation plot for data points following a particular order, we can see that the plot is significantly non-zero.
当我们按照特定顺序绘制数据点的自相关图时,我们可以看到该图明显非零。
data = pd.Series(np.arange(12,7000,16.3))
autocorrelation_plot(data)
plt.show()
Lag Plots
滞后图
Lag plots are also helpful to verify if the dataset is a random set of values or follows a certain trend.
滞后图也有助于验证数据集是一组随机值还是遵循某个趋势。
When the lag plot of “total_bills” value from “tips” dataset is plotted, as in the autocorrelation plot, the lag plot suggests it as random data with values all over the place.
当绘制“ tips”数据集中的“ total_bills”值的滞后图时,如自相关图所示,该滞后图建议它是随机数据,其值遍布整个地方。
lag_plot(MealDatabase.total_bill)
plt.show()
When we lag plot a non-random data series, as shown in the code below, we get a nice smooth line.
当我们滞后绘制非随机数据序列时,如下面的代码所示,我们会得到一条漂亮的平滑线。
data = pd.Series(np.arange(-12*np.pi,300*np.pi,10))
lag_plot(data)
plt.show()
Parallel coordinates
平行坐标
It is always a challenge to wrap our head around and visualize more than 3-dimensional data. To plot higher dimension dataset parallel coordinates are very useful. Each dimension is represented by a vertical line.
绕过头并可视化超过3维数据始终是一个挑战。 要绘制更高维度的数据集,平行坐标非常有用。 每个尺寸由垂直线表示。
In parallel coordinates, “N” equally spaced vertical lines represents “N” dimensions of the dataset. The position of the vertex on the n-th axis corresponds to the n-th coordinate of the point.
在平行坐标中,“ N”个等距垂直线代表数据集的“ N”个维度。 顶点在第n轴上的位置对应于该点的第n坐标。
Confusing!
令人困惑!
Let us consider a small sample data with five features for small and large size widgets.
让我们考虑一个小样本数据,该数据具有适用于小型和大型窗口小部件的五个功能。
A vertical line represents each feature of the widget. A continuous series of line segments represent “small” and “large” widgets’ feature values.
垂直线代表窗口小部件的每个功能。 一系列连续的线段代表“小”和“大”小部件的特征值。
Below code plots the parallel coordinates for “attention” dataset in seaborn. Please note that points that cluster appears closer together.
下面的代码绘制了seaborn中“注意力”数据集的平行坐标。 请注意,聚类的各个点似乎更靠近在一起。
parallel_coordinates(AttentionDatabase,"attention",color=('#556270', '#C7F464'))
plt.show()
I hope you will start using these out of box plots to perform the exploratory data analysis if you already are not using it. I would love to hear your favourite visualization plots for EDA.
我希望您已经开始使用这些现成的图来进行探索性数据分析。 我很想听听您最喜欢的EDA可视化图。
Read my article on Advanced Visualisation for Exploratory data analysis (EDA) to learn more on this topic.
阅读我关于探索性数据分析的高级可视化(EDA)的文章,以了解有关此主题的更多信息。
In case, you would like to learn a structured approach to identify the appropriate independent variables to make accurate predictions then read my article “How to identify the right independent variables for Machine Learning Supervised.
如果您想学习一种结构化的方法来识别适当的独立变量以做出准确的预测,然后阅读我的文章“如何为受监督的机器学习确定正确的独立变量” 。
"""Full code"""import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.plotting import autocorrelation_plot
import seaborn as sns
from pandas.plotting import scatter_matrix
from pandas.plotting import autocorrelation_plot
from pandas.plotting import parallel_coordinates
from pandas.plotting import lag_plotCarDatabase= sns.load_dataset("mpg")
MealDatabase= sns.load_dataset("tips")
AttentionDatabase= sns.load_dataset("attention")plt.scatter(CarDatabase.acceleration ,CarDatabase.horsepower, marker="^")
plt.show()CarDatabase.plot.hexbin(x='acceleration', y='horsepower', gridsize=10,cmap="YlGnBu")
plt.show()sns.heatmap(CarDatabase.corr(), annot=True, cmap="YlGnBu")
plt.show()autocorrelation_plot(MealDatabase.total_bill)
plt.show()data = pd.Series(np.arange(12,7000,16.3))
autocorrelation_plot(data)
plt.show()lag_plot(MealDatabase.total_bill)
plt.show()data = pd.Series(np.arange(-12*np.pi,300*np.pi,10))
lag_plot(data)
plt.show()parallel_coordinates(AttentionDatabase,"attention",color=('#556270', '#C7F464'))
plt.show()
pandas数据可视化