

Data Visualization plays a very important role in Data mining. Various data scientist spent their time exploring data through visualization. To accelerate this process we need to have a well-documentation of all the plots.

d ATA可视化起着数据挖掘非常重要的作用。 各种各样的数据科学家花时间通过可视化来探索数据。 为了加快这一过程,我们需要对所有地块都有完整的文档记录。

Even plenty of resources can’t be transformed into valuable goods without planning and architecture. Therefore I hope this article would provide you a good architecture of all plots and their documentation.

没有规划和架构,就连大量资源也无法转化为有价值的商品。 因此,我希望本文能为您提供所有图表及其文档的良好架构。

内容 (Content)

  1. Introduction


  2. Know your Data


  3. Distribution Plotsa. Dist-Plotb. Joint Plotc. Pair Plotd. Rug Plot

    分布图。 Dist-Plotb。 联合绘图 对图。 地毯图

  4. Categorical Plotsa. Bar Plotb. Count Plotc. Box Plotd. Violin Plot

    分类Plotsa。 酒吧Plotb。 计数Plotc。 箱图。 小提琴图

  5. Advanced Plotsa. Strip Plotb. Swarm Plot

    进阶Plotsa。 带状花鼓。 群图

  6. Matrix Plotsa. Heat Mapb. Cluster Map

    Matrix Plotsa。 热图 集群图

  7. Gridsa. Facet Grid

    Gridsa。 刻面网格

  8. Regression Plots


介绍 (Introduction)

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Seaborn是基于matplotlib的Python数据可视化库。 它提供了一个高级界面,用于绘制引人入胜且内容丰富的统计图形。

For the installation of Seaborn, you may run any of the following in your command line.


pip install seaborn
conda install seaborn

To import seaborn you can run the following command.


import seaborn as sns

了解您的数据 (Know Your Data)

The data set used in these plots is famous titanic data set (Fig. 1). Hereafter the data set is represented by the variable ‘df’.

这些图中使用的数据集是著名的泰坦尼克号数据集 (图1)。 此后,数据集由变量“ df ”表示。

Fig. 1: Titanic Data set

分布图 (Distribution Plots)

These plots help us to visualize the distribution of data. We can use these plots to understand the mean, median, range, variance, deviation, etc of the data.

这些图帮助我们可视化数据的分布 。 我们可以使用这些图来了解数据的平均值,中位数,范围,方差,偏差等。

一个。 距离图 (a. Distplot)

  • Dist plot gives us the histogram of the selected continuous variable.

  • It is an example of a univariate analysis.

  • We can change the number of bins i.e. number of vertical bars in a histogram

import seaborn as sns
sns.distplot(x = df['age'], bins = 10)
Fig. 2: Distribution Plot for ‘Age’ of Passengers.
  • Here x-axis is the age and the y-axis displays frequency. For example, for bins = 10, there are around 50 people having age 0 to 10

    x轴是年龄,y轴显示频率。 例如,对于垃圾箱= 10,大约有50个人的年龄在0到10岁之间

b。 联合图 (b. Joint Plot)

  • It is the combination of the distplot of two variables.

  • It is an example of bivariate analysis.

  • We additionally obtain a scatter plot between the variable to reflecting their linear relationship. We can customize the scatter plot into a hexagonal plot, where, more the color intensity, the more will be the number of observations.

    我们还获得了变量之间的散点图,以反映它们的线性关系。 我们可以将散点图定制为六边形图,其中,颜色强度越大,观察到的数量就越多。
import seaborn as sns
# For Plot 1
sns.jointplot(x = df['age'], y = df['Fare'], kind = 'scatter')# For Plot 2
sns.jointplot(x = df['age'], y = df['Fare'], kind = 'hex')
Fig. 3: Joint plots between ‘Age’ and ‘Fare’
  • We can see that there no appropriate linear relation between age and fare.

  • kind = ‘hex’ provides the hexagonal plot and kind = ‘reg’ provides a regression line on the graph.

    kind ='hex'提供六边形图,而kind ='reg'提供图形上的回归线。

C。 对图 (c. Pair Plot)

  • It takes all the numerical attributes of the data and plot pairwise scatter plot for two different variables and histograms from the same variables.

import seaborn as sns
Fig. 4: Pair Plot of the titanic Data set

d。 地毯图 (d. Rug Plot)

  • It draws a dash mark instead of a uniform distribution as in distplot.

  • It is an example of a univariate analysis.

import seaborn as sns
sns.rugplot(x = df['Age'])
Fig. 5: Rug Plot for ‘Age’ of Passengers

分类图 (Categorical Plots)

These plots help us understand the categorical variables. We can use them for both univariate and bivariate analysis.

这些图帮助我们理解分类变量。 我们可以将它们用于单变量和双变量分析。

一个。 条形图 (a. Bar Plot)

  • It is an example of bivariate analysis.

  • On the x-axis, we have a categorical variable and on the y-axis, we have a continuous variable.

import seaborn as sns
sns.barplot(x = df['Sex'], y = df['Fare'])
Fig. 6: Bar plot for ‘Fare’ and ‘Sex’
  • We can infer that the average fare is higher for females than males.


b。 计数图 (b. Count Plot)

  • It counts the number of occurrences of categorical variables.

  • It is an example of a univariate analysis.

import seaborn as sns
Fig. 7: Count Plot for Survived and ‘P-class’.
图7:生存和“ P级”的计数图。

C。 箱形图 (c. Box Plot)

  • It is a 5 point summary plot. It gives the information about the maximum, minimum, mean, first quartile, and third quartile of a continuous variable. Also, it equips us with knowledge of outliers.

    这是一个5点汇总图 。 它提供有关连续变量的最大值,最小值,平均值,第一四分位数和第三四分位数的信息。 同样,它为我们提供了离群值的知识。

  • We can plot this for a single continuous variable or can analyze different categorical variables based on a continuous variable.

import seaborn as sns
#For plot 1
sns.countplot(df['Pclass'])#For plot 2
sns.boxplot(y = df['Age'], x = df['Sex'])
Fig.8: a) Box plot of ‘Age’, b) Box plot of different categories in ‘sex’ for ‘Age’

d。 小提琴图 (d. Violin Plot)

  • It is similar to the Box plot, but it gives supplementary information about the distribution too.

import seaborn as sns
sns.violinplot(y = df['Age'], x = df['Sex'])
Fig. 9: Violin Plot between ‘Age’ and ‘Sex’

高级图 (Advanced Plots)

As the name suggests, they are advanced because they ought to fuse the distribution and categorical encodings.


一个。 带状图 (a. Strip Plot)

  • It’s a plot between a continuous variable and a categorical variable.

  • It plots as a scatter plot but supplementarily uses categorical encodings of the categorical variable.

import seaborn as sns
sns.stripplot(y = df['Age'], x = df['Pclass'])
Fig.10: Strip Plot between ‘Age’ and ‘P-class’
图10:“年龄”和“ P级”之间的条形图
  • We can observe that in class 1 and class 2, children around 10 years are not present and the people having age above 60 are mostly accommodated in class 1.

  • Usually, these types of observations are used to impute missing values.


b。 群图 (b. Swarm Plot)

  • It is the combination of a strip plot and a violin plot.

  • Along with the number of data points, it also provides their respective distribution.

import seaborn as sns
sns.swarmplot(y = train['Age'], x = train['Pclass'])
Fig. 11: Swarm Plot between ‘Age’ and ‘P-class’
图11:“年龄”和“ P级”之间的群图

矩阵图 (Matrix Plots)

These are the special types of plots that use two-dimensional matrix data for visualization. It is difficult to analyze and generate patterns from matrix data because of its large dimensions. So, this makes the process easier by providing color coding to matrix data.

这些是使用二维矩阵数据进行可视化的特殊类型的图。 由于矩阵数据的维数较大,因此难以分析和生成模式。 因此,通过为矩阵数据提供颜色编码,这使过程变得更容易。

一个。 热图 (a. Heat Map)

  • In the given raw dataset ‘df’, we have seven numeric variables. So, let us generate a correlation matrix between these seven variables.

    在给定的原始数据集“ df”中,我们有七个数字变量。 因此,让我们生成这七个变量之间的相关矩阵。
Fig. 12: Correlation matrix
  • It seems very difficult to read every value even though there are only 49 values. The intricacy intensifies as we traverse towards thousands of features.

    即使只有49个值,读取每个值似乎也很困难。 当我们遍历数以千计的功能部件时,复杂性加剧了。

    So, let us try to implement some color coding and see how easy the interpretation becomes.


sns.heatmap(df.corr(), annot = True, cmap = 'viridis')
Fig. 13: Heat Map of the correlation matrix of the titanic data set.
  • The same matrix is now articulating more information.

  • Another very obvious example is to use heatmaps to understand the missing value patterns. In Fig. 14, the yellow dash represents a missing value, hence it makes our tasks more effortless to identify the missing values.

    另一个非常明显的示例是使用热图来了解缺失值模式。 在图14中,黄色破折号代表缺失值,因此使我们的任务更加轻松地识别缺失值。
Fig. 14: Heat Map for missing values in titanic data.

b。 集群图 (b. Cluster Map)

  • If we have a matrix data and want to group some features according to their similarity, cluster maps can assist us. Once, have a look at the heat map (Fig. 13) and then look at the cluster map (Fig. 15).

    如果我们有一个矩阵数据,并想根据某些特征的相似性对其进行分组,则聚类图可以为我们提供帮助。 一次,查看热图(图13),然后查看聚类图(图15)。
sns.clustermap(tran.corr(), annot='True',cmap='viridis')
Fig. 15: Cluster map for correlation matrix of titanic data
  • The x-label and y-label are the same but they harmonized differently. That is because they are grouped according to their similarity.

    x标签和y标签相同,但协调不同。 那是因为它们是根据它们的相似性分组的。
  • The flow-chart like structure at the top and left describe their degree of similarity.

  • Cluster maps use Hierarchical clustering to form different clusters.


格网 (Grids)

Grid plots provide us more control over visualizations and plots various assorted graphs with a single line of code.


一个。 刻面网格 (a. Facet Grid)

  • Suppose we want to plot the age distribution of males and females in all the three classes of tickets. Hence, we would be having in a total of 6 graphs.

    假设我们要绘制所有三种票证中男性和女性的年龄分布。 因此,我们总共将拥有6张图。
sns.FacetGrid(train, col = 'Pclass', row = 'Sex').map(sns.distplot, 'Age')
Fig. 16: Distribution plot of ‘Age‘ for classes of ‘Sex’ and ‘P-class’
图16:“性别”和“ P级”类别的“年龄”分布图
  • The Facet grids provide very clear graphs as per requirements.

  • sns.FacetGrid( col = ‘col’, row = ‘row’, data = data) provides an empty grid of all unique categories in the col and row. Later, we can use different plots and common variables for peculiar variations.

    sns.FacetGrid ( col =' col ', row =' row ', data = data)提供了colrow中所有唯一类别的空网格。 以后,我们可以使用不同的图和通用变量来进行特殊的变化。

回归图 (Regression Plot)

This is a more advanced statistical plot that provides a scatter plot along with a linear fitting on the data.


sns.lmplot(x = 'Age', y = 'PassengerId', data = df, hue = 'Sex)
Disclaimer: There is so the significance of regressing age and passenger id. It is just the purpose of understanding visualization. 免责声明 :降低年龄和乘客身份非常重要。 这只是了解可视化的目的。

Fig. 17 displays the linear regression fitting between Passenger ID and Age for both males and females.


结语 (Wrap Up)

In this article, we have seen 14 different visualization techniques using seaborn.


I believe data visualization enhances our understanding and potential for interpreting data. It gives us more satisfying skills to represent data, impute missing values, identify outliers, detect anomalies, and a lot more.

我相信数据可视化会增强我们的理解力和解释数据的潜力。 它为我们提供了更令人满意的技能来表示数据,估算缺失值,识别异常值,检测异常等等。

Data Analysts are like cops that need to interrogate data and extract information via them. It is extremely necessary to have optimistic tools to do the job. Therefore, I hope this article would serve you as a tool for interrogating your data.

数据分析师就像警察一样,需要审问数据并通过它们提取信息。 拥有乐观的工具来完成这项工作是非常必要的。 因此,我希望本文能为您提供一个查询数据的工具。

For the Guide for Exploratory data analysis, visit-


学习愉快! (Happy Learning!)






