python可视化学习（一）：seaborn库（进阶）

最新推荐文章于 2024-07-18 20:54:15 发布

Auraros

最新推荐文章于 2024-07-18 20:54:15 发布

阅读量1k

点赞数

文章标签：低维数据可视化（1维，2维，3维）（2）数据可视化 sns库 python图形绘制 seaborn

本文链接：https://blog.csdn.net/qq_43634001/article/details/88984145

版权

python可视化学习（一）——seaborn库（进阶）

本文为翻抄文本，为学习笔记。供大家观看

带有类别属性的数据可视化（categorical data）

很多数据中存在大量的categorical数据，比如花朵的颜色，影片的等级等

首先，先导入需要的包和数据集

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
np.random.seed(sum(map(ord, "categorical")))
titanic = sns.load_dataset("titanic")
tips = sns.load_dataset("tips")
iris = sns.load_dataset("iris")

sns.set()设置背景颜色，风格，字型，字体等
sus.load_dataset()从数据集中下载数据集

titanic

—	survived	pclass	sex	age	…	deck	embark_town	alive	alone
0	0	3	male	22.0	…	NaN	Southampton	no	False
1	1	1	female	38.0	…	C	Cherbourg	yes	False
2	1	3	female	26.0	…	NaN	Southampton	yes	True
3	1	1	female	35.0	…	C	Southampton	yes	False
4	0	3	male	35.0	…	NaN	Southampton	no	True

tips

----	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

iris

–	sepal_length	sepal_width	petal_length	petal_width species
0	5.1	3.5	1.4	0.2 setosa
1	4.9	3.0	1.4	0.2 setosa
2	4.7	3.2	1.3	0.2 setosa
3	4.6	3.1	1.5	0.2 setosa
4	5.0	3.6	1.4	0.2 setosa

other变量&类别变量

当我们的数据中有两个变量,其中包含一个categorical变量.另外一个可以是categorical,也可以是其他的变量,这个时候我们常常会用stripplot()函数来绘制二者的关系.

常见于用来分析label为连续型变量(回归问题中的label),而我们特征中出现了categorical数据[1].
而针对这些问题,在seaborn中最常用的函数有striplot和searmplot函数.

1.Stripplot函数

sns.stripplot(x="day", y="total_bill", data=tips)

在这里插入图片描述

2.Swarmplot函数

下面这个函数可以更好的看到数据的分布情况

sns.swarmplot(x="day", y="total_bill", data=tips)

在这里插入图片描述
Stripplot VS Swarmplot
① swarmplot的优势在当我们的两个变量都是categorical的时候更加明显[2].具体的参考下图即可.
② swarmplot的缺点则是非常耗时,当数据量非常大的时候并不适用[3]

ok,[ax0,ax1] = plt.subplots(nrows=2,ncols=1,figsize = [12,8])
sns.stripplot(x="size", y="size", data=tips,jitter= True, ax = ax0);
sns.swarmplot(x="size", y="size", data=tips,ax = ax1)

在这里插入图片描述

类别特征对应特征分布

类别型数据的可视化可能会无法反应某类中一个分布情况，例如上面的day和total_bill的情况，在很多情况下较难看出哪一天total_bill的好一点，尤其在两个类别中的total都接近的时候。需要另外一些更好的可视化函数来帮助我们
seaborn中最常见的函数有Boxplots和Violinplot函数

1.Boxplots函数

Boxplots大家见的应该比较多,就是盒图的意思,我们可以通过boxplot很容易的看到我们数据的一个平均情况,包括均值,四分位数之类的信息.

sns.boxplot(x="day", y="total_bill", data=tips)

在这里插入图片描述
一个小的tips: 因为hue是和x,y变量嵌套的,当我们使用hue变量的时候,它会被分割出来并产生"位移",也就是我们看到的下面的一条线被分割成为多条线的情况. 有时为了防止位移,我们可以设置dodge=False可以抵消位移.
<font /color = red>当然有的时候我们不需要位移,即每个categorical的x轴对应的y都只有一个类型,那么此时的位移反而会使得我们的图变得很难看,所以这个时候我们可以选择将dodge设置为False[4].

sns.boxplot(x="day", y="total_bill", hue="size", data=tips, dodge=False)

在这里插入图片描述

sns.boxplot(x="day", y="total_bill", hue="size", data=tips)

在这里插入图片描述

tips["weekend"] = tips["day"].isin(["Sat", "Sun"])
sns.boxplot(x="day", y="total_bill", hue="weekend", data=tips, dodge=False)

在这里插入图片描述

2.violinplot函数

这个函数集成了boxplot和KDE,看上去也很高大上,经常可以看到,通过看violinplot函数的分布我们很容易观察某一类对应的特征(一般会是label)[5]的分布情况。能了解该cate对应的特征的情况.

sns.violinplot(x="day", y="total_bill", hue="time", split=True, data=tips)

在这里插入图片描述

Violinplot 与 Swarmplot
其实我们发现这个函数和我们的swarmplot很类似,但是violinplot能让我们一眼就识别出数据的分布情况.非常清晰直观.
将二者合并后：

sns.violinplot(x="day", y="total_bill", data=tips, inner=None)
sns.swarmplot(x="day", y="total_bill", data=tips, color="w", alpha=.5)

在这里插入图片描述

类别特征的统计信息

上面的几种关于categorical特征的可视化技术主要用以观察数据的分布的情况,但是却总感觉缺了一些东西,究竟是什么呢？作为做过一些数据分析的我们很容易就明白,我们不能仅仅只看数据的一个外在的表现,我们需要一些工具将它的一些内在信息(主要是统计信息)反应出来,例如某类数据的总的个数,均值等情况[6].

Barplot函数

sns.barplot(x="sex", y="survived", data=titanic)

在这里插入图片描述

sns.barplot(x="sex", y="survived", hue="class", data=titanic)

在这里插入图片描述

Countplot函数

我们发现上面的barplot会默认将纵轴计算为均值,这在二分类的时候非常有帮助,因为均值就是为1的概率,但是是不是具有统计意义,我们不能只看概率还得看个数[7]，这个时候我们就得用到countplot函数了,coutplot函数不能同时使用x,y所以如果想要统计某个cate变量对应的变量的个数最好用hue进行分开.例子如下:

sns.countplot(hue="sex", x="survived", data=titanic, palette="Greens_d")

在这里插入图片描述

Pointplot函数

另外还有一种比较常用的具有类似功能的函数是Pointplot函数,这个函数和Barplot很相似，y特征都是计算对应的概率,不同的是该函数更加丰富,它还会对相同的hue特征进行连接,得到特征的变化曲线

sns.pointplot(x="sex", y="survived", hue="class", data=titanic)

在这里插入图片描述

sns.pointplot(x="sex", y="survived", hue="class", data=titanic,
              palette={"First": "g", "Second": "m", "Third":'b'},
              markers=["^", "o","+"], linestyles=["-", "--",""])

在这里插入图片描述

# orient="h"表示按横轴绘制
sns.boxplot(data=iris, orient="h")

在这里插入图片描述

f, ax = plt.subplots(figsize=(7, 3))
sns.countplot(y="deck", data=titanic, color="c")

在这里插入图片描述
总结：这一节我们介绍了一些用于查看categorical变量与其他变量(一般是label)之间关系的一些可视化技巧(大家在看kernel分享或者其他开源的数据分析的时候肯定也有接触过),其中主要涉及的包有:

查看cate变量与一些其他变量(连续或者cate):Stripplot,Swarmplot(一般是cate对应变量出现overlap比较严重的时候使用)
查看cate变量对应的其他变量分布:Boxplot,Violinplot
查看cate变量对应变量(含一些统计特征):Barplot,Countplot,Pointplot
集成的函数:Factorplot和PairGrid.