Python数据可视化（三）

ma_no_lo

于 2024-05-21 18:31:08 发布

阅读量1k

点赞数 31

分类专栏： matplotlib 文章标签： python 信息可视化开发语言 matplotlib 数据分析数据可视化

本文链接：https://blog.csdn.net/ma_no_lo/article/details/139098742

版权

matplotlib 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

统计图形绘制

一，机器学习中的判别分析示意图

判别分析就是根据训练样本建立判别函数，借助判别函数对给定的新样本数据做出类别归属的分类预测方法，是机器学习中的经典分类预测方法。同样，我们会通过判别函数对给定的一组新样本做出分类归属的决策。因此，将分类归属结果以可视化形式进行展示就显得特别有意义和重要。下面，我们就通过具体代码来讲解判别分析的分类归属预测的可视化方法。

（1）代码示例

import matplotlib.pyplot as plt 
import numpy as np 
fig,ax = plt.subplots() 
 
num = 50 
 
# new sample 
sample = 10*np.random.rand(num,2) 
var1 = sample[:,0] 
var2 = sample[:,1] 
 
# threshold value 
td = 12 
 
# discriminant function 
df = 2*var1+var2 
 
cates11 = np.ma.masked_where(df>=td,var1) 
cates12 = np.ma.masked_where(df>=td,var2) 
 
cates21 = np.ma.masked_where(df<=td,var1) 
cates22 = np.ma.masked_where(df<=td,var2) 
 
ax.scatter(var1,var2,s=cates11*50,marker="s",c=cates11) 
ax.scatter(var1,var2,s=cates21*50,marker="o",c=cates21) 
 
ax.plot(var1,-2*var1+12,lw=1,color="b",alpha=0.65) 
 
ax.axis([-1,11,-1,11]) 
 
plt.show()

（2）代码讲解

<1>制造新样本数据 sample，样本数据中含有两个影响因素 var1 和 var2。

<2>将判别函数“df = 2*var1+var2”的取值与阈值“td = 12”进行比较，从而判断每个样本点的分类归属。

<3>通过调用“ax.scatter(var1,var2,s=cates11*50,marker="s",c=cates11)”和“ax.scatter(var1,var2,s= cates21*50,marker="o",c=cates21)”语句，将进行数据掩饰后的数组分别作为参数 s 和 c 的参数值，从而实现新样本 sample 的判别结果的有效展示。

<4>通过调用实例方法 plot()绘制判别函数曲线，同时，调整曲线的透明度。

注意：要想将判别结果有效地展示出来，需要使用函数 masked_where()进行数据掩饰，进而利用可视化手段将判别后的数据归属有效地展示出来。函数 masked_where()是 NumPy 包中的 ma 包的函数，调用方法是 numpy.ma.masked_where()。函数 masked_where()的调用签名是 masked_where(condition,a)，其中各参数的含义如下。

condition：对数组中的数据进行掩饰需要满足的条件。
a：进行数据掩饰的数组。

因此，当参数 condition 的条件被满足后，就会将数组中相应元素位置的判断结果是“True”的数据进行掩饰。数组中被掩饰的数据依然保留在数组中，只是以“--”形式展示数组中被掩饰的元素，其他不满足条件的元素还以原始数据形式存储在数组中。

二，日期型时间序列图

一般而言，我们绘制时间序列图都是将日期类型的数据放在 x 轴上进行展示，将对应日期下的数据放在 y 轴上进行展示的。因此，对于 matplotlib 库来讲，日期型时间序列图的绘制既可以调用模块 pyplot 的 API 函数 plot_date()，也可以调用实例方法 plot_date()。

（1）代码示例

import datetime
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np

fig, ax = plt.subplots()

months = mdates.MonthLocator()  # a Locator instance

dateFmt = mdates.DateFormatter("%m/%d/%y")  # a Formatter instance

# format the ticks 
ax.xaxis.set_major_formatter(dateFmt)
ax.xaxis.set_minor_locator(months)
# set appearance parameters for ticks,ticklabels,and gridlines
ax.tick_params(axis="both", direction="out", labelsize=10)

date1 = datetime.date(2008, 4, 17)
date2 = datetime.date(2017, 5, 4)
delta = datetime.timedelta(days=5)
dates = mdates.drange(date1, date2, delta)

y = np.random.normal(100, 15, len(dates))

ax.plot_date(dates, y, "b-", alpha=0.7)

fig.autofmt_xdate()

plt.show()

（2）代码讲解

<1>调用“mdates.MonthLocator()”语句，获得日期刻度定位器中的类 MonthLocator 的 Locator 实例，赋值给变量 months。

<2>调用“mdates.DateFormatter("%m/%d/%y")”语句，返回值是日期刻度格式器里的类 DateFormatter 的 Formatter 实例，传给变量 dateFmt。

<3>分别调用“ax.xaxis.set_major_formatter(dateFmt)”和“ax.xaxis.set_minor_locator(months)”语句，设置主刻度线的刻度标签的样式和次要刻度线的位置。

<4>调用“ax.tick_params(axis="both",direction="out",labelsize=10)”语句，设置刻度线相对轴脊的内外位置和刻度标签的大小。

<5>调用函数 drange()，返回值是按照起止日期和日期间隔参数计算的日期范围数组，其中，开始日期date1和结束日期date2都是类 date 的实例，日期间隔 delta 是类 timedelta 的实例。

<6>调用实例方法 plot_date()绘制日期型时间序列折线图，其中的参数含义如下。

dates：如果参数 xdate 的取值是 True，dates 就被理解成 matplotlib 的日期。
y：对应 dates 的 y 轴数值。
"b-"：折线图的线条样式和颜色。
xdate：参数 xdate 的默认取值是 True，x 轴会被理解成 matplotlib 的日期。
alpha：设置线条的颜色透明度。

<7>在“代码实现”的最后部分，调用实例方法 autofmt_xdate()完成调整底部子区 x 轴的刻度标签的旋转角度和子区边缘距离画布底端的距离等任务。

三，向直方图中添加概率密度曲线

我们可以单独使用直方图来描述定量数据的分布特征。如果给直方图添加一条概率密度曲线，就会更加明显地刻画定量数据的分布特征。

（1）代码示例

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np

mpl.rcParams["font.sans-serif"] = ["FangSong"]
mpl.rcParams["axes.unicode_minus"] = False

mu = 60.0
sigma = 2.0
x = mu + sigma * np.random.randn(500)

bins = 50

fig, ax = plt.subplots(1, 1)

n, bins, patches = ax.hist(x,
                           bins,
                           density=True,
                           histtype="bar",
                           facecolor="cornflowerblue",
                           edgecolor="white",
                           alpha=0.75)

y = ((1 / (np.power(2 * np.pi, 0.5) * sigma)) *
     np.exp(-0.5 * np.power((bins - mu) / sigma, 2)))

ax.plot(bins, y, color="orange", ls="--", lw=2)

ax.grid(ls=":", lw=1, color="gray", alpha=0.2)

ax.text(54, 0.2,
        r"$y=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$",
        {"color": "black", "fontsize": 20})

ax.set_xlabel("体重")
ax.set_ylabel("概率密度")
ax.set_title(r"体重的直方图: $\mu=60.0$, $\sigma=2.0$", fontsize=16)

plt.show()

（2）代码讲解

<1>通过调用“mpl.rcParams["font.sans-serif"]=["FangSong"]”语句，设置中文字体类型是“仿宋”。

<2>调用实例方法 hist()绘制直方图，同时将返回值分别赋给变量 n、bins 和 patches。注意：实例方法中的参数normed用来设置y轴是否用概率密度表示数据的分布特征。在matplotlib 2.0.0 及以上版本中，参数 normed 已经换成了 density，如果同时使用这两个参数，那么执行结果会报错。

<3>通过调用“np.random.randn(500)”语句，我们获得的是样本容量为 500 的标准正态分布的样本，也就是说，正态分布经过标准化后服从标准正态分布，即均值是 1、标准差是 0 的正态分布。

<4>设置箱体的数量为 50。

<5>通过调用“y = ((1/(np.power(2*np.pi,0.5)*sigma))*np.exp(-0.5*np.power((bins-mu)/sigma,2)))”语句，计算箱体的边界值数组 bins 的概率密度值。然后通过调用实例方法 plot()绘制关于 bins 和 y 的折线图，即概率密度曲线。

<6>使用实例方法 text()向绘图区域添加文本，文本内容通过“r"$...$"”格式进行文本渲染，即使用 mathtext 方法实现文本渲染。

<7>使用实例方法 set_xlabel()、set_ylabel()和 set_title()向绘图区域添加中文内容，其中绘图区域的标题内容依然是使用 mathtext 方法来实现的。

内容补充

我们不仅可以向直方图中添加概率密度曲线，还可以在概率密度曲线的基础上绘制积分区域，用来表示数值在指定积分区域上的取值概率，也可以理解成数值落在指定区域上的可能程度。

（1）导入模块 patches 中的类 Polygon，这是一个可以绘制不规则多边形的类。

from matplotlib.patches import Polygon

（2）设置积分区域。

integ_x = np.linspace(mu-2*sigma,mu+2*sigma,1000) 
integ_y = ((1/(np.power(2*np.pi,0.5)*sigma))* 
 np.exp(-0.5*np.power((integ_x-mu)/sigma,2))) 
area = [(mu-2*sigma,0),*zip(integ_x,integ_y),(mu+2*sigma,0)]

（3）绘制积分区域，其中，参数closed 的取值表示不会将不规则多边形设置成封闭图形。也就是说，不规则多边形的起点和终点是不会重合的。

poly = Polygon(area,facecolor="gray",edgecolor="k",alpha=0.6,closed=False) 
ax.add_patch(poly)

（4）添加无指示注解，注解内容是积分表达式

plt.text(0.45,0.2, 
 r"$\int_{\mu-2\sigma}^{\mu+2\sigma} y\mathrm{d}x$", 
 fontsize=20, 
 transform=ax.transAxes)

（5）通过向原始脚本中添加上面的 Python 代码，运行修改后的脚本，可以获得如图 3-4 所示的运行结果。

四，绘图区域嵌套子绘图区域

在一般情况下，我们不仅可以在一个绘图区域上进行数据可视化实践，还可以在一个绘图区域上嵌套子绘图区域，从而实现画布上的绘图区域的复合展示。

（1）代码示例

import matplotlib.pyplot as plt
import numpy as np

mu = 75.0
sigma = 15.0
bins = 20
x = np.linspace(1, 100, 200)
y = np.random.normal(mu, sigma, 200)
fig, ax = plt.subplots()

# the main axes
ax.plot(x, y, ls="-", lw=2, color="steelblue")
ax.set_ylim(10, 170)

# this is an inset axes over the main axes
plt.axes([0.2, 0.6, 0.2, 0.2], facecolor="k")
count, bins, patches = plt.hist(y, bins, color="cornflowerblue")
plt.ylim(0, 28)
plt.xticks([])
plt.yticks([])

# this is an inset axes over the inset axes
plt.axes([0.21, 0.72, 0.05, 0.05])
y1 = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(- (bins - mu) ** 2 / (2 *sigma ** 2))
plt.plot(bins, y1, ls="-", color="r")
plt.xticks([])
plt.yticks([])

# this is another inset axes over the main axes
plt.axes([0.65, 0.6, 0.2, 0.2], facecolor="k")
count, bins, patches = plt.hist(y, bins, color="cornflowerblue", density=True,
                                cumulative=True, histtype="step")
plt.ylim(0, 1.0)
plt.xticks([])
plt.yticks([])

# this is another inset axes over another inset axes
plt.axes([0.66, 0.72, 0.05, 0.05])
y2 = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(- (bins - mu) ** 2 / (2 *
                                                                       sigma ** 2))
y2 = y2.cumsum()
y2 = y2 / y2[-1]
plt.plot(bins, y2, ls="-", color="r")
plt.xticks([])
plt.yticks([])

plt.show()

（2）代码讲解

<1>通过调用“ax.plot(x,y,ls="-",lw=2,color="steelblue")”语句，绘制主绘图区域的折线图。其中，参数 y 是样本容量为 200、均值为 75 和标准差为 15 的正态分布的数组。

<2>需要在主绘图区域嵌套第一个子绘图区域。具体而言，通过调用“plt.axes([0.2,0.6,0.2,0.2],facecolor= "k") ” 语句实现子绘图区域的嵌套目标。函数 axes([0.2,0.6,0.2,0.2],facecolor="k") 中的参数 “[0.2,0.6,0.2,0.2]”表示主绘图区域的坐标轴经过归一化到 0~1 之间后的子绘图区域的位置和大小，即[left,bottom,width,height]；参数 facecolor 用来设置子绘图区域的背景颜色，默认颜色是白色。根据 “[0.2,0.6,0.2,0.2]”绘制子绘图区域上的直方图“plt.hist(y,bins,color="cornflowerblue")”。

<3>在子绘图区域的基础上，调用“plt.axes([0.21,0.72,0.05,0.05])”语句，继续绘制子绘图区域，实现子绘图区域的嵌套目标。

<4>在这个嵌套的子绘图区域上，调用“plt.plot(bins,y1,ls="-",color="r")”语句，绘制概率密度曲线。同时，调用“plt.xticks([])”和“plt.yticks([])”语句，将坐标轴的刻度线去掉。同理，分别调用“plt.axes([0.65,0.6,0.2,0.2],axisbg="k")”和“plt.axes([0.66,0.72,0.05,0.05])”语句，绘制另外两个子绘图区域，完成子绘图区域的连续嵌套的任务。

<5>在这两个子绘图区域上，使用“plt.hist(y,bins,color="cornflowerblue",density=True,cumulative= True,histtype="step")”语句绘制累积阶梯形直方图，使用“plt.plot(bins,y2,ls="-",color="r")”语句绘制分布函数曲线。因此，“代码实现”部分的整体思路是：先在主绘图区域上嵌套子绘图区域，再在子绘图区域上嵌套更小的子绘图区域，从而分别在各自的绘图区域上绘制统计图形，完成统计图形的组合展示的工作。

五，设置一般化的日期刻度线

我们已经讲解过有关日期型时间序列图的绘制方法。如果我们尝试将 x 轴的刻度线的日期间隔调整为定制化的模式，就需要使用 rrule 刻度定位器完成一般化的日期刻度线的设置任务。下面，我们就看看如何通过具体代码来实现 rrule 刻度定位器的应用功能。

（1）代码示例

import datetime
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np

fig, ax = plt.subplots()

# tick every 5th easter
rule = mdates.rrulewrapper(mdates.YEARLY, byeaster=0, interval=2)
loc = mdates.RRuleLocator(rule)  # a Locator instance

dateFmt = mdates.DateFormatter("%m/%d/%y")  # a Formatter instance

# format the ticks
ax.xaxis.set_major_locator(loc)
ax.xaxis.set_major_formatter(dateFmt)

# set appearance parameters for ticks,ticklabels,and gridlines
ax.tick_params(axis="both", direction="out", labelsize=10)

date1 = datetime.date(2004, 5, 17)
date2 = datetime.date(2016, 6, 4)
delta = datetime.timedelta(days=5)
dates = mdates.drange(date1, date2, delta)
y = np.random.normal(120, 12, len(dates))

ax.plot_date(dates, y, "b-", alpha=0.7)

fig.autofmt_xdate()

plt.show()

（2）代码讲解

<1>使用 matplotlib.dates.rrulewrapper，rrulewrapper 是基于 dateutil 包中的模块 rrule 里的类 rrule 构建的一个简单包装器，可以实现任意刻度线的定制化的目标。类 rrule 的构造函数的参数含义如下。

freq：可以取值 YEARLY、MONTHLY、WEEKLY、DAILY、HOURLY、MINUTELY 或 SECONDLY，其中，YEARLY 的取值是 0。
interval：每个 freq 下的间隔区间。如果使用 freq 中的 YEARLY，interval 的取值是 2，就表示以每两年作为年份的间隔区间。
byeaster：复活节（周日）的滞后天数。如果传递参数值 0，就会产生复活节（周日）当天的日期。

<2>类 RRuleLocator 是使用包装器 rrulewrapper 的日期刻度定位器。将实例 loc 作为参数代入 “ax.xaxis.set_major_locator(loc)”语句中，实现设置 x 轴的主刻度线位置的任务。

<3>关于“代码实现”部分里的其他代码的具体含义和用法，这里就不再阐述了。