python数据可视化代码_用Python代码简单快速实现数据可视化的5种方法-CSDN博客

数据可视化是数据处理工作中的一个重要部分，在项目早期阶段，我们常常需要做很多数据分析来深入了解数据。创建数据可视化能帮我们更清楚和容易的理解数据集，特别是高维度数据集。

Matplotlib 是一个很受欢迎的 Python 程序库，用它可以很容易的创建数据可视化。不过，每次开展新项目时我们都要设置数据、参数、图表和绘图等，这会非常麻烦和混乱。

在本文我们会讨论 5 种数字可视化方式，用 Python 的 Matplotlib 程序库为其写一些快速又容易的函数。

散点图

散点图很适合显示两个变量之间的关系，因为你可以直接看到数据的初始分布。你也可以将数据群组进行彩色编码来查看不同数据群组之间的关系，如下面第一幅图所示。想要可视化三个变量之间的关系？没问题！只需利用其它参数，比如点尺寸，对第三个变量进行编码，如下面第二幅图所示。

现在讲讲代码部分。我们首先用命令别名“plt”导入 Matplotlib 的 pyplot。我们调用 plt.subplots() 来创建一个新的图表。我们向函数输入 X 轴和 Y 轴数据，然后将它们传入 ax.scatter() 中来绘出散点图。我们同样也可以设置点尺寸、点颜色和阿尔法透明处理。你甚至可以设置 Y 轴来获得对数尺度，然后为图形详细设置名称和轴标签。可以看到用函数点对点地创建散点图非常容易！

import matplotlib.pyplot as plt

import numpy as np

def scatterplot(x_data, y_data, x_label="", y_label="", title="", color = "r", yscale_log=False):

# Create the plot object

_, ax = plt.subplots()

# Plot the data, set the size (s), color and transparency (alpha)

# of the points

ax.scatter(x_data, y_data, s = 10, color = color, alpha = 0.75)

if yscale_log == True:

ax.set_yscale('log')

# Label the axes and provide a title

ax.set_title(title)

ax.set_xlabel(x_label)

ax.set_ylabel(y_label)

线条图

线条图用起来非常方便，因为你可以清楚得看到两个变量之间的巨大差异，也就是说它们有很高的协方差。以下图所示为例，我们可以清晰得看到所有的专业在各个时间段内所占百分比有着巨大差异。如果将它们绘成散点图，结果会非常混乱，这样就很难理解数据。而线条图很适合这种情况，因为能让我们快速总结出两个变量之间的协方差（百分比和时间点）。这里我们仍可以通过彩色编码进行分组。

下面是绘制线条图的代码，和上面绘制散点图的代码比较像，只是在变量上略微有些变动：

def lineplot(x_data, y_data, x_label="", y_label="", title=""):

# Create the plot object

_, ax = plt.subplots()

# Plot the best fit line, set the linewidth (lw), color and

# transparency (alpha) of the line

ax.plot(x_data, y_data, lw = 2, color = '#539caf', alpha = 1)

# Label the axes and provide a title

ax.set_title(title)

ax.set_xlabel(x_label)

ax.set_ylabel(y_label)

直方图

直方图非常有助于我们查看（或发现）数据点的分布。以下面的频率 VS IQ 直方图为例，我们可以清楚地看到数据点朝中心方向集中，也就是中间值。我们也可以看到数据点呈高斯分布。使用条形（不用其它方式，比如散点）能让我们清晰得看到每个 bin 的频率之间的相对差异，利用 bin（也就是数值离散化）能帮助我们看清“数据全局”，而如果我们没有用 bin 将所有数据点进行离散处理，那么可视化后的数据中会存在很多噪声，后续的数据处理工作会无从下手。

利用 Matplotlib 绘制直方图的代码如下所示，这里要注意两个参数。第一个，参数 n_bins 决定我们在直方图中使用多少个 bin，而 bin 的数目越多，我们获取的数据信息质量也就越好，但也有可能带来噪声，从而让我们无法纵览全局。第二个，参数 cumulative 是一个布尔值，能让我们决定直方图是否是累计直方图，这基本上是在选择概率密度函数（PDF）还是

累积密度函数（CDF）。

def histogram(data, n_bins, cumulative=False, x_label = "", y_label = "", title = ""):

_, ax = plt.subplots()

ax.hist(data, n_bins = n_bins, cumulative = cumulative, color = '#539caf')

ax.set_ylabel(y_label)

ax.set_xlabel(x_label)

ax.set_title(title)

假如我们想比较数据中两个变量的分布状况。可能有人会想，我们得做两个独立的直方图，将它们放在一起进行比较。但是还有个更好的方式：我们可以用不同的透明度将两个直方图相互重叠。比如下图所示，Uniform 分布图以 0.5 的透明度表示出来，这样我们可以看清其背后的数据分布图。这样我们就能在一个图上直接看到两个变量的分布差异。

在将直方图相互重叠时，需要在代码中做些设置。首先我们设置水平距离以适应变量分布。根据设定的范围和理想数目的 bin，我们就可以计算每个 bin 的宽度。最后，我们可以在同一个图中绘出这两个直方图，其中一个有着略高的透明度。

# Overlay 2 histograms to compare them

def overlaid_histogram(data1, data2, n_bins = 0, data1_name="", data1_color="#539caf", data2_name="", data2_color="#7663b0", x_label="", y_label="", title=""):

# Set the bounds for the bins so that the two distributions are fairly compared

max_nbins = 10

data_range = [min(min(data1), min(data2)), max(max(data1), max(data2))]

binwidth = (data_range[1] - data_range[0]) / max_nbins

if n_bins == 0

bins = np.arange(data_range[0], data_range[1] + binwidth, binwidth)

else:

bins = n_bins

# Create the plot

_, ax = plt.subplots()

ax.hist(data1, bins = bins, color = data1_color, alpha = 1, label = data1_name)

ax.hist(data2, bins = bins, color = data2_color, alpha = 0.75, label = data2_name)

ax.set_ylabel(y_label)

ax.set_xlabel(x_label)

ax.set_title(title)

ax.legend(loc = 'best')

柱状图

假如你想可视化类别较少的分类数据（比如少于 10 个类别），柱状图是最有效的方式。不过，如果数据类别过多，会造成图表中的条形非常混乱，这样就无法理解数据。

之所以说柱状图适用于处理分类数据，是因为我们可以很容易的根据条形的大小看出各个类别之间的不同，而且数据类别也很容易区分和进行彩色编码。我们一般会用到三种不同类型柱状图：常规柱状图、分组柱状图、堆积柱状图。下面我们依次看看这三种柱状图。

下面的第一幅图就是常规柱状图。在 barplot() 函数中，x_data 表示 X 轴上的 ticker，y_data 表示 Y 轴上的条形高度。误差条是围绕每个条形的额外的一条线，可以画出来展示标准差。

接着是分组柱状图，它能让我们比较多个分组变量。如下面第二幅图，我们正在比较的第一个变量是分数在每个组的变化状况。我们也可以用彩色编码比较性别。在代码中，y_data_list 变量当前实际上是一列列表，每个子列表表示一个数据组。然后我们在每个组中循环，对于每个组，我们在 X 轴上画出和 ticker 对应的条形，并将每个组进行彩色编码。

堆积柱状图很适合可视化不同变量的分类组合。下面第三张图即是堆积柱状图，我们比较了每天的服务器载荷。将彩色编码后的柱状图进行堆积，我们可以很容易得看到和理解每天哪些服务器载荷最大，同所有时间段内的其它服务器相比载荷如何。我们在每个组内循环，只是这次我们是在原来的图形上方而不是边绘制新的图形。

def barplot(x_data, y_data, error_data, x_label="", y_label="", title=""):

_, ax = plt.subplots()

# Draw bars, position them in the center of the tick mark on the x-axis

ax.bar(x_data, y_data, color = '#539caf', align = 'center')

# Draw error bars to show standard deviation, set ls to 'none'

# to remove line between points

ax.errorbar(x_data, y_data, yerr = error_data, color = '#297083', ls = 'none', lw = 2, capthick = 2)

ax.set_ylabel(y_label)

ax.set_xlabel(x_label)

ax.set_title(title)

def stackedbarplot(x_data, y_data_list, colors, y_data_names="", x_label="", y_label="", title=""):

_, ax = plt.subplots()

# Draw bars, one category at a time

for i in range(0, len(y_data_list)):

if i == 0:

ax.bar(x_data, y_data_list[i], color = colors[i], align = 'center', label = y_data_names[i])

else:

# For each category after the first, the bottom of the

# bar will be the top of the last category

ax.bar(x_data, y_data_list[i], color = colors[i], bottom = y_data_list[i - 1], align = 'center', label = y_data_names[i])

ax.set_ylabel(y_label)

ax.set_xlabel(x_label)

ax.set_title(title)

ax.legend(loc = 'upper right')

def groupedbarplot(x_data, y_data_list, colors, y_data_names="", x_label="", y_label="", title=""):

_, ax = plt.subplots()

# Total width for all bars at one x location

total_width = 0.8

# Width of each individual bar

ind_width = total_width / len(y_data_list)

# This centers each cluster of bars about the x tick mark

alteration = np.arange(-(total_width/2), total_width/2, ind_width)

# Draw bars, one category at a time

for i in range(0, len(y_data_list)):

# Move the bar to the right on the x-axis so it doesn't

# overlap with previously drawn ones

ax.bar(x_data + alteration[i], y_data_list[i], color = colors[i], label = y_data_names[i], width = ind_width)

ax.set_ylabel(y_label)

ax.set_xlabel(x_label)

ax.set_title(title)

ax.legend(loc = 'upper right')

箱形图

前面我们认识了很适合可视化变量分布状况的直方图，但假如我们需要比这更多的信息呢？比如我们需要更清晰的查看标准方差，或者中位值和平均值差别太大因此我们需要一些离群值，又或者变量倾斜的太严重，很多都集中在一侧。

这个时候就需要箱线图了。它能让我们获得上面所说的所有数据信息。箱线图的底部和顶部始终是第一个和第三个四分位数（比如数据的 25% 和 75%），箱子内部始终是第二个四分位数（中位数）。箱子中延伸出的须线（也就是和条形两端相连的虚线）显示了数据的范围。

由于箱线图是针对每个数据组或变量而画，因此很容易设置。X-data 是一列数据组或变量， Matplotlib 函数 boxplot() 为 y_data 中的每一列或序列 y_data 中每个向量绘制出箱线图，这样 x_data 中的每个值对应 y_data 中的每个列或向量。接着我们只需设置如何让图形看起来更美观即可。

def boxplot(x_data, y_data, base_color="#539caf", median_color="#297083", x_label="", y_label="", title=""):

_, ax = plt.subplots()

# Draw boxplots, specifying desired style

ax.boxplot(y_data

# patch_artist must be True to control box fill

, patch_artist = True

# Properties of median line

, medianprops = {'color': median_color}

# Properties of box

, boxprops = {'color': base_color, 'facecolor': base_color}

# Properties of whiskers

, whiskerprops = {'color': base_color}