一文汇总Python可视化工具及图表

正所谓“一图胜千言”,数据可视化是数据科学中重要的一项工作,在面对海量的大数据中,如果没有图表直观的展示复杂数据,我们往往会摸不着头脑。通过可视化的图表可以直观了解数据潜藏的重要信息,以便在业务和决策中发现数据背后的价值!

常用的可视化库

1、Matplotlib

Matplotlib是Python中广泛使用的数据可视化库,与Pandas紧密集成,方便数据分析和可视化。支持了多种图表类型,如线图、散点图、条形图和直方图等。它的特点是易用,如果没有比较复杂的可视化需求,简单单单几行代码就可以轻松搞定。(文末可获取matplotlib手册及相关数据集

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

import matplotlib.pyplot as plt``import numpy as np``   ``# make data``np.random.seed(1)``x = 4 + np.random.normal(0, 1.5, 200)``#画直方图hist``plt.hist(x)``plt.show()

2、Seaborn

Seaborn 是一个基于 matplotlib 的可视化库。它的特点是可以用简洁的代码画出复杂好看的图表!

3、Plotly

Plotly是一个开源,交互式和基于浏览器的Python图形库,它的特点是****可以创建互动性的图表,有超过30种图表类型, 提供了一些在大多数库中没有的图表 ,如等高线图、树状图、3D图表等。

常用的可视化图表

有效的图表应该是这样的:

  • 传达正确和必要的信息,不歪曲事实。

  • 设计简单。

  • 优雅地表达信息而不是掩盖信息。

  • 信息不超载。

Selva Prabhakaran

下文系统地汇总了数据可视化中最有用的图表,这些图表按照可视化目的可以分为7组:

一、相关性

  1. 散点图

  2. 气泡图

  3. 带趋势线的散点图

  4. 带状图抖动

  5. 计数图

  6. 边缘直方图

  7. 边际箱线图

  8. 相关性热图

  9. 变量关系图

二、偏差

  1. 发散柱形图

  2. 分散文本图

  3. 发散点图

  4. 带标记的发散棒棒糖图

  5. 面积图

三、排序

  1. 有序条形图

  2. 棒棒糖图表

  3. 点图

  4. 坡度图

  5. 哑铃图

四、分布

  1. 连续变量的直方图

  2. 分类变量的直方图

  3. 密度图

  4. 带直方图的密度曲线

  5. 密度曲线重叠图

  6. 分布点图

  7. 箱形图

  8. 点+箱线图

  9. 小提琴图

  10. 金字塔图

  11. 分类图

五、组成

  1. 华夫饼图

  2. 饼形图

  3. 树形图

  4. 条形图

六、变化

  1. 时间序列图

  2. 带注释的波峰和波谷的时间序列

  3. 自相关图

  4. 互相关图

  5. 时间序列分解图

  6. 多时间序列

  7. 双坐标图

  8. 具有误差带的时间序列

  9. 堆积面积图

  10. 未堆叠面积图

  11. 日历热图

  12. 季节图

七、分组

  1. 树状图

  2. 聚类图

  3. 安德鲁斯曲线

  4. 平行坐标

本节代码以matplotlib示例,你也可以选择任意的可视化库,如seaborn、plotly 展示同样的可视化效果,文末可下载相关数据集。

一、相关性

相关性图用于可视化两个或多个变量之间的关系。也就是说,一个变量相对于另一个变量如何变化。

1. 散点图

散点图是用于研究两个变量之间关系的经典且基本的图。如果数据中有多个组,您可能希望以不同的颜色可视化每个组。在 中matplotlib,您可以使用 方便地执行此操作。plt.scatterplot()

# Import dataset midwest = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/midwest_filter.csv")  
  
# Prepare Data   
# Create as many colors as there are unique midwest['category']categories = np.unique(midwest['category'])colors =[plt.cm.tab10(i/float(len(categories)-1))for i in range(len(categories))]  
  
# Draw Plot for Each Category  
plt.figure(figsize=(16,10), dpi=80, facecolor='w', edgecolor='k')  
  
for i, category in enumerate(categories):    plt.scatter('area','poptotal',                 data=midwest.loc[midwest.category==category,:],                 s=20, c=colors[i], label=str(category))  
  
# Decorations  
plt.gca().set(xlim=(0.0,0.1), ylim=(0,90000),              xlabel='Area', ylabel='Population')  
  
plt.xticks(fontsize=12); plt.yticks(fontsize=12)  
plt.title("Scatterplot of Midwest Area vs Population", fontsize=22)  
plt.legend(fontsize=12)    plt.show()

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

2. 气泡图

有时您想要显示边界内的一组点以强调它们的重要性。在此示例中,您从应圈出的数据帧中获取记录并将其传递给下面代码中描述的内容。encircle()

from matplotlib import patchesfrom scipy.spatial importConvexHull  
import warnings; warnings.simplefilter('ignore')  
sns.set_style("white")  
  
# Step 1: Prepare Datamidwest = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/midwest_filter.csv")  
  
# As many colors as there are unique midwest['category']categories = np.unique(midwest['category'])colors =[plt.cm.tab10(i/float(len(categories)-1))for i in range(len(categories))]  
  
# Step 2: Draw Scatterplot with unique color for each categoryfig = plt.figure(figsize=(16,10), dpi=80, facecolor='w', edgecolor='k')    for i, category in enumerate(categories):    plt.scatter('area','poptotal', data=midwest.loc[midwest.category==category,:], s='dot_size', c=colors[i], label=str(category), edgecolors='black', linewidths=.5)  
  
# Step 3: Encircling  
# https://stackoverflow.com/questions/44575681/how-do-i-encircle-different-data-sets-in-scatter-plot  
def encircle(x,y, ax=None,**kw):    ifnot ax: ax=plt.gca()    p = np.c_[x,y]    hull =ConvexHull(p)    poly = plt.Polygon(p[hull.vertices,:],**kw)    ax.add_patch(poly)  
  
# Select data to be encircledmidwest_encircle_data = midwest.loc[midwest.state=='IN',:]                         # Draw polygon surrounding vertices      
encircle(midwest_encircle_data.area, midwest_encircle_data.poptotal, ec="k", fc="gold", alpha=0.1)  
encircle(midwest_encircle_data.area, midwest_encircle_data.poptotal, ec="firebrick", fc="none", linewidth=1.5)  
  
# Step 4: Decorations  
plt.gca().set(xlim=(0.0,0.1), ylim=(0,90000),              xlabel='Area', ylabel='Population')  
  
plt.xticks(fontsize=12); plt.yticks(fontsize=12)  
plt.title("Bubble Plot with Encircling", fontsize=22)  
plt.legend(fontsize=12)    plt.show()

3. 带趋势线的散点图

如果您想了解两个变量如何相互变化,最佳拟合线就是最佳选择。下图显示了数据中各个组之间最佳拟合线的差异。要禁用分组并仅为整个数据集绘制一条最佳拟合线,请从下面的调用中删除该参数。hue='cyl'``sns.lmplot()

# Import Datadf = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")df_select = df.loc[df.cyl.isin([4,8]),:]  
  
# Plot  
sns.set_style("white")gridobj = sns.lmplot(x="displ", y="hwy", hue="cyl", data=df_select,                      height=7, aspect=1.6, robust=True, palette='tab10',                      scatter_kws=dict(s=60, linewidths=.7, edgecolors='black'))  
  
# Decorations  
gridobj.set(xlim=(0.5,7.5), ylim=(0,50))  
plt.title("Scatterplot with line of best fit grouped by number of cylinders", fontsize=20)  
plt.show()

每条回归线在其自己的列中

或者,您可以在每个组自己的列中显示最佳拟合线。您可以通过设置.col=groupingcolumn``sns.lmplot()

# Import Datadf = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")df_select = df.loc[df.cyl.isin([4,8]),:]  
  
# Each line in its own column  
sns.set_style("white")gridobj = sns.lmplot(x="displ", y="hwy",                      data=df_select,                      height=7,                      robust=True,                      palette='Set1',                      col="cyl",                     scatter_kws=dict(s=60, linewidths=.7, edgecolors='black'))  
  
# Decorations  
gridobj.set(xlim=(0.5,7.5), ylim=(0,50))  
plt.show()

4. 带状图抖动

通常多个数据点具有完全相同的 X 和 Y 值。结果,多个点被绘制在彼此之上并隐藏。为了避免这种情况,请稍微抖动这些点,以便您可以直观地看到它们。使用seaborn 可以很方便地做到这一点。stripplot()

# Import Datadf = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")  
  
# Draw Stripplot  
fig, ax = plt.subplots(figsize=(16,10), dpi=80)    sns.stripplot(df.cty, df.hwy, jitter=0.25, size=8, ax=ax, linewidth=.5)  
  
# Decorations  
plt.title('Use jittered plots to avoid overlapping of points', fontsize=22)  
plt.show()

5. 计数图

避免点重叠问题的另一种选择是根据该点上有多少点来增加点的大小。因此,点的大小越大,其周围的点越集中。

# Import Datadf = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")df_counts = df.groupby(['hwy','cty']).size().reset_index(name='counts')  
  
# Draw Stripplot  
fig, ax = plt.subplots(figsize=(16,10), dpi=80)    sns.stripplot(df_counts.cty, df_counts.hwy, size=df_counts.counts*2, ax=ax)  
  
# Decorations  
plt.title('Counts Plot - Size of circle is bigger as more points overlap', fontsize=22)  
plt.show()

6. 边缘直方图

边缘直方图具有沿 X 和 Y 轴变量的直方图。这用于可视化 X 和 Y 之间的关系以及 X 和 Y 各自的单变量分布。该图经常用于探索性数据分析 (EDA)。

# Import Datadf = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")  
  
# Create Fig and gridspecfig = plt.figure(figsize=(16,10), dpi=80)grid = plt.GridSpec(4,4, hspace=0.5, wspace=0.2)  
  
# Define the axesax_main = fig.add_subplot(grid[:-1,:-1])ax_right = fig.add_subplot(grid[:-1,-1], xticklabels=[], yticklabels=[])ax_bottom = fig.add_subplot(grid[-1,0:-1], xticklabels=[], yticklabels=[])  
  
# Scatterplot on main ax  
ax_main.scatter('displ','hwy', s=df.cty*4, c=df.manufacturer.astype('category').cat.codes, alpha=.9, data=df, cmap="tab10", edgecolors='gray', linewidths=.5)  
  
# histogram on the right  
ax_bottom.hist(df.displ,40, histtype='stepfilled', orientation='vertical', color='deeppink')  
ax_bottom.invert_yaxis()  
  
# histogram in the bottom  
ax_right.hist(df.hwy,40, histtype='stepfilled', orientation='horizontal', color='deeppink')  
  
# Decorations  
ax_main.set(title='Scatterplot with Histograms \n displ vs hwy', xlabel='displ', ylabel='hwy')  
ax_main.title.set_fontsize(20)  
for item in([ax_main.xaxis.label, ax_main.yaxis.label]+ ax_main.get_xticklabels()+ ax_main.get_yticklabels()):    item.set_fontsize(14)xlabels = ax_main.get_xticks().tolist()  
ax_main.set_xticklabels(xlabels)  
plt.show()

7. 边际箱线图

边缘箱线图的用途与边缘直方图类似。然而,箱线图有助于查明 X 和 Y 的中位数、第 25 个百分位数和第 75 个百分位数。

# Import Datadf = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")  
  
# Create Fig and gridspecfig = plt.figure(figsize=(16,10), dpi=80)grid = plt.GridSpec(4,4, hspace=0.5, wspace=0.2)  
  
# Define the axesax_main = fig.add_subplot(grid[:-1,:-1])ax_right = fig.add_subplot(grid[:-1,-1], xticklabels=[], yticklabels=[])ax_bottom = fig.add_subplot(grid[-1,0:-1], xticklabels=[], yticklabels=[])  
  
# Scatterplot on main ax  
ax_main.scatter('displ','hwy', s=df.cty*5, c=df.manufacturer.astype('category').cat.codes, alpha=.9, data=df, cmap="Set1", edgecolors='black', linewidths=.5)  
  
# Add a graph in each part  
sns.boxplot(df.hwy, ax=ax_right, orient="v")  
sns.boxplot(df.displ, ax=ax_bottom, orient="h")  
  
# Decorations ------------------  
# Remove x axis name for the boxplot  
ax_bottom.set(xlabel='')  
ax_right.set(ylabel='')  
  
# Main Title, Xlabel and YLabel  
ax_main.set(title='Scatterplot with Histograms \n displ vs hwy', xlabel='displ', ylabel='hwy')  
  
# Set font size of different components  
ax_main.title.set_fontsize(20)  
for item in([ax_main.xaxis.label, ax_main.yaxis.label]+ ax_main.get_xticklabels()+ ax_main.get_yticklabels()):    item.set_fontsize(14)  
  
plt.show()

8. 相关性热图

相关图用于直观地查看给定数据帧(或二维数组)中所有可能的数值变量对之间的相关性度量。

# Import Datasetdf = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")  
  
# Plot  
plt.figure(figsize=(12,10), dpi=80)  
sns.heatmap(df.corr(), xticklabels=df.corr().columns, yticklabels=df.corr().columns, cmap='RdYlGn', center=0, annot=True)  
  
# Decorations  
plt.title('Correlogram of mtcars', fontsize=22)  
plt.xticks(fontsize=12)  
plt.yticks(fontsize=12)  
plt.show()

9. 变量关系图

成对图是探索性分析中的最爱,用于了解所有可能的数值变量对之间的关系。它是双变量分析的必备工具。

# Load Datasetdf = sns.load_dataset('iris')  
  
# Plot  
plt.figure(figsize=(10,8), dpi=80)  
sns.pairplot(df, kind="scatter", hue="species", plot_kws=dict(s=80, edgecolor="white", linewidth=2.5))  
plt.show()

# Load Datasetdf = sns.load_dataset('iris')  
  
# Plot  
plt.figure(figsize=(10,8), dpi=80)  
sns.pairplot(df, kind="reg", hue="species")  
plt.show()

二、偏差

10. 发散柱状图

如果您想了解项目如何根据单个指标发生变化并可视化该差异的顺序和数量,则发散条是一个很好的工具。它有助于快速区分数据中各组的表现,并且非常直观,可以立即传达要点。

# Prepare Datadf = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")x = df.loc[:,['mpg']]  
df['mpg_z']=(x - x.mean())/x.std()  
df['colors']=['red'if x <0else'green'for x in df['mpg_z']]  
df.sort_values('mpg_z', inplace=True)  
df.reset_index(inplace=True)  
  
# Draw plot  
plt.figure(figsize=(14,10), dpi=80)  
plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z, color=df.colors, alpha=0.4, linewidth=5)  
  
# Decorations  
plt.gca().set(ylabel='$Model$', xlabel='$Mileage$')  
plt.yticks(df.index, df.cars, fontsize=12)  
plt.title('Diverging Bars of Car Mileage', fontdict={'size':20})  
plt.grid(linestyle='--', alpha=0.5)  
plt.show()

11. 分散文本图

发散文本与发散条类似,如果您想以漂亮且美观的方式显示图表中每个项目的值,则首选发散文本。

# Prepare Datadf = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")x = df.loc[:,['mpg']]  
df['mpg_z']=(x - x.mean())/x.std()  
df['colors']=['red'if x <0else'green'for x in df['mpg_z']]  
df.sort_values('mpg_z', inplace=True)  
df.reset_index(inplace=True)  
  
# Draw plot  
plt.figure(figsize=(14,14), dpi=80)  
plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z)  
for x, y, tex in zip(df.mpg_z, df.index, df.mpg_z):    t = plt.text(x, y, round(tex,2), horizontalalignment='right'if x <0else'left',                  verticalalignment='center', fontdict={'color':'red'if x <0else'green','size':14})  
  
# Decorations      
plt.yticks(df.index, df.cars, fontsize=12)  
plt.title('Diverging Text Bars of Car Mileage', fontdict={'size':20})  
plt.grid(linestyle='--', alpha=0.5)  
plt.xlim(-2.5,2.5)  
plt.show()

12.发散点图

发散点图也类似于发散条形图。然而,与发散的条形图相比,没有条形图会减少组之间的对比度和差异。

# Prepare Datadf = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")x = df.loc[:,['mpg']]  
df['mpg_z']=(x - x.mean())/x.std()  
df['colors']=['red'if x <0else'darkgreen'for x in df['mpg_z']]  
df.sort_values('mpg_z', inplace=True)  
df.reset_index(inplace=True)  
  
# Draw plot  
plt.figure(figsize=(14,16), dpi=80)  
plt.scatter(df.mpg_z, df.index, s=450, alpha=.6, color=df.colors)  
for x, y, tex in zip(df.mpg_z, df.index, df.mpg_z):    t = plt.text(x, y, round(tex,1), horizontalalignment='center',                  verticalalignment='center', fontdict={'color':'white'})  
  
# Decorations  
# Lighten borders  
plt.gca().spines["top"].set_alpha(.3)  
plt.gca().spines["bottom"].set_alpha(.3)  
plt.gca().spines["right"].set_alpha(.3)  
plt.gca().spines["left"].set_alpha(.3)  
  
plt.yticks(df.index, df.cars)  
plt.title('Diverging Dotplot of Car Mileage', fontdict={'size':20})  
plt.xlabel('$Mileage$')  
plt.grid(linestyle='--', alpha=0.5)  
plt.xlim(-2.5,2.5)  
plt.show()

13. 带标记的发散棒棒糖图

带标记的棒棒糖提供了一种灵活的方式来可视化差异,方法是将重点放在您想要引起注意的任何重要数据点上,并在图表中适当地给出推理。

# Prepare Datadf = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")x = df.loc[:,['mpg']]  
df['mpg_z']=(x - x.mean())/x.std()  
df['colors']='black'  
  
# color fiat differently  
df.loc[df.cars =='Fiat X1-9','colors']='darkorange'  
df.sort_values('mpg_z', inplace=True)  
df.reset_index(inplace=True)  
  
  
# Draw plot  
import matplotlib.patches as patches  
  
plt.figure(figsize=(14,16), dpi=80)  
plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z, color=df.colors, alpha=0.4, linewidth=1)  
plt.scatter(df.mpg_z, df.index, color=df.colors, s=[600if x =='Fiat X1-9'else300for x in df.cars], alpha=0.6)  
plt.yticks(df.index, df.cars)  
plt.xticks(fontsize=12)  
  
# Annotate  
plt.annotate('Mercedes Models', xy=(0.0,11.0), xytext=(1.0,11), xycoords='data',             fontsize=15, ha='center', va='center',            bbox=dict(boxstyle='square', fc='firebrick'),            arrowprops=dict(arrowstyle='-[, widthB=2.0, lengthB=1.5', lw=2.0, color='steelblue'), color='white')  
  
# Add Patchesp1 = patches.Rectangle((-2.0,-1), width=.3, height=3, alpha=.2, facecolor='red')p2 = patches.Rectangle((1.5,27), width=.8, height=5, alpha=.2, facecolor='green')  
plt.gca().add_patch(p1)  
plt.gca().add_patch(p2)  
  
# Decorate  
plt.title('Diverging Bars of Car Mileage', fontdict={'size':20})  
plt.grid(linestyle='--', alpha=0.5)  
plt.show()

14.面积图

通过对轴和线之间的区域进行着色,面积图不仅更加强调波峰和波谷,还更加强调高点和低点的持续时间。高点持续的时间越长,线下的面积就越大。

import numpy as npimport pandas as pd# Prepare Datadf = pd.read_csv("https://github.com/selva86/datasets/raw/master/economics.csv", parse_dates=['date']).head(100)x = np.arange(df.shape[0])y_returns =(df.psavert.diff().fillna(0)/df.psavert.shift(1)).fillna(0)*100  
  
# Plot  
plt.figure(figsize=(16,10), dpi=80)  
plt.fill_between(x[1:], y_returns[1:],0, where=y_returns[1:]>=0, facecolor='green', interpolate=True, alpha=0.7)  
plt.fill_between(x[1:], y_returns[1:],0, where=y_returns[1:]<=0, facecolor='red', interpolate=True, alpha=0.7)  
  
# Annotate  
plt.annotate('Peak \n1975', xy=(94.0,21.0), xytext=(88.0,28),             bbox=dict(boxstyle='square', fc='firebrick'),             arrowprops=dict(facecolor='steelblue', shrink=0.05), fontsize=15, color='white')  
  
  
# Decorationsxtickvals =[str(m)[:3].upper()+"-"+str(y)for y,m in zip(df.date.dt.year, df.date.dt.month_name())]  
plt.gca().set_xticks(x[::6])  
plt.gca().set_xticklabels(xtickvals[::6], rotation=90, fontdict={'horizontalalignment':'center','verticalalignment':'center_baseline'})  
plt.ylim(-35,35)  
plt.xlim(1,100)  
plt.title("Month Economics Return %", fontsize=22)  
plt.ylabel('Monthly returns %')  
plt.grid(alpha=0.5)  
plt.show()

三、排序

15. 有序条形图

有序条形图有效地传达了项目的排名顺序。但是,将指标的值添加到图表上方,用户可以从图表本身获得精确的信息。这是基于计数或任何给定指标可视化项目的经典方法。查看有关 实现和解释有序条形图的免费视频教程。

# Prepare Datadf_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")df = df_raw[['cty','manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())  
df.sort_values('cty', inplace=True)  
df.reset_index(inplace=True)  
  
# Draw plot  
import matplotlib.patches as patches  
  
fig, ax = plt.subplots(figsize=(16,10), facecolor='white', dpi=80)  
ax.vlines(x=df.index, ymin=0, ymax=df.cty, color='firebrick', alpha=0.7, linewidth=20)  
  
# Annotate Text  
for i, cty in enumerate(df.cty):    ax.text(i, cty+0.5, round(cty,1), horizontalalignment='center')  
  
  
# Title, Label, Ticks and Ylim  
ax.set_title('Bar Chart for Highway Mileage', fontdict={'size':22})  
ax.set(ylabel='Miles Per Gallon', ylim=(0,30))  
plt.xticks(df.index, df.manufacturer.str.upper(), rotation=60, horizontalalignment='right', fontsize=12)  
  
# Add patches to color the X axis labelsp1 = patches.Rectangle((.57,-0.005), width=.33, height=.13, alpha=.1, facecolor='green', transform=fig.transFigure)p2 = patches.Rectangle((.124,-0.005), width=.446, height=.13, alpha=.1, facecolor='red', transform=fig.transFigure)  
fig.add_artist(p1)  
fig.add_artist(p2)  
plt.show()

16. 棒棒糖图表

棒棒糖图以视觉上令人愉悦的方式与有序条形图具有类似的用途。

# Prepare Datadf_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")df = df_raw[['cty','manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())  
df.sort_values('cty', inplace=True)  
df.reset_index(inplace=True)  
  
# Draw plot  
fig, ax = plt.subplots(figsize=(16,10), dpi=80)  
ax.vlines(x=df.index, ymin=0, ymax=df.cty, color='firebrick', alpha=0.7, linewidth=2)  
ax.scatter(x=df.index, y=df.cty, s=75, color='firebrick', alpha=0.7)  
  
# Title, Label, Ticks and Ylim  
ax.set_title('Lollipop Chart for Highway Mileage', fontdict={'size':22})  
ax.set_ylabel('Miles Per Gallon')  
ax.set_xticks(df.index)  
ax.set_xticklabels(df.manufacturer.str.upper(), rotation=60, fontdict={'horizontalalignment':'right','size':12})  
ax.set_ylim(0,30)  
  
# Annotate  
for row in df.itertuples():    ax.text(row.Index, row.cty+.5, s=round(row.cty,2), horizontalalignment='center', verticalalignment='bottom', fontsize=14)  
  
plt.show()

17. 点图

点图传达了项目的排名顺序。由于它沿水平轴对齐,因此您可以更轻松地直观地看到点之间的距离。

# Prepare Datadf_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")df = df_raw[['cty','manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())  
df.sort_values('cty', inplace=True)  
df.reset_index(inplace=True)  
  
# Draw plot  
fig, ax = plt.subplots(figsize=(16,10), dpi=80)  
ax.hlines(y=df.index, xmin=11, xmax=26, color='gray', alpha=0.7, linewidth=1, linestyles='dashdot')  
ax.scatter(y=df.index, x=df.cty, s=75, color='firebrick', alpha=0.7)  
  
# Title, Label, Ticks and Ylim  
ax.set_title('Dot Plot for Highway Mileage', fontdict={'size':22})  
ax.set_xlabel('Miles Per Gallon')  
ax.set_yticks(df.index)  
ax.set_yticklabels(df.manufacturer.str.title(), fontdict={'horizontalalignment':'right'})  
ax.set_xlim(10,27)  
plt.show()

18. 斜率图

斜率图最适合比较给定人员/项目的“之前”和“之后”位置。

import matplotlib.lines as mlines# Import Datadf = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/gdppercap.csv")left_label =[str(c)+', '+ str(round(y))for c, y in zip(df.continent, df['1952'])]right_label =[str(c)+', '+ str(round(y))for c, y in zip(df.continent, df['1957'])]klass =['red'if(y1-y2)<0else'green'for y1, y2 in zip(df['1952'], df['1957'])]  
  
# draw line  
# https://stackoverflow.com/questions/36470343/how-to-draw-a-line-with-matplotlib/36479941  
def newline(p1, p2, color='black'):    ax = plt.gca()    l = mlines.Line2D([p1[0],p2[0]],[p1[1],p2[1]], color='red'if p1[1]-p2[1]>0else'green', marker='o', markersize=6)    ax.add_line(l)    return l  
  
fig, ax = plt.subplots(1,1,figsize=(14,14), dpi=80)  
  
# Vertical Lines  
ax.vlines(x=1, ymin=500, ymax=13000, color='black', alpha=0.7, linewidth=1, linestyles='dotted')  
ax.vlines(x=3, ymin=500, ymax=13000, color='black', alpha=0.7, linewidth=1, linestyles='dotted')  
  
# Points  
ax.scatter(y=df['1952'], x=np.repeat(1, df.shape[0]), s=10, color='black', alpha=0.7)  
ax.scatter(y=df['1957'], x=np.repeat(3, df.shape[0]), s=10, color='black', alpha=0.7)  
  
# Line Segmentsand Annotation  
for p1, p2, c in zip(df['1952'], df['1957'], df['continent']):    newline([1,p1],[3,p2])    ax.text(1-0.05, p1, c +', '+ str(round(p1)), horizontalalignment='right', verticalalignment='center', fontdict={'size':14})    ax.text(3+0.05, p2, c +', '+ str(round(p2)), horizontalalignment='left', verticalalignment='center', fontdict={'size':14})  
  
# 'Before' and 'After' Annotations  
ax.text(1-0.05,13000,'BEFORE', horizontalalignment='right', verticalalignment='center', fontdict={'size':18,'weight':700})  
ax.text(3+0.05,13000,'AFTER', horizontalalignment='left', verticalalignment='center', fontdict={'size':18,'weight':700})  
  
# Decoration  
ax.set_title("Slopechart: Comparing GDP Per Capita between 1952 vs 1957", fontdict={'size':22})  
ax.set(xlim=(0,4), ylim=(0,14000), ylabel='Mean GDP Per Capita')  
ax.set_xticks([1,3])  
ax.set_xticklabels(["1952","1957"])  
plt.yticks(np.arange(500,13000,2000), fontsize=12)  
  
# Lighten borders  
plt.gca().spines["top"].set_alpha(.0)  
plt.gca().spines["bottom"].set_alpha(.0)  
plt.gca().spines["right"].set_alpha(.0)  
plt.gca().spines["left"].set_alpha(.0)  
plt.show()

19. 哑铃图

哑铃图传达了各种项目的“之前”和“之后”位置以及项目的排名顺序。如果您想可视化特定项目计划对不同对象的影响,它非常有用。

import matplotlib.lines as mlines# Import Datadf = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/health.csv")  
df.sort_values('pct_2014', inplace=True)  
df.reset_index(inplace=True)  
  
# Func to draw line segment  
def newline(p1, p2, color='black'):    ax = plt.gca()    l = mlines.Line2D([p1[0],p2[0]],[p1[1],p2[1]], color='skyblue')    ax.add_line(l)    return l# Figure and Axes  
fig, ax = plt.subplots(1,1,figsize=(14,14), facecolor='#f7f7f7', dpi=80)  
  
# Vertical Lines  
ax.vlines(x=.05, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted')  
ax.vlines(x=.10, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted')  
ax.vlines(x=.15, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted')  
ax.vlines(x=.20, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted')  
  
# Points  
ax.scatter(y=df['index'], x=df['pct_2013'], s=50, color='#0e668b', alpha=0.7)  
ax.scatter(y=df['index'], x=df['pct_2014'], s=50, color='#a3c4dc', alpha=0.7)  
  
# Line Segments  
for i, p1, p2 in zip(df['index'], df['pct_2013'], df['pct_2014']):    newline([p1, i],[p2, i])  
  
# Decoration  
ax.set_facecolor('#f7f7f7')  
ax.set_title("Dumbell Chart: Pct Change - 2013 vs 2014", fontdict={'size':22})  
ax.set(xlim=(0,.25), ylim=(-1,27), ylabel='Mean GDP Per Capita')  
ax.set_xticks([.05,.1,.15,.20])  
ax.set_xticklabels(['5%','15%','20%','25%'])  
ax.set_xticklabels(['5%','15%','20%','25%'])    plt.show()

四、分布

20.连续变量的直方图

直方图显示给定变量的频率分布。下面的表示根据分类变量对频率条进行分组,从而更好地了解连续变量和分类变量的串联。在此免费视频教程中创建直方图并学习如何解释它们。

# Import Datadf = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")  
  
# Prepare datax_var ='displ'groupby_var ='class'df_agg = df.loc[:,[x_var, groupby_var]].groupby(groupby_var)vals =[df[x_var].values.tolist()for i, df in df_agg]  
  
# Draw  
plt.figure(figsize=(16,9), dpi=80)colors =[plt.cm.Spectral(i/float(len(vals)-1))for i in range(len(vals))]  
n, bins, patches = plt.hist(vals,30, stacked=True, density=False, color=colors[:len(vals)])  
  
# Decoration  
plt.legend({group:col for group, col in zip(np.unique(df[groupby_var]).tolist(), colors[:len(vals)])})  
plt.title(f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$", fontsize=22)  
plt.xlabel(x_var)  
plt.ylabel("Frequency")  
plt.ylim(0,25)  
plt.xticks(ticks=bins[::3], labels=[round(b,1)for b in bins[::3]])  
plt.show()

21. 分类变量的直方图

分类变量的直方图显示该变量的频率分布。通过对条形进行着色,您可以可视化与表示颜色的另一个分类变量相关的分布。

# Import Datadf = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")  
  
# Prepare datax_var ='manufacturer'groupby_var ='class'df_agg = df.loc[:,[x_var, groupby_var]].groupby(groupby_var)vals =[df[x_var].values.tolist()for i, df in df_agg]  
  
# Draw  
plt.figure(figsize=(16,9), dpi=80)colors =[plt.cm.Spectral(i/float(len(vals)-1))for i in range(len(vals))]  
n, bins, patches = plt.hist(vals, df[x_var].unique().__len__(), stacked=True, density=False, color=colors[:len(vals)])  
  
# Decoration  
plt.legend({group:col for group, col in zip(np.unique(df[groupby_var]).tolist(), colors[:len(vals)])})  
plt.title(f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$", fontsize=22)  
plt.xlabel(x_var)  
plt.ylabel("Frequency")  
plt.ylim(0,40)  
plt.xticks(ticks=bins, labels=np.unique(df[x_var]).tolist(), rotation=90, horizontalalignment='left')  
plt.show()

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

22. 密度图

密度图是可视化连续变量分布的常用工具。通过按“响应”变量对它们进行分组,您可以检查 X 和 Y 之间的关系。以下案例用于代表性目的,以描述城市里程的分布如何随汽缸数量变化。

# Import Datadf = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")  
  
# Draw Plot  
plt.figure(figsize=(16,10), dpi=80)  
sns.kdeplot(df.loc[df['cyl']==4,"cty"], shade=True, color="g", label="Cyl=4", alpha=.7)  
sns.kdeplot(df.loc[df['cyl']==5,"cty"], shade=True, color="deeppink", label="Cyl=5", alpha=.7)  
sns.kdeplot(df.loc[df['cyl']==6,"cty"], shade=True, color="dodgerblue", label="Cyl=6", alpha=.7)  
sns.kdeplot(df.loc[df['cyl']==8,"cty"], shade=True, color="orange", label="Cyl=8", alpha=.7)  
  
# Decoration  
plt.title('Density Plot of City Mileage by n_Cylinders', fontsize=22)  
plt.legend()  
plt.show()

23. 带有直方图的密度曲线

带直方图的密度曲线汇集了两个图传达的集体信息,因此您可以将它们放在一个图中而不是两个图中。

# Import Datadf = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")  
  
# Draw Plot  
plt.figure(figsize=(13,10), dpi=80)  
sns.distplot(df.loc[df['class']=='compact',"cty"], color="dodgerblue", label="Compact", hist_kws={'alpha':.7}, kde_kws={'linewidth':3})  
sns.distplot(df.loc[df['class']=='suv',"cty"], color="orange", label="SUV", hist_kws={'alpha':.7}, kde_kws={'linewidth':3})  
sns.distplot(df.loc[df['class']=='minivan',"cty"], color="g", label="minivan", hist_kws={'alpha':.7}, kde_kws={'linewidth':3})  
plt.ylim(0,0.35)  
  
# Decoration  
plt.title('Density Plot of City Mileage by Vehicle Type', fontsize=22)  
plt.legend()  
plt.show()

24. 密度曲线重叠图

Joy Plot 允许不同组的密度曲线重叠,这是可视化大量组相对于彼此的分布的好方法。它看起来赏心悦目,并且清楚地传达了正确的信息。它可以使用joypy基于matplotlib.

# !pip install joypy  
# Import Datampg = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")  
  
# Draw Plot  
plt.figure(figsize=(16,10), dpi=80)  
fig, axes = joypy.joyplot(mpg, column=['hwy','cty'], by="class", ylim='own', figsize=(14,10))  
  
# Decoration  
plt.title('Joy Plot of City and Highway Mileage by Class', fontsize=22)  
plt.show()

25. 分布式点图

分布点图显示按组分割的点的单变量分布。点越黑,该区域的数据点越集中。通过对中位数进行不同的着色,各组的真实定位立即变得显而易见。

import matplotlib.patches as mpatches# Prepare Datadf_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")cyl_colors ={4:'tab:red',5:'tab:green',6:'tab:blue',8:'tab:orange'}  
df_raw['cyl_color']= df_raw.cyl.map(cyl_colors)  
  
# Mean and Median city mileage by makedf = df_raw[['cty','manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())  
df.sort_values('cty', ascending=False, inplace=True)  
df.reset_index(inplace=True)df_median = df_raw[['cty','manufacturer']].groupby('manufacturer').apply(lambda x: x.median())  
  
# Draw horizontal lines  
fig, ax = plt.subplots(figsize=(16,10), dpi=80)  
ax.hlines(y=df.index, xmin=0, xmax=40, color='gray', alpha=0.5, linewidth=.5, linestyles='dashdot')  
  
# Draw the Dots  
for i, make in enumerate(df.manufacturer):    df_make = df_raw.loc[df_raw.manufacturer==make,:]    ax.scatter(y=np.repeat(i, df_make.shape[0]), x='cty', data=df_make, s=75, edgecolors='gray', c='w', alpha=0.5)    ax.scatter(y=i, x='cty', data=df_median.loc[df_median.index==make,:], s=75, c='firebrick')  
  
# Annotate      
ax.text(33,13,"$red \; dots \; are \; the \: median$", fontdict={'size':12}, color='firebrick')  
  
# Decorationsred_patch = plt.plot([],[], marker="o", ms=10, ls="", mec=None, color='firebrick', label="Median")  
plt.legend(handles=red_patch)  
ax.set_title('Distribution of City Mileage by Make', fontdict={'size':22})  
ax.set_xlabel('Miles Per Gallon (City)', alpha=0.7)  
ax.set_yticks(df.index)  
ax.set_yticklabels(df.manufacturer.str.title(), fontdict={'horizontalalignment':'right'}, alpha=0.7)  
ax.set_xlim(1,40)  
plt.xticks(alpha=0.7)  
plt.gca().spines["top"].set_visible(False)    plt.gca().spines["bottom"].set_visible(False)    plt.gca().spines["right"].set_visible(False)    plt.gca().spines["left"].set_visible(False)   plt.grid(axis='both', alpha=.4, linewidth=.1)  
plt.show()

26.箱线图

箱线图是可视化分布的好方法,可以牢记中位数、第 25 个四分位数、第 75 个四分位数和异常值。但是,您需要小心解释框的大小,这可能会扭曲该组中包含的点数。因此,手动提供每个框中的观测值数量可以帮助克服这个缺点。查看此免费视频课程,使用箱线图可视化数值变量的分布。

例如,左侧的前两个框具有相同大小的框,尽管它们分别有 5 个和 47 个 obs。因此,有必要写下该组中的观察数量。

# Import Datadf = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")  
  
# Draw Plot  
plt.figure(figsize=(13,10), dpi=80)  
sns.boxplot(x='class', y='hwy', data=df, notch=False)  
  
# Add N Obs inside boxplot (optional)  
def add_n_obs(df,group_col,y):    medians_dict ={grp[0]:grp[1][y].median()for grp in df.groupby(group_col)}    xticklabels =[x.get_text()for x in plt.gca().get_xticklabels()]    n_obs = df.groupby(group_col)[y].size().values    for(x, xticklabel), n_ob in zip(enumerate(xticklabels), n_obs):        plt.text(x, medians_dict[xticklabel]*1.01,"#obs : "+str(n_ob), horizontalalignment='center', fontdict={'size':14}, color='white')  
  
add_n_obs(df,group_col='class',y='hwy')    # Decoration  
plt.title('Box Plot of Highway Mileage by Vehicle Class', fontsize=22)  
plt.ylim(10,40)  
plt.show()

27. 点+箱线图

点 + 箱线图 传达与分组箱线图类似的信息。此外,这些点还可以让我们了解每组中有多少个数据点。

# Import Datadf = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")  
  
# Draw Plot  
plt.figure(figsize=(13,10), dpi=80)  
sns.boxplot(x='class', y='hwy', data=df, hue='cyl')  
sns.stripplot(x='class', y='hwy', data=df, color='black', size=3, jitter=1)  
  
for i in range(len(df['class'].unique())-1):    plt.vlines(i+.5,10,45, linestyles='solid', colors='gray', alpha=0.2)  
  
# Decoration  
plt.title('Box Plot of Highway Mileage by Vehicle Class', fontsize=22)  
plt.legend(title='Cylinders')  
plt.show()

28. 小提琴图

小提琴图是箱线图的视觉上令人愉悦的替代方案。小提琴的形状或面积取决于它所容纳的观测值的数量。然而,小提琴图可能更难阅读,并且在专业环境中并不常用。这个免费的视频教程将训练您如何实现小提琴情节。

# Import Datadf = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")  
  
# Draw Plot  
plt.figure(figsize=(13,10), dpi=80)  
sns.violinplot(x='class', y='hwy', data=df, scale='width', inner='quartile')  
  
# Decoration  
plt.title('Violin Plot of Highway Mileage by Vehicle Class', fontsize=22)  
plt.show()

29. 金字塔图

金字塔图可用于显示按数量排序的群体分布。或者它也可以用于显示人群的逐步过滤,如下所示,它用于显示有多少人通过营销漏斗的每个阶段。

# Read datadf = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/email_campaign_funnel.csv")  
  
# Draw Plot  
plt.figure(figsize=(13,10), dpi=80)group_col ='Gender'order_of_bars = df.Stage.unique()[::-1]colors =[plt.cm.Spectral(i/float(len(df[group_col].unique())-1))for i in range(len(df[group_col].unique()))]  
  
for c, group in zip(colors, df[group_col].unique()):    sns.barplot(x='Users', y='Stage', data=df.loc[df[group_col]==group,:], order=order_of_bars, color=c, label=group)  
  
# Decorations      
plt.xlabel("$Users$")  
plt.ylabel("Stage of Purchase")  
plt.yticks(fontsize=12)  
plt.title("Population Pyramid of the Marketing Funnel", fontsize=22)  
plt.legend()  
plt.show()

30. 分类图

库提供的分类图seaborn可用于可视化两个或更多分类变量彼此相关的计数分布。

# Load Datasettitanic = sns.load_dataset("titanic")  
  
# Plotg = sns.catplot("alive", col="deck", col_wrap=4,                data=titanic[titanic.deck.notnull()],                kind="count", height=3.5, aspect=.8,                 palette='tab20')  
  
fig.suptitle('sf')  
plt.show()

# Load Datasettitanic = sns.load_dataset("titanic")  
  
# Plot  
sns.catplot(x="age", y="embark_town",            hue="sex", col="class",            data=titanic[titanic.embark_town.notnull()],            orient="h", height=5, aspect=1, palette="tab10",            kind="violin", dodge=True, cut=0, bw=.2)

作品

31. 华夫饼图

waffle图表可以使用该pywaffle包创建,用于显示较大人群中的群体组成。

#! pip install pywaffle  
# Reference: https://stackoverflow.com/questions/41400136/how-to-do-waffle-charts-in-python-square-piechart  
from pywaffle importWaffle  
  
# Importdf_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")  
  
# Prepare Datadf = df_raw.groupby('class').size().reset_index(name='counts')n_categories = df.shape[0]colors =[plt.cm.inferno_r(i/float(n_categories))for i in range(n_categories)]  
  
# Draw Plot and Decoratefig = plt.figure(    FigureClass=Waffle,    plots={        '111':{            'values': df['counts'],            'labels':["{0} ({1})".format(n[0], n[1])for n in df[['class','counts']].itertuples()],            'legend':{'loc':'upper left','bbox_to_anchor':(1.05,1),'fontsize':12},            'title':{'label':'# Vehicles by Class','loc':'center','fontsize':18}        },    },    rows=7,    colors=colors,    figsize=(16,9)  
)

#! pip install pywaffle  
from pywaffle importWaffle  
  
# Import  
# df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")  
  
# Prepare Data  
# By Class Datadf_class = df_raw.groupby('class').size().reset_index(name='counts_class')n_categories = df_class.shape[0]colors_class =[plt.cm.Set3(i/float(n_categories))for i in range(n_categories)]  
  
# By Cylinders Datadf_cyl = df_raw.groupby('cyl').size().reset_index(name='counts_cyl')n_categories = df_cyl.shape[0]colors_cyl =[plt.cm.Spectral(i/float(n_categories))for i in range(n_categories)]  
  
# By Make Datadf_make = df_raw.groupby('manufacturer').size().reset_index(name='counts_make')n_categories = df_make.shape[0]colors_make =[plt.cm.tab20b(i/float(n_categories))for i in range(n_categories)]  
  
  
# Draw Plot and Decoratefig = plt.figure(    FigureClass=Waffle,    plots={        '311':{            'values': df_class['counts_class'],            'labels':["{1}".format(n[0], n[1])for n in df_class[['class','counts_class']].itertuples()],            'legend':{'loc':'upper left','bbox_to_anchor':(1.05,1),'fontsize':12,'title':'Class'},            'title':{'label':'# Vehicles by Class','loc':'center','fontsize':18},            'colors': colors_class        },        '312':{            'values': df_cyl['counts_cyl'],            'labels':["{1}".format(n[0], n[1])for n in df_cyl[['cyl','counts_cyl']].itertuples()],            'legend':{'loc':'upper left','bbox_to_anchor':(1.05,1),'fontsize':12,'title':'Cyl'},            'title':{'label':'# Vehicles by Cyl','loc':'center','fontsize':18},            'colors': colors_cyl        },        '313':{            'values': df_make['counts_make'],            'labels':["{1}".format(n[0], n[1])for n in df_make[['manufacturer','counts_make']].itertuples()],            'legend':{'loc':'upper left','bbox_to_anchor':(1.05,1),'fontsize':12,'title':'Manufacturer'},            'title':{'label':'# Vehicles by Make','loc':'center','fontsize':18},            'colors': colors_make        }    },    rows=9,    figsize=(16,14)  
)

32. 饼图

饼图是显示组构成的经典方式。然而,现在通常不建议使用它,因为馅饼部分的面积有时会产生误导。因此,如果您要使用饼图,强烈建议明确写下饼图每个部分的百分比或数字。

# Importdf_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")  
  
# Prepare Datadf = df_raw.groupby('class').size()  
  
# Make the plot with pandas  
df.plot(kind='pie', subplots=True, figsize=(8,8), dpi=80)  
plt.title("Pie Chart of Vehicle Class - Bad")  
plt.ylabel("")  
plt.show()

# Importdf_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")  
  
# Prepare Datadf = df_raw.groupby('class').size().reset_index(name='counts')  
  
# Draw Plot  
fig, ax = plt.subplots(figsize=(12,7), subplot_kw=dict(aspect="equal"), dpi=80)data = df['counts']categories = df['class']explode =[0,0,0,0,0,0.1,0]  
  
def func(pct, allvals):    absolute = int(pct/100.*np.sum(allvals))    return"{:.1f}% ({:d} )".format(pct, absolute)  
  
wedges, texts, autotexts = ax.pie(data,                                   autopct=lambda pct: func(pct, data),                                  textprops=dict(color="w"),                                   colors=plt.cm.Dark2.colors,                                 startangle=140,                                 explode=explode)  
  
# Decoration  
ax.legend(wedges, categories, title="Vehicle Class", loc="center left", bbox_to_anchor=(1,0,0.5,1))  
plt.setp(autotexts, size=10, weight=700)  
ax.set_title("Class of Vehicles: Pie Chart")  
plt.show()

33. 树形图

树形图类似于饼图,它可以更好地工作,并且不会误导每个组的贡献。

# pip install squarify  
import squarify # Import Datadf_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")  
  
# Prepare Datadf = df_raw.groupby('class').size().reset_index(name='counts')labels = df.apply(lambda x: str(x[0])+"\n ("+ str(x[1])+")", axis=1)sizes = df['counts'].values.tolist()colors =[plt.cm.Spectral(i/float(len(labels)))for i in range(len(labels))]  
  
# Draw Plot  
plt.figure(figsize=(12,8), dpi=80)  
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)  
  
# Decorate  
plt.title('Treemap of Vechile Class')  
plt.axis('off')  
plt.show()

34. 条形图

条形图是根据计数或任何给定指标可视化项目的经典方式。在下图中,我为每个项目使用了不同的颜色,但您通常可能希望为所有项目选择一种颜色,除非您按组为它们着色。颜色名称存储all_colors在下面的代码中。color您可以通过设置中的参数来更改条形的颜色。plt.plot()

import random# Import Datadf_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")  
  
# Prepare Datadf = df_raw.groupby('manufacturer').size().reset_index(name='counts')n = df['manufacturer'].unique().__len__()+1all_colors = list(plt.cm.colors.cnames.keys())  
random.seed(100)c = random.choices(all_colors, k=n)  
  
# Plot Bars  
plt.figure(figsize=(16,10), dpi=80)  
plt.bar(df['manufacturer'], df['counts'], color=c, width=.5)  
for i, val in enumerate(df['counts'].values):    plt.text(i, val, float(val), horizontalalignment='center', verticalalignment='bottom', fontdict={'fontweight':500,'size':12})  
  
# Decoration  
plt.gca().set_xticklabels(df['manufacturer'], rotation=60, horizontalalignment='right')  
plt.title("Number of Vehicles by Manaufacturers", fontsize=22)  
plt.ylabel('# Vehicles')  
plt.ylim(0,45)  
plt.show()

六、变化

35.时间序列图

时间序列图用于可视化给定指标如何随时间变化。在这里,您可以看到 1949 年至 1969 年间航空客运量的变化。查看此免费视频教程,了解如何实现线图来分析时间序列。

# Import Datadf = pd.read_csv('https://github.com/selva86/datasets/raw/master/AirPassengers.csv')  
  
# Draw Plot  
plt.figure(figsize=(16,10), dpi=80)  
plt.plot('date','traffic', data=df, color='tab:red')  
  
# Decoration  
plt.ylim(50,750)xtick_location = df.index.tolist()[::12]xtick_labels =[x[-4:]for x in df.date.tolist()[::12]]  
plt.xticks(ticks=xtick_location, labels=xtick_labels, rotation=0, fontsize=12, horizontalalignment='center', alpha=.7)  
plt.yticks(fontsize=12, alpha=.7)  
plt.title("Air Passengers Traffic (1949 - 1969)", fontsize=22)  
plt.grid(axis='both', alpha=.3)  
  
# Remove borders  
plt.gca().spines["top"].set_alpha(0.0)    plt.gca().spines["bottom"].set_alpha(0.3)  
plt.gca().spines["right"].set_alpha(0.0)    plt.gca().spines["left"].set_alpha(0.3)   plt.show()

36.带波峰和波谷注释的时间序列

下面的时间序列绘制了所有的波峰和波谷,并注释了选定特殊事件的发生。

# Import Datadf = pd.read_csv('https://github.com/selva86/datasets/raw/master/AirPassengers.csv')  
  
# Get the Peaks and Troughsdata = df['traffic'].values  
doublediff = np.diff(np.sign(np.diff(data)))peak_locations = np.where(doublediff ==-2)[0]+1doublediff2 = np.diff(np.sign(np.diff(-1*data)))trough_locations = np.where(doublediff2 ==-2)[0]+1  
  
# Draw Plot  
plt.figure(figsize=(16,10), dpi=80)  
plt.plot('date','traffic', data=df, color='tab:blue', label='Air Traffic')  
plt.scatter(df.date[peak_locations], df.traffic[peak_locations], marker=mpl.markers.CARETUPBASE, color='tab:green', s=100, label='Peaks')  
plt.scatter(df.date[trough_locations], df.traffic[trough_locations], marker=mpl.markers.CARETDOWNBASE, color='tab:red', s=100, label='Troughs')  
  
# Annotate  
for t, p in zip(trough_locations[1::5], peak_locations[::3]):    plt.text(df.date[p], df.traffic[p]+15, df.date[p], horizontalalignment='center', color='darkgreen')    plt.text(df.date[t], df.traffic[t]-35, df.date[t], horizontalalignment='center', color='darkred')  
  
# Decoration  
plt.ylim(50,750)xtick_location = df.index.tolist()[::6]xtick_labels = df.date.tolist()[::6]  
plt.xticks(ticks=xtick_location, labels=xtick_labels, rotation=90, fontsize=12, alpha=.7)  
plt.title("Peak and Troughs of Air Passengers Traffic (1949 - 1969)", fontsize=22)  
plt.yticks(fontsize=12, alpha=.7)  
  
# Lighten borders  
plt.gca().spines["top"].set_alpha(.0)  
plt.gca().spines["bottom"].set_alpha(.3)  
plt.gca().spines["right"].set_alpha(.0)  
plt.gca().spines["left"].set_alpha(.3)  
  
plt.legend(loc='upper left')  
plt.grid(axis='y', alpha=.3)  
plt.show()

37. 自相关 (ACF) 和偏自相关 (PACF) 图

ACF 图显示了时间序列与其自身滞后的相关性。每条垂直线(在自相关图上)代表序列与其从滞后 0 开始的滞后之间的相关性。图中的蓝色阴影区域是显着性水平。蓝线上方的滞后是显着滞后。

那么如何解释这一点呢?

对于 AirPassengers,我们看到多达 14 次滞后超过了蓝线,因此非常严重。这意味着 14 年前的航空客运量会对今天的航空客运量产生影响。

另一方面,PACF 显示了任何给定滞后(时间序列)与当前序列的自相关,但消除了中间滞后的贡献。

注意:如果您想了解如何解释和绘制 ACF 和 PACF 图,请查看此免费视频教程。

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf# Import Datadf = pd.read_csv('https://github.com/selva86/datasets/raw/master/AirPassengers.csv')  
  
# Draw Plot  
fig,(ax1, ax2)= plt.subplots(1,2,figsize=(16,6), dpi=80)  
plot_acf(df.traffic.tolist(), ax=ax1, lags=50)  
plot_pacf(df.traffic.tolist(), ax=ax2, lags=20)  
  
# Decorate  
# lighten the borders  
ax1.spines["top"].set_alpha(.3); ax2.spines["top"].set_alpha(.3)  
ax1.spines["bottom"].set_alpha(.3); ax2.spines["bottom"].set_alpha(.3)  
ax1.spines["right"].set_alpha(.3); ax2.spines["right"].set_alpha(.3)  
ax1.spines["left"].set_alpha(.3); ax2.spines["left"].set_alpha(.3)  
  
# font size of tick labels  
ax1.tick_params(axis='both', labelsize=12)  
ax2.tick_params(axis='both', labelsize=12)  
plt.show()

38.互相关图

互相关图显示两个时间序列彼此之间的滞后。

import statsmodels.tsa.stattools as stattools# Import Datadf = pd.read_csv('https://github.com/selva86/datasets/raw/master/mortality.csv')x = df['mdeaths']y = df['fdeaths']  
  
# Compute Cross Correlationsccs = stattools.ccf(x, y)[:100]nlags = len(ccs)  
  
# Compute the Significance level  
# ref: https://stats.stackexchange.com/questions/3115/cross-correlation-significance-in-r/3128#3128conf_level =2/ np.sqrt(nlags)  
  
# Draw Plot  
plt.figure(figsize=(12,7), dpi=80)  
  
plt.hlines(0, xmin=0, xmax=100, color='gray')# 0 axis  
plt.hlines(conf_level, xmin=0, xmax=100, color='gray')  
plt.hlines(-conf_level, xmin=0, xmax=100, color='gray')  
  
plt.bar(x=np.arange(len(ccs)), height=ccs, width=.3)  
  
# Decoration  
plt.title('$Cross\; Correlation\; Plot:\; mdeaths\; vs\; fdeaths$', fontsize=22)  
plt.xlim(0,len(ccs))  
plt.show()

39.时间序列分解图

时间序列分解图显示时间序列分解为趋势、季节性和残差分量。

from statsmodels.tsa.seasonal import seasonal_decomposefrom dateutil.parser import parse# Import Datadf = pd.read_csv('https://github.com/selva86/datasets/raw/master/AirPassengers.csv')dates = pd.DatetimeIndex([parse(d).strftime('%Y-%m-01')for d in df['date']])  
df.set_index(dates, inplace=True)  
  
# Decompose result = seasonal_decompose(df['traffic'], model='multiplicative')  
  
# Plot  
plt.rcParams.update({'figure.figsize':(10,10)})  
result.plot().suptitle('Time Series Decomposition of Air Passengers')  
plt.show()

40. 多时间序列

您可以在同一个图表上绘制测量相同值的多个时间序列,如下所示。

# Import Datadf = pd.read_csv('https://github.com/selva86/datasets/raw/master/mortality.csv')  
  
# Define the upper limit, lower limit, interval of Y axis and colorsy_LL =100y_UL = int(df.iloc[:,1:].max().max()*1.1)y_interval =400mycolors =['tab:red','tab:blue','tab:green','tab:orange']    # Draw Plot and Annotate  
fig, ax = plt.subplots(1,1,figsize=(16,9), dpi=80)    columns = df.columns[1:]  for i, column in enumerate(columns):        plt.plot(df.date.values, df[column].values, lw=1.5, color=mycolors[i])        plt.text(df.shape[0]+1, df[column].values[-1], column, fontsize=14, color=mycolors[i])  
  
# Draw Tick lines    
for y in range(y_LL, y_UL, y_interval):        plt.hlines(y, xmin=0, xmax=71, colors='black', alpha=0.3, linestyles="--", lw=0.5)  
  
# Decorations      
plt.tick_params(axis="both", which="both", bottom=False, top=False,                    labelbottom=True, left=False, right=False, labelleft=True)        # Lighten borders  
plt.gca().spines["top"].set_alpha(.3)  
plt.gca().spines["bottom"].set_alpha(.3)  
plt.gca().spines["right"].set_alpha(.3)  
plt.gca().spines["left"].set_alpha(.3)  
  
plt.title('Number of Deaths from Lung Diseases in the UK (1974-1979)', fontsize=22)  
plt.yticks(range(y_LL, y_UL, y_interval),[str(y)for y in range(y_LL, y_UL, y_interval)], fontsize=12)    plt.xticks(range(0, df.shape[0],12), df.date.values[::12], horizontalalignment='left', fontsize=12)    plt.ylim(y_LL, y_UL)    plt.xlim(-2,80)    plt.show()

41. 双坐标图

如果要显示在同一时间点测量两个不同数量的两个时间序列,您可以根据右侧的辅助 Y 轴绘制第二个序列。

# Import Datadf = pd.read_csv("https://github.com/selva86/datasets/raw/master/economics.csv")x = df['date']y1 = df['psavert']y2 = df['unemploy']  
  
# Plot Line1 (Left Y Axis)  
fig, ax1 = plt.subplots(1,1,figsize=(16,9), dpi=80)  
ax1.plot(x, y1, color='tab:red')  
  
# Plot Line2 (Right Y Axis)ax2 = ax1.twinx()# instantiate a second axes that shares the same x-axis  
ax2.plot(x, y2, color='tab:blue')  
  
# Decorations  
# ax1 (left Y axis)  
ax1.set_xlabel('Year', fontsize=20)  
ax1.tick_params(axis='x', rotation=0, labelsize=12)  
ax1.set_ylabel('Personal Savings Rate', color='tab:red', fontsize=20)  
ax1.tick_params(axis='y', rotation=0, labelcolor='tab:red')  
ax1.grid(alpha=.4)  
  
# ax2 (right Y axis)  
ax2.set_ylabel("# Unemployed (1000's)", color='tab:blue', fontsize=20)  
ax2.tick_params(axis='y', labelcolor='tab:blue')  
ax2.set_xticks(np.arange(0, len(x),60))  
ax2.set_xticklabels(x[::60], rotation=90, fontdict={'fontsize':10})  
ax2.set_title("Personal Savings Rate vs Unemployed: Plotting in Secondary Y Axis", fontsize=22)  
fig.tight_layout()  
plt.show()

42.带有误差带的时间序列

如果您有一个时间序列数据集,每个时间点(日期/时间戳)有多个观测值,则可以构建具有误差带的时间序列。下面您可以看到几个基于一天中不同时间收到的订单的示例。另一个例子是 45 天内到达的订单数量。

在这种方法中,订单数的平均值由白线表示。围绕平均值计算并绘制 95% 置信带。

from scipy.stats import sem# Import Datadf = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/user_orders_hourofday.csv")df_mean = df.groupby('order_hour_of_day').quantity.mean()df_se = df.groupby('order_hour_of_day').quantity.apply(sem).mul(1.96)  
  
# Plot  
plt.figure(figsize=(16,10), dpi=80)  
plt.ylabel("# Orders", fontsize=16)  x = df_mean.index  
plt.plot(x, df_mean, color="white", lw=2) plt.fill_between(x, df_mean - df_se, df_mean + df_se, color="#3F5D7D")  # Decorations  
# Lighten borders  
plt.gca().spines["top"].set_alpha(0)  
plt.gca().spines["bottom"].set_alpha(1)  
plt.gca().spines["right"].set_alpha(0)  
plt.gca().spines["left"].set_alpha(1)  
plt.xticks(x[::2],[str(d)for d in x[::2]], fontsize=12)  
plt.title("User Orders by Hour of Day (95% confidence)", fontsize=22)  
plt.xlabel("Hour of Day")  
  
s, e = plt.gca().get_xlim()  
plt.xlim(s, e)  
  
# Draw Horizontal Tick lines    
for y in range(8,20,2):        plt.hlines(y, xmin=s, xmax=e, colors='black', alpha=0.5, linestyles="--", lw=0.5)  
  
plt.show()

"Data Source: https://www.kaggle.com/olistbr/brazilian-ecommerce#olist_orders_dataset.csv"  
from dateutil.parser import parsefrom scipy.stats import sem# Import Datadf_raw = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/orders_45d.csv',                      parse_dates=['purchase_time','purchase_date'])  
  
# Prepare Data: Daily Mean and SE Bandsdf_mean = df_raw.groupby('purchase_date').quantity.mean()df_se = df_raw.groupby('purchase_date').quantity.apply(sem).mul(1.96)  
  
# Plot  
plt.figure(figsize=(16,10), dpi=80)  
plt.ylabel("# Daily Orders", fontsize=16)  x =[d.date().strftime('%Y-%m-%d')for d in df_mean.index]  
plt.plot(x, df_mean, color="white", lw=2) plt.fill_between(x, df_mean - df_se, df_mean + df_se, color="#3F5D7D")  # Decorations  
# Lighten borders  
plt.gca().spines["top"].set_alpha(0)  
plt.gca().spines["bottom"].set_alpha(1)  
plt.gca().spines["right"].set_alpha(0)  
plt.gca().spines["left"].set_alpha(1)  
plt.xticks(x[::6],[str(d)for d in x[::6]], fontsize=12)  
plt.title("Daily Order Quantity of Brazilian Retail with Error Bands (95% confidence)", fontsize=20)  
  
# Axis limits  
s, e = plt.gca().get_xlim()  
plt.xlim(s, e-2)  
plt.ylim(4,10)  
  
# Draw Horizontal Tick lines    
for y in range(5,10,1):        plt.hlines(y, xmin=s, xmax=e, colors='black', alpha=0.5, linestyles="--", lw=0.5)  
  
plt.show()

43. 堆积面积图

堆积面积图直观地表示了多个时间序列的贡献程度,以便于相互比较。

# Import Datadf = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/nightvisitors.csv')  
  
# Decide Colors mycolors =['tab:red','tab:blue','tab:green','tab:orange','tab:brown','tab:grey','tab:pink','tab:olive']      # Draw Plot and Annotate  
fig, ax = plt.subplots(1,1,figsize=(16,9), dpi=80)columns = df.columns[1:]labs = columns.values.tolist()  
  
# Prepare datax  = df['yearmon'].values.tolist()y0 = df[columns[0]].values.tolist()y1 = df[columns[1]].values.tolist()y2 = df[columns[2]].values.tolist()y3 = df[columns[3]].values.tolist()y4 = df[columns[4]].values.tolist()y5 = df[columns[5]].values.tolist()y6 = df[columns[6]].values.tolist()y7 = df[columns[7]].values.tolist()y = np.vstack([y0, y2, y4, y6, y7, y5, y1, y3])  
  
# Plot for each columnlabs = columns.values.tolist()ax = plt.gca()  
ax.stackplot(x, y, labels=labs, colors=mycolors, alpha=0.8)  
  
# Decorations  
ax.set_title('Night Visitors in Australian Regions', fontsize=18)  
ax.set(ylim=[0,100000])  
ax.legend(fontsize=10, ncol=4)  
plt.xticks(x[::5], fontsize=10, horizontalalignment='center')  
plt.yticks(np.arange(10000,100000,20000), fontsize=10)  
plt.xlim(x[0], x[-1])  
  
# Lighten borders  
plt.gca().spines["top"].set_alpha(0)  
plt.gca().spines["bottom"].set_alpha(.3)  
plt.gca().spines["right"].set_alpha(0)  
plt.gca().spines["left"].set_alpha(.3)  
  
plt.show()

44. 未堆叠面积图

非堆叠面积图用于可视化两个或多个系列相对于彼此的进度(上升和下降)。在下图中,您可以清楚地看到个人储蓄率如何随着失业持续时间中位数的增加而下降。非堆叠面积图很好地体现了这种现象。

# Import Datadf = pd.read_csv("https://github.com/selva86/datasets/raw/master/economics.csv")  
  
# Prepare Datax = df['date'].values.tolist()y1 = df['psavert'].values.tolist()y2 = df['uempmed'].values.tolist()mycolors =['tab:red','tab:blue','tab:green','tab:orange','tab:brown','tab:grey','tab:pink','tab:olive']      columns =['psavert','uempmed']  
  
# Draw Plot   
fig, ax = plt.subplots(1,1, figsize=(16,9), dpi=80)  
ax.fill_between(x, y1=y1, y2=0, label=columns[1], alpha=0.5, color=mycolors[1], linewidth=2)  
ax.fill_between(x, y1=y2, y2=0, label=columns[0], alpha=0.5, color=mycolors[0], linewidth=2)  
  
# Decorations  
ax.set_title('Personal Savings Rate vs Median Duration of Unemployment', fontsize=18)  
ax.set(ylim=[0,30])  
ax.legend(loc='best', fontsize=12)  
plt.xticks(x[::50], fontsize=10, horizontalalignment='center')  
plt.yticks(np.arange(2.5,30.0,2.5), fontsize=10)  
plt.xlim(-10, x[-1])  
  
# Draw Tick lines    
for y in np.arange(2.5,30.0,2.5):        plt.hlines(y, xmin=0, xmax=len(x), colors='black', alpha=0.3, linestyles="--", lw=0.5)  
  
# Lighten borders  
plt.gca().spines["top"].set_alpha(0)  
plt.gca().spines["bottom"].set_alpha(.3)  
plt.gca().spines["right"].set_alpha(0)  
plt.gca().spines["left"].set_alpha(.3)  
plt.show()

45. 日历热图

与时间序列相比,日历地图是可视化基于时间的数据的替代方案,也是不太优选的选项。尽管视觉上很吸引人,但数值并不十分明显。然而,它可以有效地很好地描绘极端值和假期影响。

import matplotlib as mplimport calmap# Import Datadf = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/yahoo.csv", parse_dates=['date'])  
df.set_index('date', inplace=True)  
  
# Plot  
plt.figure(figsize=(16,10), dpi=80)  
calmap.calendarplot(df['2014']['VIX.Close'], fig_kws={'figsize':(16,10)}, yearlabel_kws={'color':'black','fontsize':14}, subplot_kws={'title':'Yahoo Stock Prices'})  
plt.show()

46. 季节图

季节性图可用于比较上一季节(年/月/周等)同一天时间序列的表现。

from dateutil.parser import parse # Import Datadf = pd.read_csv('https://github.com/selva86/datasets/raw/master/AirPassengers.csv')  
  
# Prepare data  
df['year']=[parse(d).year for d in df.date]  
df['month']=[parse(d).strftime('%b')for d in df.date]years = df['year'].unique()  
  
# Draw Plotmycolors =['tab:red','tab:blue','tab:green','tab:orange','tab:brown','tab:grey','tab:pink','tab:olive','deeppink','steelblue','firebrick','mediumseagreen']      plt.figure(figsize=(16,10), dpi=80)  
  
for i, y in enumerate(years):    plt.plot('month','traffic', data=df.loc[df.year==y,:], color=mycolors[i], label=y)    plt.text(df.loc[df.year==y,:].shape[0]-.9, df.loc[df.year==y,'traffic'][-1:].values[0], y, fontsize=12, color=mycolors[i])  
  
# Decoration  
plt.ylim(50,750)  
plt.xlim(-0.3,11)  
plt.ylabel('$Air Traffic$')  
plt.yticks(fontsize=12, alpha=.7)  
plt.title("Monthly Seasonal Plot: Air Passengers Traffic (1949 - 1969)", fontsize=22)  
plt.grid(axis='y', alpha=.3)  
  
# Remove borders  
plt.gca().spines["top"].set_alpha(0.0)    plt.gca().spines["bottom"].set_alpha(0.5)  
plt.gca().spines["right"].set_alpha(0.0)    plt.gca().spines["left"].set_alpha(0.5)   # plt.legend(loc='upper right', ncol=2, fontsize=12)  
plt.show()

七、分组

47.树状图

树状图根据给定的距离度量将相似的点分组在一起,并根据点的相似性将它们组织在树状链接中。

import scipy.cluster.hierarchy as shc# Import Datadf = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/USArrests.csv')  
  
# Plot  
plt.figure(figsize=(16,10), dpi=80)  plt.title("USArrests Dendograms", fontsize=22)  dend = shc.dendrogram(shc.linkage(df[['Murder','Assault','UrbanPop','Rape']], method='ward'), labels=df.State.values, color_threshold=100)  plt.xticks(fontsize=12)  
plt.show()

48. 聚类图

聚类图可用于划分属于同一聚类的点。下面是一个代表性示例,根据 USArrests 数据集将美国各州分为 5 组。该聚类图使用“谋杀”和“袭击”列作为 X 轴和 Y 轴。或者,您可以使用第一个主成分作为 X 轴和 Y 轴。

from sklearn.cluster importAgglomerativeClustering  
from scipy.spatial importConvexHull  
  
# Import Datadf = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/USArrests.csv')  
  
# Agglomerative Clusteringcluster =AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')  cluster.fit_predict(df[['Murder','Assault','UrbanPop','Rape']])  # Plot  
plt.figure(figsize=(14,10), dpi=80)  plt.scatter(df.iloc[:,0], df.iloc[:,1], c=cluster.labels_, cmap='tab10')  # Encircle  
def encircle(x,y, ax=None,**kw):    ifnot ax: ax=plt.gca()    p = np.c_[x,y]    hull =ConvexHull(p)    poly = plt.Polygon(p[hull.vertices,:],**kw)    ax.add_patch(poly)  
  
# Draw polygon surrounding vertices      
encircle(df.loc[cluster.labels_ ==0,'Murder'], df.loc[cluster.labels_ ==0,'Assault'], ec="k", fc="gold", alpha=0.2, linewidth=0)  
encircle(df.loc[cluster.labels_ ==1,'Murder'], df.loc[cluster.labels_ ==1,'Assault'], ec="k", fc="tab:blue", alpha=0.2, linewidth=0)  
encircle(df.loc[cluster.labels_ ==2,'Murder'], df.loc[cluster.labels_ ==2,'Assault'], ec="k", fc="tab:red", alpha=0.2, linewidth=0)  
encircle(df.loc[cluster.labels_ ==3,'Murder'], df.loc[cluster.labels_ ==3,'Assault'], ec="k", fc="tab:green", alpha=0.2, linewidth=0)  
encircle(df.loc[cluster.labels_ ==4,'Murder'], df.loc[cluster.labels_ ==4,'Assault'], ec="k", fc="tab:orange", alpha=0.2, linewidth=0)  
  
# Decorations  
plt.xlabel('Murder'); plt.xticks(fontsize=12)  
plt.ylabel('Assault'); plt.yticks(fontsize=12)  
plt.title('Agglomerative Clustering of USArrests (5 Groups)', fontsize=22)  
plt.show()

49.安德鲁斯曲线

安德鲁斯曲线有助于可视化是否存在基于给定分组的数字特征的固有分组。如果特征(数据集中的列)无助于区分组 ( ,那么这些行将不会被很好地隔离,如下所示。cyl)

from pandas.plotting import andrews_curves# Importdf = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")  
df.drop(['cars','carname'], axis=1, inplace=True)  
  
# Plot  
plt.figure(figsize=(12,9), dpi=80)  
andrews_curves(df,'cyl', colormap='Set1')  
  
# Lighten borders  
plt.gca().spines["top"].set_alpha(0)  
plt.gca().spines["bottom"].set_alpha(.3)  
plt.gca().spines["right"].set_alpha(0)  
plt.gca().spines["left"].set_alpha(.3)  
  
plt.title('Andrews Curves of mtcars', fontsize=22)  
plt.xlim(-3,3)  
plt.grid(alpha=0.3)  
plt.xticks(fontsize=12)  
plt.yticks(fontsize=12)  
plt.show()

50.平行坐标

平行坐标有助于可视化某个特征是否有助于有效地隔离组。如果实现了隔离,该特征可能对于预测该群体非常有用。

from pandas.plotting import parallel_coordinates# Import Datadf_final = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/diamonds_filter.csv")  
  
# Plot  
plt.figure(figsize=(12,9), dpi=80)  
parallel_coordinates(df_final,'cut', colormap='Dark2')  
  
# Lighten borders  
plt.gca().spines["top"].set_alpha(0)  
plt.gca().spines["bottom"].set_alpha(.3)  
plt.gca().spines["right"].set_alpha(0)  
plt.gca().spines["left"].set_alpha(.3)  
  
plt.title('Parallel Coordinated of Diamonds', fontsize=22)  
plt.grid(alpha=0.3)  
plt.xticks(fontsize=12)  
plt.yticks(fontsize=12)  
plt.show()


点击下方安全链接前往获取

CSDN大礼包:《Python入门&进阶学习资源包》免费分享

👉Python实战案例👈

光学理论是没用的,要学会跟着一起敲,要动手实操,才能将自己的所学运用到实际当中去,这时候可以搞点实战案例来学习。

图片

图片

👉Python书籍和视频合集👈

观看零基础学习视频,看视频学习是最快捷也是最有效果的方式,跟着视频中老师的思路,从基础到深入,还是很容易入门的。

图片

👉Python副业创收路线👈

图片

这些资料都是非常不错的,朋友们如果有需要《Python学习路线&学习资料》,点击下方安全链接前往获取

CSDN大礼包:《Python入门&进阶学习资源包》免费分享

本文转自网络,如有侵权,请联系删除。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值