python经典实例.pdf_TOP50 Python可视化经典案例上(附源码,建议收藏)

很多读者想学Python转数据分析,在学习或使用的过程中经常会忘记一些图表的具体用法,所以行哥分享matplotlib和seaborn速查表给读者们在画图的时候使用,需要pdf版本的话可以在行哥免费的星球里下载

matplotlib速查表

seaborn速查表

还有读者反应,有时候在数据可视化的时候不知道使用什么图来展示数据,这里提供了TOP50的可视化案例进行选择0 初始配置

1.相关性1.1 散点图

1.2 气泡图

1.3 散点图与最佳拟合线

1.4 带状抖动图

1.5 计数图

1.6 边际直方图

1.7 边际箱型图

1.8 相关图

1.9 成对图

2.偏差2.1 发散型条形图

2.2 发散型文本

2.3 发散型散点图

2.4 带有标记棒棒糖图

2.5 面积图

3.排行3.1 有序条形图

3.2 棒棒糖图

3.3 点图

3.4 坡度图

3.5 哑铃图

4.分布4.1 连续直方分布图

4.2 分类直方图

4.3 密度图

4.4 密度曲线直方图

4.5 Joy Plot

4.6 分布式点图

4.7 箱型图

4.8 点+ 箱型图

4.9 小提琴图

4.10 人口金字塔

4.11 分类图

0 初始配置# !pip install brewer2mpl

import numpy as np

import pandas as pd

import matplotlib as mpl

import matplotlib.pyplot as plt

import seaborn as sns

import warnings; warnings.filterwarnings(action='once')

large = 22; med = 16; small = 12

params = {'axes.titlesize': large,

'legend.fontsize': med,

'figure.figsize': (16, 10),

'axes.labelsize': med,

'axes.titlesize': med,

'xtick.labelsize': med,

'ytick.labelsize': med,

'figure.titlesize': large}

plt.rcParams.update(params)

plt.style.use('seaborn-whitegrid')

sns.set_style("white")

%matplotlib inline

# mac font

plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']

# windows font

# plt.rcParams['font.sans-serif'] = ['SimHei']

# Version

print(mpl.__version__) #> 3.0.0

print(sns.__version__) #> 0.9.0

1.相关性

相关下的图用于可视化两个或多个变量之间的关系。即,一个变量相对于另一个如何变化。

1.1 散点图

Scatteplot是用于研究两个变量之间关系的经典基础图。如果数据中有多个组,则可能需要以不同的颜色可视化每个组。在中matplotlib,您可以使用方便地执行此操作# Import dataset

midwest = pd.read_csv("data/midwest_filter.csv")

# Prepare Data

# Create as many colors as there are unique midwest['category']

categories = np.unique(midwest['category'])

colors = [plt.cm.tab10(i/float(len(categories)-1)) for i in range(len(categories))]

print(colors)

# Draw Plot for Each Category

plt.figure(figsize=(16, 10), dpi= 80, facecolor='w', edgecolor='k')

for i, category in enumerate(categories):

plt.scatter('area', 'poptotal',

data=midwest.loc[midwest.category==category, :],

s=30, c=colors[i], label=str(category))

# Decorations

plt.gca().set(xlim=(0.0, 0.1), ylim=(0, 90000),

xlabel='面积', ylabel='人口')

plt.xticks(fontsize=16); plt.yticks(fontsize=16)

plt.title("散点图:中西部城市面积与人口的关系", fontsize=22)

plt.legend(fontsize=12)

plt.show()

1.2 气泡图

有时想在边界内显示一组点以强调其重要性。在此示例中,您从应该环绕的数据框中获取记录,并将其传递给下面的代码中所述from matplotlib import patches

from scipy.spatial import ConvexHull

import warnings; warnings.simplefilter('ignore')

sns.set_style("white")

plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']

# Step 1: Prepare Data

midwest = pd.read_csv("data/midwest_filter.csv")

# As many colors as there are unique midwest['category']

categories = np.unique(midwest['category'])

colors = [plt.cm.tab10(i/float(len(categories)-1)) for i in range(len(categories))]

# Step 2: Draw Scatterplot with unique color for each category

fig = plt.figure(figsize=(16, 10), dpi= 80, facecolor='w', edgecolor='k')

for i, category in enumerate(categories):

plt.scatter('area', 'poptotal', data=midwest.loc[midwest.category==category, :], s='dot_size', c=colors[i], label=str(category), edgecolors='black', linewidths=.5)

# Step 3: Encircling

# https://stackoverflow.com/questions/44575681/how-do-i-encircle-different-data-sets-in-scatter-plot

def encircle(x,y, ax=None, **kw):

if not ax: ax=plt.gca()

p = np.c_[x,y]

hull = ConvexHull(p)

poly = plt.Polygon(p[hull.vertices,:], **kw)

ax.add_patch(poly)

# Select data to be encircled

midwest_encircle_data = midwest.loc[midwest.state=='IN', :]

# Draw polygon surrounding vertices

encircle(midwest_encircle_data.area, midwest_encircle_data.poptotal, ec="k", fc="gold", alpha=0.1)

encircle(midwest_encircle_data.area, midwest_encircle_data.poptotal, ec="firebrick", fc="none", linewidth=1.5)

# Step 4: Decorations

plt.gca().set(xlim=(0.0, 0.1), ylim=(0, 90000),

xlabel='面积', ylabel='人口')

plt.xticks(fontsize=12); plt.yticks(fontsize=12)

plt.title("气泡图", fontsize=22)

plt.legend(fontsize=12)

plt.show()

1.3 散点图与最佳拟合线

如果您想了解两个变量如何相对变化,则最好的方法就是拟合# Import Data

df = pd.read_csv("data/mpg_ggplot2.csv")

df_select = df.loc[df.cyl.isin([4,8]), :]

# Plot

sns.set_style("white")

gridobj = sns.lmplot(x="displ", y="hwy", hue="cyl", data=df_select,

height=7, aspect=1.6, robust=True, palette='tab10',

scatter_kws=dict(s=60, linewidths=.7, edgecolors='black'))

# Decorations

gridobj.set(xlim=(0.5, 7.5), ylim=(0, 50))

plt.title("Scastterplot with line of best fit grouped by number of cylinders", fontsize=20)

plt.show()

1.4 带状抖动图

通常,多个数据点具有完全相同的X和Y值。结果,多个点相互绘制并隐藏。为避免这种情况,请稍微抖动点,以便您可以直观地看到它们# Import Data

df = pd.read_csv("data/mpg_ggplot2.csv")

# Draw Stripplot

fig, ax = plt.subplots(figsize=(16,10), dpi= 80)

sns.stripplot(df.cty, df.hwy, jitter=0.25, size=8, ax=ax, linewidth=.5)

# Decorations

plt.title('Use jittered plots to avoid overlapping of points', fontsize=22)

plt.show()

1.5 计数图

避免点重叠问题的另一种选择是增加点的大小,具体取决于该点上有多少点。因此,点的大小越大,周围的点的集中度就越大。# Import Data

df = pd.read_csv("data/mpg_ggplot2.csv")

df_counts = df.groupby(['hwy', 'cty']).size().reset_index(name='counts')

# Draw Stripplot

fig, ax = plt.subplots(figsize=(16,10), dpi= 80)

sns.stripplot(df_counts.cty, df_counts.hwy, size=df_counts.counts*2, ax=ax)

# Decorations

plt.title('Counts Plot - Size of circle is bigger as more points overlap', fontsize=22)

plt.show()

1.6 边际直方图

边际直方图沿X和Y轴变量具有直方图。这用于可视化X和Y之间的关系以及X和Y的单变量分布。如果经常在探索性数据分析(EDA)中使用此图。# Import Data

df = pd.read_csv("data/mpg_ggplot2.csv")

# Create Fig and gridspec

fig = plt.figure(figsize=(16, 10), dpi= 80)

grid = plt.GridSpec(4, 4, hspace=0.5, wspace=0.2)

# Define the axes

ax_main = fig.add_subplot(grid[:-1, :-1])

ax_right = fig.add_subplot(grid[:-1, -1], xticklabels=[], yticklabels=[])

ax_bottom = fig.add_subplot(grid[-1, 0:-1], xticklabels=[], yticklabels=[])

# Scatterplot on main ax

ax_main.scatter('displ', 'hwy', s=df.cty*4, c=df.manufacturer.astype('category').cat.codes, alpha=.9, data=df, cmap="tab10", edgecolors='gray', linewidths=.5)

# histogram on the right

ax_bottom.hist(df.displ, 40, histtype='stepfilled', orientation='vertical', color='deeppink')

ax_bottom.invert_yaxis()

# histogram in the bottom

ax_right.hist(df.hwy, 40, histtype='stepfilled', orientation='horizontal', color='deeppink')

# Decorations

ax_main.set(title='Scatterplot with Histograms \n displ vs hwy', xlabel='displ', ylabel='hwy')

ax_main.title.set_fontsize(20)

for item in ([ax_main.xaxis.label, ax_main.yaxis.label] + ax_main.get_xticklabels() + ax_main.get_yticklabels()):

item.set_fontsize(14)

xlabels = ax_main.get_xticks().tolist()

ax_main.set_xticklabels(xlabels)

plt.show()

1.7 边际箱型图

边际箱线图的作用类似于边际直方图。但是,箱形图有助于查明X和Y的中位数,第25和第75个百分位数# Import Data

df = pd.read_csv("data/mpg_ggplot2.csv")

# Create Fig and gridspec

fig = plt.figure(figsize=(16, 10), dpi= 80)

grid = plt.GridSpec(4, 4, hspace=0.5, wspace=0.2)

# Define the axes

ax_main = fig.add_subplot(grid[:-1, :-1])

ax_right = fig.add_subplot(grid[:-1, -1], xticklabels=[], yticklabels=[])

ax_bottom = fig.add_subplot(grid[-1, 0:-1], xticklabels=[], yticklabels=[])

# Scatterplot on main ax

ax_main.scatter('displ', 'hwy', s=df.cty*5, c=df.manufacturer.astype('category').cat.codes, alpha=.9, data=df, cmap="Set1", edgecolors='black', linewidths=.5)

# Add a graph in each part

sns.boxplot(df.hwy, ax=ax_right, orient="v")

sns.boxplot(df.displ, ax=ax_bottom, orient="h")

# Decorations ------------------

# Remove x axis name for the boxplot

ax_bottom.set(xlabel='')

ax_right.set(ylabel='')

# Main Title, Xlabel and YLabel

ax_main.set(title='Scatterplot with Histograms \n displ vs hwy', xlabel='displ', ylabel='hwy')

# Set font size of different components

ax_main.title.set_fontsize(20)

for item in ([ax_main.xaxis.label, ax_main.yaxis.label] + ax_main.get_xticklabels() + ax_main.get_yticklabels()):

item.set_fontsize(14)

plt.show()

1.8 相关图

关联图用于直观地查看给定数据帧(或2D数组)中所有可能的数字变量对之间的相关性度量。# Import Dataset

df = pd.read_csv("data/mtcars.csv")

# Plot

plt.figure(figsize=(12,10), dpi= 80)

sns.heatmap(df.corr(), xticklabels=df.corr().columns, yticklabels=df.corr().columns, cmap='RdYlGn', center=0, annot=True)

# Decorations

plt.title('Correlogram of mtcars', fontsize=22)

plt.xticks(fontsize=12)

plt.yticks(fontsize=12)

plt.show()

1.9 成对图

在理解分析中所有可能的数字变量对之间的关系时,成对绘图是最喜欢的。它是用于双变量分析的必备工具# Load Dataset

df = sns.load_dataset('iris')

# Plot

plt.figure(figsize=(10,8), dpi= 80)

sns.pairplot(df, kind="scatter", hue="species", plot_kws=dict(s=80, edgecolor="white", linewidth=2.5))

plt.show()# Load Dataset

df = sns.load_dataset('iris')

# Plot

plt.figure(figsize=(10,8), dpi= 80)

sns.pairplot(df, kind="reg", hue="species")

plt.show()

2.偏差

2.1 发散型条形图

如果要查看项目基于单个度量标准的变化方式并可视化此变化的顺序和数量,则分叉条是一个很好的工具。它有助于快速区分数据中组的性能,并且非常直观,可以立即传达要点。# Prepare Data

df = pd.read_csv("data/mtcars.csv")

x = df.loc[:, ['mpg']]

df['mpg_z'] = (x - x.mean())/x.std()

df['colors'] = ['red' if x < 0 else 'green' for x in df['mpg_z']]

df.sort_values('mpg_z', inplace=True)

df.reset_index(inplace=True)

# Draw plot

plt.figure(figsize=(14,10), dpi= 80)

plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z, color=df.colors, alpha=0.4, linewidth=5)

# Decorations

plt.gca().set(ylabel='$Model$', xlabel='$Mileage$')

plt.yticks(df.index, df.cars, fontsize=12)

plt.title('Diverging Bars of Car Mileage', fontdict={'size':20})

plt.grid(linestyle='--', alpha=0.5)

plt.show()

2.2 发散型文本

分隔文本类似于分隔条,如果您希望以一种美观和可表达的方式显示图表中每个项目的值,则首选文本。# Prepare Data

df = pd.read_csv("data/mtcars.csv")

x = df.loc[:, ['mpg']]

df['mpg_z'] = (x - x.mean())/x.std()

df['colors'] = ['red' if x < 0 else 'green' for x in df['mpg_z']]

df.sort_values('mpg_z', inplace=True)

df.reset_index(inplace=True)

# Draw plot

plt.figure(figsize=(14,14), dpi= 80)

plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z)

for x, y, tex in zip(df.mpg_z, df.index, df.mpg_z):

t = plt.text(x, y, round(tex, 2), horizontalalignment='right' if x < 0 else 'left',

verticalalignment='center', fontdict={'color':'red' if x < 0 else 'green', 'size':14})

# Decorations

plt.yticks(df.index, df.cars, fontsize=12)

plt.title('Diverging Text Bars of Car Mileage', fontdict={'size':20})

plt.grid(linestyle='--', alpha=0.5)

plt.xlim(-2.5, 2.5)

plt.show()

2.3 发散型散点图

发散点图也类似于发散条。但是,与散布条相比,条的不存在会降低组之间的对比度和差异。# Prepare Data

df = pd.read_csv("data/mtcars.csv")

x = df.loc[:, ['mpg']]

df['mpg_z'] = (x - x.mean())/x.std()

df['colors'] = ['red' if x < 0 else 'darkgreen' for x in df['mpg_z']]

df.sort_values('mpg_z', inplace=True)

df.reset_index(inplace=True)

# Draw plot

plt.figure(figsize=(14,16), dpi= 80)

plt.scatter(df.mpg_z, df.index, s=450, alpha=.6, color=df.colors)

for x, y, tex in zip(df.mpg_z, df.index, df.mpg_z):

t = plt.text(x, y, round(tex, 1), horizontalalignment='center',

verticalalignment='center', fontdict={'color':'white'})

# Decorations

# Lighten borders

plt.gca().spines["top"].set_alpha(.3)

plt.gca().spines["bottom"].set_alpha(.3)

plt.gca().spines["right"].set_alpha(.3)

plt.gca().spines["left"].set_alpha(.3)

plt.yticks(df.index, df.cars)

plt.title('Diverging Dotplot of Car Mileage', fontdict={'size':20})

plt.xlabel('$Mileage$')

plt.grid(linestyle='--', alpha=0.5)

plt.xlim(-2.5, 2.5)

plt.show()

2.4 带有标记棒棒糖图

带有标记的棒棒糖提供了一种灵活的方式来可视化差异,方法是将重点放在您要引起注意的重要数据点上,并在图表中适当地进行推理。# Prepare Data

df = pd.read_csv("data/mtcars.csv")

x = df.loc[:, ['mpg']]

df['mpg_z'] = (x - x.mean())/x.std()

df['colors'] = 'black'

# color fiat differently

df.loc[df.cars == 'Fiat X1-9', 'colors'] = 'darkorange'

df.sort_values('mpg_z', inplace=True)

df.reset_index(inplace=True)

# Draw plot

import matplotlib.patches as patches

plt.figure(figsize=(14,16), dpi= 80)

plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z, color=df.colors, alpha=0.4, linewidth=1)

plt.scatter(df.mpg_z, df.index, color=df.colors, s=[600 if x == 'Fiat X1-9' else 300 for x in df.cars], alpha=0.6)

plt.yticks(df.index, df.cars)

plt.xticks(fontsize=12)

# Annotate

plt.annotate('Mercedes Models', xy=(0.0, 11.0), xytext=(1.0, 11), xycoords='data',

fontsize=15, ha='center', va='center',

bbox=dict(boxstyle='square', fc='firebrick'),

arrowprops=dict(arrowstyle='-[, widthB=2.0, lengthB=1.5', lw=2.0, color='steelblue'), color='white')

# Add Patches

p1 = patches.Rectangle((-2.0, -1), width=.3, height=3, alpha=.2, facecolor='red')

p2 = patches.Rectangle((1.5, 27), width=.8, height=5, alpha=.2, facecolor='green')

plt.gca().add_patch(p1)

plt.gca().add_patch(p2)

# Decorate

plt.title('Diverging Bars of Car Mileage', fontdict={'size':20})

plt.grid(linestyle='--', alpha=0.5)

plt.show()

2.5 面积图

通过为轴和线之间的区域着色,面积图不仅将重点放在峰和谷上,而且还将重点放在高点和低点的持续时间上。高点持续时间越长,线下面积越大import numpy as np

import pandas as pd

# Prepare Data

df = pd.read_csv("data/economics.csv", parse_dates=['date']).head(100)

x = np.arange(df.shape[0])

y_returns = (df.psavert.diff().fillna(0)/df.psavert.shift(1)).fillna(0) * 100

# Plot

plt.figure(figsize=(16,10), dpi= 80)

plt.fill_between(x[1:], y_returns[1:], 0, where=y_returns[1:] >= 0, facecolor='green', interpolate=True, alpha=0.7)

plt.fill_between(x[1:], y_returns[1:], 0, where=y_returns[1:] <= 0, facecolor='red', interpolate=True, alpha=0.7)

# Annotate

plt.annotate('Peak \n1975', xy=(94.0, 21.0), xytext=(88.0, 28),

bbox=dict(boxstyle='square', fc='firebrick'),

arrowprops=dict(facecolor='steelblue', shrink=0.05), fontsize=15, color='white')

# Decorations

xtickvals = [str(m)[:3].upper()+"-"+str(y) for y,m in zip(df.date.dt.year, df.date.dt.month_name())]

plt.gca().set_xticks(x[::6])

plt.gca().set_xticklabels(xtickvals[::6], rotation=90, fontdict={'horizontalalignment': 'center', 'verticalalignment': 'center_baseline'})

plt.ylim(-35,35)

plt.xlim(1,100)

plt.title("Month Economics Return %", fontsize=22)

plt.ylabel('Monthly returns %')

plt.grid(alpha=0.5)

plt.show()

3.排行

3.1 有序条形图

有序条形图有效地传达了项目的排名顺序。但是,将指标的值加到图表上方,用户可以从图表本身获取准确的信息。# Prepare Data

df_raw = pd.read_csv("data/mpg_ggplot2.csv")

df = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())

df.sort_values('cty', inplace=True)

df.reset_index(inplace=True)

# Draw plot

import matplotlib.patches as patches

fig, ax = plt.subplots(figsize=(16,10), facecolor='white', dpi= 80)

ax.vlines(x=df.index, ymin=0, ymax=df.cty, color='firebrick', alpha=0.7, linewidth=20)

# Annotate Text

for i, cty in enumerate(df.cty):

ax.text(i, cty+0.5, round(cty, 1), horizontalalignment='center')

# Title, Label, Ticks and Ylim

ax.set_title('Bar Chart for Highway Mileage', fontdict={'size':22})

ax.set(ylabel='Miles Per Gallon', ylim=(0, 30))

plt.xticks(df.index, df.manufacturer.str.upper(), rotation=60, horizontalalignment='right', fontsize=12)

# Add patches to color the X axis labels

p1 = patches.Rectangle((.57, -0.005), width=.33, height=.13, alpha=.1, facecolor='green', transform=fig.transFigure)

p2 = patches.Rectangle((.124, -0.005), width=.446, height=.13, alpha=.1, facecolor='red', transform=fig.transFigure)

fig.add_artist(p1)

fig.add_artist(p2)

plt.show()

3.2 棒棒糖图

棒棒糖图表在视觉上令人愉悦,其功能与订购条形图相似。# Prepare Data

df_raw = pd.read_csv("data/mpg_ggplot2.csv")

df = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())

df.sort_values('cty', inplace=True)

df.reset_index(inplace=True)

# Draw plot

fig, ax = plt.subplots(figsize=(16,10), dpi= 80)

ax.vlines(x=df.index, ymin=0, ymax=df.cty, color='firebrick', alpha=0.7, linewidth=2)

ax.scatter(x=df.index, y=df.cty, s=75, color='firebrick', alpha=0.7)

# Title, Label, Ticks and Ylim

ax.set_title('Lollipop Chart for Highway Mileage', fontdict={'size':22})

ax.set_ylabel('Miles Per Gallon')

ax.set_xticks(df.index)

ax.set_xticklabels(df.manufacturer.str.upper(), rotation=60, fontdict={'horizontalalignment': 'right', 'size':12})

ax.set_ylim(0, 30)

# Annotate

for row in df.itertuples():

ax.text(row.Index, row.cty+.5, s=round(row.cty, 2), horizontalalignment= 'center', verticalalignment='bottom', fontsize=14)

plt.show()

3.3 点图

点状图传达了项目的等级顺序。并且由于它是沿水平轴对齐的,因此您可以更轻松地可视化这些点之间的距离。# Prepare Data

df_raw = pd.read_csv("data/mpg_ggplot2.csv")

df = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())

df.sort_values('cty', inplace=True)

df.reset_index(inplace=True)

# Draw plot

fig, ax = plt.subplots(figsize=(16,10), dpi= 80)

ax.hlines(y=df.index, xmin=11, xmax=26, color='gray', alpha=0.7, linewidth=1, linestyles='dashdot')

ax.scatter(y=df.index, x=df.cty, s=75, color='firebrick', alpha=0.7)

# Title, Label, Ticks and Ylim

ax.set_title('Dot Plot for Highway Mileage', fontdict={'size':22})

ax.set_xlabel('Miles Per Gallon')

ax.set_yticks(df.index)

ax.set_yticklabels(df.manufacturer.str.title(), fontdict={'horizontalalignment': 'right'})

ax.set_xlim(10, 27)

plt.show()

3.4 坡度图

斜率图最适合比较给定人员/项目的“之前”和“之后”位置。import matplotlib.lines as mlines

# Import Data

df = pd.read_csv("data/gdppercap.csv")

left_label = [str(c) + ', '+ str(round(y)) for c, y in zip(df.continent, df['1952'])]

right_label = [str(c) + ', '+ str(round(y)) for c, y in zip(df.continent, df['1957'])]

klass = ['red' if (y1-y2) < 0 else 'green' for y1, y2 in zip(df['1952'], df['1957'])]

# draw line

# https://stackoverflow.com/questions/36470343/how-to-draw-a-line-with-matplotlib/36479941

def newline(p1, p2, color='black'):

ax = plt.gca()

l = mlines.Line2D([p1[0],p2[0]], [p1[1],p2[1]], color='red' if p1[1]-p2[1] > 0 else 'green', marker='o', markersize=6)

ax.add_line(l)

return l

fig, ax = plt.subplots(1,1,figsize=(14,14), dpi= 80)

# Vertical Lines

ax.vlines(x=1, ymin=500, ymax=13000, color='black', alpha=0.7, linewidth=1, linestyles='dotted')

ax.vlines(x=3, ymin=500, ymax=13000, color='black', alpha=0.7, linewidth=1, linestyles='dotted')

# Points

ax.scatter(y=df['1952'], x=np.repeat(1, df.shape[0]), s=10, color='black', alpha=0.7)

ax.scatter(y=df['1957'], x=np.repeat(3, df.shape[0]), s=10, color='black', alpha=0.7)

# Line Segmentsand Annotation

for p1, p2, c in zip(df['1952'], df['1957'], df['continent']):

newline([1,p1], [3,p2])

ax.text(1-0.05, p1, c + ', ' + str(round(p1)), horizontalalignment='right', verticalalignment='center', fontdict={'size':14})

ax.text(3+0.05, p2, c + ', ' + str(round(p2)), horizontalalignment='left', verticalalignment='center', fontdict={'size':14})

# 'Before' and 'After' Annotations

ax.text(1-0.05, 13000, 'BEFORE', horizontalalignment='right', verticalalignment='center', fontdict={'size':18, 'weight':700})

ax.text(3+0.05, 13000, 'AFTER', horizontalalignment='left', verticalalignment='center', fontdict={'size':18, 'weight':700})

# Decoration

ax.set_title("Slopechart: Comparing GDP Per Capita between 1952 vs 1957", fontdict={'size':22})

ax.set(xlim=(0,4), ylim=(0,14000), ylabel='Mean GDP Per Capita')

ax.set_xticks([1,3])

ax.set_xticklabels(["1952", "1957"])

plt.yticks(np.arange(500, 13000, 2000), fontsize=12)

# Lighten borders

plt.gca().spines["top"].set_alpha(.0)

plt.gca().spines["bottom"].set_alpha(.0)

plt.gca().spines["right"].set_alpha(.0)

plt.gca().spines["left"].set_alpha(.0)

plt.show()

3.5 哑铃图

哑铃图传达了各个项目的“之前”和“之后”位置以及这些项目的排名顺序。如果您想可视化特定项目/计划对不同对象的效果,则它非常有用。import matplotlib.lines as mlines

# Import Data

df = pd.read_csv("data/health.csv")

df.sort_values('pct_2014', inplace=True)

df.reset_index(inplace=True)

# Func to draw line segment

def newline(p1, p2, color='black'):

ax = plt.gca()

l = mlines.Line2D([p1[0],p2[0]], [p1[1],p2[1]], color='skyblue')

ax.add_line(l)

return l

# Figure and Axes

fig, ax = plt.subplots(1,1,figsize=(14,14), facecolor='#f7f7f7', dpi= 80)

# Vertical Lines

ax.vlines(x=.05, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted')

ax.vlines(x=.10, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted')

ax.vlines(x=.15, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted')

ax.vlines(x=.20, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted')

# Points

ax.scatter(y=df['index'], x=df['pct_2013'], s=50, color='#0e668b', alpha=0.7)

ax.scatter(y=df['index'], x=df['pct_2014'], s=50, color='#a3c4dc', alpha=0.7)

# Line Segments

for i, p1, p2 in zip(df['index'], df['pct_2013'], df['pct_2014']):

newline([p1, i], [p2, i])

# Decoration

ax.set_facecolor('#f7f7f7')

ax.set_title("Dumbell Chart: Pct Change - 2013 vs 2014", fontdict={'size':22})

ax.set(xlim=(0,.25), ylim=(-1, 27), ylabel='Mean GDP Per Capita')

ax.set_xticks([.05, .1, .15, .20])

ax.set_xticklabels(['5%', '15%', '20%', '25%'])

ax.set_xticklabels(['5%', '15%', '20%', '25%'])

plt.show()

4.分布

4.1 连续直方分布图

直方图显示给定变量的频率分布。下图根据分类变量对频率条进行了分组,从而对连续变量和分类变量串联在一起有更深入的了解。# Import Data

df = pd.read_csv("data/mpg_ggplot2.csv")

# Prepare data

x_var = 'displ'

groupby_var = 'class'

df_agg = df.loc[:, [x_var, groupby_var]].groupby(groupby_var)

vals = [df[x_var].values.tolist() for i, df in df_agg]

# Draw

plt.figure(figsize=(16,9), dpi= 80)

colors = [plt.cm.Spectral(i/float(len(vals)-1)) for i in range(len(vals))]

n, bins, patches = plt.hist(vals, 30, stacked=True, density=False, color=colors[:len(vals)])

# Decoration

plt.legend({group:col for group, col in zip(np.unique(df[groupby_var]).tolist(), colors[:len(vals)])})

plt.title(f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$", fontsize=22)

plt.xlabel(x_var)

plt.ylabel("Frequency")

plt.ylim(0, 25)

plt.xticks(ticks=bins[::3], labels=[round(b,1) for b in bins[::3]])

plt.show()

4.2 分类直方图

分类变量的直方图显示该变量的频率分布。通过为条形着色,您可以将分布与代表颜色的另一个分类变量关联起来。# Import Data

df = pd.read_csv("data/mpg_ggplot2.csv")

# Prepare data

x_var = 'manufacturer'

groupby_var = 'class'

df_agg = df.loc[:, [x_var, groupby_var]].groupby(groupby_var)

vals = [df[x_var].values.tolist() for i, df in df_agg]

# Draw

plt.figure(figsize=(16,9), dpi= 80)

colors = [plt.cm.Spectral(i/float(len(vals)-1)) for i in range(len(vals))]

n, bins, patches = plt.hist(vals, df[x_var].unique().__len__(), stacked=True, density=False, color=colors[:len(vals)])

# Decoration

plt.legend({group:col for group, col in zip(np.unique(df[groupby_var]).tolist(), colors[:len(vals)])})

plt.title(f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$", fontsize=22)

plt.xlabel(x_var)

plt.ylabel("Frequency")

plt.ylim(0, 40)

plt.xticks(ticks=bins, labels=np.unique(df[x_var]).tolist(), rotation=90, horizontalalignment='left')

plt.show()

4.3 密度图

密度图是可视化连续变量分布的常用工具。通过使用'response'变量对它们进行分组,您可以检查X和Y之间的关系。以下情况仅出于代表性目的,描述城市里程的分布相对于汽缸数的变化。# Import Data

df = pd.read_csv("data/mpg_ggplot2.csv")

# Draw Plot

plt.figure(figsize=(16,10), dpi= 80)

sns.kdeplot(df.loc[df['cyl'] == 4, "cty"], shade=True, color="g", label="Cyl=4", alpha=.7)

sns.kdeplot(df.loc[df['cyl'] == 5, "cty"], shade=True, color="deeppink", label="Cyl=5", alpha=.7)

sns.kdeplot(df.loc[df['cyl'] == 6, "cty"], shade=True, color="dodgerblue", label="Cyl=6", alpha=.7)

sns.kdeplot(df.loc[df['cyl'] == 8, "cty"], shade=True, color="orange", label="Cyl=8", alpha=.7)

# Decoration

plt.title('Density Plot of City Mileage by n_Cylinders', fontsize=22)

plt.legend()

plt.show()

4.4 密度曲线直方图

带有直方图的密度曲线将两个图所传达的集体信息汇总在一起,因此您可以将它们都放在一个图中而不是两个图中# Import Data

df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

# Draw Plot

plt.figure(figsize=(13,10), dpi= 80)

sns.distplot(df.loc[df['class'] == 'compact', "cty"], color="dodgerblue", label="Compact", hist_kws={'alpha':.7}, kde_kws={'linewidth':3})

sns.distplot(df.loc[df['class'] == 'suv', "cty"], color="orange", label="SUV", hist_kws={'alpha':.7}, kde_kws={'linewidth':3})

sns.distplot(df.loc[df['class'] == 'minivan', "cty"], color="g", label="minivan", hist_kws={'alpha':.7}, kde_kws={'linewidth':3})

plt.ylim(0, 0.35)

# Decoration

plt.title('Density Plot of City Mileage by Vehicle Type', fontsize=22)

plt.legend()

plt.show()

4.5 Joy Plot

Joy Plot允许不同组的密度曲线重叠,这是一种可视化大量组相对于彼此分布的好方法。它看起来令人赏心悦目,并且清楚地传达了正确的信息。# !pip install joypy

# Import Data

mpg = pd.read_csv("data/mpg_ggplot2.csv")

# Draw Plot

plt.figure(figsize=(16,10), dpi= 80)

fig, axes = joypy.joyplot(mpg, column=['hwy', 'cty'], by="class", ylim='own', figsize=(14,10))

# Decoration

plt.title('Joy Plot of City and Highway Mileage by Class', fontsize=22)

plt.show()

4.6 分布式点图

分布点图显示了按组划分的点的单变量分布。点越暗,该区域中数据点的集中度越高。通过对中间值进行不同的着色,各组的实际位置会立即变得明显。import matplotlib.patches as mpatches

# Prepare Data

df_raw = pd.read_csv("data/mpg_ggplot2.csv")

cyl_colors = {4:'tab:red', 5:'tab:green', 6:'tab:blue', 8:'tab:orange'}

df_raw['cyl_color'] = df_raw.cyl.map(cyl_colors)

# Mean and Median city mileage by make

df = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())

df.sort_values('cty', ascending=False, inplace=True)

df.reset_index(inplace=True)

df_median = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.median())

# Draw horizontal lines

fig, ax = plt.subplots(figsize=(16,10), dpi= 80)

ax.hlines(y=df.index, xmin=0, xmax=40, color='gray', alpha=0.5, linewidth=.5, linestyles='dashdot')

# Draw the Dots

for i, make in enumerate(df.manufacturer):

df_make = df_raw.loc[df_raw.manufacturer==make, :]

ax.scatter(y=np.repeat(i, df_make.shape[0]), x='cty', data=df_make, s=75, edgecolors='gray', c='w', alpha=0.5)

ax.scatter(y=i, x='cty', data=df_median.loc[df_median.index==make, :], s=75, c='firebrick')

# Annotate

ax.text(33, 13, "$red \; dots \; are \; the \: median$", fontdict={'size':12}, color='firebrick')

# Decorations

red_patch = plt.plot([],[], marker="o", ms=10, ls="", mec=None, color='firebrick', label="Median")

plt.legend(handles=red_patch)

ax.set_title('Distribution of City Mileage by Make', fontdict={'size':22})

ax.set_xlabel('Miles Per Gallon (City)', alpha=0.7)

ax.set_yticks(df.index)

ax.set_yticklabels(df.manufacturer.str.title(), fontdict={'horizontalalignment': 'right'}, alpha=0.7)

ax.set_xlim(1, 40)

plt.xticks(alpha=0.7)

plt.gca().spines["top"].set_visible(False)

plt.gca().spines["bottom"].set_visible(False)

plt.gca().spines["right"].set_visible(False)

plt.gca().spines["left"].set_visible(False)

plt.grid(axis='both', alpha=.4, linewidth=.1)

plt.show()

4.7 箱型图

箱形图是可视化分布的一种好方法,同时牢记中位数,第25个第75个四分位数和离群值。但是,在解释方框的大小时需要小心,这可能会扭曲该组中包含的点数。因此,手动在每个框中提供观察次数可以帮助克服此缺点。# Import Data

df = pd.read_csv("data/mpg_ggplot2.csv")

# Draw Plot

plt.figure(figsize=(13,10), dpi= 80)

sns.boxplot(x='class', y='hwy', data=df, notch=False)

# Add N Obs inside boxplot (optional)

def add_n_obs(df,group_col,y):

medians_dict = {grp[0]:grp[1][y].median() for grp in df.groupby(group_col)}

xticklabels = [x.get_text() for x in plt.gca().get_xticklabels()]

n_obs = df.groupby(group_col)[y].size().values

for (x, xticklabel), n_ob in zip(enumerate(xticklabels), n_obs):

plt.text(x, medians_dict[xticklabel]*1.01, "#obs : "+str(n_ob), horizontalalignment='center', fontdict={'size':14}, color='white')

add_n_obs(df,group_col='class',y='hwy')

# Decoration

plt.title('Box Plot of Highway Mileage by Vehicle Class', fontsize=22)

plt.ylim(10, 40)

plt.show()

4.8 点+ 箱型图

点+箱形图以箱形图的形式传送类似的信息,分为组。此外,这些点还使您感觉到每个组中有多少个数据点。# Import Data

df = pd.read_csv("data/mpg_ggplot2.csv")

# Draw Plot

plt.figure(figsize=(13,10), dpi= 80)

sns.boxplot(x='class', y='hwy', data=df, hue='cyl')

sns.stripplot(x='class', y='hwy', data=df, color='black', size=3, jitter=1)

for i in range(len(df['class'].unique())-1):

plt.vlines(i+.5, 10, 45, linestyles='solid', colors='gray', alpha=0.2)

# Decoration

plt.title('Box Plot of Highway Mileage by Vehicle Class', fontsize=22)

plt.legend(title='Cylinders')

plt.show()

4.9 小提琴图

小提琴图是箱形图的视觉替代。小提琴的形状或面积取决于其可观察的次数。但是,小提琴图很难阅读,在专业环境中不常用。# Import Data

df = pd.read_csv("data/mpg_ggplot2.csv")

# Draw Plot

plt.figure(figsize=(13,10), dpi= 80)

sns.violinplot(x='class', y='hwy', data=df, scale='width', inner='quartile')

# Decoration

plt.title('Violin Plot of Highway Mileage by Vehicle Class', fontsize=22)

plt.show()

4.10 人口金字塔

人口金字塔可用于显示按体积排序的组的分布。或者,它也可以用来显示人口的逐步过滤,因为它在下面用于显示有多少人通过营销渠道的每个阶段。# Read data

df = pd.read_csv("data/email_campaign_funnel.csv")

# Draw Plot

plt.figure(figsize=(13,10), dpi= 80)

group_col = 'Gender'

order_of_bars = df.Stage.unique()[::-1]

colors = [plt.cm.Spectral(i/float(len(df[group_col].unique())-1)) for i in range(len(df[group_col].unique()))]

for c, group in zip(colors, df[group_col].unique()):

sns.barplot(x='Users', y='Stage', data=df.loc[df[group_col]==group, :], order=order_of_bars, color=c, label=group)

# Decorations

plt.xlabel("$Users$")

plt.ylabel("Stage of Purchase")

plt.yticks(fontsize=12)

plt.title("Population Pyramid of the Marketing Funnel", fontsize=22)

plt.legend()

plt.show()

4.11 分类图

seaborn库提供的分类图可用于可视化2个或更多分类变量彼此之间的计数分布。# Load Dataset

titanic = sns.load_dataset("titanic")

# Plot

g = sns.catplot("alive", col="deck", col_wrap=4,

data=titanic[titanic.deck.notnull()],

kind="count", height=3.5, aspect=.8,

palette='tab20')

fig.suptitle('sf')

plt.show()# Load Dataset

titanic = sns.load_dataset("titanic")

# Plot

sns.catplot(x="age", y="embark_town",

hue="sex", col="class",

data=titanic[titanic.embark_town.notnull()],

orient="h", height=5, aspect=1, palette="tab10",

kind="violin", dodge=True, cut=0, bw=.2)

参考:https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/#10.-Diverging-Bars

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值