Pandas入门实践3 -数据可视化

python收藏家

于 2023-04-17 12:51:31 发布

阅读量570

点赞数

文章标签： python 数据分析

本文链接：https://blog.csdn.net/qq_42034590/article/details/129474606

版权

人类大脑擅长于在数据的视觉表现中寻找模式;因此在这一节中，我们将学习如何使用pandas沿着Matplotlib和Seaborn库来可视化数据，以获得更多的特性。我们将创建各种可视化，帮助我们更好地理解数据。

使用pandas绘图

我们可以使用plot()方法创建各种可视化效果。在本节中，我们将简要介绍其中的一些功能，这些功能使用Matplotlib。
同样，我们将使用在上一节中清理的TSA旅客吞吐量数据：

import pandas as pd

tsa_melted_holiday_travel = pd.read_csv(
    '../data/tsa_melted_holiday_travel.csv', 
    parse_dates=True, index_col='date'
)
tsa_melted_holiday_travel.head()

在这里插入图片描述
要在notebook中嵌入SVG格式的绘图，我们将配置Matplotlib绘图后端，以生成带有自定义元数据（第二个参数）的SVG输出（第一个参数）：

import matplotlib_inline
from utils import mpl_svg_config

matplotlib_inline.backend_inline.set_matplotlib_formats(
    'svg', # output images using SVG format
    **mpl_svg_config('section-3') # optional: configure metadata
)

注：第二个参数是可选的，在这里用于通过设置hashsalt沿着一些元数据使SVG输出可再现，Matplotlib在生成任何SVG输出时将使用这些元数据（更详细信息参考utils.py）。如果没有此参数，同一绘图代码的不同运行将生成视觉上相同的绘图，但由于ID、元数据等不同，这些绘图在HTML级别上会有所不同。

Line plots
plot（）方法将默认为所有数值列生成线图：

tsa_melted_holiday_travel.drop(columns='year').loc['2020'].assign(
    **{
        '7D MA': lambda x: x.travelers.rolling('7D').mean(),
        'YTD mean': lambda x: x.travelers.expanding().mean()
      }
).plot(title='2020 TSA Traveler Throughput', ylabel='travelers', alpha=0.8)

在这里插入图片描述

Bar plots
在下一个示例中，我们将绘制竖线来比较不同年份的每月TSA旅客吞吐量。让我们先创建一个包含所需信息的透视表：

plot_data = tsa_melted_holiday_travel['2019':'2021-04']\
    .assign(month=lambda x: x.index.month)\
    .pivot_table(index='month', columns='year', values='travelers', aggfunc='sum')
plot_data.head()

在这里插入图片描述
Pandas通过kind参数提供了其他绘图类型，因此我们在调用plot（）方法时指定kind=‘bar’。然后，我们使用plot（）方法返回的Axes对象进一步格式化可视化：

import calendar
from matplotlib import ticker

ax = plot_data.plot(
    kind='bar', rot=0, xlabel='', ylabel='travelers',
    figsize=(8, 1.5), title='TSA Monthly Traveler Throughput'
)

# use month abbreviations for the ticks on the x-axis
ax.set_xticklabels(calendar.month_abbr[1:])

# show y-axis labels in millions instead of scientific notation
ax.yaxis.set_major_formatter(ticker.EngFormatter())

# customize the legend
ax.legend(title='', loc='center', bbox_to_anchor=(0.5, -0.3), ncols=3, frameon=False)

在这里插入图片描述
Plotting 分布
现在让我们比较一下TSA旅客日吞吐量在各年中的分布情况。我们将为每一年创建一个子图，其中包含直方图和分布的核密度估计值（KDE）。到目前为止，Pandas已经为这两个示例生成了Figure和Axes对象，但是我们可以使用Matplotlib的plt.subplots（）函数自己创建自定义布局。首先，我们需要导入pyplot模块：

import matplotlib.pyplot as plt

虽然pandas让我们指定我们想要的子图及其布局（分别使用子图和布局参数），使用Matplotlib直接创建子图给了我们额外的灵活性：

# define the subplot layout
fig, axes = plt.subplots(3, 1, sharex=True, sharey=True, figsize=(6, 4))

for year, ax in zip(tsa_melted_holiday_travel.year.unique(), axes):
    plot_data = tsa_melted_holiday_travel.loc[str(year)].travelers
    plot_data.plot(kind='hist', legend=False, density=True, alpha=0.8, ax=ax)
    plot_data.plot(kind='kde', legend=False, color='blue', ax=ax)
    ax.set(title=f'{year} TSA Traveler Throughput', xlabel='travelers')

fig.tight_layout() # handle overlaps

在这里插入图片描述

使用Seaborn绘图

Seaborn库提供了一种无需透视即可轻松可视化长格式数据的方法。此外，它还提供了一些额外的绘图类型-再次构建在Matplotlib之上。在这里，我们将看到一些我们可以用Seaborn创建的可视化示例。

可视化长格式数据
使用Seaborn，我们可以根据带有hue参数的列的值来指定绘图颜色。当使用生成子区的函数时，我们还可以指定如何使用col和row参数按长格式列的值拆分子区。在这里，我们重新回顾了跨年TSA旅客吞吐量分布的比较：

import seaborn as sns

sns.displot(
    data=tsa_melted_holiday_travel, x='travelers', col='year', kde=True, height=2.5
)

在这里插入图片描述
热力图
我们还可以使用Seaborn将透视表可视化为热力图：

data = tsa_melted_holiday_travel['2019':'2021-04']\
    .assign(month=lambda x: x.index.month)\
    .pivot_table(index='month', columns='year', values='travelers', aggfunc='sum')

data

在这里插入图片描述

ax = sns.heatmap(data=data / 1e6, cmap='Blues', annot=True, fmt='.1f')
_ = ax.set_yticklabels(calendar.month_abbr[1:], rotation=0)
_ = ax.set_title('Total TSA Traveler Throughput (in millions)')

在这里插入图片描述

使用Matplotlib自定义绘图

在最后一节中，我们将讨论如何使用Matplotlib自定义绘图。由于有很多可用的功能，我们在这里只介绍如何添加阴影区域和注释。

添加着色区域
在查看TSA旅客吞吐量随时间变化的曲线图时，指出假期旅行的时间段是很有帮助的。我们可以使用axvspan（）方法来实现这一点：

plot_data = tsa_melted_holiday_travel['2019-05':'2019-11']
ax = plot_data.travelers.plot(
    title='TSA Traveler Throughput', ylabel='travelers', figsize=(9, 2)
)
ax.yaxis.set_major_formatter(ticker.EngFormatter())

# collect the holiday ranges (start and end dates)
holiday_ranges = plot_data.dropna().reset_index()\
    .groupby('holiday').agg({'date': ['min', 'max']})

# create shaded regions for each holiday in the plot
for start_date, end_date in holiday_ranges.to_numpy():
    ax.axvspan(start_date, end_date, color='gray', alpha=0.2)

在这里插入图片描述
添加注释
我们可以使用annotate（）方法向图中添加注释。在这里，我们指出2019年TSA旅客吞吐量最高的一天，也就是感恩节后一天：

plot_data = tsa_melted_holiday_travel.loc['2019']
ax = plot_data.travelers.plot(
    title='TSA Traveler Throughput', ylabel='travelers', figsize=(9, 2)
)
ax.yaxis.set_major_formatter(ticker.EngFormatter())

# highest throughput
max_throughput_date = plot_data.travelers.idxmax()
max_throughput = plot_data.travelers.max()
_ = ax.annotate(
    f'{max_throughput_date:%b %d}\n({max_throughput / 1e6:.2f} M)',
    xy=(max_throughput_date, max_throughput),
    xytext=(max_throughput_date - pd.Timedelta(days=25), max_throughput * 0.92),
    arrowprops={'arrowstyle': '->'}, ha='center'
)