Kaggle课程 — 数据可视化 Data Visualization

最新推荐文章于 2024-08-03 23:04:48 发布

迷途小书童问天

最新推荐文章于 2024-08-03 23:04:48 发布

阅读量947

点赞数

分类专栏：机器学习文章标签：数据可视化数据分析可视化 python

原文链接：https://www.kaggle.com/learn/data-visualization

版权

机器学习专栏收录该内容

3 篇文章 1 订阅

订阅专栏

本文是Kaggle数据可视化的微课程，介绍了如何使用Seaborn进行数据可视化。从Hello Seaborn开始，逐步讲解设置编码环境、加载数据、检查数据、绘制线性图表、条形图、热力图和散点图，以及分布图的制作。课程适合无编程经验但想快速掌握数据可视化的读者。

摘要由CSDN通过智能技术生成

)
本文翻译自kaggle官方网站https://www.kaggle.com/learn/data-visualization，仅供参考。

1. Hello，Seaborn

1.1 欢迎来到数据可视化！

在这个实际动手的微课程上，你将学习如何把你的数据可视化上升到下一个级别seaborn，一个有力的易于使用的数据可视化工具。为了使用seaborn，你将学习一点python编码。也就是说：

这门微课程面向那些没有编程经验的
每一张图表使用简短的代码，seaborn更加快速和易于使用，比起许多其他的数据可视化工具（例如Excel等）

因此，如果你从未写过一行代码，并且你想要学习最少的东西就能开始快速地制作更吸引人的图，那么你来对地方了。先看一下你将学习制作的图表，如下：
在这里插入图片描述

1.2 你的编码环境

抓紧时间，现在快速翻到本页最下面。你将注意到有许多不同类型的信息，包括：

text 正如你现在在阅读的
code 包含在灰色格子里的叫做code cell
code output 打印在屏幕上的代码运行结果

在本节内容中，我们已经运行了所以的代码。很快，你将学习使用notebook来写下和运行你的代码。

1.3 set up the notebook

在每一节notebook练习中的顶端，你需要运行几行代码来设置你的编码环境。现在理解这些代码并不重要，因此我们不会深入讨论（注意：运行结束后会输出setup complete）

import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

 Setup Complete

1.4 加载数据

在notebook中，我们将使用六个国家的FIFA历史排名数据集：Argentina (ARG), Brazil (BRA), Spain (ESP), France (FRA), Germany (GER), and Italy (ITA)。该数据集存储于csv文件（ comma-separated values file的缩写），在excel中打开csv文件，每一行为数据，对应每个国家一列。
在这里插入图片描述

为了加载数据，我们使用两个步骤如下：

设置数据的访问路径filepath
加载

# Path of the file to read
fifa_filepath = "../input/fifa.csv"

# Read the file into a variable fifa_data
fifa_data = pd.read_csv(fifa_filepath, index_col="Date", parse_dates=True)

在这里插入图片描述

1.5 检查数据

现在，我们快速浏览fifa_data数据，确认是否正确。

打印前五行数据通过head()方法：

# Print the first 5 rows of the data
fifa_data.head()

			ARG	BRA	ESP	FRA	GER	ITA
Date						
1993-08-08	5.0	8.0	13.0 12.0 1.0 2.0
1993-09-23	12.0 1.0 14.0 7.0 5.0 2.0
1993-10-22	9.0	1.0	7.0 14.0 4.0 3.0
1993-11-19	9.0	4.0	7.0 15.0 3.0 1.0
1993-12-23	8.0	3.0	5.0 15.0 1.0 2.0

1.6 绘制数据

快速阅览你将学习的内容，通过如下的代码生成一个线形图表。

# Set the width and height of the figure
plt.figure(figsize=(16,6))

# Line chart showing how FIFA rankings evolved over time 
sns.lineplot(data=fifa_data)

<matplotlib.axes._subplots.AxesSubplot at 0x7f85ec3769d0>

在这里插入图片描述
仅仅如此没什么意义，你将在下面的课程中学到更多的内容。

2.线性图表 Line Charts

现在，你已经熟悉了编码环境，是时候学习如何制作你自己的图表了。

在本节中，你将学习使用python来创建线性图表line charts。然后，在接下来的练习中，你将在现实数据中使用你的新技能。

2.1 Set up the notebook

我们从设置编码环境开始。

import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

Setup Complete

2.2 选择一个数据集

本节的数据集跟踪了全球音乐流服务Spotify上的每日流量。我们关注5首2017至2018年的流行歌曲：

“Shape of You”, by Ed Sheeran
“Despacito”, by Luis Fonzi
“Something Just Like This”, by The Chainsmokers and Coldplay
“HUMBLE.”, by Kendrick Lamar
“Unforgettable”, by French Montana

注意，第一个数据出现于2017年1月6日，与 "The Shape of You"的发布日期一致。并且，通过该表格，可以看见 "The Shape of You"在发布日获得了全球12287078次流量。注意，其他歌曲在第一行有缺失值，是因为其他歌曲是之后才发布的。

2.3 加载数据

正如上一节所学，我们使用pdread_csv命令来载入数据集。

# Path of the file to read
spotify_filepath = "../input/spotify.csv"

# Read the file into a variable spotify_data
spotify_data = pd.read_csv(spotify_filepath, index_col="Date", parse_dates=True)

上述代码运行的结果是我们可以使用spotify_data来访问该数据集。

2.4 检查数据

我们可以通过head命令打印前五行数据。

# Print the first 5 rows of the data
spotify_data.head()

			Shape of You	Despacito	Something Just Like This	HUMBLE.	Unforgettable
	Date					
2017-01-06	12287078	NaN	NaN	NaN	NaN
2017-01-07	13190270	NaN	NaN	NaN	NaN
2017-01-08	13099919	NaN	NaN	NaN	NaN
2017-01-09	14506351	NaN	NaN	NaN	NaN
2017-01-10	14275628	NaN	NaN	NaN	NaN

现在检查前五行数据和我们之前图片（excel中）看到数据。

空白的单元格将出现NaN，这是Not a Number的简称。

我们也可以查看最后5行数据通过tail命令。

# Print the last five rows of the data
spotify_data.tail()

		Shape of You	Despacito	Something Just Like This	HUMBLE.	Unforgettable
	Date					
2018-01-05	4492978	3450315.0	2408365.0	2685857.0	2869783.0
2018-01-06	4416476	3394284.0	2188035.0	2559044.0	2743748.0
2018-01-07	4009104	3020789.0	1908129.0	2350985.0	2441045.0
2018-01-08	4135505	2755266.0	2023251.0	2523265.0	2622693.0
2018-01-09	4168506	2791601.0	2058016.0	2727678.0	2627334.0

谢天谢地，一切看起来正常。我们可以开始着手描绘数据了。

2.5 描绘数据

现在，数据集已经载入notebook，我们仅仅需要一行代码来制作一个线性图表。

# Line chart showing daily global streams of each song 
sns.lineplot(data=spotify_data)

<matplotlib.axes._subplots.AxesSubplot at 0x7ff15932bbd0>

在这里插入图片描述
正如上面你所看见的，代码行相对的短并且有两个组成部分：

sns.lineplot
data=spotify_data

注意，你将一直使用同样的格式，当你创建一个线性图表的时候，并且在新数据集中唯一能够改变的事情就是数据集的名称。因此，假如你正在使用一个数据集名称为financial_data，这行代码就可以写成如下样子：

sns.lineplot(data=financial_data)

某些时候，有一些额外的细节可以修改，比如图片的大小和图表的标题。这些选项可以简单的通过一行代码来实现。

# Set the width and height of the figure
plt.figure(figsize=(14,6))

# Add title
plt.title("Daily Global Streams of Popular Songs in 2017-2018")

# Line chart showing daily global streams of each song 
sns.lineplot(data=spotify_data)

<matplotlib.axes._subplots.AxesSubplot at 0x7ff15805d310>

在这里插入图片描述
第一行代码设置图片的大小为14英寸宽乘以6英寸高。设置任何图片，你仅需要复制同样的代码。然后，假如你想要使用自定义的大小，改变14和6的值为你想要的值。

第二行代码设置图片的标题。注意该标题内容必须位于引号标记之中。

2.6 描绘数据子集

到目前为止，你已经学习如何描绘数据集中每一列的图线。在本小节，你将学习如何描绘数据子集。

我们从打印所有列名开始。通过一行代码来完成，并且通过改变数据集名称可以适用于任何数据集。

list(spotify_data.columns)

['Shape of You',
 'Despacito',
 'Something Just Like This',
 'HUMBLE.',
 'Unforgettable']

在接下来的代码模块，我们描绘前两列数据对应的图线。

# Set the width and height of the figure
plt.figure(figsize=(14,6))

# Add title
plt.title("Daily Global Streams of Popular Songs in 2017-2018")

# Line chart showing daily global streams of 'Shape of You'
sns.lineplot(data=spotify_data['Shape of You'], label="Shape of You")

# Line chart showing daily global streams of 'Despacito'
sns.lineplot(data=spotify_data['Despacito'], label="Despacito")

# Add label for horizontal axis
plt.xlabel("Date")

Text(0.5, 0, 'Date')

在这里插入图片描述
前两行代码设置图片标题和大小，与之前相似。

接下来的两行代码，每一行都增加了线性图表里的一条图线。比如，考虑前一个，增加了 "Shape of You"的图线：

# Line chart showing daily global streams of 'Shape of You'
sns.lineplot(data=spotify_data['Shape of You'], label="Shape of You")

这一行代码与之前用于描绘数据集的代码类似，但是它有一些关键的不同：

我们设置data=spotify_data[‘Shape of You’]来替代data=spotify_data。通常的，描绘唯一的一列，我们使用这样的格式，即将列名用单引号括起来然后放置于方括号中。（为了正确设置，可以先打印所有列名）
我们增加标签label="Shape of You"来使图线出现在图像中并为其设置对应的标签。

最后一行代码修改X轴的标签，想要设置的标签位于引号标记中。