Pandas实战-Series的方法

最新推荐文章于 2024-04-26 19:13:06 发布

VIP文章煉心_

最新推荐文章于 2024-04-26 19:13:06 发布

阅读量1.3k

点赞数 1

分类专栏： Python Pandas 文章标签： python pandas

本文链接：https://blog.csdn.net/gangchengzhong/article/details/113813121

版权

本文将主要介绍以下内容：

1. 导入CSV数据集

2. 排序Series值

3. 修改原Series值

4. 统计Series值

5. apply方法

在介绍Series的方法之前，需要一些现实世界的数据集，本文会使用以下三个CSV文件：

- pokemon.csv，超过800个的Pokémon，这是任天堂最受欢迎的宠物小精灵

- google_stock.csv，Google从2004年8月首次亮相到2019年10月的每日美元价格

- revolutionary_war.csv，美国独立战争期间的战斗记录，由于某些战争没有确定的开始日期或不是发生在美国领土，因此此数据集包含缺失值

1. 导入CSV数据集

首先启动Jupyter Notebook，然后导入pandas库：

In  [1] import pandas as pd

pandas可以导入十几种不同类型的文件。每种文件格式都有一个关联的导入方法，这些方法都以read开头。在下面的例子中，我们会使用read_csv方法。它的第一个参数filepath_or_buffer是文件名的路径，必须包含文件的扩展名，使用相对路径默认是在与Jupyter Notebook相同的目录中查找文件。

In  [2]: # 下面两行代码是相同的
         pd.read_csv(filepath_or_buffer = "pokemon.csv")
         pd.read_csv("pokemon.csv")
Out [2]:          Pokemon             Type
           0    Bulbasaur   Grass / Poison
           1      Ivysaur   Grass / Poison
           2     Venusaur   Grass / Poison
           3   Charmander             Fire
           4   Charmeleon             Fire
           …            …                …
         804    Stakataka     Rock / Steel
         805  Blacephalon     Fire / Ghost
         806      Zeraora         Electric
         807       Meltan            Steel
         808     Melmetal            Steel
         809 rows x 2 columns

请注意，输出的格式与Series有所不同，返回的实际上是pandas的DataFrame，它是支持多个行和列的二维数据结构。无论数据集包含多少列，pandas都会默认将导入的数据存储在DataFrame中。

如果要把数据存储在Series中，需要用到read_csv方法的另外两个参数index_col和squeeze。数据集有两列（Pokemon和Type），但是Series只支持一列，这时可以使用index_col参数指定作为索引的列名称，把列之一指定为索引。当使用字符串时，请注意区分大小写，该字符串必须与数据集中的列名完全相同。

In  [3]: pd.read_csv("pokemon.csv", index_col = "Pokemon")
Out [3]:                          Type
             Pokemon                   
           Bulbasaur    Grass / Poison
             Ivysaur    Grass / Poison
            Venusaur    Grass / Poison
          Charmander              Fire
          Charmeleon              Fire
                 ...               ...
           Stakataka      Rock / Steel
         Blacephalon      Fire / Ghost
             Zeraora          Electric
              Meltan             Steel
            Melmetal             Steel
         809 rows × 1 columns

但是pandas仍然默认将数据存储到DataFrame对象中。毕竟，多列容器仍然可以存储一列数据。为了简单起见，将数据存储在尽可能小的容器中通常是有利的。为了强制将数据存储在Series中，需要使用squeeze参数并设为True：

In  [4]: pd.read_csv("pokemon.csv", index_col = "Pokemon", squeeze = True)
Out [4]: Pokemon
         Bulbasaur    Grass / Poison
         Ivysaur      Grass / Poison
         Venusaur     Grass / Poison
         Charmander             Fire
         Charmeleon             Fire
                           ...
         Stakataka      Rock / Steel
         Blacephalon    Fire / Ghost
         Zeraora            Electric
         Meltan                Steel
         Melmetal              Steel
         Name: Type, Length: 809, dtype: object

输出揭示了有关数据集的一些重要信息。它显示了文件其余列的名称，这里是Type，数据集有809个值，dtype: object表示值类型是字符串。然后可以将Series分配给变量，以便可以在整个Notebook中重复使用它：

In  [5]: pokemon = pd.read_csv("pokemon.csv", index_col = "Pokemon", squeeze = True)

如果想知道该Series中的值或索引是否有NaN值，可以使用hasnans属性：

In  [6]: pokemon.hasnans
Out [6]: False

In  [7]: pokemon.index.hasnans
Out [7]: False

其余的两个数据集则有一些复杂，让我们从google_stock.csv开始，它包含一个Date列，其值采用YYYY-MM-DD格式：

In  [8]: pd.read_csv("google_stocks.csv").head()
Out [8]:           Date   Close
         0   2004-08-19   49.98
         1   2004-08-20   53.95
         2   2004-08-23   54.50
         3   2004-08-24   52.24
         4   2004-08-25   52.80

导入纯文本文件（例如CSV）时，pandas会为每一列假定最符合逻辑的数据类型。当涉及到日期时，为了安全起见会将其作为字符串导入。但我们也可以使用parse_dates参数

明确把Date列中的值转换为datetime对象。pandas可以识别日期的各种不同字符串格式，包括此例的YYYY-MM-DD格式。

In  [9]: google = pd.read_csv("google_stocks.csv",
                              index_col = "Date",
                              parse_dates = ["Date"],
                              squeeze = True)
         google.head()
Out [9]: Date
         2004-08-19    49.98
         2004-08-20    53.95
         2004-08-23    54.50
         2004-08-24    52.24
         2004-08-25    52.80
         Name: Close, dtype: float64

最后一个数据集是revolutionary_war.csv，让我们预览一下它的数据结构，可以使用read_csv默认返回的DataFrame对象的tail方法查看数据集的最后五行。我们可以看到，在State列中存在缺失值，用NaN表示：

In  [10]: pd.read_csv("revolutionary_war.csv").tail()
Out [10]:                           Battle   Start Date      S

最低0.47元/天解锁文章

煉心_

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Pandas实战-Series的方法

本文将主要介绍以下内容：1. 导入CSV数据集2. 排序Series值3. 修改原Series值4. 统计Series值5. apply方法在介绍Series的方法之前，需要一些现实世界的数据集，本文会使用以下三个CSV文件：- pokemon.csv，超过800个的Pokémon，这是任天堂最受欢迎的宠物小精灵- google_stock.csv，Google从2004年8月首次亮相到2019年10月的每日美元价格- revolutionary_war.csv，美.
复制链接

扫一扫