Polars简明基础教程五：什么是Polars的“DataFrame（数据框）_上”

sosogod

已于 2024-08-12 15:02:07 修改

阅读量650

点赞数 14

分类专栏： Polars简明基础教程系列文章标签： python 算法

于 2024-08-10 16:34:13 首次发布

本文链接：https://blog.csdn.net/sosogod/article/details/141088162

版权

Polars简明基础教程系列专栏收录该内容

15 篇文章 2 订阅

订阅专栏

学习目的

在本次讲座中，我们将从高层次的角度了解Polars的DataFrame，并学习：

如何访问重要的元数据
Polars如何使用Apache Arrow存储数据
当我们修改DataFrame时会发生什么

在这里，我们有一个dataframe的示例（示例中用到的csv数据文件可在这里免费下载：“泰坦尼克号生还者数据集”）：

import polars as pl
import numpy as np

csv_file = "../data/titanic.csv"
df = pl.read_csv(csv_file)
df.head(3)

shape: (3, 15)
┌──────────┬────────┬────────┬──────┬───┬──────┬─────────────┬───────┬───────┐
│ survived ┆ pclass ┆ sex    ┆ age  ┆ … ┆ deck ┆ embark_town ┆ alive ┆ alone │
│ ---      ┆ ---    ┆ ---    ┆ ---  ┆   ┆ ---  ┆ ---         ┆ ---   ┆ ---   │
│ i64      ┆ i64    ┆ str    ┆ f64  ┆   ┆ str  ┆ str         ┆ str   ┆ bool  │
╞══════════╪════════╪════════╪══════╪═══╪══════╪═════════════╪═══════╪═══════╡
│ 0        ┆ 3      ┆ male   ┆ 22.0 ┆ … ┆ null ┆ Southampton ┆ no    ┆ false │
│ 1        ┆ 1      ┆ female ┆ 38.0 ┆ … ┆ C    ┆ Cherbourg   ┆ yes   ┆ false │
│ 1        ┆ 3      ┆ female ┆ 26.0 ┆ … ┆ null ┆ Southampton ┆ yes   ┆ true  │
└──────────┴────────┴────────┴──────┴───┴──────┴─────────────┴───────┴───────┘

Polars的DataFrame

Polars的DataFrame：

是一个存储在Arrow表中的表格数据集
有高度和宽度（二维表格数据结构）
有唯一的字符串列名
每列都有数据类型（dtype）
有用于转换存储在Arrow表中的数据的方法

我们可以获取DataFrame的高度（行数）和宽度（列数）两个属性：

df.width
15

df.height
891

数据类型模式

DataFrame中的每一列都有一个称为dtype的数据类型。

我们可以使用.schema属性获取一个OrderedDict，它将列名映射到dtypes。

df.schema

Schema({'survived': Int64, 'pclass': Int64, 'sex': String, 'age': Float64, 'sibsp': Int64, 'parch': Int64, 'fare': Float64, 'embarked': String, 'class': String, 'who': String, 'adult_male': Boolean, 'deck': String, 'embark_town': String, 'alive': String, 'alone': Boolean})

还有一个dtypes属性（与Pandas中的类似）。但是，这会提供一个没有列名的dtypes的list。

df.dtypes

[Int64, Int64, String, Float64, Int64, Int64, Float64, String, String, String, Boolean, String, String, String, Boolean]

Series也有一个数据类型属性

df['age'].dtype

Float64

超类型

Polars 的 DataFrame 支持多种数据类型（dtypes），这些数据类型与 Pandas 类似但也有其独特之处。

我们可以将Polars DataFrame 的 dtypes分组：

integers：整型，如 pl.Int8,pl.Int16 等
floats：浮点型，如 pl.Float32,pl.Float64
string：字符型，如 pl.Utf8
boolean：布尔型，如 pl.Boolean
datetime：日期时间型，如 pl.Datetime,pl.Date 等

Polars还有一个超类型（Supertypes）的概念。当我们尝试执行涉及具有不同类型的列的操作时，就会出现超类型。如果这些列的数据类型（dtypes）有一个超类型，那么所有列都将被转换为该类型以执行操作。

超类型是针对给定的一对数据类型定义的，而不是通用的。

超类型（Supertypes ）的概念

Supertype 是 Polars 中的一种抽象类型，它能够表示一个 Series 中可能存在的多种具体类型。例如，一个 Series 可能同时包含整数和浮点数。
Subtypes 是具体的类型，它们可以组合在一起形成一个 supertype。例如，整数 (Int32, Int64) 和浮点数 (Float32, Float64) 可以被视为 Numeric supertype 的子类型。

超类型（Supertypes ）的作用

类型转换:
- 当需要将不同类型的 Series 转换为同一类型时，Polars 会根据 supertype 的规则来进行转换。
- 例如，如果一个 Series 包含 Int64 和 Float64 类型的数据，则转换后的 Series 类型将会是 Float64，因为 Numeric supertype 会倾向于更宽泛的类型。
类型推断:
- 在某些操作中，Polars 需要决定输出 Series 的类型。这时它会根据 supertype 的规则来推断结果类型。
灵活性:
- Supertypes 提供了更高的灵活性，使得在不知道确切类型的情况下也能进行数据处理。