polars_learn

一条闲鱼_mytube

已于 2024-04-10 16:14:59 修改

阅读量502

点赞数 3

分类专栏： python基础文章标签： python

于 2024-04-10 16:11:22 首次发布

本文链接：https://blog.csdn.net/weixin_38805083/article/details/137599598

版权

python基础专栏收录该内容

7 篇文章 0 订阅

订阅专栏

polars 表达式

group by 分组和操作

import polars as pl         
df = pl.read_csv("./data.csv")
data_pl= pl.DataFrame(df)
data_pl=data_pl.filter(pl.col('sepal_length')>5).groupby('species').agg(pl.all().sum())
print(data_pl)

shape: (3, 5)
┌────────────┬──────────────┬─────────────┬──────────────┬─────────────┐
│ species    ┆ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width │
│ ---        ┆ ---          ┆ ---         ┆ ---          ┆ ---         │
│ str        ┆ f64          ┆ f64         ┆ f64          ┆ f64         │
╞════════════╪══════════════╪═════════════╪══════════════╪═════════════╡
│ versicolor ┆ 281.9        ┆ 131.8       ┆ 202.9        ┆ 63.3        │
│ setosa     ┆ 116.9        ┆ 81.7        ┆ 33.2         ┆ 6.1         │
│ virginica  ┆ 324.5        ┆ 146.2       ┆ 273.1        ┆ 99.6        │
└────────────┴──────────────┴─────────────┴──────────────┴─────────────┘


/tmp/ipykernel_3199148/1799683845.py:4: DeprecationWarning: `groupby` is deprecated. It has been renamed to `group_by`.
  data_pl=data_pl.filter(pl.col('sepal_length')>5).groupby('species').agg(pl.all().sum())

延迟执行示例
从即时执行变更为延迟执行非常简单，只需要在已有调用基础上添加 .lazy() 和 .collect() 即可


result = pl.read_csv('data.csv').lazy().filter(pl.col('sepal_length') > 5).group_by('species')\
    .agg(pl.col('*').sum())\
    .collect()

print(result)

shape: (3, 5)
┌────────────┬──────────────┬─────────────┬──────────────┬─────────────┐
│ species    ┆ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width │
│ ---        ┆ ---          ┆ ---         ┆ ---          ┆ ---         │
│ str        ┆ f64          ┆ f64         ┆ f64          ┆ f64         │
╞════════════╪══════════════╪═════════════╪══════════════╪═════════════╡
│ virginica  ┆ 324.5        ┆ 146.2       ┆ 273.1        ┆ 99.6        │
│ setosa     ┆ 116.9        ┆ 81.7        ┆ 33.2         ┆ 6.1         │
│ versicolor ┆ 281.9        ┆ 131.8       ┆ 202.9        ┆ 63.3        │
└────────────┴──────────────┴─────────────┴──────────────┴─────────────┘

3.1上下文

上下文包括：

选择：df.select([…])
分组集合：df.groupby(…).agg([…])
横向堆叠（hstack）或者增加列： df.with_columns（【[…]）

语法糖
即使在即时执行中，你也在使用 Polars 的延迟执行API

df.groupby(“foo”).agg([pl.col(“bar”).sum()])
(df.lazy().groupby(“foo”).agg([pl.col(“bar”).sum()])).collect()
可以让 Polars 把表达式推送给查询引擎，进行一些优化和缓存操作

select 上下文
选择操作是按照列进行的。
在选择向下文的表达式必须要返回 Series 并且这些 Series 需要有相同的长度或者长度为1
一个长度为 1 的 Series 会将 DataFrame 的一列赋予完全一样的值(这个值来自Series)
select 可能会返回一个新的列，这个列可能是一些聚合的结果、一些表达式的组合或者常量

out= data_pl.select([
    pl.sum('nrs'),
    pl.col('names').sort(),
    pl.col('names').first().alias('first name'),
    (pl.mean('nrs')*10).alias('10xnrs')
])
print(out)

添加列:

with_columns()

Groupby 上下文
在 groupby 上下文中的表达式主要作用域分组上，因此他们会返回任意长度（每个组可能有不同数量的成员）

out = df.groupby("groups").agg(
    [
        pl.sum("nrs"),  # 通过groups列对nrs求和
        pl.col("random").count().alias("count"),  # 记录组数
        # 如果name != null记录random列的和
        pl.col("random").filter(pl.col("names").is_not_null()).sum().suffix("_sum"),
        pl.col("names").reverse().alias(("reversed names")),
    ]
)

3.3分组

“分割-处理-组合” polars的分组核心操作
分割阶段的哈希操作，Polars 使用了无锁多线程方式
图解：
!image[https://raw.githubusercontent.com/pola-rs/polars-static/master/docs/split-apply-combine.svg]

Polars Expressions

Polars 实现了一种非常强大的语法，在其延迟执行API和即时执行API上都有定义

统计每组的行数：
短版：pl.count(“party”)
长版：pl.col(“party”).count()
把每组的性别放入一个列表:
长版： pl.col(“gender”).list()
找到每组的第一个 last_name：
短版: pl.first(“last_name”)
长版: pl.col(“last_name”).first()

import polars as pl

from .dataset import dataset

q = (
    dataset.lazy()
    .groupby("first_name")
    .agg(
        [
            pl.count(),
            pl.col("gender"),
            pl.first("last_name"),
        ]
    )
    .sort("count", descending=True)
    .limit(5)
)

df = q.collect()

我们想要知道对于每个 state 有多少 Pro 和 Anti

import polars as pl

from .dataset import dataset

q = (
    dataset.lazy()
    .groupby("state")
    .agg(
        [
            (pl.col("party") == "Anti-Administration").sum().alias("anti"),
            (pl.col("party") == "Pro-Administration").sum().alias("pro"),
        ]
    )
    .sort("pro", descending=True)
    .limit(5)
)

df = q.collect()

过滤
我们想要计算每组的均值，但是我们不希望计算所有值的均值，我们也不希望直接从 DataFrame 过滤，因为我们后需还需要那些行做其他操作

可以写明 Python 的自定义函数，这些函数没有什么运行时开销

from datetime import date

import polars as pl

from .dataset import dataset


def compute_age() -> pl.Expr:
    return date(2021, 1, 1).year - pl.col("birthday").dt.year()


def avg_birthday(gender: str) -> pl.Expr:
    return compute_age().filter(pl.col("gender") == gender).mean().alias(f"avg {gender} birthday")


q = (
    dataset.lazy()
    .groupby(["state"])
    .agg(
        [
            avg_birthday("M"),
            avg_birthday("F"),
            (pl.col("gender") == "M").sum().alias("# male"),
            (pl.col("gender") == "F").sum().alias("# female"),
        ]
    )
    .limit(5)
)

df = q.collect()

排序
经常把一个 DataFrame 排序为了在分组操作的时候保持某种顺

import polars as pl

from .dataset import dataset


def get_person() -> pl.Expr:
    return pl.col("first_name") + pl.lit(" ") + pl.col("last_name")


q = (
    dataset.lazy()
    .sort("birthday", descending=True)
    .groupby(["state"])
    .agg(
        [
            get_person().first().alias("youngest"),
            get_person().last().alias("oldest"),
            pl.col("gender").sort_by("first_name").first().alias("gender"),
        ]
    )
    .limit(5)
)
df = q.collect()

上面的例子中我们知道通过组合表达式可以完成复杂的查询。而且，我们避免了使用自定义 Python 函数带来的性能损失（解释器和 GIL）

3.4 折叠

fold 函数在列方向的性能最佳，它很好的利用了数据的内存格局，通常还会伴随向量化操作


df=pl.DataFrame(
       {
        "a": [1, 2, 3],
        "b": [10, 20, 30],
    }
)
out = df.select(
    pl.fold(acc=pl.lit(0), function=lambda acc, x: acc + x, exprs=pl.col("*")).alias("sum"),
)
print(out)

shape: (3, 1)
┌─────┐
│ sum │
│ --- │
│ i64 │
╞═════╡
│ 11  │
│ 22  │
│ 33  │
└─────┘

函数 f(acc, x) -> acc 被反复调用并把结果累加到 acc 变量，最终把结果放入 x 列,函数按照列执行，并且充分利用了缓存和向量化操作

条件语句

当我们希望对一个 DataFrame 的所有列是施加条件语句的时候，采用 fold 就非常简洁

df=pl.DataFrame(
       {
        "a": [1, 2, 3],
        "b": [0, 1, 2],
    }
)
out=df.filter(pl.fold(acc=pl.lit(True),function=lambda acc,x:acc & x,exprs=pl.col('*')>1))
print(out)

shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 3   ┆ 2   │
└─────┴─────┘

fold 和字符串

Fold 可以用来连接字符串，由于这个操作会产生一些中间结果，这个操作是 O(n^2) 的时间复杂度。推荐使用 concat_str 表达式

import polars as pl
df = pl.DataFrame(
    {
        "a": ["a", "b", "c"],
        "b": [1, 2, 3],
    }
)

out = df.select(
    [
        pl.concat_str(["a", "b"]),
    ]
)
print(out)

shape: (3, 1)
┌─────┐
│ a   │
│ --- │
│ str │
╞═════╡
│ a1  │
│ b2  │
│ c3  │
└─────┘

3.5 自定义函数

map

map函数将表达式所支持的Series数据原封不动的传递
map函数在select和groupby中遵循相同的规则

这将意味着Series代表DataFrame中的一个列
在groupby情况下，该列还没有被分组

map的使用情况很有限。它们只用于性能方面，但很容易导致不正确的结果

df = pl.DataFrame(
    {
        "keys": ["a", "a", "b"],
        "values": [10, 7, 1],
    }
)
out=df.group_by('keys',maintain_order=True).agg(
    [
        pl.col('values').map(lambda s: s.shift()).alias('shift_map'),
        pl.col('values').shift().alias('shift_expression'),
     
     ]
)

print(out)

shape: (2, 3)
┌──────┬────────────┬──────────────────┐
│ keys ┆ shift_map  ┆ shift_expression │
│ ---  ┆ ---        ┆ ---              │
│ str  ┆ list[i64]  ┆ list[i64]        │
╞══════╪════════════╪══════════════════╡
│ a    ┆ [null, 10] ┆ [null, 10]       │
│ b    ┆ [null]     ┆ [null]           │
└──────┴────────────┴──────────────────┘


/tmp/ipykernel_187905/1529375746.py:9: DeprecationWarning: `map` is deprecated. It has been renamed to `map_batches`.
  pl.col('values').map(lambda s: s.shift()).alias('shift_map'),

apply

out = df.groupby("keys", maintain_order=True).agg(
    [
        pl.col("values").apply(lambda s: s.shift()).alias("shift_map"),
        pl.col("values").shift().alias("shift_expression"),
    ]
)
print(out)

shape: (2, 3)
┌──────┬────────────┬──────────────────┐
│ keys ┆ shift_map  ┆ shift_expression │
│ ---  ┆ ---        ┆ ---              │
│ str  ┆ list[i64]  ┆ list[i64]        │
╞══════╪════════════╪══════════════════╡
│ a    ┆ [null, 10] ┆ [null, 10]       │
│ b    ┆ [null]     ┆ [null]           │
└──────┴────────────┴──────────────────┘


/tmp/ipykernel_187905/2663066239.py:1: DeprecationWarning: `groupby` is deprecated. It has been renamed to `group_by`.
  out = df.groupby("keys", maintain_order=True).agg(
/tmp/ipykernel_187905/2663066239.py:3: DeprecationWarning: `apply` is deprecated. It has been renamed to `map_elements`.
  pl.col("values").apply(lambda s: s.shift()).alias("shift_map"),

合并多列

df=pl.DataFrame(
    [
    {"keys": "a", "values": 10},
    {"keys": "a", "values": 7},
    {"keys": "b", "values": 1},
]

)

out = df.select(
    [
        pl.struct(["keys", "values"]).apply(lambda x: len(x["keys"]) + x["values"]).alias("solution_apply"),
        (pl.col("keys").str.lengths() + pl.col("values")).alias("solution_expr"),
    ]
)
print(out)

shape: (3, 2)
┌────────────────┬───────────────┐
│ solution_apply ┆ solution_expr │
│ ---            ┆ ---           │
│ i64            ┆ i64           │
╞════════════════╪═══════════════╡
│ 11             ┆ 11            │
│ 8              ┆ 8             │
│ 2              ┆ 2             │
└────────────────┴───────────────┘


/tmp/ipykernel_187905/1009504305.py:12: DeprecationWarning: `apply` is deprecated. It has been renamed to `map_elements`.
  pl.struct(["keys", "values"]).apply(lambda x: len(x["keys"]) + x["values"]).alias("solution_apply"),
/tmp/ipykernel_187905/1009504305.py:13: DeprecationWarning: `lengths` is deprecated. It has been renamed to `len_bytes`.
  (pl.col("keys").str.lengths() + pl.col("values")).alias("solution_expr"),

python类型与polars数据类型的映射如下：

int->Int64
float->Float64
bool->Boolean
str->Utf8
list[tp]->Listtp
dict[str, [tp]]->struct
Any->object(在任何时候都要防止这种情况)

3.6窗口函数

它可以让用户在 select 上下文中分组进行类聚

import polars as pl

# 然后，让我们加载一些包pokemon信息的csv数据
df = pl.read_csv(
    "./pokemon.csv"
)
print(df)

shape: (163, 13)
┌─────┬───────────────────────┬─────────┬────────┬───┬─────────┬───────┬────────────┬───────────┐
│ #   ┆ Name                  ┆ Type 1  ┆ Type 2 ┆ … ┆ Sp. Def ┆ Speed ┆ Generation ┆ Legendary │
│ --- ┆ ---                   ┆ ---     ┆ ---    ┆   ┆ ---     ┆ ---   ┆ ---        ┆ ---       │
│ i64 ┆ str                   ┆ str     ┆ str    ┆   ┆ i64     ┆ i64   ┆ i64        ┆ bool      │
╞═════╪═══════════════════════╪═════════╪════════╪═══╪═════════╪═══════╪════════════╪═══════════╡
│ 1   ┆ Bulbasaur             ┆ Grass   ┆ Poison ┆ … ┆ 65      ┆ 45    ┆ 1          ┆ false     │
│ 2   ┆ Ivysaur               ┆ Grass   ┆ Poison ┆ … ┆ 80      ┆ 60    ┆ 1          ┆ false     │
│ 3   ┆ Venusaur              ┆ Grass   ┆ Poison ┆ … ┆ 100     ┆ 80    ┆ 1          ┆ false     │
│ 3   ┆ VenusaurMega Venusaur ┆ Grass   ┆ Poison ┆ … ┆ 120     ┆ 80    ┆ 1          ┆ false     │
│ 4   ┆ Charmander            ┆ Fire    ┆ null   ┆ … ┆ 50      ┆ 65    ┆ 1          ┆ false     │
│ …   ┆ …                     ┆ …       ┆ …      ┆ … ┆ …       ┆ …     ┆ …          ┆ …         │
│ 146 ┆ Moltres               ┆ Fire    ┆ Flying ┆ … ┆ 85      ┆ 90    ┆ 1          ┆ true      │
│ 147 ┆ Dratini               ┆ Dragon  ┆ null   ┆ … ┆ 50      ┆ 50    ┆ 1          ┆ false     │
│ 148 ┆ Dragonair             ┆ Dragon  ┆ null   ┆ … ┆ 70      ┆ 70    ┆ 1          ┆ false     │
│ 149 ┆ Dragonite             ┆ Dragon  ┆ Flying ┆ … ┆ 100     ┆ 80    ┆ 1          ┆ false     │
│ 150 ┆ Mewtwo                ┆ Psychic ┆ null   ┆ … ┆ 90      ┆ 130   ┆ 1          ┆ true      │
└─────┴───────────────────────┴─────────┴────────┴───┴─────────┴───────┴────────────┴───────────┘

group_by 类聚

窗口函数永远返回一个跟原有 DataFrame 一样规格的 DataFrame
在一次查询中并行的运行多个分组操作


out = df.select(
    [
        "Type 1",
        "Type 2",
        pl.col("Attack").mean().over("Type 1").alias("avg_attack_by_type"),
        pl.col("Defense").mean().over(["Type 1", "Type 2"]).alias("avg_defense_by_type_combination"),
        pl.col("Attack").mean().alias("avg_attack"),
    ]
)
out

shape: (163, 5)

Type 1	Type 2	avg_attack_by_type	avg_defense_by_type_combination	avg_attack
str	str	f64	f64	f64
"Grass"	"Poison"	72.923077	67.8	75.349693
"Grass"	"Poison"	72.923077	67.8	75.349693
"Grass"	"Poison"	72.923077	67.8	75.349693
"Grass"	"Poison"	72.923077	67.8	75.349693
"Fire"	null	88.642857	58.3	75.349693
"Fire"	null	88.642857	58.3	75.349693
"Fire"	"Flying"	88.642857	82.0	75.349693
"Fire"	"Dragon"	88.642857	111.0	75.349693
"Fire"	"Flying"	88.642857	82.0	75.349693
"Water"	null	74.193548	74.526316	75.349693
"Water"	null	74.193548	74.526316	75.349693
"Water"	null	74.193548	74.526316	75.349693
…	…	…	…	…
"Rock"	"Water"	87.5	105.0	75.349693
"Rock"	"Water"	87.5	105.0	75.349693
"Rock"	"Flying"	87.5	75.0	75.349693
"Rock"	"Flying"	87.5	75.0	75.349693
"Normal"	null	70.625	59.846154	75.349693
"Ice"	"Flying"	67.5	100.0	75.349693
"Electric"	"Flying"	62.0	85.0	75.349693
"Fire"	"Flying"	88.642857	82.0	75.349693
"Dragon"	null	94.0	55.0	75.349693
"Dragon"	null	94.0	55.0	75.349693
"Dragon"	"Flying"	94.0	95.0	75.349693
"Psychic"	null	53.875	51.428571	75.349693

分组操作

窗口函数不仅仅可以类聚，还可以用来按照组施加自定义函数

filtered=df.filter(pl.col('Type 2')=='Psychic').select(
    ['Name','Type 1','Speed']
)
print(filtered)

shape: (7, 3)
┌─────────────────────┬────────┬───────┐
│ Name                ┆ Type 1 ┆ Speed │
│ ---                 ┆ ---    ┆ ---   │
│ str                 ┆ str    ┆ i64   │
╞═════════════════════╪════════╪═══════╡
│ Slowpoke            ┆ Water  ┆ 15    │
│ Slowbro             ┆ Water  ┆ 30    │
│ SlowbroMega Slowbro ┆ Water  ┆ 30    │
│ Exeggcute           ┆ Grass  ┆ 40    │
│ Exeggutor           ┆ Grass  ┆ 55    │
│ Starmie             ┆ Water  ┆ 115   │
│ Jynx                ┆ Ice    ┆ 95    │
└─────────────────────┴────────┴───────┘

分组 Water 的列 Type 1 并不连续，中间有两行 Grass。而且，同组中的每一个口袋妖股被按照 Speed 升序排列

filtered=filtered.with_columns(
    [pl.col(['Name','Speed']).sort(descending=True).over('Type 1')]
)
print(filtered)

shape: (7, 3)
┌─────────────────────┬────────┬───────┐
│ Name                ┆ Type 1 ┆ Speed │
│ ---                 ┆ ---    ┆ ---   │
│ str                 ┆ str    ┆ i64   │
╞═════════════════════╪════════╪═══════╡
│ Starmie             ┆ Water  ┆ 115   │
│ Slowpoke            ┆ Water  ┆ 30    │
│ SlowbroMega Slowbro ┆ Water  ┆ 30    │
│ Exeggutor           ┆ Grass  ┆ 55    │
│ Exeggcute           ┆ Grass  ┆ 40    │
│ Slowbro             ┆ Water  ┆ 15    │
│ Jynx                ┆ Ice    ┆ 95    │
└─────────────────────┴────────┴───────┘

out = filtered.with_columns(
    [pl.col(['Name','Speed']).sort(descending=True).over('Type 1')]
)
print(out)

shape: (7, 3)
┌─────────────────────┬────────┬───────┐
│ Name                ┆ Type 1 ┆ Speed │
│ ---                 ┆ ---    ┆ ---   │
│ str                 ┆ str    ┆ i64   │
╞═════════════════════╪════════╪═══════╡
│ Starmie             ┆ Water  ┆ 115   │
│ Slowpoke            ┆ Water  ┆ 30    │
│ SlowbroMega Slowbro ┆ Water  ┆ 30    │
│ Exeggutor           ┆ Grass  ┆ 55    │
│ Exeggcute           ┆ Grass  ┆ 40    │
│ Slowbro             ┆ Water  ┆ 15    │
│ Jynx                ┆ Ice    ┆ 95    │
└─────────────────────┴────────┴───────┘

groupby -> 标记类聚的分组，返回一个跟组的个数一致的 DataFrame
over -> 标记我们希望对这个分组进行计算，但是不会更改原有 DataFrame 的形状

# 分组内类聚且广播
# 输出类型: -> Int32
pl.sum("foo").over("groups")

# 组内加和，然后乘以组内的元素
# 输出类型: -> Int32
(pl.col("x").sum() * pl.col("y")).over("groups")

# 组内加和，然后乘以组内的元素
# 并且组内类聚成一个列表
# 输出类型: -> List(Int32)
(pl.col("x").sum() * pl.col("y")).list().over("groups")

# 注意这里需要一个显式的 `list` 调用
# 组内加和，然后乘以组内的元素
# 并且组内类聚成一个列表
# list() 会展开

# 如果组内是有序的，这是最快的操作方法：
(pl.col("x").sum() * pl.col("y")).list().over("groups").flatten()

3.7 Numpy通用函数

Polars 表达式支持NumPy

import polars as pl
import numpy as np

df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

out = df.select(
    [
        np.log(pl.all()).suffix("_log"),  # 对df所有列求对数
    ]
)
print(out)

shape: (3, 2)
┌──────────┬──────────┐
│ a_log    ┆ b_log    │
│ ---      ┆ ---      │
│ f64      ┆ f64      │
╞══════════╪══════════╡
│ 0.0      ┆ 1.386294 │
│ 0.693147 ┆ 1.609438 │
│ 1.098612 ┆ 1.791759 │
└──────────┴──────────┘


/tmp/ipykernel_555462/1897305078.py:8: DeprecationWarning: `suffix` is deprecated. It has been moved to `name.suffix`.
  np.log(pl.all()).suffix("_log"),  # 对df所有列求对数

3.8示例

4 索引

Polars DataFrame没有索引，因此索引行为可以是一致的

数值型

axis 0: 行（row）
axis 1: 列（column）

数值型 + 字符串

axis 0: 行（这里只接收数字)
axis 1: 列（接受数字+字符串值）

仅字符串

axis 0: 列（column）
axis 1: 报错（error）

表达式

所有表达式求值都是并行执行的

axis 0: 列（column）
axis 1: 列（column）
…
axis n: 列（column）

与Pandas的对比

操作	pandas	polars
选择列	`df.iloc[2]`	`df[2, :]`
按索引选择几行	`df.iloc[[2, 5, 6]]`	`df[[2, 5, 6], :]`
选择行的切片	`df.iloc[2:6]`	`df[2:6, :]`
使用布尔掩码选择行	`df.iloc[True, True, False]`	`df[[True, True, False]]`
按谓词条件选择行	`df.loc[df["A"] > 3]`	`df[df["A"] > 3]`
选择列的切片	`df.iloc[:, 1:3]`	`df[:, 1:3]`
按字符串顺序选择列的切片	`df.loc[:, "A":"Z"]`	`df[:, "A":"Z"]`
选择单个值（标量）	`df.loc[2, "A"]`	`df[2, "A"]`
选择单个值（标量）	`df.iloc[2, 1]`	`df[2, 1]`
选择单个值（Series或DataFrame）	`df.loc[2, ["A"]]`	`df[2, ["A"]]`
选择单个值（Series或DataFrame）	`df.iloc[2, [1]]`	-

表达式

df.with_columns([
    pl.col("A").head(5),  # 从“A”的首部开始获取
    pl.col("B").tail(5).reverse(), # 以逆序的方式获取“B”的后部
    pl.col("B").filter(pl.col("B") > 5).head(5), # 首先得到满足谓词的“B”
    pl.sum("A").over("B").head(5) # 获取“A”在“B”组上的总和，并返回前5个
])

5.数据类型

Polars完全基于Arrow数据类型，并由Arrow内存阵列支持,这使得数据处理缓存效率高，支持进程间通信.
大多数数据类型遵循确切的实现来自Arrow，除了Utf8（实际上是LargeUtf8）、category和Object（支持有限）

Int8: 8位有符号整数。
Int16: 16位有符号整数。
Int32: 32位有符号整数。
Int64: 64位有符号整数。
UInt8: 8位有符号整数。
UInt16: 16位无符号整数。
UInt32: 32位无符号整数。
UInt64: 64位无符号整数。
Float32: 32位浮点数。
Float64: 64位浮点数。
Boolean: 布尔型有效位压缩。
Utf8: 字符串数据（内部实际上是Arrow LargeUtf8）。
List: 列表数组包含着包含列表值的子数组和偏移数组。（这实际上是内部的Arrow LargeList）。
Date: 日期表示，内部表示为自UNIX纪元以来的天数，由32位有符号整数编码。
Datetime: Datetime表示法，内部表示为自UNIX纪元以来的纳秒，由64位有符号整数编码。
Duration: 时间型。在减去Date/Datetime时创建。
Time: 时间表示法，从午夜开始在内部表示为纳秒。
Object: 受支持的有限数据类型，可以是任何值。

6.polars vs pandas

列运算

Pandas

以下代码是顺序执行的

df[“a”] = df[“b”] * 10
df[“c”] = df[“b”] * 100

Polars

以下代码是并发执行的

df.with_columns([
(pl.col(“b”) * 10).alias(“a”),
(pl.col(“b”) * 100).alias(“c”),
])

基于判定的列运算

Pandas

df.loc[df[“c”] == 2, “a”] = df.loc[df[“c”] == 2, “b”]

Polars

df.with_column(
pl.when(pl.col(“c”) == 2)
.then(pl.col(“b”))
.otherwise(pl.col(“a”)).alias(“a”)
)

Polars的方式更“干净”，因而原始DataFrame中的数据并没有被修改,mask（掩膜）也不像在Pandas中那样被计算了两次
Pandas中防止原始DataFrame中的数据在这一步被修改，但这需要借助临时变量
Polars能并行计算每一个 if -> then -> otherwise的分支。当分支的计算复杂度提高时，就能体现并行计算的优势了。

Pandas重塑

Pandas

df = pd.DataFrame({
“c”: [1, 1, 1, 2, 2, 2, 2],
“type”: [“m”, “n”, “o”, “m”, “m”, “n”, “n”]
})

df[“size”] = df.groupby(“c”)[“type”].transform(len)
使用Pandas 要先聚合"c"列、截取出"type"列、计算组的长度，最后将结果拼接回原始DataFrame中。
其结果是

c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4

Polars

df.select([
pl.all(),
pl.col(“type”).count().over(“c”).alias(“size”)
])

shape: (7, 3)
┌─────┬──────┬──────┐
│ c ┆ type ┆ size │
│ — ┆ — ┆ — │
│ i64 ┆ str ┆ u32 │
╞═════╪══════╪══════╡
│ 1 ┆ m ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ n ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ o ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ m ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ m ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ n ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ n ┆ 4 │
└─────┴──────┴──────┘

可以将所有的操作放在一个语句中，因此结合多个窗口函数，甚至结合不同的组都是可以的!
Polars会将应用于相同组的窗口函数表达式缓存，所以将多个表达式入在一个select语句中既方便且优雅

Polars会将应用于相同组的窗口函数表达式缓存，所以将多个表达式入在一个select语句中既方便且优雅

df.select([
    pl.all(),
    pl.col("c").count().over("c").alias("size"),
    pl.col("c").sum().over("type").alias("sum"),
    pl.col("c").reverse().over("c").flatten().alias("reverse_type")
])

结果：

shape: (7, 5)
┌─────┬──────┬──────┬─────┬──────────────┐
│ c ┆ type ┆ size ┆ sum ┆ reverse_type │
│ — ┆ — ┆ — ┆ — ┆ — │
│ i64 ┆ str ┆ u32 ┆ i64 ┆ i64 │
╞═════╪══════╪══════╪═════╪══════════════╡
│ 1 ┆ m ┆ 3 ┆ 5 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ n ┆ 3 ┆ 5 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ o ┆ 3 ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ m ┆ 4 ┆ 5 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ m ┆ 4 ┆ 5 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ n ┆ 4 ┆ 5 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ n ┆ 4 ┆ 5 ┆ 1 │
└─────┴──────┴──────┴─────┴──────────────┘

8.时间序列

上采样 (Up Sampling):上采样实际上相当于将一个日期范围与你的数据集进行左关联 (left join) 操作，并填充缺失数据

from datetime import datetime
df = pl.DataFrame(
    {
        "time": pl.date_range(start=datetime(2021, 12, 16), end=datetime(2021, 12, 16, 3), interval="30m", eager=True),
        "groups": ["a", "a", "a", "b", "b", "a", "a"],
        "values": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
    }
)
out1 = df.upsample(time_column="time", every="15m").fill_null(strategy="forward")

out2 = df.upsample(time_column="time", every="15m").interpolate().fill_null(strategy="forward")

/tmp/ipykernel_555462/735323673.py:4: DeprecationWarning: Creating Datetime ranges using `date_range(s)` is deprecated. Use `datetime_range(s)` instead.
  "time": pl.date_range(start=datetime(2021, 12, 16), end=datetime(2021, 12, 16, 3), interval="30m", eager=True),

out1

shape: (13, 3)

time	groups	values
datetime[μs]	str	f64
2021-12-16 00:00:00	"a"	1.0
2021-12-16 00:15:00	"a"	1.0
2021-12-16 00:30:00	"a"	2.0
2021-12-16 00:45:00	"a"	2.0
2021-12-16 01:00:00	"a"	3.0
2021-12-16 01:15:00	"a"	3.0
2021-12-16 01:30:00	"b"	4.0
2021-12-16 01:45:00	"b"	4.0
2021-12-16 02:00:00	"b"	5.0
2021-12-16 02:15:00	"b"	5.0
2021-12-16 02:30:00	"a"	6.0
2021-12-16 02:45:00	"a"	6.0
2021-12-16 03:00:00	"a"	7.0

out2

shape: (13, 3)

time	groups	values
datetime[μs]	str	f64
2021-12-16 00:00:00	"a"	1.0
2021-12-16 00:15:00	"a"	1.5
2021-12-16 00:30:00	"a"	2.0
2021-12-16 00:45:00	"a"	2.5
2021-12-16 01:00:00	"a"	3.0
2021-12-16 01:15:00	"a"	3.5
2021-12-16 01:30:00	"b"	4.0
2021-12-16 01:45:00	"b"	4.5
2021-12-16 02:00:00	"b"	5.0
2021-12-16 02:15:00	"b"	5.5
2021-12-16 02:30:00	"a"	6.0
2021-12-16 02:45:00	"a"	6.5
2021-12-16 03:00:00	"a"	7.0

下采样 (Down Sampling)

Polars 将下采样视为分组（groupby）操作的一个特例，因此表达式 API 为分组（groupby）上下文（contexts）提供了两个额外的入口

df = pl.DataFrame(
    {
        "time": pl.datetime_range(
            start=datetime(2021, 12, 16),
            end=datetime(2021, 12, 16, 3),
            interval="30m",
            eager=True,
        ),
        "groups": ["a", "a", "a", "b", "b", "a", "a"],
    }
)
df

shape: (7, 2)

time	groups
datetime[μs]	str
2021-12-16 00:00:00	"a"
2021-12-16 00:30:00	"a"
2021-12-16 01:00:00	"a"
2021-12-16 01:30:00	"b"
2021-12-16 02:00:00	"b"
2021-12-16 02:30:00	"a"
2021-12-16 03:00:00	"a"

out = df.group_by_dynamic(
    "time",
    every="1h",
    closed="both",
    by="groups",
    include_boundaries=True,
).agg([pl.len()])
out

shape: (7, 5)

groups	_lower_boundary	_upper_boundary	time	len
str	datetime[μs]	datetime[μs]	datetime[μs]	u32
"a"	2021-12-15 23:00:00	2021-12-16 00:00:00	2021-12-15 23:00:00	1
"a"	2021-12-16 00:00:00	2021-12-16 01:00:00	2021-12-16 00:00:00	3
"a"	2021-12-16 01:00:00	2021-12-16 02:00:00	2021-12-16 01:00:00	1
"a"	2021-12-16 02:00:00	2021-12-16 03:00:00	2021-12-16 02:00:00	2
"a"	2021-12-16 03:00:00	2021-12-16 04:00:00	2021-12-16 03:00:00	1
"b"	2021-12-16 01:00:00	2021-12-16 02:00:00	2021-12-16 01:00:00	2
"b"	2021-12-16 02:00:00	2021-12-16 03:00:00	2021-12-16 02:00:00	1

动态分组 (Groupby Dynamic)

距离月底的天数
一个月里的天数

# 时间轴（从low到high，间隔为1天，轴名称为"time"）
df = pl.DataFrame(
    {
        "time": pl.datetime_range(
            start=datetime(2021, 1, 1),
            end=datetime(2021, 12, 31),
            interval="1d",
            eager=True,
        ),
    }
)

out = (
    df.group_by_dynamic("time", every="1mo", period="1mo", closed="left")
    .agg(
        [
            pl.col("time").cum_count().reverse().head(3).alias("day/eom"),
            ((pl.col("time") - pl.col("time").first()).last().dt.total_days() + 1).alias("days_in_month"),
        ]
    )
    .explode("day/eom")
)
print(out)

shape: (36, 3)
┌─────────────────────┬─────────┬───────────────┐
│ time                ┆ day/eom ┆ days_in_month │
│ ---                 ┆ ---     ┆ ---           │
│ datetime[μs]        ┆ u32     ┆ i64           │
╞═════════════════════╪═════════╪═══════════════╡
│ 2021-01-01 00:00:00 ┆ 31      ┆ 31            │
│ 2021-01-01 00:00:00 ┆ 30      ┆ 31            │
│ 2021-01-01 00:00:00 ┆ 29      ┆ 31            │
│ 2021-02-01 00:00:00 ┆ 28      ┆ 28            │
│ 2021-02-01 00:00:00 ┆ 27      ┆ 28            │
│ …                   ┆ …       ┆ …             │
│ 2021-11-01 00:00:00 ┆ 29      ┆ 30            │
│ 2021-11-01 00:00:00 ┆ 28      ┆ 30            │
│ 2021-12-01 00:00:00 ┆ 31      ┆ 31            │
│ 2021-12-01 00:00:00 ┆ 30      ┆ 31            │
│ 2021-12-01 00:00:00 ┆ 29      ┆ 31            │
└─────────────────────┴─────────┴───────────────┘

窗口需要以下几个参数：

every：窗口的时间间隔
period：窗口的持续时间
offset：可以对窗口的开始进行偏移

every 并不总是需要等于 period，我们可以用一种非常灵活的方式来创建很多组别。它们可以互相重叠，也可以在组间留出边
创建出的窗口相邻，且长度相等

–

  |--|

every: 1 天 -> “1d”
period: 1 天 -> “1d”

–

  |--|

every: 1 天 -> “1d”
period: 2 天 -> “2d”

窗口之间有 1 天的重叠

----

  |----|

every: 2 天 -> “2d”
period: 1 天 -> “1d”

两个窗口之间留有间隔，在这段范围内的数据不属于任何一个组别
|–|
|–|
|–|

滚动分组 (Rolling Groupby)

滚动分组是分组（groupby）上下文的另一个入口.
但与 groupby_dynamic 不同的是，窗口的设置不接受参数 every 和 period —— 对于一个滚动分组，窗口不是固定的
由 index_column 中的值决定
滚动分组的窗口总是由 DataFrame 列中的值决定，组别的数目总是与原 DataFrame 相等

动态分组与滚动分组结合起来

from datetime import datetime

import polars as pl

df = pl.DataFrame(
    {
        "time": pl.datetime_range(
            start=datetime(2021, 12, 16),
            end=datetime(2021, 12, 16, 3),
            interval="30m",
            eager=True,
        ),
        "groups": ["a", "a", "a", "b", "b", "a", "a"],
    }
)
print(out)

shape: (36, 3)
┌─────────────────────┬─────────┬───────────────┐
│ time                ┆ day/eom ┆ days_in_month │
│ ---                 ┆ ---     ┆ ---           │
│ datetime[μs]        ┆ u32     ┆ i64           │
╞═════════════════════╪═════════╪═══════════════╡
│ 2021-01-01 00:00:00 ┆ 31      ┆ 31            │
│ 2021-01-01 00:00:00 ┆ 30      ┆ 31            │
│ 2021-01-01 00:00:00 ┆ 29      ┆ 31            │
│ 2021-02-01 00:00:00 ┆ 28      ┆ 28            │
│ 2021-02-01 00:00:00 ┆ 27      ┆ 28            │
│ …                   ┆ …       ┆ …             │
│ 2021-11-01 00:00:00 ┆ 29      ┆ 30            │
│ 2021-11-01 00:00:00 ┆ 28      ┆ 30            │
│ 2021-12-01 00:00:00 ┆ 31      ┆ 31            │
│ 2021-12-01 00:00:00 ┆ 30      ┆ 31            │
│ 2021-12-01 00:00:00 ┆ 29      ┆ 31            │
└─────────────────────┴─────────┴───────────────┘

# 动态分组
out = df.group_by_dynamic(
    "time",
    every="1h",
    closed="both",
    by="groups",
    include_boundaries=True,
).agg([pl.len()])
print(df)

shape: (7, 2)
┌─────────────────────┬────────┐
│ time                ┆ groups │
│ ---                 ┆ ---    │
│ datetime[μs]        ┆ str    │
╞═════════════════════╪════════╡
│ 2021-12-16 00:00:00 ┆ a      │
│ 2021-12-16 00:30:00 ┆ a      │
│ 2021-12-16 01:00:00 ┆ a      │
│ 2021-12-16 01:30:00 ┆ b      │
│ 2021-12-16 02:00:00 ┆ b      │
│ 2021-12-16 02:30:00 ┆ a      │
│ 2021-12-16 03:00:00 ┆ a      │
└─────────────────────┴────────┘

9.使用范围

IO

通过Polars加载CSV文件比使用Pandas加载要快
只需运行 pl.read_csv（“”，rechunk=False）.to_pandas()

CSV 文件

读写

df = pl.read_csv("path.csv")
df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "bak", "baz"]})
df.write_csv("path.csv")

扫描

df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "bak", "baz"]})
df.write_csv("path.csv")

Parquet 文件

df = pl.read_parquet("path.parquet")

df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "bak", "baz"]})
df.write_parquet("path.parquet")

df = pl.scan_parquet("path.parquet")

处理多个文件

Polars可以根据您的需要和内存紧张程度，以不同的方式处理多个文件

import polars as pl

df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "ham", "spam"]})

for i in range(5):
    df.write_csv(f"my_many_files_{i}.csv")

读入单个DataFrame

df = pl.read_csv("my_many_files_*.csv")
print(df)

shape: (15, 2)
┌─────┬──────┐
│ foo ┆ bar  │
│ --- ┆ ---  │
│ i64 ┆ str  │
╞═════╪══════╡
│ 1   ┆ null │
│ 2   ┆ ham  │
│ 3   ┆ spam │
│ 1   ┆ null │
│ 2   ┆ ham  │
│ …   ┆ …    │
│ 2   ┆ ham  │
│ 3   ┆ spam │
│ 1   ┆ null │
│ 2   ┆ ham  │
│ 3   ┆ spam │
└─────┴──────┘

pl.scan_csv("my_many_files_*.csv").show_graph()

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

并行读取和处理

如果您的文件不必位于单个表中，您还可以为每个文件构建一个查询计划，并在Polars线程池中并行执行它们。所有查询计划的执行都是极好的并行执行，不需要任何通信

import polars as pl
import glob

queries = []
for file in glob.glob("my_many_files_*.csv"):
    q = pl.scan_csv(file).group_by("bar").agg([pl.len(), pl.sum("foo")])
    queries.append(q)

dataframes = pl.collect_all(queries)
print(dataframes)

[shape: (3, 3)
┌──────┬─────┬─────┐
│ bar  ┆ len ┆ foo │
│ ---  ┆ --- ┆ --- │
│ str  ┆ u32 ┆ i64 │
╞══════╪═════╪═════╡
│ spam ┆ 1   ┆ 3   │
│ ham  ┆ 1   ┆ 2   │
│ null ┆ 1   ┆ 1   │
└──────┴─────┴─────┘, shape: (3, 3)
┌──────┬─────┬─────┐
│ bar  ┆ len ┆ foo │
│ ---  ┆ --- ┆ --- │
│ str  ┆ u32 ┆ i64 │
╞══════╪═════╪═════╡
│ ham  ┆ 1   ┆ 2   │
│ null ┆ 1   ┆ 1   │
│ spam ┆ 1   ┆ 3   │
└──────┴─────┴─────┘, shape: (3, 3)
┌──────┬─────┬─────┐
│ bar  ┆ len ┆ foo │
│ ---  ┆ --- ┆ --- │
│ str  ┆ u32 ┆ i64 │
╞══════╪═════╪═════╡
│ ham  ┆ 1   ┆ 2   │
│ spam ┆ 1   ┆ 3   │
│ null ┆ 1   ┆ 1   │
└──────┴─────┴─────┘, shape: (3, 3)
┌──────┬─────┬─────┐
│ bar  ┆ len ┆ foo │
│ ---  ┆ --- ┆ --- │
│ str  ┆ u32 ┆ i64 │
╞══════╪═════╪═════╡
│ spam ┆ 1   ┆ 3   │
│ ham  ┆ 1   ┆ 2   │
│ null ┆ 1   ┆ 1   │
└──────┴─────┴─────┘, shape: (3, 3)
┌──────┬─────┬─────┐
│ bar  ┆ len ┆ foo │
│ ---  ┆ --- ┆ --- │
│ str  ┆ u32 ┆ i64 │
╞══════╪═════╪═════╡
│ ham  ┆ 1   ┆ 2   │
│ null ┆ 1   ┆ 1   │
│ spam ┆ 1   ┆ 3   │
└──────┴─────┴─────┘]

从数据库中读取文件

读取MySQL、Postgres、Sqlite、Redshift、Clickhouse

import polars as pl

conn = "postgres://username:password@server:port/database"
query = "SELECT * FROM foo"

pl.read_sql(query, conn)

从s3中读取文件

import polars as pl
import pyarrow.parquet as pq
import s3fs

fs = s3fs.S3FileSystem()
bucket = "<YOUR_BUCKET>"
path = "<YOUR_PATH>"

dataset = pq.ParquetDataset(f"s3://{bucket}/{path}", filesystem=fs)
df = pl.from_arrow(dataset.read())

与 Postgres 交互

读取

import polars as pl

conn = "postgresql://username:password@server:port/database"
query = "SELECT * FROM foo"

pl.read_sql(query, conn)

写入

pip install psycopg2-binary

from psycopg2 import sql
import psycopg2.extras
import polars as pl

# 不仿假设有一个DataFrame，其列分别为：浮点，整数，字符串，日期（date64）类型的数据
df = pl.read_parquet("somefile.parquet")

# 首先将 polars 的 date64 数据类型转换成 python 的 datetime 对象
for col in df:
    # 只转换date64类型数据
    if col.dtype == pl.Date64:
        df = df.with_column(col.dt.to_python_datetime())

# 为字段名创建 sql 标识符
# 这一步是为了在sql语句中安全插入数据
columns = sql.SQL(",").join(sql.Identifier(name) for name in df.columns)

# 为值创建占位符，之后再被值填充
values = sql.SQL(",").join([sql.Placeholder() for _ in df.columns])

table_id = "mytable"

# 准备insert语句
insert_stmt = sql.SQL("INSERT INTO ({}) VALUES({});").format(
    sql.Identifier(table_id), columns, values
)

# 创建与数据库的连接
conn = psycopg2.connect()
cur = conn.cursort()

# 执行insert语句
psycopg2.extras.execute_batch(cur, insert_stmt, df.rows())
conn.commit()

### 互通性

要将 Polars 的 DataFrame 或者 Series 转换为 Arrow，只需使用 .to_arrow() 函数。类似的，要从 Arrow 格式导入数据，可以调用 .from_arrow() 函数。

olars 的 Series 支持 NumPy 的通用函数 (ufuncs)。调用元素层面的 (element-wise) 函数，比如 np.exp()、np.cos() 或 np.div()，基本上没有额外开销

Polars 中的缺失值是一个独立的比特掩码 —— 其在 NumPy 中是不可见的。这可能导致窗口函数或 np.convolve() 输出有缺陷或不完整的结果

将一个 Polars Series 转换为 NumPy 数组，可以调用 .to_numpy() 函数。转换时，此函数将会把缺失值替换为 np.nan

如果 Series 中没有缺失值，或转换后不再需要这些值，可以使用 .view() 函数作为代替，这将为数据生成一个零拷贝的 NumPy 数组

数据

字符串

import polars as pl

import polars as pl

df = pl.DataFrame({"shakespeare": "All that glitters is not gold".split(" ")})

df = df.with_columns(pl.col("shakespeare").str.len_bytes().alias("letter_count"))
df

shape: (6, 2)

shakespeare	letter_count
str	u32
"All"	3
"that"	4
"glitters"	8
"is"	2
"not"	3
"gold"	4

下面是从句子中过滤出冠词（the、a、and、etc.）的正则表达式模式

import polars as pl

df = pl.DataFrame({"a": "The man that ate a whole cake".split(" ")})

df = df.filter(pl.col("a").str.contains(r"(?i)^the$|^a$").not_())
df

shape: (5, 1)

a
str
"man"
"that"
"ate"
"whole"
"cake"

时间戳

import polars as pl

dataset = pl.DataFrame({"date": ["2020-01-02", "2020-01-03", "2020-01-04"], "index": [1, 2, 3]})

q = dataset.lazy().with_columns(pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"))

df = q.collect()
df

shape: (3, 2)

date	index
date	i64
2020-01-02	1
2020-01-03	2
2020-01-04	3

数据帧

import polars as pl

# 创建一个 Polars DataFrame
df = pl.DataFrame({
    'a': [1, 2, 3, None, 5],
    'b': ['foo', 'ham', 'spam', 'egg', None],
    'c': [0.37454, 0.950714, 0.731994, 0.598658, 0.156019],
    'd': ['a', 'b', 'c', 'd', 'e'],
})

# 显示表格
print(df)

shape: (5, 4)
┌──────┬──────┬──────────┬─────┐
│ a    ┆ b    ┆ c        ┆ d   │
│ ---  ┆ ---  ┆ ---      ┆ --- │
│ i64  ┆ str  ┆ f64      ┆ str │
╞══════╪══════╪══════════╪═════╡
│ 1    ┆ foo  ┆ 0.37454  ┆ a   │
│ 2    ┆ ham  ┆ 0.950714 ┆ b   │
│ 3    ┆ spam ┆ 0.731994 ┆ c   │
│ null ┆ egg  ┆ 0.598658 ┆ d   │
│ 5    ┆ null ┆ 0.156019 ┆ e   │
└──────┴──────┴──────────┴─────┘

# 推荐写法 选取行
out=df.select(["a", "b"])
# 也可以写成这样
out = df[["a", "b"]]
print(out)

shape: (5, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ foo  │
│ 2    ┆ ham  │
│ 3    ┆ spam │
│ null ┆ egg  │
│ 5    ┆ null │
└──────┴──────┘

## 选取行
df[0:2]

shape: (2, 4)

a	b	c	d
i64	str	f64	str
1	"foo"	0.37454	"a"
2	"ham"	0.950714	"b"

##添加行
import polars as pl

# 创建一个 DataFrame 实例
df = pl.DataFrame({
    "a": [1, 2, 3, None, 5],
    "b": ["foo", "ham", "spam", "egg", None],
    "c": [0.37454, 0.950714, 0.731994, 0.598658, 0.156019],
    "d": ["a", "b", "c", "d", "e"]
})

# 新的一行数据
new_row = pl.DataFrame({
    "a": [6],
    "b": ["new_row"],
    "c": [0.123],
    "d": ["f"]
})

# 使用 concat 方法拼接两个 DataFrame
df = pl.concat([df, new_row])

print(df)

shape: (6, 4)
┌──────┬─────────┬──────────┬─────┐
│ a    ┆ b       ┆ c        ┆ d   │
│ ---  ┆ ---     ┆ ---      ┆ --- │
│ i64  ┆ str     ┆ f64      ┆ str │
╞══════╪═════════╪══════════╪═════╡
│ 1    ┆ foo     ┆ 0.37454  ┆ a   │
│ 2    ┆ ham     ┆ 0.950714 ┆ b   │
│ 3    ┆ spam    ┆ 0.731994 ┆ c   │
│ null ┆ egg     ┆ 0.598658 ┆ d   │
│ 5    ┆ null    ┆ 0.156019 ┆ e   │
│ 6    ┆ new_row ┆ 0.123    ┆ f   │
└──────┴─────────┴──────────┴─────┘

## 添加列
import polars as pl

# 创建一个 DataFrame 实例
df = pl.DataFrame({
    "a": [1, 2, 3, None, 5],
    "b": ["foo", "ham", "spam", "egg", None],
    "c": [0.37454, 0.950714, 0.731994, 0.598658, 0.156019],
    "d": ["a", "b", "c", "d", "e"]
})

# 新的一列数据
new_column_data = [4, 12, 33, None, 51]  # 你要添加的新列数据

# 使用 with_column 方法添加新列
df = df.with_columns([pl.Series(new_column_data).alias('e')])

print(df)

shape: (5, 5)
┌──────┬──────┬──────────┬─────┬──────┐
│ a    ┆ b    ┆ c        ┆ d   ┆ e    │
│ ---  ┆ ---  ┆ ---      ┆ --- ┆ ---  │
│ i64  ┆ str  ┆ f64      ┆ str ┆ i64  │
╞══════╪══════╪══════════╪═════╪══════╡
│ 1    ┆ foo  ┆ 0.37454  ┆ a   ┆ 4    │
│ 2    ┆ ham  ┆ 0.950714 ┆ b   ┆ 12   │
│ 3    ┆ spam ┆ 0.731994 ┆ c   ┆ 33   │
│ null ┆ egg  ┆ 0.598658 ┆ d   ┆ null │
│ 5    ┆ null ┆ 0.156019 ┆ e   ┆ 51   │
└──────┴──────┴──────────┴─────┴──────┘

## 类型转换
out = df.with_columns(pl.col("a").cast(float))
print(out)

shape: (5, 5)
┌──────┬──────┬──────────┬─────┬──────┐
│ a    ┆ b    ┆ c        ┆ d   ┆ e    │
│ ---  ┆ ---  ┆ ---      ┆ --- ┆ ---  │
│ f64  ┆ str  ┆ f64      ┆ str ┆ i64  │
╞══════╪══════╪══════════╪═════╪══════╡
│ 1.0  ┆ foo  ┆ 0.37454  ┆ a   ┆ 4    │
│ 2.0  ┆ ham  ┆ 0.950714 ┆ b   ┆ 12   │
│ 3.0  ┆ spam ┆ 0.731994 ┆ c   ┆ 33   │
│ null ┆ egg  ┆ 0.598658 ┆ d   ┆ null │
│ 5.0  ┆ null ┆ 0.156019 ┆ e   ┆ 51   │
└──────┴──────┴──────────┴─────┴──────┘

## 重命名
import numpy as np
import polars as pl

df = pl.DataFrame(
    {
        "a": [1, 2, 3, None, 5],
        "b": ["foo", "ham", "spam", "egg", None],
        "c": np.random.rand(5),
        "d": ["a", "b", "c", "d", "e"],
    }
)

df.columns = ["banana", "orange", "apple", "grapefruit"]  # 重命名列
df

shape: (5, 4)

banana	orange	apple	grapefruit
i64	str	f64	str
1	"foo"	0.366423	"a"
2	"ham"	0.487346	"b"
3	"spam"	0.573355	"c"
null	"egg"	0.990123	"d"
5	null	0.665936	"e"

## 删除列
import numpy as np
import polars as pl

df = pl.DataFrame(
    {
        "a": [1, 2, 3, None, 5],
        "b": ["foo", "ham", "spam", "egg", None],
        "c": np.random.rand(5),
        "d": ["a", "b", "c", "d", "e"],
    }
)
df.drop('d')
df.drop(['b','c'])
out= df.select(pl.all().exclude(['a','b']))
out

shape: (5, 2)

c	d
f64	str
0.328839	"a"
0.989829	"b"
0.677381	"c"
0.049378	"d"
0.414995	"e"

df = pl.DataFrame(
    {
        "a": [1, 2, 3, None, 5],
        "b": ["foo", "ham", "spam", "egg", None],
        "c": np.random.rand(5),
        "d": ["a", "b", "c", "d", "e"],
    }
)

out=df.drop_nulls()
out

shape: (3, 4)

a	b	c	d
i64	str	f64	str
1	"foo"	0.977671	"a"
2	"ham"	0.956296	"b"
3	"spam"	0.967253	"c"

## 填充缺失值
df = pl.DataFrame(
    {
        "a": [1, 2, 3, None, 5],
        "b": ["foo", "ham", "spam", "egg", None],
        "c": np.random.rand(5),
        "d": ["a", "b", "c", "d", "e"],
    }
)

## 不生效
out=df.fill_null(strategy='forward')
out

shape: (5, 4)

a	b	c	d
i64	str	f64	str
1	"foo"	0.365143	"a"
2	"ham"	0.576743	"b"
3	"spam"	0.709904	"c"
3	"egg"	0.720322	"d"
5	"egg"	0.104406	"e"

## 获取所有列
df.columns

['a', 'b', 'c', 'd']

df.null_count()

shape: (1, 4)

a	b	c	d
u32	u32	u32	u32
1	1	0	0

df.sort('a',descending=True)

shape: (5, 4)

a	b	c	d
i64	str	f64	str
null	"egg"	0.720322	"d"
5	null	0.104406	"e"
3	"spam"	0.709904	"c"
2	"ham"	0.576743	"b"
1	"foo"	0.365143	"a"

df.to_numpy()

array([[1.0, 'foo', 0.3651426421860232, 'a'],
       [2.0, 'ham', 0.5767431907967668, 'b'],
       [3.0, 'spam', 0.7099041984185189, 'c'],
       [nan, 'egg', 0.7203224331236571, 'd'],
       [5.0, None, 0.10440627936761548, 'e']], dtype=object)

df.to_pandas()

	a	b	c	d
0	1.0	foo	0.365143	a
1	2.0	ham	0.576743	b
2	3.0	spam	0.709904	c
3	NaN	egg	0.720322	d
4	5.0	None	0.104406	e

分组

先调用 .groupby() 函数，并跟随一个 .agg() 函数
.agg() 函数中，你可以对任意数量的列进行任意聚合操作

import polars as pl

q = (
    pl.scan_csv("data/reddit.csv")
    .groupby("comment_karma")
    .agg([pl.col("name").n_unique().alias("unique_names"), pl.max("link_karma")])
    .sort(by="unique_names", descending=True)  # 逆序排列，reverse=True
)

df = q.fetch()

聚合

select() 或者 .with_column()/.with_columns()

过滤

急性

import polars as pl

df = pl.DataFrame({"a": [1, 2, 3], "b": [None, "b", "c"]})

df.filter(pl.col("a") > 2)
df

shape: (3, 2)

a	b
i64	str
1	null
2	"b"
3	"c"


df = pl.DataFrame({"a": [1, 2, 3], "b": [None, "b", "c"]})

out = df.lazy().filter(pl.col("a") > 2).collect() # 惰性过滤
out

shape: (1, 2)

a	b
i64	str
3	"c"

连接

import polars as pl

df_a = pl.DataFrame({"a": [1, 2, 1, 1], "b": ["a", "b", "c", "c"], "c": [0, 1, 2, 3]})

df_b = pl.DataFrame({"foo": [1, 1, 1], "bar": ["a", "c", "c"], "ham": ["let", "var", "const"]})
print(df_a)

shape: (4, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ a   ┆ 0   │
│ 2   ┆ b   ┆ 1   │
│ 1   ┆ c   ┆ 2   │
│ 1   ┆ c   ┆ 3   │
└─────┴─────┴─────┘

## 急性
out= df_a.join(df_b,left_on=['a','b'],right_on=['foo','bar'],how='left')
out

shape: (6, 4)

a	b	c	ham
i64	str	i64	str
1	"a"	0	"let"
2	"b"	1	null
1	"c"	2	"var"
1	"c"	2	"const"
1	"c"	3	"var"
1	"c"	3	"const"

##惰性

q = df_a.lazy().join(df_b.lazy(), left_on="a", right_on="foo", how="outer")
out = q.collect()
print(out)

shape: (10, 6)
┌─────┬─────┬─────┬──────┬──────┬───────┐
│ a   ┆ b   ┆ c   ┆ foo  ┆ bar  ┆ ham   │
│ --- ┆ --- ┆ --- ┆ ---  ┆ ---  ┆ ---   │
│ i64 ┆ str ┆ i64 ┆ i64  ┆ str  ┆ str   │
╞═════╪═════╪═════╪══════╪══════╪═══════╡
│ 1   ┆ a   ┆ 0   ┆ 1    ┆ a    ┆ let   │
│ 1   ┆ a   ┆ 0   ┆ 1    ┆ c    ┆ var   │
│ 1   ┆ a   ┆ 0   ┆ 1    ┆ c    ┆ const │
│ 2   ┆ b   ┆ 1   ┆ null ┆ null ┆ null  │
│ 1   ┆ c   ┆ 2   ┆ 1    ┆ a    ┆ let   │
│ 1   ┆ c   ┆ 2   ┆ 1    ┆ c    ┆ var   │
│ 1   ┆ c   ┆ 2   ┆ 1    ┆ c    ┆ const │
│ 1   ┆ c   ┆ 3   ┆ 1    ┆ a    ┆ let   │
│ 1   ┆ c   ┆ 3   ┆ 1    ┆ c    ┆ var   │
│ 1   ┆ c   ┆ 3   ┆ 1    ┆ c    ┆ const │
└─────┴─────┴─────┴──────┴──────┴───────┘

重塑

重塑操作将一个宽格式的 DataFrame 逆透视为长格式

import polars as pl

df = pl.DataFrame({"A": ["a", "b", "a"], "B": [1, 3, 5], "C": [10, 11, 12], "D": [2, 4, 6]})
print(df)

shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ A   ┆ B   ┆ C   ┆ D   │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ a   ┆ 1   ┆ 10  ┆ 2   │
│ b   ┆ 3   ┆ 11  ┆ 4   │
│ a   ┆ 5   ┆ 12  ┆ 6   │
└─────┴─────┴─────┴─────┘

out = df.melt(id_vars=["A", "B"], value_vars=["C", "D"])
print(out)

shape: (6, 4)
┌─────┬─────┬──────────┬───────┐
│ A   ┆ B   ┆ variable ┆ value │
│ --- ┆ --- ┆ ---      ┆ ---   │
│ str ┆ i64 ┆ str      ┆ i64   │
╞═════╪═════╪══════════╪═══════╡
│ a   ┆ 1   ┆ C        ┆ 10    │
│ b   ┆ 3   ┆ C        ┆ 11    │
│ a   ┆ 5   ┆ C        ┆ 12    │
│ a   ┆ 1   ┆ D        ┆ 2     │
│ b   ┆ 3   ┆ D        ┆ 4     │
│ a   ┆ 5   ┆ D        ┆ 6     │
└─────┴─────┴──────────┴───────┘

透视

透视操作包括一个或多个列的分组（它们将成为新的 y 轴），将被透视的列（它们将成为新的 x 轴）以及一个聚合

first：第一项
sum：求和
min：最小值
max：最大值
mean：平均值
median：中位数

import polars as pl

# 构造DataFrame（数据帧）
df = pl.DataFrame(
    {
        "foo": ["A", "A", "B", "B", "C"],
        "N": [1, 2, 2, 4, 2],
        "bar": ["k", "l", "m", "n", "o"],
    }
)
print(df)

shape: (5, 3)
┌─────┬─────┬─────┐
│ foo ┆ N   ┆ bar │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ A   ┆ 1   ┆ k   │
│ A   ┆ 2   ┆ l   │
│ B   ┆ 2   ┆ m   │
│ B   ┆ 4   ┆ n   │
│ C   ┆ 2   ┆ o   │
└─────┴─────┴─────┘

## 急性

out = df.pivot(
    index="foo",
    columns="bar",
    values="N",
)
out

shape: (3, 6)

foo	k	l	m	n	o
str	i64	i64	i64	i64	i64
"A"	1	2	null	null	null
"B"	null	null	2	4	null
"C"	null	null	null	null	2

## 惰性

q = (
    df.lazy()
    .collect()
    .pivot(
        index="foo",
        columns="bar",
        values="N",
    )
    .lazy()
)
out = q.collect()
print(out)

shape: (3, 6)
┌─────┬──────┬──────┬──────┬──────┬──────┐
│ foo ┆ k    ┆ l    ┆ m    ┆ n    ┆ o    │
│ --- ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ str ┆ i64  ┆ i64  ┆ i64  ┆ i64  ┆ i64  │
╞═════╪══════╪══════╪══════╪══════╪══════╡
│ A   ┆ 1    ┆ 2    ┆ null ┆ null ┆ null │
│ B   ┆ null ┆ null ┆ 2    ┆ 4    ┆ null │
│ C   ┆ null ┆ null ┆ null ┆ null ┆ 2    │
└─────┴──────┴──────┴──────┴──────┴──────┘

排序

Polars 支持与其他数据框架库类似的排序行为，即按一个或多个列以及多个（不同的）顺序进行排序

import numpy as np
import polars as pl

df = pl.DataFrame({"a": np.arange(1, 4), "b": ["a", "a", "b"]})  # np.arange(1, 4): 生成[1, 4)的数组
print(df)

shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1   ┆ a   │
│ 2   ┆ a   │
│ 3   ┆ b   │
└─────┴─────┘

## 急性
out = df.sort(["b", "a"], descending=[True, False])  # 分别对两列"b", "a"进行排序，"b"逆序，"a"顺序
print(out)

shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 3   ┆ b   │
│ 1   ┆ a   │
│ 2   ┆ a   │
└─────┴─────┘

## 惰性
import polars as pl

q = df.lazy().sort(pl.col("a"), descending=True)  # 惰性排序，对"a"列
df = q.collect()
print(out)

shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 3   ┆ b   │
│ 1   ┆ a   │
│ 2   ┆ a   │
└─────┴─────┘


q = df.lazy().with_columns(
    pl.when(pl.col("range") >= 5).then(pl.col("left")).otherwise(pl.col("right")).alias("foo_or_bar")  # .alias增加一列
)

df = q.collect()
print(df)

shape: (10, 4)
┌───────┬──────┬───────┬────────────┐
│ range ┆ left ┆ right ┆ foo_or_bar │
│ ---   ┆ ---  ┆ ---   ┆ ---        │
│ i64   ┆ str  ┆ str   ┆ str        │
╞═══════╪══════╪═══════╪════════════╡
│ 0     ┆ foo  ┆ bar   ┆ bar        │
│ 1     ┆ foo  ┆ bar   ┆ bar        │
│ 2     ┆ foo  ┆ bar   ┆ bar        │
│ 3     ┆ foo  ┆ bar   ┆ bar        │
│ 4     ┆ foo  ┆ bar   ┆ bar        │
│ 5     ┆ foo  ┆ bar   ┆ foo        │
│ 6     ┆ foo  ┆ bar   ┆ foo        │
│ 7     ┆ foo  ┆ bar   ┆ foo        │
│ 8     ┆ foo  ┆ bar   ┆ foo        │
│ 9     ┆ foo  ┆ bar   ┆ foo        │
└───────┴──────┴───────┴────────────┘

条件应用

我们可以使用 .when()/.then()/.otherwise() 表达式。

when - 接受一个谓词表达式
then - 当谓词 == True 时使用的表达式
otherwise - 当谓词 == False 时使用的表达式

import numpy as np
import polars as pl

# 构造数据帧
df = pl.DataFrame({"range": np.arange(10), "left": ["foo"] * 10, "right": ["bar"] * 10})
df.head()

shape: (5, 3)

range	left	right
i64	str	str
0	"foo"	"bar"
1	"foo"	"bar"
2	"foo"	"bar"
3	"foo"	"bar"
4	"foo"	"bar"


q = df.lazy().with_columns(
    pl.when(pl.col("range") >= 5).then(pl.col("left")).otherwise(pl.col("right")).alias("foo_or_bar")  # .alias增加一列
)

df = q.collect()
print(df)

shape: (10, 4)
┌───────┬──────┬───────┬────────────┐
│ range ┆ left ┆ right ┆ foo_or_bar │
│ ---   ┆ ---  ┆ ---   ┆ ---        │
│ i64   ┆ str  ┆ str   ┆ str        │
╞═══════╪══════╪═══════╪════════════╡
│ 0     ┆ foo  ┆ bar   ┆ bar        │
│ 1     ┆ foo  ┆ bar   ┆ bar        │
│ 2     ┆ foo  ┆ bar   ┆ bar        │
│ 3     ┆ foo  ┆ bar   ┆ bar        │
│ 4     ┆ foo  ┆ bar   ┆ bar        │
│ 5     ┆ foo  ┆ bar   ┆ foo        │
│ 6     ┆ foo  ┆ bar   ┆ foo        │
│ 7     ┆ foo  ┆ bar   ┆ foo        │
│ 8     ┆ foo  ┆ bar   ┆ foo        │
│ 9     ┆ foo  ┆ bar   ┆ foo        │
└───────┴──────┴───────┴────────────┘

自定义函数

import polars as pl

my_map = {1: "foo", 2: "bar", 3: "ham", 4: "spam", 5: "eggs"}

s = pl.Series("a", [1, 2, 3, 4, 5])  # 构建Series
s = s.apply(lambda x: my_map[x])  # 用lambda表达式添加Series
s

/tmp/ipykernel_3199148/1850177654.py:6: DeprecationWarning: `apply` is deprecated. It has been renamed to `map_elements`.
  s = s.apply(lambda x: my_map[x])  # 用lambda表达式添加Series
/tmp/ipykernel_3199148/1850177654.py:6: PolarsInefficientMapWarning: 
Series.map_elements is significantly slower than the native series API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expression...
  - s.map_elements(lambda x: ...)
with this one instead:
  + s.replace(my_map)

  s = s.apply(lambda x: my_map[x])  # 用lambda表达式添加Series

shape: (5,)

a
str
"foo"
"bar"
"ham"
"spam"
"eggs"

窗口函数

import polars as pl

dataset = pl.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "fruits": ["banana", "banana", "apple", "apple", "banana"],
        "B": [5, 4, 3, 2, 1],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
    }
)

q = dataset.lazy().with_columns(
    [
        pl.sum("A").over("fruits").alias("fruit_sum_A"),  # 在"fruits"列的基础上进行"A"的加和，并另起一列
        pl.first("B").over("fruits").alias("fruit_first_B"),
        pl.max("B").over("cars").alias("cars_max_B"),
    ]
)

df = q.collect()
df

shape: (5, 7)

A	fruits	B	cars	fruit_sum_A	fruit_first_B	cars_max_B
i64	str	i64	str	i64	i64	i64
1	"banana"	5	"beetle"	8	5	5
2	"banana"	4	"audi"	8	5	4
3	"apple"	3	"beetle"	7	3	5
4	"apple"	2	"beetle"	7	3	5
5	"banana"	1	"beetle"	8	5	5

10. 性能

字符串

如果我们需要对Arrow UTF8数组重新排序，我们需要交换字符串值的所有字节，这在处理大型字符串时可能会非常昂贵
对于Vec，我们只需要交换指针，只需移动8字节的数据，成本很低

一条闲鱼_mytube

关注

3
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
polars_learn

map函数将表达式所支持的Series数据原封不动的传递map函数在select和groupby中遵循相同的规则这将意味着Series代表DataFrame中的一个列在groupby情况下，该列还没有被分组map的使用情况很有限。它们只用于性能方面，但很容易导致不正确的结果print(out)s = pl.Series("a", [1, 2, 3, 4, 5]) # 构建Series。
复制链接

扫一扫

专栏目录