polars自学—官方文档：https://docs.pola.rs/user-guide/getting-started/

最新推荐文章于 2024-07-12 19:06:42 发布

jack.06wen

最新推荐文章于 2024-07-12 19:06:42 发布

阅读量791

点赞数 29

文章标签： python 数据分析

本文链接：https://blog.csdn.net/m0_59075153/article/details/139636828

版权

开始简介

数据读写

polars数据读写与pandas类似

存储 df.write_csv(“docs/data/output.csv”)

读取 df_csv = pl.read_csv(“docs/data/output.csv”)

import polars as pl
from datetime import datetime

df = pl.DataFrame(
    {
        "integer": [1, 2, 3],
        "date": [
            datetime(2025, 1, 1),
            datetime(2025, 1, 2),
            datetime(2025, 1, 3),
        ],
        "float": [4.0, 5.0, 6.0],
        "string": ["a", "b", "c"],
    }
)

df

shape: (3, 4)

integer	date	float	string
i64	datetime[μs]	f64	str
1	2025-01-01 00:00:00	4.0	"a"
2	2025-01-02 00:00:00	5.0	"b"
3	2025-01-03 00:00:00	6.0	"c"

Expressions

polars中最核心的部分Expressions，Expressions提供了一个模块结构，在该结构内，你可以使用并不断叠加简单的concepts（另外一个核心的概念），最终实现复杂的查询。
在polars中，主要由以下四个基本的模块结构(也称contexts)：

select
filter
group_by
with_columns
select
为了选择某列，首先需要定义对应的数据集dataframe，其次要明确需要的列

# col('*')表示选择所有列, 与pl.all()相同
print(df.select(pl.col("*")))

print(df.select(pl.all()))

# 选择特定列
print(df.select(pl.col('float','date')))

shape: (3, 4)
┌─────────┬─────────────────────┬───────┬────────┐
│ integer ┆ date                ┆ float ┆ string │
│ ---     ┆ ---                 ┆ ---   ┆ ---    │
│ i64     ┆ datetime[μs]        ┆ f64   ┆ str    │
╞═════════╪═════════════════════╪═══════╪════════╡
│ 1       ┆ 2025-01-01 00:00:00 ┆ 4.0   ┆ a      │
│ 2       ┆ 2025-01-02 00:00:00 ┆ 5.0   ┆ b      │
│ 3       ┆ 2025-01-03 00:00:00 ┆ 6.0   ┆ c      │
└─────────┴─────────────────────┴───────┴────────┘
shape: (3, 4)
┌─────────┬─────────────────────┬───────┬────────┐
│ integer ┆ date                ┆ float ┆ string │
│ ---     ┆ ---                 ┆ ---   ┆ ---    │
│ i64     ┆ datetime[μs]        ┆ f64   ┆ str    │
╞═════════╪═════════════════════╪═══════╪════════╡
│ 1       ┆ 2025-01-01 00:00:00 ┆ 4.0   ┆ a      │
│ 2       ┆ 2025-01-02 00:00:00 ┆ 5.0   ┆ b      │
│ 3       ┆ 2025-01-03 00:00:00 ┆ 6.0   ┆ c      │
└─────────┴─────────────────────┴───────┴────────┘
shape: (3, 2)
┌───────┬─────────────────────┐
│ float ┆ date                │
│ ---   ┆ ---                 │
│ f64   ┆ datetime[μs]        │
╞═══════╪═════════════════════╡
│ 4.0   ┆ 2025-01-01 00:00:00 │
│ 5.0   ┆ 2025-01-02 00:00:00 │
│ 6.0   ┆ 2025-01-03 00:00:00 │
└───────┴─────────────────────┘

filter

# 通过日期筛选
print(df.filter(pl.col('date').is_between(datetime(2025, 1, 1), datetime(2025, 1, 2))))

#通过数值筛选
print(df.filter(pl.col('float').is_between(5, 6) ))

shape: (2, 4)
┌─────────┬─────────────────────┬───────┬────────┐
│ integer ┆ date                ┆ float ┆ string │
│ ---     ┆ ---                 ┆ ---   ┆ ---    │
│ i64     ┆ datetime[μs]        ┆ f64   ┆ str    │
╞═════════╪═════════════════════╪═══════╪════════╡
│ 1       ┆ 2025-01-01 00:00:00 ┆ 4.0   ┆ a      │
│ 2       ┆ 2025-01-02 00:00:00 ┆ 5.0   ┆ b      │
└─────────┴─────────────────────┴───────┴────────┘
shape: (2, 4)
┌─────────┬─────────────────────┬───────┬────────┐
│ integer ┆ date                ┆ float ┆ string │
│ ---     ┆ ---                 ┆ ---   ┆ ---    │
│ i64     ┆ datetime[μs]        ┆ f64   ┆ str    │
╞═════════╪═════════════════════╪═══════╪════════╡
│ 2       ┆ 2025-01-02 00:00:00 ┆ 5.0   ┆ b      │
│ 3       ┆ 2025-01-03 00:00:00 ┆ 6.0   ┆ c      │
└─────────┴─────────────────────┴───────┴────────┘

select和fliter返回的dataframe均为筛选后的，其一般不会新增列，group_by和 with_columns能对原始数据的列进行替换或添加

with_column

print(df.with_columns(pl.col('float').sum().alias('new_folat'), (pl.col('string')+'add').alias('string+add')))
# 使用alias创建新列，否则替换原列
print(df.with_columns(pl.col('float').sum(), (pl.col('string')+'add')))

shape: (3, 6)
┌─────────┬─────────────────────┬───────┬────────┬───────────┬────────────┐
│ integer ┆ date                ┆ float ┆ string ┆ new_folat ┆ string+add │
│ ---     ┆ ---                 ┆ ---   ┆ ---    ┆ ---       ┆ ---        │
│ i64     ┆ datetime[μs]        ┆ f64   ┆ str    ┆ f64       ┆ str        │
╞═════════╪═════════════════════╪═══════╪════════╪═══════════╪════════════╡
│ 1       ┆ 2025-01-01 00:00:00 ┆ 4.0   ┆ a      ┆ 15.0      ┆ aadd       │
│ 2       ┆ 2025-01-02 00:00:00 ┆ 5.0   ┆ b      ┆ 15.0      ┆ badd       │
│ 3       ┆ 2025-01-03 00:00:00 ┆ 6.0   ┆ c      ┆ 15.0      ┆ cadd       │
└─────────┴─────────────────────┴───────┴────────┴───────────┴────────────┘
shape: (3, 4)
┌─────────┬─────────────────────┬───────┬────────┐
│ integer ┆ date                ┆ float ┆ string │
│ ---     ┆ ---                 ┆ ---   ┆ ---    │
│ i64     ┆ datetime[μs]        ┆ f64   ┆ str    │
╞═════════╪═════════════════════╪═══════╪════════╡
│ 1       ┆ 2025-01-01 00:00:00 ┆ 15.0  ┆ aadd   │
│ 2       ┆ 2025-01-02 00:00:00 ┆ 15.0  ┆ badd   │
│ 3       ┆ 2025-01-03 00:00:00 ┆ 15.0  ┆ cadd   │
└─────────┴─────────────────────┴───────┴────────┘

group_by

df2 = pl.DataFrame(
    {
        "x": range(8),
        "y": ["A", "A", "A", "B", "B", "C", "X", "X"],
    }
)

df2.head()

shape: (5, 2)

x	y
i64	str
0	"A"
1	"A"
2	"A"
3	"B"
4	"B"

df2.group_by(['y'], maintain_order=True).mean()

shape: (4, 2)

y	x
str	f64
"A"	1.0
"B"	3.5
"C"	5.0
"X"	6.5

print(df2.group_by('y', maintain_order=True).agg(pl.col('*').mean().alias('mean'), pl.col('*').count().alias('count'),))

shape: (4, 3)
┌─────┬──────┬───────┐
│ y   ┆ mean ┆ count │
│ --- ┆ ---  ┆ ---   │
│ str ┆ f64  ┆ u32   │
╞═════╪══════╪═══════╡
│ A   ┆ 1.0  ┆ 3     │
│ B   ┆ 3.5  ┆ 2     │
│ C   ┆ 5.0  ┆ 1     │
│ X   ┆ 6.5  ┆ 2     │
└─────┴──────┴───────┘

以上4种结构不仅可以单独使用，还可以相互配合以实现更强大的查询需求

print( df.with_columns((pl.col('float')*6).alias('float*6')).select(pl.all().exclude('string')))

shape: (3, 4)
┌─────────┬─────────────────────┬───────┬─────────┐
│ integer ┆ date                ┆ float ┆ float*6 │
│ ---     ┆ ---                 ┆ ---   ┆ ---     │
│ i64     ┆ datetime[μs]        ┆ f64   ┆ f64     │
╞═════════╪═════════════════════╪═══════╪═════════╡
│ 1       ┆ 2025-01-01 00:00:00 ┆ 4.0   ┆ 24.0    │
│ 2       ┆ 2025-01-02 00:00:00 ┆ 5.0   ┆ 30.0    │
│ 3       ┆ 2025-01-03 00:00:00 ┆ 6.0   ┆ 36.0    │
└─────────┴─────────────────────┴───────┴─────────┘

合并数据

import numpy as np
df3 = pl.DataFrame(
    {
        "a": range(8),
        "b": np.random.rand(8),
        "d": [1, 2.0, float("nan"), float("nan"), 0, -5, -42, None],
    }
)

df4 = pl.DataFrame(
    {
        "x": range(8),
        "y": ["A", "A", "A", "B", "B", "C", "X", "X"],
    }
)

print(df3.head(), df4.head())

shape: (5, 3)
┌─────┬──────────┬─────┐
│ a   ┆ b        ┆ d   │
│ --- ┆ ---      ┆ --- │
│ i64 ┆ f64      ┆ f64 │
╞═════╪══════════╪═════╡
│ 0   ┆ 0.411314 ┆ 1.0 │
│ 1   ┆ 0.984068 ┆ 2.0 │
│ 2   ┆ 0.169014 ┆ NaN │
│ 3   ┆ 0.712731 ┆ NaN │
│ 4   ┆ 0.248682 ┆ 0.0 │
└─────┴──────────┴─────┘ shape: (5, 2)
┌─────┬─────┐
│ x   ┆ y   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 0   ┆ A   │
│ 1   ┆ A   │
│ 2   ┆ A   │
│ 3   ┆ B   │
│ 4   ┆ B   │
└─────┴─────┘

df5 = df3.join(df4, left_on="a", right_on="x")
df5

shape: (8, 4)

a	b	d	y
i64	f64	f64	str
0	0.411314	1.0	"A"
1	0.984068	2.0	"A"
2	0.169014	NaN	"A"
3	0.712731	NaN	"B"
4	0.248682	0.0	"B"
5	0.921465	-5.0	"C"
6	0.516578	-42.0	"X"
7	0.145339	null	"X"

df3.hstack(df4)

shape: (8, 5)

a	b	d	x	y
i64	f64	f64	i64	str
0	0.411314	1.0	0	"A"
1	0.984068	2.0	1	"A"
2	0.169014	NaN	2	"A"
3	0.712731	NaN	3	"B"
4	0.248682	0.0	4	"B"
5	0.921465	-5.0	5	"C"
6	0.516578	-42.0	6	"X"
7	0.145339	null	7	"X"

Concepts - Polars API 的核心概念介绍

数据类型

常见的数据类型都包括在polars中，详见：https://docs.pola.rs/user-guide/concepts/data-types/overview/
在使用中，可以采用.cast(dtyps)的方法，转换对应的数据类型

数据结构

与pandas一样，主要包括 Series 和 DataFrame

# Series
# 1维数据
import polars as pl

pl.Series('num',range(6))

shape: (6,)

num
i64
0
1
2
3
4
5

DataFrame

二维数据，与pandas中DataFrame概念相同，可以通过读取、创建、变换等方式得到。使用pl.DataFrame(),与df.to_pandas(),可以和pandas的DataFrame相互变换.

from datetime import datetime

df = pl.DataFrame(
    {
        "integer": [1, 2, 3, 4, 5],
        "date": [
            datetime(2022, 1, 1),
            datetime(2022, 1, 2),
            datetime(2022, 1, 3),
            datetime(2022, 1, 4),
            datetime(2022, 1, 5),
        ],
        "float": [4.0, 5.0, 6.0, 7.0, 8.0],
    }
)

print(df)

print(type(df.to_pandas()))

shape: (5, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date                ┆ float │
│ ---     ┆ ---                 ┆ ---   │
│ i64     ┆ datetime[μs]        ┆ f64   │
╞═════════╪═════════════════════╪═══════╡
│ 1       ┆ 2022-01-01 00:00:00 ┆ 4.0   │
│ 2       ┆ 2022-01-02 00:00:00 ┆ 5.0   │
│ 3       ┆ 2022-01-03 00:00:00 ┆ 6.0   │
│ 4       ┆ 2022-01-04 00:00:00 ┆ 7.0   │
│ 5       ┆ 2022-01-05 00:00:00 ┆ 8.0   │
└─────────┴─────────────────────┴───────┘
<class 'pandas.core.frame.DataFrame'>

数据浏览

与pandas相同，polars包含.head(), .tail(), .sample(), .describe()等方法初步观看数据，同时支持[:,:]切片查询

# 查看表头
print(df.head(2))
# 查看表尾
print(df.tail(3))
# 随机取样
print(df.sample(1))
# 总体分析
print(df.describe())
# 切片查询
print(df[:,1:3])

shape: (2, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date                ┆ float │
│ ---     ┆ ---                 ┆ ---   │
│ i64     ┆ datetime[μs]        ┆ f64   │
╞═════════╪═════════════════════╪═══════╡
│ 1       ┆ 2022-01-01 00:00:00 ┆ 4.0   │
│ 2       ┆ 2022-01-02 00:00:00 ┆ 5.0   │
└─────────┴─────────────────────┴───────┘
shape: (3, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date                ┆ float │
│ ---     ┆ ---                 ┆ ---   │
│ i64     ┆ datetime[μs]        ┆ f64   │
╞═════════╪═════════════════════╪═══════╡
│ 3       ┆ 2022-01-03 00:00:00 ┆ 6.0   │
│ 4       ┆ 2022-01-04 00:00:00 ┆ 7.0   │
│ 5       ┆ 2022-01-05 00:00:00 ┆ 8.0   │
└─────────┴─────────────────────┴───────┘
shape: (1, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date                ┆ float │
│ ---     ┆ ---                 ┆ ---   │
│ i64     ┆ datetime[μs]        ┆ f64   │
╞═════════╪═════════════════════╪═══════╡
│ 3       ┆ 2022-01-03 00:00:00 ┆ 6.0   │
└─────────┴─────────────────────┴───────┘
shape: (9, 4)
┌────────────┬──────────┬─────────────────────┬──────────┐
│ statistic  ┆ integer  ┆ date                ┆ float    │
│ ---        ┆ ---      ┆ ---                 ┆ ---      │
│ str        ┆ f64      ┆ str                 ┆ f64      │
╞════════════╪══════════╪═════════════════════╪══════════╡
│ count      ┆ 5.0      ┆ 5                   ┆ 5.0      │
│ null_count ┆ 0.0      ┆ 0                   ┆ 0.0      │
│ mean       ┆ 3.0      ┆ 2022-01-03 00:00:00 ┆ 6.0      │
│ std        ┆ 1.581139 ┆ null                ┆ 1.581139 │
│ min        ┆ 1.0      ┆ 2022-01-01 00:00:00 ┆ 4.0      │
│ 25%        ┆ 2.0      ┆ 2022-01-02 00:00:00 ┆ 5.0      │
│ 50%        ┆ 3.0      ┆ 2022-01-03 00:00:00 ┆ 6.0      │
│ 75%        ┆ 4.0      ┆ 2022-01-04 00:00:00 ┆ 7.0      │
│ max        ┆ 5.0      ┆ 2022-01-05 00:00:00 ┆ 8.0      │
└────────────┴──────────┴─────────────────────┴──────────┘
shape: (5, 2)
┌─────────────────────┬───────┐
│ date                ┆ float │
│ ---                 ┆ ---   │
│ datetime[μs]        ┆ f64   │
╞═════════════════════╪═══════╡
│ 2022-01-01 00:00:00 ┆ 4.0   │
│ 2022-01-02 00:00:00 ┆ 5.0   │
│ 2022-01-03 00:00:00 ┆ 6.0   │
│ 2022-01-04 00:00:00 ┆ 7.0   │
│ 2022-01-05 00:00:00 ┆ 8.0   │
└─────────────────────┴───────┘

Contexts 查询背景

polars的两个核心组件是背景（Contexts）和表达式（Expressions）
polars中主要的三种查询背景分别为：

选择：df.select(), df.with_columns()
筛选: df.filter()
分类/聚合: df.group_by().agg()

# slect
# polars在其expression中支持广播机制，在expression中，如果不使用alias从新命名，默认替换原数据，因此需要注意列名重复问题
print(df)
print(df.select(
    pl.col('date').dt.day(),
    pl.col('float')/pl.col('float').sum(),
    (pl.col('float')*10).alias('float*10')
    ))

shape: (5, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date                ┆ float │
│ ---     ┆ ---                 ┆ ---   │
│ i64     ┆ datetime[μs]        ┆ f64   │
╞═════════╪═════════════════════╪═══════╡
│ 1       ┆ 2022-01-01 00:00:00 ┆ 4.0   │
│ 2       ┆ 2022-01-02 00:00:00 ┆ 5.0   │
│ 3       ┆ 2022-01-03 00:00:00 ┆ 6.0   │
│ 4       ┆ 2022-01-04 00:00:00 ┆ 7.0   │
│ 5       ┆ 2022-01-05 00:00:00 ┆ 8.0   │
└─────────┴─────────────────────┴───────┘
shape: (5, 3)
┌──────┬──────────┬──────────┐
│ date ┆ float    ┆ float*10 │
│ ---  ┆ ---      ┆ ---      │
│ i8   ┆ f64      ┆ f64      │
╞══════╪══════════╪══════════╡
│ 1    ┆ 0.133333 ┆ 40.0     │
│ 2    ┆ 0.166667 ┆ 50.0     │
│ 3    ┆ 0.2      ┆ 60.0     │
│ 4    ┆ 0.233333 ┆ 70.0     │
│ 5    ┆ 0.266667 ┆ 80.0     │
└──────┴──────────┴──────────┘

# with_columns与select不同之处在于，with_columns会保留其他列，select仅保留选择对象
print(df.select(
    pl.col('date').dt.day(),
    pl.col('float')/pl.col('float').sum(),
    (pl.col('float')*10).alias('float*10')
    ))

print(df.with_columns(
    pl.col('date').dt.day(),
    pl.col('float')/pl.col('float').sum(),
    (pl.col('float')*10).alias('float*10')
    ))

shape: (5, 3)
┌──────┬──────────┬──────────┐
│ date ┆ float    ┆ float*10 │
│ ---  ┆ ---      ┆ ---      │
│ i8   ┆ f64      ┆ f64      │
╞══════╪══════════╪══════════╡
│ 1    ┆ 0.133333 ┆ 40.0     │
│ 2    ┆ 0.166667 ┆ 50.0     │
│ 3    ┆ 0.2      ┆ 60.0     │
│ 4    ┆ 0.233333 ┆ 70.0     │
│ 5    ┆ 0.266667 ┆ 80.0     │
└──────┴──────────┴──────────┘
shape: (5, 4)
┌─────────┬──────┬──────────┬──────────┐
│ integer ┆ date ┆ float    ┆ float*10 │
│ ---     ┆ ---  ┆ ---      ┆ ---      │
│ i64     ┆ i8   ┆ f64      ┆ f64      │
╞═════════╪══════╪══════════╪══════════╡
│ 1       ┆ 1    ┆ 0.133333 ┆ 40.0     │
│ 2       ┆ 2    ┆ 0.166667 ┆ 50.0     │
│ 3       ┆ 3    ┆ 0.2      ┆ 60.0     │
│ 4       ┆ 4    ┆ 0.233333 ┆ 70.0     │
│ 5       ┆ 5    ┆ 0.266667 ┆ 80.0     │
└─────────┴──────┴──────────┴──────────┘

# filter 支持单个或多个筛选
df.filter((pl.col('integer')>2) & (pl.col('float')<7))

shape: (1, 3)

integer	date	float
i64	datetime[μs]	f64
3	2022-01-03 00:00:00	6.0

# group_by/aggregation
# 分类与聚合
df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names": ["foo", "ham", "spam", "egg", None],
        "random": [2, 1, 2, 4, 5],
        "groups": ["A", "A", "B", "C", "B"],
    }
)
print(df)

# group_by的背景下，可以多次使用expression，生成多个新列
print(df.group_by('groups').agg(
    pl.col('nrs'),
    pl.col('nrs').mean().alias('nrs_mean'),
    pl.col('nrs').count().alias('nrs_count'),
    pl.col('nrs').is_null().sum().alias('nrs_na_sum'),
    pl.col('names')
))

shape: (5, 4)
┌──────┬───────┬────────┬────────┐
│ nrs  ┆ names ┆ random ┆ groups │
│ ---  ┆ ---   ┆ ---    ┆ ---    │
│ i64  ┆ str   ┆ i64    ┆ str    │
╞══════╪═══════╪════════╪════════╡
│ 1    ┆ foo   ┆ 2      ┆ A      │
│ 2    ┆ ham   ┆ 1      ┆ A      │
│ 3    ┆ spam  ┆ 2      ┆ B      │
│ null ┆ egg   ┆ 4      ┆ C      │
│ 5    ┆ null  ┆ 5      ┆ B      │
└──────┴───────┴────────┴────────┘
shape: (3, 6)
┌────────┬───────────┬──────────┬───────────┬────────────┬────────────────┐
│ groups ┆ nrs       ┆ nrs_mean ┆ nrs_count ┆ nrs_na_sum ┆ names          │
│ ---    ┆ ---       ┆ ---      ┆ ---       ┆ ---        ┆ ---            │
│ str    ┆ list[i64] ┆ f64      ┆ u32       ┆ u32        ┆ list[str]      │
╞════════╪═══════════╪══════════╪═══════════╪════════════╪════════════════╡
│ C      ┆ [null]    ┆ null     ┆ 0         ┆ 1          ┆ ["egg"]        │
│ A      ┆ [1, 2]    ┆ 1.5      ┆ 2         ┆ 0          ┆ ["foo", "ham"] │
│ B      ┆ [3, 5]    ┆ 4.0      ┆ 2         ┆ 0          ┆ ["spam", null] │
└────────┴───────────┴──────────┴───────────┴────────────┴────────────────┘

Expressions 查询表达

expression是polars的核心内容，拥有者强大的选择查询能力，以下为一个expression示例

pl.col(‘nrs’).sum()

该expression示意：1.选择‘nrs’列；

2.对列进行求和

df.select(pl.col('random').sort().head(3), pl.col('nrs').filter(pl.col('groups')=='A').sum())

shape: (3, 2)

random	nrs
i64	i64
1	3
2	3
2	3

Lazy API 与 Streaming API

文档中特别强调该两接口，在大规模数据加载与处理过程中，使用该接口能很好的加快速度，lazy接口能避免重复的执行查询，优化查询路径，从而加快速度

# lazy
# 原始方法
df = pl.read_csv("docs/data/iris.csv")
df_small = df.filter(pl.col("sepal_length") > 5)
df_agg = df_small.group_by("species").agg(pl.col("sepal_width").mean())

#lazy查询，注意，其用.scan_调用lazy接口，在处理dataframe时，可使用.lazy()调用
q = (
    pl.scan_csv("docs/data/iris.csv")
    .filter(pl.col("sepal_length") > 5)
    .group_by("species")
    .agg(pl.col("sepal_width").mean())
)

df = q.collect(streaming=True) # 使用streaming调用streaming API