【WIP】Polars 学习笔记 01_polars schema-CSDN博客

本文链接：https://blog.csdn.net/qq_45065669/article/details/134462550

Polars 学习笔记 01

为什么要开始学习Polars

最近在kaggle上打比赛（Enefit、Optiver、WritingQuality）的过程中，都需要对较大规模的数据集进行大量的特征工程，许多public的notebook中都使用了polars来高效的处理这些数据。而由于之前一直使用pandas没有学习polars导致在学习这些notebook的过程中存在一些困难，因此准备开始专门学习一下polars并记录自己学习过程中的一些东西。

polars与pandas的一些区别（一些例子）

polars对文件的读取速度比pandas（1.x，numpy后端）快（在pandas2.x版本中，采用了pyarrows后端，读取速度与polars相近）
groupby分组速度更快，根据H2O发布的db-benchmark中对50G数据进行的groupby操作对比，polars-0.19.8耗时32s，而pandas-2.1.1耗时773s（24倍）
join连接速度更快，同样在db-benchmark中对5G数据进行的join操作对比，polars-0.19.8耗时14s，而pandas-2.1.1耗时265s（19倍）
polars自动使用多线程并行操作，而pandas需要额外操作才能使用多线程

polars新建Series与Dataframe

首先分别导入polars以及pandas（这里用于进行用法的对比）

import numpy as np
import polars as pl # 0.19.12
import pandas as pd # 2.1.3 与polars用法进行对比

学习过程中polars使用版本为0.19.12，pandas使用版本为2.1.3
polars创建Series的方法与pandas类似，如下为polars.Seires的API

class polars.Series(
	name: str | ArrayLike | None = None, # 设置Series的列名
	values: ArrayLike | None = None, # 传入Series的数据
	dtype: PolarsDataType | None = None, # 设置数据类型
	*,
	strict: bool = True,
	nan_to_null: bool = False,
	dtype_if_empty: PolarsDataType | None = None,
)

创建一个10000条数据的series，polars和pandas的方法十分类似

# 使用polars
series_pl = pl.Series('pl_series', np.random.rand(10000,)) 

# 使用pandas
series_pd = pd.Series(np.random.rand(10000,), name='pd_series')

DataFrame的创建在polars中比较灵活

class polars.DataFrame(
	data: FrameInitTypes | None = None, # 传入DataFrame的数据
	schema: SchemaDefinition | None = None, # DataFrame的框架
	*,
	schema_overrides: SchemaDict | None = None, 
	orient: Orientation | None = None, # 对于二维数据的解释方法
	infer_schema_length: int | None = 100,
	nan_to_null: bool = False,
)

其中schema规定了DataFrame的模式，在API文档中可以看到这个参数可以接收以下类型的输入：

{‘列名1’ : 列类型, ‘列名2’ : 列类型, …, , ‘列名n’ : 列类型}形式的字典，同时指定列名和对应列的数据类型，如果列类型为None，则会由polars根据传入数据自动得到
[‘列名1’, ‘列名2’, …, ‘列名n’]形式的字典，直接指定每一列的名字，列类型由polars自动指定
[(‘列名1’, 列类型), (‘列名2’, 列类型), …, (‘列名n’, 列类型)]形式的元组构成的列表，也是同时指定列名于该列数据类型

下面将使用polars创建Dataframe并解释几个常用参数的使用方法

# polars创建Dataframe，直接传入字典形式数据
df_pl = pl.DataFrame({
    'col1':np.random.randint(0, 100, 10000,)
    , 'col2':np.random.rand(10000,)
})

# pandas创建Dataframe
df_pd = pd.DataFrame({
    'col1':np.random.randint(0, 100, 10000,)
    , 'col2':np.random.rand(10000,)
})

schema用法

df_pl = pl.DataFrame([
    np.random.randint(0, 100, 10000,)
    , np.random.rand(10000,)
], schema=['col1', 'col2']) # 直接传入列名列表作为schema
print(df_pl)
'''
shape: (10_000, 2)
┌──────┬──────────┐
│ col1 ┆ col2     │
│ ---  ┆ ---      │
│ i32  ┆ f64      │     -> 由polars默认infer数据类型
╞══════╪══════════╡
│ 85   ┆ 0.703093 │
│ 84   ┆ 0.951594 │
│ 37   ┆ 0.829924 │
│ 92   ┆ 0.485065 │
│ …    ┆ …        │
│ 58   ┆ 0.548643 │
│ 18   ┆ 0.187007 │
│ 35   ┆ 0.018052 │
│ 94   ┆ 0.480516 │
└──────┴──────────┘
'''

df_pl = pl.DataFrame([
    np.random.randint(0, 100, 10000,)
    , np.random.rand(10000,)
], schema={'col1':pl.Int16, 'col2':pl.Float32}) # 用字典形式指定数据类型
print(df_pl)
'''
shape: (10_000, 2)
┌──────┬──────────┐
│ col1 ┆ col2     │
│ ---  ┆ ---      │
│ i16  ┆ f32      │     -> 字典格式schema指定数据类型
╞══════╪══════════╡
│ 66   ┆ 0.721416 │
│ 76   ┆ 0.737047 │
│ 1    ┆ 0.797954 │
│ 54   ┆ 0.334464 │
│ …    ┆ …        │
│ 32   ┆ 0.181006 │
│ 91   ┆ 0.606713 │
│ 84   ┆ 0.680303 │
│ 27   ┆ 0.228896 │
└──────┴──────────┘
'''

orient用法

df_pl = pl.DataFrame([
    np.random.randint(0, 100, 10000,)
    , np.random.rand(10000,)
], orient='col')  # 以列的形式来解释二维数组，即传入的2 x 10000的数据会被当作10000 x 2来解释（数据中一行解释为一列）
print(df_pl)

'''
shape: (10_000, 2)
┌──────────┬──────────┐
│ column_0 ┆ column_1 │
│ ---      ┆ ---      │
│ i32      ┆ f64      │
╞══════════╪══════════╡
│ 38       ┆ 0.762026 │
│ 53       ┆ 0.820637 │
│ 72       ┆ 0.543824 │
│ 97       ┆ 0.004582 │
│ …        ┆ …        │
│ 99       ┆ 0.616119 │
│ 63       ┆ 0.634567 │
│ 9        ┆ 0.184565 │
│ 23       ┆ 0.251811 │
└──────────┴──────────┘
'''

df_pl = pl.DataFrame([
    np.random.randint(0, 100, 10000,)
    , np.random.rand(10000,)
], orient='row')  # 2 x 10000的数据就作为2 x 10000被解释，一行就是一行
print(df_pl)
'''
shape: (2, 10_000)
┌──────────┬──────────┬──────────┬──────────┬───┬─────────────┬─────────────┬─────────────┬─────────────┐
│ column_0 ┆ column_1 ┆ column_2 ┆ column_3 ┆ … ┆ column_9996 ┆ column_9997 ┆ column_9998 ┆ column_9999 │
│ ---      ┆ ---      ┆ ---      ┆ ---      ┆   ┆ ---         ┆ ---         ┆ ---         ┆ ---         │
│ f64      ┆ f64      ┆ f64      ┆ f64      ┆   ┆ f64         ┆ f64         ┆ f64         ┆ f64         │
╞══════════╪══════════╪══════════╪══════════╪═══╪═════════════╪═════════════╪═════════════╪═════════════╡
│ 68.0     ┆ 1.0      ┆ 63.0     ┆ 58.0     ┆ … ┆ 7.0         ┆ 9.0         ┆ 7.0         ┆ 97.0        │
│ 0.554154 ┆ 0.799863 ┆ 0.721035 ┆ 0.570688 ┆ … ┆ 0.653323    ┆ 0.316002    ┆ 0.969565    ┆ 0.632796    │
└──────────┴──────────┴──────────┴──────────┴───┴─────────────┴─────────────┴─────────────┴─────────────┘
'''