从 Pandas 到 Polars 四十六：使用Polars读取和写入S3数据

sosogod

已于 2024-08-08 09:00:28 修改

阅读量775

点赞数 28

分类专栏：极速数据处理：Polars揭秘文章标签： pandas python

于 2024-08-08 08:59:59 首次发布

本文链接：https://blog.csdn.net/sosogod/article/details/141015899

版权

极速数据处理：Polars揭秘专栏收录该内容

47 篇文章 6 订阅

订阅专栏

在本文中，我们将看到如何使用Polars从S3中的CSV或Parquet文件读取和写入数据。同时，我们还将了解如何在下载前对S3上的文件进行过滤，以减少跨网络传输的数据量。

写入文件到S3

我们将创建一个包含3列的简单DataFrame。我们将使用s3fs库将其写入S3中的CSV和Parquet文件。s3fs库允许您以类似于在本地文件系统上工作的语法来读取和写入S3中的文件。

bucket_name = "my_bucket"
csv_key = "test_write.csv"
parquet_key = "test_write.parquet"
fs = s3fs.S3FileSystem()  
df = pl.DataFrame(
    {
        "foo": [1, 2, 3, 4, 5],
        "bar": [6, 7, 8, 9, 10],
        "ham": ["a", "b", "c", "d", "e"],
    }
)
with fs.open(f"{bucket_name}/{csv_key}", mode="wb") as f:
    df.write_csv(f)
with fs.open(f"{bucket_name}/{parquet_key}", mode="wb") as f:
    df.write_parquet(f)

如果你可以选择的话，我推荐使用Parquet格式，因为它具有更小的文件大小，可以保留数据类型（dtypes），并且使后续读取更快。

从S3读取文件

我们可以使用Polars的pl.read_csv函数从S3中读取文件。

df_csv = pl.read_csv(f"s3://{bucket}/{csv_key}")
df_parquet = pl.read_parquet(f"s3://{bucket}/{parquet_key}")

Polars内部使用ffspec将远程文件读取到内存缓冲区中，然后将缓冲区中的数据读入DataFrame。这是一种快速的方法，但它确实意味着整个文件都被读入内存。对于小文件来说这没问题，但对于大文件来说可能会很慢并且占用大量内存。

然而，当我们只想读取行的一个子集时，读取整个文件是浪费的。对于Parquet文件，我们可以在S3上扫描文件，并且只读取我们需要的行。

在S3上使用查询优化扫描文件

对于Parquet文件，我们可以在S3上扫描文件并构建一个延迟查询。Polars的查询优化器会应用以下优化：

谓词下推（predicate pushdown），这意味着任何用于过滤行的条件都在S3上应用
投影下推（projection pushdown），这意味着如果只需要列的一个子集，那么只有这些列才会从S3中读取。

我们可以使用pl.scan_parquet来执行这些操作。这也可能需要传递一些特定于云存储提供商的选项。首先，Polars会尝试从环境变量中获取这些选项，但我们可以使用storage_options参数来覆盖它们。

import polars as pl

source = "s3://bucket/*.parquet"

storage_options = {
    "aws_access_key_id": "<secret>",
    "aws_secret_access_key": "<secret>",
    "aws_region": "eu-west-1",
}
df = (
    # Scan the file on S3
    pl.scan_parquet(source, storage_options=storage_options)
    # Apply a filter condition
    .filter(pl.col("id") > 100)
    # Select only the columns we need
    .select("id","value")
    # Collect the data
    .collect()
)

使用scan_parquet，Polars会在底层使用Rust的object_store库对Parquet文件进行异步读取。

对CSV文件应用过滤器

目前（2023年10月），Polars不支持在S3上扫描CSV文件。在这种情况下，我们可以在返回文件之前使用boto3库在S3上应用过滤器条件。

在boto3中，关键的方法是select_object_content。这允许我们在下载文件之前对S3上的文件应用SQL过滤器。此外，它还需要我们传递一些关于文件序列化方式（是CSV还是Parquet文件，是否压缩等）以及下载数据应该如何序列化的更多信息。

在这个例子中，我们过滤S3上的CSV文件，只返回“ham”列等于“a”的行。下面我将展示完整的代码，然后逐一解释每个部分。

import boto3import polars as pl

bucket_name = "my_bucket"
key = "test_write.csv"

# Create a boto3 client to interface with S3
s3 = boto3.client("s3")

# Define the SQL statement to filter the CSV 
datasql_expression = "SELECT * FROM s3object s WHERE ham = 'a'"

# Use SelectObjectContent to filter the CSV data before downloading 
its3_object = s3.select_object_content(
    Bucket=bucket_name,
    Key=key,
    ExpressionType="SQL",
    Expression=sql_expression,
    InputSerialization={"CSV": {"FileHeaderInfo": "USE"}, "CompressionType": "NONE"},
    OutputSerialization={"CSV": {}},
)

# Create a reusable StringIO 
objectoutput = io.StringIO()

# Iterate over the filtered CSV data and write it to the StringIO object
for event in s3_object["Payload"]:
    if "Records" in event:
        records = event["Records"]["Payload"].decode("utf-8")
        output.write(records)

# Rewind the StringIO object to the beginning
output.seek(0)
df = pl.read_csv(output)

我们首先创建SQL查询字符串：

sql_expression = "SELECT * FROM s3object s WHERE ham = 'a'"

这使用了S3 Select SQL语法，它与您在PostgreSQL等数据库中编写查询的方式有一些差异。

s3object 关键字指的是S3上的对象。
s 是该对象的别名。
WHERE 子句是过滤条件。

请注意，您当然需要为筛选数据所消耗的计算资源付费。

然后，我们通过调用 s3.select_object_content 创建了 s3_object。这需要多个参数：

s3_object = s3.select_object_content(
    Bucket=bucket_name,
    Key=key,
    ExpressionType="SQL",
    Expression=sql_expression,
    InputSerialization={"CSV": {"FileHeaderInfo": "USE"}, "CompressionType": "NONE"},
    OutputSerialization={"CSV": {}},)

在InputSerialization参数中，我们传递了一个包含一些参数的字典，以告诉它如何读取我们的CSV文件。例如，我们告诉它文件有一个标题行，并且没有压缩。

在OutputSerialization参数中，我们传递了一个包含一些参数的字典，以告诉它如何序列化返回的数据——在这种情况下是CSV文件。不幸的是，在编写本文时，唯一的输出序列化选项是CSV和JSON，因此即使您输入的是Parquet文件，您也不能返回Parquet文件。

然后，我们使用Python内置的io库创建了一个StringIO对象，用于在从S3提取数据之前保存s3.select_object_content方法返回的数据。

# Iterate over the filtered CSV data and write it to the StringIO object
for event in s3_object["Payload"]:
    if "Records" in event:
        # Decode the bytes for each line to a string        
        records = event["Records"]["Payload"].decode("utf-8")

        # Write the string to the StringIO object        
        output.write(records)

# Rewind the StringIO object to the beginning
output.seek(0)

有了这个StringIO对象，我们就可以创建一个Polars DataFrame了。