15、数据重塑、整理与时间序列分析

最新推荐文章于 2025-09-24 21:52:07 发布

play7

最新推荐文章于 2025-09-24 21:52:07 发布

阅读量26

点赞数

CC 4.0 BY-SA版权

分类专栏： Polars实战：高效数据处理文章标签：数据重塑时间序列分析 Polars

本文链接：https://blog.csdn.net/play7/article/details/151096333

Polars实战：高效数据处理专栏收录该内容

17 篇文章 ¥69.90 ¥499.90 限时 7 天

订阅专栏

超级会员免费看

数据重塑、整理与时间序列分析

数据重塑与整理

准备工作

使用已熟悉的 academic_df 数据框进行操作。

重塑技术应用

按列值分区数据框 ：使用 .partition_by() 方法将数据框按列值划分为多个独立的数据框。

academic_df.partition_by('academic_year')

此代码会按 academic_year 列的值对数据框进行分区。

转置数据框 ：使用 .transpose() 方法翻转数据框的行和列。

academic_df.transpose(include_header=True)

将列值重塑为列表 ：使用 .reshape() 方法将列值转换为数组。

(
    academic_df
    .select(
        pl.col('academic_year', 'students').reshape((1, 5))
    )
)

方法原理

.partition_by() 方法：可返回字典或数据框形式的结果，还能指定多个列进行数据分组。
.transpose() 方法：借助 header_name 和 column_names 参数可指定表头和列名。
.reshape() 方法：将列值转换为数组，通过元组指定所需形状，元组的第一个值为行数，第二个值为列表中的元素数。

注意事项

.partition_by() 和 .transpose() 方法仅适用于数据框， .reshape() 方法仅适用于列或系列对象。

时间序列分析

技术要求

安装 Polars 库：

pip install polars

导入 Polars 库：

import polars as pl

安装 hvplot 库：

pip install hvplot

下载数据集：使用历史每小时天气数据集，可从 GitHub 仓库下载。
读取数据集为 LazyFrame ：

lf = pl.scan_csv('../data/toronto_weather.csv')

查看数据集前五行：

lf.head().collect()

温度单位转换：将温度从开尔文转换为摄氏度。

lf = lf.with_columns(pl.col('temperature') - 273.15)
lf.head().collect()

处理日期和时间

操作步骤

读取时转换数据类型 ：若源数据的日期、日期时间或时间列为字符串类型，可使用 try_parse_date 参数在读取时进行转换。

lf_date_parsed = pl.scan_csv('../data/toronto_weather.csv', try_parse_dates=True)
lf_date_parsed.head().collect()

可使用 .schema 或 .dtypes 属性检查列的数据类型：

lf_date_parsed.collect_schema(),
lf_date_parsed.collect_schema().dtypes()

读取后转换数据类型 ：

lf = lf.with_columns(pl.col('datetime').str.to_datetime())
lf.head().collect()

提取日期特征 ：从 datetime 列中提取时间、日、周、月和年等特征。

(
    lf
    .select(
        'datetime',
        pl.col('datetime').dt.year().alias('year'),
        pl.col('datetime').dt.month().alias('month'),
        pl.col('datetime').dt.day().alias('day'),
        pl.col('datetime').dt.time().alias('time')
    )
    .head().collect()
)

按日期和时间属性过滤 ：

from datetime import datetime
filtered_lf = (
    lf
    .filter(
        pl.col('datetime').dt.date().is_between(
            datetime(2017, 1, 1), datetime(2017, 12, 31)
        ),
        pl.col('datetime').dt.hour() < 12
    )
)
filtered_lf.head().collect()

检查过滤结果：

(
    filtered_lf
    .select(
        pl.col('datetime').dt.year().unique()
        .implode()
        .list.len()
        .alias('year_cnt'),
        pl.col('datetime').dt.hour().unique()
        .implode()
        .list.len()
        .alias('hour_cnt')
    )
    .head()
    .collect()
)

转换和替换时区 ：

time_zones_lf = (
    lf
    .select(
        'datetime',
        pl.col('datetime').dt.replace_time_zone('America/Toronto')
        .alias('replaced_time_zone_toronto'),
        pl.col('datetime').dt.convert_time_zone('America/Toronto')
        .alias('converted_time_zone_toronto')
    )
)
time_zones_lf.head().collect()

原理说明

dt 命名空间：提供多种处理日期和时间属性的方法。
数据类型转换：读取时和读取后均可进行数据类型转换，如使用 .str.strptime() 方法将字符串列转换为日期时间列。
日期范围指定： .is_between() 方法可方便指定日期范围，Python 的 Datetime 对象可用于过滤日期和时间属性。
时区处理： .dt.replace_time_zone() 方法仅分配或重置列的时区， .dt.convert_time_zone() 方法会修改日期时间值。

额外功能

使用 pl.duration() 函数偏移日期和时间属性：

(
    lf
    .select(
        'datetime',
        (pl.col('datetime') - pl.duration(weeks=5)).alias('minus_5weeks'),
        (pl.col('datetime') + pl.duration(milliseconds=5)).alias('plus_5ms')
    )
    .head()
    .collect()
)

应用滚动窗口计算

操作步骤

计算滚动平均值 ：使用 .rolling_mean() 方法计算温度的滚动平均值。

(
    lf
    .select(
        'datetime',
        'temperature',
        pl.col('temperature').rolling_mean(3).alias('3hr_rolling_avg')
    )
    .head()
    .collect()
)

按天聚合温度 ：

daily_avg_temperature_lf = (
    lf
    .select(
        pl.col('datetime').dt.date().alias('date'),
        'temperature'
    )
    .group_by('date', maintain_order=True)
    .agg(
        pl.col('temperature').mean().alias('daily_avg_temp')
    )
)
daily_avg_temperature_lf.head().collect()

计算滚动聚合值 ：

(
    daily_avg_temperature_lf
    .select(
        'date',
        'daily_avg_temp',
        pl.col('daily_avg_temp').rolling_mean(3).alias('3day_rolling_avg'),
        pl.col('daily_avg_temp').rolling_min(3).alias('3day_rolling_min'),
        pl.col('daily_avg_temp').rolling_max(3).alias('3day_rolling_max')
    )
    .head()
    .collect()
)

使用 .rolling() 方法 ：

(
    daily_avg_temperature_lf
    .set_sorted('date')
    .select(
        'date',
        'daily_avg_temp',
        pl.col('daily_avg_temp').rolling_mean(3).alias('3day_rolling_avg'),
        pl.col('daily_avg_temp').rolling_mean(
            window_size=3,
            min_periods=1
        )
        .alias('3day_rolling_avg2'),
        pl.col('daily_avg_temp').mean().rolling(
            index_column='date',
            period='3d',
            closed='right'
        )
        .alias('3day_rolling_avg3')
    )
    .head(10)
    .collect()
)

使用 .rolling() 数据框方法 ：若结果仅包含滚动计算，可使用该方法缓存窗口大小计算。

(
    daily_avg_temperature_lf
    .set_sorted('date')
    .rolling(
        'date',
        period='3d'
    )
    .agg(
        pl.col('daily_avg_temp'),
        pl.col('daily_avg_temp').mean().alias('3day_rolling_avg'),
        pl.col('daily_avg_temp').min().alias('3day_rolling_min'),
        pl.col('daily_avg_temp').max().alias('3day_rolling_max'),
    )
    .head(10)
    .collect()
)

可视化数据 ：使用 Polars 的内置绘图功能可视化每日平均温度和 60 天滚动平均值。

(
    daily_avg_temperature_lf
    .select(
        'date',
        'daily_avg_temp',
        pl.col('daily_avg_temp').rolling_mean(60).alias('60day_rolling_avg')
    )
    .collect()
    .plot.line(
        x='date',
        y=['daily_avg_temp', '60day_rolling_avg'],
        color=['skyblue', 'gray'],
        width=800,
        height=400
    )
    .opts(legend_position='bottom_right')
)

原理说明

内置方法：如 .rolling_mean() 、 .rolling_min() 和 .rolling_max() 等方法可进行滚动计算，还有 .rolling_var() 和 .rolling_std() 等更多方法。
.rolling() 方法：更灵活，可使用任何指定表达式进行滚动计算。
数据框方法： .rolling() 数据框方法适用于结果仅包含滚动计算的情况，可提高效率。

注意事项

滚动计算方法：在数据框/ LazyFrame 级别和表达式级别使用的 .rolling() 方法都要求时间列已排序，可使用 .set_sorted() 或 .sort() 方法进行排序。
可视化功能： Polars 的内置绘图功能基于 hvplot 库，需安装该库才能使用。

自定义滚动计算

使用 .rolling_map() 方法接受用户定义的函数进行自定义滚动计算：

def get_range(nums):
    min_num = min(nums)
    max_num = max(nums)
    range = max(nums) - min(nums)
    return range

(
    daily_avg_temperature_lf
    .with_columns(
        pl.col('daily_avg_temp').rolling_map(get_range, window_size=3).alias('3day_rolling_range')
    )
    .head()
    .collect()
)

可视化自定义滚动计算结果：

(
    daily_avg_temperature_lf
    .with_columns(
        pl.col('daily_avg_temp').rolling_map(get_range, window_size=3).alias('3day_rolling_range')
    )
    .collect()
    .plot.line(
        x='date',
        y=['daily_avg_temp', '3day_rolling_range'],
        color=['skyblue', 'gray'],
        width=800,
        height=400
    )
    .opts(legend_position='bottom_right')
)

注意事项

.rolling_map() 方法速度较慢，仅在无法使用内置方法实现逻辑时使用。

流程图

graph LR
    A[读取数据集] --> B[数据预处理]
    B --> C[处理日期和时间]
    C --> D[应用滚动窗口计算]
    D --> E[可视化数据]

表格总结

操作类型	方法	示例代码
数据重塑	`.partition_by()`	`academic_df.partition_by('academic_year')`
数据重塑	`.transpose()`	`academic_df.transpose(include_header=True)`
数据重塑	`.reshape()`	`academic_df.select(pl.col('academic_year', 'students').reshape((1, 5)))`
时间序列处理	`try_parse_date`	`pl.scan_csv('../data/toronto_weather.csv', try_parse_dates=True)`
时间序列处理	`.str.to_datetime()`	`lf.with_columns(pl.col('datetime').str.to_datetime())`
滚动计算	`.rolling_mean()`	`pl.col('temperature').rolling_mean(3).alias('3hr_rolling_avg')`
滚动计算	`.rolling()`	`pl.col('daily_avg_temp').mean().rolling(index_column='date', period='3d', closed='right')`
自定义滚动计算	`.rolling_map()`	`pl.col('daily_avg_temp').rolling_map(get_range, window_size=3).alias('3day_rolling_range')`

重采样技术

重采样是时间序列分析中的重要技术，它可以将时间序列数据从一个频率转换为另一个频率。以下是使用 Polars 进行重采样的操作步骤：

操作步骤

按小时重采样 ：将数据按小时进行重采样，并计算每小时的平均温度。

hourly_resampled_lf = (
    lf
    .select(
        pl.col('datetime').dt.truncate('1h').alias('hour'),
        'temperature'
    )
    .group_by('hour', maintain_order=True)
    .agg(
        pl.col('temperature').mean().alias('hourly_avg_temp')
    )
)
hourly_resampled_lf.head().collect()

按天重采样 ：将数据按天进行重采样，并计算每天的平均温度。

daily_resampled_lf = (
    lf
    .select(
        pl.col('datetime').dt.truncate('1d').alias('day'),
        'temperature'
    )
    .group_by('day', maintain_order=True)
    .agg(
        pl.col('temperature').mean().alias('daily_avg_temp')
    )
)
daily_resampled_lf.head().collect()

原理说明

dt.truncate() 方法：用于将日期时间列截断到指定的时间间隔，如小时、天等。
group_by() 和 agg() 方法：结合使用可以对截断后的时间列进行分组，并计算每组的聚合值。

时间序列预测

时间序列预测是根据历史数据预测未来值的过程。这里将介绍如何使用 functime 库进行时间序列预测。

操作步骤

安装 functime 库 ：

pip install functime

准备数据 ：

# 假设我们使用按天重采样后的数据
forecast_data = daily_resampled_lf.collect()

进行预测 ：

from functime.forecasting import AutoARIMA

# 创建预测模型
model = AutoARIMA()

# 拟合模型并进行预测
forecast = model.fit_predict(forecast_data, h=30)  # 预测未来 30 天的数据
print(forecast)

原理说明

AutoARIMA 模型：是一种自动选择 ARIMA 模型参数的方法，它可以根据数据的特征自动选择最优的模型参数。
fit_predict() 方法：用于拟合模型并进行预测， h 参数指定了预测的未来时间步数。

总结

时间序列分析在数据分析和预测中具有重要的应用价值。通过 Polars 库，我们可以方便地处理日期和时间数据，进行滚动窗口计算、重采样等操作。同时，结合 functime 库，我们可以进行时间序列预测，为未来的决策提供支持。

表格总结

操作类型	方法	示例代码
重采样	`dt.truncate()`	`pl.col('datetime').dt.truncate('1h').alias('hour')`
重采样	`group_by()` 和 `agg()`	`lf.group_by('hour', maintain_order=True).agg(pl.col('temperature').mean().alias('hourly_avg_temp'))`
时间序列预测	`AutoARIMA`	`model = AutoARIMA(); forecast = model.fit_predict(forecast_data, h=30)`

流程图

graph LR
    A[读取数据集] --> B[数据预处理]
    B --> C[处理日期和时间]
    C --> D[应用滚动窗口计算]
    D --> E[重采样技术]
    E --> F[时间序列预测]
    F --> G[可视化结果]

注意事项

在进行滚动计算和重采样时，要确保时间列已排序，否则可能会得到错误的结果。
使用 functime 库进行时间序列预测时，要根据数据的特点选择合适的模型和参数。
对于自定义滚动计算， .rolling_map() 方法速度较慢，尽量使用内置方法。

通过以上的操作和分析，我们可以更好地理解和处理时间序列数据，为实际应用提供有力的支持。