Pandsa时间序列采样频率滑窗及重采样

黄昏中起飞的猫头鹰

已于 2023-02-15 18:32:12 修改

阅读量713

点赞数 2

分类专栏： pandas 文章标签： pandas python 数据分析

于 2023-02-15 18:27:13 首次发布

本文链接：https://blog.csdn.net/qq_20163065/article/details/129048074

版权

pandas 专栏收录该内容

11 篇文章 1 订阅

订阅专栏

Pandas时间序列采样频率滑窗

1.滑窗函数rolling()

获取近7天的销售总量

案例：一个有序的但非连续的时间序列索引的Series，其中记录当天销售量，统计近7天内的销售总量

构建销售数据

import pandas as pd
np.random.seed(999) #设置随机种子
s = pd.Series(np.random.randint(0,10,365),index=pd.date_range('20220101','20221231')) #模拟生成365天的销售数据
s = s.sample(250,random_state=999).sort_index()  #获取250个样本,根据时间序列索引排序
s.head(9)

输出：

2022-01-01    0
2022-01-02    5
2022-01-04    8
2022-01-05    1
2022-01-07    3
2022-01-08    0
2022-01-09    5
2022-01-11    8
dtype: int32

使用rolling() 滑窗函数可以获取连续的时间序列的滑窗

s.rolling("7D").sum() #统计近7天的销售总量

输出：

2022-01-01     0.0
2022-01-02     5.0
2022-01-04    13.0
2022-01-05    14.0
2022-01-07    17.0
              ... 
2022-12-18    20.0
2022-12-22    23.0
2022-12-25    12.0
2022-12-28    20.0
2022-12-30    14.0
Length: 250, dtype: float64

pandas无法实现非固定采样频率的时间序列滑窗，例如无法统计7个工作日的销售总量，但我们可以通过多个函数的组合实现这个功能。

s1 = s[~s.index.to_series().dt.dayofweek.isin([5,6])] #筛选工作日数据
s2 = s1.rolling(7,min_periods=1).sum() #统计近7天工作日的销售总量，参数min_periods为观测点最小数量，1则为每个观测点都有值
result = s2.reindex(s.index).ffill()   #设置原始索引值，双休日的值使用前一个工作日填充
result

输出

2022-01-01     NaN
2022-01-02     NaN
2022-01-04     8.0
2022-01-05     9.0
2022-01-07    12.0
              ... 
2022-12-18    39.0
2022-12-22    40.0
2022-12-25    40.0
2022-12-28    40.0
2022-12-30    40.0
Length: 250, dtype: float64

2. shift() 及 diff()

shift()、diff()和pct_change()是一组类滑窗函数，公共参数都为periods=n,默认值为1，分别表示取向前第n个元素的值、与前第n个元素相比的增长率。 n为负值则表示反方向类似操作。

shift()

shift()不仅支持取向前第n个元素的值，还通过传入参数freq来对时间范围进行平移，下面例子：向前获取第50天的销售

s.shift(freq="50D").head()

输出：

2022-02-20    0
2022-02-21    5
2022-02-23    8
2022-02-24    1
2022-02-26    3
dtype: int32

diff()

diff()也属于特殊滑窗函数，但不支持参数freq。diff()的作用在于观察时间戳记录的时间间隔。
案例：观察某个时间传感器的数据采样频率是否保持在五分钟一次。

s = pd.read_csv(path,parse_dates=['Record']).iloc[:,0] #读取文件中的传感器记录
s.head()

输出：

0   2021-09-01 08:00:00
1   2021-09-01 08:05:00
2   2021-09-01 08:10:00
3   2021-09-01 08:15:00
4   2021-09-01 08:19:00
Name: Record, dtype: datetime64[ns]

s1 = s.diff(1).dt.total_seconds().sort_values(ascending=False) #获取时间间隔（单位：秒）
s1.head()

输出：

134     5520.0
447     2940.0
504     2460.0
402      360.0
1491     360.0
Name: Record, dtype: float64

#定位起始时间
error = s1.iloc[:3]
s[error.index-1]

输出：

133   2021-09-01 19:03:00
446   2021-09-02 22:35:00
503   2021-09-03 04:04:00
Name: Record, dtype: datetime64[ns]

#定位起始时间
s[error.index]

输出：

134   2021-09-01 20:35:00
447   2021-09-02 23:24:00
504   2021-09-03 04:45:00
Name: Record, dtype: datetime64[ns]

重采样

resample()

重采样是一种特殊的分组，例如按照每天0点-8点、8点-16点、16点到24点进行分组。

resample是一种时间序列的分组函数

构造时间序列Series

np.random.seed(0)
idx = pd.date_range('20220901','20220902 23:59:59',freq="90min")
s = pd.Series(np.random.rand(idx.shape[0]),index=idx)
s.head()

输出：

2022-09-01 00:00:00    0.548814
2022-09-01 01:30:00    0.715189
2022-09-01 03:00:00    0.602763
2022-09-01 04:30:00    0.544883
2022-09-01 06:00:00    0.423655
Freq: 90T, dtype: float64

#对每八个小时数据进行分组统计
s.resample('8H',origin='start',closed='left',label='left').sum()

输出：

2022-09-01 00:00:00    3.481198
2022-09-01 08:00:00    3.468190
2022-09-01 16:00:00    2.180701
2022-09-02 00:00:00    4.278784
2022-09-02 08:00:00    2.143557
2022-09-02 16:00:00    2.919968
Freq: 8H, dtype: float64

resample关键参数

origin：时间分割起点，默认为取“start_day”表示从0点开始分割，设置为“start”则表示序列中最新时间戳为分割起点；设置为‘end’或‘end_day’则逆向采样。
closed：默认为‘left’,表示每组区间为“左闭右开”；逆向采样时默认为‘左开右闭’
label：默认为‘left’,表示每一组的聚合结果对应的索引是组的左端点。