【python】无规律时间步长时序数据转为固定步长

最新推荐文章于 2024-09-16 19:01:19 发布

撼沧

最新推荐文章于 2024-09-16 19:01:19 发布

阅读量1.8k

点赞数 2

分类专栏： python 时间序列预测

本文链接：https://blog.csdn.net/qq_32832803/article/details/118895137

版权

python 同时被 2 个专栏收录

9 篇文章

订阅专栏

时间序列预测

2 篇文章

订阅专栏

写在前面

日常可能会遇到时间步长无规律的数据，需要转化为固定时间步长，此时需要进行重采样或插值。
在这里插入图片描述
例子
在选择好时间间隔后，可以用pandas的resample来操作。

import pandas as pd
from pandas.tseries.offsets import Hour,Second
import numpy as np
import datetime
data = pd.read_csv('fuel.csv')
data['date'] = pd.to_datetime(data['date'], format='%Y-%m-%d %H:%M:%S')
# frequency = 1
# time_range = pd.date_range(data['date'][0], data['date']
#                            [data.shape[0]-1]+frequency*Hour(), freq='%sH' % frequency)
data.set_index('date', inplace=True)  # 把时间列作为索引
#ticks = data.iloc[:]
bars = data.resample('h').mean().dropna().reset_index()  # 切分
print(bars)

参考：CSDN专家-HGJ 在问答“怎么统一多个时间序列时间间隔不同的问题”中的回答[2021-07-10 15:18] https://ask.csdn.net/questions/7471005

主要函数介绍
Pandas中的resample，重新采样，是对原样本重新处理的一个方法，是一个对常规时间序列数据重新采样和频率转换的便捷的方法。重新取样时间序列数据。对象必须具有类似datetime的索引(DatetimeIndex、PeriodIndex或TimedeltaIndex)，或将类似datetime的值传递给on或level关键字。

DataFrame.resample(rule, axis=0, closed=None, label=None, convention='start', kind=None, loffset=None, base=None, on=None, level=None, origin='start_day', offset=None)

参数详解：

参数	说明
rule	表示目标转换的偏移量字符串或对象
freq	表示重采样频率，例如‘M’、‘5min’，Second(15)
how=‘mean’	用于产生聚合值的函数名或数组函数，例如‘mean’、‘ohlc’、np.max等，默认是‘mean’，其他常用的值由：‘first’、‘last’、‘median’、‘max’、‘min’
axis=0	哪个轴用于向上采样或向下采样。对于序列，这将默认为0，即沿着行。必须是DatetimeIndex, TimedeltaIndex或PeriodIndex。默认是纵轴，横轴设置axis=1
fill_method = None	升采样时如何插值，比如‘ffill’、‘bfill’等
closed = ‘right’	在降采样时，各时间段的哪一段是闭合的，‘right’或‘left’，默认‘right’
label= ‘right’	在降采样时，如何设置聚合值的标签，例如，9：30-9：35会被标记成9：30还是9：35,默认9：35
convention = None	当重采样时期时，将低频率转换到高频率所采用的约定（start或end）。默认‘end’
kind = None	聚合到时期（‘period’）或时间戳（‘timestamp’），默认聚合到时间序列的索引类型
loffset = None	调整重新取样的时间标签
base	对于平均细分1天的频率，为累计间隔的“起源”。例如，对于“5min”频率，基数可以从0到4。默认值为0。
on	对于数据流，要使用列而不是索引进行重采样。列必须与日期时间类似。
level	对于多索引，用于重采样的级别(名称或数字)。级别必须与日期时间相似
origin	要调整分组的时间戳。起始时区必须与索引的时区匹配。如果没有使用时间戳，也支持以下值:epoch:原点是1970-01-01’；start ': origin是timeseries的第一个值；“start_day”:起源是timeseries午夜的第一天；
offset	加到原点的偏移时间增量

参考：[Pandas中resample方法详解——风雪云侠] https://blog.csdn.net/weixin_40426830/article/details/111512471

由于时间不均，在重采样的过程中可能会遇到重采样后某一时间步长的值为空，我们上面例子中dropna()为删掉NA值。Resample()提供了两种方法：‘ffill’和‘bfill’。

上采样和填充值
上采样是下采样的相反操作。它将时间序列数据重新采样到一个更小的时间框架。例如，从小时到分钟，从年到天。结果将增加行数，并且附加的行值默认为NaN。内置的方法ffill()和bfill()通常用于执行前向填充或后向填充来替代NaN。

df.resample('Q').ffill()

在这里插入图片描述
按季度重新采样一年并向后填充值。向后填充方法bfill()将使用下一个已知值来替换NaN。

df.resample('Q').bfill()

在这里插入图片描述
参考：[使用Pandas的resample函数处理时间序列数据的技巧——deephub] https://deephub.blog.csdn.net/article/details/109542622

另外一种可行方案是利用插值法填补NAN，例如：

bars = data.resample('h').mean().interpolate('linear').reset_index()  # 线性插值

函数介绍

Resampler.interpolate(method='linear', axis=0, limit=None, inplace=False, limit_direction='forward', downcast=None, **kwargs)

Interpolate values according to different methods.

method	{‘linear’, ‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘krogh’, ‘polynomial’, ‘spline’, ‘piecewise_polynomial’, ‘from_derivatives’, ‘pchip’, ‘akima’} ‘linear’: ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes. default ‘time’: interpolation works on daily and higher resolution data to interpolate given length of interval ‘index’, ‘values’: use the actual numerical values of the index ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘polynomial’ is passed to scipy.interpolate.interp1d. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int), e.g. df.interpolate(method=’polynomial’, order=4). These use the actual numerical values of the index. ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’ and ‘akima’ are all wrappers around the scipy interpolation methods of similar names. These use the actual numerical values of the index. For more information on their behavior, see the scipy documentation and tutorial documentation ‘from_derivatives’ refers to BPoly.from_derivatives which replaces ‘piecewise_polynomial’ interpolation method in scipy 0.18 New in version 0.18.1: Added support for the ‘akima’ method Added interpolate method ‘from_derivatives’ which replaces ‘piecewise_polynomial’ in scipy 0.18; backwards-compatible with scipy < 0.18
axis	{0, 1}, default 0 0: fill column-by-column 1: fill row-by-row
limit	int, default None. Maximum number of consecutive NaNs to fill. Must be greater than 0.
limit_direction	{‘forward’, ‘backward’, ‘both’}, default ‘forward’ If limit is specified, consecutive NaNs will be filled in this direction. New in version 0.17.0.
inplace	bool, default False Update the NDFrame in place if possible.
downcast	optional, ‘infer’ or None, defaults to None Downcast dtypes if possible.
kwargs	keyword arguments to pass on to the interpolating function.