pandas.DataFrame.resample 对数据进行重新采样

最新推荐文章于 2024-06-23 16:28:21 发布

寒霜211

最新推荐文章于 2024-06-23 16:28:21 发布

阅读量6k

点赞数

本文链接：https://blog.csdn.net/weixin_42744500/article/details/89348737

版权

DataFrame.resample（规则，how = None，axis = 0，fill_method = None，closed = None，label = None，convention ='start'，kind = None，loffset = None，limit = None，base = 0，on = None，level =无）

重新采样时间序列数据。

频率转换和时间序列重采样的便捷方法。对象必须具有类似日期时间的索引（DatetimeIndex， PeriodIndex或TimedeltaIndex），或者将类似于datetime的值传递给on或level关键字。

参数：

规则 ： str
表示目标转换的偏移字符串或对象。

how  ： str
用于向下/重新采样的方法，默认为用于下采样的“均值”。

从版本0.18.0开始不推荐使用：新语法是.resample(...).mean()，或 .resample(...).apply(<func>)
axis ： {0或'index'，1或'columns'}，默认为0
	哪个轴用于上采样或下采样。对于Series，这将默认为0，即沿着行。必须是 DatetimeIndex，				 TimedeltaIndex或PeriodIndex。

fill_method ： str，默认无
上采样的填充方法。

从版本0.18.0开始不推荐使用：新语法是.resample(...).<func>()，例如.resample(...).pad()
关闭 ： {'右'，'左'}，默认无
bin间隔的哪一侧是关闭的。所有频率偏移的默认值为“左”，除了“M”，“A”，“Q”，“BM”，“BA”，“BQ”和“W”都默认为“正确”。

label ： {'right'，'left'}，默认无
哪个bin边缘标签用于标记桶。所有频率偏移的默认值为“左”，除了“M”，“A”，“Q”，“BM”，“BA”，“BQ”和“W”都默认为“正确”。

惯例 ： {'开始'，'结束'，'s'，'e'}，默认'开始'
仅适用于PeriodIndex，控制是否使用规则的开头或结尾。

kind ： {'timestamp'，'period'}，可选，默认无
传递'timestamp'将结果索引转换为 DateTimeIndex或'period'，将其转换为PeriodIndex。默认情况下，保留输入表示。

loffset ： timedelta，默认无
调整重新采样的时间标签。

limit ： int，默认无
使用fill_method重新索引时的最大大小差距。

自版本0.18.0后弃用。
base ： int，默认值为0
对于均匀细分1天的频率，聚合间隔的“原点”。例如，对于“5min”频率，base可以在0到4之间。默认为0。

on ： str，可选
对于DataFrame，要使用的列而不是索引进行重新采样。列必须与日期时间相似。

版本0.19.0中的新功能。

level ： str或int，可选
对于MultiIndex，用于重新采样的级别（名称或编号）。级别必须与日期时间一样。

版本0.19.0中的新功能。

返回：	
重新采样器对象
也可以看看
groupby
按标签，功能，标签或标签列表分组。
Series.resample
重新取样系列。
DataFrame.resample
重新采样DataFrame。
笔记

http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html#pandas.DataFrame.resample

要了解有关偏移字符串的更多信息，请参阅此链接。

例子

首先创建一个包含9个一分钟时间戳的系列。

>>> index = pd.date_range('1/1/2000', periods=9, freq='T')
>>> series = pd.Series(range(9), index=index)
>>> series
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64

将该系列下采样为3分钟的箱，并将落入箱中的时间戳的值相加。

>>> series.resample('3T').sum()
2000-01-01 00:00:00     3
2000-01-01 00:03:00    12
2000-01-01 00:06:00    21
Freq: 3T, dtype: int64

如上所述将系列下采样为3分钟箱，但使用右边缘而不是左边标记每个箱子。请注意，用作标签的存储桶中的值不包含在标记的存储桶中。例如，在原始系列中，存储桶包含值3，但重新采样的存储桶中带有标签的总和值不包括3（如果存在，则总和值将为6，而不是3）。要包含此值，请关闭bin间隔的右侧，如下面的示例所示。2000-01-01 00:03:002000-01-01 00:03:00

>>> series.resample('3T', label='right').sum()
2000-01-01 00:03:00     3
2000-01-01 00:06:00    12
2000-01-01 00:09:00    21
Freq: 3T, dtype: int64

如上所述将系列下采样为3分钟箱，但关闭箱间隔的右侧。

>>> series.resample('3T', label='right', closed='right').sum()
2000-01-01 00:00:00     0
2000-01-01 00:03:00     6
2000-01-01 00:06:00    15
2000-01-01 00:09:00    15
Freq: 3T, dtype: int64

将该系列变为30秒的箱子。

>>> series.resample('30S').asfreq()[0:5]   # Select first 5 rows
2000-01-01 00:00:00   0.0
2000-01-01 00:00:30   NaN
2000-01-01 00:01:00   1.0
2000-01-01 00:01:30   NaN
2000-01-01 00:02:00   2.0
Freq: 30S, dtype: float64

将系列采样到30秒的箱中，并NaN 使用该pad方法填充值。

>>> series.resample('30S').pad()[0:5]
2000-01-01 00:00:00    0
2000-01-01 00:00:30    0
2000-01-01 00:01:00    1
2000-01-01 00:01:30    1
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

将系列采样到30秒的箱中，并NaN使用该bfill方法填充值。

>>> series.resample('30S').bfill()[0:5]
2000-01-01 00:00:00    0
2000-01-01 00:00:30    1
2000-01-01 00:01:00    1
2000-01-01 00:01:30    2
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

通过自定义功能 apply

>>> def custom_resampler(array_like):
...     return np.sum(array_like) + 5
...
>>> series.resample('3T').apply(custom_resampler)
2000-01-01 00:00:00     8
2000-01-01 00:03:00    17
2000-01-01 00:06:00    26
Freq: 3T, dtype: int64

对于具有PeriodIndex的Series，可以使用关键字约定来控制是使用规则的开头还是结尾。

使用“开始” 约定每季度重新采样一次。值分配给期间的第一个季度。

>>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01',
...                                             freq='A',
...                                             periods=2))
>>> s
2012    1
2013    2
Freq: A-DEC, dtype: int64
>>> s.resample('Q', convention='start').asfreq()
2012Q1    1.0
2012Q2    NaN
2012Q3    NaN
2012Q4    NaN
2013Q1    2.0
2013Q2    NaN
2013Q3    NaN
2013Q4    NaN
Freq: Q-DEC, dtype: float64

使用“结束” 惯例按月重新采样季度。值将分配给期间的最后一个月。

>>> q = pd.Series([1, 2, 3, 4], index=pd.period_range('2018-01-01',
...                                                   freq='Q',
...                                                   periods=4))
>>> q
2018Q1    1
2018Q2    2
2018Q3    3
2018Q4    4
Freq: Q-DEC, dtype: int64
>>> q.resample('M', convention='end').asfreq()
2018-03    1.0
2018-04    NaN
2018-05    NaN
2018-06    2.0
2018-07    NaN
2018-08    NaN
2018-09    3.0
2018-10    NaN
2018-11    NaN
2018-12    4.0
Freq: M, dtype: float64

对于DataFrame对象，关键字on可用于指定列而不是用于重新采样的索引。

>>> d = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19],
...           'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
>>> df = pd.DataFrame(d)
>>> df['week_starting'] = pd.date_range('01/01/2018',
...                                     periods=8,
...                                     freq='W')
>>> df
   price  volume week_starting
0     10      50    2018-01-07
1     11      60    2018-01-14
2      9      40    2018-01-21
3     13     100    2018-01-28
4     14      50    2018-02-04
5     18     100    2018-02-11
6     17      40    2018-02-18
7     19      50    2018-02-25
>>> df.resample('M', on='week_starting').mean()
               price  volume
week_starting
2018-01-31     10.75    62.5
2018-02-28     17.00    60.0

对于具有MultiIndex的DataFrame，关键字级别可用于指定重采样需要在哪个级别进行。

>>> days = pd.date_range('1/1/2000', periods=4, freq='D')
>>> d2 = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19],
...            'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
>>> df2 = pd.DataFrame(d2,
...                    index=pd.MultiIndex.from_product([days,
...                                                     ['morning',
...                                                      'afternoon']]
...                                                     ))
>>> df2
                      price  volume
2000-01-01 morning       10      50
           afternoon     11      60
2000-01-02 morning        9      40
           afternoon     13     100
2000-01-03 morning       14      50
           afternoon     18     100
2000-01-04 morning       17      40
           afternoon     19      50
>>> df2.resample('D', level=0).sum()
            price  volume
2000-01-01     21     110
2000-01-02     22     140
2000-01-03     32     150
2000-01-04     36      90

寒霜211

关注

0
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
pandas.DataFrame.resample 对数据进行重新采样

DataFrame.resample（规则，how = None，axis = 0，fill_method = None，closed = None，label = None，convention ='start'，kind = None，loffset = None，limit = None，base = 0，on = None，level =无）重新采样时间序列数据。频率转换和时间序列重...
复制链接

扫一扫