python_时间序列_重采样及频率转换
Resampling and Frequency Conversion¶
rng = pd. date_range( '2000-01-01' , periods= 100 , freq= 'D' )
ts = pd. Series( np. random. randn( len ( rng) ) , index= rng)
ts
ts. resample( 'M' ) . mean( )
ts. resample( 'M' , kind= 'period' ) . mean( )
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
NameError Traceback ( most recent call last)
< ipython- input - 2 - 617dc4111608 > in < module> ( )
11
12
- - - > 13 rng = pd. date_range( '2000-01-01' , periods= 100 , freq= 'D' )
14 ts = pd. Series( np. random. randn( len ( rng) ) , index= rng)
15 ts
NameError: name 'pd' is not defined
Downsampling
import pandas as pd
import numpy as np
rng = pd. date_range( '2000-01-01' , periods= 12 , freq= 'T' )
ts = pd. Series( np. arange( 12 ) , index= rng)
ts
2000 - 01 - 01 00 : 00 : 00 0
2000 - 01 - 01 00 : 01 : 00 1
2000 - 01 - 01 00 : 02 : 00 2
2000 - 01 - 01 00 : 03 : 00 3
2000 - 01 - 01 00 : 04 : 00 4
2000 - 01 - 01 00 : 05 : 00 5
2000 - 01 - 01 00 : 06 : 00 6
2000 - 01 - 01 00 : 07 : 00 7
2000 - 01 - 01 00 : 08 : 00 8
2000 - 01 - 01 00 : 09 : 00 9
2000 - 01 - 01 00 : 10 : 00 10
2000 - 01 - 01 00 : 11 : 00 11
Freq: T, dtype: int32
假设你想要通过求和的⽅式将这些数据聚合到“5 分钟”块中:
ts. resample( '5min' , closed= 'right' ) . sum ( )
1999 - 12 - 31 23 : 55 : 00 0
2000 - 01 - 01 00 : 00 : 00 15
2000 - 01 - 01 00 : 05 : 00 40
2000 - 01 - 01 00 : 10 : 00 11
Freq: 5T, dtype: int32
传⼊的频率将会以“5 分钟”的增量定义⾯元边界。默认情况下,⾯
ts. resample( '5min' , closed= 'right' ) . sum ( )
1999 - 12 - 31 23 : 55 : 00 0
2000 - 01 - 01 00 : 00 : 00 15
2000 - 01 - 01 00 : 05 : 00 40
2000 - 01 - 01 00 : 10 : 00 11
Freq: 5T, dtype: int32
ts. resample( '5min' , closed= 'right' , label= 'right' ) . sum ( )
2000 - 01 - 01 00 : 00 : 00 0
2000 - 01 - 01 00 : 05 : 00 15
2000 - 01 - 01 00 : 10 : 00 40
2000 - 01 - 01 00 : 15 : 00 11
Freq: 5T, dtype: int32
最后,你可能希望对结果索引做⼀些位移,⽐如从右边界减去⼀
ts. resample( '5min' , closed= 'right' ,
label= 'right' , loffset= '-1s' ) . sum ( )
1999 - 12 - 31 23 : 59 : 59 0
2000 - 01 - 01 00 : 04 : 59 15
2000 - 01 - 01 00 : 09 : 59 40
2000 - 01 - 01 00 : 14 : 59 11
Freq: 5T, dtype: int32
Open- High- Low- Close ( OHLC) resampling
OHLC重采样
ts. resample( '5min' ) . ohlc( )
open high low close
2000 - 01 - 01 00 : 00 : 00 0 4 0 4
2000 - 01 - 01 00 : 05 : 00 5 9 5 9
2000 - 01 - 01 00 : 10 : 00 10 11 10 11
Upsampling and Interpolation
升采样和插值
frame = pd. DataFrame( np. random. randn( 2 , 4 ) ,
index= pd. date_range( '1/1/2000' , periods= 2 ,
freq= 'W-WED' ) ,
columns= [ 'Colorado' , 'Texas' , 'New York' , 'Ohio' ] )
frame
Colorado Texas New York Ohio
2000 - 01 - 05 - 0.149378 - 0.509131 - 1.183238 0.278487
2000 - 01 - 12 - 0.600444 3.604595 1.125173 - 1.316800
当你对这个数据进⾏聚合,每组只有⼀个值,这样就会引⼊缺失
df_daily = frame. resample( 'D' ) . asfreq( )
df_daily
Colorado Texas New York Ohio
2000 - 01 - 05 - 0.149378 - 0.509131 - 1.183238 0.278487
2000 - 01 - 06 NaN NaN NaN NaN
2000 - 01 - 07 NaN NaN NaN NaN
2000 - 01 - 08 NaN NaN NaN NaN
2000 - 01 - 09 NaN NaN NaN NaN
2000 - 01 - 10 NaN NaN NaN NaN
2000 - 01 - 11 NaN NaN NaN NaN
2000 - 01 - 12 - 0.600444 3.604595 1.125173 - 1.316800
假设你想要⽤前⾯的周型值填充“⾮星期三”。resampling的填充
frame. resample( 'D' ) . ffill( )
Colorado Texas New York Ohio
2000 - 01 - 05 - 0.149378 - 0.509131 - 1.183238 0.278487
2000 - 01 - 06 - 0.149378 - 0.509131 - 1.183238 0.278487
2000 - 01 - 07 - 0.149378 - 0.509131 - 1.183238 0.278487
2000 - 01 - 08 - 0.149378 - 0.509131 - 1.183238 0.278487
2000 - 01 - 09 - 0.149378 - 0.509131 - 1.183238 0.278487
2000 - 01 - 10 - 0.149378 - 0.509131 - 1.183238 0.278487
2000 - 01 - 11 - 0.149378 - 0.509131 - 1.183238 0.278487
2000 - 01 - 12 - 0.600444 3.604595 1.125173 - 1.316800
这⾥也可以只填充指定的时期数(⽬的是限制前⾯的观测
frame. resample( 'D' ) . ffill( limit= 2 )
Colorado Texas New York Ohio
2000 - 01 - 05 - 0.149378 - 0.509131 - 1.183238 0.278487
2000 - 01 - 06 - 0.149378 - 0.509131 - 1.183238 0.278487
2000 - 01 - 07 - 0.149378 - 0.509131 - 1.183238 0.278487
2000 - 01 - 08 NaN NaN NaN NaN
2000 - 01 - 09 NaN NaN NaN NaN
2000 - 01 - 10 NaN NaN NaN NaN
2000 - 01 - 11 NaN NaN NaN NaN
2000 - 01 - 12 - 0.600444 3.604595 1.125173 - 1.316800
注意,新的⽇期索引完全没必要跟旧的重叠:
frame. resample( 'W-THU' ) . ffill( )
Colorado Texas New York Ohio
2000 - 01 - 06 - 0.149378 - 0.509131 - 1.183238 0.278487
2000 - 01 - 13 - 0.600444 3.604595 1.125173 - 1.316800
Resampling with Periods
通过时期进⾏重采样
frame = pd. DataFrame( np. random. randn( 24 , 4 ) ,
index= pd. period_range( '1-2000' , '12-2001' ,
freq= 'M' ) ,
columns= [ 'Colorado' , 'Texas' , 'New York' , 'Ohio' ] )
frame[ : 5 ]
annual_frame = frame. resample( 'A-DEC' ) . mean( )
annual_frame
Colorado Texas New York Ohio
2000 - 0.125607 0.239264 0.159813 - 0.175475
2001 - 0.053293 - 0.190562 - 0.136178 - 0.115108
升采样要稍微麻烦⼀些,因为你必须决定在新频率中各区间的哪
annual_frame. resample( 'Q-DEC' ) . ffill( )
annual_frame. resample( 'Q-DEC' , convention= 'end' ) . ffill( )
Colorado Texas New York Ohio
2000Q4 - 0.125607 0.239264 0.159813 - 0.175475
2001Q1 - 0.125607 0.239264 0.159813 - 0.175475
2001Q2 - 0.125607 0.239264 0.159813 - 0.175475
2001Q3 - 0.125607 0.239264 0.159813 - 0.175475
2001Q4 - 0.053293 - 0.190562 - 0.136178 - 0.115108
由于时期指的是时间区间,所以升采样和降采样的规则就⽐较严
annual_frame. resample( 'Q-MAR' ) . ffill( )
Colorado Texas New York Ohio
2000Q4 - 0.125607 0.239264 0.159813 - 0.175475
2001Q1 - 0.125607 0.239264 0.159813 - 0.175475
2001Q2 - 0.125607 0.239264 0.159813 - 0.175475
2001Q3 - 0.125607 0.239264 0.159813 - 0.175475
2001Q4 - 0.053293 - 0.190562 - 0.136178 - 0.115108
2002Q1 - 0.053293 - 0.190562 - 0.136178 - 0.115108
2002Q2 - 0.053293 - 0.190562 - 0.136178 - 0.115108
2002Q3 - 0.053293 - 0.190562 - 0.136178 - 0.115108