一、时序的创建
bdate_range是一个类似与date_range的方法,特点在于可以在自带的工作日间隔设置上,再选择weekmask参数和holidays参数
它的freq中有一个特殊的’C’/‘CBM’/'CBMS’选项,表示定制,需要联合weekmask参数和holidays参数使用
例如现在需要将工作日中的周一、周二、周五3天保留,并将部分holidays剔除
weekmask = 'Mon Tue Thu'
holidays = [pd.Timestamp('2020/1/%s'%i) for i in range(7,13)]
#注意holidays
print(pd.bdate_range(start='2020-1-1',end='2020-1-15',freq='C',weekmask=weekmask,holidays=holidays))
'''
DatetimeIndex(['2020-01-02', '2020-01-06', '2020-01-13', '2020-01-14'], dtype='datetime64[ns]', freq='C')
'''
从这里看出,weekmask
是设置需要保留的周期,而holidays
则是需要剔除的日子。需要剔除的日子,是保留不下来的。
DateOffset对象
DataOffset与Timedelta的区别
Timedelta绝对时间差的特点指无论是冬令时还是夏令时,增减1day都只计算24小时
DataOffset相对时间差指,无论一天是23\24\25小时,增减1day都与当天相同的时间保持一致
例如,英国当地时间 2020年03月29日,01:00:00 时钟向前调整 1 小时 变为 2020年03月29日,02:00:00,开始夏令时
ts = pd.Timestamp('2020-3-29 01:00:00', tz='Europe/Helsinki')
print(ts + pd.Timedelta(days=1))
print(ts + pd.DateOffset(days=1))
'''
2020-03-30 02:00:00+03:00
2020-03-30 01:00:00+03:00
'''
第一个Timedelta
可以对时差进行自动计算,补上时差。
而DateOffset
则只会增加24小时。
增减一段时间
DateOffset的可选参数包括years/months/weeks/days/hours/minutes/seconds
print(pd.Timestamp('2020-01-01') + pd.DateOffset(minutes=20) - pd.DateOffset(weeks=2))
'''
2019-12-18 00:20:00
'''
二、重采样
1. resample对象的基本操作
采样频率一般设置为上面提到的offset字符
df_r = pd.DataFrame(np.random.randn(1000, 3),index=pd.date_range('1/1/2020', freq='S', periods=1000),
columns=['A', 'B', 'C'])
print(df_r)
r = df_r.resample('3min')
print(r.sum())
'''
A B C
2020-01-01 00:00:00 21.871583 18.603125 -11.916000
2020-01-01 00:03:00 21.430003 -26.923329 -0.848451
2020-01-01 00:06:00 18.780852 3.067451 -2.614459
2020-01-01 00:09:00 -4.071431 1.847722 21.474951
2020-01-01 00:12:00 -3.105468 11.821253 -10.143737
2020-01-01 00:15:00 -1.277825 4.847213 0.936945
'''
2. 采样聚合
r = df_r.resample('3T')
print(r['A'].mean())
'''
2020-01-01 00:00:00 -0.041984
2020-01-01 00:03:00 -0.015345
2020-01-01 00:06:00 -0.002229
2020-01-01 00:09:00 -0.104800
2020-01-01 00:12:00 0.010224
2020-01-01 00:15:00 -0.018361
Freq: 3T, Name: A, dtype: float64
'''
三、窗口函数
1. Rolling
所谓rolling方法,就是规定一个窗口,它和groupby对象一样,本身不会进行操作,需要配合聚合函数才能计算结果
s = pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2020', periods=1000))
print(s.head())
print(s.rolling(window=50))
print(s.rolling(window=50).mean().head())
'''
Rolling [window=50,center=False,axis=0]
2020-01-01 NaN
2020-01-02 NaN
2020-01-03 NaN
2020-01-04 NaN
2020-01-05 NaN
Freq: D, dtype: float64
'''
min_periods参数是指需要的非缺失数据点数量阀值
print(s.rolling(window=50,min_periods=3).mean().head())
'''
2020-01-01 NaN
2020-01-02 NaN
2020-01-03 -1.122268
2020-01-04 -0.924381
2020-01-05 -0.968945
Freq: D, dtype: float64
'''
使用apply聚合时,只需记住传入的是window大小的Series,输出的必须是标量即可,比如如下计算变异系数
print(s.rolling(window=50,min_periods=3).apply(lambda x:x.std()/x.mean()).head())
'''
2020-01-01 NaN
2020-01-02 NaN
2020-01-03 -1.257953
2020-01-04 -1.318450
2020-01-05 -1.094140
Freq: D, dtype: float64
'''
基于时间的rolling
print(s.rolling('15D').mean().head())
'''
2020-01-01 -1.825596
2020-01-02 -1.032927
2020-01-03 -1.076893
2020-01-04 -0.892363
2020-01-05 -0.895791
Freq: D, dtype: float64
'''
可选closed=‘right’(默认)‘left’‘both’'neither’参数,决定端点的包含情况
print(s.rolling('15D', closed='right').sum().head())
'''
2020-01-01 -0.077588
2020-01-02 -0.245890
2020-01-03 -0.377603
2020-01-04 -1.024685
2020-01-05 -0.571623
Freq: D, dtype: float64
'''
2. Expanding
普通的expanding函数等价与rolling(window=len(s),min_periods=1),是对序列的累计计算
print(s.rolling(window=len(s),min_periods=1).sum().head())
'''
2020-01-01 -0.350793
2020-01-02 -1.364236
2020-01-03 -2.225010
2020-01-04 -1.150104
2020-01-05 -0.270128
Freq: D, dtype: float64
'''
print(s.expanding().sum().head())
'''
2020-01-01 -0.120957
2020-01-02 -0.449439
2020-01-03 -0.123427
2020-01-04 0.718118
2020-01-05 2.795019
Freq: D, dtype: float64
'''
问题与练习
【问题一】 如何对date_range进行批量加帧操作或对某一时间段加大时间戳密度?
对periods
进行调整,或者对frq
进行调整
print(pd.date_range(start='2020/1/1',end='2020/1/10',periods=3))
print(pd.date_range(start='2020/1/1',end='2020/1/10',periods=7))
'''
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-05 12:00:00',
'2020-01-10 00:00:00'],
dtype='datetime64[ns]', freq=None)
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-02 12:00:00',
'2020-01-04 00:00:00', '2020-01-05 12:00:00',
'2020-01-07 00:00:00', '2020-01-08 12:00:00',
'2020-01-10 00:00:00'],
dtype='datetime64[ns]', freq=None)
'''
print(pd.date_range(start='2020/1/1',end='2020/1/10',freq='D'))
print(pd.date_range(start='2020/1/1',end='2020/1/10',freq='S'))
'''
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
'2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
'2020-01-09', '2020-01-10'],
dtype='datetime64[ns]', freq='D')
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 00:00:01',
'2020-01-01 00:00:02', '2020-01-01 00:00:03',
'2020-01-01 00:00:04', '2020-01-01 00:00:05',
'2020-01-01 00:00:06', '2020-01-01 00:00:07',
'2020-01-01 00:00:08', '2020-01-01 00:00:09',
...
'2020-01-09 23:59:51', '2020-01-09 23:59:52',
'2020-01-09 23:59:53', '2020-01-09 23:59:54',
'2020-01-09 23:59:55', '2020-01-09 23:59:56',
'2020-01-09 23:59:57', '2020-01-09 23:59:58',
'2020-01-09 23:59:59', '2020-01-10 00:00:00'],
dtype='datetime64[ns]', length=777601, freq='S')
Process finished with exit code 0
'''
【问题二】 如何批量增加TimeStamp的精度?
Timestamp的精度远远不止day,可以最小到纳秒ns,如:
print(pd.to_datetime('2020/1/1 00:00:00.123456789'))
'''
2020-01-01 00:00:00.123456789
'''
【问题三】 对于超出处理时间的时间点,是否真的完全没有处理方法?
【问题四】 给定一组非连续的日期,怎么快速找出位于其最大日期和最小日期之间,且没有出现在该组日期中的日期?
time = pd.date_range(start='2020/12/23', end='2020/12/31', periods=3)
print(time)
'''
time = pd.date_range(start='2020/12/20', end='2020/12/31', periods=3)
print(time)
print(time.max(), time.min())
time1 = pd.date_range(start=str(time.min()), end=str(time.max()), freq='D')
print(time1[~time1.isin(time)])
'''
DatetimeIndex(['2020-12-20 00:00:00', '2020-12-25 12:00:00',
'2020-12-31 00:00:00'],
dtype='datetime64[ns]', freq=None)
2020-12-31 00:00:00 2020-12-20 00:00:00
DatetimeIndex(['2020-12-21', '2020-12-22', '2020-12-23', '2020-12-24',
'2020-12-25', '2020-12-26', '2020-12-27', '2020-12-28',
'2020-12-29', '2020-12-30'],
dtype='datetime64[ns]', freq='D')
'''