pandas学习-第九章-时序数据

时序数据的创建

import pandas as pd 
import numpy as np

四类时间变量

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ywu2mJqc-1593446125517)(attachment:image.png)]

时间点的创立

to_datetime方法

pd.to_datetime('2020.1.1')
pd.to_datetime('2020 1.1')
pd.to_datetime('2020 1 1')
pd.to_datetime('2020 1-1')
pd.to_datetime('2020-1 1')
pd.to_datetime('2020-1-1')
pd.to_datetime('2020/1/1')
pd.to_datetime('1.1.2020')
pd.to_datetime('1.1 2020')
pd.to_datetime('1 1 2020')
pd.to_datetime('1 1-2020')
pd.to_datetime('1-1 2020')
pd.to_datetime('1-1-2020')
pd.to_datetime('1/1/2020')
pd.to_datetime('20200101')
pd.to_datetime('2020.0101')
Timestamp('2020-01-01 00:00:00')
  • 也可以使用format进行匹配
pd.to_datetime('2020`1`1',format='%Y`%m`%d')
Timestamp('2020-01-01 00:00:00')
  • 列表也可以准换为时间索引
pd.Series(range(2),index=pd.to_datetime(['2020/1/1','2020/1/2']))
2020-01-01    0
2020-01-02    1
dtype: int64
type(pd.to_datetime(['2020/1/1','2020/1/2']))
pandas.core.indexes.datetimes.DatetimeIndex
  • 对于表格而言,如果列已经按时间顺序排好,则利用to_datetime可自动转换
data=pd.DataFrame({'year':[2020,2020],'month':[1,1],'day':[1,2]})
pd.to_datetime(data)
0   2020-01-01
1   2020-01-02
dtype: datetime64[ns]

时间精度与范围

  • Timestamp的精度远远不止day,可以最小到纳秒ns
pd.to_datetime('2020/1/1 00:00:00.123456789')
Timestamp('2020-01-01 00:00:00.123456789')

date_range方法

  • start/end/periods(时间点个数)/freq(间隔方法)是该方法最重要的参数,给定了其中的3个,剩下的一个就会被确定
pd.date_range(start='2020/1/1',end='2020/1/10',periods=5)
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-03 06:00:00',
               '2020-01-05 12:00:00', '2020-01-07 18:00:00',
               '2020-01-10 00:00:00'],
              dtype='datetime64[ns]', freq=None)
pd.date_range(start='2020/1/1',end='2020/1/10',freq='D')
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10'],
              dtype='datetime64[ns]', freq='D')
pd.date_range(start='2020/1/1',periods=4,freq='D')
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04'], dtype='datetime64[ns]', freq='D')
pd.date_range(start='2020/1/1',periods=4,freq='T')
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 00:01:00',
               '2020-01-01 00:02:00', '2020-01-01 00:03:00'],
              dtype='datetime64[ns]', freq='T')
pd.date_range(start='2020/1/1',periods=4,freq='M') #一直到月末
DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30'], dtype='datetime64[ns]', freq='M')
pd.date_range(start='2020/1/1',periods=4,freq='BYS')
DatetimeIndex(['2020-01-01', '2021-01-01', '2022-01-03', '2023-01-02'], dtype='datetime64[ns]', freq='BAS-JAN')

bdate_range方法

  • bdate_range是一个类似与date_range的方法,特点在于可以在自带的工作日间隔设置上,再选择weekmask参数和holidays参数

  • 它的freq中有一个特殊的’C’/‘CBM’/'CBMS’选项,表示定制,需要联合weekmask参数和holidays参数使用

  • 例如现在需要将工作日中的周一、周二、周五3天保留,并将部分holidays剔除

weekmask='Mon Tue Fri'
holidays=[pd.Timestamp('2020/1/%s'%i) for i in range(7,13)]
pd.bdate_range(start='2020/1/1',end='2020/1/20',freq='C',weekmask=weekmask
              ,holidays=holidays)
DatetimeIndex(['2020-01-03', '2020-01-06', '2020-01-13', '2020-01-14',
               '2020-01-17', '2020-01-20'],
              dtype='datetime64[ns]', freq='C')

DateOffset对象

DataOffset与Timedelta的区别

  • Timedelta绝对时间差的特点指无论是冬令时还是夏令时,增减1day都只计算24小时
  • DataOffset相对时间差指,无论一天是23\24\25小时,增减1day都与当天相同的时间保持一致
ts=pd.Timestamp('2020/1/1 02:00:00',tz='Europe/Helsinki')
ts+pd.Timedelta(days=2)
Timestamp('2020-01-03 02:00:00+0200', tz='Europe/Helsinki')
ts+pd.DateOffset(days=2)
Timestamp('2020-01-03 02:00:00+0200', tz='Europe/Helsinki')

增减一段时间

  • ateOffset的可选参数包括years/months/weeks/days/hours/minutes/seconds
pd.Timestamp('2020/1/1')+pd.DateOffset(minutes=20)-pd.DateOffset(weeks=2)
Timestamp('2019-12-18 00:20:00')

各类常用offset对象

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lGDtzKoy-1593446125519)(attachment:image.png)]

pd.Timestamp('2020/1/1')+pd.offsets.Week(2)
Timestamp('2020-01-15 00:00:00')
pd.Timestamp('2020/01/01') + pd.offsets.BQuarterBegin(1)
Timestamp('2020-03-02 00:00:00')

序列的offset操作

  • 利用apply函数
pd.Series(pd.offsets.BYearBegin(3).apply(i) for i in pd.date_range('20200101',periods=3,freq='Y'))
0   2023-01-02
1   2024-01-01
2   2025-01-01
dtype: datetime64[ns]
  • 直接使用对象加减
pd.date_range('2020/1/1',periods=3,freq='Y')+pd.offsets.BYearBegin(3)
DatetimeIndex(['2023-01-02', '2024-01-01', '2025-01-01'], dtype='datetime64[ns]', freq='A-DEC')
pd.Series(pd.offsets.CDay(3,weekmask='Wed Fri').apply(i)
                                  for i in pd.date_range('20200105',periods=3,freq='D'))
0   2020-01-15
1   2020-01-15
2   2020-01-15
dtype: datetime64[ns]
pd.date_range('20200105',periods=3,freq='D')
DatetimeIndex(['2020-01-05', '2020-01-06', '2020-01-07'], dtype='datetime64[ns]', freq='D')

时序的索引及属性

索引切片

rng=pd.date_range('2020','2021',freq='W')
ts=pd.Series(np.random.randn(len(rng)),index=rng)
ts.head()
2020-01-05   -0.243773
2020-01-12    0.821945
2020-01-19    1.234821
2020-01-26   -0.345294
2020-02-02    0.209237
Freq: W-SUN, dtype: float64
ts['2020/1/26']
-0.3452940253341838

重采样

resample对象的基本操作

  • 采样频率一般设置为上面提到的offset字符
data_r=pd.DataFrame(np.random.randn(1000,3),index=pd.date_range('1/1/2020',freq='S',periods=1000),columns=['A','B','C'])
data_r
ABC
2020-01-01 00:00:00-0.099414-0.1929171.066336
2020-01-01 00:00:012.194989-0.119435-0.337781
2020-01-01 00:00:02-1.2786290.5751460.029983
2020-01-01 00:00:031.5754720.060964-1.603794
2020-01-01 00:00:040.7493231.8656360.146967
............
2020-01-01 00:16:350.695445-0.070806-0.513009
2020-01-01 00:16:360.204610-1.123102-0.036155
2020-01-01 00:16:370.0831282.1431661.597349
2020-01-01 00:16:38-0.829676-0.948882-1.235877
2020-01-01 00:16:390.2549030.2462311.211078

1000 rows × 3 columns

r=data_r.resample('3min')
r
<pandas.core.resample.DatetimeIndexResampler object at 0x000001ED3C8A6148>
r.sum()#每3分钟采样一次
ABC
2020-01-01 00:00:0019.188984-4.21782421.772775
2020-01-01 00:03:00-2.906431-6.085502-8.279348
2020-01-01 00:06:002.325963-7.10687214.082200
2020-01-01 00:09:00-14.9551322.491868-11.081364
2020-01-01 00:12:00-1.06433619.65809015.028689
2020-01-01 00:15:006.039573-3.221933-1.965375
data_r2=pd.DataFrame(np.random.randn(200,3)
                     ,index=pd.date_range('2020/1/1',freq='D'
                                         ,periods=200),columns=['A','B','C'])
r=data_r2.resample('CBMS')
r.sum()
ABC
2020-01-01-1.040047-2.7998461.522516
2020-02-03-0.6470100.8280693.202404
2020-03-022.118657-10.868392-1.208521
2020-04-011.772132-0.6109016.605194
2020-05-019.0234141.690500-0.139112
2020-06-019.4319741.707741-4.933881
2020-07-01-9.0960840.48494710.837215

采样聚合

r=data_r.resample('3T')
r['A'].mean()
2020-01-01 00:00:00    0.106605
2020-01-01 00:03:00   -0.016147
2020-01-01 00:06:00    0.012922
2020-01-01 00:09:00   -0.083084
2020-01-01 00:12:00   -0.005913
2020-01-01 00:15:00    0.060396
Freq: 3T, Name: A, dtype: float64
r['A'].agg([np.sum,np.mean,np.std])
summeanstd
2020-01-01 00:00:0019.1889840.1066050.993201
2020-01-01 00:03:00-2.906431-0.0161471.014105
2020-01-01 00:06:002.3259630.0129220.974205
2020-01-01 00:09:00-14.955132-0.0830841.013231
2020-01-01 00:12:00-1.064336-0.0059131.054681
2020-01-01 00:15:006.0395730.0603960.870126
  • 类似的,可以使用函数/lambda表达式
r.agg({'A':np.sum,'B':lambda x:max(x)-min(x)})
AB
2020-01-01 00:00:0019.1889844.719555
2020-01-01 00:03:00-2.9064315.455282
2020-01-01 00:06:002.3259635.365256
2020-01-01 00:09:00-14.9551325.278425
2020-01-01 00:12:00-1.0643365.027111
2020-01-01 00:15:006.0395734.847176

采样组的迭代

  • 采样组的迭代和groupby迭代完全类似,对于每一个组都可以分别做相应操作
small=pd.Series(range(6),index=pd.to_datetime(['2020-01-01 00:00:00', '2020-01-01 00:30:00'
                                                 , '2020-01-01 00:31:00','2020-01-01 01:00:00'
                                                 ,'2020-01-01 03:00:00','2020-01-01 03:05:00']))
resampled = small.resample('H')
for name,group in resampled:
    print('group:',name)
    print('-'*27)
    print(group,end='\n\n')
group: 2020-01-01 00:00:00
---------------------------
2020-01-01 00:00:00    0
2020-01-01 00:30:00    1
2020-01-01 00:31:00    2
dtype: int64

group: 2020-01-01 01:00:00
---------------------------
2020-01-01 01:00:00    3
dtype: int64

group: 2020-01-01 02:00:00
---------------------------
Series([], dtype: int64)

group: 2020-01-01 03:00:00
---------------------------
2020-01-01 03:00:00    4
2020-01-01 03:05:00    5
dtype: int64

窗口函数

s=pd.Series(np.random.randn(1000),index=pd.date_range('2020/1/1',periods=1000))
s.head()
2020-01-01    1.329339
2020-01-02   -0.578495
2020-01-03    0.144955
2020-01-04   -0.388870
2020-01-05   -1.020317
Freq: D, dtype: float64

Rolling

  • 所谓rolling方法,就是规定一个窗口,它和groupby对象一样,本身不会进行操作,需要配合聚合函数才能计算结果
s.rolling(window=50)
Rolling [window=50,center=False,axis=0]
s.rolling(window=50).mean()
2020-01-01         NaN
2020-01-02         NaN
2020-01-03         NaN
2020-01-04         NaN
2020-01-05         NaN
                ...   
2022-09-22    0.100815
2022-09-23    0.135331
2022-09-24    0.136454
2022-09-25    0.175283
2022-09-26    0.169613
Freq: D, Length: 1000, dtype: float64
  • min_periods参数是指需要的非缺失数据点数量阀值
s.rolling(window=50,min_periods=3).mean().head()
2020-01-01         NaN
2020-01-02         NaN
2020-01-03    0.298600
2020-01-04    0.126733
2020-01-05   -0.102677
Freq: D, dtype: float64
  • count/sum/mean/median/min/max/std/var/skew/kurt/quantile/cov/corr都是常用的聚合函数
s.rolling(window=50,min_periods=3).std()
2020-01-01         NaN
2020-01-02         NaN
2020-01-03    0.963152
2020-01-04    0.858251
2020-01-05    0.903101
                ...   
2022-09-22    0.872299
2022-09-23    0.907054
2022-09-24    0.906309
2022-09-25    0.889243
2022-09-26    0.893184
Freq: D, Length: 1000, dtype: float64

rolling的apply聚合

  • 使用apply聚合时,只需记住传入的是window大小的Series,输出的必须是标量即可,比如如下计算变异系数
s.rolling(window=50,min_periods=3).apply(lambda x:x.std()/x.mean()).head()
2020-01-01         NaN
2020-01-02         NaN
2020-01-03    3.225560
2020-01-04    6.772149
2020-01-05   -8.795515
Freq: D, dtype: float64

基于时间的rolling

s.rolling('15D').max().head()
2020-01-01    1.329339
2020-01-02    1.329339
2020-01-03    1.329339
2020-01-04    1.329339
2020-01-05    1.329339
Freq: D, dtype: float64

Expanding

expanding函数

  • 普通的expanding函数等价与rolling(window=len(s),min_periods=1),是对序列的累计计算
s.rolling(window=len(s),min_periods=1).sum()
2020-01-01     1.329339
2020-01-02     0.750845
2020-01-03     0.895800
2020-01-04     0.506930
2020-01-05    -0.513387
                ...    
2022-09-22   -45.244308
2022-09-23   -43.385423
2022-09-24   -43.810607
2022-09-25   -43.070710
2022-09-26   -43.646916
Freq: D, Length: 1000, dtype: float64
s.expanding().sum().head()
2020-01-01    1.329339
2020-01-02    0.750845
2020-01-03    0.895800
2020-01-04    0.506930
2020-01-05   -0.513387
Freq: D, dtype: float64

特殊函数

shift/diff/pct_change都是涉及到了元素关系

  1. shift是指序列索引不变,但值向后移动
  2. diff是指前后元素的差,period参数表示间隔,默认为1,并且可以为负
  3. pct_change是值前后元素的变化百分比,period参数与diff类似
s.head(10)
2020-01-01    1.329339
2020-01-02   -0.578495
2020-01-03    0.144955
2020-01-04   -0.388870
2020-01-05   -1.020317
2020-01-06   -0.751869
2020-01-07    0.678827
2020-01-08   -0.957797
2020-01-09    0.578865
2020-01-10   -0.775079
Freq: D, dtype: float64
s.shift(2).head(5)
2020-01-01         NaN
2020-01-02         NaN
2020-01-03    1.329339
2020-01-04   -0.578495
2020-01-05    0.144955
Freq: D, dtype: float64
s.shift(3).head(5)
2020-01-01         NaN
2020-01-02         NaN
2020-01-03         NaN
2020-01-04    1.329339
2020-01-05   -0.578495
Freq: D, dtype: float64
s.diff(3).head()
2020-01-01         NaN
2020-01-02         NaN
2020-01-03         NaN
2020-01-04   -1.718209
2020-01-05   -0.441823
Freq: D, dtype: float64
s.diff(2).head()
2020-01-01         NaN
2020-01-02         NaN
2020-01-03   -1.184384
2020-01-04    0.189624
2020-01-05   -1.165273
Freq: D, dtype: float64
s.pct_change(3).head()
2020-01-01         NaN
2020-01-02         NaN
2020-01-03         NaN
2020-01-04   -1.292529
2020-01-05    0.763746
Freq: D, dtype: float64

问题与练习

问题

如何对date_range进行批量加帧操作或对某一时间段加大时间戳密度?

  1. 调整freq参数
  2. 增加periods个数

因为这个是确定3个参数,主要还是在时间点个数与间隔上进行处理

如何批量增加TimeStamp的精度

import time
import datetime

t = time.time()

print (t)                       #原始时间数据
print (int(t))                  #秒级时间戳
print (int(round(t * 1000)))    #毫秒级时间戳
print (int(round(t * 1000000))) #微秒级时间戳
1593443476.5105758
1593443476
1593443476511
1593443476510576

对于超出处理时间的时间点,是否真的完全没有处理方法?

网上搜了下,没有看到相关方法,自己也不会处理超过这个的数据

给定一组非连续的日期,怎么快速找出位于其最大日期和最小日期之间,且没有出现在该组日期中的日期?

s.index.max()
Timestamp('2022-09-26 00:00:00', freq='D')
s.index.min()
Timestamp('2020-01-01 00:00:00', freq='D')

第三个不会

练习

现有一份关于某超市牛奶销售额的时间序列数据,请完成下列问题:

1. 销售额出现最大值的是星期几?

data_sale=pd.read_csv(r'D:\jupyter Notebook\天池比赛\pandas学习\joyful-pandas-master\data\time_series_one.csv',parse_dates=['日期'])
data_sale
日期销售额
02017-02-172154
12017-02-182095
22017-02-193459
32017-02-202198
42017-02-212413
.........
9952019-11-093022
9962019-11-102961
9972019-11-113984
9982019-11-122799
9992019-11-132941

1000 rows × 2 columns

data_sale.shape
(1000, 2)
data_sale['销售额'].max()
4333
data_sale.loc[data_sale['销售额']==4333]
日期销售额
9262019/9/14333
  • 周日

2. 计算除去春节、国庆、五一节假日的月度销售总额

#先输入假期
holiday = pd.date_range(start='20170501', end='20170503').append(
          pd.date_range(start='20171001', end='20171007')).append(
          pd.date_range(start='20180215', end='20180221')).append(
          pd.date_range(start='20180501', end='20180503')).append(
          pd.date_range(start='20181001', end='20181007')).append(
          pd.date_range(start='20190204', end='20190224')).append(
          pd.date_range(start='20190501', end='20190503')).append(
          pd.date_range(start='20191001', end='20191007'))
result_1 = data_sale[~data_sale['日期'].isin(holiday)].set_index('日期').resample('MS').sum()
result_1
销售额
日期
2017-02-0131740
2017-03-0180000
2017-04-0174734
2017-05-0176237
2017-06-0180750
2017-07-0183107
2017-08-0185715
2017-09-0179604
2017-10-0161197
2017-11-0178877
2017-12-0184900
2018-01-0185869
2018-02-0161838
2018-03-0188339
2018-04-0182011
2018-05-0176932
2018-06-0185307
2018-07-0182316
2018-08-0185917
2018-09-0182740
2018-10-0167018
2018-11-0181333
2018-12-0191381
2019-01-0189407
2019-02-0121219
2019-03-0187565
2019-04-0190202
2019-05-0185665
2019-06-0190301
2019-07-0190902
2019-08-0193664
2019-09-0189077
2019-10-0172099
2019-11-0138423

3. 按季度计算周末(周六和周日)的销量总额

result_2=data_sale[data_sale['日期'].dt.dayofweek.isin([5,6])].set_index('日期').resample('QS').sum()
result_2
销售额
日期
2017-01-0132894
2017-04-0166692
2017-07-0169099
2017-10-0170384
2018-01-0174671
2018-04-0169950
2018-07-0174245
2018-10-0174699
2019-01-0177835
2019-04-0177042
2019-07-0176276
2019-10-0135994

总结

1. 对于时序数据对于自己十分重要,这个是需要重点掌握的
2. 课后习题难度真的有点太大了。自己完全是不能完成了!
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值