ch11 时间序列

11.1日期和时间数据类型及工具

  • Python标准库包含用于日期(date)和时间(time)数据的数据类型,而且还有日历方面的功能。我们主要会用到datetime、time以及calendar模块
from datetime import datetime
now = datetime.now()
now
datetime.datetime(2018, 12, 25, 9, 25, 16, 517966)
now.year, now.month, now.day
(2018, 12, 25)
  • datetime以毫秒形式存储日期和时间。timedelta表示两个datetime对象之间的时间差:
delta = datetime(2011,1,7) - datetime(2008,6,24,8,15)
delta
datetime.timedelta(926, 56700)
delta.days
926
delta.seconds
56700
  • 可以给datetime对象加上(或减去)一个或多个timedelta,这样会产生一个新对象:
from datetime import timedelta
start = datetime(2011,1,7)
start + timedelta(12,20)# 天,毫秒
datetime.datetime(2011, 1, 19, 0, 0, 20)
start - 2 * timedelta(12)
datetime.datetime(2010, 12, 14, 0, 0)

datetime 模块的数据类型如下:

字符串和datetime的相互转换

  • 利用str或strftime方法(传入一个格式化字符串),datetime对象和pandas的Timestamp对象(稍后就会介绍)可以被格式化为字符串:
stamp = datetime(2011,1,3)
str(stamp)
'2011-01-03 00:00:00'
stamp.strftime('%Y-%m-%d')
'2011-01-03'
  • datetime.strptime可以用这些格式化编码将字符串转换为日期:
value = '2011-01-03'
datetime.strptime(value, '%Y-%m-%d')
datetime.datetime(2011, 1, 3, 0, 0)
datestrs = ['7/6/2011', '8/6/2011']
[datetime.strptime(value,'%m/%d/%Y') for value in datestrs]
[datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]
  • datetime.strptime是通过已知格式进行日期解析的最佳方式。但是每次都要编写格式定义是很麻烦的事情,用dateutil这个第三方包中的parser.parse方法(pandas中已经自动安装好了):
from dateutil.parser import parse
parse('2011-01-03')
datetime.datetime(2011, 1, 3, 0, 0)
parse("Jan 31, 1997 10:45 PM")
datetime.datetime(1997, 1, 31, 22, 45)
  • 在一些国际应用领域,日期出现在月前面很普遍,可以传入参数dayfirst=True
parse("6/12/2011",dayfirst=True)
datetime.datetime(2011, 12, 6, 0, 0)
  • pandas 通常用于处理成组的日期:
import pandas as pd
datestrs = ["2011-07-06 12:00:00","2011-08-06 00:00:00"]
pd.to_datetime(datestrs)
DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00'], dtype='datetime64[ns]', freq=None)
# 还可以处理缺失值
idx = pd.to_datetime(datestrs + [None])
idx
DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00', 'NaT'], dtype='datetime64[ns]', freq=None)
idx[2]#NaT(Not a Time)是pandas中时间戳数据的null值。
NaT
pd.isnull(idx)
array([False, False,  True])

11.2 时间序列基础

  • pandas 最基础的时间序列类型就是以时间戳为索引的Series
import numpy as np
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5),
         datetime(2011, 1, 7), datetime(2011, 1, 8),
         datetime(2011, 1, 10), datetime(2011, 1, 12)]
ts = pd.Series(np.random.randn(6),index=dates)
ts
2011-01-02    0.319960
2011-01-05    1.431469
2011-01-07   -1.651676
2011-01-08   -1.302452
2011-01-10   -0.284987
2011-01-12    0.565406
dtype: float64
ts.index
DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)
# datetime 对象也可以切片
ts[datetime(2011,1,7) :]
2011-01-07   -1.651676
2011-01-08   -1.302452
2011-01-10   -0.284987
2011-01-12    0.565406
dtype: float64
ts + ts[::2]
2011-01-02    0.639919
2011-01-05         NaN
2011-01-07   -3.303352
2011-01-08         NaN
2011-01-10   -0.569973
2011-01-12         NaN
dtype: float64
# pandas用NumPy的datetime64数据类型以纳秒形式存储时间戳:
ts.index.dtype
dtype('<M8[ns]')
#DatetimeIndex中的各个标量值是pandas的Timestamp对象:
stamp = ts.index[0]
stamp
Timestamp('2011-01-02 00:00:00')

索引、选取、子集

  • 根据标签索引选取数据时,时间序列和其它的pandas.Series很像
stamp = ts.index[2]
ts[stamp]
-1.6516760222106173
# 还有一个更为方便的形式:传入一个可以被解释为日期的字符串
ts['1/10/2011']
-0.2849867054501697
ts['20110110']
-0.2849867054501697
  • 对于较长的时间序列,秩只需传入年或者年月即可选取数据的切片
longer_ts = pd.Series(np.random.randn(1000), index = pd.date_range('1/1/2000', periods=1000))
longer_ts
2000-01-01    0.323078
2000-01-02   -0.192916
2000-01-03    0.161027
2000-01-04    1.042233
2000-01-05    1.344387
2000-01-06   -0.764185
2000-01-07   -0.141419
2000-01-08    0.297445
2000-01-09    0.623654
2000-01-10    0.584203
2000-01-11    0.087188
2000-01-12   -0.110279
2000-01-13    0.209217
2000-01-14   -0.915065
2000-01-15   -0.713069
2000-01-16   -0.836166
2000-01-17    0.295419
2000-01-18    0.288559
2000-01-19   -0.084119
2000-01-20   -0.413960
2000-01-21   -0.120220
2000-01-22    0.453401
2000-01-23   -2.301278
2000-01-24   -0.253605
2000-01-25   -1.404243
2000-01-26    1.409910
2000-01-27    0.959088
2000-01-28   -2.079919
2000-01-29   -1.176011
2000-01-30   -0.356094
                ...   
2002-08-28   -1.037953
2002-08-29    0.936959
2002-08-30   -0.991882
2002-08-31   -1.012418
2002-09-01   -0.333391
2002-09-02   -0.562380
2002-09-03   -1.936792
2002-09-04    0.086965
2002-09-05   -0.751722
2002-09-06    0.874634
2002-09-07   -0.694940
2002-09-08   -1.155072
2002-09-09   -0.266088
2002-09-10   -0.412032
2002-09-11    0.032159
2002-09-12   -0.569722
2002-09-13   -0.769999
2002-09-14   -0.540141
2002-09-15    0.380193
2002-09-16   -0.834590
2002-09-17   -0.105814
2002-09-18   -0.509613
2002-09-19   -0.464820
2002-09-20   -0.369378
2002-09-21   -0.588090
2002-09-22   -1.452517
2002-09-23    1.517069
2002-09-24   -0.177512
2002-09-25   -1.207979
2002-09-26    0.575119
Freq: D, Length: 1000, dtype: float64
longer_ts['2001']
2001-01-01   -0.710368
2001-01-02   -0.493213
2001-01-03    0.011035
2001-01-04   -0.188882
2001-01-05   -0.275450
2001-01-06   -1.397614
2001-01-07   -0.050230
2001-01-08    0.995234
2001-01-09    0.144589
2001-01-10    1.399901
2001-01-11   -0.230674
2001-01-12   -0.921200
2001-01-13   -0.125920
2001-01-14   -0.398851
2001-01-15   -1.369030
2001-01-16   -1.083224
2001-01-17    1.703383
2001-01-18    1.481350
2001-01-19    0.721221
2001-01-20   -0.555076
2001-01-21   -0.164058
2001-01-22    0.616386
2001-01-23   -0.614457
2001-01-24    0.624650
2001-01-25   -0.141876
2001-01-26    0.491621
2001-01-27    0.434586
2001-01-28    0.030046
2001-01-29    1.141433
2001-01-30    2.319519
                ...   
2001-12-02   -1.371908
2001-12-03   -0.947667
2001-12-04   -1.169943
2001-12-05    3.115463
2001-12-06   -0.796079
2001-12-07   -0.287574
2001-12-08   -0.775596
2001-12-09    0.473937
2001-12-10    0.353532
2001-12-11   -1.696697
2001-12-12   -0.250758
2001-12-13   -0.395799
2001-12-14   -0.565465
2001-12-15    0.035062
2001-12-16    0.086432
2001-12-17    0.069176
2001-12-18   -0.834662
2001-12-19    0.415141
2001-12-20   -0.433074
2001-12-21    0.731880
2001-12-22   -0.831124
2001-12-23    0.194700
2001-12-24   -0.051128
2001-12-25   -0.379829
2001-12-26   -1.756667
2001-12-27   -0.581870
2001-12-28    1.144978
2001-12-29    1.232212
2001-12-30   -1.354103
2001-12-31   -0.929930
Freq: D, Length: 365, dtype: float64
longer_ts['2001-05']
2001-05-01   -1.271589
2001-05-02   -0.351115
2001-05-03   -0.895262
2001-05-04   -0.713803
2001-05-05   -0.572470
2001-05-06    0.388224
2001-05-07   -0.415884
2001-05-08   -0.149180
2001-05-09   -1.331999
2001-05-10    0.417673
2001-05-11   -0.633069
2001-05-12    1.277451
2001-05-13    0.350078
2001-05-14   -0.477254
2001-05-15    0.331342
2001-05-16   -0.844850
2001-05-17    1.931488
2001-05-18   -0.291305
2001-05-19    0.066933
2001-05-20    0.516700
2001-05-21   -0.472930
2001-05-22   -1.264003
2001-05-23   -0.222774
2001-05-24   -0.633053
2001-05-25   -1.627209
2001-05-26    0.206001
2001-05-27    0.929017
2001-05-28   -0.386632
2001-05-29    1.769678
2001-05-30   -0.250572
2001-05-31   -0.815622
Freq: D, dtype: float64
  • 范围查询
import numpy as np
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5),
         datetime(2011, 1, 7), datetime(2011, 1, 8),
         datetime(2011, 1, 10), datetime(2011, 1, 12)]
ts = pd.Series(np.random.randn(6),index=dates)
ts
2011-01-02    0.372253
2011-01-05   -0.746129
2011-01-07   -0.702319
2011-01-08    0.140512
2011-01-10    0.248298
2011-01-12    0.392128
dtype: float64
ts['1/6/2011':'1/11/2011']#这样切片所产生的是原时间序列的视图,没有数据被复制,对切片进行修改会反映到原始数据上。
2011-01-07   -0.702319
2011-01-08    0.140512
2011-01-10    0.248298
dtype: float64
#还有一个等价的方式可以截取两个日期之间的时间序列
ts.truncate(after='1/9/2011')# 1月9号之前的时间序列
2011-01-02    0.372253
2011-01-05   -0.746129
2011-01-07   -0.702319
2011-01-08    0.140512
dtype: float64
dates = pd.date_range('1/1/2000', periods=100, freq = 'W-WED')
long_df = pd.DataFrame(np.random.randn(100,4), index = dates, columns = ['Colorado', 'Texas', 'New York', 'Ohio'])
long_df['5-2001']
ColoradoTexasNew YorkOhio
2001-05-020.2282990.9161941.478840-0.715889
2001-05-090.4016581.3973410.5058771.921401
2001-05-16-1.370897-0.4937760.300839-0.520820
2001-05-23-0.5654850.8209140.0566470.890600
2001-05-300.092271-0.7526760.5852100.873675

带有重复索引的时间序列

dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000','1/2/2000', '1/3/2000'])
dup_ts = pd.Series(np.arange(5), index=dates)
dup_ts
2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int32
dup_ts.index.is_unique
False
dup_ts['1/2/2000']
2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int32
dup_ts['1/3/2000']
4
#  想要对具有非唯一时间戳的数据进行聚合。一个办法是使用groupby,并传入level=0:
grouped = dup_ts.groupby(level=0)
grouped.mean()
2000-01-01    0
2000-01-02    2
2000-01-03    4
dtype: int32
grouped.count()
2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64

日期的范围、频率以及移动

ts
2011-01-02    0.372253
2011-01-05   -0.746129
2011-01-07   -0.702319
2011-01-08    0.140512
2011-01-10    0.248298
2011-01-12    0.392128
dtype: float64
resampler = ts.resample('D')
resampler
DatetimeIndexResampler [freq=<Day>, axis=0, closed=left, label=left, convention=start, base=0]
  • 生成日期范围
  • pandas.date_range可用于根据指定的频率生成指定长度的DatetimeIndex;默认情况下,date_range会产生按天计算的时间点。
index = pd.date_range('2012-04-01','2012-06-01')
index
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
               '2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
               '2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
               '2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
               '2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
               '2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
               '2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
               '2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
               '2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
               '2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
               '2012-05-27', '2012-05-28', '2012-05-29', '2012-05-30',
               '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')
  • 如果只传入起始或结束日期,那就还得传入一个表示一段时间的数字:
pd.date_range(end='2012-06-01',periods=20)
DatetimeIndex(['2012-05-13', '2012-05-14', '2012-05-15', '2012-05-16',
               '2012-05-17', '2012-05-18', '2012-05-19', '2012-05-20',
               '2012-05-21', '2012-05-22', '2012-05-23', '2012-05-24',
               '2012-05-25', '2012-05-26', '2012-05-27', '2012-05-28',
               '2012-05-29', '2012-05-30', '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')
'''想要生成一个由每月最后一个工作日组成的日期索引,
可以传入"BM"频率(表示business end of month,表11-4是频率列表),
这样就只会包含时间间隔内(或刚好在边界上的)符合频率要求的日期:'''
pd.date_range('2000-01-01','2000-12-01',freq='BM')

DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
               '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
               '2000-09-29', '2000-10-31', '2000-11-30'],
              dtype='datetime64[ns]', freq='BM')
  • 基本的时间序列频率


pd.date_range('2012-05-02 12:56:31',periods=5)
DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
               '2012-05-04 12:56:31', '2012-05-05 12:56:31',
               '2012-05-06 12:56:31'],
              dtype='datetime64[ns]', freq='D')
pd.date_range('2012-05-02 12:56:31',periods=5, normalize=True)
DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
               '2012-05-06'],
              dtype='datetime64[ns]', freq='D')

频率和日期偏移量

  • pandas中的频率是由一个基础频率(base frequency)和一个乘数组成的。基础频率通常以一个字符串别名表示,比如"M"表示每月,"H"表示每小时。对于每个基础频率,都有一个被称为日期偏移量(date offset)的对象与之对应。
from pandas.tseries.offsets import Hour, Minute
hour = Hour()
hour
<Hour>
# 传入一个整数即可定义偏移量的倍数
four_hours = Hour(4)
four_hours
<4 * Hours>
#无需明确创建这样的对象,只需使用诸如"H"或"4H"字符串别名即可。在基础频率前面放上一个整数即可创建倍数:
pd.date_range('2000-01-01','2000-01-03 23:59', freq='4h')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 04:00:00',
               '2000-01-01 08:00:00', '2000-01-01 12:00:00',
               '2000-01-01 16:00:00', '2000-01-01 20:00:00',
               '2000-01-02 00:00:00', '2000-01-02 04:00:00',
               '2000-01-02 08:00:00', '2000-01-02 12:00:00',
               '2000-01-02 16:00:00', '2000-01-02 20:00:00',
               '2000-01-03 00:00:00', '2000-01-03 04:00:00',
               '2000-01-03 08:00:00', '2000-01-03 12:00:00',
               '2000-01-03 16:00:00', '2000-01-03 20:00:00'],
              dtype='datetime64[ns]', freq='4H')
# 偏移量可以使用加法链接
Hour(2) + Minute(30)
<150 * Minutes>
# 同时也可以传入字符串,如“2h30min”
pd.date_range('2001-01-01',periods=10, freq='1h30min')
DatetimeIndex(['2001-01-01 00:00:00', '2001-01-01 01:30:00',
               '2001-01-01 03:00:00', '2001-01-01 04:30:00',
               '2001-01-01 06:00:00', '2001-01-01 07:30:00',
               '2001-01-01 09:00:00', '2001-01-01 10:30:00',
               '2001-01-01 12:00:00', '2001-01-01 13:30:00'],
              dtype='datetime64[ns]', freq='90T')
  • 有些频率所描述的时间点并不是均匀分隔的。例如,“M”(日历月末)和"BM"(每月最后一个工作日)就取决于每月的天数,对于后者,还要考虑月末是不是周末。由于没有更好的术语,我将这些称为锚点偏移量(anchored offset)


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值