金融时间序列分析处理
模块导入
import datetime
from datetime import datetime
Python下的日期格式——Datetime数据及相关转换
now = datetime.now()
print(now)
print(type(now))
# 2023-02-26 13:40:46.803721
# <class 'datetime.datetime'>
print('{}年{}月{}日'.format(now.year, now.month, now.day))
# 2023年2月26日
delta = datetime.now() - datetime(2023, 1, 1)
# datetime.timedelta(days=56, seconds=49554, microseconds=961194)
时间格式转换为字符串类型
dt_time = datetime(2023,2,26)
str_time = str(dt_time)
print(type(dt_time))
print(type(str_time))
# <class 'datetime.datetime'>
# <class 'str'>
str_time2 = dt_time.strftime('%d/%m/%y')
print(str_time2)
print(type(str_time2))
# 26/02/23
# <class 'str'>
将字符串转化为datetime格式3种方法
- 方法一:
datetime.strptime()
p : parse(解析)
必须匹配日期的具体格式,否则就会报错
dt_str = '2017-06-18'
dt_time = datetime.strptime(dt_str, '%Y-%m-%d')
print(type(dt_time))
print(dt_time)
# <class 'datetime.datetime'>
# 2017-06-18 00:00:00
- 方法二:
dateutil.parser
from dateutil.parser import parse
dt_str2 = '01-06-2017'
dt_time2 = parse(dt_str2) # dayfirst = True --> 输出的是: 06-01
print(type(dt_time2))
print(dt_time2)
# <class 'datetime.datetime'>
# 2017-01-06 00:00:00
- 方法三:
pd.to_datetime()
str_time = pd.Series(['2017/06/18', '2017/06/19', '2017-06-20', '2017-06-21'], name='Course_time')
dt_time = pd.to_datetime(str_time)
dt_time
# 结果:
0 2017-06-18
1 2017-06-19
2 2017-06-20
3 2017-06-21
Name: Course_time, dtype: datetime64[ns]
Pandas下的时间格式
–timestamp:pandas 最基本的时间日期对象是TimeStamp,这个对象与 datetime 对象保有高度兼容性,可通过.to_datetime() 函数转换。
–DatetimeIndex: pandas下的时间索引格式;
–pd.date_range() 可用于生成指定长度的 DatetimeIndex。参数可以是起始结束日期,或单给一个日期,加一个时间段参数。日期是包含的。
–Period:时期(period)概念不同于前面的时间戳(timestamp),指的是一个时间段。但在使用上并没有太多不同,pd.Period 类的构造函数仍需要一个时间戳,以及一个 freq 参数。
import pandas as pd
import numpy as np
- DatetimeIndex
把python下的datetime转换成为pandas下的时间索引DatatimeIndex;
dates = [datetime(2016, 8, 1), datetime(2016, 8, 2)]
dates = pd.DatetimeIndex(dates)
dates
# DatetimeIndex(['2016-08-01', '2016-08-02'], dtype='datetime64[ns]', freq=None)
如果一个Series创建时参数输入一个datetime列表,那么会自动将其转化为 DatetimeIndex并以其作为索引
dates = [datetime(2016, 8, 1), datetime(2016, 8, 2)]
df = pd.Series(np.random.randn(2), index = dates)
df
# 结果:
2016-08-01 -0.612734
2016-08-02 0.172384
dtype: float64
- pd.Timestamp()
时刻数据:代表时间点,是pandas的数据类型,是将值与时间点相关联的最基本类型的时间序列数据
date1 = datetime(2016, 12, 1, 12, 45, 30) # 创建一个datetime.datetime
date2 = '2017-12-21' # 创建一个字符串
t1 = pd.Timestamp(date1)
t2 = pd.Timestamp(date2)
print(t1, type(t1))
print(t2)
print(pd.Timestamp('2017-12-21 15:00:22'))
# 结果:
2016-12-01 12:45:30 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2017-12-21 00:00:00
2017-12-21 15:00:22
# 直接生成pandas的时刻数据 → 时间戳
# 数据类型为 pandas的Timestamp
- pd.date_range()
- 用于生成DatetimeIndex;生成日期范围
2种生成方式:①start + end; ②start/end + periods
默认频率:day
freq = ‘D’
pd.date_range(start=None, end=None, periods=None, freq='D', tz=None, normalize=False, name=None, closed=None, **kwargs)
start:开始时间
end:结束时间
periods:偏移量
freq:频率,默认天,pd.date_range()默认频率为日历日,pd.bdate_range()默认频率为工作日
tz:时区
normalize:时间参数值正则化到午夜时间戳
name:索引对象名称
closed:默认为None的情况下,左闭右闭,left则左闭右开,right则左开右闭
频率:
B:每工作日
H:每小时
T/MIN:每分
S:每秒
L:每毫秒(千分之一秒)
U:每微秒(百万分之一秒)还有很多,需要的时候再看吧…
pd.date_range()
'''三种生成方式'''
rng1 = pd.date_range('1/1/2017', '1/10/2017', normalize=True)
rng2 = pd.date_range(start='1/1/2017', periods=10)
rng3 = pd.date_range(end='1/30/2017 15:00:00', periods=10) # 增加了时、分、秒
print(rng1, type(rng1))
print(rng2)
print(rng3)
# 结果:
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
'2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
'2017-01-09', '2017-01-10'],
dtype='datetime64[ns]', freq='D')
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
'2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
'2017-01-09', '2017-01-10'],
dtype='datetime64[ns]', freq='D')
DatetimeIndex(['2017-01-21 15:00:00', '2017-01-22 15:00:00',
'2017-01-23 15:00:00', '2017-01-24 15:00:00',
'2017-01-25 15:00:00', '2017-01-26 15:00:00',
'2017-01-27 15:00:00', '2017-01-28 15:00:00',
'2017-01-29 15:00:00', '2017-01-30 15:00:00'],
dtype='datetime64[ns]', freq='D')
'''name和normalize'''
rng4 = pd.date_range(start='1/1/2017 15:30', periods=10, name='hello world!', normalize=True)
print(rng4)
# 结果:
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
'2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
'2017-01-09', '2017-01-10'],
dtype='datetime64[ns]', name='hello world!', freq='D')
# normalize:时间参数值正则化到午夜时间戳(这里最后就直接变成0:00:00,并不是15:30:00)
# name:索引对象名称
'''closed'''
print(pd.date_range('20170101','20170104')) # 20170101也可读取
print(pd.date_range('20170101','20170104',closed = 'right'))
print(pd.date_range('20170101','20170104',closed = 'left'))
# 结果:
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04'], dtype='datetime64[ns]', freq='D')
DatetimeIndex(['2017-01-02', '2017-01-03', '2017-01-04'], dtype='datetime64[ns]', freq='D')
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03'], dtype='datetime64[ns]', freq='D')
# closed:默认为None的情况下,左闭右闭,left则左闭右开,right则左开右闭
'''直接转化为list,元素为Timestamp'''
print(list(pd.date_range(start = '1/1/2017', periods = 10)))
# 结果:
[Timestamp('2017-01-01 00:00:00', freq='D'), Timestamp('2017-01-02 00:00:00', freq='D'), Timestamp('2017-01-03 00:00:00', freq='D'), Timestamp('2017-01-04 00:00:00', freq='D'), Timestamp('2017-01-05 00:00:00', freq='D'), Timestamp('2017-01-06 00:00:00', freq='D'), Timestamp('2017-01-07 00:00:00', freq='D'), Timestamp('2017-01-08 00:00:00', freq='D'), Timestamp('2017-01-09 00:00:00', freq='D'), Timestamp('2017-01-10 00:00:00', freq='D')]
pd.bdate_range()
默认频率为工作日
freq = ‘B’
print(pd.bdate_range('20170101', '20170107'))
# 结果:
DatetimeIndex(['2017-01-02', '2017-01-03', '2017-01-04', '2017-01-05',
'2017-01-06'],
dtype='datetime64[ns]', freq='B')
.asfreq()
时期频率转换
ts = pd.Series(np.random.rand(4),
index=pd.date_range('20170101', '20170104'))
print(ts)
print(ts.asfreq('4H', method='ffill'))
# 改变频率,这里是D改为4H
# method:插值模式,None不插值,ffill用之前值填充,bfill用之后值填充
# 结果:
2017-01-01 0.357889
2017-01-02 0.150000
2017-01-03 0.726897
2017-01-04 0.068152
Freq: D, dtype: float64
2017-01-01 00:00:00 0.357889
2017-01-01 04:00:00 0.357889
2017-01-01 08:00:00 0.357889
2017-01-01 12:00:00 0.357889
2017-01-01 16:00:00 0.357889
2017-01-01 20:00:00 0.357889
2017-01-02 00:00:00 0.150000
2017-01-02 04:00:00 0.150000
2017-01-02 08:00:00 0.150000
2017-01-02 12:00:00 0.150000
2017-01-02 16:00:00 0.150000
2017-01-02 20:00:00 0.150000
2017-01-03 00:00:00 0.726897
2017-01-03 04:00:00 0.726897
2017-01-03 08:00:00 0.726897
2017-01-03 12:00:00 0.726897
2017-01-03 16:00:00 0.726897
2017-01-03 20:00:00 0.726897
2017-01-04 00:00:00 0.068152
Freq: 4H, dtype: float64
.shift():
日期范围:超前/滞后数据
ts = pd.Series(np.random.rand(4),
index=pd.date_range('20170101', '20170104'))
print(ts)
print(ts.shift(2))
print(ts.shift(-2))
print('------')
# 正数:数值后移(滞后);负数:数值前移(超前)
per = ts/ts.shift(1) - 1
print(per)
print('------')
# 计算变化百分比,这里计算:该时间戳与上一个时间戳相比,变化百分比
print(ts.shift(2, freq='D'))
print(ts.shift(2, freq='T'))
# 加上freq参数:对时间戳进行位移,而不是对数值进行位移
# 结果:
2017-01-01 0.514884
2017-01-02 0.406579
2017-01-03 0.651076
2017-01-04 0.249117
Freq: D, dtype: float64
2017-01-01 NaN
2017-01-02 NaN
2017-01-03 0.514884
2017-01-04 0.406579
Freq: D, dtype: float64
2017-01-01 0.651076
2017-01-02 0.249117
2017-01-03 NaN
2017-01-04 NaN
Freq: D, dtype: float64
------
2017-01-01 NaN
2017-01-02 -0.210349
2017-01-03 0.601353
2017-01-04 -0.617376
Freq: D, dtype: float64
------
2017-01-03 0.514884
2017-01-04 0.406579
2017-01-05 0.651076
2017-01-06 0.249117
Freq: D, dtype: float64
2017-01-01 00:02:00 0.514884
2017-01-02 00:02:00 0.406579
2017-01-03 00:02:00 0.651076
2017-01-04 00:02:00 0.249117
Freq: D, dtype: float64
- pd.Period()
pd.Period()
参数:一个时间戳 + freq 参数 → freq 用于指明该 period 的长度,时间戳则说明该 period 在时间轴上的位置
p = pd.Period('2017', freq='M')
print(p, type(p))
# 生成一个以2017-01开始,月为频率的时间构造器
print(p + 1)
print(p - 2)
print(pd.Period('2012', freq='A-DEC') - 1)
# 结果:
2017-01 <class 'pandas._libs.tslibs.period.Period'>
2017-02
2016-11
2011
# Period('2012', freq = 'A-DEC')可以看成多个时间期的时间段中的游标
# Timestamp表示一个时间戳,是一个时间截面;Period是一个时期,是一个时间段!!但两者作为index时区别不大
- pd.period_range()
和
pd.date_range()
的区别是:
- 一个是DateIndex(显示到日),单个数值为Timestamp;一个是PeriodIndex(显示到月),单个数值为Period
- 有些函数用法不同
- pd.period_range()
prng = pd.period_range('1/1/2011', '1/1/2012', freq='M')
print(prng, type(prng))
print(prng[0], type(prng[0]))
# 结果:
PeriodIndex(['2011-01', '2011-02', '2011-03', '2011-04', '2011-05', '2011-06',
'2011-07', '2011-08', '2011-09', '2011-10', '2011-11', '2011-12',
'2012-01'],
dtype='period[M]') <class 'pandas.core.indexes.period.PeriodIndex'>
2011-01 <class 'pandas._libs.tslibs.period.Period'>
# 数据格式为PeriodIndex,单个数值为Period
ts = pd.Series(np.random.rand(len(prng)), index=prng)
print(ts, type(ts))
print(ts.index)
# 结果:
2011-01 0.618567
2011-02 0.840224
2011-03 0.644018
2011-04 0.843022
2011-05 0.913075
2011-06 0.703065
2011-07 0.889372
2011-08 0.102980
2011-09 0.051866
2011-10 0.874365
2011-11 0.500441
2011-12 0.133804
2012-01 0.411256
Freq: M, dtype: float64 <class 'pandas.core.series.Series'>
PeriodIndex(['2011-01', '2011-02', '2011-03', '2011-04', '2011-05', '2011-06',
'2011-07', '2011-08', '2011-09', '2011-10', '2011-11', '2011-12',
'2012-01'],
dtype='period[M]')
- .asfreq()
p = pd.Period('2017','A-DEC')
print(p)
print(p.asfreq('M', how = 'start')) # 也可写 how = 's'
print(p.asfreq('D', how = 'end')) # 也可写 how = 'e'
# 通过.asfreq(freq, method=None, how=None)方法转换成别的频率
# 结果:
2017
2017-01
2017-12-31
prng = pd.period_range('2017','2018',freq = 'M')
ts1 = pd.Series(np.random.rand(len(prng)), index = prng)
ts2 = pd.Series(np.random.rand(len(prng)), index = prng.asfreq('D', how = 'start'))
print(ts1.head(),len(ts1))
print(ts2.head(),len(ts2))
# asfreq也可以转换TIMESeries的index
# 结果:
2017-01 0.679288
2017-02 0.114730
2017-03 0.347454
2017-04 0.623199
2017-05 0.947733
Freq: M, dtype: float64 13
2017-01-01 0.332454
2017-02-01 0.463674
2017-03-01 0.805149
2017-04-01 0.336653
2017-05-01 0.964991
Freq: D, dtype: float64 13
- pd.to_period()
和pd.to_timestamp()
rng = pd.date_range('2017/1/1', periods=10, freq='M')
prng = pd.period_range('2017', '2018', freq='M')
ts1 = pd.Series(np.random.rand(len(rng)), index=rng)
print(ts1.head())
print(ts1.to_period().head())
# 每月最后一日,转化为每月
# 结果:
2017-01-31 0.659452
2017-02-28 0.446721
2017-03-31 0.365186
2017-04-30 0.122854
2017-05-31 0.346285
Freq: M, dtype: float64
2017-01 0.659452
2017-02 0.446721
2017-03 0.365186
2017-04 0.122854
2017-05 0.346285
Freq: M, dtype: float64
ts2 = pd.Series(np.random.rand(len(prng)), index=prng)
print(ts2.head())
print(ts2.to_timestamp().head())
# 每月,转化为每月第一天
# 结果:
2017-01 0.535997
2017-02 0.251049
2017-03 0.355468
2017-04 0.826570
2017-05 0.002247
Freq: M, dtype: float64
2017-01-01 0.535997
2017-02-01 0.251049
2017-03-01 0.355468
2017-04-01 0.826570
2017-05-01 0.002247
Freq: MS, dtype: float64
时间序列 - 重采样 - resample()
将时间序列从一个频率转换为另一个频率的过程,且会有数据的结合
降采样:高频数据 → 低频数据,eg.以天为频率的数据转为以月为频率的数据
升采样:低频数据 → 高频数据,eg.以年为频率的数据转为以月为频率的数据
聚合方法:
- mean()→ 求平均值
- max()→ 求最大值
- min()→ 求最小值
- median()→ 求中值
- first()→ 返回第一个值
- last()→ 返回最后一个值
- ohlc()→ OHLC重采样
- OHLC:金融领域的时间序列聚合方式 → open开盘、high最大值、low最小值、close收盘
重采样:.resample()
rng = pd.date_range('20170101', periods=12)
ts = pd.Series(np.arange(12), index=rng)
print(ts)
print('------')
ts_re = ts.resample('5D')
ts_re2 = ts.resample('5D').sum()
print(ts_re, '\n', type(ts_re))
print('------')
print(ts_re2, '\n', type(ts_re2))
# ts.resample('5D'):得到一个重采样构建器,频率改为5天
# ts.resample('5D').sum():得到一个新的聚合后的Series,聚合方式为求和
# freq:重采样频率 → ts.resample('5D')
# .sum():聚合方法
# 结果:
2017-01-01 0
2017-01-02 1
2017-01-03 2
2017-01-04 3
2017-01-05 4
2017-01-06 5
2017-01-07 6
2017-01-08 7
2017-01-09 8
2017-01-10 9
2017-01-11 10
2017-01-12 11
Freq: D, dtype: int64
------
DatetimeIndexResampler [freq=<5 * Days>, axis=0, closed=left, label=left, convention=start, origin=start_day]
<class 'pandas.core.resample.DatetimeIndexResampler'>
------
2017-01-01 10
2017-01-06 35
2017-01-11 21
Freq: 5D, dtype: int64
<class 'pandas.core.series.Series'>
降采样
rng = pd.date_range('20170101', periods=12)
ts = pd.Series(np.arange(1, 13), index=rng)
print(ts, '\n')
print('-----')
print(ts.resample('5D').sum(), '→ 默认\n')
print(ts.resample('5D', closed='left').sum(), '→ left\n')
print(ts.resample('5D', closed='right').sum(), '→ right\n')
print('-----')
# closed:各时间段哪一端是闭合(即包含)的,默认 左闭右闭
# 详解:这里values为0-11,按照5D重采样 → [1,2,3,4,5],[6,7,8,9,10],[11,12]
# left指定间隔左边为结束 → [1,2,3,4,5],[6,7,8,9,10],[11,12]
# right指定间隔右边为结束 → [1],[2,3,4,5,6],[7,8,9,10,11],[12]
print(ts.resample('5D', label='left').sum(), '→ leftlabel\n')
print(ts.resample('5D', label='right').sum(), '→ rightlabel\n')
# label:聚合值的index,默认为取左
# 值采样认为默认(这里closed默认)
# 结果:
2017-01-01 1
2017-01-02 2
2017-01-03 3
2017-01-04 4
2017-01-05 5
2017-01-06 6
2017-01-07 7
2017-01-08 8
2017-01-09 9
2017-01-10 10
2017-01-11 11
2017-01-12 12
-----
Freq: D, dtype: int64
2017-01-01 15
2017-01-06 40
2017-01-11 23
Freq: 5D, dtype: int64 → 默认
2017-01-01 15
2017-01-06 40
2017-01-11 23
Freq: 5D, dtype: int64 → left
2016-12-27 1
2017-01-01 20
2017-01-06 45
2017-01-11 12
Freq: 5D, dtype: int64 → right
-----
2017-01-01 15
2017-01-06 40
2017-01-11 23
Freq: 5D, dtype: int64 → leftlabel
2017-01-06 15
2017-01-11 40
2017-01-16 23
Freq: 5D, dtype: int64 → rightlabel
升采样及插值
rng = pd.date_range('2017/1/1 0:0:0', periods=5, freq='H')
ts = pd.DataFrame(np.arange(15).reshape(5, 3),
index=rng,
columns=['a', 'b', 'c'])
print(ts)
print(ts.resample('15T').asfreq())
print(ts.resample('15T').ffill())
print(ts.resample('15T').bfill())
# 低频转高频,主要是如何插值
# .asfreq():不做填充,返回Nan
# .ffill():向上填充
# .bfill():向下填充
# .interpolate():线性插值填充
时期重采样 - Period
prng = pd.period_range('2016','2017',freq = 'M')
ts = pd.Series(np.arange(len(prng)), index = prng)
print(ts)
print(ts.resample('3M').sum()) # 降采样
print(ts.resample('15D').ffill()) # 升采样