时间序列数据分析

最新推荐文章于 2024-08-15 15:27:26 发布

小张不咕咕

最新推荐文章于 2024-08-15 15:27:26 发布

阅读量1.9k

点赞数 45

分类专栏： Pandas入门教程文章标签：数据分析数据挖掘 python pandas

本文链接：https://blog.csdn.net/Asrfb0416/article/details/124388767

版权

Pandas入门教程专栏收录该内容

7 篇文章 0 订阅

订阅专栏

本篇学习目标

了解什么是时间序列，ARIMA
掌握时间序列的基本操作
掌握时期，重采样
熟悉滑动窗口的使用

1.时间序列的基本操作

思考：什么是时间序列？

答：时间序列是指多个时间点上形成的数值序列，它既可以是定期的，也可以是不定期出现的。

时间序列的数据主要有以下几种：

时间戳tiimestamp：表示特定的时刻，比如现在。
时期period：比如2020年或者2020年10月。
时间间隔interval：由起始时间戳和结束时间戳表示。

1.1创建时间序列

1.在Pandas中，时间戳使用Timestamp（Series派生的子类）对象表示。

该对象与datetime具有高度的兼容性，可以直接通过to_datetime()函数将datetime转换为TimeStamp对象。

import pandas as pd
from datetime import datetime
import numpy as np
pd.to_datetime('20200828')         # 将datetime转换为Timestamp对象

Timestamp(‘2020-08-28 00:00:00’)

当传入的是多个datetime组成的列表，则Pandas会将其强制转换为DatetimeIndex类对象。

# 传入多个datetime字符串
date_index = pd.to_datetime(['20200820', '20200828', '20200908'])
date_index

DatetimeIndex([‘2020-08-20’, ‘2020-08-28’, ‘2020-09-08’],

dtype=‘datetime64[ns]’, freq=None)

如何取出第一个时间戳

date_index[0]   # 取出第一个时间戳

Timestamp(‘2020-08-20 00:00:00’)

2.在Pandas中，最基本的时间序列类型就是以时间戳为索引的Series对象。

# 创建时间序列类型的Series对象
date_ser = pd.Series([11, 22, 33], index=date_index)
date_ser

2020-08-20 11

2020-08-28 22

2020-09-08 33

dtype: int64

也可将包含多个datetime对象的列表传给index参数，同样能创建具有时间戳索引的Series对象。

# 指定索引为多个datetime的列表
date_list = [datetime(2020, 1, 1), datetime(2020, 1, 15),
             datetime(2020, 2, 20), datetime(2020, 4, 1),
             datetime(2020, 5, 5), datetime(2020, 6, 1)]
time_se = pd.Series(np.arange(6), index=date_list)
time_se

2020-01-01 0

2020-01-15 1

2020-02-20 2

2020-04-01 3

2020-05-05 4

2020-06-01 5

dtype: int32

3.如果希望DataFrame对象具有时间戳索引，也可以采用上述方式。

data_demo = [[11, 22, 33], [44, 55, 66], 
             [77, 88, 99], [12, 23, 34]]
date_list = [datetime(2020, 1, 23), datetime(2020, 2, 15),
             datetime(2020, 5, 22), datetime(2020, 3, 30)]
time_df = pd.DataFrame(data_demo, index=date_list)
time_df

	0	1	2
2020-01-23	11	22	33
2020-02-15	44	55	66
2020-05-22	77	88	99
2020-03-30	12	23	34

1.2通过时间戳索引选取子集

# 指定索引为多个日期字符串的列表
date_list = ['2017/05/30', '2019/02/01',
             '2017.6.1', '2018.4.1',
             '2019.6.1', '2020.1.23']
# 将日期字符串转换为DatetimeIndex 
date_index = pd.to_datetime(date_list)
# 创建以DatetimeIndex 为索引的Series对象
date_se = pd.Series(np.arange(6), index=date_index)
date_se

2017-05-30 0

2019-02-01 1

2017-06-01 2

2018-04-01 3

2019-06-01 4

2020-01-23 5

dtype: int32

常用选取子集的方式操作有：

1.直接使用位置索引来获取具体的数据。（最简单）

# 根据位置索引获取数据
time_se[3]

3

2.使用datetime构建的日期获取其对应的数据。

date_time = datetime(2017, 6, 1)
date_se[date_time]

2

3.操作索引获取子集，直接使用一个日期字符串（符合可以被解析的格式）进行获取。

eg:
a.

date_se['20170530']

2017-05-30 0

dtype: int32

date_se['2018-04-01']

2018-04-01 3

dtype: int32

date_se['2020/01/23']

2020-01-23 5

dtype: int32

date_se['6/1/2019']

2019-06-01 4

dtype: int32

4.直接指定。直接用指定的年份或者月份操作索引来获取某年或某个月的数据。

date_se['2017']  # 获取2017年的数据

2017-05-30 0

2017-06-01 2

dtype: int32

5.使用过truncate()方法截取 Series或DataFrame对象。

truncate(before = None,after = None,axis = None,copy = True)

# 扔掉2018-1-1之前的数据
sorted_se = date_se.sort_index()
sorted_se.truncate(before='2018-1-1')

2018-04-01 3

2019-02-01 1

2019-06-01 4

2020-01-23 5

dtype: int32

# 扔掉2018-7-31之后的数据
sorted_se.truncate(after='2018-7-31')

2017-05-30 0

2017-06-01 2

2018-04-01 3

dtype: int32

参数说明：

before – 表示截断此索引值之前的所有行。
after – 表示截断此索引值之后的所有行。
axis – 表示截断的轴，默认为行索引方向。

2.固定频率的时间序列

2.1创建固定频率的时间序列

Pandas中所提供的date_range()函数，主要用于生成一个具有固定频率的DatetimeIndex对象。

参数说明：

start：表示起始日期，默认为None。
end：表示终止日期，默认为None。
periods：表示产生多少个时间戳索引值。
freq：用来指定计时单位。

注意：

start、end、periods、freq这四个参数至少要指定三个参数，否则会出现错误。

1.当调用date_range()函数创建DatetimeIndex对象时，如果只是传入了开始日期（start参数）与结束日期（end参数），则默认生成的时间戳是按天计算的，即freq参数为D。

# 创建DatetimeIndex对象时，只传入开始日期与结束日期
pd.date_range('2020/08/10', '2020/08/20')

DatetimeIndex([‘2020-08-10’, ‘2020-08-11’, ‘2020-08-12’, ‘2020-08-13’,
‘2020-08-14’, ‘2020-08-15’, ‘2020-08-16’, ‘2020-08-17’,
‘2020-08-18’, ‘2020-08-19’, ‘2020-08-20’],
dtype=‘datetime64[ns]’, freq=‘D’)

2.若只是传入了开始日期或结束日期，则还需要用periods参数指定产生多少个时间戳。
a.

# 创建DatetimeIndex对象时，传入start与periods参数
pd.date_range(start='2020/08/10', periods=5)

DatetimeIndex([‘2020-08-10’, ‘2020-08-11’, ‘2020-08-12’, ‘2020-08-13’,
‘2020-08-14’],
dtype=‘datetime64[ns]’, freq=‘D’)

# 创建DatetimeIndex对象时，传入end与periods参数
pd.date_range(end='2020/08/10', periods=5)

DatetimeIndex([‘2020-08-06’, ‘2020-08-07’, ‘2020-08-08’, ‘2020-08-09’,
‘2020-08-10’],
dtype=‘datetime64[ns]’, freq=‘D’)

3.如果希望时间序列中的时间戳都是每周固定的星期日，则可以在创建DatetimeIndex时将freq参数设为“W-SUN”。

dates_index = pd.date_range('2020-01-01',         # 起始日期
                            periods=5,            # 周期
                            freq='W-SUN')         # 频率
dates_index

DatetimeIndex([‘2020-01-05’, ‘2020-01-12’, ‘2020-01-19’, ‘2020-01-26’,
‘2020-02-02’],
dtype=‘datetime64[ns]’, freq=‘W-SUN’)

ser_data = [12, 56, 89, 99, 31]
pd.Series(ser_data, dates_index)

2020-01-05 12
2020-01-12 56
2020-01-19 89
2020-01-26 99
2020-02-02 31
Freq: W-SUN, dtype: int64

4.如果日期中带有与时间相关的信息，且想产生一组被规范化到当天午夜的时间戳，可以将normalize参数的值设为True。

# 创建DatetimeIndex，并指定开始日期、产生日期个数、默认的频率，以及时区
pd.date_range(start='2020/8/1 12:13:30', periods=5, 
              tz='Asia/Hong_Kong')

DatetimeIndex([‘2020-08-01 12:13:30+08:00’, ‘2020-08-02 12:13:30+08:00’,
      '2020-08-03 12:13:30+08:00', '2020-08-04 12:13:30+08:00',
    
      '2020-08-05 12:13:30+08:00'],
    
      dtype='datetime64[ns, Asia/Hong_Kong]', freq='D')

DatetimeIndex(['2020-08-01 12:13:30+08:00', '2020-08-02 12:13:30+08:00',
               '2020-08-03 12:13:30+08:00', '2020-08-04 12:13:30+08:00',
               '2020-08-05 12:13:30+08:00'],
              dtype='datetime64[ns, Asia/Hong_Kong]', freq='D')
#规范化时间戳
pd.date_range(start='2020/8/1 12:13:30', periods=5, 
              normalize=True, tz='Asia/Hong_Kong')

DatetimeIndex([‘2020-08-01 00:00:00+08:00’, ‘2020-08-02 00:00:00+08:00’,
‘2020-08-03 00:00:00+08:00’, ‘2020-08-04 00:00:00+08:00’,
‘2020-08-05 00:00:00+08:00’],
dtype=‘datetime64[ns, Asia/Hong_Kong]’, freq=‘D’)

2.2时间序列的频率、偏移量

1.默认生成的时间序列数据是按天计算的，即频率为“D”。

“D”是一个基础频率，通过用一个字符串的别名表示，比如“D”是“day”的别名。
频率是由一个基础频率和一个乘数组成的，比如，“5D”表示每5天。

下面通过一张表来列举时间序列的基础频率。

pd.date_range(start='2020/2/1', end='2020/2/28', freq='5D')

DatetimeIndex([‘2020-02-01’, ‘2020-02-06’, ‘2020-02-11’, ‘2020-02-16’,
‘2020-02-21’, ‘2020-02-26’],
dtype=‘datetime64[ns]’, freq=‘5D’)

2.每个基础频率还可以跟着一个被称为日期偏移量的DateOffset对象。如果想要创建一个DateOffset对象，则需要先导入pd.tseries.offsets模块后才行。

from pandas.tseries.offsets import *
DateOffset(months=4, days=5)

Timedelta(‘14 days 10:00:00’)

3. 使用offsets模块中提供的偏移量类型进行创建。

例如，创建14天10小时的偏移量，可以换算为两周零十个小时，其中“周”使用Week类型表示的，“小时”使用Hour类型表示，它们之间可以使用加号连接。

Week(2) + Hour(10)

Timedelta(‘14 days 10:00:00’)

# 生成日期偏移量
date_offset  = Week(2) + Hour(10)
pd.date_range('2020/3/1', '2020/3/31', freq=date_offset)

DatetimeIndex([‘2020-03-01 00:00:00’, ‘2020-03-15 10:00:00’,
‘2020-03-29 20:00:00’],
dtype=‘datetime64[ns]’, freq=‘346H’)

2.3时间序列的移动

移动是指沿着时间轴方向将数据进行前移或后移。

1.Pandas对象中提供了一个shift()方法，用来前移或后移数据，但数据索引保持不变。

shift(periods=1, freq=None, axis=0)

periods – 表示移动的幅度，可以为正数，也可以为负数，默认值是1，代表移动一次。

date_index = pd.date_range('2020/01/01', periods=5)
time_ser = pd.Series(np.arange(5) + 1, index=date_index)
time_ser

2020-01-01 1
2020-01-02 2
2020-01-03 3
2020-01-04 4
2020-01-05 5
Freq: D, dtype: int32

# 向后移动一次
time_ser.shift(1)

2020-01-01 NaN
2020-01-02 1.0
2020-01-03 2.0
2020-01-04 3.0
2020-01-05 4.0
Freq: D, dtype: float64

# 向前移动一次
time_ser.shift(-1)

2020-01-01 2.0
2020-01-02 3.0
2020-01-03 4.0
2020-01-04 5.0
2020-01-05 NaN
Freq: D, dtype: float64

3.时间周期及计算

3.1创建时期对象

1.Period类表示一个标准的时间段或时期，比如某年、某月、某日、某小时等。

创建Period类对象的方式比较简单，只需要在构造方法中以字符串或整数的形式传入一个日期即可。

# 创建Period对象，表示从2020-01-01到2020-12-31之间的时间段
pd.Period(2020)

Period(‘2020’, ‘A-DEC’)

# 表示从2019-06-01到2019-06-30之间的整月时间
period = pd.Period('2019/6')
period

Period(‘2019-06’, ‘M’)

2.Period对象能够参与数学运算。

eg:
a.如果Period对象加上或者减去一个整数，则会根据具体的时间单位进行位移操作。

period + 1   # Period对象加上一个整数

Period(‘2019-07’, ‘M’)

period - 5    # Period对象减去一个整数

Period(‘2019-01’, ‘M’)

b.如果具有相同频率的两个Period对象进行数学运算，那么计算结果为它们的单位数量。

# 创建一个与period频率相同的时期
other_period = pd.Period(201401, freq='M' )
period - other_period

<65 * MonthEnds>

3.如果希望创建多个Period对象，且它们是固定出现的，则可以通过period_range()函数实现。

period_index = pd.period_range('2014.1.8', '2014.5.31', freq='M')
period_index

PeriodIndex([‘2014-01’, ‘2014-02’, ‘2014-03’, ‘2014-04’, ‘2014-05’], dtype=‘period[M]’)

上述示例返回了一个PeriodIndex对象，它是由一组时期对象构成的索引。


Period对象1
Period对象2
…
Period对象n

4.除了使用上述方式创建PeriodIndex外，还可以直接在PeriodIndex的构造方法中传入一组日期字符串。

str_list = ['2012', '2013', '2014']
pd.PeriodIndex(str_list, freq='A-DEC')

PeriodIndex([‘2012’, ‘2013’, ‘2014’], dtype=‘period[A-DEC]’)

period_ser = pd.Series(np.arange(5), period_index)
period_ser

2014-01 0
2014-02 1
2014-03 2
2014-04 3
2014-05 4
Freq: M, dtype: int32

注意：DatetimeIndex是用来指代一系列时间点的一种索引结构，而PeriodIndex则是用来指代一系列时间段的索引结构。

3.2时期的频率转换

1.Pandas中提供了一个asfreq()方法来转换时期的频率。

asfreq（freq，method = None，how = None，normalize = False，fill_value = None ）

参数说明：

freq – 表示计时单位。
how – 可以取值为start或end，默认为end。
normalize – 表示是否将时间索引重置为午夜。
fill_value – 用于填充缺失值的值。

# 创建时期对象
period = pd.Period('2019', freq='A-DEC')
period.asfreq('M', how='start')

Period(‘2019-01’, ‘M’)

period.asfreq('M', how='end')

period.asfreq(‘M’, how=‘end’)

4.重采样

4.1重采样方法（resample）

1.Pandas中的resample()是一个对常规时间序列数据重新采样和频率转换的便捷的方法。

resample(rule, how=None, axis=0, fill_method=None, closed=None, label=None, ...)

参数说明：

rule – 表示重采样频率的字符串或DateOffset。
fill_method – 表示升采样时如何插值。
closed – 设置降采样哪一端是闭合的。

date_index = pd.date_range('2019.7.8', periods=30)
time_ser = pd.Series(np.arange(30), index=date_index)
time_ser

2019-07-08 0
2019-07-09 1
2019-07-10 2
2019-07-11 3
2019-07-12 4
2019-07-13 5
2019-07-14 6
2019-07-15 7
2019-07-16 8
2019-07-17 9
2019-07-18 10
2019-07-19 11
2019-07-20 12
2019-07-21 13
2019-07-22 14
2019-07-23 15
2019-07-24 16
2019-07-25 17
2019-07-26 18
2019-07-27 19
2019-07-28 20
2019-07-29 21
2019-07-30 22
2019-07-31 23
2019-08-01 24
2019-08-02 25
2019-08-03 26
2019-08-04 27
2019-08-05 28
2019-08-06 29
Freq: D, dtype: int32

1.例如通过resample()方法对数据重新采样。

time_ser.resample('W-MON').mean()

注意：how参数不再建议使用，而是采用新的方式“.resample(…).mean()”求平均值。

time_ser.resample('W-MON').mean()

2019-07-08 0.0
2019-07-15 4.0
2019-07-22 11.0
2019-07-29 18.0
2019-08-05 25.0
2019-08-12 29.0
Freq: W-MON, dtype: float64

2.如果重采样时传入closed参数为left，则表示采样的范围是左闭右开型的。

换句话说位于某范围的时间序列中，开头的时间戳包含在内，结尾的时间戳是不包含在内的。

time_ser.resample('W-MON', closed='left').mean()

2019-07-15 3.0
2019-07-22 10.0
2019-07-29 17.0
2019-08-05 24.0
2019-08-12 28.5
Freq: W-MON, dtype: float64

4.2降采样

1.降采样时间颗粒会变大，数据量是减少的。为了避免有些时间戳对应的数据闲置，可以利用内置方法聚合数据。

eg：股票数据比较常见的是OHLC重采样，包括开盘价、最高价、最低价和收盘价。

Pandas中专门提供了一个ohlc()方法。

date_index = pd.date_range('2020/06/01', periods=30)
shares_data = np.random.rand(30)
time_ser = pd.Series(shares_data, index=date_index)
time_ser.resample('7D').ohlc()        # OHLC重采样

2.降采样相当于另外一种形式的分组操作，它会按照日期将时间序列进行分组，之后对每个分组应用聚合方法得出一个结果。

time_ser.groupby(lambda x: x.week).mean()

# 通过groupby技术实现降采样
time_ser.groupby(lambda x: x.week).mean()

23 0.399464
24 0.452200
25 0.351563
26 0.372442
27 0.326700
dtype: float64

4.3升采样

1.升采样的时间颗粒是变小的，数据量会增多，这很有可能导致某些时间戳没有相应的数据。

遇到这种情况，常用的解决办法就是插值，具体有如下几种方式：

通过ffill(limit)或bfill(limit)方法，取空值前面或后面的值填充，limit可以限制填充的个数。
通过fillna(‘ffill’)或fillna(‘bfill’)进行填充，传入ffill则表示用NaN前面的值填充，传入bfill则表示用后面的值填充。
通过使用interpolate()方法根据插值算法补全数据。

data_demo = np.array([['101', '210', '150'], ['330', '460', '580']])
date_index = pd.date_range('2020/06/10', periods=2, freq='W-SUN')
time_df = pd.DataFrame(data_demo, index=date_index, 
columns=['A产品', 'B产品', 'C产品'])
time_df

time_df.resample('D').asfreq()

time_df.resample('D').ffill()

5.数据统计—滑动窗口

5.1什么是滑动窗口

1.滑动窗口指的是根据指定的单位长度来框住时间序列，从而计算框内的统计指标。

相当于一个长度指定的滑块在刻度尺上面滑动，每滑动一个单位即可反馈滑块内的数据。

2.滑动窗口的概念比较抽象，

下面我们来举个例子描述一下。

某分店按天统计了2017年全年的销售数据，现在总经理想抽查分店8月28日（七夕）的销售情况，如果只是单独拎出来当天的数据，则这个数据比较绝对，无法很好地反映出这个日期前后销售的整体情况。

为了提升数据的准确性，可以将某个点的取值扩大到包含这个点的一段区间，用区间内的数据进行判断。

例如，我们可以将8月24日到9月2日的数据拿出来，求此区间的平均值作为抽查结果。

这个区间就是窗口，它的单位长度为10，数据是按天统计的，所以统计的是10天的平均指标，这样显得更加合理，可以很好地反映了七夕活动的整体情况。

3.移动窗口就是窗口向一端滑行，每次滑行并不是区间整块的滑行，而是一个单位一个单位的滑行。

例如，窗口向右边滑行一个单位，此时窗口框住的时间区间范围为2017-08-25到2017-09-03。

每次窗口移动，一次只会移动一个单位的长度，并且窗口的长度始终为10个单位长度，直至移动到末端。

由此可知，通过滑动窗口统计的指标会更加平稳一些，数据上下浮动的范围会比较小。

year_data = np.random.randn(365)
date_index = pd.date_range('2017-01-01', '2017-12-31', freq='D')
ser = pd.Series(year_data, date_index)
ser.head()

2017-01-01 0.877804
2017-01-02 -0.491671
2017-01-03 0.703513
2017-01-04 -1.013152
2017-01-05 0.290298
Freq: D, dtype: float64

5.2.滑动窗口法

Pandas中提供了一个窗口方法rolling()。

rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None)

参数说明：

window – 表示窗口的大小。
min_periods – 每个窗口最少包含的观测值数量。
center – 是否把窗口的标签设置为居中。
win_type – 表示窗口的类型。
closed – 用于定义区间的开闭。

roll_window = ser.rolling(window=10)
roll_window

Rolling [window=10,center=False,axis=0,method=single]

roll_window.mean()

2017-01-01 NaN
2017-01-02 NaN
2017-01-03 NaN
2017-01-04 NaN
2017-01-05 NaN
…
2017-12-27 -0.235698
2017-12-28 0.081969
2017-12-29 0.047098
2017-12-30 -0.085800
2017-12-31 -0.236803
Freq: D, Length: 365, dtype: float64

import matplotlib.pyplot as plt
%matplotlib inline
ser.plot(style='y--')
ser_window = ser.rolling(window=10).mean()
ser_window.plot(style='b')

6.时序模型—ARIMA

思考：什么是ARIMA模型？

答：ARIMA的全称叫做差分整合移动平均自回归模型，又称作整合移动平均自回归模型，是一种用于时间序列预测的常见统计模型。记作：ARIMA(p,d,q)

ARIMA模型主要由AR、I与MA模型三个部分组成。

ARIMA(p,d,q)模型可以表示为：

参数说明：

p–代表预测模型中采用的时序数据本身的滞后数，即自回归项数。
d–代表时序数据需要进行几阶差分化，才是稳定的，即差分的阶数。
q–代表预测模型中采用的预测误差的滞后数，即滑动平均项数。

ARIMA模型的基本思想：是将预测对象随时间推移而形成的数据序列视为一个随机序列，用一定的数学模型来近似描述这个序列，这个模型一旦被识别后，就可以从时间序列的过去值及现在值来预测未来值。

ARIMA模型建立的基本步骤如下：

第1步：获取被观测的时间序列数据。
第2步：根据时间序列数据进行绘图，观测是否为平稳时间序列。
第3步：从平稳的时间序列中求得自相关系数ACF和偏自相关系数PACF，得到最佳的阶层p和阶数q。
第4步：根据上述计算的d、q、p得到ARIMA模型，然后对模型进行检验。

注意：对于一个时间序列来说，如果它的均值没有系统的变化（无趋势），方差没有系统变化，并且严格消除了周期性的变化，就称为是平稳的。

7.总结

本章主要介绍了Pandas中用于处理时间序列的相关内容，包括创建时间序列、时间戳索引和切片操作、固定频率的时间序列、时期及计算、重采样、滑动窗口和时序模型。
通过对本章内容的学习，读者应该掌握处理时间序列数据的一些技巧，并灵活加以运用。
明天会更好！！！

小张不咕咕

关注

45
点赞
踩
37

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录