Python机器学习之时间序列2（新手自学记录）

蛋肠加蛋不加香菜

已于 2024-04-27 17:51:42 修改

阅读量790

点赞数 10

分类专栏： python机器学习文章标签： python 机器学习开发语言

于 2024-04-27 17:46:37 首次发布

本文链接：https://blog.csdn.net/weixin_46714992/article/details/138251062

版权

python机器学习专栏收录该内容

4 篇文章 0 订阅

订阅专栏

（每日碎碎念：天天对着电脑变成秃头宅女了o(╥﹏╥)o）

本文共2篇，第一篇链接：Python机器学习之时间序列Time series（新手自学记录）-CSDN博客文章浏览阅读576次，点赞8次，收藏13次。本文共两个案例，用到移动平均、指数平均（简单指数平均、霍尔特线性、霍尔特winter）。需要用到的工具库：Statmodels是一个库，它提供了用于估计许多不同统计模型的类和函数，以及用于执行统计测试和统计数据探索。https://blog.csdn.net/weixin_46714992/article/details/138246662学会：

1.使用boxcox变换时间序列数据。
2.使用ACF和PACF确定参数。
3.使用ARIMA模型。
4.对时间序列预测模型进行性能比较。

toolbox:

KPSS test: from statsmodels.tsa.stattools import kpss

statsmodels.tsa.stattools.kpss - statsmodels 0.14.1

boxcox transformation: from scipy.stats import boxcox

scipy.stats.boxcox — SciPy v1.13.0 Manual

ACF and PACF: from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

statsmodels.graphics.tsaplots.plot_pacf - statsmodels 0.14.1

statsmodels.graphics.tsaplots.plot_acf - statsmodels 0.14.1

ARIMA: from statsmodels.tsa.arima.model import ARIMA

statsmodels.tsa.arima.model.ARIMA - statsmodels 0.14.1

In [1]: 要用到的库

import pandas as pd
import numpy as np
from sklearn import metrics
from scipy.stats import boxcox
from statsmodels.tsa.stattools import kpss
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from matplotlib import pyplot as plt
%matplotlib inline

1. Heater demand 加热器需求案例1

In [2]: 导入数据

heater = pd.read_csv('heater.csv')
heater.head()

Out[2]:

	year_month	demand
0	2004-01	27
1	2004-02	18
2	2004-03	14
3	2004-04	13
4	2004-05	13

In [3]: 绘图

plt.figure(figsize = (20,8))
plt.plot(heater['year_month'], heater['demand'])
plt.title('Monthly heater demand')

Out[3]:

Text(0.5, 1.0, 'Monthly heater demand')

由上图观察，时间序列是否平稳？

不平稳。数据具有增加的趋势和变化，以及有季节性变化的模式。接下来用KPSS测试来确认。

In [4]:

kpss_stat, pval, lags, crit = kpss(heater['demand'], regression = 'c', nlags = 'auto')
print('p Value to the KPSS test is: ',pval)

p Value to the KPSS test is:  0.01

由于p值低于0.05，可以认为时间序列在95%的置信水平下是非平稳的。

（a）应用Box-Cox来稳定方差

In [5]: Box-Cox变换可以将非正态分布的数据转化为近似正态分布的形式，从而改善数据分布，提高准确性。此处不详细展开，csdn有很多coxbox专门讲解的博客可以学习。

#if parameter lmbda is none, the model will find the optimal parameter
BCheater, lmbda = boxcox(heater['demand']) 
print(BCheater[:10])
print(lmbda)

[1.33645176 1.28437989 1.24430878 1.23115726 1.23115726 1.23115726
 1.23115726 1.24430878 1.25598388 1.2992875 ]
-0.6645210955312427

In [6]: 调整后的分布情况图

heater['demand_BC'] = BCheater
plt.figure(figsize = (20,8))
plt.plot(heater['year_month'], heater['demand_BC'])
plt.title('Monthly heater demand after Box-Cox transformation')

Out[6]:

Text(0.5, 1.0, 'Monthly heater demand after Box-Cox transformation')

方差逐渐增加的问题得到了缓解，时间序列数据呈现增加趋势和季节性。

(b) 用月均值作减法减少季节性影响

In [7]: 将 heater 数据框中的 year_month 列转换为日期时间格式，并提取出月份信息，将其存储在一个新的列 month 中

heater['month'] = pd.to_datetime(heater['year_month'], format = '%Y-%m').dt.month
print(heater.head())

  year_month  demand  demand_BC  month
0    2004-01      27   1.336452      1
1    2004-02      18   1.284380      2
2    2004-03      14   1.244309      3
3    2004-04      13   1.231157      4
4    2004-05      13   1.231157      5

In [8]: 按照月份（提取每年的对应月份数据）对 heater 数据框进行分组，生成 heater_by_month ，然后计算每个月份的 demand_BC 列的平均值

heater_by_month = heater.groupby('month')['demand_BC'].mean().reset_index()
print(heater_by_month.head(12))

    month  demand_BC
0       1   1.354655
1       2   1.323694
2       3   1.292226
3       4   1.272576
4       5   1.265195
5       6   1.259130
6       7   1.248622
7       8   1.253515
8       9   1.281419
9      10   1.332302
10     11   1.354482
11     12   1.362085

In [9]: 将 heater_by_month 数据框中的列名 demand_BC 更改为 demand_month_avg

heater_by_month = heater_by_month.rename(columns = {'demand_BC':'demand_month_avg'})
print(heater_by_month.head(12))

    month  demand_month_avg
0       1          1.354655
1       2          1.323694
2       3          1.292226
3       4          1.272576
4       5          1.265195
5       6          1.259130
6       7          1.248622
7       8          1.253515
8       9          1.281419
9      10          1.332302
10     11          1.354482
11     12          1.362085

In [10]: 将 heater 数据框与 heater_by_month 数据框合并，使用 month 列进行连接。这将在 heater 数据框中添加一个新列 demand_month_avg，该列包含每个月份的平均 demand_BC 值。

heater = heater.merge(heater_by_month, on = 'month')
print(heater.head())

  year_month  demand  demand_BC  month  demand_month_avg
0    2004-01      27   1.336452      1          1.354655
1    2005-01      27   1.336452      1          1.354655
2    2006-01      21   1.305845      1          1.354655
3    2007-01      25   1.327616      1          1.354655
4    2008-01      26   1.332175      1          1.354655

In [11]: 按照 year_month 列的值对 heater 数据框进行排序，以升序排列。

# sort data by year_month in ascending order
heater = heater.sort_values('year_month')
print(heater.head())

   year_month  demand  demand_BC  month  demand_month_avg
0     2004-01      27   1.336452      1          1.354655
17    2004-02      18   1.284380      2          1.323694
34    2004-03      14   1.244309      3          1.292226
51    2004-04      13   1.231157      4          1.272576
68    2004-05      13   1.231157      5          1.265195

In [12]: 创建了一个新的列 demand_de_season，其中包含了每个月的 demand_BC 减去该月的平均 demand_BC 值后的结果，用于表示去季节性的需求。绘制了heater 数据框中 year_month 列和 demand_de_season 列的折线图，显示去季节性的需求。

heater['demand_de_season'] = heater['demand_BC'] - heater['demand_month_avg']
plt.figure(figsize = (20,8))
plt.plot(heater['year_month'], heater['demand_de_season'])
plt.title('Monthly heater demand after Box-Cox transformation and de-seasonality')

Out[12]:

Text(0.5, 1.0, 'Monthly heater demand after Box-Cox transformation and de-seasonality')

上图去掉了季节性因素。时间序列数据呈递增趋势。

(c) 应用一阶差分来消除数据的趋势。

In [13]: 创建一阶差分列，一阶差分是当前值与前一个值之间的差值，用于表示序列的变化情况。

heater['demand_diff1'] = heater['demand_de_season'].diff()
print(heater.head())

   year_month  demand  demand_BC  month  demand_month_avg  demand_de_season  \
0     2004-01      27   1.336452      1          1.354655         -0.018203   
17    2004-02      18   1.284380      2          1.323694         -0.039314   
34    2004-03      14   1.244309      3          1.292226         -0.047917   
51    2004-04      13   1.231157      4          1.272576         -0.041419   
68    2004-05      13   1.231157      5          1.265195         -0.034038   

    demand_diff1  
0            NaN  
17     -0.021111  
34     -0.008603  
51      0.006498  
68      0.007382

In [14]: 绘图看看

plt.figure(figsize = (20,8))
plt.plot(heater['year_month'], heater['demand_diff1'])
plt.title('Monthly heater demand after Box-Cox transformation, de-seasonality, and de-trend')

Out[14]:

Text(0.5, 1.0, 'Monthly heater demand after Box-Cox transformation, de-seasonality, and de-trend')

趋势图看起来好很多，接下来用KPSS测试来确认。

In [15]:

_, pval, __, ___ = kpss(heater['demand_diff1'][1:], regression = 'c', nlags = 'auto') # do not include the first NaN value
print('p Value to the KPSS test is: ',pval)

p Value to the KPSS test is:  0.1

由于p值大于0.05，可以认为时间序列在95%的置信水平上是平稳的。

接下来准备应用ARMA模型。

(a) 利用PACF图求出自回归(AR)的阶数(即p)。

In [16]: plot_pacf的函数通常用于绘制偏自相关函数（Partial Autocorrelation Function，PACF）的图表，PACF图表显示了每个滞后阶数的序列与其自身滞后值之间的偏自相关性。

plot_pacf(heater['demand_diff1'][1:], title = 'PACF plot');

设p=2，模型越简单越好。

(b) 使用ACF图找出移动平均线(MA)的顺序(即q)。

In [17]: 画ACF图，ACF图表显示了每个滞后阶数的序列与其自身滞后值之间的自相关性。

plot_acf(heater['demand_diff1'][1:], title = 'ACF plot');

设q = 1。接下来训练模型是ARMA(2,1)。

In [18]: 将 heater 数据集中除了最后12个数据点之外的所有数据分配给了训练集 train。

# first split our data into training and test. let's take the last 12 data points as our test data.
train = heater[:-12]
test = heater[-12:]
train.shape, test.shape

Out[18]:

((186, 7), (12, 7))

In [19]: 绘制训练集和测试集折线图，以便可视化数据的分布和趋势。

# plot training and testing data
plt.figure(figsize = (20,8))
plt.plot(train['year_month'], train['demand_diff1'], label = 'Train')
plt.plot(test['year_month'], test['demand_diff1'], label = 'Test')
plt.legend(loc = 'upper left')
plt.title('Monthly heater demand')

Out[19]:

Text(0.5, 1.0, 'Monthly heater demand')

In [20]: 使用ARIMA模型来进行时间序列预测。创建了一个ARIMA模型对象，其中 train['demand_diff1'] 是训练数据的一阶差分序列，而 (2,0,1) 是ARIMA模型的参数，分别表示自回归项的阶数(p)，差分的阶数(d)，移动平均项的阶数(q)。

# 2,0,1 ARIMA Model
model = ARIMA(train['demand_diff1'], order=(2,0,1))
model_fit = model.fit()
y_predict = model_fit.forecast(12)

In [21]: 绘制训练集测试集及用ARIMA模型预测结果的折线图。

# Plot
plt.figure(figsize = (20,8))
plt.plot(train['year_month'], train['demand_diff1'], label = 'Train')
plt.plot(test['year_month'], test['demand_diff1'], label = 'Test')
plt.plot(test['year_month'], y_predict, label='forecast')
plt.title('Forecast vs Actuals')
plt.legend(loc='upper left', fontsize=18)
plt.show()

再尝试一下淡季需求的ARIMA(2,1,1)模型。

In [22]: 使用ARIMA模型来进行时间序列预测。创建了一个ARIMA模型对象，ARIMA模型的参数是（2,1,1）。

model = ARIMA(train['demand_de_season'], order=(2,1,1)) 
model_fit = model.fit()
y_predict = model_fit.forecast(12)

In [23]: 绘制训练集测试集及用ARIMA模型预测结果的折线图。

plt.figure(figsize = (20,8))
plt.plot(train['year_month'], train['demand_de_season'], label = 'Train')
plt.plot(test['year_month'], test['demand_de_season'], label = 'Test')
plt.plot(test['year_month'], y_predict, label='forecast')
plt.title('Forecast vs Actuals')
plt.legend(loc='upper left', fontsize=18)
plt.show()

试试Box-Cox转换需求的季节性ARIMA。非季节部分与之前相同，即p = 2, d = 1, q = 1。

因为这个季节是以月为单位的。对于季节性的AR (P)、I (D)和MA (Q)，取滞后期= 12的差值。例如，今年1月的需求减去去年1月的需求。

In [24]:

# plot the demand after differencing with lag = 12
plt.figure(figsize = (20,8))
plt.plot(heater['year_month'], heater['demand_BC'].diff(12))
plt.show()

折线图没有受趋势影响，所以我们不进一步差分，即D = 0。
用pacf求季节P P = 1。

In [25]:

print(heater['demand_BC'].diff(12).head(n=20))
plot_pacf(heater['demand_BC'].diff(12)[12:]);

0           NaN
17          NaN
34          NaN
51          NaN
68          NaN
85          NaN
102         NaN
118         NaN
134         NaN
150         NaN
166         NaN
182         NaN
1      0.000000
18    -0.008535
35     0.011675
52     0.013152
69     0.000000
86     0.000000
103   -0.014951
119   -0.013152
Name: demand_BC, dtype: float64

用acf求季节性Q值，Q = 3。

In [26]:

plot_acf(heater['demand_BC'].diff(12)[12:]);

接下来应用季节性ARIMA模型。

In [27]:

model = ARIMA(train['demand_BC'], order=(2,1,1), seasonal_order = (1,0,3,12))
model_fit = model.fit()
y_predict = model_fit.forecast(12)

In [28]:

plt.figure(figsize = (20,8))
plt.plot(train['year_month'], train['demand_BC'], label = 'Train')
plt.plot(test['year_month'], test['demand_BC'], label = 'Test')
plt.plot(test['year_month'], y_predict, label='forecast')
plt.title('Forecast vs Actuals')
plt.legend(loc='upper left', fontsize=18)
plt.show()

In [29]:

rmse_sarima = metrics.mean_squared_error(y_pred=y_predict,
                                       y_true=test['demand_BC'], squared = False)
print(rmse_sarima)

0.010115957627675597

In [30]:

from statsmodels.tsa.exponential_smoothing.ets import ETSModel
hw = ETSModel(train['demand_BC'], trend='add', seasonal = 'add', seasonal_periods = 12)
hw_fit = hw.fit()
y_predict = hw_fit.forecast(12)

In [31]:

plt.figure(figsize = (20,8))
plt.plot(train['year_month'], train['demand_BC'], label = 'Train')
plt.plot(test['year_month'], test['demand_BC'], label = 'Test')
plt.plot(test['year_month'], y_predict, label='forecast')
plt.title('Forecast vs Actuals')
plt.legend(loc='upper left', fontsize=18)
plt.show()

In [32]:

rmse_hw = metrics.mean_squared_error(y_pred=y_predict,
                                       y_true=test['demand_BC'], squared = False)
print(rmse_hw)