Huang S, Wang D, Wu X, et al. DSANet: Dual Self-Attention Network for Multivariate Time Series Forecasting[C]//Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2019: 2129-2132.
原文链接:
https://doi.org/10.1145/3357384.3358132
代码链接(Pytorch):
https://github.com/bighuang624/DSANet
Motivation
- Traditional methods fail to capture complicated nonlinear dependencies between time steps and between multiple time series.
- Recurrent neural network and attention mechanism have been used to model periodic temporal patterns across multiple time steps. However, these models fit not well for time series with dynamic-period patterns or nonperiodic patterns.
Dual self-attention network (DSANet) :
Highly efficient multivariate time series forecasting, especially for dynamic-period or nonperiodic series.
Model
- Global Temporal Convolution:
Extract time-invariant patterns of all time steps for univariate time series. - Local Temporal Convolution:
Time steps with a shorter relative distance have a larger impact on each other.
Focus on modeling local temporal patterns. - Self-Attention Module:
Strong feature-extraction capability of self-attentional networks.
Capture the dependencies between different series.
Scaled dot product self-attention:
Position-wise feed-forward:
- Autoregressive Component:
Due to the nonlinearity of both convolutional and self-attention components, the scale of neural network output is not sensitive to that of input.
The classical AR model is taken as the linear component. - Generation of Prediction:
First use a dense layer to combine the outputs of two self-attention modules;
Then obtained by summing the self-attention based prediction and the AR prediction.
Experiment
Gas station service company:
Daily revenue of five gas stations ranging from 2015.12.1-2018.12.1.
The stations are geographically close, which means a complex mix of revenue promotion and mutual exclusion exists between them.
training (60%), validation (20%) and test (20%).
mini-batch stochastic gradient descent (SGD) with the Adam optimizer , loss is MSE
dropout rate : 0.1
root relative squared error (RRSE), mean absolute error (MAE) and empirical correlation coefficient (CORR)
(1) The best result on each window-horizon pair is obtained by complete DSANet, showing all components have contributed to the effectiveness and robustness of the whole model;
(2) The performance of DSAwoAR significantly drops, showing that the AR component plays a crucial role. The reason is that AR is generally robust to the scale changing in data according to [10];
(3) DSAwoGlobal and DSAwoLocal also suffer from performance loss but less than removing the AR component.This is because features learned by the two branches coincide. In other words, when one branch is removed, some of the lost features can be obtained from the other branch.
pytorch版本的代码详解可以参考此链接:https://blog.csdn.net/itnerd/article/details/106266829