WaveForM: Graph Enhanced Wavelet Learning for Long Sequence Forecasting of Multivariate Time Series

雾岛听雪

于 2024-01-29 19:24:31 发布

阅读量1k

点赞数 31

文章标签： pytorch 算法深度学习

本文链接：https://blog.csdn.net/xzhbut/article/details/135876979

版权

WaveForM: Graph Enhanced Wavelet Learning for Long Sequence Forecasting of Multivariate Time Series

Abstract

Multivariate time series (MTS) analysis and forecasting are crucial in many real-world applications, such as smart traffic management and weather forecasting. However, most existing work either focuses on short sequence forecasting or makes predictions predominantly with time domain features, which is not effective at removing noises with irregular frequencies in MTS.

然而，现有的工作大多集中在短序列预测或主要利用时域特征进行预测，无法有效去除MTS中不规则频率的噪声。

Therefore, we propose WAVEFORM , an end-to-end graph enhanced Wavelet learning framework for long sequence FORecasting of MTS. WaveForM first utilizes Discrete Wavelet Transform (DWT) to represent MTS in the wavelet domain, which captures both frequency and time domain features with a sound theoretical basis.

因此，我们提出了用于MTS长序列预测的端到端图增强小波学习框架波形。波形首先利用离散小波变换(DWT)在小波域表示MTS;该方法同时捕捉了频域和时域特征，具有良好的理论基础。

To enable the effective learning in the wavelet domain, we further propose a graph constructor, which learns a global graph to represent the relationships between MTS variables, and graphenhanced prediction modules, which utilize dilated convolution and graph convolution to capture the correlations between time series and predict the wavelet coefficients at different levels.

为了实现在小波域的有效学习，我们进一步提出了一个图构造器，它学习一个全局图来表示MTS变量之间的关系，以及图增强预测模块，它利用扩展卷积和图卷积来捕获时间序列之间的相关性并预测不同层次的小波系数。

Extensive experiments on five real-world forecasting datasets show that our model can achieve considerable performance improvement over different prediction lengths against the most competitive baseline of each dataset.

Introduction

Multiple interconnected streams of data, also known as multivariate time series (MTS), have pervasive presence in realworld applications. Examples of MTS include the recorded traffic flows from various roadway sensors and the weather observations from multiple weather stations over time. Multivariate time series forecasting, which makes predictions based on historical MTS observations, has attracted extensive interest as it can integrate multiple sources of observations to provide a global view of applications and help make meaningful and accurate application-wide predictions. For example, to predict the future power consumption of a household, it is beneficial to consider and integrate the usage observations of multiple sectors, such as kitchen, laundry, and the average current intensity of the household.

多个相互连接的数据流，也称为多变量时间序列(multivariate time series, MTS)，在现实世界的应用程序中普遍存在。MTS的例子包括各种道路传感器记录的交通流量和多个气象站随时间的天气观测。多元时间序列预测是一种基于历史MTS观测数据进行预测的方法，它可以整合多个观测来源，提供应用的全局视图，并有助于做出有意义和准确的应用范围预测，因此引起了广泛的兴趣。例如，为了预测一个家庭未来的电力消耗，考虑和整合多个部门的使用情况观察是有益的，例如厨房，洗衣，以及家庭的平均电流强度。

Early solutions (Box and Jenkins 1970), which utilize statistical models, generally assume linear dependencies among variables, thus failing to capture complex non-linear patterns, which frequently occur in MTS.

早期的解决方案（Box和Jenkins，1970年）主要利用统计模型，通常假设变量之间存在线性依赖，因此无法捕捉多变量时间序列（MTS）中经常出现的复杂非线性模式。

In recent work,researchers have proposed a series of graph neural network (GNN)-based models to capture interconnections and interdependencies (also known as spatial dependencies) among MTS due to GNN’s strength in modeling complex structures of graph data.

最近的研究中，研究人员提出了一系列基于图神经网络（GNN）的模型，以更好地捕捉MTS中变量之间的相互连接和相互依赖关系（也称为空间依赖关系），因为GNN在建模图数据的复杂结构方面具有强大的能力。

For example, STGCN (Yu, Yin, and Zhu 2017) skillfully utilizes both graph convolution and gated causal convolution to tackle the MTS prediction problems in the traffic domain. Graph Multi-attention Network (GMAN) (Zheng et al. 2020) extends STGCN with an encoderdecoder architecture and incorporates attention mechanisms to better capture spatial-temporal relations in traffic data. The use of GNN in STGCN and GMAN relies on the assumption that prior knowledge of stable relationships among variables is available, and such knowledge is represented in a pre-defined graph structure. MTGNN (Wu et al. 2020) focuses on learning and recovering the latent dependencies (graph structure) among variables for tasks without explicitly defined graph structures by using a graph learning module, leading to better interpretability and performance for MTS forecasting tasks

例如，STGCN（Yu、Yin和Zhu，2017年）巧妙地利用图卷积和门控因果卷积来解决交通领域中的MTS预测问题。Graph Multi-attention Network（GMAN）（Zheng等人，2020年）通过引入编码器-解码器架构和注意力机制，对STGCN进行了扩展，以更好地捕捉交通数据中的时空关系。在STGCN和GMAN中使用GNN的前提是存在关于变量之间稳定关系的先验知识，并且这些知识以预定义的图结构形式表示。MTGNN（Wu等人，2020年）专注于学习和恢复变量之间的潜在依赖关系（图结构），适用于没有显式定义图结构的任务。它通过使用图学习模块来学习潜在的图结构，提高了MTS预测任务的可解释性和性能。

However, the existing work still overlooks long sequence forecasting (LSF) of MTS, which uses a given length of MTS to predict longer future sequences. LSF of MST is crucial for facilitating long-term planning and offering early warning in various real-world applications.

然而，现有的工作仍然忽略了MTS的长序列预测(LSF)，即使用给定长度的MTS来预测更长的未来序列。MST的LSF对于促进长期规划和在各种实际应用中提供早期预警至关重要。

However, predicting long sequences is challenging as long-term MTS are often composed of more entangled temporal patterns than short-term ones, and overlooking this may lead to unreliable discoveries of temporal dependencies (Wu et al. 2021).

然而，预测长序列具有挑战性，因为长期MTS通常由比短期MTS更纠缠的时间模式组成，忽略这一点可能导致对时间依赖性的不可靠发现(Wu et al . 2021)。

Recently, transformer-based architectures have proven their effectiveness in modeling sequential data owing to the use of self-attention mechanisms (Zaheer et al. 2020), empowering MTS forecasting models for long-term prediction (Wen et al. 2022). However, these models frequently suffer from high computational cost in LSF. The existing transformer-based LSF approaches mainly focus on developing sparse selfattention schema to improve model efficiency, inevitably sacrificing the rate of information utilization and resulting in a bottleneck for MTS LSF

最近，由于使用了自关注机制(Zaheer et al .2020)，基于变压器的架构已经证明了它们在序列数据建模方面的有效性，使MTS预测模型能够进行长期预测(Wen et al .2020)。然而，这些模型在LSF中往往存在较高的计算成本。现有的基于变压器的LSF方法主要是通过开发稀疏的自关注模式来提高模型效率，这必然会牺牲信息利用率，导致MTS LSF的瓶颈。

In comparison, this paper proposed a novel solution for effective long sequence forecasting in MTS.

通过比较，本文提出了一种新的MTS长序列预测方法。

In practice, MTS can be analyzed in the time domain,which studies how signals change over time, and/or frequency domain, which studies signals from the perspective of their frequencies. However, we noticed that most existing work with deep learning models centers on extracting and utilizing time-domain features from MTS, leaving frequency domain analysis generally unattended.

在实践中，多元时间序列（MTS）可以在时间域进行分析，研究信号随时间的变化，也可以在频率域进行分析，从其频率的角度研究信号。然而，我们注意到大多数现有的深度学习模型的工作集中在从MTS中提取和利用时间域特征上，对频率域分析通常忽视。

Although some models, such as Autoformer (Wu et al. 2021) and FEDFormer (Zhou et al. 2022), utilize time-frequency transformations, such as Fourier transform, they mainly aim to reduce the time complexity of the Transformer models rather than fully exploit the rich features in the frequency domain. Autoformer (Wu et al. 2021) and FEDFormer (Zhou et al. 2022) demonstrated their effectiveness in leveraging additional frequency domain features with Fourier transform or discrete wavelet transform. However, they generally simply utilize the extracted frequency domain features as a complement to the representations in the time domain. They feed the combined/concatenated features to deep learning models for forecasting.

尽管一些模型，如Autoformer（吴等人，2021年）和FEDFormer（周等人，2022年），利用了时频转换，如傅里叶变换，但它们主要旨在减少Transformer模型的时间复杂度，而不是充分利用频域中的丰富特征。Autoformer（吴等人，2021年）和FEDFormer（周等人，2022年）展示了它们在利用傅里叶变换或离散小波变换等附加频域特征方面的有效性。然而，它们通常只是将提取的频域特征作为对时间域表示的补充，将合并/连接的特征馈送到深度学习模型进行预测。

We argue that such a simple combination of features from two different domains cannot provide clear and sufficient information to the deep learning models and diminish the effect of features in frequency domain. It lacks a theoretical basis and may even introduce noises to the models and lead to sub-optimal performance.

我们认为，从两个不同领域的特征进行简单组合不能为深度学习模型提供清晰和充分的信息，削弱了频率域中特征的影响。这种简单的组合缺乏理论基础，甚至可能向模型引入噪音并导致次优性能。

Therefore, we propose to model MTS in the “wavelet domain” to effectively capture and exploit wavelet domain features, leveraging the capability of Discrete Wavelet Transform (DWT) (Shensa et al. 1992) which captures both the frequency-domain and time-domain features of MTS in a theoretically guaranteed framework.

因此，我们提出在“小波域”中建模多元时间序列（MTS），以有效捕捉和利用小波域特征，利用离散小波变换（DWT）（Shensa等人，1992年）的能力，该变换在理论上保证捕获了MTS的频域和时域特征。

More specifically, we utilize DWT to decompose MTS into different frequency bands (wavelets) with different resolutions, which are represented as wavelet coefficients, to enable more sophisticated time series feature extraction. Thereafter, we propose to adapt a Graph-enhanced Prediction module (GP) to model the changes of the wavelet coefficients with the same resolution over time. The graph convolution in GP is used to tackle the inter-dependencies among variables. Thus, the interdependency relationship can be captured at different resolutions in the wavelet domain. More importantly, we inject the same/global graph structure across all GP modules, indicating that variables in different views in the wavelet domain share the same basic message-passing behavior and avoid model overfitting. Once we obtain the predicted wavelet coefficients, we utilize Inverse Discrete Wavelet Transform (IDWT) to enable supervised learning in the training set.

具体而言，我们利用DWT将MTS分解成不同频段（小波），具有不同分辨率的小波系数表示，以实现更复杂的时间序列特征提取。
然后，我们提出采用图增强预测模块（GP）来模拟同一分辨率下小波系数随时间的变化。GP中的图卷积用于处理变量之间的相互依赖关系。因此，在小波域中可以在不同分辨率上捕获相互依赖关系。
更重要的是，我们在所有GP模块中注入相同/全局的图结构，表明小波域中不同视图中的变量共享相同的基本消息传递行为，避免模型过拟合。
一旦获得预测的小波系数，我们利用逆离散小波变换（IDWT）在训练集中进行监督学习。
PS：
“同一分辨率下小波系数” 指的是在离散小波变换（DWT）的过程中，来自相同频带或频段的小波系数。在DWT中，原始的时间序列信号被分解成不同频带的小波系数，每个频带对应着一定范围内的频率。而在每个频带中，小波系数的分辨率是相同的。
小波系数的分辨率指的是在小波变换中，用于表示信号在不同频率范围内的成分的能力或精细度。在小波分析中，分辨率是指能够在频率域中区分不同频率的程度。

Note that the global graph in the framework is learned endto-end from data, which leads to a better interpretation of the
inter-dependencies among variables.

请注意，框架中的全局图是从数据端到端学习的，这可以更好地解释变量之间的相互依赖关系。

The novel contributions of this research is as follows:
We propose a DWT-based end-to-end framework that transforms MTS into a wavelet domain for MTS long sequence prediction tasks. Owing to the features of DWT, our model is capable of fully exploiting the inherent features of MTS in both frequency and time domains.

我们提出了一个基于小波变换的端到端框架，该框架将MTS转换成小波域用于MTS长序列预测任务。由于DWT的特点，我们的模型能够充分利用MTS在频域和时域的固有特征。

We propose a global graph constructor to extract global information on the interrelationship among variables in the wavelet domain, preventing the framework training from overfitting

我们提出了一个全局图构造器来提取小波域中变量间相互关系的全局信息，从而避免了框架训练过拟合

We conducted comprehensive experiments on long sequence forecasting tasks in MTS and the results show our model consistently/effectively outperforms the stateof-the-art models for LSF tasks by a large margin.

我们对MTS中的长序列预测任务进行了全面的实验，结果表明我们的模型在LSF任务中始终有效地优于最先进的模型。

Related Work

MTS forecasting can be considered a typical seq2seq task and various deep sequence models have been proposed. DeepAR (Salinas et al. 2020) combines the idea of autoregression with recurrent neural networks (RNNs) to model the probability distribution of sequences. Besides RNN models, convolutional neural networks (CNNs) are also used for MTS forecasting. For example, Graph WaveNet (Wu
et al. 2019) utilizes the dilated causal convolution to force the model to focus only on historical information and expands the perspective field to obtain a broader range of periodic and tendency patterns. However, most of the existing models are not designed for LSF of MTS.

MTS预测可以看作是一个典型的seq2seq任务，人们提出了各种深度序列模型。DeepAR (Salinas et al 2020)将自回归的思想与递归神经网络(rnn)相结合，对序列的概率分布进行建模。
除了RNN模型，卷积神经网络(cnn)也被用于MTS预测。例如，Graph WaveNet (Wu et al .
2019)利用扩展的因果卷积来迫使模型只关注历史信息，并扩展视角域以获得更广泛的周期和趋势模式。然而，现有的大多数模型都不是针对MTS的LSF设计的。

Transformer-based models for MTS predictions have received increasing attention (Wen et al. 2022) with two strands of research along this line. One strand, such as LogTrans (Li et al. 2019) and Autoformer (Wu et al. 2021), focuses on developing sparse attention mechanisms to replace the original attention mechanism which has been recognized as the computational bottleneck for long sequence MTS predictions due to its O(L 2 ) complexity in both time and space. Another strand, such as Informer (Zhou et al. 2021) and Pyraformer (Liu et al. 2021), focuses on reducing the computational complexity by improving the attention mechanism at the decomposed structural level by introducing different resolution representations for the original sequences through convolution operators and/or Fourier transform to obtain the time dependence of the original sequences at different scales. However, such resolutions are solely or mostly performed in the time domain. Their purpose is to reduce the sequence length and thus improve computational efficiency. Therefore, the frequency domain information is not fully exploited as it is used as a supplementary or a means of reducing computational complexity.

基于变压器的MTS预测模型受到越来越多的关注(Wen et al 2022)，沿着这条线进行了两股研究。其中，LogTrans (Li
et al . 2019)和Autoformer (Wu et al .
2021)专注于开发稀疏注意机制，以取代原有的注意机制，该机制因其在时间和空间上的复杂度为0
(l2)而被认为是长序列MTS预测的计算瓶颈。另一种方法，如Informer (Zhou et al .
2021)和Pyraformer (Liu et al .
2021)，侧重于通过卷积算子和/或傅里叶变换对原始序列引入不同的分辨率表示，以获得原始序列在不同尺度上的时间依赖性，从而通过改进分解结构层面的注意机制来降低计算复杂度。然而，这样的解析完全或大部分是在时域中执行的。它们的目的是减少序列长度，从而提高计算效率。因此，频域信息没有被充分利用，因为它被用作补充或降低计算复杂性的手段。

Spatial-temporal GNNs have also been proposed for MTS forecasting tasks. They model each variate in MTS as a graph node and then represent the interdependencies between nodes with a latent graph. The features of each node are obtained by mainly considering the temporal dependency among each time series. Specifically, Graph WaveNet (Wu et al. 2019) designs a self-adaptive matrix to
reveal the spatial dependencies with node embeddings. MTGNN (Wu et al. 2020) and GTS (Shang, Chen, and Bi 2021) extend Graph WaveNet by jointly learning the latent graph and spatial-temporal GNN in an end-to-end framework with more sophisticated designs.

时空GNN也被提出用于MTS预测任务。他们将MTS中的每个变量建模为一个图节点，然后用潜在图表示节点之间的相互依赖性。主要考虑各时间序列之间的时间依赖性，得到各节点的特征。具体来说，Graph WaveNet (Wu et al . 2019)设计了一个自适应矩阵来揭示节点嵌入的空间依赖关系。MTGNN (Wu et al 2020)和GTS (Shang, Chen, and Bi 2021)通过在端到端框架中以更复杂的设计共同学习潜在图和时空GNN，扩展了Graph WaveNet。

Autoformer (Wu et al. 2021) and FEDFormer (Zhou et al.2022) utilized Fourier Transform to extract frequency domain features, which are then simply concatenated with time domain features for further processing in deep learning models. However, as we have argued before, the simple mixture of features from completely different domains may intros no general theoretical guide for the cross-domain concatenation/combination in deep learning.

Autoformer（吴等人，2021年）和FEDFormer（周等人，2022年）利用傅里叶变换提取频域特征，然后将这些特征简单地与时域特征拼接，以便在深度学习模型中进行进一步处理。然而，正如我们之前所论述的，来自完全不同领域的特征的简单混合可能缺乏深度学习中跨领域拼接/组合的通用理论指导。

Thus, this paper proposes to center the analysis on the wavelet domain, which theoretically reflects both time and frequency features, to better exploit the complex patterns in MTS.

因此，本文提出将分析重心放在小波域上，该域在理论上反映了时域和频域特征，以更好地挖掘多元时间序列中的复杂模式。

3 Methodology

This section explains the details of the proposed WAVEFORM , the overview of which is illustrated in Fig. 1. WAVEFORM is a multi-resolution analysis (MRA) model based on discrete wavelet transforms, and it forecasts MTS in the wavelet domain.

本节解释了所建议WAVEFORM的细节，其概述如图1所示。WAVEFORM是基于离散小波变换的多分辨率分析(MRA)模型，它在小波域中预测MTS。

WAVEFORM consists of three main components: discrete wavelet transform (DWT) module, global graph constructor (GGC), and graph-enhanced prediction (GP) modules.

WAVEFORM主要由三个部分组成:离散小波变换(DWT)模块、全局图构造器(GGC)模块和图增强预测(GP)模块。

As an MRA model, WAVEFORM relies on the scaling and translation of the DWT module to obtain the detail coefficients (cDi) and approximate coefficients (cAi) of different levels (i = 1, 2, . . . ) in the wavelet domain. The GP modules utilize dilated convolution and graph convolution to capture the correlations between time series and predict the wavelet coefficients at different levels, and all these modules share the same graph that is learned from GGC. With the use of an inverse DWT module, the framework is trained end-to-end.

WAVEFORM作为一种MRA模型，依靠DWT模块的缩放和平移来获得不同层次(i = 1,2，…)的细节系数(cDi)和近似系数(cAi)。
在小波域中。GP模块利用展开卷积和图卷积来捕获时间序列之间的相关性，并预测不同层次的小波系数，所有这些模块共享从GGC学习到的同一图。
通过使用反向DWT模块，对框架进行端到端训练。

The technical details of each component are presented in the rest of this section.
本节的其余部分将介绍每个组件的技术细节。

WAVEFORM框架。输入的MTS在小波域中被分解成不同的系数，然后被送入共享同一全局图的独立GP模块中进行预测。可学习全局图由两个嵌入层生成，并由所有GP模块共享。利用逆小波变换(IDWT)模块将GP模块的输出重构到时域中。

3.1 Problem Definition

An MTS is denoted as 在这里插入图片描述 , where X ∈R N×T represents N-variate time series. represents the time series of the i-th variable, which consists of sequential recordings at T timestamps.

MTS表示为X = [X⊺1;X⊺2;···;x⊺N]，其中x∈R N×T表示N变量时间序列。xi∈R T表示第i个变量的时间序列，它由T个时间戳的顺序记录组成。

For an MTS forecasting task, we set an observation window H for historical time series and a forecasting window P for prediction. Accordingly, for each time step t, its historical series Ht = Xt−H+1:t and forecasting series Pt = Xt+1:t+P are defined as follows:

对于MTS预测任务，我们为历史时间序列设置了观测窗口H，为预测设置了预测窗口P。因此，对于每一个时间步t，其历史序列Ht = Xt−H+1:t和预测序列Pt = Xt+1:t+P定义如下:

在这里插入图片描述
Specifically, to be considered as a long sequence MTS forecasting task, H ≪ P.

具体来说，作为一项长序列MTS预测任务，H≪P。

Given a historical series Ht, the goal is to learn a mapping function f that is capable of predicting the next P time steps ˆPt = f(Ht, Θ) accurately, where Θ is the learnable parameter set

给定一个历史序列Ht，目标是学习一个映射函数f，该函数f能够准确地预测下一个P个时间步长，其中Θ是可学习的参数集

3.2 Discrete Wavelet Transform Module and Its Inverse Version

DWT module transforms an input MTS into its corresponding multi-scale frequency representations with DWT.

DWT模块用DWT将输入的MTS变换成相应的多尺度频率表示。

DWT is generally used to decompose input signals into a set of wavelets, which captures both the frequency and time features of the original signals and enables the following prediction modules to make predictions in parallel.

DWT通常用于将输入信号分解成一组小波，这些小波同时捕获原始信号的频率和时间特征，使后续预测模块能够并行进行预测。

As depicted in Fig. 1, DWT can be performed multiple times, and each DWT uses a high-pass filter h and a lowpass filter g to decompose a time series signal x into different resolutions. The outputs of the high-pass and low-pass filters at layer l are denoted as (cDl) and (cAl), respectively: cDl , cAl = DW T(cAl−1) , where l indicates the l-th decomposition and 在这里插入图片描述 . Specifically, we have

如图1所示，DWT可以执行多次，每个DWT使用高通滤波器h和低通滤波器g将时间序列信号x分解为不同的分辨率。第1层高通滤波器和低通滤波器的输出分别表示为(cDl)和(cAl):cDl, cAl = DW T(cAl−1)，其中l表示第l次分解，cA0 = x
.
.
高通滤波器（High-pass Filter）：高通滤波器允许高频信号通过，而阻塞低频信号。它通常用于去除信号中的低频成分，强调高频成分。在离散小波变换中，高通滤波器被用于捕捉信号中的短期变化或细节，产生细节系数（cD）。
低通滤波器（Low-pass Filter）：低通滤波器允许低频信号通过，而阻塞高频信号。它通常用于去除信号中的高频成分，保留低频成分。在离散小波变换中，低通滤波器被用于捕捉信号中的长期趋势，产生近似系数（cA）。
“分解为不同的分辨率” 意味着将一个信号或数据分解成具有不同清晰度或精细度的不同部分，每个部分对应于不同的频率或尺度。

在这里插入图片描述
where M represents the length of cAl−1 after decomposing times and represents the scale. One feature of DWT is that after passing through and (namely and perform convolution operationwith , respectively), only half the number of samples characterizes the original signal cAl−1 owing to the double scale. Therefore, according to Nyquist’s rule, we can remove half of the samples with downsampling while keeping the original information. Besides, the selection of h and g depends on the form of the wavelet basis. In theory, once the wavelet basis is determined1 , the form of h and g are determined as well. The detail coefficients cD depict the short-term trend of the series and carry the signal nuances, while the approximate coefficients cA describe the signal’s long-term trend which characterizes its identity. In addition, the frequency resolution of the original signal increases as the decomposition goes deeper

式中，M为cAl−1分解(l−1)次后的长度，s为尺度。DWT的一个特点是经过h和g(即h和g分别与cAl−1进行卷积运算(*))后，由于双尺度，原始信号cAl−1表征的样本数只有一半。因此，根据Nyquist规则，我们可以在保留原始信息的情况下，通过降采样去除一半的样本。此外，h和g的选择取决于小波基的形式。理论上，一旦确定了小波基,h和g的形式也就确定了。
细节系数cD描述了该序列的短期趋势，并携带了信号的细微差别，而近似系数cA描述了信号的长期趋势，表征了其一致性。
此外，原始信号的频率分辨率随着分解的深入而增加
.
Nyquist规则是信号处理中的一个基本原则，它规定了对于一个信号，为了能够完全恢复其原始信息，采样频率必须至少是信号最高频率的两倍。

After l layers of decomposition, for each , the DWT module outputs a set of l + 1 coefficients 在这里插入图片描述 . Different levels of DWT represent different resolutions of the original signal.

经过l层分解，对于每个xi, DWT模块输出一组l + 1个系数p (i) = {cD(i) 1, cD(i) 2，…， cD(i) l, cA(i) l}。
不同级别的DWT表示原始信号的不同分辨率。

Let 在这里插入图片描述 represent the layered wavelet coefficients where each layer contains each variable’s corresponding coefficients, denoted as follows:

设C = {C1, C2，…， Cl, Cl+1}表示分层小波系数，每层包含每个变量对应的系数，表示如下:

在这里插入图片描述
where , H is the length of the input MTS, and N denotes the number of variables.

式中Hj = h2 j, H为输入MTS的长度，N为变量个数。

Note that after the following graph-enhanced modules output the i-th variable’s coefficients for future P time steps, denoted as 在这里插入图片描述 , we apply the Inverse Discrete Wavelet Transform (IDWT) to reconstruct their corresponding sequence in the time domain. The process can be formulated as follows:

请注意，在以下图形增强模块输出未来P个时间步长的第i个变量的系数后，表示为P°(i) = {cD°(i) 1, cD°(i) 2，…，cD°(i) l, cA°(i) l}，我们应用逆离散小波变换(IDWT)在时域重构它们对应的序列。
该过程可表述如下:

在这里插入图片描述

PS：This paper utilizes the Haar wavelet (Pattanaik and Bouatouch1995) for simplicity and l = 3 为了简单起见，本文采用Haar小波(Pattanaik and Bouatouch 1995)，且l = 3

where and 在这里插入图片描述 are the synthesis version of and . When using the Haar wavelet (Pattanaik and Bouatouch 1995), and . Then, is the reconstructed time series of the i-th variable in the time domain.

式中，h '和g '为h和g的合成版本。当使用Haar小波(Pattanaik and Bouatouch 1995)时，h ’ = -h, g ’ = g，则x´i = cA´0为第i个变量在时域的重构时间序列。

在这里插入图片描述

3.3 Global Graph Constructor (GGC)

After obtaining the wavelet coefficients at different scales, the model intends to forecast the coefficient changes over time in the wavelet domain. Although the wavelet coefficients at different layers reflect the time series at different frequency subbands, we assume the variables share the same basic interaction structure at different resolutions without the loss of generality. Using a global graph rather than learning graphs in each GP module also avoids overfitting and saves memory. The GGC module learns a global graph to represent the relationships between variables.

在获得不同尺度下的小波系数后，该模型拟在小波域中预测系数随时间的变化。虽然不同层的小波系数反映了不同频率子带的时间序列，但我们假设变量在不同分辨率下具有相同的基本相互作用结构，而不会失去一般性。
在每个GP模块中使用全局图而不是学习图还可以避免过拟合并节省内存。GGC模块学习全局图来表示变量之间的关系。

“全局图” 意味着在整个系统或模型中使用一个共享的图结构，该图描述了变量之间的关系。

For most real-world tasks, as we do not have the proper prior knowledge of what the graph looks like, we propose to utilize a graph constructor to learn the global graph, which in turn guides the graph-enhanced prediction modules for more sophisticated feature extraction. Following (Wu et al. 2020), we use two independent and learnable embedding layers, 在这里插入图片描述
, to learn two embedding representations for each node after assigning each node/variable as an integer scalar. denotes the index set of the nodes/variables, given as follows: ,
, where E1 and E2 denote variable representations obtained from two different layers. Then, the adjacency matrix A can be defined as follows:
在这里插入图片描述

对于大多数现实世界的任务，由于我们没有适当的关于图的先验知识，我们建议利用图构造器来学习全局图，这反过来指导图增强的预测模块进行更复杂的特征提取。接下来(Wu et al . 2020)，我们使用两个独立且可学习的嵌入层E1和E2，在将每个节点/变量分配为整数标量后，学习每个节点的两个嵌入表示。N ={1,2，…， N}表示节点/变量的索引集，表达式如下:E1 = E1(N)∈R N×d, E2 = E2(N)∈R N×d，其中E1和E2表示从两层得到的变量表示。则邻接矩阵A可定义为:

where A ∈ RN×N and α is the hyper-parameter for the activation function. It is worth noting that Eq.(8) regularizes the adjacency matrix A to a uni-directional acyclic graph so that the influence between the nodes is uni-directional. This is more consistent with the hypothesis widely adopted in MTS analysis that the influences between variables are not mutual. To further reduce the computational cost of the
following graph convolution, we can simply set up a threshold to filter out the linkages with weights smaller than the threshold to make the graph sparse.

在上述文本中，其中 A ∈ ℝᴺ×ᴺ，α 是激活函数的超参数。值得注意的是，方程（8）对邻接矩阵 A 进行了正则化，使其成为一个单向无环图，从而节点之间的影响是单向的。
这更符合多元时间序列（MTS）分析中广泛采用的假设，即变量之间的影响是单向的。
为了进一步降低接下来的图卷积的计算成本，我们可以简单地设定一个阈值，过滤掉权重小于该阈值的连接，使图变得稀疏。

3.4Graph-Enhanced Prediction Modules

Given the learnable adjacency matrix of the variables, we build Graph-enhanced Prediction (GP) modules to methodically exploit the graphical information for predictions.

给定变量的可学习邻接矩阵，我们构建了图形增强预测(GP)模块，以系统地利用图形信息进行预测。

A GP module consists of three main components: a) learning the multi-scale representation that incorporates the wavelet information via dilated convolution, b) aggregating neighborhood messages via graph convolution, and c) generating the final representations by combining skip connection layers.

GP模块由三个主要部分组成:A)通过扩展卷积学习包含小波信息的多尺度表示，b)通过图卷积聚合邻域消息，c)通过组合跳过连接层生成最终表示。

3.4.1 Dilated Convolution Component 扩展卷积分量

Following MTGNN (Wu et al. 2020), we pass the input through stacked 1D dilated convolutions, which filter the wavelet coefficients to incorporate the wavelet information. In general, the standard convolution layer is ill-suited for dealing with long sequence forecasting since they require many layers or large filters to increase the receptive field, while both of them result in a substantial increase in model complexity.

在MTGNN (Wu et al .2020)之后，我们通过堆叠的1D展开卷积传递输入，该卷积过滤小波系数以包含小波信息。
一般来说，标准卷积层不适合处理长序列预测，因为它们需要许多层或大型滤波器来增加接受域，而这两种方法都会导致模型复杂性的大幅增加。

Alternatively, the dilated convolution (Yu and Koltun 2016), which is known to be stemmed from the wavelet decomposition, can capture long-term information/more complex patterns without sacrificing computational efficiency.

另外，已知源于小波分解的扩展卷积(Yu and Koltun 2016)可以在不牺牲计算效率的情况下捕获长期信息/更复杂的模式。

To better predict the changes of the time series signal in the wavelet domain, with an assumption that the wavelet coefficients contain latent patterns and are by no means the best raw signal for the following graph convolution component, we further utilize multiple dilated convolution filters with different kernel sizes to capture respective features for the wavelet coefficients at each level of resolutions. The output representations of these filters are activated by a sigmoid function and then concatenated to obtain the final representations of the stacked dilated convolution module. Given an input z, and G filters 在这里插入图片描述 , the dilated convolution module has the following form:

在这里插入图片描述

为了更好地预测时间序列信号在小波域的变化，假设小波系数包含潜在模式，并且绝不是下一个图卷积分量的最佳原始信号，我们进一步利用不同核大小的多个扩展卷积滤波器来捕获每个分辨率级别小波系数的各自特征。
这些滤波器的输出表示由sigmoid函数激活，然后连接以获得堆叠的扩展卷积模块的最终表示。给定输入z,G滤除f1, f1，···，fG，展开卷积模具有如下形式:
其中⊗¯表示扩展卷积算子。

3.4.2 Graph Convolution Component

The purpose of the graph convolution module is to aggregate the node information with its neighbors’ information to capture the global
dependencies among different variables. It is widely known that vanilla GCNs are susceptible to over-smoothing issues
due to the simplification of convolution as a neighborhood averaging operator, resulting in limited distinguishable representations of the nodes (Li, Han, and Wu 2018; Abu-ElHaija et al. 2019; Huang et al. 2020).

图卷积模块的目的是将节点信息与其相邻节点信息聚合，以捕获不同变量之间的全局依赖关系。众所周知，由于卷积作为邻域平均算子的简化，普通GCNs容易出现过度平滑问题，导致节点的可区分表示有限(Li, Han, and Wu 2018;Abu-ElHaija等2019;Huang et al . 2020)。

To mitigate the issue, we utilize the MixHop layer proposed by (Abu-El-Haija et al. 2019; Wu et al. 2020) to capture complex relationships of neighbors at various hops instead of simply aggregating information from immediate neighbors.

为了缓解这个问题，我们利用了MixHop(Abu-El-Haija等人2019;Wu et al . 2020)，以捕获不同跳点的邻居的复杂关系，而不是简单地聚合来自近邻的信息。

Concretely, the graph convolution includes two main steps: i) the message propagating step (Eq.(10)), and ii) the message aggregating step (Eq.(11)). These two steps recursively pass the local information to the nodes in the global graph structure. Given the adjacency matrix A, the process of K-layer MixHop can be formulated as follows:

具体来说，图卷积包括两个主要步骤:i)消息传播步骤(Eq.(10))和ii)消息聚合步骤(Eq.(11))。
这两个步骤递归地将本地信息传递给全局图结构中的节点。给定邻接矩阵A, k层MixHop的过程可表述为:

在这里插入图片描述
where , is the representations outputted from the previous layer, is the normalized adjacency matrix, , and β is a hyperparameter that controls the proportion of information maintained from the previous representation, which helps to alleviate the over-smoothing problem. Following (Wu et al. 2020), we use two MixHop layers to obtain exhaustive information by
processing inflow and outflow information passed through nodes separately.

其中H1 = Hin, Hin是前一层输出的表示，~ A = D−1·(A+I)是归一化邻接矩阵，Dii = 1 +P j Aij，β是控制从前一层表示中保留的信息比例的超参数，有助于缓解过度平滑问题。接下来(Wu et al .2020)，我们使用两个MixHop层分别处理通过节点的流入和流出信息来获得详尽信息。

Eventually, given the dilated convolution component’soutput Z, the process of graph convolution component can be described as 在这里插入图片描述

最终，给定扩展卷积分量的输出Z，图卷积分量的过程可以描述为Hout = MixHop1 (Z， ~ A) + MixHop2 (Z， ~ A⊺)。

3.4.3Skip Connection and Output 跳过连接和输出

A naive combination of the dilated convolution component and the graph convolution component is shown to be prone to gradient vanishing issues. The proposed GP module uses skip-connections to improve its representational capability by preserving original information. Given the wavelet coefficients 在这里插入图片描述 , we first initialize two factors:

对扩张卷积组件和图卷积组件的朴素组合被证明容易出现梯度消失问题。
提出的 GP 模块使用跳跃连接来提高其表示能力，通过保留原始信息。给定小波系数 C ∈ RN×L, w，我们首先初始化两个因子：

where W0 is a 1 × 1 convolution kernel for the convolution module in GP, and Wskip0 is a 1 × L convolution kernel for a skip connection layer. Then we take the adjacency matrix A and these two factors as the input to pass through a K-layer stacked GP modules:
在这里插入图片描述

其中W0为GP中卷积模块的1 × 1卷积核，Wskip0为跳跃连接层的1 × L卷积核。然后取邻接矩阵A和这两个因子作为输入，通过k层堆叠GP模块:

In this process, the skip-output of the previous GP module, represented as 在这里插入图片描述 , joins the output of the current dilated convolution module, represented as , forming Yskipk :

在此过程中，前一个GP模块的skip输出(表示为Yskipk−1)与当前扩展卷积模块的输出(表示为Zk)结合，形成Yskipk:

where τ is a hyperparameter to control the balance. Similarly, the other output factor of the previous GP module, 在这里插入图片描述 , joints the skip-output of the current GP module, , to form :

其中τ是控制平衡的超参数。同样，前一个GP模块的另一个输出因子Yk−1将当前GP模块的跳过输出Yskipk连接起来，形成Yk:

After passing through all K-layer stacked GP modules, we can obtain the final output representation as the prediction of the wavelet coefficients. It is worth noting that in this process, wavelet coefficients at different scales are predicted separately by different GP modules while sharing the same global graph adjacency matrix.

在遍历所有k层叠加的GP模块后，我们可以得到最终的输出表示作为小波系数的预测。
值得注意的是，在此过程中，不同尺度的小波系数由不同的GP模块分别预测，同时共享相同的全局图邻接矩阵。

在这里插入图片描述

Experiments

4.1 Datasets and Settings

我们在实验中应用了广泛使用的数据集:电力(Wu et al 2021)、交通(Lai et al 2018)、天气(Wu et al 2021)和太阳能(Lai et al 2018)。
每个数据集按时间顺序划分，70%用于训练，20%用于验证，10%用于测试。接下来(Wu et al . 2019, 2020)，我们将输入序列的长度(I)设置为96，以预测未来的96、192、336和720步(O)，并利用平均绝对误差(MAE)和均方误差(MSE)来评估WAVEFORM a和基线的长序列预测性能。

附录中提供了对数据集、评估指标和实验设置的更详细描述。代码可从https://github.com/alanyoungCN/WaveForM获得。

4.2 Comparison Models

We compared WAVEFORM with the general sequence modeling approaches, including LSTM (Hochreiter and Schmidhuber 1997) and Transformer (Vaswani et al. 2017), and SOTA MTS forecasting models, including Graph WaveNet (Wu et al. 2019), Informer (Zhou et al. 2021), Autoformer (Wu et al. 2021), and MTGNN (Wu et al. 2020). The details of the baseline models can be found in Introduction & Related Work.

我们将波形与一般序列建模方法进行了比较，包括LSTM (Hochreiter和Schmidhuber 1997)和Transformer (Vaswani等人2017)，以及SOTA MTS预测模型，包括Graph WaveNet (Wu等人2019)、Informer (Zhou等人2021)、Autoformer (Wu等人2021)和MTGNN (Wu等人2020)。

4.3 Main Results

实验结果如表1所示。对于每种方法，我们用不同的种子重复运行三次，并报告平均结果。对于所有数据集，我们的模型在MSE和MAE指标上始终优于SOTA模型(越低越好)，而现有模型中没有一个可以始终作为所有数据集的次优模型。

对于每个数据集，我们的模型相对于最具竞争力的基线，在不同的预测长度上可以大致实现15-20%的性能改进。我们将这种显著的改进归功于在小波域和全局图中使用多级信号。

总的来说，对于节点/变量数量相对较少的数据集，如太阳能和天气，基于图的MST模型(包括Graph WaveNet, MTGNN和我们的WAVEFORM)比基于变压器的MST模型表现更好，这表明基于图的建模和gnn在捕获变量之间的相互依赖性方面的能力。
我们还观察到，对于具有大量节点/变量的数据集，例如Traffic，除WAVEFORM外，基于图的模型的性能不如其他模型。

基线和模型在不同数据集和预测长度上的比较。我们使用96、192、336和720作为预测长度，并使用96作为所有情况下的输入长度。我们将每种情况重复三次，并将其平均值作为最终结果。MSE和MAE越低，预测精度越高。粗体文本表示最佳结果。

请注意，Traffic 数据集具有最少的记录，但却有最多的节点，使其成为最具挑战性的任务。我们认为现有的基于图的方法在 Traffic 数据集上存在一定的欠拟合，因为要对这样一个庞大的图进行建模。
相反，WAVEFORM 可以通过在小波域中进行更复杂的设计，从信号/数据中发现更多的特征，从而更好地训练端到端模型，捕捉大量变量之间的复杂相互关系。

此外，我们的全局图建模可以被认为是使用多个GP模块进一步“微调”多个级别中变量的相互依赖关系，从而导致更好的性能。虽然Autoformer也利用频域特征，正如我们所争论的，它的性能不如我们的可能是由于不适当的混合使用来自不同领域的特征。

我们进一步实验了温度数据集(Grigsby, Wang, and Qi 2021)，以评估模型对更长的序列预测的性能，其中表2给出了结果。在之前的实验中，只有两个最具竞争力的模型，Autoformer和MTGNN，被纳入比较。我们可以看到，当预测步长从720步延长到1260步时，Autoformer的性能急剧下降，而MTGNN在此设置下的性能较差。
我们认为，由于温度数据集的节点数量非常少(只有6个)，但在所有数据集中记录数量最多，因此MTGNN中使用的图可能会过度拟合MTS之间的相互依赖关系。相比之下，在处理超长序列预测时，波形只经历了轻微的退化，并且比其他模型的性能提高了300%。

在这里插入图片描述

4.4Ablation Study

我们对电力数据集进行了烧蚀研究，以评估波形中不同模块的有效性。波形的变体包括:

•波形w/oGGC:从波形中删除全局图形构造函数模块，并在GP模块的每一层应用单独的图形构造函数。

•波形w/单GP:通过最后的高通和低通滤波器后，将小波系数以相同的分解顺序连接成一个单一序列，然后仅使用一个GP模块进行预测。然后将GP的输出手动分割成不同级别的小波系数用于IDWT。

•波形w/o GP:从波形中删除图形增强预测(GP)模块，并使用多个仿射变换作为替代品。

表3给出了实验结果。与表1的结果相比，WAVEFORM及其所有变体都优于所有其他比较模型，证明了熟练使用小波域特征的有效性。
从表3还可以看出，GGC模块在有效提供不同小波系数之间的全局信息方面发挥了重要作用，显著提高了预测性能。
此外，WAVEFORMw/单个GP的性能较差，说明用小波变换得到的不同尺度的系数最好分开处理，从而保证了IDWT使用的有效性。
在这里插入图片描述

4.5Wavelet-Domain Observations

本节利用DWT从小波域的角度来解释其性能。我们使用2层DWT将输入转换为小波系数cD1, cD2和cA2，它们表示输入与小波函数之间随时间的相关性。具体来说，我们将ground truth、Autoformer的预测和Electricity数据集上的波形预测转换为小波域系数进行比较。

从图2中可以看出，随着小波域特征层数的逐渐细化(从上到下)，Autoformer(红色和绿色区域)和波形(绿色区域)对地真值的预测误差逐渐增大，这意味着Autoformer更难发现细粒度的低频特征(由更深的cD和cA传递)。同时，随着分解的深入，Autoformer与波形之间的性能差距(红色区域)也越来越大，这表明波形更能揭示MTS的复杂模式。

小波域的比较。对于Electricity数据集，分别使用训练好的Autoformer和波形进行预测。对预测结果进行DWT变换，并分别计算对地真实值的误差。绿色区域表示波形误差，红色区域表示自耦器在波形上的附加误差。从上到下的系数反映了不同的频率分辨率，越低越精确。

5 Conclusion

提出了一种新的长序列多变量时间序列预测框架——WAVEFORM。
WAVEFORM利用DWT将时域序列变换成多分辨率的小波域系数，然后利用图卷积模块对多变量之间的关系进行建模。
实验表明，变换后的小波域系数能够更好地描述来自多个分辨率的输入序列，从而使模型能够学习细粒度的复杂模式。
在广泛使用的基准数据集上的实验表明，我们的模型在MTS长序列预测方面明显优于SOTA模型，并且具有显著的裕度。