Are Transformers Effective for Time Series Forecasting?

最新推荐文章于 2025-01-14 19:44:03 发布

雾岛听雪

最新推荐文章于 2025-01-14 19:44:03 发布

阅读量1.5k

点赞数 25

文章标签：深度学习人工智能

本文链接：https://blog.csdn.net/XZHBUT/article/details/136304191

版权

Are Transformers Effective for Time Series Forecasting?

Abstract

Recently, there has been a surge of Transformer-based solutions for the long-term time series forecasting (LTSF) task. Despite the growing performance over the past few years, we question the validity of this line of research in this work. Specifically, Transformers is arguably the most successful solution to extract the semantic correlations among the elements in a long sequence. However, in time series modeling, we are to extract the temporal relations in an ordered set of continuous points. While employing positional encoding and using tokens to embed sub-series in Transformers facilitate preserving some ordering information, the nature of the permutation-invariant self-attention mechanism inevitably results in temporal information loss.

最近，针对长期时间序列预测(LTSF)任务，出现了大量基于变压器的解决方案。尽管在过去几年中取得了越来越多的成绩，但我们质疑这项工作中这条研究路线的有效性。具体来说，transformer可以说是提取长序列中元素之间语义相关性的最成功的解决方案。然而，在时间序列建模中，我们要从连续点的有序集合中提取时间关系。在变压器中采用位置编码和标记嵌入子序列有助于保留一些排序信息，但排列不变自关注机制的性质不可避免地导致了时间信息的丢失。

To validate our claim, we introduce a set of embarrassingly simple one-layer linear models named LTSF-Linear for comparison. Experimental results on nine real-life datasets show that LTSF-Linear surprisingly outperforms existing sophisticated Transformer-based LTSF models in all cases, and often by a large margin. Moreover, we conduct comprehensive empirical studies to explore the impacts of various design elements of LTSF models on their temporal relation extraction capability. We hope this surprising finding opens up new research directions for the LTSF task. We also advocate revisiting the validity of Transformer-based solutions for other time series analysis
tasks (e.g., anomaly detection) in the future. Code is available at: https://github.com/cure-lab/LTSFLinear.

为了验证我们的说法，我们引入了一组令人尴尬的单层线性模型，称为LTSF-Linear进行比较。在9个真实数据集上的实验结果表明，LTSF- linear在所有情况下都令人惊讶地优于现有的基于变压器的复杂LTSF模型，并且通常有很大的优势。此外，我们进行了全面的实证研究，探讨了LTSF模型的各种设计元素对其时间关系提取能力的影响。我们希望这一惊人的发现为LTSF任务开辟了新的研究方向。我们还提倡在未来对其他时间序列分析任务(例如，异常检测)重新审视基于transformer的解决方案的有效性。代码可从https://github.com/cure-lab/LTSFLinear获得。

1. Introduction

Time series are ubiquitous in today’s data-driven world. Given historical data, time series forecasting (TSF) is a long-standing task that has a wide range of applications, including but not limited to traffic flow estimation, energy management, and financial investment. Over the past several decades, TSF solutions have undergone a progression from traditional statistical methods (e.g., ARIMA [1]) and machine learning techniques (e.g., GBRT [11]) to deep learning-based solutions, e.g., Recurrent Neural Networks [15] and Temporal Convolutional Networks [3, 17].

在数据驱动的当今世界，时间序列无处不在。鉴于历史数据，时间序列预测（TSF）是一项由来已久的任务，应用广泛，包括但不限于交通流量估算、能源管理和金融投资。在过去几十年中，TSF 解决方案经历了从传统统计方法（如 ARIMA [1]）和机器学习技术（如 GBRT [11]）到基于深度学习的解决方案（如递归神经网络 [15] 和时序卷积网络 [3, 17]）的发展过程。

Transformer [26] is arguably the most successful sequence modeling architecture, demonstrating unparalleled performances in various applications, such as natural language processing (NLP) [7], speech recognition [8], and computer vision [19, 29]. Recently, there has also been a surge of Transformer-based solutions for time series analysis, as surveyed in [27]. Most notable models, which focus on the less explored and challenging long-term time series forecasting (LTSF) problem, include LogTrans [16] (NeurIPS 2019), Informer [30] (AAAI 2021 Best paper), Autoformer [28] (NeurIPS 2021), Pyraformer [18] (ICLR 2022 Oral), Triformer [5] (IJCAI 2022) and the recent FEDformer [31] (ICML 2022).

Transformer[26]可以说是最成功的序列建模架构，在自然语言处理(NLP)[7]、语音识别[8]和计算机视觉[19,29]等各种应用中表现出无与伦比的性能。最近，基于transformer的时间序列分析解决方案也出现了激增，正如[27]所调查的那样。最著名的模型，专注于较少探索和具有挑战性的长期时间序列预测(LTSF)问题，包括LogTrans [16] (NeurIPS 2019)、Informer [30] (AAAI 2021最佳论文)、Autoformer [28] (NeurIPS 2021)、Pyraformer [18] (ICLR 2022 Oral)、Triformer [5] (IJCAI 2022)和最近的FEDformer [31] (ICML 2022)。

The main working power of Transformers is from its multi-head self-attention mechanism, which has a remarkable capability of extracting semantic correlations among elements in a long sequence (e.g., words in texts or 2D patches in images). However, self-attention is permutationinvariant and “anti-order” to some extent. While using various types of positional encoding techniques can preserve some ordering information, it is still inevitable to have temporal information loss after applying self-attention on top of them. This is usually not a serious concern for semanticrich applications such as NLP, e.g., the semantic meaning of a sentence is largely preserved even if we reorder some words in it. However, when analyzing time series data, there is usually a lack of semantics in the numerical data itself, and we are mainly interested in modeling the temporal changes among a continuous set of points. That is, the order itself plays the most crucial role. Consequently, we pose the following intriguing question: Are Transformers really effective for long-term time series forecasting?

Transformers的主要工作动力来自于它的多头自注意机制，它具有显著的提取长序列元素之间语义相关性的能力(例如，文本中的单词或图像中的二维补丁)。然而，自我注意在一定程度上具有排列不变性和“反序性”。虽然使用各种类型的位置编码技术可以保留一些有序信息，但在其上加上自注意后，仍然不可避免地存在时间信息丢失。对于语义丰富的应用程序，如NLP，这通常不是一个严重的问题，例如，即使我们重新排序其中的一些单词，句子的语义也会在很大程度上保留下来。然而，在分析时间序列数据时，通常数值数据本身缺乏语义，我们主要感兴趣的是对连续点之间的时间变化进行建模。也就是说，秩序本身起着最关键的作用。因此，我们提出了以下有趣的问题:变形金刚对长期时间序列预测真的有效吗?

Moreover, while existing Transformer-based LTSF solutions have demonstrated considerable prediction accuracy improvements over traditional methods, in their experiments, all the compared (non-Transformer) baselines perform autoregressive or iterated multi-step (IMS) forecasting [1,2,22,24], which are known to suffer from significant error accumulation effects for the LTSF problem. Therefore, in this work, we challenge Transformer-based LTSF solutions with direct multi-step (DMS) forecasting strategies to validate their real performance.

此外，虽然现有的基于变压器的 LTSF 解决方案与传统方法相比在预测精度上有了相当大的提高，但在其实验中，所有比较过的（非变压器）基线都执行了自回归或迭代多步（IMS）预测 [1,2,22,24]，而众所周知，这些方法在 LTSF 问题上存在显著的误差累积效应。因此，在这项工作中，我们挑战基于变压器的 LTSF 解决方案，以验证其实际性能。

Not all time series are predictable, let alone long-term forecasting (e.g., for chaotic systems). We hypothesize that long-term forecasting is only feasible for those time series with a relatively clear trend and periodicity. As linear models can already extract such information, we introduce a set of embarrassingly simple models named LTSF-Linear as a new baseline for comparison. LTSF-Linear regresses historical time series with a one-layer linear model to forecast future time series directly. We conduct extensive experiments on nine widely-used benchmark datasets that cover various real-life applications: traffic, energy, economics, weather, and disease predictions. Surprisingly, our results show that LTSF-Linear outperforms existing complex Transformerbased models in all cases, and often by a large margin (20% ∼ 50%). Moreover, we find that, in contrast to the claims in existing Transformers, most of them fail to extract temporal relations from long sequences, i.e., the forecasting errors are not reduced (sometimes even increased) with the increase of look-back window sizes. Finally, we conduct various ablation studies on existing Transformer-based TSF solutions to study the impact of various design elements in them.

并非所有的时间序列都是可预测的，更不用说长期预测了(例如，对于混沌系统)。我们假设长期预测只适用于那些趋势和周期性相对明显的时间序列。由于线性模型已经可以提取这些信息，我们引入了一组令人尴尬的简单模型，称为LTSF-Linear，作为比较的新基线。LTSF-Linear用单层线性模型对历史时间序列进行回归，直接预测未来时间序列。我们在九个广泛使用的基准数据集上进行了广泛的实验，这些数据集涵盖了各种现实生活中的应用:交通、能源、经济、天气和疾病预测。令人惊讶的是，我们的结果表明，LTSF-Linear在所有情况下都优于现有的基于复杂变压器的模型，并且通常有很大的优势(20%~ 50%)。此外，我们发现，与现有的transformer中的声明相反，它们中的大多数不能从长序列中提取时间关系，即随着回看窗口大小的增加，预测误差并没有减少(有时甚至增加)。最后，我们对现有的基于变压器的TSF解决方案进行了各种烧蚀研究，以研究其中各种设计元素的影响。

To sum up, the contributions of this work include:
综上所述，本工作的贡献包括:

To the best of our knowledge, this is the first work to challenge the effectiveness of the booming Transformers for the long-term time series forecasting task.

（1）据我们所知，这是第一个挑战变形金刚在长期时间序列预测任务中的有效性的工作。

To validate our claims, we introduce a set of embarrassingly simple one-layer linear models, named LTSF-Linear, and compare them with existing Transformer-based LTSF solutions on nine benchmarks. LTSF-Linear can be a new baseline for the LTSF problem.

（2）为了验证我们的说法，我们引入了一组简单得令人尴尬的单层线性模型，称为LTSF-linear，并在9个基准测试中将它们与现有的基于transformer的LTSF解决方案进行比较。LTSF- linear可以作为LTSF问题的新基线。

We conduct comprehensive empirical studies on various aspects of existing Transformer-based solutions, including the capability of modeling long inputs, the sensitivity to time series order, the impact of positional encoding and sub-series embedding, and efficiency comparisons. Our findings would benefit future research in this area.

（3）我们对现有基于transformer的解决方案的各个方面进行了全面的实证研究，包括建模长输入的能力、对时间序列顺序的敏感性、位置编码和子序列嵌入的影响以及效率比较。我们的发现将有益于这一领域的未来研究。

With the above, we conclude that the temporal modeling capabilities of Transformers for time series are exaggerated, at least for the existing LTSF benchmarks. At the same time, while LTSF-Linear achieves a better prediction accuracy compared to existing works, it merely serves as a simple baseline for future research on the challenging longterm TSF problem. With our findings, we also advocate revisiting the validity of Transformer-based solutions for other time series analysis tasks in the future.

此外，虽然现有的基于变压器的 LTSF 解决方案与传统方法相比在预测精度上有了相当大的提高，但在其实验中，所有比较过的（非变压器）基线都执行了自回归或迭代多步（IMS）预测 [1,2,22,24]，而众所周知，这些方法在 LTSF 问题上存在显著的误差累积效应。因此，在这项工作中，我们挑战基于变压器的 LTSF 解决方案，以验证其实际性能。

2. Preliminaries: TSF Problem Formulation 初论:TSF问题制定

For time series containing C variates, given historical data 在这里插入图片描述 wherein is the look-back
window size and is the value of the ith variate at the tth time step. The time series forecasting task is to predict the values at the T future time steps. When , iterated multi-step (IMS) forecasting [23] learns a single-step forecaster and iteratively applies it to obtain multi-step predictions. Alternatively, direct multistep (DMS) forecasting [4] directly optimizes the multi-step forecasting objective at once.

对于包含C变量的时间序列，给定历史数据X = fxt1;:::;Xt C g L t=1，其中L为回看窗口大小，Xt i为第i个变量在第n个时间步长的值。时间序列预测任务是预测值^X = f ^Xt 1;:::;^Xt C g L+T T=L+1在未来的T个时间步长。当T > 1时，迭代多步(IMS)预测[23]学习单步预测器，并迭代应用它来获得多步预测。直接多步预测[4]直接对多步预测目标进行一次优化。

Compared to DMS forecasting results, IMS predictions have smaller variance thanks to the autoregressive estimation procedure, but they inevitably suffer from error accumulation effects. Consequently, IMS forecasting is preferable when there is a highly-accurate single-step forecaster, and T is relatively small. In contrast, DMS forecasting generates more accurate predictions when it is hard to obtain an
unbiased single-step forecasting model, or T is large.

与DMS预测结果相比，IMS预测结果由于采用自回归估计过程，方差较小，但不可避免地存在误差累积效应。因此，当存在高度精确的单步预测器且T相对较小时，IMS预测更可取。相比之下，当难以获得无偏单步预测模型或T较大时，DMS预测产生更准确的预测。

3. Transformer-Based LTSF Solutions

Transformer-based models [26] have achieved unparalleled performances in many long-standing AI tasks in natural language processing and computer vision fields, thanks to the effectiveness of the multi-head self-attention mechanism. This has also triggered lots of research interest in Transformer-based time series modeling techniques [20, 27]. In particular, a large amount of research works are dedicated to the LTSF task (e.g., [16, 18, 28, 30, 31]). Considering the ability to capture long-range dependencies with Transformer models, most of them focus on the lessexplored long-term forecasting problem 在这里插入图片描述

.由于多头自注意机制的有效性，基于变压器的模型[26]在自然语言处理和计算机视觉领域的许多长期AI任务中取得了无与伦比的表现。这也引发了许多基于transformer的时间序列建模技术的研究兴趣[20,27]。特别是，大量的研究工作致力于LTSF任务(例如[16,18,28,30,31])。考虑到使用Transformer模型捕获长期依赖关系的能力，它们中的大多数关注较少探索的长期预测问题( 1) 1。

When applying the vanilla Transformer model to the LTSF problem, it has some limitations, including the quadratic time/memory complexity with the original selfattention scheme and error accumulation caused by the autoregressive decoder design. Informer [30] addresses these issues and proposes a novel Transformer architecture with reduced complexity and a DMS forecasting strategy. Later,
more Transformer variants introduce various time series features into their models for performance or efficiency improvements [18,28,31]. We summarize the design elements of existing Transformer-based LTSF solutions as follows (see Figure 1).

当将普通Transformer模型应用于LTSF问题时，它存在一些局限性，包括原始自注意方案的二次时间/内存复杂度以及自回归解码器设计引起的误差累积。Informer[30]解决了这些问题，并提出了一种具有降低复杂性和DMS预测策略的新颖Transformer架构。后来，更多的Transformer变体在其模型中引入了各种时间序列特征以提高性能或效率[18,28,31]。我们将现有的基于变压器的LTSF解决方案的设计元素总结如下(参见图1)。

> 现有的基于变压器的TSF解决方案的管道。在(a)和(b)中，实线框为基本操作，虚线框为可选应用。©和(d)对于不同的方法是不同的[16,18,28,30,31]。

Time series decomposition: For data preprocessing, normalization with zero-mean is common in TSF. Besides, Autoformer [28] first applies seasonal-trend decomposition behind each neural block, which is a standard method in time series analysis to make raw data more predictable [6, 13]. Specifically, they use a moving average kernel on the input sequence to extract the trend-cyclical component of
the time series. The difference between the original sequence and the trend component is regarded as the seasonal component. On top of the decomposition scheme of Autoformer, FEDformer [31] further proposes the mixture of experts’ strategies to mix the trend components extracted by moving average kernels with various kernel sizes.

时间序列分解:对于数据预处理，TSF中常见的是零均值归一化。此外，Autoformer[28]首先在每个神经块后面应用季节趋势分解，这是时间序列分析中的标准方法，使原始数据更具可预测性[6,13]。具体来说，他们在输入序列上使用移动平均核来提取时间序列的趋势周期成分。将原始序列与趋势分量之差作为季节分量。在Autoformer分解方案的基础上，FEDformer[31]进一步提出了混合专家策略，将不同核大小的移动平均核提取的趋势分量进行混合。

Input embedding strategies: The self-attention layer in the Transformer architecture cannot preserve the positional information of the time series. However, local positional information, i.e. the ordering of time series, is important. Besides, global temporal information, such as hierarchical timestamps (week, month, year) and agnostic timestamps (holidays and events), is also informative [30]. To enhance
the temporal context of time-series inputs, a practical design in the SOTA Transformer-based methods is injecting several embeddings, like a fixed positional encoding, a channel projection embedding, and learnable temporal embeddings into the input sequence. Moreover, temporal embeddings ith a temporal convolution layer [16] or learnable timestamps [28] are introduced.

输入嵌入策略:Transformer体系结构中的自关注层不能保存时间序列的位置信息。然而，局部位置信息，即时间序列的排序，是重要的。此外，全局时间信息，如分层时间戳(周、月、年)和不可知时间戳(假日和事件)，也是信息丰富的[30]。为了增强时间序列输入的时间上下文，基于SOTA变换的方法的一个实用设计是在输入序列中注入几个嵌入，如固定位置编码、通道投影嵌入和可学习的时间嵌入。此外，还引入了带有时间卷积层的时间嵌入[16]或可学习时间戳[28]。

Self-attention schemes: Transformers rely on the selfattention mechanism to extract the semantic dependencies between paired elements. Motivated by reducing the 在这里插入图片描述 time and memory complexity of the vanilla Transformer, recent works propose two strategies for efficiency. On the one hand, LogTrans and Pyraformer explicitly introduce a sparsity bias into the self-attention scheme. Specifically, LogTrans uses a Logsparse mask to reduce the computational complexity to 在这里插入图片描述 while Pyraformer adopts pyramidal attention that captures hierarchically multi-scale temporal dependencies with an O (L) time and memory complexity. On the other hand, Informer and FEDformer use the low-rank property in the self-attention matrix. Informer proposes a ProbSparse self-attention mechanism and a self-attention distilling operation to decrease the complexity to O (LlogL), and FEDformer
designs a Fourier enhanced block and a wavelet enhanced block with random selection to obtain O (L) complexity. Lastly, Autoformer designs a series-wise auto-correlation mechanism to replace the original self-attention layer.

自注意方案:转换器依靠自注意机制来提取成对元素之间的语义依赖关系。为了降低普通Transformer的O l2 时间和内存复杂性，最近的工作提出了两种提高效率的策略。一方面，LogTrans和Pyraformer明确地在自关注方案中引入了稀疏性偏差。具体来说，LogTrans使用log稀疏掩码将计算复杂度降低到O (LlogL)，而Pyraformer采用金字塔式注意力，**以O (L)的时间和内存复杂度捕获分层多尺度时间依赖性.**另一方面，Informer 和 FEDformer 利用了自注意矩阵的低秩属性。Informer 提出了 ProbSparse 自注意机制和自注意蒸馏操作，将复杂度降低到 O (LlogL)；FEDformer 设计了傅立叶增强块和随机选择的小波增强块，以获得 O (L) 的复杂度。

Decoders: The vanilla Transformer decoder outputs sequences in an autoregressive manner, resulting in a slow inference speed and error accumulation effects, especially for long-term predictions. Informer designs a generative-style decoder for DMS forecasting. Other Transformer variants employ similar DMS strategies. For instance, Pyraformer uses a fully-connected layer concatenating Spatio-temporal
axes as the decoder. Autoformer sums up two refined decomposed features from trend-cyclical components and the stacked auto-correlation mechanism for seasonal components to get the final prediction. FEDformer also uses a decomposition scheme with the proposed frequency attention block to decode the final results.

解码器:普通的Transformer解码器以自回归的方式输出序列，导致推理速度慢和错误累积效应，特别是对于长期预测。Informer为DMS预测设计了一种生成式解码器。其他Transformer变体采用类似的DMS策略。例如，Pyraformer使用一个连接时空轴的全连接层作为解码器。Autoformer将趋势周期分量和季节分量的叠加自相关机制两种特征进行精细化分解，得到最终预测结果。FEDformer还使用一种分解方案与提出的频率注意块解码最终结果。

The premise of Transformer models is the semantic correlations between paired elements, while the self-attention mechanism itself is permutation-invariant, and its capability of modeling temporal relations largely depends on positional encodings associated with input tokens. Considering the raw numerical data in time series (e.g., stock prices or electricity values), there are hardly any point-wise semantic
correlations between them. In time series modeling, we are mainly interested in the temporal relations among a continuous set of points, and the order of these elements instead of the paired relationship plays the most crucial role. While employing positional encoding and using tokens to embed sub-series facilitate preserving some ordering information, the nature of the permutation-invariant self-attention mechanism inevitably results in temporal information loss. Due to the above observations, we are interested in revisiting the effectiveness of Transformer-based LTSF solutions.

Transformer模型的前提是配对元素之间的语义相关性，而自关注机制本身是排列不变的，其建模时间关系的能力在很大程度上取决于与输入标记相关联的位置编码。考虑到时间序列中的原始数值数据(例如，股票价格或电力价值)，它们之间几乎没有任何逐点的语义相关性。在时间序列建模中，我们主要关心的是一组连续点之间的时间关系，而这些元素的顺序而不是配对关系起着最关键的作用。虽然使用位置编码和标记嵌入子序列有利于保留一些排序信息，但排列不变自注意机制的性质不可避免地导致了时间信息的丢失。由于上述观察结果，我们有兴趣重新审视基于变压器的LTSF解决方案的有效性。

4. An Embarrassingly Simple Baseline 一个令人尴尬的简单基线

In the experiments of existing Transformer-based LTSF solutions 在这里插入图片描述 , all the compared (non-Transformer) baselines are IMS forecasting techniques, which are known to suffer from significant error accumulation effects. We hypothesize that the performance improvements in these works are largely due to the DMS strategy used in them.

在对现有基于变压器的 LTSF 解决方案（T 1）中，所有比较过的（非变压器）基线都是 IMS预测技术，众所周知，这种技术会产生显著的误差累积效应。我们假设，这些工作中的性能改进我们假设，这些作品的性能改进主要归功于它们所采用的 DMS 策略。

在这里插入图片描述
To validate this hypothesis, we present the simplest DMS model via a temporal linear layer, named LTSF-Linear, as a baseline for comparison. The basic formulation of LTSFLinear directly regresses historical time series for future prediction via a weighted sum operation (as illustrated in Figure 2). The mathematical expression is 在这里插入图片描述 , where is a linear layer along the temporal axis. and are the prediction and input for each ith variate. Note that LTSF-Linear shares weights across different variates and does not model any spatial correlations.

为了验证这一假设，我们通过一个名为LTSF-Linear的时间线性层提出了最简单的DMS模型，作为比较的基线。ltslinear的基本公式通过加权和操作直接回归历史时间序列以预测未来(如图2所示)。数学表达式为^Xi= wxi，其中w2r T ×L是沿时间轴的线性层。 ^Xi和Xi是每一个变量的预测和输入。请注意，LTSF-Linear在不同变量之间共享权重，并且不为任何空间相关性建模。

LTSF-Linear is a set of linear models. Vanilla Linear is a one-layer linear model. To handle time series across different domains (e.g., finance, traffic, and energy domains), we further introduce two variants with two preprocessing methods, named DLinear and NLinear.

LTSF-Linear是一组线性模型。Vanilla Linear是一个单层线性模型。为了处理跨不同领域(例如，金融、交通和能源领域)的时间序列，我们进一步引入了两种具有两种预处理方法的变体，称为DLinear和NLinear。

(1) Specifically, DLinear is a combination of a Decomposition scheme used in Autoformer and FEDformer with linear layers. It first decomposes a raw data input into a trend component by a moving average kernel and a remainder (seasonal) component. Then, two
one-layer linear layers are applied to each component, and we sum up the two features to get the final prediction. By explicitly handling trend, DLinear enhances the performance of a vanilla linear when there is a clear trend in the data.

具体来说，DLinear是一种结合了Autoformer和FEDformer中使用的线性层分解方案。它首先通过移动平均核和剩余(季节)分量将原始数据输入分解为趋势分量。然后，对每个分量应用两个单层线性层，将两个特征相加得到最终的预测结果。通过显式地处理趋势，当数据中有明显的趋势时，DLinear增强了普通线性的性能。

(2) Meanwhile, to boost the performance of LTSF-Linear when there is a distribution shift in the dataset, NLinear first subtracts the input by the last value of the sequence. Then, the input goes through a linear layer, and the subtracted part is added back before making
the final prediction. The subtraction and addition in NLinear are a simple normalization for the input sequence.

同时，为了提高LTSF-Linear在数据集中出现分布移位时的性能，NLinear首先用序列的最后一个值减去输入。然后，输入经过一个线性层，在进行最终预测之前，将减去的部分加回来。NLinear中的减法和加法是输入序列的简单归一化。

5. Experiments

5.1. Experimental Settings

Dataset. We conduct extensive experiments on nine widely-used real-world datasets, including ETT (Electricity Transformer Temperature) [30] (ETTh1, ETTh2, ETTm1, ETTm2), Traffic, Electricity, Weather, ILI, ExchangeRate [15]. All of them are multivariate time series. We leave data descriptions in the Appendix.

数据集。我们在9个广泛使用的现实世界数据集上进行了广泛的实验，包括ETT(电力变压器温度)[30](ETTh1, ETTh2, ETTm1, ETTm2)，交通，电力，天气，ILI, ExchangeRate[15]。它们都是多元时间序列。我们在附录中留下了数据描述。

Evaluation metric. Following previous works [28, 30,31], we use Mean Squared Error (MSE) and Mean Absolute Error (MAE) as the core metrics to compare performance.

评价指标。根据之前的工作[28,30,31]，我们使用均方误差(MSE)和平均绝对误差(MAE)作为比较性能的核心指标。

Compared methods. We include five recent Transformer-based methods: FEDformer [31], Autoformer [28], Informer [30], Pyraformer [18], and LogTrans [16]. Besides, we include a naive DMS method: Closest Repeat (Repeat), which repeats the last value in the look-back window, as another simple baseline. Since there are two variants of FEDformer, we compare the one with better accuracy (FEDformer-f via Fourier transform).

比较的方法。我们包括五种最近基于变压器的方法:FEDformer [31]， Autoformer [28]， Informer [30]，Pyraformer[18]和LogTrans[16]。此外，我们还包含了一个简单的直接多步DMS方法:nearest Repeat (Repeat)，它重复回看窗口中的最后一个值，作为另一个简单的基线。由于FEDformer有两种变体，我们比较了精度更高的一种(通过傅里叶变换的FEDformer-f)。

5.2. Comparison with Transformers

Quantitative results. In Table 2, we extensively evaluate all mentioned Transformers on nine benchmarks, following the experimental setting of previous work [28, 30,31]. Surprisingly, the performance of LTSF-Linear surpasses the SOTA FEDformer in most cases by 20% ∼ 50%improvements on the multivariate forecasting, where LTSFLinear even does not model correlations among variates.For different time series benchmarks, NLinear and DLinear show the superiority to handle the distribution shift and trend-seasonality features. We also provide results for univariate forecasting of ETT datasets in the Appendix, where LTSF-Linear still consistently outperforms Transformerbased LTSF solutions by a large margin.

定量的结果。在表2中，我们根据之前的工作[28,30,31]的实验设置，在9个基准上对上述所有变压器进行了广泛的评估。令人惊讶的是，在大多数情况下，LTSF-Linear的性能比SOTA FEDformer在多变量预测上提高了20% ~ 50%，其中ltslinear甚至不模拟变量之间的相关性。对于不同的时间序列基准，NLinear和DLinear在处理分布移位和趋势季节性特征方面表现出优势。我们还在附录中提供了ETT数据集的单变量预测结果，其中LTSF- linear仍然在很大程度上优于基于transformer的LTSF解决方案。

就MSE和MAE而言，多变量长期预测误差越低越好。其中，ILI数据集具有预测层位t2f24;36个;48;60克。对于其他的，t2f96;192;336;720克。Repeat重复回顾窗口中的最后一个值。最好的结果用粗体突出显示，变形金刚的最好结果用下划线突出显示。因此，与基于变压器的解决方案的结果相比，线性模型的结果是最好的。

FEDformer achieves competitive forecasting accuracy on ETTh1. This because FEDformer employs classical time series analysis techniques such as frequency processing, which brings in time series inductive bias and benefits the ability of temporal feature extraction. In summary, these results reveal that existing complex Transformer-based LTSF solutions are not seemingly effective on the existing nine
benchmarks while LTSF-Linear can be a powerful baseline.

FEDformer在ETTh1上实现了具有竞争力的预测精度。这是因为FEDformer采用了经典的时间序列分析技术，如频率处理，这带来了时间序列的归纳偏置，有利于时间特征提取的能力。总之，这些结果表明，现有的基于复杂变压器的LTSF解决方案在现有的9个基准测试中似乎并不有效，而LTSF- linear可能是一个强大的基准。

Another interesting observation is that even though the naive Repeat method shows worse results when predicting long-term seasonal data (e.g., Electricity and Traffic), it surprisingly outperforms all Transformer-based methods on Exchange-Rate (around 45%). This is mainly caused by the wrong prediction of trends in Transformer-based solutions, which may overfit toward sudden change noises in the training data, resulting in significant accuracy degradation (see Figure 3(b)). Instead, Repeat does not have the bias.

另一个有趣的观察结果是，尽管朴素的Repeat方法在预测长期季节性数据(例如电力和交通)时显示出较差的结果，但它在汇率方面的表现出人意料地优于所有基于transformer的方法(约45%)。这主要是由于基于transformer的解决方案对趋势的错误预测，可能会对训练数据中的突然变化噪声过拟合，从而导致显著的精度下降(见图3(b))。相反，Repeat没有偏见。

电力、汇率和ETTh2上五个模型的长期预测输出(y轴)分别为输入长度L=96和输出长度T=192 (x轴)。

Qualitative results. As shown in Figure 3, we plot the prediction results on three selected time series datasets with Transformer-based solutions and LTSF-Linear: Electricity (Sequence 1951, Variate 36), Exchange-Rate (Sequence 676, Variate 3), and ETTh2 ( Sequence 1241, Variate 2), where these datasets have different temporal patterns. When the input length is 96 steps, and the output horizon
is 336 steps, Transformers [28, 30, 31] fail to capture the scale and bias of the future data on Electricity and ETTh2. Moreover, they can hardly predict a proper trend on aperiodic data such as Exchange-Rate. These phenomena further indicate the inadequacy of existing Transformer-based solutions for the LTSF task.

定性结果。如图 3 所示，我们绘制了三个选定的时间序列数据集的预测结果：电力（序列 1951，变量 36）、汇率（序列 676，变量 3）和 ETTh2（序列 1241，变量 2），这些数据集具有不同的时间模式。当输入长度为 96 步，输出范围为 336 步时为 336 步时，变换器 [28, 30, 31] 无法捕捉未来数据的规模和偏差。电和ETTh2 未来数据的规模和偏差。此外，它们也很难预测汇率等非周期性数据的正确趋势。这些现象进一步这些现象进一步表明，现有的基于变压器的解决方案不足以完成 LTSF 任务。

5.3. More Analyses on LTSF-Transformers ltsfTransformer的进一步分析

Can existing LTSF-Transformers extract temporal relations well from longer input sequences? The size of the look-back window greatly impacts forecasting accuracy as it determines how much we can learn from historical data. Generally speaking, a powerful TSF model with a strong temporal relation extraction capability should be able to achieve better results with larger look-back window sizes.

现有的ltsf -transformer能否很好地从较长的输入序列中提取时间关系?回顾窗口的大小极大地影响了预测的准确性，因为它决定了我们能从历史数据中学到多少。一般来说，一个强大的TSF模型，具有较强的时间关系提取能力，应该能够在更大的回看窗口尺寸下获得更好的结果。

To study the impact of input look-back window sizes, we conduct experiments with 在这里插入图片描述 for long-term forecasting (T=720). Figure 4 demonstrates the MSE results on two datasets. Similar to the observations from previous studies [27, 30], existing Transformer-based models’ performance deteriorates or stays stable when the look-back window size increases. In contrast, the performances of all LTSF-Linear are significantly boosted with the increase of look-back window size. Thus, existing
solutions tend to overfit temporal noises instead of extracting temporal information if given a longer sequence, and the input size 96 is exactly suitable for most Transformers.

为了研究输入回望窗大小的影响，我们用l2f24进行了实验;48;72;96;120;144;168;192;336;504;672;720g长期预测(T=720)。图4展示了两个数据集上的MSE结果。与先前研究的观察结果相似[27,30]，现有的基于transformer的模型的性能随着回看窗口大小的增加而恶化或保持稳定。相比之下，所有LTSF-Linear的性能都随着回看窗口大小的增加而显著提高。因此，如果给定较长的序列，现有的解决方案倾向于过拟合时间噪声而不是提取时间信息，并且输入大小为96正好适用于大多数变压器。

图4。在交通和电力数据集上，不同回望窗大小模型(x轴)的长期预测(T=720)的均方误差结果(y轴)。

What can be learned for long-term forecasting? While the temporal dynamics in the look-back window signifi- cantly impact the forecasting accuracy of short-term time series forecasting, we hypothesize that long-term forecasting depends on whether models can capture the trend and periodicity well only. That is, the farther the forecasting horizon, the less impact the look-back window itself has.

我们可以从长期预测中学到什么?虽然回顾窗口中的时间动态对短期时间序列预测的准确性有显著影响，但我们假设长期预测仅取决于模型是否能够很好地捕捉趋势和周期性。也就是说，预测范围越远，回顾窗口本身的影响就越小。

在MSE度量下比较不同的输入序列，以探索ltsf变压器依赖于什么。如果输入是Close，我们使用第96个;:::;第191个时间步长作为输入序列。如果输入是Far，我们使用第0个;:::;第95步。两者都预测是192日;:::;(192 + 720)时间步长。

To validate the above hypothesis, in Table 3, we compare the forecasting accuracy for the same future 720 time steps with data from two different look-back windows: (i). the original input L=96 setting (called Close) and (ii). the far input L=96 setting (called Far) that is before the original 96 time steps. From the experimental results, the performance of the SOTA Transformers drops slightly, indicating these models only capture similar temporal information from the adjacent time series sequence. Since capturing the intrinsic characteristics of the dataset generally does not require a large number of parameters, i,e. one parameter can represent the periodicity. Using too many parameters will even cause overfitting, which partially explains why LTSFLinear performs better than Transformer-based methods.

为了验证上述假设，我们在表 3 中比较了对相同的未来 720 个时间步的预测准确性，这些数据来自两个不同的回溯窗口：(i).原始输入 L=96 设置（称为 Close）和 (ii).原始 96 个时间步之前的远输入 L=96 设置（称为 Far）。从实验结果来看，SOTA 变换器的性能略有下降，这表明这些模型只能从相邻的时间序列中捕捉相似的时间信息。从相邻的时间序列序列中获取相似的时间信息。由于捕捉由于捕捉数据集的固有特征一般不需要大量参数，即一个参数就能代表周期性。表示周期性。使用过多的参数这也部分解释了为什么 LTSFLinear比基于变换器的方法表现更好。

Are the self-attention scheme effective for LTSF? We verify whether these complex designs in the existing Transformer (e.g., Informer) are essential. In Table 4, we gradually transform Informer to Linear. First, we replace each self-attention layer by a linear layer, called Att.-Linear, since a self-attention layer can be regarded as a fullyconnected layer where weights are dynamically changed. Furthermore, we discard other auxiliary designs (e.g., FFN) in Informer to leave embedding layers and linear layers, named Embed + Linear. Finally, we simplify the model to one linear layer. Surprisingly, the performance of Informer grows with the gradual simplification, indicating the unnecessary of the self-attention scheme and other complex modules at least for existing LTSF benchmarks.

自我关注计划对LSTF是否有效?我们验证现有Transformer(例如，Informer)中的这些复杂设计是否必要。在表4中，我们逐步将Informer转换为Linear。
首先，我们将每个自注意层替换为一个线性层，称为at .-linear，因为自注意层可以被视为一个权值动态变化的全连接层。
此外，我们在Informer中抛弃了其他辅助设计(例如FFN)，留下嵌入层和线性层，称为Embed + linear。
最后，我们将模型简化为一个线性层。
令人惊讶的是，Informer的性能随着逐渐简化而增长，这表明至少对于现有的LTSF基准测试来说，自关注方案和其他复杂模块是不必要的。

在这里插入图片描述
Can existing LTSF-Transformers preserve temporal rder well? Self-attention is inherently permutationinvariant, i.e., regardless of the order. However, in timeseries forecasting, the sequence order often plays a crucial role. We argue that even with positional and temporal embeddings, existing Transformer-based methods still suffer from temporal information loss. In Table 5, we shuffle the raw input before the embedding strategies. Two shuffling strategies are presented: Shuf. randomly shuffles the whole input sequences and Half-Ex. exchanges the first half of the input sequence with the second half. Interestingly, compared with the original setting (Ori.) on the Exchange Rate, the performance of all Transformer-based methods does not fluctuate even when the input sequence is randomly shuf- fled. By contrary, the performance of LTSF-Linear is damaged significantly. These indicate that LTSF-Transformers with different positional and temporal embeddings preserve quite limited temporal relations and are prone to overfit on noisy financial data, while the LTSF-Linear can model the
order naturally and avoid overfitting with fewer parameters.

现有的 LTSF-Transformer能否很好地保留时间顺序？秩序？自注意本质上是包络不变的，即与顺序无关。然而，在时间序列预测中，序列顺序往往起着至关重要的作用。的作用。我们认为，即使有了位置和时间嵌入，现有的基于变换器的方法仍然存在时间信息丢失的问题。时间信息损失。在表 5 中，我们在使用嵌入策略之前对原始输入进行了洗牌。在表 5 中，我们在使用嵌入策略之前对原始输入进行了洗牌。
策略： Shuf随机洗牌整个输入序列和Half-Ex.将输入序列的前半部分与后半部分交换。有趣的是，与交换率的原始设置（Ori、所有基于变换器的方法的性能都没有波动。有趣的是，与交换率的原始设置（Ori.相反，LTS与此相反，LTSF-Linear 的性能明显下降。这表明，LTSF-变换器不同位置和时间嵌入的 LTSF变换器所保留的时空关系非常有限，容易对高噪声金融数据产生过拟合，而 LTSF而 LTSF-Linear则能自然地对顺序进行建模，并避免过拟合。

对原始输入序列进行洗牌时模型的MSE比较。Shuf。随机打乱输入序列。Half-EX。将输入序列的前半部分与后半部分随机交换。平均下降是在洗牌后所有预测长度下的平均性能下降。所有结果均为5次运行的平均测试MSE。

For the ETTh1 dataset, FEDformer and Autoformer introduce time series inductive bias into their models, making them can extract certain temporal information when the dataset has more clear temporal patterns (e.g., periodicity) than the Exchange Rate. Therefore, the average drops of the two Transformers are 73.28% and 56.91% under the Shuf. setting, where it loses the whole order information. Moreover, Informer still suffers less from both Shuf. and Half-Ex. settings due to its no such temporal inductive bias. Overall, the average drops of LTSF-Linear are larger than Transformer-based methods for all cases, indicating the existing Transformers do not preserve temporal order well.

对于ETTh1数据集，FEDformer和Autoformer在他们的模型中引入了时间序列归纳偏置，使得他们可以在数据集具有比汇率更明确的时间模式(例如，周期性)时提取某些时间信息。因此，两台变压器在关闭工况下的平均降幅分别为73.28%和56.91%。设置，它会丢失整个订单信息。此外，Informer在这两种情况下受到的影响都较小。和Half-Ex。设置，由于它没有这种时间归纳偏差。总的来说，在所有情况下，LTSF-Linear的平均下降都大于基于变压器的方法，这表明现有的变压器不能很好地保持时间顺序。

How effective are different embedding strategies? We study the benefits of position and timestamp embeddings used in Transformer-based methods. In Table 6, the forecasting errors of Informer largely increase without positional embeddings (wo/Pos.). Without timestamp embeddings (wo/Temp.) will gradually damage the performance of Informer as the forecasting lengths increase. Since Informer uses a single time step for each token, it is necessary to introduce temporal information in tokens.

不同的嵌入策略效果如何?我们研究了在基于变压器的方法中使用位置和时间戳嵌入的好处。在表6中，没有位置嵌入(wo/Pos.)， Informer的预测误差大大增加。**没有时间戳嵌入(wo/Temp.)**会随着预测长度的增加而逐渐损害Informer的性能。由于Informer对每个令牌使用单个时间步长，因此有必要在令牌中引入时态信息。

回顾窗口大小为96、预测长度为f96的基于transformer的方法中不同嵌入策略的MSE比较192;336;720克。

Rather than using a single time step in each token, FEDformer and Autoformer input a sequence of timestamps to embed the temporal information. Hence, they can achieve comparable or even better performance without fixed positional embeddings. However, without timestamp embeddings, the performance of Autoformer declines rapidly because of the loss of global temporal information. Instead,
thanks to the frequency-enhanced module proposed in FEDformer to introduce temporal inductive bias, it suffers less from removing any position/timestamp embeddings.

FEDformer和Autoformer不是在每个令牌中使用单个时间步长，而是输入一系列时间戳来嵌入时间信息。因此，它们可以在没有固定位置嵌入的情况下获得相当甚至更好的性能。然而，如果没有时间戳嵌入，由于全局时间信息的丢失，自耦器的性能会迅速下降。相反，由于FEDformer中提出的频率增强模块引入了时间感应偏置，因此移除任何位置/时间戳嵌入的影响较小。

Is training data size a limiting factor for existing LTSFTransformers? Some may argue that the poor performance of Transformer-based solutions is due to the small sizes of the benchmark datasets. Unlike computer vision or natural language processing tasks, TSF is performed on collected time series, and it is difficult to scale up the training data size. In fact, the size of the training data would indeed have a significant impact on the model performance.Accordingly, we conduct experiments on Traffic, comparing the performance of the model trained on a full dataset (17,544*0.7 hours), named Ori., with that trained on a shortened dataset (8,760 hours, i.e., 1 year), called Short. Unexpectedly, Table 7 presents that the prediction errors with reduced training data are lower in most cases. This might because the whole-year data maintains more clear temporal features than a longer but incomplete data size. While we cannot conclude that we should use less data for training, it demonstrates that the training data scale is not the limiting reason for the performances of Autoformer and FEDformer.

**训练数据大小是现有ltsftransformer的限制因素吗?**有些人可能会认为基于transformer的解决方案的低性能是由于基准数据集的小尺寸。与计算机视觉或自然语言处理任务不同，TSF是在收集的时间序列上执行的，很难扩大训练数据的规模。事实上，训练数据的大小确实会对模型的性能产生重大影响。因此，*我们对Traffic进行了实验，比较了在完整数据集(17544 0.7小时)上训练的模型的性能，命名为Ori。，在缩短的数据集(8,760小时，即1年)上进行训练，称为Short。在大多数情况下，训练数据更低。这可能是因为全年数据比较长但不完整的数据具有更清晰的时间特征。虽然我们不能得出我们应该使用更少的数据进行训练的结论，但它表明训练数据规模不是Autoformer和FEDformer性能的限制原因

在这里插入图片描述

Is efficiency really a top-level priority? Existing LTSFTransformers claim that the 在这里插入图片描述 complexity of the vanilla Transformer is unaffordable for the LTSF problem. Although they prove to be able to improve the theoretical time and memory complexity from to , it is unclear whether 1) the actual inference time and memory cost on devices are improved, and 2) the memory issue is unacceptable and urgent for today’s GPU (e.g., an NVIDIA Titan XP here). In Table 8, we compare the average practical efficiencies with 5 runs. Interestingly, compared with the vanilla Transformer (with the same DMS decoder), most Transformer variants incur similar or even worse inference time and parameters in practice. These follow-ups introduce more additional design elements to make practical costs high. Moreover, the memory cost of the vanilla Transformer is practically acceptable, even for output length L = 720, which weakens the importance of developing a memoryefficient Transformers, at least for existing benchmarks.

效率真的是头等大事吗?现有的ltsftransformer声称，对于LTSF问题，普通Transformer的ol2 复杂性是无法承受的。虽然他们证明能够将理论上的时间和内存复杂度从O L 2 提高到O(L)，但目前尚不清楚 1)设备上的实际推理时间和内存成本是否得到改善，2)内存问题对于今天的GPU(例如，这里的NVIDIA TitanXP)来说是不可接受的和紧迫的。在表8中，我们比较了5次运行的平均实际效率。有趣的是，与普通Transformer(使用相同的DMS解码器)相比，大多数Transformer变体在实践中会产生相似甚至更差的推理时间和参数。这些后续工作引入了更多额外的设计元素，使实际成本更高。此外，即使输出长度为L = 720，普通Transformer的内存成本实际上也是可以接受的，这削弱了开发内存效率高的Transformer的重要性，至少对于现有的基准测试来说是这样。

ltsf变压器在L=96和T=720条件下的实际效率比较。mac是乘法累加操作的个数。我们使用线性进行比较，因为它的代价是LTSF-Linear的两倍。推理时间平均为5次运行。

6. Conclusion and Future Work

Conclusion. This work questions the effectiveness of
emerging favored Transformer-based solutions for the longterm time series forecasting problem. We use an embarrassingly simple linear model LTSF-Linear as a DMS
forecasting baseline to verify our claims. Note that our
contributions do not come from proposing a linear model
but rather from throwing out an important question, showing surprising comparisons, and demonstrating why LTSFTransformers are not as effective as claimed in these works
through various perspectives. We sincerely hope our comprehensive studies can benefit future work in this area.
Future work. LTSF-Linear has a limited model capacity, and it merely serves a simple yet competitive baseline with strong interpretability for future research. For example, the one-layer linear network is hard to capture the
temporal dynamics caused by change points [25]. Consequently, we believe there is a great potential for new model
designs, data processing, and benchmarks to tackle the challenging LTSF problem.

结论。这项工作质疑了新兴的基于变压器的解决方案对长期时间序列预测问题的有效性。我们使用一个令人尴尬的简单线性模型LTSF-Linear作为DMS预测基线来验证我们的主张。请注意，我们的贡献不是来自于提出一个线性模型，而是来自于抛出一个重要的问题，展示令人惊讶的比较，并从不同的角度展示为什么ltsftransformer不像这些作品中声称的那样有效。我们真诚地希望我们的全面研究能够对未来在这方面的工作有所帮助。
未来的工作。LTSF-Linear具有有限的模型容量，它仅仅为未来的研究提供了一个简单但具有竞争力的基线，具有很强的可解释性。例如，单层线性网络很难捕捉由变化点引起的时间动态[25]。因此，我们相信在新的模型设计、数据处理和基准测试方面有很大的潜力来解决具有挑战性的LTSF问题。