A TIME SERIES IS WORTH 64 WORDS: LONG-TERM FORECASTING WITH TRANSFORMERS

最新推荐文章于 2025-04-15 22:13:01 发布

雾岛听雪

最新推荐文章于 2025-04-15 22:13:01 发布

阅读量2.4k

点赞数 29

文章标签： python

本文链接：https://blog.csdn.net/XZHBUT/article/details/136158044

版权

A TIME SERIES IS WORTH 64 WORDS: LONG-TERM FORECASTING WITH TRANSFORMERS一个时间序列相当于64个字:用变压器进行长期预测

Abstract

We propose an efficient design of Transformer-based models for multivariate time series forecasting and self-supervised representation learning. It is based on two key components: (i) segmentation of time series into subseries-level patches which are served as input tokens to Transformer; (ii) channel-independence where
each channel contains a single univariate time series that shares the same embedding and Transformer weights across all the series. Patching design naturally has three-fold benefit: local semantic information is retained in the embedding; computation and memory usage of the attention maps are quadratically reduced given the same look-back window; and the model can attend longer history. Our channel-independent patch time series Transformer (PatchTST) can improve the long-term forecasting accuracy significantly when compared with that of SOTA
Transformer-based models. We also apply our model to self-supervised pretraining tasks and attain excellent fine-tuning performance, which outperforms supervised training on large datasets. Transferring of masked pre-trained representation on one dataset to others also produces SOTA forecasting accuracy.

我们提出了一种有效的基于变压器的多元时间序列预测和自监督表示学习模型设计。
它基于两个关键组件:(i)将时间序列分割成子序列级补丁，作为Transformer的输入令牌;
(ii)通道独立，其中每个通道包含单个单变量时间序列，该序列在所有序列中共享相同的嵌入和Transformer权重。
。。。
补丁设计自然有三方面的好处:在嵌入中保留了局部语义信息;
在相同的回望窗口下，注意图的计算量和内存使用量呈二次减少;
而且该模型可以关注更长的历史。
与基于SOTA变压器的模型相比，我们的通道无关补丁时间序列变压器(PatchTST)可以显著提高长期预测精度。我们还将我们的模型应用于自监督预训练任务，并获得了出色的微调性能，优于大型数据集上的监督训练。将一个数据集上的屏蔽预训练表示转移到其他数据集上也可以产生SOTA预测精度。

Introduction

Forecasting is one of the most important tasks in time series analysis. With the rapid growth of deep learning models, the number of research works has increased significantly on this topic (Bryan & Stefan, 2021; Torres et al., 2021; Lara-Ben´ıtez et al., 2021). Deep models have shown excellent performance not only on forecasting tasks, but also on representation learning where abstract representation can be extracted and transferred to various downstream tasks such as classification and anomaly detection to attain state-of-the-art performance.

预测是时间序列分析的重要任务之一。随着深度学习模型的快速增长，该主题的研究工作数量显著增加(Bryan & Stefan,2021;Torres等，2021;Lara-Ben´ıtez et al, 2021)。深度模型不仅在预测任务上表现出色，而且在表征学习上也表现出色，其中抽象表征可以被提取并转移到各种下游任务中，如分类和异常检测，以获得最先进的性能。

Among deep learning models, Transformer has achieved great success on various application fields such as natural language processing (NLP) (Kalyan et al., 2021), computer vision (CV) (Khan et al., 2021), speech (Karita et al., 2019), and more recently time series (Wen et al., 2022), benefiting from its attention mechanism which can automatically learn the connections between elements in a sequence, thus becomes ideal for sequential modeling tasks. Informer (Zhou et al., 2021), Autoformer (Wu et al., 2021), and FEDformer (Zhou et al., 2022) are among the best variants of the Transformer model successfully applying to time series data. Unfortunately, regardless of the complicated design of Transformer-based models, it is shown in the recent paper (Zeng et al., 2022) that a very simple linear model can outperform all of the previous models on a variety of common benchmarks and it challenges the usefulness of Transformer for time series forecasting. In this paper, we attempt to answer this question by proposing a channel-independence patch time series Transformer (PatchTST) model that contains two key designs:

在深度学习模型中，Transformer在自然语言处理(NLP) (Kalyan等人，2021)、计算机视觉(CV) (Khan等人，2021)、语音(Karita等人，2019)以及最近的时间序列(Wen等人，2022)等各个应用领域取得了巨大成功，受益于其注意力机制，可以自动学习序列中元素之间的联系，因此成为序列建模任务的理想选择。Informer (Zhou et al .， 2021)、Autoformer (Wu et al .， 2021)和FEDformer (Zhou et al .，2022)是Transformer模型成功应用于时间序列数据的最佳变体。不幸的是，尽管基于Transformer的模型设计复杂，但在最近的论文中(Zeng et al .， 2022)显示，一个非常简单的线性模型可以在各种常见基准上优于之前的所有模型，并且它挑战了Transformer在时间序列预测中的实用性。
在本文中，我们试图通过提出一个通道无关的补丁时间序列变压器(PatchTST)模型来回答这个问题，该模型包含两个关键设计:

Patching. Time series forecasting aims to understand the correlation between data in each different time steps. However, a single time step does not have semantic meaning like a word in a sentence, thus extracting local semantic information is essential in analyzing their connections. Most of the previous works only use point-wise input tokens, or just a handcrafted information from series. In contrast, we enhance the locality and capture comprehensive semantic information that is not available in point-level by aggregating time steps into subseries level patches.

Patching：对于时间序列预测，其目标是理解不同时间步之间数据之间的相关性。然而，单个时间步并没有像句子中的单词那样的语义含义，因此提取局部语义信息在分析它们的连接中是至关重要的。大多数以前的工作只使用点式输入标记，或者仅仅是从系列中手工制作的信息。相比之下，我们通过将时间步聚合成子序列级别的片段来增强局部性，并捕获在点级别不可用的全面的语义信息。

Channel-independence. A multivariate time series is a multi-channel signal, and each Transformer input token can be represented by data from either a single channel or multiple channels. Depending on the design of input tokens, different variants of the Transformer architecture have been proposed. Channel-mixing refers to the latter case where the input token takes the vector of all time series features and projects it to the embedding space to mix information. On the other hand, channel-independence means that each input token only contains information
from a single channel. This was proven to work well with CNN (Zheng et al., 2014) and linear models (Zeng et al., 2022), but hasn’t been applied to Transformer-based models yet.

Channel-independence： 多变量时间序列是一个多通道信号，每个Transformer的输入标记可以由单个通道或多个通道的数据表示。根据输入标记的设计，不同变体的Transformer架构已被提出。通道混合指的是后一种情况，即输入标记采用所有时间序列特征的向量，并将其投影到嵌入空间以混合信息。另一方面，通道独立性意味着每个输入标记仅包含来自单个通道的信息。这已被证明在CNN（Zheng等人，2014）和线性模型（Zeng等人，2022）中效果良好，但尚未应用于基于Transformer的模型中。

We offer a snapshot of our key results in Table 1 by doing a case study on Traffic dataset, which consists of 862 time series. Our model has several advantages:

我们通过对流量数据集(由862个时间序列组成)进行案例研究，在表1中提供了关键结果的快照。我们的模式有几个优点:

在这里插入图片描述

基于交通数据集的多元时间序列预测研究。预测范围是96。报告了具有不同回看窗口L和输入令牌数N的结果。最好的结果用粗体表示，次好的结果用下划线表示。下采样意味着每4步采样一次，然后加上最后一个值。除了使用自监督学习的最佳结果外，所有结果都来自监督训练。

Reduction on time and space complexity: The original Transformer has O(N2) complexity on both time and space, where N is the number of input tokens. Without pre-processing, N will have the same value as input sequence length L, which becomes a primary bottleneck of computation time and memory in practice. By applying patching, we can reduce N by a factor of the stride: N ≈ L/S, thus reducing the complexity quadratically. Table 1 illustrates
the usefulness of patching. By setting patch length P = 16 and stride S = 8 with L = 336, the training time is significantly reduced as much as 22 time on large datasets.

(1) 降低时间和空间复杂度:原始Transformer在时间和空间上的复杂度为0 (N2)，其中N是输入令牌的数量。如果不进行预处理，N将与输入序列长度L具有相同的值，这在实际中成为计算时间和内存的主要瓶颈。通过应用补丁，我们可以将N减少一个步幅因子:N≈L/S，从而二次降低复杂度。表1说明了补丁的有用性。通过设置patch length P = 16, stride S = 8, L = 336，在大数据集上可以显著减少训练时间22倍。

Capability of learning from longer look-back window: Table 1 shows that by increasing lookback window L from 96 to 336, MSE can be reduced from 0:518 to 0:397. However, simply extending L comes at the cost of larger memory and computational usage. Since time series often carries heavily temporal redundant information, some previous work tried to ignore parts of data points by using downsampling or a carefully designed sparse connection of attention (Li et al., 2019) and the model still yields sufficient information to forecast well. We study the case when L = 380 and the time series is sampled every 4 steps with the last point added to sequence, resulting in the number of input tokens being N = 96. The model achieves better MSE score (0:447) than using the data sequence containing the most recent 96 time steps (0:518), indicating that longer look-back window conveys more important information even with the same number of input tokens. This leads us to think of a question: is there a way to avoid throwing values while maintaining a long look-back window? Patching is a good answer to it. It can group local time steps that may contain similar values while at the same time enables the model to reduce the input token length for computational benefit. As evident in Table 1, MSE score is further reduced from 0:397 to 0:367 with patching when L = 336.

从更长的回顾窗口学习的能力:表1显示，通过将回顾窗口L从96增加到336,MSE可以从0:518降低到0:397。然而，简单地扩展L是以更大的内存和计算使用量为代价的。由于时间序列通常携带大量的时间冗余信息，因此之前的一些工作试图通过使用降采样或精心设计的注意力稀疏连接来忽略部分数据点(Li et al .， 2019)，并且模型仍然产生足够的信息来进行预测。我们研究了L = 380的情况，时间序列每4步采样一次，最后一个点添加到序列中，导致输入令牌的数量为N = 96。与使用包含最近96个时间步长的数据序列(0:518)相比，该模型获得了更好的MSE分数(0:447)，这表明即使使用相同数量的输入令牌，更长的回看窗口也能传递更重要的信息。这让我们想到了一个问题:是否有一种方法可以避免抛出值，同时保持较长的回顾窗口?
打补丁是一个很好的答案。它可以对可能包含相似值的本地时间步骤进行分组，同时使模型能够减少输入令牌长度以获得计算效益。从表1中可以看出，当L = 336时，打补丁后MSE得分从0:397进一步降低到0:367。

Capability of representation learning: With the emergence of powerful self-supervised learning techniques, sophisticated models with multiple non-linear layers of abstraction are required to capture abstract representation of the data. Simple models like linear ones (Zeng et al., 2022) may not be preferred for that task due to its limited expressibility. With our PatchTST model, we not only confirm that Transformer is actually effective for time series
forecasting, but also demonstrate the representation capability that can further enhance the forecasting performance. Our PatchTST has achieved the best MSE (0:349) in Table 1.

表示学习能力: 随着强大的自监督学习技术的出现，需要具有多个非线性抽象层的复杂模型来捕获数据的抽象表示。简单的模型，如线性模型(Zeng et al, 2022)，由于其有限的可表达性，可能不适合该任务。通过我们的PatchTST模型，我们不仅证实了Transformer在时间序列预测中是有效的，而且证明了它的表示能力可以进一步提高预测性能。我们的PatchTST在表1中取得了最好的MSE(0:349)。

We introduce our approach in more detail and conduct extensive experiments in the following sections to conclusively prove our claims. We not only demonstrate the model effectiveness with supervised forecasting results and ablation studies, but also achieves SOTA self-supervised representation learning and transfer learning performance.

我们将更详细地介绍我们的方法，并在以下章节中进行广泛的实验，以最终证明我们的说法。我们不仅通过监督预测结果和消融研究证明了模型的有效性，而且还实现了SOTA自监督表示学习和迁移学习的性能。

2.RELATED WORK

Patch in Transformer-based Models. Transformer (Vaswani et al., 2017) has demonstrated a significant potential on different data modalities. Among all applications, patching is an essential part when local semantic information is important. In NLP, BERT (Devlin et al., 2018) considers subword-based tokenization (Schuster & Nakajima, 2012) instead of performing character-based tokenization. In CV, Vision Transformer (ViT) (Dosovitskiy et al., 2021) is a milestone work that splits an image into 16×16 patches before feeding into the Transformer model. The following influential works such as BEiT (Bao et al., 2022) and masked autoencoders (He et al., 2021) are all using patches as input. Similarly, in speech researchers are using convolutions to extract information in sub-sequence levels from raw audio input (Baevski et al., 2020; Hsu et al., 2021).

Patch in Transformer-based Models
Patch（补丁）在基于Transformer的模型中扮演着重要的角色。Transformer（Vaswani等人，2017）在不同的数据模态上展现了显著的潜力。在所有的应用中，当局部语义信息很重要时，补丁是一个关键部分。在自然语言处理（NLP）中，BERT（Devlin等人，2018）采用基于子词的分词（Schuster和Nakajima，2012），而不是执行基于字符的分词。在计算机视觉领域，Vision Transformer（ViT）（Dosovitskiy等人，2021）是一个具有里程碑意义的工作，它将图像分割成16×16的补丁，然后将其馈送到Transformer模型中。接下来的一些重要工作，如BEiT（Bao等人，2022）和掩码自编码器（He等人，2021），都使用补丁作为输入。同样，在语音领域，研究人员使用卷积从原始音频输入中提取子序列级别的信息（Baevski等人，2020；Hsu等人，2021）。

Transformer-based Long-term Time Series Forecasting. There is a large body of work that tries to apply Transformer models to forecast long-term time series in recent years. We here summarize some of them. LogTrans (Li et al., 2019) uses convolutional self-attention layers with LogSparse design to capture local information and reduce the space complexity. Informer (Zhou et al., 2021) proposes a ProbSparse self-attention with distilling techniques to extract the most important keys efficiently. Autoformer (Wu et al., 2021) borrows the ideas of decomposition and auto-correlation from traditional time series analysis methods. FEDformer (Zhou et al., 2022) uses Fourier enhanced structure to get a linear complexity. Pyraformer (Liu et al., 2022) applies pyramidal attention module with inter-scale and intra-scale connections which also get a linear complexity.

Most of these models focus on designing novel mechanisms to reduce the complexity of original attention mechanism, thus achieving better performance on forecasting, especially when the prediction length is long. However, most of the models use point-wise attention, which ignores the importance of patches. LogTrans (Li et al., 2019) avoids a point-wise dot product between the key and query, but its value is still based on a single time step. Autoformer (Wu et al., 2021) uses autocorrelation to get patch level connections, but it is a handcrafted design which doesn’t include all the semantic information within a patch. Triformer (Cirstea et al., 2022) proposes patch attention, but the purpose is to reduce complexity by using a pseudo timestamp as the query within a patch, thus it neither treats a patch as a input unit, nor reveals the semantic importance behind it.

基于Transformer的长期时间序列预测。近年来，有大量的研究尝试将Transformer模型应用于预测长期时间序列。我们在这里总结其中一些。LogTrans（Li等人，2019）使用具有LogSparse设计的卷积自注意力层来捕获局部信息并减少空间复杂度。Informer（Zhou等人，2021）提出了一种ProbSparse自注意力机制，并结合提取技术，以高效地提取最重要的关键。Autoformer（Wu等人，2021）借鉴了传统时间序列分析方法中的分解和自相关思想。FEDformer（Zhou等人，2022）采用了傅里叶增强结构以获得线性复杂度。Pyraformer（Liu等人，2022）应用了金字塔式注意力模块，具有跨尺度和内尺度连接，也获得了线性复杂度。
。。。
大多数这些模型的重点是设计新颖的机制，以降低原始注意力机制的复杂度，从而在长期预测中实现更好的性能**。然而，大多数模型使用的是逐点注意力，忽略了补丁的重要性**。LogTrans（Li等人，2019）避免了关键和查询之间的逐点点积，但其值仍然基于单个时间步。Autoformer（Wu等人，2021）使用自相关来获取补丁级别的连接，但这是一个手工设计，它没有包含补丁内的所有语义信息。Triformer（Cirstea等人，2022）提出了补丁注意力，但其目的是通过在补丁内使用伪时间戳作为查询来降低复杂度，因此它既不将补丁视为输入单元，也不揭示其背后的语义重要性。

Time Series Representation Learning. Besides supervised learning, self-supervised learning is also an important research topic since it has shown the potential to learn useful representations for downstream tasks. There are many non-Transformer-based models proposed in recent years to learn representations in time series (Franceschi et al., 2019; Tonekaboni et al., 2021; Yang & Hong, 2022; Yue et al., 2022). Meanwhile, Transformer is known to be an ideal candidate towards foundation models (Bommasani et al., 2021) and learning universal representations. However, although people have made attempts on Transformer-based models like time series Transformer (TST) (Zerveas et al., 2021) and TS-TCC (Eldele et al., 2021), the potential is still not fully realized yet.

时间序列表示学习。
除了监督学习之外，自监督学习也是一个重要的研究课题，因为它已经显示出为下游任务学习有用表示的潜力。近年来，提出了许多非基于Transformer的模型来学习时间序列的表示（Franceschi等人，2019；Tonekaboni等人，2021；Yang & Hong，2022；Yue等人，2022）。同时，Transformer被认为是构建基础模型（Bommasani等人，2021）和学习通用表示的理想候选者。然而，尽管人们已经尝试了基于Transformer的模型，如时间序列Transformer（TST）（Zerveas等人，2021）和TS-TCC（Eldele等人，2021），但其潜力仍然没有完全实现

3 PROPOSED METHOD

3.1 MODEL STRUCTURE

We consider the following problem: given a collection of multivariate time series samples with lookback window 在这里插入图片描述 where each at time step t is a vector of dimension M, we would like to forecast T future values . Our PatchTST is illustrated in Figure 1 where the model makes use of the vanilla Transformer encoder as its core architecture.

我们考虑以下问题:给定一个多变量时间序列样本集合，具有回视窗L:(x1;:::;xL)，其中每个xt在时间步长t是一个维数M的向量，我们想要预测未来t的值(xL+1;:::;xL + T)。我们的PatchTST如图1所示，其中模型使用了普通的Transformer编码器作为其核心架构。

图1:PatchTST架构。(a)将多变量时间序列数据分成不同的通道。它们共享相同的Transformer主干，但是前向进程是独立的。(b)通过实例归一化算子对每个通道单变量序列进行分割。这些补丁被用作Transformer输入令牌。©使用PatchTST的掩码自监督表示学习，其中随机选择补丁并将其设置为零。该模型将重建被掩盖的斑块。

Forward Process. We denote a i-th univariate series of length L starting at time index 1 as 在这里插入图片描述 where
. The input
is split to M univariate series
, where each of them is fed independently into the Transformer backbone according to our channel-independence setting. Then the Transformer backbone will provide prediction results
accordingly .

正向过程。我们将长度为L的第i个单变量系列，从时间索引1开始表示为x(i)1:L = (x(i)1, …, x(i)L)，其中i = 1, …, M。输入(x1, …, xL)被分割成M个单变量系列x(i) ∈ R^{1×L，其中每个系列根据我们的通道独立设置被独立地输入到Transformer主干中。然后，Transformer主干将相应地提供预测结果}x(i) = (^x(i)L+1, …, ^x(i)L+T) ∈ R^1×T。

Patching. Each input univariate time series 在这里插入图片描述 is first divided into patches which can be either
overlapped or non-overlapped. Denote the patch length as P and the stride - the non overlapping
region between two consecutive patches as S, then the patching process will generate the a sequence
of patches where N is the number of patches, 在这里插入图片描述 . Here, we pad S
repeated numbers of the last value to the end of the original sequence before patching.
With the use of patches, the number of input tokens can reduce from L to approximately .This implies the memory usage and computational complexity of the attention map are quadratically decreased by a factor of S. Thus constrained on the training time and GPU memory, patch design can allow the model to see the longer historical sequence, which can significantly improve the forecasting performance, as demonstrated in Table 1.

补丁处理。每个输入的单变量时间序列x(i)首先被划分为补丁，这些补丁可以是重叠的，也可以是非重叠的。将补丁长度表示为P，将步长（即两个连续补丁之间的非重叠区域）表示为S，则补丁处理过程将生成一系列补丁x(i)p ∈ RP×N，其中N是补丁的数量，N = b(L−P)/Sc + 2。这里，我们在补丁之前的原始序列末尾填充S个重复的最后一个值x(i)L ∈ R。通过使用补丁，输入标记的数量可以从L减少到大约L/S。这意味着注意力图的内存使用和计算复杂度将以S的因子呈二次减少。因此，在训练时间和GPU内存受限的情况下，补丁设计可以使模型看到更长的历史序列，这可以显著改善预测性能，如表1所示。

Transformer Encoder. We use a vanilla Transformer encoder that maps the observed signals to the latent representations. The patches are mapped to the Transformer latent space of dimension D via a trainable linear projection 在这里插入图片描述 , and a learnable additive position encoding Wpos 2 R D×N
is applied to monitor the temporal order of patches: denote the input that will be fed into Transformer encoder in Figure 1. Then each head in multi-head attention will transform them into query matrices

, key matrices
在这里插入图片描述 and value matrices , where and
. After that a scaled production is used for getting attention output
:

Transformer编码器。我们使用一个普通的Transformer编码器，将观测信号映射到潜在表示空间。补丁通过可训练的线性投影Wp ∈
RD×P映射到维度为D的Transformer潜在空间，同时应用一个可学习的加性位置编码Wpos ∈
RD×N来监控补丁的时间顺序：x(i)d = Wpx(i)p + Wpos，其中x(i)d ∈
RD×N表示将被馈送到Transformer编码器的输入。然后，多头注意力中的每个头h = 1; …; H将它们转换为查询矩阵Q(i)h
= (x(i)d)TWQh，键矩阵K(i)h = (x(i)d)TWKh和值矩阵V(i)h = (x(i)d)TWVh，其中WQh，WKh ∈ RD×dk和WVh ∈ RD×D。之后，使用缩放乘法来获取注意力输出Oh ∈ RD×N。

The multi-head attention block also includes BatchNorm 1 layers and a feed forward network with residual connections as shown in Figure 1. Afterwards it generates the representation denoted as 在这里插入图片描述 . Finally a flatten layer with linear head is used to obtain the prediction result

多头注意块还包括BatchNorm 1层和具有剩余连接的前馈网络，如图1所示。然后生成表示为z (i) 2r D×N。最后使用线性头部的平坦层得到预测结果x^ (i) = (^x (i) L+1;:::;x^ (i) L+T) 2r 1×T。

Loss Function. We choose to use the MSE loss to measure the discrepancy between the prediction
and the ground truth. The loss in each channel is gathered and averaged over M time series to get
the overall objective loss: 在这里插入图片描述 :

损失函数。我们选择使用MSE损失来衡量预测与实际情况之间的差异。每个通道的损耗被收集并在M个时间序列上平均，得到总体目标损耗:L = Ex> 1 M PM i=1 kx^ (i) L+1:L+T - x (i) L+1:L+ tk22;

Instance Normalization. This technique has recently been proposed to help mitigating the distribution shift effect between the training and testing data (Ulyanov et al., 2016; Kim et al., 2022). It simply normalizes each time series instance x (i) with zero mean and unit standard deviation. In essence, we normalize each x (i) before patching and the mean and deviation are added back to the output prediction.

实例归一化。这项技术最近被提出，旨在帮助缓解训练数据和测试数据之间的分布偏移效应（Ulyanov等人，2016；Kim等人 2022）。它简单地将每个时间序列实例x(i)标准化为零均值和单位标准差。本质上，我们在对每个x(i)进行补丁处理之前对其进行归一化，并在输出预测时将均值和标准差添加回去。

3.2 REPRESENTATION LEARNING表征学习

Self-supervised representation learning has become a popular approach to extract high level abstract representation from unlabelled data. In this section, we apply PatchTST to obtain useful representation of the multivariate time series. We will show that the learnt representation can be effectively transferred to forecasting tasks. Among popular methods to learn representation via self-supervise pre-training, masked autoencoder has been applied successfully to NLP (Devlin et al., 2018) and CV (He et al., 2021) domains. This technique is conceptually simple: a portion of input sequence is intentionally removed at random and the model is trained to recover the missing contents.

自监督表示学习已成为一种从未标记数据中提取高级抽象表示的流行方法。在本节中，我们应用PatchTST来获得多元时间序列的有用表示。我们将展示学习到的表征可以有效地转移到预测任务中。在通过自我监督预训练学习表征的流行方法中，掩码自编码器已成功应用于NLP(Devlin等人，2018)和CV (He等人，2021)领域。这种技术在概念上很简单:有意地随机删除输入序列的一部分，然后训练模型来恢复丢失的内容。

Masked encoder has been recently employed in time series and delivered notable performance on classification and regression tasks (Zerveas et al., 2021). The authors proposed to apply the multivariate time series to Transformer, where each input token is a vector xi consisting of time series values at time step i-th. Masking is placed randomly within each time series and across different series. However, there are two potential issues with this setting: First, masking is applied at the level of single time steps. The masked values at the current time step can be easily inferred by interpolating with the immediate proceeding or succeeding time values without high level understanding
of the entire sequence, which deviates from our goal of learning important abstract representation of the whole signal. Zerveas et al. (2021) proposed complex randomization strategies to resolve the problem in which groups of time series with different sizes are randomly masked.

掩码编码器最近被用于时间序列，并在分类和回归任务上提供了显着的性能(Zerveas等人，2021)。作者建议将多元时间序列应用于Transformer，其中每个输入令牌是由时间步长i-th的时间序列值组成的向量xi。掩蔽在每个时间序列和不同序列中随机放置。
然而，这种设置有两个潜在的问题:首先，在单个时间步长级别上应用屏蔽。在当前时间步长的掩蔽值可以很容易地通过插入直接进行或后续的时间值来推断，而不需要对整个序列进行高层次的理解，这偏离了我们学习整个信号的重要抽象表示的目标。Zerveas等(2021)提出了复杂随机化策略来解决不同规模的时间序列组被随机屏蔽的问题。

Second, the design of the output layer for forecasting task can be troublesome. Given the representation vectors 在这里插入图片描述 corresponding to all L time steps, mapping these vectors to the output containing variables each with prediction horizon T via a linear map requires a parameter matrix W of dimension . This matrix can be particularly oversized if either one or all of these four values are large. This may cause overfitting when the number of downstream training samples is scarce.

其次，预测任务的输出层设计可能会很麻烦。给定与所有L个时间步相对应的表示向量zt 2 R D，通过线性映射将这些向量映射到包含M个变量的输出，每个变量具有预测水平T，需要一个维数为(L·D) × (M·T)的参数矩阵W。如果这四个值中的一个或所有值都很大，则该矩阵可能特别超大。当下游训练样本数量不足时，这可能会导致过拟合。

Our proposed PatchTST can naturally overcome the aforementioned issues. As shown in Figure 1, we use the same Transformer encoder as the supervised settings. The prediction head is removed and a D × P linear layer is attached. As opposed to supervised model where patches can be overlapped, we divide each input sequence into regular non-overlapping patches. It is for convenience to ensure observed patches do not contain information of the masked ones. We then select a subset of the patch indices uniformly at random and mask the patches according to these selected indices with zero values. The model is trained with MSE loss to reconstruct the masked patches.

我们提出的PatchTST自然可以克服上述问题。如图1所示，我们使用与监督设置相同的Transformer编码器。去除预测头，附加一个D × P线性层。与有监督模型不同的是，我们将每个输入序列划分为规则的不重叠的小块。为了方便，确保观察到的补丁不包含被屏蔽补丁的信息。然后，我们均匀随机地选择一个补丁索引子集，并根据这些选择的索引对补丁进行零值掩码。利用MSE损失对模型进行训练，重建掩模。

在这里插入图片描述

We emphasize that each time series will have its own latent representation that are cross-learned via a shared weight mechanism. This design can allow the pre-training data to contain different number of time series than the downstream data, which may not be feasible by other approaches.

我们强调每个时间序列都有自己的潜在表示，这些潜在表示是通过共享权重机制交叉学习的。这种设计可以允许预训练数据包含与下游数据不同数量的时间序列，这是其他方法可能无法实现的。

4 EXPERIMENTS

4.1 LONG-TERM TIME SERIES FORECASTING

Datasets. We evaluate the performance of our proposed PatchTST on 8 popular datasets, including Weather, Traffic, Electricity, ILI and 4 ETT datasets (ETTh1, ETTh2, ETTm1, ETTm2). These datasets have been extensively utilized for benchmarking and publicly available on (Wu et al., 2021). The statistics of those datasets are summarized in Table 2. We would like to highlight several large datasets: Weather, Traffic, and Electricity. They have many more number of time series, thus the results would be more stable and less susceptible to overfitting than other smaller datasets.

数据集。我们在8个流行的数据集上评估了我们提出的PatchTST的性能，包括天气、交通、电力、ILI和4个ETT数据集(ETTh1、ETTh2、ETTm1、ETTm2)。这些数据集已被广泛用于基准测试，并公开提供(Wu et al .， 2021)。表2总结了这些数据集的统计数据。我们想重点介绍几个大型数据集:天气、交通和电力。它们有更多的时间序列，因此结果会比其他较小的数据集更稳定，更不容易过度拟合。

在这里插入图片描述
Baselines and Experimental Settings. We choose the SOTA Transformer-based models, including FEDformer (Zhou et al., 2022), Autoformer (Wu et al., 2021), Informer (Zhou et al., 2021), Pyraformer (Liu et al., 2022), LogTrans (Li et al., 2019), and a recent non-Transformer-based model DLinear (Zeng et al., 2022) as our baselines. All of the models are following the same experimental setup with prediction length 在这里插入图片描述 for ILI dataset and for other datasets as in the original papers. We collect baseline results from Zeng et al. (2022) with the default look-back window for Transformer-based models, and for DLinear. But in order to avoid under-estimating the baselines, we also run FEDformer, Autoformer and Informer for six different look-back window 在这里插入图片描述 , and always choose the best results to create strong baselines. More details about the baselines could be found in Appendix A.1.2. We calculate the MSE and MAE of multivariate time series forecasting as metrics.

基线和实验设置。我们选择基于SOTA变压器的模型，包括FEDformer (Zhou等人，2022)、Autoformer (Wu等人，2021)、Informer (Zhou等人，2021)、Pyraformer (Liu等人，2022)、LogTrans (Li等人，2019)和最近的非基于变压器的模型DLinear (Zeng等人，2022)作为我们的基线。所有模型都遵循相同的实验设置，预测长度为t2f24;36个;48;ILI数据集60g, t2f96;192;336;其他数据集如原论文中的720g。我们收集了Zeng等人(2022)的基线结果，对于基于变压器的模型，默认回看窗口L = 96，对于DLinear模型，默认回看窗口L = 336。但为了避免低估基线，我们还运行FEDformer, Autoformer和Informer为六个不同的回望窗口l2f24;48;96;192;336;720g，并始终选择最好的结果来创建强大的基线。有关基线的更多详情，请参阅附录A.1.2。我们计算多元时间序列预测的MSE和MAE作为指标。

Model Variants. We propose two versions of PatchTST in Table 3. PatchTST/64 implies the number of input patches is 64, which uses the look-back window L = 512. PatchTST/42 means the number of input patches is 42, which has the default look-back window L = 336. Both of them use patch length P = 16 and stride S = 8. Thus, we could use PatchTST/42 as a fair comparison to DLinear and other Transformer-based models, and PatchTST/64 to explore even better results on larger datasets. More experimental details are provided in Appendix A.1.

模型变体。我们在表3中提出了两个版本的PatchTST。PatchTST/64表示输入的补丁数为64，使用的是回溯窗口L = 512。PatchTST/42表示输入补丁的数量为42，其默认回看窗口L = 336。两者均使用斑块长度P = 16，步幅S = 8。因此，我们可以使用PatchTST/42作为DLinear和其他基于transformer的模型的公平比较，而PatchTST/64则可以在更大的数据集上探索更好的结果。更多实验细节见附录A.1。

Results. Table 3 shows the multivariate long-term forecasting results. Overall, our model outperform all baseline methods. Quantitatively, compared with the best results that Transformer-based models can offer, PatchTST/64 achieves an overall 21:0% reduction on MSE and 16:7% reduction on MAE, while PatchTST/42 attains an overall 20:2% reduction on MSE and 16:4% reduction on MAE. Compared with the DLinear model, PatchTST can still outperform it in general, especially on large datasets (Weather, Traffic, Electricity) and ILI dataset. We also experiment with univariate datasets where the results are provided in Appendix A.3.

结果。表3显示了多元长期预测结果。总体而言，我们的模型优于所有基线方法。定量地，与基于transformer的模型所能提供的最佳结果相比，PatchTST/64在MSE上实现了21:0%的总体降低，在MAE上实现了16:7%的总体降低，而PatchTST/42在MSE上实现了20:2%的总体降低，在MAE上实现了16:4%的总体降低。与DLinear模型相比，PatchTST在一般情况下仍然可以优于DLinear模型，特别是在大型数据集(天气、交通、电力)和ILI数据集上。我们还对单变量数据集进行了实验，其结果见附录A.3。

监督PatchTST的多变量长期预测结果。我们使用预测长度t2f24;36个;48;ILI数据集60g, t2f96;192;336;其他的720g。最好的结果用粗体表示，第二好的结果用下划线表示。

4.2 REPRESENTATION LEARNING

In this section, we conduct experiments with masked self-supervised learning where we set the patches to be non-overlapped. Otherwise stated, across all representation learning experiments the input sequence length is chosen to be 512 and patch size is set to 12, which results in 42 patches. We consider high masking ratio where 40% of the patches are masked with zero values. We first apply self-supervised pre-training on the datasets mentioned in Section 4.1 for 100 epochs. Once the pre-trained model on each dataset is available, we perform supervised training to evaluate the learned representation with two options: (a) linear probing and (b) end-to-end fine-tuning. With (a), we only train the model head for 20 epochs while freezing the rest of the network; With (b), we apply linear probing for 10 epochs to update the model head and then end-to-end fine-tuning the entire network for 20 epochs. It was proven that a two-step strategy with linear probing followed by fine-tuning can outperform only doing fine-tuning directly (Kumar et al., 2022). We select a few representative results on below, and a full benchmark can be found in Appendix A.5.

在本节中，我们进行了屏蔽自监督学习的实验，我们将补丁设置为不重叠。换句话说，在所有表征学习实验中，输入序列长度选择为512，补丁大小设置为12，结果为42个补丁。我们考虑高掩蔽比，其中40%的斑块被零值掩蔽。
。。。
我们首先对4.1节中提到的100个epoch的数据集应用自监督预训练。
一旦每个数据集上的预训练模型可用，我们通过两个选项执行监督训练来评估学习到的表示:(a)线性探测和(b)端到端微调。在(a)中，我们只训练模型头部20次，而冻结网络的其余部分;在(b)中，我们应用10个epoch的线性探测来更新模型头部，然后对整个网络进行20个epoch的端到端微调。事实证明，线性探测和微调的两步策略优于直接进行微调(Kumar
et al, 2022)。我们在下面选择了一些有代表性的结果，完整的基准测试可以在附录a .5中找到。

Comparison with Supervised Methods. Table 4 compares the performance of PatchTST (with fine-tuning, linear probing, and supervising from scratch) with other supervised method. As shown in the table, on large datasets our pre-training procedure contributes a clear improvement compared to supervised training from scratch. By just fine-tuning the model head (linear probing), the forecasting performance is already comparable with training the entire network from scratch and better than DLinear. The best results are observed with end-to-end fine-tuning. Self-supervised PatchTST significantly outperforms other Transformer-based models on all the datasets.

与监督方法的比较。表4比较了PatchTST(微调、线性探测和从头开始监督)与其他监督方法的性能。
如表所示，在大型数据集上，与从头开始的监督训练相比，我们的预训练过程有明显的改进。
通过对模型头部进行微调(线性探测)，预测性能已经可以与从头开始训练整个网络相媲美，并且优于DLinear。
通过端到端微调可以观察到最好的结果。自监督PatchTST在所有数据集上都明显优于其他基于transformer的模型。

在这里插入图片描述

Transfer Learning. We test the capability of transfering the pre-trained model to downstream tasks. In particular, we pre-train the model on Electricity dataset and fine-tune on other datasets. We observe from Table 5 that overall the fine-tuning MSE is lightly worse than pre-training and finetuning on the same dataset, which is reasonable. The fine-tuning performance is also worse than supervised training in some cases. However, the forecasting performance is still better than other models. Note that as opposed to supervised PatchTST where the entire model is trained for each prediction horizon, here we only retrain the linear head or the entire model for much fewer epochs, which results in significant computational time reduction.

迁移学习。我们测试了将预训练模型转移到下游任务的能力。具体来说，我们在电力数据集上对模型进行预训练，然后在其他数据集上进行微调。我们从表 5 中观察到，总体而言，微调的 MSE 比在同一数据集上预训练和微调的 MSE 稍差，这是合理的。在某些情况下，微调性能也比监督训练差。不过，预测性能仍然优于其他模型。需要注意的是，与监督式 PatchTST 不同的是，在监督式 PatchTST 中，每个预测周期都要对整个模型进行训练。与监督式 PatchTST 不同的是，在监督式 PatchTST 中，每个预测周期都要对整个模型进行训练、这大大减少了计算时间。

迁移学习任务:PatchTST在Electricity数据集上进行预训练，并将模型迁移到其他数据集。最好的结果用粗体表示，第二好的结果用下划线表示。

Comparison with Other Self-supervised Methods. We compare our self-supervised model with BTSF (Yang & Hong, 2022), TS2Vec (Yue et al., 2022), TNC (Tonekaboni et al., 2021), and TSTCC (Eldele et al., 2021) which are among the state-of-the-art contrastive learning representation methods for time series 2 . We test the forecasting performance on ETTh1 dataset, where we only apply linear probing after the learned representation is obtained (only fine-tune the last linear layer) to make the comparison fair. Results from Table 6 strongly indicates the superior performance of PatchTST, both from pre-training on its own ETTh1 data (self-supervised columns) or pre-training on Traffic (transferred columns).

与其他自我监督方法的比较。我们将我们的自监督模型与BTSF (Yang & Hong, 2022)、TS2Vec (Yue等人，2022)、TNC (Tonekaboni等人，2021)和TSTCC (Eldele等人，2021)进行了比较，这些都是时间序列2最先进的对比学习表示方法。我们在ETTh1数据集上测试预测性能，其中我们仅在获得学习表征后应用线性探测(仅微调最后一个线性层)以使比较公平。表6的结果强烈表明PatchTST的性能优越，无论是对自己的ETTh1数据(自监督列)进行预训练，还是对Traffic(传输列)进行预训练。

表征学习方法比较。转移的列名意味着在Traffic数据集上预训练PatchTST并将表示转移到ETTh1，而自监督意味着在ETTh1上进行预训练和线性探测。最好和次好的结果用粗体和下划线表示。IMP.表示与基线相比，PatchTST在最佳结果上的改进，在各种预测长度上的改进范围为34:5%至48:8%。

4.3 ABLATION STUDY

Patching and Channel-independence. We study the effects of patching and channel-independence in Table 7. We include FEDformer as the SOTA benchmark for Transformer-based model. By comparing results with and without the design of patching / channel-independence accordingly, one can observe that both of them are important factors in improving the forecasting performance. The motivation of patching is natural; furthermore this technique improves the running time and memory consumption as shown in Table 1 due to shorter Transformer sequence input. Channel-independence, on the other hand, may not be as intuitive as patching is in terms of the technical advantages. Therefore,we provide an in-depth analysis on the key factors that make channel-independence more preferable in Appendix A.7. More ablation study results are available in Appendix A.4.

补丁和信道无关。我们在表7中研究了补丁和信道无关的影响。我们将FEDformer作为基于变压器模型的SOTA基准。通过比较有补丁/信道无关设计和没有相应设计的结果，可以观察到它们都是提高预测性能的重要因素。打补丁的动机是自然的;此外，由于Transformer序列输入更短，这种技术改善了运行时间和内存消耗，如表1所示。另一方面，就技术优势而言，频道无关性可能不像补丁那样直观。因此，我们在附录A.7中对使信道无关性更可取的关键因素进行了深入分析。更多消融研究结果见附录A.4。

PatchTST中补片和通道无关的消融研究。包括4种情况:(a)模型中既包括补丁，也包括通道无关(P+CI);(b)仅通道独立性(CI);©只打补丁§;(d)两者均未包含(原始TST模型)。PatchTST表示监督PatchTST/42。表中的“-”表示即使批处理大小为1，模型也会耗尽GPU内存(NVIDIA A40 48GB)。最好的结果是大胆的。

Varying Look-back Window. In principle, a longer look-back window increases the receptive field, which will potentially improves the forecasting performance. However, as argued in (Zeng et al.,2022), this phenomenon hasn’t been observed in most of the Transformer-based models. We also demonstrate in Figure 2 that in most cases, these Transformer-based baselines have not benefited from longer look-back window L, which indicates their ineffectiveness in capturing temporal information. In contrast, our PatchTST consistently reduces the MSE scores as the receptive field increases, which confirms our model’s capability to learn from longer look-back window.

不同的回望窗口。原则上，较长的回顾窗口增加了接受野，这将潜在地提高预测性能。然而，正如(Zeng et al .， 2022)所述，在大多数基于transformer的模型中并未观察到这种现象。我们还在图2中证明，在大多数情况下，这些基于transformer的基线并没有从较长的回看窗口L中获益，这表明它们在捕获时间信息方面是无效的。相比之下，随着接受野的增加，我们的PatchTST持续降低MSE分数，这证实了我们的模型有能力从更长的回望窗口学习。

3个大数据集:电力、交通和天气，具有不同回望窗口的预测性能(MSE)。选取回望窗L = 24;48;96;192;336;720，预测层位T
= 96;720. 在这个实验中，我们使用了有监督的PatchTST/42和其他开源的基于transformer的基线。

5 CONCLUSION AND FUTURE WORK

This paper proposes an effective design of Transformer-based models for time series forecasting tasks by introducing two key components: patching and channel-independent structure. Compared to the previous works, it could capture local semantic information and benefit from longer look-back windows. We not only show that our model outperforms other baselines in supervised learning, but also prove its promising capability in self-supervised representation learning and transfer learning.

本文提出了一种有效的基于变压器的时间序列预测模型设计方法，通过引入两个关键组成部分:补片和信道无关结构。与以前的工作相比，它可以捕获局部语义信息，并受益于更长的回顾窗口。我们不仅证明了我们的模型在监督学习方面优于其他基线，而且还证明了它在自监督表示学习和迁移学习方面的良好能力。

Our model exhibits the potential to be the based model for future work of Transformer-based forecasting and be a building block for time series foundation models. Patching is simple but proven to be an effective operator that can be transferred easily to other models. Channel-independence, on the other hand, can be further exploited to incorporate the correlation between different channels. It would be an important future step to model the cross-channel dependencies properly.