
We propose an efficient design of Transformer-based models for multivariate time series forecasting and self-supervised representation learning. It is based on two key components: (i) segmentation of time series into subseries-level patches which are served as input tokens to Transformer; (ii) channel-independence where
each channel contains a single univariate time series that shares the same embedding and Transformer weights across all the series. Patching design naturally has three-fold benefit: local semantic information is retained in the embedding; computation and memory usage of the attention maps are quadratically reduced given the same look-back window; and the model can attend longer history. Our channel-independent patch time series Transformer (PatchTST) can improve the long-term forecasting accuracy significantly when compared with that of SOTA
Transformer-based models. We also apply our model to self-supervised pretraining tasks and attain excellent fine-tuning performance, which outperforms supervised training on large datasets. Transferring of masked pre-trained representation on one dataset to others also produces SOTA forecasting accuracy.



Forecasting is one of the most important tasks in time series analysis. With the rapid growth of deep learning models, the number of research works has increased significantly on this topic (Bryan & Stefan, 2021; Torres et al., 2021; Lara-Ben´ıtez et al., 2021). Deep models have shown excellent performance not only on forecasting tasks, but also on representation learning where abstract representation can be extracted and transferred to various downstream tasks such as classification and anomaly detection to attain state-of-the-art performance.

预测是时间序列分析的重要任务之一。随着深度学习模型的快速增长,该主题的研究工作数量显著增加(Bryan & Stefan,2021;Torres等,2021;Lara-Ben´ıtez et al, 2021)。深度模型不仅在预测任务上表现出色,而且在表征学习上也表现出色,其中抽象表征可以被提取并转移到各种下游任务中,如分类和异常检测,以获得最先进的性能。

Among deep learning models, Transformer has achieved great success on various application fields such as natural language processing (NLP) (Kalyan et al., 2021), computer vision (CV) (Khan et al., 2021), speech (Karita et al., 2019), and more recently time series (Wen et al., 2022), benefiting from its attention mechanism which can automatically learn the connections between elements in a sequence, thus becomes ideal for sequential modeling tasks. Informer (Zhou et al., 2021), Autoformer (Wu et al., 2021), and FEDformer (Zhou et al., 2022) are among the best variants of the Transformer model successfully applying to time series data. Unfortunately, regardless of the complicated design of Transformer-based models, it is shown in the recent paper (Zeng et al., 2022) that a very simple linear model can outperform all of the previous models on a variety of common benchmarks and it challenges the usefulness of Transformer for time series forecasting. In this paper, we attempt to answer this question by proposing a channel-independence patch time series Transformer (PatchTST) model that contains two key designs:

在深度学习模型中,Transformer在自然语言处理(NLP) (Kalyan等人,2021)、计算机视觉(CV) (Khan等人,2021)、语音(Karita等人,2019)以及最近的时间序列(Wen等人,2022)等各个应用领域取得了巨大成功,受益于其注意力机制,可以自动学习序列中元素之间的联系,因此成为序列建模任务的理想选择。Informer (Zhou et al ., 2021)、Autoformer (Wu et al ., 2021)和FEDformer (Zhou et al .,2022)是Transformer模型成功应用于时间序列数据的最佳变体。不幸的是,尽管基于Transformer的模型设计复杂,但在最近的论文中(Zeng et al ., 2022)显示,一个非常简单的线性模型可以在各种常见基准上优于之前的所有模型,并且它挑战了Transformer在时间序列预测中的实用性

Patching. Time series forecasting aims to understand the correlation between data in each different time steps. However, a single time step does not have semantic meaning like a word in a sentence, thus extracting local semantic information is essential in analyzing their connections. Most of the previous works only use point-wise input tokens, or just a handcrafted information from series. In contrast, we enhance the locality and capture comprehensive semantic information that is not available in point-level by aggregating time steps into subseries level patches.


Channel-independence. A multivariate time series is a multi-channel signal, and each Transformer input token can be represented by data from either a single channel or multiple channels. Depending on the design of input tokens, different variants of the Transformer architecture have been proposed. Channel-mixing refers to the latter case where the input token takes the vector of all time series features and projects it to the embedding space to mix information. On the other hand, channel-independence means that each input token only contains information
from a single channel. This was proven to work well with CNN (Zheng et al., 2014) and linear models (Zeng et al., 2022), but hasn’t been applied to Transformer-based models yet.

Channel-independence: 多变量时间序列是一个多通道信号,每个Transformer的输入标记可以由单个通道或多个通道的数据表示。根据输入标记的设计,不同变体的Transformer架构已被提出。通道混合指的是后一种情况,即输入标记采用所有时间序列特征的向量,并将其投影到嵌入空间以混合信息另一方面,通道独立性意味着每个输入标记仅包含来自单个通道的信息。这已被证明在CNN(Zheng等人,2014)和线性模型(Zeng等人,2022)中效果良好,但尚未应用于基于Transformer的模型中。

We offer a snapshot of our key results in Table 1 by doing a case study on Traffic dataset, which consists of 862 time series. Our model has several advantages:




Reduction on time and space complexity: The original Transformer has O(N2) complexity on both time and space, where N is the number of input tokens. Without pre-processing, N will have the same value as input sequence length L, which becomes a primary bottleneck of computation time and memory in practice. By applying patching, we can reduce N by a factor of the stride: N ≈ L/S, thus reducing the complexity quadratically. Table 1 illustrates
the usefulness of patching. By setting patch length P = 16 and stride S = 8 with L = 336, the training time is significantly reduced as much as 22 time on large datasets.

(1) 降低时间和空间复杂度:原始Transformer在时间和空间上的复杂度为0 (N2),其中N是输入令牌的数量。如果不进行预处理,N将与输入序列长度L具有相同的值,这在实际中成为计算时间和内存的主要瓶颈。通过应用补丁,我们可以将N减少一个步幅因子:N≈L/S,从而二次降低复杂度。表1说明了补丁的有用性。通过设置patch length P = 16, stride S = 8, L = 336,在大数据集上可以显著减少训练时间22倍

Capability of learning from longer look-back window: Table 1 shows that by increasing lookback window L from 96 to 336, MSE can be reduced from 0:518 to 0:397. However, simply extending L comes at the cost of larger memory and computational usage. Since time series often carries heavily temporal redundant information, some previous work tried to ignore parts of data points by using downsampling or a carefully designed sparse connection of attention (Li et al., 2019) and the model still yields sufficient information to forecast well. We study the case when L = 380 and the time series is sampled every 4 steps with the last point added to sequence, resulting in the number of input tokens being N = 96. The model achieves better MSE score (0:447) than using the data sequence containing the most recent 96 time steps (0:518), indicating that longer look-back window conveys more important information even with the same number of input tokens. This leads us to think of a question: is there a way to avoid throwing values while maintaining a long look-back window? Patching is a good answer to it. It can group local time steps that may contain similar values while at the same time enables the model to reduce the input token length for computational benefit. As evident in Table 1, MSE score is further reduced from 0:397 to 0:367 with patching when L = 336.

从更长的回顾窗口学习的能力:表1显示,通过将回顾窗口L从96增加到336,MSE可以从0:518降低到0:397。然而,简单地扩展L是以更大的内存和计算使用量为代价的。由于时间序列通常携带大量的时间冗余信息,因此之前的一些工作试图通过使用降采样或精心设计的注意力稀疏连接来忽略部分数据点(Li et al ., 2019),并且模型仍然产生足够的信息来进行预测。我们研究了L = 380的情况,时间序列每4步采样一次,最后一个点添加到序列中,导致输入令牌的数量为N = 96。与使用包含最近96个时间步长的数据序列(0:518)相比,该模型获得了更好的MSE分数(0:447),这表明即使使用相同数量的输入令牌,更长的回看窗口也能传递更重要的信息。这让我们想到了一个问题:是否有一种方法可以避免抛出值,同时保持较长的回顾窗口?
打补丁是一个很好的答案。它可以对可能包含相似值的本地时间步骤进行分组,同时使模型能够减少输入令牌长度以获得计算效益。从表1中可以看出,当L = 336时,打补丁后MSE得分从0:397进一步降低到0:367。

Capability of representation learning: With the emergence of powerful self-supervised learning techniques, sophisticated models with multiple non-linear layers of abstraction are required to capture abstract representation of the data. Simple models like linear ones (Zeng et al., 2022) may not be preferred for that task due to its limited expressibility. With our PatchTST model, we not only confirm that Transformer is actually effective for time series
forecasting, but also demonstrate the representation capability that can further enhance the forecasting performance. Our PatchTST has achieved the best MSE (0:349) in Table 1.

表示学习能力: 随着强大的自监督学习技术的出现,需要具有多个非线性抽象层的复杂模型来捕获数据的抽象表示。简单的模型,如线性模型(Zeng et al, 2022),由于其有限的可表达性,可能不适合该任务。通过我们的PatchTST模型,我们不仅证实了Transformer在时间序列预测中是有效的,而且证明了它的表示能力可以进一步提高预测性能。我们的PatchTST在表1中取得了最好的MSE(0:349)。

We introduce our approach in more detail and conduct extensive experiments in the following sections to conclusively prove our claims. We not only demonstrate the model effectiveness with supervised forecasting results and ablation studies, but also achieves SOTA self-supervised representation learning and transfer learning performance.



Patch in Transformer-based Models. Transformer (Vaswani et al., 2017) has demonstrated a significant potential on different data modalities. Among all applications, patching is an essential part when local semantic information is important. In NLP, BERT (Devlin et al., 2018) considers subword-based tokenization (Schuster & Nakajima, 2012) instead of performing character-based tokenization. In CV, Vision Transformer (ViT) (Dosovitskiy et al., 2021) is a milestone work that splits an image into 16×16 patches before feeding into the Transformer model. The following influential works such as BEiT (Bao et al., 2022) and masked autoencoders (He et al., 2021) are all using patches as input. Similarly, in speech researchers are using convolutions to extract information in sub-sequence levels from raw audio input (Baevski et al., 2020; Hsu et al., 2021).

Patch in Transformer-based Models
Patch(补丁)在基于Transformer的模型中扮演着重要的角色。Transformer(Vaswani等人,2017)在不同的数据模态上展现了显著的潜力。在所有的应用中,当局部语义信息很重要时,补丁是一个关键部分。在自然语言处理(NLP)中,BERT(Devlin等人,2018)采用基于子词的分词(Schuster和Nakajima,2012),而不是执行基于字符的分词。在计算机视觉领域,Vision Transformer(ViT)(Dosovitskiy等人,2021)是一个具有里程碑意义的工作,它将图像分割成16×16的补丁,然后将其馈送到Transformer模型中。接下来的一些重要工作,如BEiT(Bao等人,2022)和掩码自编码器(He等人,2021),都使用补丁作为输入。同样,在语音领域,研究人员使用卷积从原始音频输入中提取子序列级别的信息(Baevski等人,2020;Hsu等人,2021)。

Transformer-based Long-term Time Series Forecasting. There is a large body of work that tries to apply Transformer models to forecast long-term time series in recent years. We here summarize some of them. LogTrans (Li et al., 2019) uses convolutional self-attention layers with LogSparse design to capture local information and reduce the space complexity. Informer (Zhou et al., 2021) proposes a ProbSparse self-attention with distilling techniques to extract the most important keys efficiently. Autoformer (Wu et al., 2021) borrows the ideas of decomposition and auto-correlation from traditional time series analysis methods. FEDformer (Zhou et al., 2022) uses Fourier enhanced structure to get a linear complexity. Pyraformer (Liu et al., 2022) applies pyramidal attention module with inter-scale and intra-scale connections which also get a linear complexity.

Most of these models focus on designing novel mechanisms to reduce the complexity of original attention mechanism, thus achieving better performance on forecasting, especially when the prediction length is long. However, most of the models use point-wise attention, which ignores the importance of patches. LogTrans (Li et al., 2019) avoids a point-wise dot product between the key and query, but its value is still based on a single time step. Autoformer (Wu et al., 2021) uses autocorrelation to get patch level connections, but it is a handcrafted design which doesn’t include all the semantic information within a patch. Triformer (Cirstea et al., 2022) proposes patch attention, but the purpose is to reduce complexity by using a pseudo timestamp as the query within a patch, thus it neither treats a patch as a input unit, nor reveals the semantic importance behind it.


Time Series Representation Learning. Besides supervised learning, self-supervised learning is also an important research topic since it has shown the potential to learn useful representations for downstream tasks. There are many non-Transformer-based models proposed in recent years to learn representations in time series (Franceschi et al., 2019; Tonekaboni et al., 2021; Yang & Hong, 2022; Yue et al., 2022). Meanwhile, Transformer is known to be an ideal candidate towards foundation models (Bommasani et al., 2021) and learning universal representations. However, although people have made attempts on Transformer-based models like time series Transformer (TST) (Zerveas et al., 2021) and TS-TCC (Eldele et al., 2021), the potential is still not fully realized yet.

除了监督学习之外,自监督学习也是一个重要的研究课题,因为它已经显示出为下游任务学习有用表示的潜力。近年来,提出了许多非基于Transformer的模型来学习时间序列的表示(Franceschi等人,2019;Tonekaboni等人,2021;Yang & Hong,2022;Yue等人,2022)。同时,Transformer被认为是构建基础模型(Bommasani等人,2021)和学习通用表示的理想候选者。然而,尽管人们已经尝试了基于Transformer的模型,如时间序列Transformer(TST)(Zerveas等人,2021)和TS-TCC(Eldele等人,2021),但其潜力仍然没有完全实现



We consider the following problem: given a collection of multivariate time series samples with lookback window 在这里插入图片描述 where each 在这里插入图片描述 at time step t is a vector of dimension M, we would like to forecast T future values 在这里插入图片描述. Our PatchTST is illustrated in Figure 1 where the model makes use of the vanilla Transformer encoder as its core architecture.

我们考虑以下问题:给定一个多变量时间序列样本集合,具有回视窗L:(x1;:::;xL),其中每个xt在时间步长t是一个维数M的向量,我们想要预测未来t的值(xL+1;:::;xL + T)。我们的PatchTST如图1所示,其中模型使用了普通的Transformer编码器作为其核心架构。


Forward Process. We denote a i-th univariate series of length L starting at time index 1 as 在这里插入图片描述 where 在这里插入图片描述
. The input 在这里插入图片描述
is split to M univariate series在这里插入图片描述
, where each of them is fed independently into the Transformer backbone according to our channel-independence setting. Then the Transformer backbone will provide prediction results
在这里插入图片描述accordingly .

正向过程。我们将长度为L的第i个单变量系列,从时间索引1开始表示为x(i)1:L = (x(i)1, …, x(i)L),其中i = 1, …, M。输入(x1, …, xL)被分割成M个单变量系列x(i) ∈ R1×L,其中每个系列根据我们的通道独立设置被独立地输入到Transformer主干中。然后,Transformer主干将相应地提供预测结果x(i) = (^x(i)L+1, …, ^x(i)L+T) ∈ R^1×T。

Patching. Each input univariate time series 在这里插入图片描述is first divided into patches which can be either
overlapped or non-overlapped. Denote the patch length as P and the stride - the non overlapping
region between two consecutive patches as S, then the patching process will generate the a sequence
of patches 在这里插入图片描述 where N is the number of patches, 在这里插入图片描述. Here, we pad S
repeated numbers of the last value 在这里插入图片描述 to the end of the original sequence before patching.
With the use of patches, the number of input tokens can reduce from L to approximately 在这里插入图片描述.This implies the memory usage and computational complexity of the attention map are quadratically decreased by a factor of S. Thus constrained on the training time and GPU memory, patch design can allow the model to see the longer historical sequence, which can significantly improve the forecasting performance, as demonstrated in Table 1.

补丁处理。每个输入的单变量时间序列x(i)首先被划分为补丁,这些补丁可以是重叠的,也可以是非重叠的。将补丁长度表示为P,将步长(即两个连续补丁之间的非重叠区域)表示为S,则补丁处理过程将生成一系列补丁x(i)p ∈ RP×N,其中N是补丁的数量,N = b(L−P)/Sc + 2。这里,我们在补丁之前的原始序列末尾填充S个重复的最后一个值x(i)L ∈ R。通过使用补丁,输入标记的数量可以从L减少到大约L/S。这意味着注意力图的内存使用和计算复杂度将以S的因子呈二次减少。因此,在训练时间和GPU内存受限的情况下,补丁设计可以使模型看到更长的历史序列,这可以显著改善预测性能,如表1所示。

Transformer Encoder. We use a vanilla Transformer encoder that maps the observed signals to the latent representations. The patches are mapped to the Transformer latent space of dimension D via a trainable linear projection 在这里插入图片描述 , and a learnable additive position encoding Wpos 2 R D×N
is applied to monitor the temporal order of patches: 在这里插入图片描述denote the input that will be fed into Transformer encoder in Figure 1. Then each head 在这里插入图片描述in multi-head attention will transform them into query matrices 在这里插入图片描述

, key matrices
在这里插入图片描述and value matrices 在这里插入图片描述, where 在这里插入图片描述 and 在这里插入图片描述
. After that a scaled production is used for getting attention output 在这里插入图片描述

Transformer编码器。我们使用一个普通的Transformer编码器,将观测信号映射到潜在表示空间。补丁通过可训练的线性投影Wp ∈
RD×P映射到维度为D的Transformer潜在空间,同时应用一个可学习的加性位置编码Wpos ∈
RD×N来监控补丁的时间顺序:x(i)d = Wpx(i)p + Wpos,其中x(i)d ∈
RD×N表示将被馈送到Transformer编码器的输入。然后,多头注意力中的每个头h = 1; …; H将它们转换为查询矩阵Q(i)h
= (x(i)d)TWQh,键矩阵K(i)h = (x(i)d)TWKh和值矩阵V(i)h = (x(i)d)TWVh,其中WQh,WKh ∈ RD×dk和WVh ∈ RD×D。之后,使用缩放乘法来获取注意力输出Oh ∈ RD×N。

The multi-head attention block also includes BatchNorm 1 layers and a feed forward network with residual connections as shown in Figure 1. Afterwards it generates the representation denoted as在这里插入图片描述 . Finally a flatten layer with linear head is used to obtain the prediction result 在这里插入图片描述在这里插入图片描述

多头注意块还包括BatchNorm 1层和具有剩余连接的前馈网络,如图1所示。然后生成表示为z (i) 2r D×N。最后使用线性头部的平坦层得到预测结果x^ (i) = (^x (i) L+1;:::;x^ (i) L+T) 2r 1×T。

Loss Function. We choose to use the MSE loss to measure the discrepancy between the prediction
and the ground truth. The loss in each channel is gathered and averaged over M time series to get
the overall objective loss: 在这里插入图片描述:

损失函数。我们选择使用MSE损失来衡量预测与实际情况之间的差异。每个通道的损耗被收集并在M个时间序列上平均,得到总体目标损耗:L = Ex> 1 M PM i=1 kx^ (i) L+1:L+T - x (i) L+1:L+ tk22;

Instance Normalization. This technique has recently been proposed to help mitigating the distribution shift effect between the training and testing data (Ulyanov et al., 2016; Kim et al., 2022). It simply normalizes each time series instance x (i) with zero mean and unit standard deviation. In essence, we normalize each x (i) before patching and the mean and deviation are added back to the output prediction.

实例归一化。这项技术最近被提出,旨在帮助缓解训练数据和测试数据之间的分布偏移效应(Ulyanov等人,2016;Kim等人 2022)。它简单地将每个时间序列实例x(i)标准化为零均值和单位标准差。本质上,我们在对每个x(i)进行补丁处理之前对其进行归一化,并在输出预测时将均值和标准差添加回去。


Self-supervised representation learning has become a popular approach to extract high level abstract representation from unlabelled data. In this section, we apply PatchTST to obtain useful representation of the multivariate time series. We will show that the learnt representation can be effectively transferred to forecasting tasks. Among popular methods to learn representation via self-supervise pre-training, masked autoencoder has been applied successfully to NLP (Devlin et al., 2018) and CV (He et al., 2021) domains. This technique is conceptually simple: a portion of input sequence is intentionally removed at random and the model is trained to recover the missing contents.

自监督表示学习已成为一种从未标记数据中提取高级抽象表示的流行方法。在本节中,我们应用PatchTST来获得多元时间序列的有用表示。我们将展示学习到的表征可以有效地转移到预测任务中。在通过自我监督预训练学习表征的流行方法中,掩码自编码器已成功应用于NLP(Devlin等人,2018)和CV (He等人,2021)领域。这种技术在概念上很简单:有意地随机删除输入序列的一部分,然后训练模型来恢复丢失的内容

Masked encoder has been recently employed in time series and delivered notable performance on classification and regression tasks (Zerveas et al., 2021). The authors proposed to apply the multivariate time series to Transformer, where each input token is a vector xi consisting of time series values at time step i-th. Masking is placed randomly within each time series and across different series. However, there are two potential issues with this setting: First, masking is applied at the level of single time steps. The masked values at the current time step can be easily inferred by interpolating with the immediate proceeding or succeeding time values without high level understanding
of the entire sequence, which deviates from our goal of learning important abstract representation of the whole signal. Zerveas et al. (2021) proposed complex randomization strategies to resolve the problem in which groups of time series with different sizes are randomly masked.


Second, the design of the output layer for forecasting task can be troublesome. Given the representation vectors 在这里插入图片描述 corresponding to all L time steps, mapping these vectors to the output containing 在这里插入图片描述 variables each with prediction horizon T via a linear map requires a parameter matrix W of dimension 在这里插入图片描述. This matrix can be particularly oversized if either one or all of these four values are large. This may cause overfitting when the number of downstream training samples is scarce.

其次,预测任务的输出层设计可能会很麻烦。给定与所有L个时间步相对应的表示向量zt 2 R D,通过线性映射将这些向量映射到包含M个变量的输出,每个变量具有预测水平T,需要一个维数为(L·D) × (M·T)的参数矩阵W。如果这四个值中的一个或所有值都很大,则该矩阵可能特别超大。当下游训练样本数量不足时,这可能会导致过拟合。

Our proposed PatchTST can naturally overcome the aforementioned issues. As shown in Figure 1, we use the same Transformer encoder as the supervised settings. The prediction head is removed and a D × P linear layer is attached. As opposed to supervised model where patches can be overlapped, we divide each input sequence into regular non-overlapping patches. It is for convenience to ensure observed patches do not contain information of the masked ones. We then select a subset of the patch indices uniformly at random and mask the patches according to these selected indices with zero values. The model is trained with MSE loss to reconstruct the masked patches.

我们提出的PatchTST自然可以克服上述问题。如图1所示,我们使用与监督设置相同的Transformer编码器。去除预测头,附加一个D × P线性层。与有监督模型不同的是,我们将每个输入序列划分为规则的不重叠的小块。为了方便,确保观察到的补丁不包含被屏蔽补丁的信息。然后,我们均匀随机地选择一个补丁索引子集,并根据这些选择的索引对补丁进行零值掩码。利用MSE损失对模型进行训练,重建掩模。


We emphasize that each time series will have its own latent representation that are cross-learned via a shared weight mechanism. This design can allow the pre-training data to contain different number of time series than the downstream data, which may not be feasible by other approaches.




Datasets. We evaluate the performance of our proposed PatchTST on 8 popular datasets, including Weather, Traffic, Electricity, ILI and 4 ETT datasets (ETTh1, ETTh2, ETTm1, ETTm2). These datasets have been extensively utilized for benchmarking and publicly available on (Wu et al., 2021). The statistics of those datasets are summarized in Table 2. We would like to highlight several large datasets: Weather, Traffic, and Electricity. They have many more number of time series, thus the results would be more stable and less susceptible to overfitting than other smaller datasets.

数据集。我们在8个流行的数据集上评估了我们提出的PatchTST的性能,包括天气、交通、电力、ILI和4个ETT数据集(ETTh1、ETTh2、ETTm1、ETTm2)。这些数据集已被广泛用于基准测试,并公开提供(Wu et al ., 2021)。表2总结了这些数据集的统计数据。我们想重点介绍几个大型数据集:天气、交通和电力。它们有更多的时间序列,因此结果会比其他较小的数据集更稳定,更不容易过度拟合。

Baselines and Experimental Settings. We choose the SOTA Transformer-based models, including FEDformer (Zhou et al., 2022), Autoformer (Wu et al., 2021), Informer (Zhou et al., 2021), Pyraformer (Liu et al., 2022), LogTrans (Li et al., 2019), and a recent non-Transformer-based model DLinear (Zeng et al., 2022) as our baselines. All of the models are following the same experimental setup with prediction length 在这里插入图片描述 for ILI dataset and 在这里插入图片描述 for other datasets as in the original papers. We collect baseline results from Zeng et al. (2022) with the default look-back window 在这里插入图片描述 for Transformer-based models, and 在这里插入图片描述 for DLinear. But in order to avoid under-estimating the baselines, we also run FEDformer, Autoformer and Informer for six different look-back window 在这里插入图片描述, and always choose the best results to create strong baselines. More details about the baselines could be found in Appendix A.1.2. We calculate the MSE and MAE of multivariate time series forecasting as metrics.

基线和实验设置。我们选择基于SOTA变压器的模型,包括FEDformer (Zhou等人,2022)、Autoformer (Wu等人,2021)、Informer (Zhou等人,2021)、Pyraformer (Liu等人,2022)、LogTrans (Li等人,2019)和最近的非基于变压器的模型DLinear (Zeng等人,2022)作为我们的基线。所有模型都遵循相同的实验设置,预测长度为t2f24;36个;48;ILI数据集60g, t2f96;192;336;其他数据集如原论文中的720g。我们收集了Zeng等人(2022)的基线结果,对于基于变压器的模型,默认回看窗口L = 96,对于DLinear模型,默认回看窗口L = 336。但为了避免低估基线,我们还运行FEDformer, Autoformer和Informer为六个不同的回望窗口l2f24;48;96;192;336;720g,并始终选择最好的结果来创建强大的基线。有关基线的更多详情,请参阅附录A.1.2。我们计算多元时间序列预测的MSE和MAE作为指标。

Model Variants. We propose two versions of PatchTST in Table 3. PatchTST/64 implies the number of input patches is 64, which uses the look-back window L = 512. PatchTST/42 means the number of input patches is 42, which has the default look-back window L = 336. Both of them use patch length P = 16 and stride S = 8. Thus, we could use PatchTST/42 as a fair comparison to DLinear and other Transformer-based models, and PatchTST/64 to explore even better results on larger datasets. More experimental details are provided in Appendix A.1.

模型变体。我们在表3中提出了两个版本的PatchTST。PatchTST/64表示输入的补丁数为64,使用的是回溯窗口L = 512PatchTST/42表示输入补丁的数量为42,其默认回看窗口L = 336两者均使用斑块长度P = 16,步幅S = 8。因此,我们可以使用PatchTST/42作为DLinear和其他基于transformer的模型的公平比较,而PatchTST/64则可以在更大的数据集上探索更好的结果。更多实验细节见附录A.1。

Results. Table 3 shows the multivariate long-term forecasting results. Overall, our model outperform all baseline methods. Quantitatively, compared with the best results that Transformer-based models can offer, PatchTST/64 achieves an overall 21:0% reduction on MSE and 16:7% reduction on MAE, while PatchTST/42 attains an overall 20:2% reduction on MSE and 16:4% reduction on MAE. Compared with the DLinear model, PatchTST can still outperform it in general, especially on large datasets (Weather, Traffic, Electricity) and ILI dataset. We also experiment with univariate datasets where the results are provided in Appendix A.3.


监督PatchTST的多变量长期预测结果。我们使用预测长度t2f24;36个;48;ILI数据集60g, t2f96;192;336;其他的720g。最好的结果用粗体表示,第二好的结果用下划线表示。


In this section, we conduct experiments with masked self-supervised learning where we set the patches to be non-overlapped. Otherwise stated, across all representation learning experiments the input sequence length is chosen to be 512 and patch size is set to 12, which results in 42 patches. We consider high masking ratio where 40% of the patches are masked with zero values. We first apply self-supervised pre-training on the datasets mentioned in Section 4.1 for 100 epochs. Once the pre-trained model on each dataset is available, we perform supervised training to evaluate the learned representation with two options: (a) linear probing and (b) end-to-end fine-tuning. With (a), we only train the model head for 20 epochs while freezing the rest of the network; With (b), we apply linear probing for 10 epochs to update the model head and then end-to-end fine-tuning the entire network for 20 epochs. It was proven that a two-step strategy with linear probing followed by fine-tuning can outperform only doing fine-tuning directly (Kumar et al., 2022). We select a few representative results on below, and a full benchmark can be found in Appendix A.5.

et al, 2022)。我们在下面选择了一些有代表性的结果,完整的基准测试可以在附录a .5中找到。

Comparison with Supervised Methods. Table 4 compares the performance of PatchTST (with fine-tuning, linear probing, and supervising from scratch) with other supervised method. As shown in the table, on large datasets our pre-training procedure contributes a clear improvement compared to supervised training from scratch. By just fine-tuning the model head (linear probing), the forecasting performance is already comparable with training the entire network from scratch and better than DLinear. The best results are observed with end-to-end fine-tuning. Self-supervised PatchTST significantly outperforms other Transformer-based models on all the datasets.



Transfer Learning. We test the capability of transfering the pre-trained model to downstream tasks. In particular, we pre-train the model on Electricity dataset and fine-tune on other datasets. We observe from Table 5 that overall the fine-tuning MSE is lightly worse than pre-training and finetuning on the same dataset, which is reasonable. The fine-tuning performance is also worse than supervised training in some cases. However, the forecasting performance is still better than other models. Note that as opposed to supervised PatchTST where the entire model is trained for each prediction horizon, here we only retrain the linear head or the entire model for much fewer epochs, which results in significant computational time reduction.

迁移学习。我们测试了将预训练模型转移到下游任务的能力。具体来说,我们在电力数据集上对模型进行预训练,然后在其他数据集上进行微调。我们从表 5 中观察到,总体而言,微调的 MSE 比在同一数据集上预训练和微调的 MSE 稍差,这是合理的。在某些情况下,微调性能也比监督训练差。不过,预测性能仍然优于其他模型。需要注意的是,与监督式 PatchTST 不同的是,在监督式 PatchTST 中,每个预测周期都要对整个模型进行训练。与监督式 PatchTST 不同的是,在监督式 PatchTST 中,每个预测周期都要对整个模型进行训练、 这大大减少了计算时间。


Comparison with Other Self-supervised Methods. We compare our self-supervised model with BTSF (Yang & Hong, 2022), TS2Vec (Yue et al., 2022), TNC (Tonekaboni et al., 2021), and TSTCC (Eldele et al., 2021) which are among the state-of-the-art contrastive learning representation methods for time series 2 . We test the forecasting performance on ETTh1 dataset, where we only apply linear probing after the learned representation is obtained (only fine-tune the last linear layer) to make the comparison fair. Results from Table 6 strongly indicates the superior performance of PatchTST, both from pre-training on its own ETTh1 data (self-supervised columns) or pre-training on Traffic (transferred columns).

与其他自我监督方法的比较。我们将我们的自监督模型与BTSF (Yang & Hong, 2022)、TS2Vec (Yue等人,2022)、TNC (Tonekaboni等人,2021)和TSTCC (Eldele等人,2021)进行了比较,这些都是时间序列2最先进的对比学习表示方法。我们在ETTh1数据集上测试预测性能,其中我们仅在获得学习表征后应用线性探测(仅微调最后一个线性层)以使比较公平。表6的结果强烈表明PatchTST的性能优越,无论是对自己的ETTh1数据(自监督列)进行预训练,还是对Traffic(传输列)进行预训练。



Patching and Channel-independence. We study the effects of patching and channel-independence in Table 7. We include FEDformer as the SOTA benchmark for Transformer-based model. By comparing results with and without the design of patching / channel-independence accordingly, one can observe that both of them are important factors in improving the forecasting performance. The motivation of patching is natural; furthermore this technique improves the running time and memory consumption as shown in Table 1 due to shorter Transformer sequence input. Channel-independence, on the other hand, may not be as intuitive as patching is in terms of the technical advantages. Therefore,we provide an in-depth analysis on the key factors that make channel-independence more preferable in Appendix A.7. More ablation study results are available in Appendix A.4.


PatchTST中补片和通道无关的消融研究。包括4种情况:(a)模型中既包括补丁,也包括通道无关(P+CI);(b)仅通道独立性(CI);©只打补丁§;(d)两者均未包含(原始TST模型)。PatchTST表示监督PatchTST/42。表中的“-”表示即使批处理大小为1,模型也会耗尽GPU内存(NVIDIA A40 48GB)。最好的结果是大胆的。

Varying Look-back Window. In principle, a longer look-back window increases the receptive field, which will potentially improves the forecasting performance. However, as argued in (Zeng et al.,2022), this phenomenon hasn’t been observed in most of the Transformer-based models. We also demonstrate in Figure 2 that in most cases, these Transformer-based baselines have not benefited from longer look-back window L, which indicates their ineffectiveness in capturing temporal information. In contrast, our PatchTST consistently reduces the MSE scores as the receptive field increases, which confirms our model’s capability to learn from longer look-back window.

不同的回望窗口。原则上,较长的回顾窗口增加了接受野,这将潜在地提高预测性能。然而,正如(Zeng et al ., 2022)所述,在大多数基于transformer的模型中并未观察到这种现象。我们还在图2中证明,在大多数情况下,这些基于transformer的基线并没有从较长的回看窗口L中获益,这表明它们在捕获时间信息方面是无效的。相比之下,随着接受野的增加,我们的PatchTST持续降低MSE分数,这证实了我们的模型有能力从更长的回望窗口学习。

3个大数据集:电力、交通和天气,具有不同回望窗口的预测性能(MSE)。选取回望窗L = 24;48;96;192;336;720,预测层位T
= 96;720. 在这个实验中,我们使用了有监督的PatchTST/42和其他开源的基于transformer的基线。


This paper proposes an effective design of Transformer-based models for time series forecasting tasks by introducing two key components: patching and channel-independent structure. Compared to the previous works, it could capture local semantic information and benefit from longer look-back windows. We not only show that our model outperforms other baselines in supervised learning, but also prove its promising capability in self-supervised representation learning and transfer learning.


Our model exhibits the potential to be the based model for future work of Transformer-based forecasting and be a building block for time series foundation models. Patching is simple but proven to be an effective operator that can be transferred easily to other models. Channel-independence, on the other hand, can be further exploited to incorporate the correlation between different channels. It would be an important future step to model the cross-channel dependencies properly.


Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting是一篇2021年发表在AAAI会议上的论文,它提出了一种名为Informer的方法,用于解决长时间序列预测的问题。\[2\]这篇论文的目标是在长时间序列预测中提高预测能力并提高效率。传统的Transformer方法在长时间序列预测上存在一些局限性,而Informer通过引入一些新的机制来克服这些局限性。\[3\]具体来说,Informer采用了一种多层次的注意力机制,以便更好地捕捉长时间序列中的依赖关系,并引入了一种自适应长度的编码器和解码器,以提高对长序列的建模能力。通过这些改进,Informer在长时间序列预测任务上取得了更好的效果。 #### 引用[.reference_title] - *1* *3* [Informer: Beyond Efficient Transformer for Long SequenceTime-Series Forecasting](https://blog.csdn.net/lwera/article/details/127389652)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^control_2,239^v3^insert_chatgpt"}} ] [.reference_item] - *2* [Informer:超越Transformer的长时间序列预测模型](https://blog.csdn.net/zuiyishihefang/article/details/123437169)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^control_2,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]
