文章提出了模型FFPFormer(we Fully develop the tried-and-tested Patch-wise attention mechanism and Pyramid architecture in both encoder and decoder and thereby propose),被期刊为中科院1区Top期刊 Internet of Things Journal 接收
推荐指数:★★★★★
摘要
当前,Transformer和MLP是用于深度学习时间序列预测的两种主要范式,其中前者因为其注意力机制和encoder-decoder结构而被更加广泛地应用。然而数据科学家似乎更愿意在encoder上投入研究,decoder往往被忽略。一些研究者甚至试图采用线性映射替代decoder来降低复杂度。(decoder被忽略的现状)
我们认为探寻输入和预测序列的联系与提取输入序列的特征同等重要,前者恰好是encoder和decoder的代表功能,具有很大意义。受到CV领域FPN成功的启发,我们提出FPPFomer,利用自底向上的编码其和自顶向下的解码器结构来构建完整且合理的层级结构。在本工作中,还进一步探索了 pacth-wise注意力,并将其与改进后的element-wise注意力结合,在不同的注意力机制组合下,其编码器和解码器的格式也不同。在12个基准上进行了6个最先进的基线的广泛实验,验证了FPPformer的良好性能以及decoder在transformer进行实际序列预测时的重要性。源码发布在https://github.com/OrigamiSL/FPPformer。(实际的工作:decoder,注意力,实验)
contribution
improve decoder
renovate decoder
diagonal-masked self-attention
combination of element-wise attention and patch-wise attention
模型结构
左图为普通的transformer结构,右图为本文提出的FPPTransformer,除了decoder的结构的改善之外,①在输入前进行了IN(instance normalization),②注意力机制由self-attention改成了 DM Patch-wise +DM Element-wise ③传统的transformer只有最后一个encoder的输出会输入到encoder里面,本模型每个encoder的输出都会输入到对应的decoder中,来提取不同粒度的信息。
然后单独把decoder拿出来,decoder的输入是 Position Embedding,然后输入到一个decoder块里面,同样decoder块里面的注意力机制被改善,decoder块之后还跟了一个Split patches,最后将decoder的输出、encoder的输出投影之后进行拼接,经过一个RevIN层就得到最后预测的输出。
实验结果
对比实验:短期、长期、单变量预测
对比实验细节:encoder,decoder数量为3、初始的patch分段为6
短期预测:输入序列长度为96(与AutoFormer对齐),预测序列长度 {96、192、336、720}
长期预测:输入序列长度:{192、384、576},预测长度720,8个数据集平均预测结果如下表所示
单变量预测:
消融实验
预测长度:720,输入长度:{96, 192, 384, 576}
可以看到,Patch-wise组件对FPPFormer的贡献度最大,删除该组件的组合1(only using point-wise
attention):MSE、增加了84.5%,从0.345→0.637
源码
先看encoder的代码,里面包含了文章着重提到的point-wise attention和patch-wise attention
这里有一个小小的困惑,为什么要三个norm层
encoder
Encoder(
(attn1): Attn_PointLevel(
(query_projection): Linear(in_features=28, out_features=28, bias=True)
(kv_projection): Linear(in_features=28, out_features=28, bias=True)
(out_projection): Linear(in_features=28, out_features=28, bias=True)
(dropout): Dropout(p=0.05, inplace=False)
)
(attn2): Attn_PatchLevel(
(query_projection): Linear(in_features=168, out_features=168, bias=True)
(kv_projection): Linear(in_features=168, out_features=168, bias=True)
(out_projection): Linear(in_features=168, out_features=168, bias=True)
(dropout): Dropout(p=0.05, inplace=False)
)
(activation): GELU(approximate='none')
(norm1): LayerNorm((168,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((168,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((168,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.05, inplace=False)
(linear1): Linear(in_features=168, out_features=672, bias=True)
(linear2): Linear(in_features=672, out_features=168, bias=True)
)
注意力
两个注意力机制init函数相同,forward函数略有不同,这是因为两个注意力机制拟接收的输入不同造成,patch级别的输入shape(B, V, P, D) ;point级别的输入:(B, V, P, L, D)
point注意力的输入的d_model固定为28,patch级别不固定
(B,V,P,L,D)=(16,7,16,6,28)
patch级别的注意力就是把一个patch的point视为同一个point的不同维度,point注意力就是提取patch内部的点之间的联系
(B,V,P,D)=(16,7,16,168)
相对完整的维度变化过程
Attn_PointLevel
class Attn_PointLevel(nn.Module):
def __init__(self, d_model, dropout=0.1):
super(Attn_PointLevel, self).__init__()
self.query_projection = nn.Linear(d_model, d_model)
self.kv_projection = nn.Linear(d_model, d_model)
self.out_projection = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, queries, keys, values, mask='Diag'):
B, V, P, L, D = queries.shape
_, _, _, S, D = keys.shape
scale = 1. / math.sqrt(D)
queries = self.query_projection(queries)
keys = self.kv_projection(keys)
values = self.kv_projection(values)
scores = torch.einsum("bvpld,bvpmd->bvplm", queries, keys) # [B V P L L]
if mask == 'Diag':
attn_mask = OffDiagMask_PointLevel(B, V, P, L, device=queries.device) # [B V P L L]
scores.masked_fill_(attn_mask.mask, -np.inf)
elif mask == 'Causal':
assert (L == S)
attn_mask = TriangularCausalMask(B, V, P, L, device=queries.device) # [B V P L L ]
scores.masked_fill_(attn_mask.mask, -np.inf)
else:
pass
attn = self.dropout(torch.softmax(scale * scores, dim=-1)) # [B V P L L]
out = torch.einsum("bvplm,bvpmd->bvpld", attn, values) # [B V P L D]
return self.out_projection(out) # [B V P L D]
Attn_Patch_Level
class Attn_PatchLevel(nn.Module):
def __init__(self, d_model, dropout=0.1):
super(Attn_PatchLevel, self).__init__()
self.query_projection = nn.Linear(d_model, d_model)
self.kv_projection = nn.Linear(d_model, d_model)
self.out_projection = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, queries, keys, values, mask='Diag'):
B, V, P, D = queries.shape
_, _, S, D = keys.shape
scale = 1. / math.sqrt(D)
queries = self.query_projection(queries)
keys = self.kv_projection(keys)
values = self.kv_projection(values)
scores = torch.einsum("bvpd,bvsd->bvps", queries, keys) # [B V P P]
if mask == 'Diag':
attn_mask = OffDiagMask_PatchLevel(B, V, P, device=queries.device) # [B V P P]
scores.masked_fill_(attn_mask.mask, -np.inf)
else:
pass
attn = self.dropout(torch.softmax(scale * scores, dim=-1)) # [B V P P]
out = torch.einsum("bvps,bvsd->bvpd", attn, values) # [B V P D]
return self.out_projection(out) # [B V P D]
层级结构体现
encoder:168-336-672
decoder:672-336-168
patch_size不断加倍,6->12->24 , d_model=24
patch_size*d_model=168->336->672
第一个encoder的patch级别注意力机制
(attn2): Attn_PatchLevel(
(query_projection): Linear(in_features=168, out_features=168, bias=True)
(kv_projection): Linear(in_features=168, out_features=168, bias=True)
(out_projection): Linear(in_features=168, out_features=168, bias=True)
(dropout): Dropout(p=0.05, inplace=False)
)
第二个encoder
(attn2): Attn_PatchLevel(
(query_projection): Linear(in_features=336, out_features=336, bias=True)
(kv_projection): Linear(in_features=336, out_features=336, bias=True)
(out_projection): Linear(in_features=336, out_features=336, bias=True)
(dropout): Dropout(p=0.05, inplace=False)
)
第三个encoder
(attn2): Attn_PatchLevel(
(query_projection): Linear(in_features=672, out_features=672, bias=True)
(kv_projection): Linear(in_features=672, out_features=672, bias=True)
(out_projection): Linear(in_features=672, out_features=672, bias=True)
(dropout): Dropout(p=0.05, inplace=False)
)