ICLR 2022: Perceiver IO 结构化输入和输出的通用架构

Phoenixtree_DongZhao

已于 2024-08-06 19:49:58 修改

阅读量2.1k

点赞数 3

文章标签：人工智能

于 2023-04-12 13:18:10 首次发布

本文链接：https://blog.csdn.net/u014546828/article/details/130100547

版权

Perceiver IO是一种新型通用架构，能处理任意结构的输入和输出，并且随着数据规模线性扩展。该架构通过潜在空间处理信息，并利用灵活的查询机制生成多样化输出。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Perceiver IO

A General Architecture for Structured Inputs & Outputs

2107.14795.pdf (arxiv.org)

deepmind-research/perceiver at master · deepmind/deepmind-research · GitHub

Abstract

A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible. Current architectures, however, cannot be applied beyond a small set of stereotyped settings, as they bake in domain & task assumptions or scale poorly to large inputs or outputs. In this work, we propose Perceiver IO, a general-purpose architecture that handles data from arbitrary settings while scaling linearly with the size of inputs and outputs. Our model augments the Perceiver with a flexible querying mechanism that enables outputs of various sizes and semantics, doing away with the need for task-specific architecture engineering. The same architecture achieves strong results on tasks spanning natural language and visual understanding, multi-task and multi-modal reasoning, and StarCraft II. As highlights, Perceiver IO outperforms a Transformer-based BERT baseline on the GLUE language benchmark despite removing input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation with no explicit mechanisms for multiscale correspondence.

机器学习的一个中心目标是开发能够在尽可能多的数据领域解决许多问题的系统。然而，目前的架构只能应用于一小部分模式化的设置，因为它们包含了领域和任务假设，或者无法扩展到大型输入或输出。本文出了 Percepver IO，这是一种通用架构，可以处理来自任意设置的数据，同时随着输入和输出的大小线性扩展。该模型通过灵活的查询机制增强了 Percepver，支持各种大小和语义的输出，消除了对特定于任务的体系结构工程的需求。同样的架构在跨越自然语言和视觉理解、多任务和多模态推理以及星际争霸II的任务上取得了强大的结果。值得强调的是，尽管取消了输入 tokenization，但 Percepver IO 在 GLUE 语言基准测试上优于基于 transformer 的 BERT 基线，并且在没有明确的多尺度通信机制的情况下，在 Sintel 光流估计方面实现了 SOTA 的性能。

Model

Perceiver IO 架构建立在 Perceiver (ICML 2021) 的基础上，通过假设其输入是一个简单的 2D 字节数组来实现其跨域通用性（cross-domain generality）：一组元素 (可能是视觉中的像素或 patches，语言中的字符或单词，或某种形式的嵌入，学习或其他)，每个元素都由一个特征向量描述。然后，该模型使用较小数量的潜在特征向量 (Transformer-style attention) 对输入数组的信息进行编码，然后进行迭代处理，最终聚合到类别标签。【建议参考阅读 Perceiver了解其初始动机。】

相比于输出单一类别，Perceiver IO 的目标是对其输出具有与感知器对其输入相同级别的通用性：也就是说，它应该产生任意的输出数组。可以使用另一个注意模块通过使用所需输出元素唯一的查询特征向量查询潜在数组来预测输出数组的每个元素。换句话说，作者定义了一个查询数组，其元素数量与所需输出相同。查询可以是手工设计的，可以是学习的嵌入，也可以是输入的简单函数。它们处理潜能以产生所需形状的输出数组。

Encoding, Processing, Decoding

网络结构介绍：

Fig. 2 illustrates the Perceiver IO. We first encode by applying an attention module that maps input arrays x ∈ R^{M×C} to arrays in a latent space z∈R^{N×D}. We next process the latents z by applying a series of modules that take in and return arrays in this latent space. Finally, we decode by applying an attention module that maps latent arrays to output arrays y ∈ R^{O×E}. M, C, O, and E are properties of the task data and can be very large (Tab. 5), while N and D are hyperparameters and can be chosen to make model computation tractable. Following the design of the Perceiver, we implement each of the architecture’s components using Transformer-style attention modules.

Each of these modules applies a global query-key-value (QKV) attention operation followed by a multi-layer perceptron (MLP). As usual in Transformer-style architectures, we apply the MLP independently to each element of the index dimension. Both encoder and decoder take in two input arrays, the first used as input to the module’s key and value networks, and the second used as input to the module’s query network. The module’s output has the same index dimension (the same number of elements) as the query input.

This architecture can be applied to inputs of any shape or spatial layout including inputs or outputs with different spatial structure (e.g. sound and video). In contrast to latent spaces typically used in vision (e.g. Ronneberger et al. 2015) the latent does not explicitly share the structure (spatial or otherwise) of the inputs. To decode this information, we query for it using cross-attention.

图 2 展示了 Perceiver IO，包括三个部分：编码，处理，解码。首先，通过应用一个注意力模块来编码，该模块将输入数组 x∈R^{M×C}映射到潜在空间 z∈R^{N×D} 中的数组。接下来，通过应用一系列模块来处理潜在空间 z，这些模块在这个潜在空间中接收并返回数组。最后，通过应用一个注意模块来解码，该模块将潜在数组映射到输出数组 y∈R^{O×E}。M, C, O, E 是任务数据的属性，可以非常大(表5)，而 N 和 D 是超参数，可以选择使模型计算易于处理。

按照 Perceiver 的设计，Perceiver IO 也使用 transformer 风格的注意力模块实现架构的每个组件。

每个模块应用全局查询-键-值 (query-key-value，QKV) 注意力操作，然后是多层感知器 (MLP)。与 transformer 风格的体系结构中通常的情况一样，本文将 MLP 独立地应用于索引维度的每个元素。编码器和解码器都接受两个输入数组，第一个用于模块的 key 和 value 网络的输入，第二个用于模块的 query 网络的输入。模块的输出具有与查询输入相同的索引维数 (相同数量的元素)。

这种架构可以应用于任何形状或空间布局的输入，包括具有不同空间结构的输入或输出 (例如声音和视频)。与视觉中通常使用的潜在空间相比 (e.g. Ronneberger et al. 2015)，潜在空间并不明确地共享输入的结构 (空间或其他)。为了解码这些信息，Perceiver IO 使用交叉注意力进行查询。

Perceiver IO vs. Transformer

The Perceiver IO architecture builds on primitives similar to those in Transformers. Why aren’t Transformers all you need? Transformers scale very poorly in both compute and memory (Tay et al., 2020). Because Transformers deploy attention modules homogeneously throughout its architecture, using its full input to generate queries and keys at every layer. This means each layer scales quadratically in compute and memory, which makes it impossible to apply Transformers on highdimensional data like images without some form of preprocessing. Even on domains like language where Transformers shine, preprocessing (e.g. tokenization) is often needed to scale beyond short input sequences. Perceiver IO uses attention non-homogeneously by mapping inputs to a latent space, processing in that latent space, and decoding to an output space. Perceiver IO has no quadratic dependence on the input or output size: encoder and decoder attention modules depend linearly on the input and output size (respectively), while latent attention is independent of both input and output sizes (Sec. E.2). Because of the corresponding reduction in compute and memory requirements, Perceiver IO scales to much larger inputs and outputs. While Transformers are typically used in settings with data preprocessed to contain at most a few thousand dimensions (Brown et al., 2020; Raffel et al., 2020), we show good results on domains with hundreds of thousands of dimensions.

Perceiver IO 架构建立在与 Transformer 中的原语相似的基础上。为什么不直接使用 Transformer?因为 Transformer 在计算和内存方面的扩展性都很差，在其体系结构中均匀地部署注意力模块，使用其完整的输入在每一层生成查询和键。这意味着每一层在计算和内存上都是 quadratically 扩展的，这使得在没有某种形式的预处理的情况下，不可能在高维数据 (如图像) 上应用 transformer。即使在像 tranformer 所擅长的语言领域，预处理 (例如标记化) 也经常需要扩展到短输入序列之外。

Perceiver IO 通过将输入映射到潜在空间，在潜在空间中进行处理，并解码到输出空间来非均匀地使用注意力。Perceiver IO 与输入或输出大小没有二次依赖关系：编码器和解码器注意模块 (分别)线性依赖于输入和输出大小，而潜在注意与输入和输出大小无关 (第E.2节)。由于计算和内存需求的相应减少，Perceiver IO 可扩展到更大的输入和输出。虽然 Transformer 通常用于数据预处理的设置，最多包含几千个维度，而 Perceiver IO 在数十万个维度的域上显示了良好的结果。

Query Array

Our goal is to produce a final output array of size O × E, given a latent representation of size N × D. We produce an output of this size by querying the decoder with an array of index dimension O. To capture the structure of the output space, we use queries containing the appropriate information for each output point, e.g. its spatial position or its modality.

给定大小为 N × d 的潜在表示，Perceiver IO 的目标是产生一个大小为O × E的最终输出数组。通过使用索引维数为 O 的数组查询解码器来产生这个大小的输出。为了捕获输出空间的结构，Perceiver IO 使用包含每个输出点的适当信息的查询，例如它的空间位置或它的模态。

We construct queries by combining (concatenating or adding) a set of vectors into a query vector containing all of the information relevant for one of the O desired outputs. This process is analogous to the way that positional information is used to query implicit functions like NeRF (Mildenhall et al., 2020). We illustrate the query structure for the tasks we consider here in Fig. 3. For tasks with simple outputs, such as classification, these queries can be reused for every example and can be learned from scratch. For outputs with a spatial or sequence structure, we include a position encoding (e.g. a learned positional encoding or a Fourier feature) representing the position to be decoded in the output. For outputs with a multi-task or multimodal structure, we learn a single query for each task or for each modality: this information allows the network to distinguish one task or modality query from the others, much as positional encodings allow attention to distnguish one position from another. For other tasks, the output should reflect the content of the input at the query location: for instance, for flow we find it helpful to include the input feature at the point being queried, and for StarCraft II we use the unit information to associate the model’s output with the corresponding unit. We find that even very simple query features can produce good results, suggesting that the latent attention process is able to learn to organize the relevant information in a way that’s easy to query.

通过将一组向量组合 (连接或相加) 为一个包含与期望输出之一相关的所有信息的 query 向量来构造 query。这一过程类似于位置信息用于 query 隐式函数的方式（文献 NeRF）。

图 3 说明了某一任务的 query 结构。对于输出简单的任务，例如分类，这些 query 可以在每个示例中重用，并且可以从头开始学习。对于具有空间或序列结构的输出，包括一个位置编码 (例如，学习的位置编码或傅里叶特征)，表示输出中要解码的位置。对于具有多任务或多模态结构的输出，为每个任务或每种模态学习一个 query：这些信息允许网络将一个任务或模态 query 与其他任务或模态 query 区分开来，就像位置编码允许注意力区分一个位置与另一个位置一样。对于其他任务，输出应该反映 query 位置的输入内容：例如，对于流，作者发现在被 query 的点上包含输入特征是有帮助的，对于星际争霸II，使用单位信息将模型的输出与相应的单位关联起来。本文发现，即使是非常简单的 query 特征也能产生很好的结果，这表明潜在注意过程能够学习以一种易于 query 的方式组织相关信息。

Each output point depends only on its query and the latent array, allowing us to decode outputs in parallel. This property allows us to amortize model training on datasets of very large output size. For example, Kinetics consists of labels, video voxels, and audio samples which together come to over 800,000 points (Tab. 5), which is prohibitively expensive to decode at once, even with linear scaling. Instead, we subsample the output array at training time and compute the loss on an affordable subset of points. At test time, we generate outputs in batches to produce the full output array.

每个输出点只依赖于它的 query 和潜在数组，允许并行解码输出。这个属性允许在输出规模非常大的数据集上摊销模型训练。例如，Kinetics 由标签、视频体素和音频样本组成，它们加起来超过800,000点 (表5)，即使使用线性缩放，一次性解码的成本也非常高。相反，本文在训练时对输出数组进行子采样，并在一个负担得起的点子集上计算损失。在测试时，批量生成输出以生成完整的输出数组。