LLMs之Transformer:《The Transformer model family—Transformer 模型家族—计算机视觉/自然语言处理/音频/多模态/强化学习》翻译与解读
导读:Transformer模型在自然语言处理任务中被BERT等模型广泛使用。它的编码器主要用作特征提取,解码器用于生成文本。
在计算机视觉任务中,Vision Transformer(ViT)使用类似自然语言的方式对图像进行编码,取得了不错的效果。后来的改进版模型包括Swin Transformer和SegFormer。Transformer同样应用在生成任务如图像生成、语音识别等多模态任务中,使用编码器-解码器结构。
总的来说,所有这些Transformer变体有一个共同点,就是建构在原始Transformer体系结构之上。它们可能只使用编码器或解码器,或者使用编码器-解码器结构。这种分类可以用来分析Transformer家族中的模型差异,帮助理解之前未接触过的Transformer变体。
文章介绍了各种领域中基于Transformer模型的任务和模型变体。它涵盖了计算机视觉、自然语言处理、音频处理、多模态和强化学习等方面。不同的任务和模型变体展示了如何利用Transformer架构的优势来解决特定。Transformer模型在很多领域都有应用,而且有创新的改进不断涌现。之所以能够这么通用,归根结底来说还是因为它的通用性高、模型结构易扩展的优点。
目录
《The Transformer model family》翻译与解读
Convolutional network卷积网络:ConvNeXt
Encoder编码器:ViT→Swin →SegFormer→BeIT→ViTMAE
Natural language processing自然语言处理
Encoder编码器:BERT→RoBERTa→ALBERT→DistilBERT→DeBERTa→Longformer
Decoder解码器:GPT-2→XLNET→LLM术语→GPT-J→OPT→BLOOM
Encoder-decoder编码器-解码器:BART→Pegasus→T5
Encoder-decoder编码器-解码器:Speech2Text→Whisper
Encoder编码器:VisualBERT→ViLT→CLIP→OWL-ViT
Encoder-decoder编码器-解码器:TrOCR→Donut
Decoder解码器:Decision Transformer →Trajectory Transformer
《The Transformer model family》翻译与解读
时间 | 2022年 |
地址 | |
作者 | huggingface |
Since its introduction in 2017, the original Transformer model has inspired many new and exciting models that extend beyond natural language processing (NLP) tasks. There are models for predicting the folded structure of proteins, training a cheetah to run, and time series forecasting. With so many Transformer variants available, it can be easy to miss the bigger picture. What all these models have in common is they’re based on the original Transformer architecture. Some models only use the encoder or decoder, while others use both. This provides a useful taxonomy to categorize and examine the high-level differences within models in the Transformer family, and it’ll help you understand Transformers you haven’t encountered before. If you aren’t familiar with the original Transformer model or need a refresher, check out the How do Transformers work chapter from the Hugging Face course. | 自从2017年推出以来,原始的Transformer模型激发了许多新颖且令人兴奋的模型,超越了自然语言处理(NLP)任务。有模型用于预测蛋白质的折叠结构,训练cheetah 奔跑(此处强调了Transformer模型的广泛适用性),以及时间序列预测。由于Transformer的变体众多,很容易忽视更大的画面。所有这些模型的共同之处是它们都基于原始的Transformer架构。一些模型仅使用编码器或解码器,而其他模型则同时使用两者。这为对Transformer家族中的模型进行分类和研究提供了有用的分类法,并将帮助您理解以前未接触过的Transformer模型。 如果您对原始Transformer模型不熟悉或需要复习,请查看Hugging Face课程中的“How do Transformers work”章节。 |
地址:How do Transformers work? - Hugging Face NLP Course
Computer vision计算机视觉
卷积网络(CNN)是计算机视觉任务的主导模型,但Vision Transformer(ViT)通过使用Transformer编码器在没有卷积的情况下处理图像,取得了竞争性的结果。Swin Transformer和SegFormer等模型在ViT的基础上进行了改进,用于密集预测任务如分割和检测。BeIT和ViTMAE从BERT的预训练目标中获得灵感,并应用于图像预训练。
Convolutional network卷积网络:ConvNeXt
For a long time, convolutional networks (CNNs) were the dominant paradigm for computer vision tasks until the Vision Transformer demonstrated its scalability and efficiency. Even then, some of a CNN’s best qualities, like translation invariance, are so powerful (especially for certain tasks) that some Transformers incorporate convolutions in their architecture. ConvNeXt flipped this exchange around and incorporated design choices from Transformers to modernize a CNN. For example, ConvNeXt uses non-overlapping sliding windows to patchify an image and a larger kernel to increase its global receptive field. ConvNeXt also makes several layer design choices to be more memory-efficient and improve performance, so it competes favorably with Transformers! | 长期以来,卷积网络(CNN)一直是计算机视觉任务的主导范式直到Vision Transformer展示了其可扩展性和高效性。即便如此,一些最好的品质,如平移不变性,在某些任务中非常强大,以至于一些Transformer在其架构中加入了卷积操作。ConvNeXt反转了这一交换,并从Transformer中引入设计选择来使CNN更加现代化。例如,ConvNeXt使用非重叠的滑动窗口将图像进行分块处理,并使用较大的卷积核来增加其全局感受野。ConvNeXt还做出了几个层设计选择,以提高内存效率和性能,因此与Transformer相比具有竞争优势! |
Encoder编码器:ViT→Swin →SegFormer→BeIT→ViTMAE
The Vision Transformer (ViT) opened the door to computer vision tasks without convolutions. ViT uses a standard Transformer encoder, but its main breakthrough was how it treated an image. It splits an image into fixed-size patches and uses them to create an embedding, just like how a sentence is split into tokens. ViT capitalized on the Transformers’ efficient architecture to demonstrate competitive results with the CNNs at the time while requiring fewer resources to train. ViT was soon followed by other vision models that could also handle dense vision tasks like segmentation as well as detection. | Vision Transformer(ViT)为不使用卷积进行计算机视觉任务开辟了道路。ViT使用标准的Transformer编码器,但它的主要突破在于如何处理图像。它将图像分割为固定大小的图块,并使用这些图块创建嵌入向量,就像将句子分割为标记一样。ViT利用了Transformer高效的架构,在训练时需要较少的资源,并展示了与当时的CNN相媲美的结果。随后出现了其他可以处理密集视觉任务(如分割和检测)的视觉模型。 |
One of these models is the Swin Transformer. It builds hierarchical feature maps (like a CNN ��� and unlike ViT) from smaller-sized patches and merges them with neighboring patches in deeper layers. Attention is only computed within a local window, and the window is shifted between attention layers to create connections to help the model learn better. Since the Swin Transformer can produce hierarchical feature maps, it is a good candidate for dense prediction tasks like segmentation and detection. The SegFormer also uses a Transformer encoder to build hierarchical feature maps, but it adds a simple multilayer perceptron (MLP) decoder on top to combine all the feature maps and make a prediction. | 其中之一是Swin Transformer。它从较小尺寸的图块中构建分层特征图(类似CNN,与ViT不同),并在较深的层次中将其与相邻的图块合并。注意力仅在局部窗口内计算,并且在注意力层之间移动窗口,以创建有助于模型学习的连接。由于Swin Transformer可以生成分层特征图,因此非常适用于像分割和检测这样的密集预测任务。SegFormer也使用Transformer编码器构建分层特征图,但它在顶部添加了一个简单的多层感知机(MLP)解码器,用于组合所有特征图并进行预测。 |
Other vision models, like BeIT and ViTMAE, drew inspiration from BERT’s pretraining objective. BeIT is pretrained by masked image modeling (MIM); the image patches are randomly masked, and the image is also tokenized into visual tokens. BeIT is trained to predict the visual tokens corresponding to the masked patches. ViTMAE has a similar pretraining objective, except it must predict the pixels instead of visual tokens. What’s unusual is 75% of the image patches are masked! The decoder reconstructs the pixels from the masked tokens and encoded patches. After pretraining, the decoder is thrown away, and the encoder is ready to be used in downstream tasks. | 其他视觉模型(如BeIT和ViTMAE)从BERT的预训练目标中汲取灵感。BeIT通过遮挡图像建模(MIM)进行预训练,即对图像块进行随机遮挡,并将图像分词成视觉标记。BeIT被训练用于预测与遮挡块相对应的视觉标记。ViTMAE具有类似的预训练目标,但它必须预测像素而不是视觉标记。不寻常的是,75%的图像块被遮挡!解码器根据遮挡的标记和编码的图块来重构像素。在预训练之后,解码器被丢弃,编码器准备用于下游任务。 |
Decoder解码器:ImageGPT→
Decoder-only vision models are rare because most vision models rely on an encoder to learn an image representation. But for use cases like image generation, the decoder is a natural fit, as we’ve seen from text generation models like GPT-2. ImageGPT uses the same architecture as GPT-2, but instead of predicting the next token in a sequence, it predicts the next pixel in an image. In addition to image generation, ImageGPT could also be finetuned for image classification. | 仅具有解码器的视觉模型很少,因为大多数视觉模型依赖于编码器来学习图像表示。但对于图像生成等用例,解码器是一个自然的选择,就像我们从文本生成模型(如GPT-2)中所见的那样。ImageGPT使用与GPT-2相同的架构,但不是预测序列中的下一个标记,而是预测图像中的下一个像素。除了图像生成,ImageGPT也可以进行微调以进行图像分类。 |
Encoder-decoder编码器-解码器:DETR
Vision models commonly use an encoder (also known as a backbone) to extract important image features before passing them to a Transformer decoder. DETR has a pretrained backbone, but it also uses the complete Transformer encoder-decoder architecture for object detection. The encoder learns image representations and combines them with object queries (each object query is a learned embedding that focuses on a region or object in an image) in the decoder. DETR predicts the bounding box coordinates and class label for each object query. | 视觉模型通常使用编码器(也称为主干网络)来提取重要的图像特征,然后将其传递给Transformer解码器。DETR具有预训练的主干网络,并且还使用完整的Transformer编码器-解码器架构进行目标检测。编码器学习图像表示,并将其与对象查询(每个对象查询是一个学习的嵌入向量,用于关注图像中的区域或对象)在解码器中进行组合。DETR预测每个对象查询的边界框坐标和类别标签。 |
Natural language processing自然语言处理
BERT是一种仅编码器的Transformer模型,通过对输入进行随机掩码,预测上下文中的掩码标记,从而学习更深入和丰富的输入表示。而GPT-2是一种仅解码器的Transformer模型,用于生成下一个单词的序列。大型语言模型(LLMs)如GPT-J、OPT和BLOOM通过在大规模数据集上进行预训练实现了零样本学习或少样本学习。
BART和Pegasus是编码器-解码器模型,用于文本生成和摘要任务。T5将所有NLP任务转化为文本到文本问题,通过特定的前缀来区分任务类型。
Encoder编码器:BERT→RoBERTa→ALBERT→DistilBERT→DeBERTa→Longformer
BERT is an encoder-only Transformer that randomly masks certain tokens in the input to avoid seeing other tokens, which would allow it to “cheat”. The pretraining objective is to predict the masked token based on the context. This allows BERT to fully use the left and right contexts to help it learn a deeper and richer representation of the inputs. However, there was still room for improvement in BERT’s pretraining strategy. RoBERTa improved upon this by introducing a new pretraining recipe that includes training for longer and on larger batches, randomly masking tokens at each epoch instead of just once during preprocessing, and removing the next-sentence prediction objective. The dominant strategy to improve performance is to increase the model size. But training large models is computationally expensive. One way to reduce computational costs is using a smaller model like DistilBERT. DistilBERT uses knowledge distillation - a compression technique - to create a smaller version of BERT while keeping nearly all of its language understanding capabilities. | BERT是一个仅有编码器的Transformer模型,在输入中随机遮盖一些标记,以避免看到其他标记,从而防止“作弊”。预训练目标是根据上下文预测被遮盖的标记。这使得BERT能够充分利用左右上下文,帮助其学习更深入、更丰富的输入表示。然而,BERT的预训练策略仍有改进的空间。RoBERTa通过引入新的预训练方法进行改进,包括更长时间和更大批次的训练,在每个时期随机遮盖标记而不仅仅是在预处理时,以及移除下一句预测目标。 提高性能的主要策略是增加模型的规模。但训练大模型的计算成本很高。减小计算成本的一种方法是使用DistilBERT这样的较小模型。DistilBERT使用知识蒸馏(一种压缩技术)来创建BERT的较小版本,同时几乎保留其所有的语言理解能力。 |
However, most Transformer models continued to trend towards more parameters, leading to new models focused on improving training efficiency. ALBERT reduces memory consumption by lowering the number of parameters in two ways: separating the larger vocabulary embedding into two smaller matrices and allowing layers to share parameters. DeBERTa added a disentangled attention mechanism where the word and its position are separately encoded in two vectors. The attention is computed from these separate vectors instead of a single vector containing the word and position embeddings. Longformer also focused on making attention more efficient, especially for processing documents with longer sequence lengths. It uses a combination of local windowed attention (attention only calculated from fixed window size around each token) and global attention (only for specific task tokens like [CLS] for classification) to create a sparse attention matrix instead of a full attention matrix. | 然而,大多数Transformer模型继续朝着更多参数的方向发展,导致出现了专注于提高训练效率的新模型。ALBERT通过两种方式降低参数数量来减少内存消耗:将较大的词汇嵌入分成两个较小的矩阵,并允许层之间共享参数。DeBERTa添加了一种解耦的注意力机制,其中单词和其位置分别编码为两个向量。注意力是从这些单独的向量计算的,而不是从包含单词和位置嵌入的单个向量中计算。Longformer也专注于使注意力更高效,特别是处理序列长度较长的文档。它使用局部窗口注意力(仅计算每个标记周围固定窗口大小的注意力)和全局注意力(仅用于特定任务标记,如用于分类的[CLS])来创建稀疏注意力矩阵,而不是完全注意力矩阵。 |
Decoder解码器:GPT-2→XLNET→LLM术语→GPT-J→OPT→BLOOM
GPT-2 is a decoder-only Transformer that predicts the next word in the sequence. It masks tokens to the right so the model can’t “cheat” by looking ahead. By pretraining on a massive body of text, GPT-2 became really good at generating text, even if the text is only sometimes accurate or true. But GPT-2 lacked the bidirectional context from BERT’s pretraining, which made it unsuitable for certain tasks. XLNET combines the best of both BERT and GPT-2’s pretraining objectives by using a permutation language modeling objective (PLM) that allows it to learn bidirectionally. After GPT-2, language models grew even bigger and are now known as large language models (LLMs). LLMs demonstrate few- or even zero-shot learning if pretrained on a large enough dataset. GPT-J is an LLM with 6B parameters and trained on 400B tokens. GPT-J was followed by OPT, a family of decoder-only models, the largest of which is 175B and trained on 180B tokens. BLOOM was released around the same time, and the largest model in the family has 176B parameters and is trained on 366B tokens in 46 languages and 13 programming languages. | GPT-2是一个仅有解码器的Transformer模型,用于预测序列中的下一个单词。它向右遮盖标记,以防止模型通过向前查看“作弊”。通过在大规模文本语料上进行预训练,GPT-2在生成文本方面表现出色,即使文本有时并不准确或真实。但GPT-2缺乏BERT预训练中的双向上下文,因此不适用于某些任务。XLNET结合了BERT和GPT-2的预训练目标的优点,使用排列语言模型目标(PLM)使其能够进行双向学习。 在GPT-2之后,语言模型变得更大,现在被称为大型语言模型(LLM)。如果在足够大的数据集上进行预训练,LLM可以展示出少量甚至零样本学习。GPT-J是一个具有60亿参数并在4000亿标记上进行训练的LLM。GPT-J之后是OPT,它是一系列仅具有解码器的模型,其中最大的模型具有1750亿参数,并在1800亿标记上进行训练。同时,BLOOM在同一时间发布,该系列中最大的模型具有1760亿参数,并在46种语言和13种编程语言的3660亿标记上进行训练。 |
Encoder-decoder编码器-解码器:BART→Pegasus→T5
BART keeps the original Transformer architecture, but it modifies the pretraining objective with text infilling corruption, where some text spans are replaced with a single mask token. The decoder predicts the uncorrupted tokens (future tokens are masked) and uses the encoder’s hidden states to help it. Pegasus is similar to BART, but Pegasus masks entire sentences instead of text spans. In addition to masked language modeling, Pegasus is pretrained by gap sentence generation (GSG). The GSG objective masks whole sentences important to a document, replacing them with a mask token. The decoder must generate the output from the remaining sentences. T5 is a more unique model that casts all NLP tasks into a text-to-text problem using specific prefixes. For example, the prefix Summarize: indicates a summarization task. T5 is pretrained by supervised (GLUE and SuperGLUE) training and self-supervised training (randomly sample and drop out 15% of tokens). | BART保持了原始的Transformer架构,但通过文本填充损坏修改了预训练目标,其中一些文本片段被替换为单个遮盖标记。解码器预测未被损坏的标记(未来标记被遮盖),并使用编码器的隐藏状态来帮助预测。Pegasus与BART类似,但Pegasus遮盖整个句子而不是文本片段。除了遮盖语言建模,Pegasus还通过缺口句子生成(GSG)进行预训练。GSG目标遮盖了文档中重要的整个句子,并用遮盖标记替换它们。解码器必须从剩余的句子中生成输出。T5是一个更独特的模型,它将所有NLP任务都转化为使用特定前缀的文本到文本问题。例如,前缀"Summarize:"表示一个摘要任务。T5通过监督(GLUE和SuperGLUE)训练和自监督训练(随机抽样和去除15%的标记)进行预训练。 |
Audio音频
Wav2Vec2和HuBERT是使用Transformer编码器从原始音频波形直接学习语音表示的模型。Speech2Text和Whisper是用于自动语音识别(ASR)的模型,使用编码器-解码器结构生成转录文本。
Encoder编码器:Wav2Vec2→HuBERT
Wav2Vec2 uses a Transformer encoder to learn speech representations directly from raw audio waveforms. It is pretrained with a contrastive task to determine the true speech representation from a set of false ones. HuBERT is similar to Wav2Vec2 but has a different training process. Target labels are created by a clustering step in which segments of similar audio are assigned to a cluster which becomes a hidden unit. The hidden unit is mapped to an embedding to make a prediction. | Wav2Vec2使用Transformer编码器直接从原始音频波形中学习语音表示。它通过对比任务进行预训练,从一组错误的语音表示中确定真实的语音表示。HuBERT与Wav2Vec2类似,但具有不同的训练过程。目标标签是通过聚类步骤创建的,在该步骤中,类似音频的片段被分配到一个成为隐藏单元的聚类中。隐藏单元被映射到一个嵌入以进行预测。 |
Encoder-decoder编码器-解码器:Speech2Text→Whisper
Speech2Text is a speech model designed for automatic speech recognition (ASR) and speech translation. The model accepts log mel-filter bank features extracted from the audio waveform and pretrained autoregressively to generate a transcript or translation. Whisper is also an ASR model, but unlike many other speech models, it is pretrained on a massive amount of ✨ labeled ✨ audio transcription data for zero-shot performance. A large chunk of the dataset also contains non-English languages, meaning Whisper can also be used for low-resource languages. Structurally, Whisper is similar to Speech2Text. The audio signal is converted to a log-mel spectrogram encoded by the encoder. The decoder generates the transcript autoregressively from the encoder’s hidden states and the previous tokens. | Speech2Text是一个用于自动语音识别(ASR)和语音翻译的语音模型。该模型接受从音频波形提取的对数梅尔滤波器组特征,并进行预训练以自动回归生成转录或翻译。Whisper也是一个ASR模型,但与许多其他语音模型不同的是,它在大量带有标注的音频转录数据上进行了预训练,以实现零样本性能。数据集的一大部分还包含非英语语言,这意味着Whisper也可以用于低资源语言。在结构上,Whisper与Speech2Text类似。音频信号经过编码器转换为由编码器编码的对数梅尔频谱图。解码器从编码器的隐藏状态和前一个标记中自动回归地生成转录。 |
Multimodal多模态
VisualBERT和ViLT是用于视觉-语言任务的多模态模型。它们将图像嵌入与文本嵌入一起输入到Transformer模型中。CLIP通过联合训练图像编码器和文本编码器来实现图像-文本匹配任务。OCR模型如TrOCR和Donut使用Transformer模型进行图像中的文字识别和文档理解。
Encoder编码器:VisualBERT→ViLT→CLIP→OWL-ViT
VisualBERT is a multimodal model for vision-language tasks released shortly after BERT. It combines BERT and a pretrained object detection system to extract image features into visual embeddings, passed alongside text embeddings to BERT. VisualBERT predicts the masked text based on the unmasked text and the visual embeddings, and it also has to predict whether the text is aligned with the image. When ViT was released, ViLT adopted ViT in its architecture because it was easier to get the image embeddings this way. The image embeddings are jointly processed with the text embeddings. From there, ViLT is pretrained by image text matching, masked language modeling, and whole word masking. CLIP takes a different approach and makes a pair prediction of (image, text) . An image encoder (ViT) and a text encoder (Transformer) are jointly trained on a 400 million (image, text) pair dataset to maximize the similarity between the image and text embeddings of the (image, text) pairs. After pretraining, you can use natural language to instruct CLIP to predict the text given an image or vice versa. OWL-ViT builds on top of CLIP by using it as its backbone for zero-shot object detection. After pretraining, an object detection head is added to make a set prediction over the (class, bounding box) pairs. | VisualBERT是一个用于视觉-语言任务的多模态模型,发布在BERT之后不久。它结合了BERT和预训练的目标检测系统,将图像特征提取为视觉嵌入,与文本嵌入一起传递给BERT。VisualBERT基于未被遮盖的文本和视觉嵌入预测被遮盖的文本,并且还必须预测文本是否与图像对齐。当ViT发布时,ViLT采用了ViT的架构,因为这样更容易获得图像嵌入。图像嵌入与文本嵌入一起进行联合处理。然后,ViLT通过图像文本匹配、遮盖语言建模和整词遮盖进行预训练。 CLIP采用了一种不同的方法,对(图像,文本)进行一对预测。图像编码器(ViT)和文本编码器(Transformer)共同在一个包含400百万个(图像,文本)对的数据集上进行训练,以最大化(图像,文本)对的图像和文本嵌入之间的相似性。预训练后,可以使用自然语言来指示CLIP根据图像预测文本,反之亦然。OWL-ViT在CLIP的基础上构建,将其作为零样本对象检测的骨干网络。预训练后,添加了一个目标检测头,对(类别,边界框)对进行集合预测。 |
Encoder-decoder编码器-解码器:TrOCR→Donut
Optical character recognition (OCR) is a long-standing text recognition task that typically involves several components to understand the image and generate the text. TrOCR simplifies the process using an end-to-end Transformer. The encoder is a ViT-style model for image understanding and processes the image as fixed-size patches. The decoder accepts the encoder’s hidden states and autoregressively generates text. Donut is a more general visual document understanding model that doesn’t rely on OCR-based approaches. It uses a Swin Transformer as the encoder and multilingual BART as the decoder. Donut is pretrained to read text by predicting the next word based on the image and text annotations. The decoder generates a token sequence given a prompt. The prompt is represented by a special token for each downstream task. For example, document parsing has a special parsing token that is combined with the encoder hidden states to parse the document into a structured output format (JSON). | 光学字符识别(OCR)是一项长期存在的文本识别任务,通常涉及多个组件来理解图像并生成文本。TrOCR使用端到端Transformer简化了这个过程。编码器是一个ViT风格的图像理解模型,将图像处理为固定大小的补丁。解码器接受编码器的隐藏状态,并自回归地生成文本。Donut是一个更通用的视觉文档理解模型,不依赖于基于OCR的方法。它使用Swin Transformer作为编码器,使用多语言BART作为解码器。Donut通过预测基于图像和文本注释的下一个单词来进行预训练以读取文本。解码器根据提示生成一个令牌序列。提示由下游任务的特殊令牌表示。例如,文档解析具有特殊的解析令牌,该令牌与编码器的隐藏状态结合在一起将文档解析为结构化的输出格式(JSON)。 |
Reinforcement learning强化学习
决策和轨迹Transformer是应用于强化学习的模型,将状态、动作和奖励建模为序列预测问题。
Decoder解码器:Decision Transformer →Trajectory Transformer
The Decision and Trajectory Transformer casts the state, action, and reward as a sequence modeling problem. The Decision Transformer generates a series of actions that lead to a future desired return based on returns-to-go, past states, and actions. For the last K timesteps, each of the three modalities are converted into token embeddings and processed by a GPT-like model to predict a future action token. Trajectory Transformer also tokenizes the states, actions, and rewards and processes them with a GPT architecture. Unlike the Decision Transformer, which is focused on reward conditioning, the Trajectory Transformer generates future actions with beam search. | Decision 和Trajectory Transformer将状态、动作和奖励视为序列建模问题。Decision Transformer根据返现、过去的状态和动作生成一系列导致未来期望回报的动作。对于最后K个时间步长,这三种模态都被转换为令牌嵌入,并由类似GPT的模型处理以预测未来的动作令牌。轨迹Transformer也对状态、动作和奖励进行分词,并使用GPT架构进行处理。与关注奖励条件的决策Transformer不同,Trajectory Transformer使用波束搜索生成未来的动作。 |