【论文翻译】RT-1: ROBOTICS TRANSFORMER FOR REAL-WORLD CONTROL AT SCALE【未完待续】_robotics transformer for real world control at sca-CSDN博客

本文链接：https://blog.csdn.net/l963852k/article/details/133717731

博客围绕机器人领域的RT - 1展开，涉及Transformer在其中的应用。介绍了数据收集环境，如机器人教室、办公室厨房等，还提及移动机械手及用于扩展技能和对象多样性的对象组。此外给出了论文地址。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

5 RT-1: ROBOTICS TRANSFORMER

5.1 MODEL

5.2 DATA

参考：

0.ABSTRACT

By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data. We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data. In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks. The project's website and videos can be found at robotics-transformer1.github.io

通过从大型、多样化、与任务无关的数据集中转移知识，现代机器学习模型可以zero-shot或使用特定于任务的小型数据集解决特定的下游任务，以实现高水平的性能。虽然这种能力已经在计算机视觉、自然语言处理或语音识别等其他领域得到证明，但在机器人技术中仍有待展示，由于难以收集真实世界的机器人数据，模型的泛化能力尤其重要。我们认为，这种通用机器人模型成功的关键之一在于开放式任务无关的训练，以及可以吸收所有多样化机器人数据的高容量架构。在本文中，我们提出了一个模型类，称为Robotics Transformer，它表现出有前途的可扩展模型属性。我们在对不同模型类的研究中验证了我们的结论，以及它们基于执行实际任务的真实机器人的大规模数据收集，将其概括为数据大小、模型大小和数据多样性的能力。该项目的网站和视频可以在 robotics-transformer1.github.io找到。

1 INTRODUCTION

End-to-end robotic learning, with either imitation or reinforcement, typically involves collecting task-specific data in either single-task (Kalashnikov et al., 2018; Zhang et al., 2018) or multitask (Kalashnikov et al., 2021b; Jang et al., 2021) settings that are narrowly tailored to the tasks that the robot should perform. This workflow mirrors the classic approach to supervised learning in other domains, such as computer vision and NLP, where task-specific datasets would be collected, labeled, and deployed to solve individual tasks, with little interplay between the tasks themselves. Recent years have seen a transformation in vision, NLP, and other domains, away from siloed, smallscale datasets and models and towards large, general models pre-trained on broad, large datasets. The keys to the success of such models lie with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the knowledge present in large-scale datasets. If a model can "sponge up" experience to learn general patterns in language or perception, then it can bring them to bear on individual tasks more efficiently. While removing the need for large taskspecific datasets is appealing generally in supervised learning, it is even more critical in robotics, where datasets might require engineering-heavy autonomous operation or expensive human demonstrations. We therefore ask: can we train a single, capable, large multi-task backbone model on data consisting of a wide variety of robotic tasks? And does such a model enjoy the benefits observed in other domains, exhibiting zero-shot generalization to new tasks, environments, and objects?	端到端机器人学习，无论是模仿还是强化，通常涉及在单个任务中收集特定于任务的数据（Kalashnikov等人，2018;Zhang等人，2018）或多任务（卡拉什尼科夫等人，2021b;Jang 等人，2021 年）的设置，这些设置是针对机器人应执行的任务而狭隘地定制的。该工作流程反映了其他领域监督学习的经典方法，例如计算机视觉和NLP，其中特定于任务的数据集将被收集，标记和部署以解决单个任务，任务本身之间几乎没有相互作用。近年来，视觉、NLP 和其他领域发生了转变，从孤立的小规模数据集和模型转向在广泛的大型数据集上预先训练的大型通用模型。这种模型成功的关键在于开放式任务无关的训练，以及可以吸收大规模数据集中存在的所有知识的高容量架构。如果一个模型可以“海绵化”经验来学习语言或感知的一般模式，那么它可以使它们更有效地承担单个任务。虽然消除对大型任务特定数据集的需求在监督学习中通常很有吸引力，但在机器人技术中更为重要，因为数据集可能需要大量工程的自主操作或昂贵的人工演示。因此，我们问：我们能否在由各种机器人任务组成的数据上训练一个单一的、有能力的、大型的多任务主干模型？这样的模型是否享有在其他领域观察到的好处，表现出对新任务、环境和对象的zero-shot泛化？
Building such models in robotics is not easy. Although recent years have seen several large multitask robot policies proposed in the literature (Reed et al., 2022; Jang et al., 2021), such models often have limited breadth of real-world tasks, as with Gato (Reed et al., 2022), or focus on training tasks rather than generalization to new tasks, as with recent instruction following methods (Shridhar et al., 2021; 2022), or attain comparatively lower performance on new tasks (Jang et al., 2021).	在机器人中构建这样的模型并不容易。尽管近年来文献中提出了几项大型多任务机器人策略（Reed 等人，2022 年;Jang 等人，2021 年），此类模型通常具有有限的现实世界任务广度，例如 Gato（Reed 等人，2022 年），或者专注于训练任务而不是泛化到新任务，例如最近的指令遵循方法（Shridhar 等人，2021 年; 2022 年），或者在新任务上获得相对较低的性能（Jang 等人，2021 年）。
The two main challenges lie in assembling the right dataset and designing the right model. While data collection and curation is often the "unsung hero" of many large-scale machine learning projects (Radford et al., 2021; Ramesh et al., 2021), this is especially true in robotics, where datasets are often robot-specific and gathered manually (Dasari et al., 2019; Ebert et al., 2021). As we will show in our evaluations, good generalization requires datasets that combine both scale and breadth, covering a variety of tasks and settings. At the same time, the tasks in the dataset should be sufficiently well-connected to enable generalization, such that the model can discover the patterns between structural similar tasks and perform new tasks that combine those patterns in novel ways. We utilize a dataset that we gathered over the course of 17 months with a fleet of 13 robots, containing∼130k episodes and over 700 tasks, and we ablate various aspects of this dataset in our evaluation.	两个主要挑战在于组装正确的数据集和设计正确的模型。虽然数据收集和管理通常是许多大型机器学习项目的“无名英雄”（Radford 等人，2021 年;Ramesh 等人，2021 年），在机器人技术中尤其如此，其中数据集通常是特定于机器人的并手动收集（Dasari 等人，2019 年;埃伯特等人，2021 年）。正如我们将在评估中展示的那样，良好的泛化需要结合规模和广度的数据集，涵盖各种任务和设置。同时，数据集中的任务应该足够好地连接以实现泛化，以便模型可以发现结构相似任务之间的模式，并执行以新颖的方式组合这些模式的新任务。我们利用了在 17 个月内收集的数据集，其中包含 13 个机器人，包含 130k episodes和 700 多个任务，我们在评估中忽略了该数据集的各个方面。
The second challenge lies in the design of the model itself. Effective robotic multi-task learning requires a high capacity model, and Transformer (Vaswani et al., 2017) models excel in this regard, particularly when it is necessary to learn many tasks conditioned, as in our case, on language instructions. However, robotic controllers must also be efficient enough to run in real time, which presents a major challenge for Transformers in particular. We propose a novel architecture that we call RT-1 (Robotics Transformer 1), which by encoding high-dimensional inputs and outputs, including camera images, instructions and motor commands into compact token representations to be used by the Transformer, allows for efficient inference at runtime to make real-time control feasible.	第二个挑战在于模型本身的设计。有效的机器人多任务学习需要高容量模型，而Transformer（Vaswani等人，2017）模型在这方面表现出色，特别是当需要学习许多以语言指令为条件的任务时，就像我们的情况一样。然而，机器人控制器也必须足够高效才能实时运行，这对Transformer来说尤其是一个重大挑战。我们提出了一种称为RT-1（机器人Transformer1）的新颖架构，通过将高维输入和输出（包括相机图像，指令和电机命令）编码为紧凑的token表示以供Transformer使用，允许在运行时进行高效推理，以使实时控制可行。
Our contribution is the RT-1 model and experiments with this model on a large and broad dataset of real-world robotic tasks. Our experiments not only demonstrate that RT-1 can exhibit significantly improved generalization and robustness compared to prior techniques, but also evaluate and ablate many design choices in both the model and in the composition of the training set. Our results show that RT-1 can perform over 700 training instructions at 97% success rate, and can generalize to new tasks, distractors, and backgrounds 25%, 36% and 18% better than the next best baseline, respectively. This level of performance allows us to execute very long-horizon tasks in the SayCan (Ahn et al., 2022) framework, with as many as 50 stages. We further show that RT-1 can incorporate data from simulation or even other robot types, retaining performance on the original tasks and improving generalization to new scenarios. A short overview of RT-1 capabilities is presented in Fig. 1b2.	我们的贡献是RT-1模型，并在现实世界机器人任务的大型广泛数据集上对该模型进行了实验。我们的实验不仅表明，与先前的技术相比，RT-1可以表现出显着改善的泛化和鲁棒性，而且还评估和消融了模型和训练集组成的许多设计选择。我们的结果表明，RT-1可以以97%的成功率执行700多个训练指令，并且可以推广到新任务，干扰项和背景分别比下一个最佳基线高25%，36%和18%。这种性能水平使我们能够在 SayCan（Ahn 等人，2022 年）框架中执行非常长的任务，多达 50 个阶段。我们进一步表明，RT-1可以整合来自模拟甚至其他机器人类型的数据，保持原始任务的性能并改善对新场景的泛化。RT-1功能的简要概述如图1b2所示。

2 RELATED WORK

A number of recent works have proposed Transformer-based policies for robotic control. As in RT-1, several works use language commands processed with Transformers as a robust framework for specifying and generalizing to new tasks (Zhang & Chai, 2021; Pashevich et al., 2021; Silva et al., 2021; Jang et al., 2021; Ahn et al., 2022; Nair et al., 2022). Our work takes the application of Transformers a step further and treats the mapping of language and vision observations to robot actions as a sequence modelling problem, using a Transformer to learn this mapping. This idea is directly inspired by successes in game-playing (Chen et al., 2021; Lee et al., 2022a) as well as simulated robot navigation (Fang et al., 2019), locomotion (Janner et al., 2021; Gupta et al., 2022), and manipulation (Jiang et al., 2022) environments. We note that several of these works go beyond only text conditioning and use Transformers to also generalize across robot morphologies (e.g., Gupta et al. (2022)) and other modalities for task specifications (e.g., Jang et al. (2021); Jiang et al. (2022)). These extensions are promising future directions for RT-1.	最近的一些工作提出了Transformer-based的机器人控制策略。与RT-1一样，一些工作使用Transformer处理的语言命令作为指定和推广到新任务的强大框架（Zhang&Chai，2021;帕舍维奇等人，2021 年;席尔瓦等人，2021 年;张等人，2021;安等人，2022 年;奈尔等人，2022 年）。我们的工作使Transformer的应用更进一步，并将语言和视觉观察与机器人动作的映射视为序列建模问题，使用Transformer来学习这种映射。这个想法直接受到游戏成功的启发（Chen 等人，2021 年;Lee 等人，2022a）以及模拟机器人导航（Fang 等人，2019 年）、运动（Janner 等人，2021 年;古普塔等人，2022 年）和操纵（姜等人，2022 年）环境。我们注意到，其中一些工作不仅超出了文本条件反射，还使用Transformer在不同机器人形态（例如，Gupta等人（2022））和其他任务规范模式（例如，Jang等人（2021）;蒋等人（2022））之间泛化.这些扩展为RT-1的未来方向提供了希望。
Beyond Transformer-based policies, the focus of our work is on generalizable and robust real-world robotic manipulation at scale. Existing works on real-world Transformer-based robotic manipulation focus on efficiently learning tasks from a set of demonstrations per task (Shridhar et al., 2022). Behavior Transformer (Shafiullah et al., 2022) and Gato (Reed et al., 2022) advocate for training a single model on large-scale robotic and non-robotic datasets. However, these works are limited in their real-world robotic tasks; e.g., Gato learns effectively a single task (colored block stacking) without evaluating generalization to new tasks or a variety of real-world settings. On the technical side, our work examines how Transformer-based policies can be built so as to combine high capacity and generalization with the computational efficiency necessary for real-time control.	除了基于Transformer-based的政策之外，我们工作的重点是大规模推广和强大的现实世界机器人操作。关于基于现实世界的Transformer机器人操作的现有工作侧重于从每个任务的一组演示中有效地学习任务（Shridhar 等人，2022 年）。行为Transformer（Shafiullah 等人，2022 年）和 Gato（Reed 等人，2022 年）主张在大规模机器人和非机器人数据集上训练单个模型。然而，这些作品在现实世界的机器人任务中受到限制;例如，Gato有效地学习单个任务（彩色块堆叠），而无需评估对新任务或各种现实世界设置的泛化。在技术方面，我们的工作研究了如何构建基于Transformer的策略，以便将高容量和泛化与实时控制所需的计算效率相结合。
While the use of high-capacity Transformer models to learn robotic control policies is a fairly recent innovation, robotics has a long history of multi-task and language-conditioned learning, and RT-1 builds on these foundations. A significant body of work deals with learning policies and predictive models for robotic grasping (Saxena et al., 2006; Lenz et al., 2015; Pinto & Gupta, 2016; Gupta et al., 2018; Viereck et al., 2017), with the aim of generalizing to new objects. Prior works have sought to address robotic language understanding through pipelined approaches that combine language parsing, vision, and robotic control (MacMahon et al., 2006; Kollar et al., 2010; Tellex et al., 2011) and with end-to-end approaches (Mei et al., 2016; Stepputtis et al., 2020; Lynch & Sermanet, 2020; Ahn et al., 2022). Multi-task robotic learning has also been approached from the perspective of learning to reach goals (Chung et al., 2015; Raffin et al., 2019; Jurgenson et al., 2020; Huang et al., 2020), as well as learning policies that can perform tasks in a discrete set or some other parameterized form (Deisenroth et al., 2014; Devin et al., 2017; Fox et al., 2019; Kalashnikov et al., 2021a). A number of prior works in robotics have also focused on collecting datasets containing demonstrations or trials that illustrate a variety of different tasks (Sharma et al., 2018; Dasari et al., 2019; Yu et al., 2020; Singh et al., 2020; James et al., 2020). Our work adds further evidence in support of the power of multi-task, language-conditioned robotic learning, presenting experimental results at a larger scale and with a greater variety of behaviors, objects, and scenes and proposing new architectures and design choices that enable robotic learning at a significantly larger scale.	虽然使用高容量Transformer模型来学习机器人控制策略是一项相当新的创新，但机器人技术在多任务和语言条件学习方面有着悠久的历史，而RT-1建立在这些基础上。大量工作涉及机器人抓取的学习政策和预测模型（Saxena 等人，2006 年;伦茨等人，2015;平托和古普塔，2016;古普塔等人，2018 年;Viereck等人，2017），目的是推广到新对象。以前的工作试图通过结合语言解析、视觉和机器人控制的流水线方法来解决机器人语言理解问题（MacMahon 等人，2006 年;科拉尔等人，2010;Tellex等人，2011）和端到端方法（Mei等人，2016;斯特普蒂斯等人，2020 年;林奇与塞尔马内特，2020 年;安等人，2022 年）。多任务机器人学习也从学习达到目标的角度进行了研究（Chung等人，2015;拉芬等人，2019 年;尤尔根森等人，2020 年;Huang等人，2020），以及能够以离散集或其他参数化形式执行任务的学习策略（Deisenroth等人，2014;德文等人，2017 年;福克斯等人，2019 年;卡拉什尼科夫等人，2021a）。机器人领域的许多先前工作也侧重于收集包含演示或试验的数据集，这些数据集说明了各种不同的任务（Sharma 等人，2018 年;达萨里等人，2019 年;余等人，2020;辛格等人，2020 年;詹姆斯等人，2020 年）。我们的工作增加了进一步的证据，以支持多任务，语言条件机器人学习的力量，以更大的规模和更多样化的行为，对象和场景呈现实验结果，并提出新的架构和设计选择，使机器人学习在更大的规模上实现。

3 PRELIMINARIES

Robot learning. We aim to learn robot policies to solve language-conditioned tasks from vision. Formally, we consider a sequential decision-making environment. At timestep $t=0$ , the policy $\pi$ is presented with a language instruction $i$ and an initial image observation $x_0$ . The policy produces an action distribution $\pi (\cdot \| i, x_0)$ from which an action $a_0$ is sampled and applied to the robot. This process continues, with the policy iteratively producing actions at by sampling from a learned distribution $\pi (\cdot \| i, \{x_j\}_0^t)$ and applying those actions to the robot. The interaction ends when a termination condition is achieved. The full interaction i, {(xj , aj )}T j=0 from from the starting step t = 0 to terminating step T is referred to as an episode. At the end of an episode, the agent will be given a binary reward r ∈ {0, 1} indicating whether the robot performed the instruction i. The goal is to learn a policy π that maximizes the average reward, in expectation over a distribution of instructions, starting states x0, and transition dynamics.	机器人学习。我们的目标是学习机器人策略，从视觉中解决语言条件的任务。从形式上讲，我们考虑序列决策环境。在t = 0时刻，策略 $\pi$ 呈现语言指令 $i$ 和初始图像观察 $x_0$ 。该策略生成一个动作分布 $\pi (\cdot \| i, x_0)$ ，从中对动作 $a_0$ 进行采样并应用于机器人。这个过程继续，策略通过从学习的分布 $\pi (\cdot \| i, \{x_j\}_0^t)$ 中采样来迭代生成动作，并将这些动作应用于机器人。当达到终止条件时，交互结束。从起始步 t= 0 到终止步 T 的全交互过程 $i,\{(x_j,a_j)\}_{j=0}^T$ 称为一个episode。在一集结束时，代理将获得一个二进制奖励r∈{0,1}，指示机器人是否执行了指令i。目标是学习一个最大化平均奖励的策略 $\pi$ ，在指令分布、起始状态 x0 和过渡动态上取期望。
Transformers. RT-1 uses a Transformer (Vaswani et al., 2017) to parameterize the policy π. Generally speaking, a Transformer is a sequence model mapping an input sequence $\{\xi_h \}_{h=0}^H$ to an output sequence $\{y_k \}_{k=0}^K$ using combinations of self-attention layers and fully-connected neural networks. While Transformers were originally designed for text sequences, where each input $\xi _j$ and output $y_k$ represents a text token, they have been extended to images (Parmar et al., 2018) as well as other modalities (Lee et al., 2022a; Reed et al., 2022). As detailed in the next section, we parameterizeπ by first mapping inputs $i,\{x_j\}_{j=0}^t$ to a sequence $\{\xi_h \}_{h=0}^H$ and action outputs at to a sequence $\{y_k \}_{k=0}^K$ before using a Transformer to learn the mapping $\{\xi_h \}_{h=0}^H$ $\rightarrow$ $\{y_k \}_{k=0}^K$ .	Transformers。RT-1使用Transformer（Vaswani等人，2017）来参数化策略π。一般来说，Transformer是一个序列模型，使用自我注意层和全连接神经网络的组合将输入序列 $\{\xi_h \}_{h=0}^H$ 映射到输出序列 $\{y_k \}_{k=0}^K$ 。虽然Transformer最初是为文本序列设计的，其中每个输入 $\xi _j$ 和输出 $y_k$ 代表一个text token，但它们已扩展到图像（Parmar 等人，2018 年）以及其他模态（Lee 等人，2022a;里德等人，2022 年）。如下一节所述，我们参数化 $\pi$ ，首先将输入 $i,\{x_j\}_{j=0}^t$ 映射到序列 $\{\xi_h \}_{h=0}^H$ ，动作输出 $a_t$ 映射到序列 $\{y_k \}_{k=0}^K$ ，然后再使用Transformer学习映射 $\{\xi_h \}_{h=0}^H$ $\rightarrow$ $\{y_k \}_{k=0}^K$ 。
Imitation learning. Imitation learning methods train the policy π on a dataset D of demonstrations (Pomerleau, 1988; Zhang et al., 2018; Jang et al., 2021). Specifically, we assume access to a dataset D = {(i(n), {(x(n)t , a(n)t )}T (n)t=0 )}N n=0 of episodes, all of which are successful (i.e., have a final reward of 1). We learn π using behavioral cloning (Pomerleau, 1988), which optimizes π by minimizing the negative log-likelihood of actions at given the images and language instructions.	模仿学习。模仿学习方法在演示数据集D上训练政策π（Pomerleau，1988;张等， 2018;张等人，2021 年）。具体来说，我们假设访问数据集 D = {（i（n）， {（x（n）t ， a（n）t ）}T （n）t=0 ）}N n=0 的剧集，所有这些剧集都是成功的（即最终奖励为 1）。我们使用行为克隆（Pomerleau，1988）学习π，它通过最小化给定图像和语言指令下操作的负对数可能性来优化π。

4 SYSTEM OVERVIEW

The goal of this work is to build and demonstrate a general robot learning system that can absorb large amounts of data and generalize effectively. We use mobile manipulators from Everyday Robots^3, which have a 7 degree-of-freedom arm, a two-fingered gripper, and a mobile base (see Fig. 2 (d)). To collect data and evaluate our method, we use three kitchen-based environments: two real office kitchens and a training environment modelled off these real kitchens. The training environment, shown in Fig. 2 (a), consists of partial counters and is constructed for large scale data collection. The two real environments, shown in Fig. 2 (b, c), have similar counter tops to the training environment, but vary in lighting, background, and full kitchen geometry (e.g., there may be a cabinet instead of a drawer or a sink may be visible). We evaluate the performance of our policies across these different environments, measuring the policy's performance and ability to generalize.	这项工作的目标是构建和演示一个通用的机器人学习系统，该系统可以吸收大量数据并有效地泛化。我们使用Everyday Robots^3的移动机械手，它有一个7自由度的臂，一个双指夹持器和一个移动底座（见图2（d））。为了收集数据并评估我们的方法，我们使用三个基于厨房的环境：两个真实的办公室厨房和一个模仿这些真实厨房的培训环境。训练环境如图2（a）所示，由部分计数器组成，是为大规模数据收集而构建的。图2（b，c）所示的两个真实环境具有与培训环境相似的台面，但在照明，背景和完整的厨房几何形状方面有所不同（例如，可能有一个橱柜而不是抽屉或可能可以看到水槽）。我们评估策略在这些不同环境中的性能，衡量策略的性能和泛化能力。
Our training data consists of human-provided demonstrations, and we annotate each episode with a textual description of the instruction that the robot just performed. The instructions usually contain a verb and one or more nouns describing the target objects. To group these instructions together, we split them into a number of skills (e.g., verbs such as "pick", "open" or "place upright") and objects (e.g., nouns such as "coke can", "apple", or "drawer"). We describe the details of our data collection strategy at scale in Sec. 5.2. Our largest dataset contains over 130k individual demonstrations constituting over 700 distinct task instructions using a large variety of objects (see Fig. 2 (f)). We describe the details of the data collected in Sec. 5.2.	我们的训练数据由人类提供的演示组成，我们用机器人刚刚执行的指令的文本描述来注释每一集。指令通常包含一个动词和一个或多个描述目标对象的名词。为了将这些指令组合在一起，我们将它们分为许多技能（例如，动词，如“pick”，“open”或“place直立”）和宾语（例如，名词，如“可乐罐”，“苹果”或“抽屉”）。我们在第 5.2 节中大规模描述了我们的数据收集策略的详细信息。我们最大的数据集包含超过 130k 个单独的演示，使用各种各样的对象构成 700 多个不同的任务指令（见图 2 （f））。我们在5.2 节中描述了收集的数据的详细信息。
One of the main contributions of our system is the network architecture, Robotics Transformer 1 (RT-1), an efficient model that can absorb large amounts of data, effectively generalize, and output actions at real-time rates for practical robotic control. RT-1 takes a short sequence of images and a natural language instruction as input and outputs an action for the robot at each time step. To this end, the architecture (shown in Figure 1a) leverages several elements: first the images and text are processed via an ImageNet pretrained convolutional network (Tan & Le, 2019) conditioned on a pretrained embedding of the instruction via FiLM (Perez et al., 2018), followed by a Token Learner (Ryoo et al., 2021) to compute a compact set of tokens, and finally a Transformer (Vaswani et al., 2017) to attend over these tokens and produce discretized action tokens. The actions consist of seven dimensions for the arm movement (x, y, z, roll, pitch, yaw, opening of the gripper), three dimensions for base movement (x, y, yaw) and a discrete dimension to switch between three modes: controlling the arm, the base, or terminating the episode. RT-1 performs closed-loop control and commands actions at 3 Hz until it either yields a "terminate" action or hits a pre-set time step limit.	我们系统的主要贡献之一是网络架构，Robotics Transformer 1（RT-1），这是一个高效的模型，可以吸收大量数据，有效地泛化，并以实时速率输出动作，以实现实际的机器人控制。RT-1将一小段图像和自然语言指令作为输入，并在每个时间步为机器人输出一个动作。为此，该架构（如图 1a 所示）利用了几个元素：首先通过 ImageNet 预训练卷积网络（Tan & Le，2019）处理图像和文本，条件是通过 FiLM 对指令进行预训练嵌入（Perez 等人，2018 年），然后是令牌学习者（Ryoo 等人，2021 年）来计算一组紧凑的令牌，最后是一个Transformer （Vaswani等人，2017）来参加这些令牌并产生离散化的动作令牌。动作包括手臂运动的七个维度（x，y，z，roll, pitch, yaw，夹持器的打开），base运动的三个维度（x，y，yaw）和一个在三种模式之间切换的离散维度：控制手臂，底座或终止情节。RT-1 执行闭环控制并以 3 Hz 的频率命令动作，直到产生“终止”动作或达到预设的时间步长限制。

Figure 2: (a) Robot classroom where we collect data at scale; (b) a real office kitchen, one of the two realistic environments used for evaluation (named Kitchen1 in the rest of the paper); (c) a different office kitchen used for evaluation (named Kitchen2 in the rest of the paper); (d) mobile manipulator used throughout the paper; (e) a set of objects used for most of the skills to expand skill diversity; (f) a more diverse set of objects used mostly to expand object diversity of the picking skill.

图2：（a）机器人教室，我们在其中大规模收集数据;（b）一个真正的办公室厨房，这是用于评价的两个现实环境之一（在本文的其余部分称为Kitchen1）;（c）用于评价的另一个办公室厨房（在本文其余部分名为Kitchen2）;（d）整篇论文使用的移动机械手;（e）用于大多数技能的一组对象，以扩大技能多样性;（f）一组更多样化的对象，主要用于扩展拣选技能的对象多样性。

5 RT-1: ROBOTICS TRANSFORMER

In this section, we describe how we tokenize the images, text, and actions, and then discuss the RT-1 model architecture. We then describe how we attain the runtime speed required for real-time control. Lastly, we describe the data collection procedure and the skills and instructions in our dataset.

在本节中，我们将介绍如何标记图像、文本和动作，然后讨论 RT-1 模型架构。然后，我们将描述如何获得实时控制所需的runtime速度。最后，我们描述了数据收集过程以及数据集中的技能和说明。

5.1 MODEL

Our model is built on a Transformer architecture (Vaswani et al., 2017) and takes a history of images and task description as input and directly outputs tokenized actions, as shown in Fig. 1a and in detail in Fig. 3. In the following we describe the components of the model, following the top-to-bottom order in Fig. 3. More detail on model selection at scale are provided in Appendix C.3.	我们的模型建立在 Transformer 架构上（Vaswani 等人，2017 年），并将图像和任务描述的历史作为输入，并直接输出tokenized动作，如图 1a 所示，详细显示在图 3 中。在下文中，我们按照图 3 中从上到下的顺序描述模型的组件。附录 C.3 中提供了有关大规模模型选择的更多详细信息。
Instruction and image tokenization. The RT-1 architecture relies on a data-efficient and compact tokenization of images and language instruction. RT-1 tokenizes a history of 6 images by passing images through an ImageNet pretrained EfficientNet-B3 (Tan & Le, 2019) model, which takes 6 images of resolution 300 × 300 as input and outputs a spatial feature map of shape 9 × 9 × 512 from the final convolutional layer. Unlike Reed et al. (2022), we do not patchify the images into visual tokens prior to feeding them to our Transformer backbone. We instead flatten the output feature map from the EfficientNet into 81 visual tokens which are passed on to the later layers of the network.	指令和图像tokenization。RT-1 架构依赖于数据高效且紧凑的图像和语言指令tokenization。RT-1通过将图像通过ImageNet预训练的EfficientNet-B3（Tan & Le，2019）模型来 tokenizes 6张图像序列，该模型将6张分辨率为300×300的图像作为输入，并从最终卷积层输出形状为9×9×512的空间特征图。与 Reed 等人（2022）不同，我们不会在将图像馈送到我们的Transformer主干之前将图像修补成visual tokens。相反，我们将 EfficientNet 的输出特征图展平为 81 个visual tokens ，这些tokens被传递到网络的后续层。
To include the language instruction, we condition the image tokenizer on the natural language instruction in the form of a pretrained language embedding, allowing extraction of task-relevant image features early on and improving performance of RT-1. The instruction is first embedded via Universal Sentence Encoder (Cer et al., 2018). This embedding is then used as input to identity-initialized FiLM layers (Perez et al., 2018) added to the pretrained EfficientNet to condition the image encoder. Normally, inserting a FiLM layer into the interior of a pretrained network would disrupt the intermediate activations and negate the benefit of using pretrained weights. To overcome this, we initialize the weights of the dense layers (fc and hC ) which produce the FiLM affine transformation to zero, allowing the FiLM layer to initially act as an identity and preserve the function of the pretrained weights. We find that identity-initialized FiLM also produces better results when training with an EfficientNet initialized from scratch, without ImageNet pretraining, but it does not surpass the initialization described above. The architecture of the image tokenizer is presented in Fig. 3.	为了包含语言指令，我们以预训练语言嵌入的形式将图像标记器以自然语言指令为条件，允许尽早提取与任务相关的图像特征并提高 RT-1 的性能。该指令首先通过通用句子编码器嵌入（Cer等人，2018）。然后将此嵌入用作身份初始化的FiLM层（Perez等人，2018）的输入，添加到预训练的EfficientNet中以调节图像编码器。通常，将FiLM层插入预训练网络的内部会破坏中间激活，并抵消使用预训练权重的好处。为了克服这个问题，我们初始化了产生FiLM仿射变换的密集层（fc和hC）的权重，允许FiLM层最初充当恒等式并保留预训练权重的功能。我们发现，在没有 ImageNet 预训练的情况下，使用从头开始初始化的 EfficientNet 进行训练时，身份初始化的 FiLM 也会产生更好的结果，但它并没有超过上述初始化。图像分词器的架构如图 3 所示。
RT-1's image and instruction tokenization via FiLM EfficientNet-B3 is a total of 16M parameters, with 26 layers of MBConv blocks and FiLM layers, which output 81 vision-language tokens.	RT-1通过FiLM EfficientNet-B3进行的图像和instruction tokenization共有16M个参数，具有26层MBConv块和FiLM层，输出81个vision-language tokens。
Token Learner. To further compress the number of tokens that RT-1 needs to attend over and thus speed up inference, RT-1 uses TokenLearner (Ryoo et al., 2021). TokenLearner is an elementwise attention module that learns to map a large number of tokens into a much smaller number of tokens. This allows us to soft-select image tokens based on their information, passing only the important token combinations to the subsequent Transformer layers. The inclusion of TokenLearner subsamples the 81 visual tokens that come out of the pre-trained FiLM-EfficientNet layers to just 8final tokens that are then passed on to our Transformer layers.	Token Learner。为了进一步压缩 RT-1 需要参加的tokens数量并从而加快推理速度，RT-1 使用了Token Learner（Ryoo 等人，2021 年）。TokenLearner是一个elementwise attention module ，它学习将大量tokens映射到数量少得多的tokens中。这允许我们根据image tokens 的信息对其进行软选择，仅将重要的tokens组合传递给后续的Transformer层。TokenLearner的加入将来自预先训练的FiLM-EfficientNet层的81个视觉tokens子采样为8个最终tokens，然后传递给我们的Transformer层。
Transformer. These 8 tokens per-image are then concatenated with the other images in the history, forming 48 total tokens (with added position encoding) to be fed into the Transformer backbone of RT-1. The Transformer is a decoder-only sequence model with 8 self-attention layers and 19M total parameters that outputs action tokens.	Transformer。然后将每个图像的这 8 个tokens与历史记录中的其他图像连接，形成总共 48 个tokens（带有添加的位置编码）以馈送到 RT-1 的Transformer主干网中。Transformer是一个仅解码器的序列模型，具有 8 个自注意层和 19M 个输出action tokens的总参数。
Action tokenization. To tokenize actions, each action dimension in RT-1 is discretized into256 bins. As mentioned previously, the action dimensions we consider include seven variables for the arm movement (x, y, z, roll, pitch, yaw, opening of the gripper), three variables for base movement (x, y, yaw) and a discrete variable to switch between three modes: controlling arm, base or terminating the episode. For each variable, we map the target to one of the 256 bins, where the bins are uniformly distributed within the bounds of each variable.	Action tokenization。为了tokenize actions，RT-1 中的每个操作维度都离散化为 256 个bins。如前所述，我们考虑的动作维度包括手臂运动的七个变量（x，y，z，滚动，俯仰，偏航，夹持器的打开），base运动的三个变量（x，y，yaw）和一个在三种模式之间切换的离散变量：控制手臂，base或终止the episode。对于每个变量，我们将目标映射到 256 个bins之一，其中bins均匀分布在每个变量的边界内。
Loss. We use a standard categorical cross-entropy entropy objective and causal masking that was utilized in prior Transformer-based controllers (Reed et al., 2022; Lee et al., 2022a).	损失。我们使用标准分类交叉熵目标和因果掩蔽，该目标和因果掩蔽在以前基于变压器的控制器中使用（Reed 等人，2022 年;李等人，2022a）。
Inference speed. In contrast to many applications of large models, such as natural language or image generation, one of the unique requirements for a model that needs to run on real robots in real time is fast and consistent inference speed. Given the human speeds of executing the instructions considered in this work (which we measured to be in the 2 − 4 secs range), we want the model to be not significantly slower than that. Based on our experiments this requirement corresponds to at least3Hz control frequency and the resulting inference time budget for the model, given other latencies in the system, to be less than 100ms.	推理速度。与自然语言或图像生成等大型模型的许多应用相比，需要在真实机器人上实时运行的模型的独特要求之一是快速且一致的推理速度。考虑到人类执行这项工作中考虑的指令的速度（我们测量的在 2 − 4 秒范围内），我们希望模型不会明显慢于此速度。根据我们的实验，这一要求对应于至少3Hz的控制频率，并且考虑到系统中的其他延迟，模型的推理时间预算小于100ms。
This requirement limits the size of the model that we can use. We further explore the impact of model size on inference speed in the experiments. We employ two techniques to speed up inference: (i) reduce the number of tokens generated by a pre-trained EfficientNet model by using TokenLearner (Ryoo et al., 2021), (ii) compute these tokens only once and reuse them for the following windows that overlap for the future inferences. Both of these allow us to speed up the model inference by 2.4 and 1.7 times, respectively. Additional details on model inference are in Appendix C.1.	此要求限制了我们可以使用的模型的大小。我们在实验中进一步探讨了模型大小对推理速度的影响。我们采用两种技术来加速推理：（i）通过使用 TokenLearner 减少预先训练的 EfficientNet 模型生成的代币数量（Ryoo 等人，2021 年），（ii）仅计算一次这些代币，并在以下重叠的窗口中重复使用它们以供将来的推理。这两者都使我们能够将模型推理分别加快 2.4 倍和 1.7 倍。有关模型推理的其他详细信息，请参阅附录 C.1。