RT-1论文翻译：ROBOTICS TRANSFORMER FOR REAL-WORLD CONTROL AT SCALE

YYGe

已于 2024-01-27 14:50:43 修改

阅读量1.3k

点赞数 16

分类专栏：机器人文章标签：人工智能深度学习机器人

于 2024-01-25 20:22:52 首次发布

本文链接：https://blog.csdn.net/weixin_43334869/article/details/135850410

版权

机器人专栏收录该内容

10 篇文章 1 订阅

订阅专栏

RT-1: ROBOTICS TRANSFORMER FOR REAL-WORLD CONTROL AT SCALE

RT-1：用于真实世界大规模控制的机器人Transformer

RT2 论文翻译: https://blog.csdn.net/weixin_43334869/article/details/135858619

文章目录

RT-1: ROBOTICS TRANSFORMER FOR REAL-WORLD CONTROL AT SCALE
RT-1：用于真实世界大规模控制的机器人Transformer

ABSTRACT By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data. We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data. In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks. The project’s website and videos can be found at robotics-transformer1.github.io

摘要：通过从大型、多样化、与任务无关的数据集中转移知识，现代机器学习模型能够在特定的下游任务上以零-shot或使用小规模任务特定数据集实现高水平的性能。虽然这种能力已在其他领域如计算机视觉、自然语言处理或语音识别中得到证明，但在机器人领域尚未展示，其中模型的泛化能力特别关键，因为收集实际机器人数据的难度较大。我们认为，实现这类通用机器人模型成功的关键之一在于开放式的与任务无关训练，结合高容量的架构，能够吸收各种多样的机器人数据。在本文中，我们提出了一个模型类别，被称为Robotics Transformer，展现了可扩展模型性能的有希望的特性。我们通过对不同模型类别进行研究，以及它们在数据规模、模型规模和数据多样性作为函数的泛化能力的验证，基于对真实机器人执行实际任务的大规模数据收集，证实了我们的结论。该项目的网站和视频可在 robotics-transformer1.github.io 找到。

1 INTRODUCTION

1 引言

End-to-end robotic learning, with either imitation or reinforcement, typically involves collecting task-specific data in either single-task (Kalashnikov et al., 2018; Zhang et al., 2018) or multitask (Kalashnikov et al., 2021b; Jang et al., 2021) settings that are narrowly tailored to the tasks that the robot should perform. This workflow mirrors the classic approach to supervised learning in other domains, such as computer vision and NLP, where task-specific datasets would be collected, labeled, and deployed to solve individual tasks, with little interplay between the tasks themselves. Recent years have seen a transformation in vision, NLP, and other domains, away from siloed, smallscale datasets and models and towards large, general models pre-trained on broad, large datasets. The keys to the success of such models lie with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the knowledge present in large-scale datasets. If a model can “sponge up” experience to learn general patterns in language or perception, then it can bring them to bear on individual tasks more efficiently. While removing the need for large taskspecific datasets is appealing generally in supervised learning, it is even more critical in robotics, where datasets might require engineering-heavy autonomous operation or expensive human demonstrations. We therefore ask: can we train a single, capable, large multi-task backbone model on data consisting of a wide variety of robotic tasks? And does such a model enjoy the benefits observed in other domains, exhibiting zero-shot generalization to new tasks, environments, and objects?

端到端的机器人学习，无论是模仿还是强化，通常涉及在单一任务（Kalashnikov等人，2018；Zhang等人，2018）或多任务（Kalashnikov等人，2021b；Jang等人，2021）设置中收集与机器人应执行的任务紧密相关的任务特定数据。这个工作流程类似于其他领域（如计算机视觉和自然语言处理）中监督学习的经典方法，在这些领域中，任务特定的数据集将被收集、标记并用于解决个别任务，任务之间的相互作用很小。近年来，视觉、自然语言处理和其他领域已经发生了变革，摆脱了孤立的、小规模的数据集和模型，转向在广泛的大型数据集上预训练的大型通用模型。这类模型成功的关键在于开放式的与任务无关训练，结合高容量的架构，能够吸收大规模数据中的所有知识。如果一个模型能够“吸收”经验以学习语言或感知中的一般模式，那么它可以更有效地应用于个别任务。虽然在监督学习中普遍希望消除对大型任务特定数据集的需求，但在机器人领域，这一点更为关键，因为数据集可能需要工程密集型的自主操作或昂贵的人类演示。因此，我们提出一个问题：我们是否可以在由各种各样的机器人任务组成的数据上训练一个单一、有能力的大型多任务主干模型？并且这样的模型是否能够在新任务、环境和对象上展现零-shot泛化的好处？

Building such models in robotics is not easy. Although recent years have seen several large multitask robot policies proposed in the literature (Reed et al., 2022; Jang et al., 2021), such models often have limited breadth of real-world tasks, as with Gato (Reed et al., 2022), or focus on training tasks rather than generalization to new tasks, as with recent instruction following methods (Shridhar et al., 2021; 2022), or attain comparatively lower performance on new tasks (Jang et al., 2021).

在机器人领域构建这样的模型并不容易。尽管近年来文献中提出了几个大规模多任务机器人策略（Reed等人，2022；Jang等人，2021），这类模型通常在真实世界任务的广度上有限，如Gato（Reed等人，2022），或者专注于训练任务而不是新任务的泛化，如最近的指令跟随方法（Shridhar等人，2021；2022），或者在新任务上的表现相对较差（Jang等人，2021）。

The two main challenges lie in assembling the right dataset and designing the right model. While data collection and curation is often the “unsung hero” of many large-scale machine learning projects (Radford et al., 2021; Ramesh et al., 2021), this is especially true in robotics, where datasets are often robot-specific and gathered manually (Dasari et al., 2019; Ebert et al., 2021). As we will show in our evaluations, good generalization requires datasets that combine both scale and breadth, covering a variety of tasks and settings. At the same time, the tasks in the dataset should be sufficiently well-connected to enable generalization, such that the model can discover the patterns between structural similar tasks and perform new tasks that combine those patterns in novel ways. We utilize a dataset that we gathered over the course of 17 months with a fleet of 13 robots, containing ∼130k episodes and over 700 tasks, and we ablate various aspects of this dataset in our evaluation.

两个主要挑战在于组建正确的数据集和设计合适的模型。虽然数据收集和管理经常是许多大规模机器学习项目中的“无名英雄”（Radford等人，2021；Ramesh等人，2021），但在机器人领域，这一点尤为真实，因为数据集通常是机器人特定的，手动收集（Dasari等人，2019；Ebert等人，2021）。正如我们在评估中将要展示的，良好的泛化需要结合规模和广度的数据集，涵盖各种任务和设置。同时，数据集中的任务应足够相互连接以实现泛化，使模型能够发现结构相似任务之间的模式，并以新颖的方式组合这些模式执行新任务。我们利用我们在17个月内使用13台机器人收集的数据集，包含∼130,000个情节和超过700个任务，并在我们的评估中对这个数据集的各个方面进行了实验。

The second challenge lies in the design of the model itself. Effective robotic multi-task learning requires a high capacity model, and Transformer (Vaswani et al., 2017) models excel in this regard, particularly when it is necessary to learn many tasks conditioned, as in our case, on language instructions. However, robotic controllers must also be efficient enough to run in real time, which presents a major challenge for Transformers in particular. We propose a novel architecture that we call RT-1 (Robotics Transformer 1), which by encoding high-dimensional inputs and outputs, including camera images, instructions and motor commands into compact token representations to be used by the Transformer, allows for efficient inference at runtime to make real-time control feasible.

第二个挑战在于模型本身的设计。有效的机器人多任务学习需要一个高容量的模型，而Transformer（Vaswani等人，2017）模型在这方面表现出色，特别是在需要学习许多任务的情况下，如我们的情况，任务是基于语言指令的。然而，机器人控制器还必须足够高效，以实时运行，这对于Transformers来说尤为具有挑战性。我们提出了一个新颖的架构，我们称之为RT-1（Robotics Transformer 1），通过将高维输入和输出（包括摄像头图像、指令和电机命令）编码成Transformer可用的紧凑token表示，使得在运行时进行高效推理，从而使实时控制成为可能。

Our contribution is the RT-1 model and experiments with this model on a large and broad dataset of real-world robotic tasks. Our experiments not only demonstrate that RT-1 can exhibit significantly improved generalization and robustness compared to prior techniques, but also evaluate and ablate many design choices in both the model and in the composition of the training set. Our results show that RT-1 can perform over 700 training instructions at 97% success rate, and can generalize to new tasks, distractors, and backgrounds 25%, 36% and 18% better than the next best baseline, respectively. This level of performance allows us to execute very long-horizon tasks in the SayCan (Ahn et al., 2022) framework, with as many as 50 stages. We further show that RT-1 can incorporate data from simulation or even other robot types, retaining performance on the original tasks and improving generalization to new scenarios. A short overview of RT-1 capabilities is presented in Fig. 1b2 .

我们的贡献是RT-1模型以及对这个模型在一个大而广泛的真实世界机器人任务数据集上的实验。我们的实验不仅证明了与先前技术相比，RT-1在泛化和鲁棒性方面都能显著改进，还评估和消融了模型和训练集构成中许多设计选择。我们的结果显示，RT-1可以以97%的成功率执行超过700个训练指令，并且在新任务、干扰物和背景上的泛化分别比下一个最佳基线好25%、36%和18%。这种性能水平使我们能够在SayCan（Ahn等人，2022）框架中执行非常长的视野任务，最多有50个阶段。我们进一步展示了RT-1可以整合来自模拟甚至其他机器人类型的数据，保持在原始任务上的性能，并提高对新场景的泛化。RT-1能力的简要概述见图1b2。

在这里插入图片描述
(a) RT-1 takes images and natural language instructions and outputs discretized base and arm actions. Despite its size (35M parameters), it does this at 3 Hz, due to its efficient yet high-capacity architecture: a FiLM (Perez et al., 2018) conditioned EfficientNet (Tan & Le, 2019), a TokenLearner (Ryoo et al., 2021), and a Transformer (Vaswani et al., 2017).

(a) RT-1接收图像和自然语言指令，并输出离散化的底盘和臂部动作。尽管它的规模较大（35M参数），但由于其高效而高容量的架构，它以3 Hz的速度执行这一过程：一个由FiLM（Perez等人，2018）调制的EfficientNet（Tan＆Le，2019），一个TokenLearner（Ryoo等人，2021），和一个Transformer（Vaswani等人，2017）。
在这里插入图片描述
(b) RT-1’s large-scale, real-world training (130k demonstrations) and evaluation (3000 real-world trials) show impressive generalization, robustness, and ability to learn from diverse data.

(b) RT-1在大规模、真实世界的训练（130,000个演示）和评估（3,000个真实世界试验）中展示了令人印象深刻的泛化、鲁棒性和学习来自多样化数据的能力。

Figure 1: A high-level overview of RT-1’s architecture, dataset, and evaluation

图1：RT-1架构、数据集和评估的高级概览。

2 RELATED WORK

2 相关工作

A number of recent works have proposed Transformer-based policies for robotic control. As in RT-1, several works use language commands processed with Transformers as a robust framework for specifying and generalizing to new tasks (Zhang & Chai, 2021; Pashevich et al., 2021; Silva et al., 2021; Jang et al., 2021; Ahn et al., 2022; Nair et al., 2022). Our work takes the application of Transformers a step further and treats the mapping of language and vision observations to robot actions as a sequence modelling problem, using a Transformer to learn this mapping. This idea is directly inspired by successes in game-playing (Chen et al., 2021; Lee et al., 2022a) as well as simulated robot navigation (Fang et al., 2019), locomotion (Janner et al., 2021; Gupta et al., 2022), and manipulation (Jiang et al., 2022) environments. We note that several of these works go beyond only text conditioning and use Transformers to also generalize across robot morphologies (e.g., Gupta et al. (2022)) and other modalities for task specifications (e.g., Jang et al. (2021); Jiang et al. (2022)). These extensions are promising future directions for RT-1.

近期有许多工作提出了基于Transformer的机器人控制策略。与RT-1类似，一些工作使用经过Transformer处理的语言命令作为一个强大的框架来指定和泛化到新任务（Zhang＆Chai，2021；Pashevich等人，2021；Silva等人，2021；Jang等人，2021；Ahn等人，2022；Nair等人，2022）。我们的工作更进一步，将语言和视觉观察映射到机器人动作看作是一个序列建模问题，使用Transformer学习这个映射。这个想法直接受到游戏玩法（Chen等人，2021；Lee等人，2022a）以及模拟机器人导航（Fang等人，2019）、运动（Janner等人，2021；Gupta等人，2022）和操作（Jiang等人，2022）环境中的成功启发。我们注意到，其中一些工作不仅限于文本调制，还使用Transformers来跨机器人形态（例如，Gupta等人（2022））和其他模态进行任务规范（例如，Jang等人（2021）；Jiang等人（2022））的泛化。这些扩展是RT-1有望探索的未来方向。

Beyond Transformer-based policies, the focus of our work is on generalizable and robust real-world robotic manipulation at scale. Existing works on real-world Transformer-based robotic manipulation focus on efficiently learning tasks from a set of demonstrations per task (Shridhar et al., 2022). Behavior Transformer (Shafiullah et al., 2022) and Gato (Reed et al., 2022) advocate for training a single model on large-scale robotic and non-robotic datasets. However, these works are limited in their real-world robotic tasks; e.g., Gato learns effectively a single task (colored block stacking) without evaluating generalization to new tasks or a variety of real-world settings. On the technical side, our work examines how Transformer-based policies can be built so as to combine high capacity and generalization with the computational efficiency necessary for real-time control.

除了基于Transformer的策略之外，我们的工作重点是在大规模情境下进行可泛化和鲁棒的真实世界机器人操作。现有关于基于Transformer的真实世界机器人操纵的工作侧重于从每个任务的演示中高效地学习任务（Shridhar等人，2022）。Behavior Transformer（Shafiullah等人，2022）和Gato（Reed等人，2022）主张在大规模机器人和非机器人数据集上训练单一模型。然而，这些工作在真实世界机器人任务方面存在局限；例如，Gato有效地学习单一任务（彩色块堆叠），而没有评估对新任务或各种真实世界设置的泛化。在技术方面，我们的工作研究了如何构建基于Transformer的策略，以在实时控制所需的计算效率与高容量和泛化相结合。

While the use of high-capacity Transformer models to learn robotic control policies is a fairly recent innovation, robotics has a long history of multi-task and language-conditioned learning, and RT-1 builds on these foundations. A significant body of work deals with learning policies and predictive models for robotic grasping (Saxena et al., 2006; Lenz et al., 2015; Pinto & Gupta, 2016; Gupta et al., 2018; Viereck et al., 2017), with the aim of generalizing to new objects. Prior works have sought to address robotic language understanding through pipelined approaches that combine language parsing, vision, and robotic control (MacMahon et al., 2006; Kollar et al., 2010; Tellex et al., 2011) and with end-to-end approaches (Mei et al., 2016; Stepputtis et al., 2020; Lynch & Sermanet, 2020; Ahn et al., 2022). Multi-task robotic learning has also been approached from the perspective of learning to reach goals (Chung et al., 2015; Raffin et al., 2019; Jurgenson et al., 2020; Huang et al., 2020), as well as learning policies that can perform tasks in a discrete set or some other parameterized form (Deisenroth et al., 2014; Devin et al., 2017; Fox et al., 2019; Kalashnikov et al., 2021a). A number of prior works in robotics have also focused on collecting datasets containing demonstrations or trials that illustrate a variety of different tasks (Sharma et al., 2018; Dasari et al., 2019; Yu et al., 2020; Singh et al., 2020; James et al., 2020). Our work adds further evidence in support of the power of multi-task, language-conditioned robotic learning, presenting experimental results at a larger scale and with a greater variety of behaviors, objects, and scenes and proposing new architectures and design choices that enable robotic learning at a significantly larger scale.

虽然使用高容量的Transformer模型学习机器人控制策略是一项相对较新的创新，但机器人学在多任务和语言条件学习方面有着悠久的历史，RT-1在这些基础上进行了建设。大量工作致力于为机器人抓取（Saxena等人，2006；Lenz等人，2015；Pinto＆Gupta，2016；Gupta等人，2018；Viereck等人，2017）学习策略和预测模型，旨在泛化到新对象。先前的工作试图通过流水线方法来解决机器人语言理解问题，这些方法结合了语言解析、视觉和机器人控制（MacMahon等人，2006；Kollar等人，2010；Tellex等人，2011），以及端到端方法（Mei等人，2016；Stepputtis等人，2020；Lynch＆Sermanet，2020；Ahn等人，2022）。多任务机器人学习也是从学习达到目标的角度（Chung等人，2015；Raffin等人，2019；Jurgenson等人，2020；Huang等人，2020）以及学习能够以离散集或其他参数化形式执行任务的策略的角度进行的（Deisenroth等人，2014；Devin等人，2017；Fox等人，2019；Kalashnikov等人，2021a）。机器人学领域的许多先前工作还着重于收集包含演示或试验的数据集，展示了各种不同任务（Sharma等人，2018；Dasari等人，2019；Yu等人，2020；Singh等人，2020；James等人，2020）。我们的工作通过在更大规模上展示实验证据，以及涵盖更多种行为、物体和场景的多样性，提出了新的架构和设计选择，从而支持多任务、语言条件机器人学习的强大性。

3 PRELIMINARIES

3 预备知识

Robot learning. We aim to learn robot policies to solve language-conditioned tasks from vision. Formally, we consider a sequential decision-making environment. At timestep t = 0, the policy π is presented with a language instruction i and an initial image observation x0. The policy produces an action distribution π(· | i, x0) from which an action a0 is sampled and applied to the robot. This process continues, with the policy iteratively producing actions at by sampling from a learned distribution π(· | i, {xj} t j=0) and applying those actions to the robot. The interaction ends when a termination condition is achieved. The full interaction i, {(xj , aj )} T j=0 from from the starting step t = 0 to terminating step T is referred to as an episode. At the end of an episode, the agent will be given a binary reward r ∈ {0, 1} indicating whether the robot performed the instruction i. The goal is to learn a policy π that maximizes the average reward, in expectation over a distribution of instructions, starting states x0, and transition dynamics.

机器人学习。我们的目标是学习机器人策略来通过视觉解决语言条件任务。形式上，我们考虑一个序列决策环境。在时刻 t = 0，策略 π 收到一个语言指令 i 和一个初始图像观察 x0。策略生成一个动作分布 π(· | i, x0)，从中采样一个动作 a0 并应用于机器人。这个过程继续进行，策略通过从学到的分布 π(· | i, {xj} t j=0) 中采样迭代产生动作 at，并将这些动作应用于机器人。交互在达到终止条件时结束。从开始步骤 t = 0 到终止步骤 T 的完整交互 i，{(xj , aj )} T j=0 被称为一个 episode。在 episode 结束时，智能体将获得一个二进制奖励 r ∈ {0, 1}，指示机器人是否执行了指令 i。目标是学习一个策略 π，最大化在指令、初始状态 x0 和转移动态的分布上的期望平均奖励。

Transformers. RT-1 uses a Transformer (Vaswani et al., 2017) to parameterize the policy π. Generally speaking, a Transformer is a sequence model mapping an input sequence {ξh} H h=0 to an output sequence {yk} K k=0 using combinations of self-attention layers and fully-connected neural networks. While Transformers were originally designed for text sequences, where each input ξj and output yk represents a text token, they have been extended to images (Parmar et al., 2018) as well as other modalities (Lee et al., 2022a; Reed et al., 2022). As detailed in the next section, we parameterize π by first mapping inputs i, {xj} t j=0 to a sequence {ξh} H h=0 and action outputs at to a sequence {yk} K k=0 before using a Transformer to learn the mapping {ξh} H h=0 → {yk} K k=0.

Transformer。RT-1使用一个Transformer（Vaswani等人，2017）来参数化策略 π。一般来说，Transformer是一个序列模型，使用自注意力层和全连接神经网络的组合，将输入序列 {ξh} H h=0 映射到输出序列 {yk} K k=0。虽然Transformer最初是为文本序列设计的，其中每个输入 ξj 和输出 yk 表示一个文本标记，但它已经扩展到图像（Parmar等人，2018）以及其他形式（Lee等人，2022a；Reed等人，2022）。如下一节详细说明的那样，我们通过首先将输入 i，{xj} t j=0 映射到一个序列 {ξh} H h=0，将动作输出 at 映射到一个序列 {yk} K k=0，然后使用Transformer学习映射 {ξh} H h=0 → {yk} K k=0 来参数化 π。

Imitation learning. Imitation learning methods train the policy π on a dataset D of demonstrations (Pomerleau, 1988; Zhang et al., 2018; Jang et al., 2021). Specifically, we assume access to a dataset D = {(i (n) , {(x (n) t , a (n) t )} T (n) t=0 )} N n=0 of episodes, all of which are successful (i.e., have a final reward of 1). We learn π using behavioral cloning (Pomerleau, 1988), which optimizes π by minimizing the negative log-likelihood of actions at given the images and language instructions.

模仿学习。模仿学习方法通过在演示的数据集 D 上训练策略 π（Pomerleau，1988；Zhang等人，2018；Jang等人，2021）。具体来说，我们假设有一个包含 episodes 的数据集 D = {(i (n) , {(x (n) t , a (n) t )} T (n) t=0 )} N n=0，其中所有 episodes 都是成功的（即，具有最终奖励为 1）。我们使用行为克隆（Behavioral Cloning）（Pomerleau，1988）来学习 π，该方法通过最小化给定图像和语言指令的情况下动作 at 的负对数似然来优化 π。

4 SYSTEM OVERVIEW

4 系统概述

The goal of this work is to build and demonstrate a general robot learning system that can absorb large amounts of data and generalize effectively. We use mobile manipulators from Everyday Robots3 , which have a 7 degree-of-freedom arm, a two-fingered gripper, and a mobile base (see Fig. 2 (d)). To collect data and evaluate our method, we use three kitchen-based environments: two real office kitchens and a training environment modelled off these real kitchens. The training environment, shown in Fig. 2 (a), consists of partial counters and is constructed for large scale data collection. The two real environments, shown in Fig. 2 (b, c), have similar counter tops to the training environment, but vary in lighting, background, and full kitchen geometry (e.g., there may be a cabinet instead of a drawer or a sink may be visible). We evaluate the performance of our policies across these different environments, measuring the policy’s performance and ability to generalize.

这项工作的目标是构建和展示一个通用的机器人学习系统，能够吸收大量数据并有效泛化。我们使用来自Everyday Robots的移动操纵器，它具有7个自由度的机械臂、一个两指夹持器和一个移动底座（见图2（d））。为了收集数据并评估我们的方法，我们使用了三个基于厨房的环境：两个真实的办公室厨房和一个模拟这些真实厨房的训练环境。训练环境如图2（a）所示，由部分柜台构成，用于大规模数据收集。两个真实环境如图2（b、c）所示，与训练环境具有相似的柜台，但在照明、背景和完整的厨房几何结构方面有所变化（例如，可能存在柜子而不是抽屉，或者可能可见水槽）。我们在这些不同环境中评估我们的策略的性能和泛化能力。
在这里插入图片描述
Figure 2: (a) Robot classroom where we collect data at scale; (b) a real office kitchen, one of the two realistic environments used for evaluation (named Kitchen1 in the rest of the paper); © a different office kitchen used for evaluation (named Kitchen2 in the rest of the paper); (d) mobile manipulator used throughout the paper; (e) a set of objects used for most of the skills to expand skill diversity; (f) a more diverse set of objects used mostly to expand object diversity of the picking skill.

图2：(a) 我们大规模收集数据的机器人教室；(b) 一个真实的办公室厨房，是用于评估的两个真实环境之一（在本文的其余部分称为Kitchen1）；© 用于评估的另一个办公室厨房（在本文的其余部分称为Kitchen2）；(d) 整个论文中使用的移动操纵器；(e) 用于大多数技能的一组物体，以扩展技能的多样性；(f) 用于扩展拾取技能的物体多样性的更多不同种类的物体。

Our training data consists of human-provided demonstrations, and we annotate each episode with a textual description of the instruction that the robot just performed. The instructions usually contain a verb and one or more nouns describing the target objects. To group these instructions together, we split them into a number of skills (e.g., verbs such as “pick”, “open” or “place upright”) and objects (e.g., nouns such as “coke can”, “apple”, or “drawer”). We describe the details of our data collection strategy at scale in Sec. 5.2. Our largest dataset contains over 130k individual demonstrations constituting over 700 distinct task instructions using a large variety of objects (see Fig. 2 (f)). We describe the details of the data collected in Sec. 5.2.

我们的训练数据包括人类提供的演示，并用机器人刚执行的指令的文本描述注释每个 episode。这些指令通常包含一个动词和一个或多个描述目标对象的名词。为了将这些指令分组在一起，我们将它们分为多个技能（例如，“pick”、“open”或“place upright”等动词）和对象（例如，“coke can”、“apple”或“drawer”等名词）。我们在第5.2节详细描述了我们的大规模数据收集策略的细节。我们最大的数据集包含超过130,000个单独的演示，涵盖了超过700个不同的任务指令，使用了大量不同的物体（见图2（f））。我们在第5.2节详细描述了收集到的数据的细节。

One of the main contributions of our system is the network architecture, Robotics Transformer 1 (RT-1), an efficient model that can absorb large amounts of data, effectively generalize, and output actions at real-time rates for practical robotic control. RT-1 takes a short sequence of images and a natural language instruction as input and outputs an action for the robot at each time step. To this end, the architecture (shown in Figure 1a) leverages several elements: first the images and text are processed via an ImageNet pretrained convolutional network (Tan & Le, 2019) conditioned on a pretrained embedding of the instruction via FiLM (Perez et al., 2018), followed by a Token Learner (Ryoo et al., 2021) to compute a compact set of tokens, and finally a Transformer (Vaswani et al., 2017) to attend over these tokens and produce discretized action tokens. The actions consist of seven dimensions for the arm movement (x, y, z, roll, pitch, yaw, opening of the gripper), three dimensions for base movement (x, y, yaw) and a discrete dimension to switch between three modes: controlling the arm, the base, or terminating the episode. RT-1 performs closed-loop control and commands actions at 3 Hz until it either yields a “terminate” action or hits a pre-set time step limit.

我们系统的一个主要贡献是网络架构，即Robotics Transformer 1（RT-1），这是一个高效的模型，可以吸收大量数据，有效泛化，并以实时速率输出用于实际机器人控制的动作。RT-1以每个时间步的图像短序列和自然语言指令作为输入，并输出机器人的动作。为此，该架构（如图1a所示）利用了几个元素：首先，通过ImageNet预训练的卷积网络（Tan＆Le，2019）和通过FiLM（Perez等人，2018）对指令的预训练嵌入条件处理图像和文本，然后通过Token Learner（Ryoo等人，2021）计算一组紧凑的标记，最后通过Transformer（Vaswani等人，2017）对这些标记进行关注并产生离散动作标记。动作包括七个维度的机械臂运动（x、y、z、roll、pitch、yaw、夹持器张开）、三个维度的底座运动（x、y、yaw）和一个离散维度，用于在三种模式之间切换：控制机械臂、底座或终止 episode。RT-1以3 Hz的频率执行闭环控制，并在产生“终止”动作或达到预设时间步限制时终止。

5 RT-1: ROBOTICS TRANSFORMER

5 RT-1：机器人变形器

In this section, we describe how we tokenize the images, text, and actions, and then discuss the RT-1 model architecture. We then describe how we attain the runtime speed required for real-time control. Lastly, we describe the data collection procedure and the skills and instructions in our dataset.

在本节中，我们将描述如何对图像、文本和动作进行标记，然后讨论RT-1模型的体系结构。然后，我们描述如何实现实时控制所需的运行时速度。最后，我们描述数据收集过程以及数据集中的技能和指令。

5.1 MODEL

5.1 模型

Our model is built on a Transformer architecture (Vaswani et al., 2017) and takes a history of images and task description as input and directly outputs tokenized actions, as shown in Fig. 1a and in detail in Fig. 3. In the following we describe the components of the model, following the top-to-bottom order in Fig. 3. More detail on model selection at scale are provided in Appendix C.3

我们的模型建立在Transformer体系结构上（Vaswani等人，2017），以图1a和图3中详细显示的形式，接受图像和任务描述的历史作为输入，并直接输出标记化的动作。在接下来的内容中，我们按照图3中的自上而下的顺序描述模型的组件。有关规模上的模型选择的更多详细信息，请参见附录C.3。
请添加图片描述
Figure 3: The architecture diagram of RT-1. The instruction is transformed into a USE embedding and used to condition a pre-trained EfficientNet via FiLM layers. The resulting vision-language tokens are reduced by the TokenLearner and fed into a decoder-only Transformer, which outputs tokenized actions.

图3：RT-1的架构图。指令被转换为一个USE嵌入，并通过FiLM层对预训练的EfficientNet进行条件处理。生成的视觉语言标记经过TokenLearner减少，并馈送到一个仅包含解码器的Transformer中，该Transformer输出标记化的动作。

Instruction and image tokenization. The RT-1 architecture relies on a data-efficient and compact tokenization of images and language instruction. RT-1 tokenizes a history of 6 images by passing images through an ImageNet pretrained EfficientNet-B3 (Tan & Le, 2019) model, which takes 6 images of resolution 300×300 as input and outputs a spatial feature map of shape 9×9×512 from the final convolutional layer. Unlike Reed et al. (2022), we do not patchify the images into visual tokens prior to feeding them to our Transformer backbone. We instead flatten the output feature map from the EfficientNet into 81 visual tokens which are passed on to the later layers of the network.

指令和图像标记。 RT-1体系结构依赖于对图像和语言指令进行高效且紧凑的标记。RT-1通过将图像通过一个ImageNet预训练的EfficientNet-B3（Tan＆Le，2019）模型，对图像的历史进行标记。该模型接受分辨率为300×300的6张图像作为输入，并从最后的卷积层输出形状为9×9×512的空间特征图。与Reed等人（2022）不同，我们在将它们馈送到我们的Transformer骨干之前，不会将图像分块成视觉标记。相反，我们将EfficientNet的输出特征图展平为81个视觉标记，并将它们传递到网络的后续层。

To include the language instruction, we condition the image tokenizer on the natural language instruction in the form of a pretrained language embedding, allowing extraction of task-relevant image features early on and improving performance of RT-1. The instruction is first embedded via Universal Sentence Encoder (Cer et al., 2018). This embedding is then used as input to identity-initialized FiLM layers (Perez et al., 2018) added to the pretrained EfficientNet to condition the image encoder. Normally, inserting a FiLM layer into the interior of a pretrained network would disrupt the intermediate activations and negate the benefit of using pretrained weights. To overcome this, we initialize the weights of the dense layers (fc and hC ) which produce the FiLM affine transformation to zero, allowing the FiLM layer to initially act as an identity and preserve the function of the pretrained weights. We find that identity-initialized FiLM also produces better results when training with an EfficientNet initialized from scratch, without ImageNet pretraining, but it does not surpass the initialization described above. The architecture of the image tokenizer is presented in Fig. 3.

为了包含语言指令，我们在自然语言指令的形式上调整图像标记器，使其对语言指令进行条件处理，以获得预训练语言嵌入的任务相关图像特征，并提高RT-1的性能。指令首先通过Universal Sentence Encoder（Cer等人，2018）进行嵌入。然后，将此嵌入用作预训练EfficientNet的FiLM层的输入，以对图像编码器进行条件处理。通常，在预训练网络的内部插入FiLM层会破坏中间激活，并抵消使用预训练权重的好处。为了克服这一问题，我们将生成FiLM仿射变换的密集层（fc和hC）的权重初始化为零，从而使FiLM层最初充当恒等式并保留预训练权重的功能。我们发现，当使用从头开始初始化的EfficientNet进行训练时，恒等初始化的FiLM也可以产生更好的结果，但它不会超过上述描述的初始化。图像标记器的架构如图3所示。

RT-1’s image and instruction tokenization via FiLM EfficientNet-B3 is a total of 16M parameters, with 26 layers of MBConv blocks and FiLM layers, which output 81 vision-language tokens.

RT-1的图像和指令通过FiLM EfficientNet-B3进行标记化，总共有16M参数，包括26层MBConv块和FiLM层，输出81个视觉语言标记。

TokenLearner. To further compress the number of tokens that RT-1 needs to attend over and thus speed up inference, RT-1 uses TokenLearner (Ryoo et al., 2021). TokenLearner is an elementwise attention module that learns to map a large number of tokens into a much smaller number of tokens. This allows us to soft-select image tokens based on their information, passing only the important token combinations to the subsequent Transformer layers. The inclusion of TokenLearner subsamples the 81 visual tokens that come out of the pre-trained FiLM-EfficientNet layers to just 8 final tokens that are then passed on to our Transformer layers.

TokenLearner。 为了进一步压缩RT-1需要处理的标记数量，从而加速推理过程，RT-1使用TokenLearner（Ryoo等人，2021）。TokenLearner是一个逐元素注意力模块，学习将大量标记映射到数量较少的标记。这使我们能够根据它们的信息软选择图像标记，仅将重要的标记组合传递给后续的Transformer层。TokenLearner的加入将经过预训练FiLM-EfficientNet层的81个视觉标记子采样为仅有8个最终标记，然后传递给我们的Transformer层。

Transformer. These 8 tokens per-image are then concatenated with the other images in the history, forming 48 total tokens (with added position encoding) to be fed into the Transformer backbone of RT-1. The Transformer is a decoder-only sequence model with 8 self-attention layers and 19M total parameters that outputs action tokens.

Transformer。 然后，每个图像的这8个标记与历史上的其他图像连接在一起，形成48个总标记（带有附加的位置编码），馈送到RT-1的Transformer骨干。Transformer是一个仅包含解码器的序列模型，具有8个自注意力层和总共19M参数，输出动作标记。

Action tokenization. To tokenize actions, each action dimension in RT-1 is discretized into 256 bins. As mentioned previously, the action dimensions we consider include seven variables for the arm movement (x, y, z, roll, pitch, yaw, opening of the gripper), three variables for base movement (x, y, yaw) and a discrete variable to switch between three modes: controlling arm, base or terminating the episode. For each variable, we map the target to one of the 256 bins, where the bins are uniformly distributed within the bounds of each variable.

动作标记化。 为了对动作进行标记化，RT-1中的每个动作维度都被离散为256个箱。如前所述，我们考虑的动作维度包括七个变量用于手臂移动（x、y、z、横滚、俯仰、偏航、夹爪的张开），三个变量用于底座移动（x、y、偏航），以及一个离散变量，用于在三种模式之间切换：控制手臂、底座或终止剧集。对于每个变量，我们将目标映射到256个箱中，其中箱在每个变量的边界内均匀分布。

Loss. We use a standard categorical cross-entropy entropy objective and causal masking that was utilized in prior Transformer-based controllers (Reed et al., 2022; Lee et al., 2022a).

损失函数。 我们使用标准的分类交叉熵损失和在先前基于Transformer的控制器中使用的因果掩码。

Inference speed. In contrast to many applications of large models, such as natural language or image generation, one of the unique requirements for a model that needs to run on real robots in real time is fast and consistent inference speed. Given the human speeds of executing the instructions

推理速度。 与许多大型模型的应用（如自然语言或图像生成）不同，需要在实时机器人上运行的模型的独特要求之一是快速且一致的推理速度。

considered in this work (which we measured to be in the 2 − 4 secs range), we want the model to be not significantly slower than that. Based on our experiments this requirement corresponds to at least 3Hz control frequency and the resulting inference time budget for the model, given other latencies in the system, to be less than 100ms.

考虑到本工作中执行指令的人类速度（我们测得在2-4秒范围内），我们希望模型的速度不要显著慢于这个速度。根据我们的实验，这一要求对应至少3Hz的控制频率，考虑到系统中的其他延迟，模型的推理时间预算应小于100毫秒。

This requirement limits the size of the model that we can use. We further explore the impact of model size on inference speed in the experiments. We employ two techniques to speed up inference: (i) reduce the number of tokens generated by a pre-trained EfficientNet model by using TokenLearner (Ryoo et al., 2021), (ii) compute these tokens only once and reuse them for the following windows that overlap for the future inferences. Both of these allow us to speed up the model inference by 2.4 and 1.7 times, respectively. Additional details on model inference are in Appendix C.1.

这一要求限制了我们可以使用的模型大小。我们在实验中进一步探讨了模型大小对推理速度的影响。我们采用了两种技术来加速推理：（i）通过使用TokenLearner（Ryoo等人，2021）减少预训练EfficientNet模型生成的标记数量，（ii）仅计算这些标记一次，并在将来的推理中重复使用它们的窗口。这两者都使我们能够将模型推理加速2.4和1.7倍，分别。有关模型推理的附加详细信息在附录C.1中。

5.2 DATA

5.2 数据

Our goal is to build a system that exhibits high performance, generalization to new tasks, and robustness to distractors and backgrounds. We therefore aim to collect a large, diverse dataset of robot trajectories that includes multiple tasks, objects and environments. Our primary dataset consists of ∼130k robot demonstrations, collected with a fleet of 13 robots over the course of 17 months. We conducted this large-scale data collection in a series of office kitchen segments, which we refer to as robot classrooms, shown in Fig. 2. More details on data collection are in Appendix C.2.

我们的目标是构建一个表现出高性能、对新任务具有泛化能力并对干扰和背景具有鲁棒性的系统。因此，我们旨在收集一个大型、多样化的机器人轨迹数据集，包括多个任务、对象和环境。我们的主要数据集包括约130k个机器人演示，使用13个机器人的车队在17个月内收集。我们在一系列办公厨房段中进行了这次大规模的数据收集，我们将其称为机器人课堂，如图2所示。有关数据收集的更多详细信息，请参见附录C.2。

Skills and instructions. While the definition of a task remains inconsistent in the literature, in this work we count the number of language instructions that the system can perform, where an instruction corresponds to a verb surrounded by one or multiple nouns, such as “place water bottle upright”, “move the coke can to the green chip bag” or “open the drawer”. RT-1 is able to perform over 700 language instructions in multiple realistic office kitchen environments that we evaluate and describe in detail in the experiments. In order to group the evaluations and draw conclusions on the performance of the system, we group the instructions by the verbs used in them, which we refer to as skills. A more detailed list of instructions is shown in Table 1, with examples and the number of instructions per skill.

技能和指令。 虽然文献中对任务的定义仍不一致，但在本工作中，我们计算系统可以执行的语言指令的数量，其中指令对应于由一个或多个名词环绕的动词，例如“将水瓶放直”、“将可乐罐移到绿色的薯片袋”或“打开抽屉”。RT-1能够在多个真实的办公厨房环境中执行超过700个语言指令，我们将在实验中详细评估和描述这些指令。为了对系统性能进行分组评估并得出结论，我们根据指令中使用的动词将其分组，我们称之为技能。表1显示了更详细的指令列表，包括示例和每个技能的指令数量。
在这里插入图片描述
Table 1: The list of skills collected for RT-1 together with their descriptions and example instructions.

表1：为RT-1收集的技能列表，包括它们的描述和示例指令。

The current set of skills includes picking, placing, opening and closing drawers, getting items in and out drawers, placing elongated items up-right, knocking them over, pulling napkins and opening jars. The skills were chosen to demonstrate multiple behaviors with many objects (seen in Fig. 2(e)) to test aspects of RT-1 such as generalization to new instructions and ability to perform many tasks. We then greatly expanded the object diversity for the “pick” skill to make sure that the skills generalize to varied objects (see the expanded set of objects in Fig. 2(f)). The skills were further expanded while we conducted the ablations to include instructions added in the last row of Table 1, which were used for the experiments described in Sec. 6.4 and 6.3. These additional skills focused on realistic, long-horizon instructions in an office kitchen. The entire process of adding tasks and data is described in the Appendix C.4. Since we do not make any assumptions about particular skills when adding new instructions, the system is easily extendable, and we can continuously provide more diverse data to improve its capabilities.

当前技能集包括拾取、放置、打开和关闭抽屉、从抽屉中取物品、将长物品放直、将它们推倒、拿取餐巾纸和打开罐子。这些技能的选择是为了演示多个对象的多个行为，以测试RT-1的泛化能力和执行多个任务的能力。技能在进行剔除实验时进一步扩展，以确保对各种对象具有泛化能力。整个添加任务和数据的过程在附录C.4中有详细描述。由于系统在添加新指令时不对特定技能做出任何假设，因此系统易于扩展，可以不断提供更多多样化的数据以提高其能力。

6 EXPERIMENTS

6 实验

Our experiments seek to answer the following questions: 1. Can an RT-1 learn to perform a large number of instructions, as well as to generalize in zero shot to new tasks, objects and environments? (Section 6.2) 2. Can we push the resulting model even further by incorporating heterogeneous data sources, such as simulated data or data from different robots? (Section 6.3) 3. How do various methods generalize to long-horizon robotic scenarios? (Section 6.4) 4. How do generalization metrics change with varying amounts of data quantity and data diversity? (Section 6.5) 5. What are the important and practical decisions in the design of the model and how do they affect performance and generalization? (Appendix Section D.4)

我们的实验旨在回答以下问题：

RT-1能否学会执行大量指令，并在零样本情况下推广到新任务、新物体和新环境？（第6.2节）
通过整合异构数据源，如模拟数据或不同机器人的数据，我们能否进一步推动生成的模型？（第6.3节）
不同方法在长时间跨度的机器人场景中的推广能力如何？（第6.4节）
随着数据量和数据多样性的变化，各种推广指标如何变化？（第6.5节）
在模型设计中，哪些是重要且实际的决策，它们如何影响性能和推广能力？（附录D.4节）

Throughout this section we will compare to two baseline state of the art architectures, Gato (Reed et al., 2022) and BC-Z (Jang et al., 2021). Importantly both of these are trained on our data described in detail in Sec. 5.2 (which is an important part of our system) since the original models in these publications would not exhibit generalization properties required for our evaluation tasks. Gato is, similarly to RT-1, based on a Transformer architecture, but varies from RT-1 in multiple aspects. First, it computes image tokens without the notion of language and each image token embedding is computed separately for each image patch, as opposed to early language fusion and global image embedding in our model. Second, it does not use a pre-trained text embedding to encode the language string. It also does not include inference time considerations that are necessary for real robots as discussed in Sec. 5.1 such as TokenLearner and the removal of auto-regressive actions. In order to run Gato on real robots at a high enough frequency, we also limit the size of the model compared to the original publication, which was 1.2B parameters (resulting in on robot inference time of 1.9s), to be of similar size to RT-1 (37M parameters for Gato vs. 35M for RT-1). BC-Z is based on a ResNet architecture, and was used in SayCan (Ahn et al., 2022). BC-Z differs from RT-1 in that it is a feedforward model that does not use previous timesteps, and it uses continuous actions rather than discrete action tokens. In addition to the original BC-Z model size, we also compare our method to a larger version of BC-Z that has a similar number of parameters to RT-1 and refer to it as BC-Z XL. We study and analyze how each of these design decisions changes performance in Appendix Sections D.4 and D.5.

在本节中，我们将与两个基线最先进的架构进行比较，即Gato（Reed等，2022年）和BC-Z（Jang等，2021年）。重要的是，这两个模型都是在我们在第5.2节中详细描述的数据上进行训练的（这是我们系统的重要组成部分），因为这些出版物中的原始模型不会展示出我们评估任务所需的推广属性。Gato与RT-1类似，都基于Transformer架构，但在多个方面与RT-1不同。首先，它在没有语言概念的情况下计算图像标记，每个图像标记嵌入都是为每个图像补丁单独计算的，而不是在我们的模型中进行早期语言融合和全局图像嵌入。其次，它不使用预训练的文本嵌入来编码语言字符串。它还不包括在实时机器人中讨论的TokenLearner等在内的推理时间考虑因素。为了在实际机器人上以足够高的频率运行Gato，我们还限制了模型的大小，使其与RT-1的大小相似（Gato为37M参数，RT-1为35M参数）。BC-Z基于ResNet架构，用于SayCan（Ahn等，2022年）。BC-Z与RT-1的不同之处在于它是一个前馈模型，不使用先前的时间步，而且它使用连续的动作而不是离散的动作标记。除了原始的BC-Z模型大小外，我们还将我们的方法与BC-Z的较大版本进行比较，该版本的参数数量与RT-1相似，并将其称为BC-Z XL。我们研究并分析这些设计决策如何在附录D.4和D.5中改变性能。

We evaluate the success rate in experiments to measure performance on training instructions, generalization to unseen instructions, robustness to backgrounds and distractors, and performance in long-horizon scenarios, as detailed below. Throughout this section, we evaluate our approach and baselines with over 3000 real-world trials, making one of the largest scale evaluation of a robot learning system to-date.

我们通过实验中的成功率来评估性能，以衡量对训练指令的性能、对未见指令的泛化、对背景和干扰物的稳健性，以及在长时间跨度场景中的性能，如下所述。在本节中，我们使用超过3000次真实世界的试验来评估我们的方法和基线，这是迄今为止对机器人学习系统进行的最大规模评估之一。

6.1 EXPERIMENTAL SETUP

6.1 实验设置

As mentioned in Section 4, we evaluate RT-1 with a set of mobile manipulators from Everyday Robots in three environments: two real office kitchens and a training environment modelled off these real kitchens. The training environment, shown in Fig. 2 (a), consists of partial counters while the two real environments, shown in Fig. 2 (b, c), have similar counter tops to the training environment, but vary in lighting, background, and full kitchen geometry (e.g., there may be a cabinet instead of a drawer or a sink may be visible). The policies are evaluated for performance on training tasks as well as generalization to new tasks, robustness to unseen environments, and performance when chained together for long-horizon tasks, as detailed below.

如第4节所述，我们使用Everyday Robots的一组移动操纵器在三个环境中评估RT-1：两个真实的办公厨房和一个以这些真实厨房为模型的训练环境。训练环境如图2(a)所示，由部分柜台组成，而两个真实环境如图2(b, c)所示，与训练环境具有类似的柜台，但在照明、背景和完整的厨房几何结构方面有所不同（例如，柜子可能代替抽屉，或者水槽可能可见）。对政策进行评估，以评估在训练任务上的性能以及对新任务的泛化、对未见环境的稳健性以及在长时间跨度任务中链接在一起时的性能，如下所述。

Seen task performance. To evaluate performance on seen instructions, we evaluate performance on instructions sampled from the training set. Note, however, that this evaluation still involves varying the placement of objects and other factors of the setup (e.g., time of day, robot position), requiring the skills to generalize to realistic variability in the environment. In all, we test over 200 tasks in this evaluation: 36 for picking objects, 35 for knocking objects, 35 for placing things upright, 48 for moving objects, 18 for opening and closing various drawers, and 36 for picking out of and placing objects into drawers.

已见任务性能。为了评估对已见指令的性能，我们在从训练集中抽取的指令上进行评估。但是，请注意，该评估仍涉及更改对象的放置和设置的其他因素（例如，一天中的时间，机器人位置），需要技能对环境中的真实变化进行泛化。总的来说，在此评估中测试了200多项任务：36项用于拾取物体，35项用于敲击物体，35项用于将物体竖立放置，48项用于移动物体，18项用于打开和关闭各种抽屉，以及36项用于从抽屉中拾取和放置物品。

Unseen tasks generalization. To evaluate generalization to unseen tasks, we test 21 novel, unseen instructions. These instructions are distributed across skills and objects. This ensures that at least some instances of each object and skill were present in the training set but they will be combined in novel ways. For example, if “pick up the apple” is held out, then there are other training instructions that include the apple. The list of all unseen instructions can be found in the Appendix D.1.

未见任务泛化。为了评估对未见任务的泛化能力，我们测试了21个新颖的、未见的指令。这些指令分布在技能和对象之间。这确保了训练集中至少有一些每个对象和技能的实例，但它们将以新颖的方式组合。例如，如果“拾取苹果”被保留，那么还有其他包含苹果的训练指令。所有未见指令的列表可以在附录D.1中找到。

Robustness. To evaluate robustness, we perform 30 real-world tasks for distractor robustness and 22 tasks for background robustness. The background robustness was tested by evaluating in new kitchens (which have different lighting and background visuals) and with different counter surfaces (e.g., a patterned table cloth). Example configurations of the robustness evaluation scenarios are depicted in Fig. 4.

稳健性。为了评估稳健性，我们进行了30个用于干扰物稳健性的真实任务和22个用于背景稳健性的任务。通过在新的厨房中评估背景稳健性（其具有不同的照明和背景视觉效果）以及使用不同的柜台表面（例如，有图案的桌布）。稳健性评估场景的示例配置如图4所示。

在这里插入图片描述
Figure 4: Evaluation scenarios for distractors (first row), from left to right: easy (0-5 distractors), medium (9 distractors), hard (9 distractors and occluded object); background (second row), from left to right: original environment, patterned table cloth, new kitchen; and realistic scenarios in the real kitchen (third row), generalization levels from left to right: L1, L2 and L3.

Figure 4: 干扰物的评估场景（第一行），从左到右：简单（0-5个干扰物），中等（9个干扰物），困难（9个干扰物和遮挡的对象）；背景（第二行），从左到右：原始环境，花纹桌布，新厨房；以及真实厨房中的现实场景（第三行），从左到右的泛化级别：L1，L2和L3。

Long-horizon scenarios. We also evaluate generalization to more realistic long-horizon scenarios, which each require executing a sequence of skills. The goal of this evaluation is to combine multiple generalization axes such as new tasks, objects, environments and test the overall generalization capabilities in realistic settings. These evaluations consist of 15 long-horizon instructions in two real kitchens, which require executing sequences of skills consisting of ∼ 10 distinct steps, with each step of roughly comparable scope as the training instructions. These steps are obtained automatically from higher level instructions, such as “how would you throw away all the items on the table?” by using the SayCan system (Ahn et al., 2022), as described in detail in Section 6.4 and Appendix D.3.

长时间跨度场景。我们还评估对更现实的长时间跨度场景的泛化能力，其中每个场景都需要执行一系列技能。此评估的目标是结合多个泛化轴，如新任务、新物体、新环境，并测试在真实环境中的整体泛化能力。这些评估包括在两个真实厨房中的15个长时间跨度指令，需要执行包含大约10个不同步骤的技能序列，每个步骤的范围与训练指令大致相当。这些步骤是通过使用SayCan系统（Ahn等，2022年）从高层指令（例如“你会怎么扔掉桌子上的所有物品？”）中自动获得的，如第6.4节和附录D.3中详细描述。

6.2 CAN RT-1 LEARN TO PERFORM A LARGE NUMBER OF INSTRUCTIONS, AND TO GENERALIZE TO NEW TASKS, OBJECTS AND ENVIRONMENTS?

6.2 RT-1是否能够学习执行大量指令，并对新任务、对象和环境进行泛化？

To answer our first question, we analyze the overall performance, generalization, and robustness capabilities of RT-1 compared to previously proposed models. Specifically, we compare to the model architectures used by Gato (Reed et al., 2022) and BC-Z (Jang et al., 2021), as well as a larger version of BC-Z, which we refer to as BC-Z XL. Note, however, that all models are trained on the same data as RT-1, and the evaluation only compares the model architectures, not the task sets, datasets, or overall robotic systems. The capabilities of RT-1 are determined to a large extent by the dataset and task set, which we believe improves significantly over prior works (e.g. BC-Z uses 100 tasks and the original Gato model trains a stacking task with various shapes), and thus this comparison should be viewed as rather favorable to the prior models, which also benefit from the large and diverse dataset and task set that we collected.

为了回答我们的第一个问题，我们分析了RT-1相对于先前提出的模型的整体性能、泛化能力和稳健性。具体而言，我们与Gato（Reed等人，2022）和BC-Z（Jang等人，2021）使用的模型架构进行比较，以及BC-Z的一个更大版本，我们将其称为BC-Z XL。然而，请注意，所有模型都是在与RT-1相同的数据上训练的，评估仅比较模型架构，而不涉及任务集、数据集或整体机器人系统。RT-1的能力在很大程度上由我们收集的数据集和任务集决定，我们认为这在很大程度上改善了之前的工作（例如，BC-Z使用100个任务，而原始Gato模型训练了一个包含各种形状的堆叠任务）的对比，因此这种比较应被视为对先前模型相当有利的。

The results are shown in Table 2. Across each category, we find that RT-1 outperforms the prior models significantly. On seen tasks, RT-1 is able to perform 97% of the more than 200 instructions successfully, which is 25% more than BC-Z and 32% more than Gato. On unseen tasks, RT-1 shows it is capable of generalizing to novel instructions, performing 76% of the never-before-seen instructions, 24% more than the next best baseline. While such generalization to novel instructions is made possible due to natural language conditioning of the policy, as the policy is able to understand new combinations of previously seen concepts, all of the baselines are also conditioned on natural language and in principle enjoy the same benefits. We further ablate different components of RT-1 in the next section to better understand what aspects of our method contribute the most to this difference. On distractors and backgrounds, we find that RT-1 is quite robust, successfully executing 83% of the distractor robustness tasks and 59% of the background robustness tasks (36% and 18% higher than the next best alternative, respectively). Overall, we find that RT-1 has high general performance, while exhibiting impressive degrees of generalization and robustness. We show example trajectories of the RT-1 agent including instructions that cover different skills, environments and objects in Fig. 5. We also present additional trajectory examples for different generalization tests in the Appendix, which include backgrounds (Fig. 10), and distractors (Fig. 12).

结果如表2所示。在每个类别中，我们发现RT-1明显优于先前的模型。在已见任务上，RT-1能够成功执行超过200个指令中的97%，比BC-Z高出25%，比Gato高出32%。在未见任务上，RT-1表现出对新指令的泛化能力，成功执行76%的以前未见过的指令，比次优基线高出24%。虽然这种对新指令的泛化是由于策略对自然语言的调整，使得策略能够理解先前看到的概念的新组合，但所有基线也是根据自然语言进行调整的，原则上享有相同的优势。我们在接下来的部分中进一步对RT-1的不同组件进行分析，以更好地了解我们的方法对这种差异的贡献。在干扰物和背景方面，我们发现RT-1相当强大，在干扰物稳健性任务中成功执行83%，在背景稳健性任务中成功执行59%（分别比次优的替代方案高36%和18%）。总体而言，我们发现RT-1具有高度的总体性能，同时表现出令人印象深刻的泛化和稳健性。我们在图5中展示了RT-1代理的示例轨迹，其中包含涵盖不同技能、环境和对象的指令。我们还在附录中提供了不同泛化测试的其他轨迹示例，其中包括背景（图10）和干扰物（图12）。

请添加图片描述
Figure 5: Example evaluation trajectories for RT-1 across various instructions.

Figure 5: RT-1在各种指令下的示例评估轨迹。
在这里插入图片描述
Table 2: Overall performance of RT-1 and baselines across seen tasks, generalization to unseen tasks, and robustness to distractors and backgrounds.

Table 2: RT-1和基线在已见任务、对未见任务的泛化以及对干扰物和背景的稳健性方面的总体性能。

Generalization to realistic instructions. Next, we test whether our method generalizes enough across all the different axes that we evaluated previously to be deployed in a real kitchen, which poses multiple distribution shifts all at once such as new tasks combinations, object distractors as well as a novel environment.

对现实指令的泛化。接下来，我们测试我们的方法在所有我们先前评估的不同轴上是否具有足够的泛化能力，以便在现实厨房中部署，这会同时引起多个分布变化，如新任务组合、对象干扰以及新环境。

To evaluate our algorithm in realistic scenarios in a real kitchen, we construct task sequences to accomplish a number of realistic goals. The robot restocks several snacks in drawers, tidies up knocked over condiment bottles and closes drawers left open by humans, prepares a snack with an orange and a napkin and fetches lost sunglasses and an octopus toy from several places in the kitchen. The detailed instructions used in these scenarios are listed in the Appendix D.1. The office kitchen involves a dramatic shift from the training environment and we categorize tasks across these scenarios with varying levels of generalization: L1 for generalization to the new counter-top layout and lighting conditions, L2 for additionally generalization to unseen distractor objects, L3 for additional generalization to drastically new task settings, new task objects or objects in unseen locations such as near a sink. The three levels that correspond to the three tasks of restocking, preparing a snack and fetching a lost object in the real kitchen are depicted in the last row of Fig. 4. Example trajectories for different levels are presented in the Appendix in Fig. 11.

为了在真实厨房中以现实场景评估我们的算法，我们构建了一些实现实际目标的任务序列。机器人在抽屉中重新放置几种零食，整理倒下的调料瓶并关闭被人们打开的抽屉，用橙子和餐巾准备小吃，并从厨房的几个地方取回丢失的太阳镜和章鱼玩具。这些场景中使用的详细指令在附录D.1中列出。办公室厨房涉及从训练环境中的戏剧性转变，并且我们将任务跨越这些场景进行分类，涉及不同程度的泛化：L1表示对新的台面布局和照明条件的泛化，L2表示对未见的干扰物对象的附加泛化，L3表示对极新的任务设置、新的任务对象或未见位置的对象的附加泛化。在图4的最后一行中描绘了与真实厨房中重新放置、准备小吃和取回丢失物体的三个任务对应的三个级别。不同级别的示例轨迹在附录中呈现（图11）。

We report the per-task success rate in these realistic scenarios along with the varying generalization levels in Table 3 and find RT-1 to be the most robust on all levels. Gato generalizes fairly well at the first level but it performs significantly drops for the more difficult generalization scenarios. BC-Z and its XL equivalent perform fairly well at L2 level and better than Gato at L3 but they are still not at the generalization level of RT-1.

我们在表3中报告了这些现实场景中的每项任务成功率以及不同泛化级别，发现RT-1在所有级别上都是最稳健的。
在这里插入图片描述
Table 3: Realistic generalization scenarios: we compare model success rate in a realistic Google kitchen scenarios across three levels of generalization: L1 for generalization to the new counter-top layout and lighting conditions, L2 for additionally generalization to unseen distractor objects, L3 for additionally generalization to drastically new task settings, new task objects or in unseen locations like near a sink.

Table 3: 真实场景的泛化情景：我们比较在真实谷歌厨房场景中，模型在三个泛化级别上的成功率：L1表示对新的台面布局和照明条件的泛化，L2表示对未见干扰物体的附加泛化，L3表示对极端新任务设置、新任务对象或在未见位置（如水槽附近）的附加泛化。

6.3 CAN WE PUSH THE RESULTING MODEL FURTHER BY INCORPORATING HETEROGENEOUS DATA SOURCES SUCH AS SIMULATION OR DATA FROM DIFFERENT ROBOTS?

6.3 我们可以通过整合异构数据源（如仿真数据或来自不同机器人的数据）来进一步推动所得模型的发展吗？

Next, we explore the limits of RT-1 for utilizing highly heterogeneous data. We demonstrate how RT1 can incorporate and learn from vastly different data sources and improve from such data without sacrificing its original-tasks performance across the varied tasks inherent in this data. To this end, we conduct two experiments: (1) RT-1 trained and tested on both real data and simulation data and (2)RT-1 trained across large datasets of different tasks, originally collected by different robots. More information on each is provided in Appendix D.2.

在这一部分，我们探讨了RT-1对吸收高度异构数据的能力。我们展示了RT-1如何从不同的数据源中吸收并学习，并在不损害其原始任务性能的情况下改进。为此，我们进行了两个实验：
（1）在实际数据和模拟数据上训练和测试的RT-1，以及
（2）跨不同机器人原始收集的大量任务数据上训练的RT-1。有关每个实验的详细信息，请参阅附录D.2。

Absorbing simulation data. Table 4 shows the ability of RT-1, and baselines, to absorb both real and simulation data. To test this, we take all of the real demonstration data but we also provide additional simulation data that includes objects that the robot has never seen in the real world. Specifically, we specify different generalization scenarios: for seen skills with real objects the training data has real data of that instruction (i.e., performance on seen tasks), for seen skills with sim objects the training data has sim data of that instruction (e.g. “pick up a sim object”, which was present in sim), and for unseen skills with sim objects the training data has sim data of that object but there are no examples of the instruction describing the skill with that object either in sim or in real (e.g., “move a sim object to apple”, even though the robot has only practiced in picking that sim object and not moving it near other objects). All evaluations are done in the real world but to limit the number of instructions evaluated, we focus on pick and move-to skills

吸收模拟数据。 表格4显示了RT-1及基线吸收实际和模拟数据的能力。为了测试这一点，我们使用了所有实际演示数据，同时提供了额外的模拟数据，其中包括机器人在现实世界中从未见过的物体。具体而言，我们指定了不同的泛化场景：对于实际对象的已见技能，训练数据包含该指令的实际数据（即在已见任务上的性能）；对于模拟对象的已见技能，训练数据包含该指令的模拟数据（例如“拾起模拟对象”）；对于模拟对象的未见技能，训练数据包含该对象的模拟数据，但在实际世界和模拟中都没有描述该技能的指令示例（例如“将模拟对象移动到苹果附近”）。所有评估都在现实世界中进行，为了减少评估的指令数量，我们专注于“拾取”和“移动至”等技能。
在这里插入图片描述
Table 4: Experimental results for incorporating simulation data in RT-1. Adding simulation data does not impact the performance on real objects, while significantly improving real performance on objects that were only introduced in simulation (+64%). It also improves real-world generalization on simulated objects used with skills seen only in the real world (+26%), e.g. “move X to Y” where X only appeared in simulated “pick X” task.

表4：在RT-1中整合模拟数据的实验结果。添加模拟数据对实际对象的性能没有影响，但显著提高了仅在模拟中引入的对象的实际性能（+64%）。它还改善了在仅在现实世界中出现的技能中使用的模拟对象的实际世界泛化（+26%），例如“将X移动到Y”，其中X仅在模拟中出现在“拾取X”任务中。

We find in Table 4 that for RT-1, we do not lose performance adding simulation data compared to the Real Only dataset. We do however, see a significant increase in performance (from 23% to 87%) on objects and tasks seen only in simulation, to approximately the performance of the those in real, demonstrating an impressive degree of domain transfer. We also see a significant increase in performance on unseen instructions from 7% to 33%; impressive given the object in question has never been seen in real and the instruction never seen at all. Overall, we find that RT-1 is able to efficiently absorb new data, even from a very different domain.

从表格4中我们发现，对于RT-1，添加模拟数据并没有降低性能，相比仅使用实际数据集。然而，我们在仅在模拟中见到的对象和任务上看到了显著的性能提升（从23%提高到87%），达到了几乎与实际数据中相同的性能水平，展示了出色的领域转移能力。此外，在未见指令的性能也显著提高，从7%提高到33%；考虑到相关对象从未在实际中见过，而指令也是全新的，这一结果令人印象深刻。总体而言，我们发现RT-1能够高效地吸收来自非常不同领域的新数据。

Absorbing data from different robots. To push the data absorption limits of RT-1, we conduct an additional set of experiments where we combine two data sources that originate from different robots: Kuka IIWA as well as the Everyday Robots mobile manipulators used in the experiments so far. The Kuka data contains all the successful examples collected in QT-Opt (Kalashnikov et al., 2018), which corresponds to 209k episodes, where the robot was indiscriminately grasping objects in a bin (see an example of a Kuka episode in Table. 5). To test whether RT-1 can effectively absorb these two very different datasets, which we refer to as the standard “Classroom eval”, as well as the performance on the newly constructed tasks that reflect the bin-picking setup present in the Kuka data, which we refer to as the “Bin-picking eval” (see Fig. 6).

吸收不同机器人的数据。 为了推动RT-1的数据吸收极限，我们进行了一组额外的实验，将两个来源于不同机器人的数据源进行合并：Kuka IIWA以及迄今为止在实验中使用的Everyday Robots移动操纵器。Kuka数据包含在QT-Opt（Kalashnikov et al., 2018）中收集的所有成功示例，对应于209,000个episode，其中机器人在一个容器中不加选择地抓取物体（请参阅表5中的Kuka episode示例）。为了测试RT-1是否能够有效吸收这两个非常不同的数据集，我们将其称为标准的“Classroom eval”，以及在Kuka数据中存在的反映拾取容器设置的新构建任务的性能，我们将其称为“Bin-picking eval”（参见图6）。
在这里插入图片描述
Figure 6: In Table 5, RT-1 is trained with data from two robotics platforms and learns to generalize across them.
图6：在表5中，RT-1使用来自两个机器人平台的数据进行训练，并学会在它们之间进行泛化。

Table 5: Experimental results for mixing data from two different robots. Incorporating Kuka binpicking data from QT-Opt (Kalashnikov et al., 2018) in RT-1 minimally impacts the standard classroom evaluation performance and results in almost a 2x improvement in generalization to the Binpicking evaluation (that is similar to the setup in the Kuka data) on the Everyday Robots manipulator. This demonstrates an effective transfer across two different robot morphologies.

表5：混合来自两个不同机器人的数据的实验结果。将QT-Opt（Kalashnikov等，2018）中的Kuka抓取数据合并到RT-1中，对标准课堂评估性能影响很小，并且在常规课堂评估性能的基础上，对Everyday Robots机械手在与Kuka数据类似的Binpicking评估中的泛化效果几乎提高了2倍。这证明了在两种不同机器人形态之间的有效迁移。

We would like to emphasize the difficulty of this setting by noting the major differences between the datasets. Not only are the robots that collected the data different in appearance and action space, but also the environment they were deployed in has different appearance and dynamics. In addition the QT-Opt data presents a completely different action distribution – it was collected by an RL agent as opposed to human demonstrations present in our dataset.

我们想强调此设置的难度，注意数据集之间的主要差异。收集数据的机器人不仅在外观和动作空间上不同，而且它们部署的环境在外观和动力学上也有所不同。此外，QT-Opt数据呈现了完全不同的动作分布——它是由RL代理收集的，而不是我们数据集中存在的人类演示。

The results are presented in Table 5. We observe that the model that mixes the RT-1 data and the Kuka data has only a minimal decrease in the original tasks’ performance (i.e. Classroom eval), i.e. 2%. Even more importantly, in the Bin-picking eval, we observe that the model trained on multirobot data performs at 39% compared to the 22% of the model that was trained only on the RT-1 data. This is a 17% performance difference (almost 2x). Additionally, RT-1 trained on Kuka bin-picking data and evaluated on the bin-picking tasks with the Everyday Robots (EDR) robot achieves 0% performance, confirming that it is difficult to transfer a behavior from another robot morphology. However, mixing the data from both robots allows RT-1 to infer the correct actions of the EDR robot even when faced with the states observed by Kuka robots. This is achieved without explicit demonstrations of bin-picking on EDR robot and by taking advantage of past experiences collected by Kuka robots. These results indicate that RT-1’s absorption properties also include the ability to acquire new skills through observing other robots’ experiences and present an exciting avenue of future work where we combine many more multi-robot datasets to enhance the robot capabilities.

结果如表5所示。我们观察到混合RT-1数据和Kuka数据的模型在原始任务的性能上（即Classroom eval）仅有2%的轻微下降。更重要的是，在Bin-picking eval中，我们观察到在混合机器人数据上训练的模型的性能为39%，而仅在RT-1数据上训练的模型的性能为22%。这是一个17%的性能差异（几乎是2倍）。此外，RT-1在Kuka拾取容器数据上训练并在Everyday Robots（EDR）机器人上进行容器拾取任务的评估时，性能为0%，证实了在不同机器人形态间转移行为的困难。然而，混合来自两个机器人的数据使RT-1能够推断出在Kuka机器人观察到的状态下，EDR机器人的正确动作，而无需在EDR机器人上明确演示拾取容器。这是通过利用Kuka机器人收集的过去经验实现的。这些结果表明，RT-1的吸收特性还包括通过观察其他机器人的经验获取新技能的能力，并展示了未来工作的激动人心的方向，我们将结合更多的多机器人数据集来增强机器人的能力。

6.4 HOW DO VARIOUS METHODS GENERALIZE LONG-HORIZON ROBOTIC SCENARIOS?

6.4 各种方法如何泛化长期机器人场景？

In the next set of experiments we evaluate whether our method generalizes enough to be used in long-horizon realistic kitchen settings. To answer this question, we execute RT-1 and various baselines within the SayCan (Ahn et al., 2022) framework in two different real kitchens. Since SayCan combines many low-level instructions to perform high-level instructions, the number of possible high-level instructions increases combinatorially with skills, so the skill-breadth of RT-1 can be fully seen (for more details on the SayCan algorithm please refer to Ahn et al. (2022)). The success rate of long-horizon tasks also decreases exponentially with the length of the task, so high success rates in manipulation skills are particularly important. Furthermore, as mobile manipulation tasks require both navigation and manipulation, the policies ability to be robust to base position is crucial. More detail is provided in Appendix D.3.

在接下来的一系列实验中，我们评估我们的方法是否具有足够的泛化能力，可以在长时间视角的真实厨房环境中使用。为了回答这个问题，我们在两个不同的真实厨房中执行RT-1和各种基线，这是在SayCan（Ahn等人，2022）框架中完成的。由于SayCan组合了许多低级指令以执行高级指令，所以随着技能的增加，可能的高级指令数量呈指数增长，因此RT-1的技能广度可以得到充分展现（有关SayCan算法的更多详细信息，请参阅Ahn等人（2022））。长时间视角任务的成功率也随任务长度的增加呈指数下降，因此在操作技能方面取得高成功率尤为重要。此外，由于移动操作任务需要导航和操作，策略对基座位置的鲁棒性至关重要。有关更多详细信息，请参见附录D.3。

Table 6 shows our results (on instructions in Appendix Table 12). Except for original SayCan, all methods get 87% as planning success rate, and RT-1 performs the best, with 67% execution success rate in Kitchen1. Kitchen2 constitutes a much more challenging generalization scene, since the Robot Classroom training scenes are modeled after Kitchen1 (see the pictures of the kitchens in Fig. 2). Due to this generalization difficulty, SayCan with Gato is not able to finish any long horizon task, and SayCan with BC-Z is able to achieve a success rate of 13%. The original SayCan paper did not evaluate performance in a new kitchen. Surprisingly, the manipulation performance does not see a visible drop from Kitchen1 to Kitchen2 for our method. In the supplementary video, we show that this enables us to operate unseen drawers in Kitchen2, and that we can use SayCan-RT1 to plan and execute ultra-long horizon tasks, with as many as 50 steps.

表6显示了我们的结果（附录表12中的说明）。除了原始的SayCan外，所有方法都获得了87%的规划成功率，而RT-1在Kitchen1中表现最好，执行成功率为67%。由于Robot Classroom训练场景是根据Kitchen1模拟的，因此Kitchen2构成了一个更具挑战性的泛化场景（请参阅图2中厨房的图片）。由于这种泛化难度，使用Gato的SayCan无法完成任何长时间视角任务，而使用BC-Z的SayCan能够实现13%的成功率。原始SayCan论文没有在新的厨房中评估性能。令人惊讶的是，我们的方法在从Kitchen1到Kitchen2的操作性能上并没有明显下降。在补充视频中，我们展示了这使我们能够在Kitchen2中操作看不见的抽屉，并且我们可以使用SayCan-RT1来规划和执行超长时间视角任务，最多有50个步骤。
在这里插入图片描述
Table 6: SayCan style long horizon tasks in Kitchen1 and Kitchen2. (*Original SayCan eval uses a slightly different prompt so the planning success rate is lower.)

表6：在Kitchen1和Kitchen2中的SayCan风格的长视角任务。 (*原始的SayCan评估使用了一个略有不同的提示，因此计划成功率较低。)

6.5 HOW DO GENERALIZATION METRICS CHANGE WITH VARYING AMOUNTS OF DATA QUANTITY AND DATA DIVERSITY?

6.5 数据数量和数据多样性的变化如何影响泛化度量？

While previous works have shown the scaling abilities of Transformer-based models (Lee et al., 2022a; Reed et al., 2022; Jiang et al., 2022) with the number of model parameters, in many robotics works the model size is often not the primary bottleneck, and the maximum size is limited by the latency requirement for running such models on real robots. Instead, in this study we focus on ablating the influence of dataset size and diversity, as they play an important role in the traditionally data-limited robot learning field. Since data collection is particularly expensive for real robots, it is important to quantify what kind of data our models need to achieve a certain performance and generalization. Thus, our last question focuses on the scaling properties of RT-1 with different data properties.

尽管以前的研究已经展示了基于Transformer的模型（Lee等人，2022a; Reed等人，2022; Jiang等人，2022）随着模型参数数量的增加而扩展的能力，在许多机器人学的工作中，模型大小通常不是主要瓶颈，最大大小往往受到在真实机器人上运行此类模型的延迟要求的限制。相反，在这项研究中，我们专注于消除数据集大小和多样性的影响，因为它们在传统上是数据受限的机器人学习领域中起着重要作用。由于对于真实机器人而言数据收集尤为昂贵，因此量化模型需要什么类型的数据才能达到特定性能和泛化是很重要的。因此，我们的最后一个问题聚焦于RT-1在不同数据属性下的扩展性质。

In Table 7 we show the performance, generalization, and robustness of RT-1 as we decrease the dataset size (% data) and the dataset diversity (% tasks). To separate the axes of dataset size and diversity, we create smaller datasets with the same task diversity by removing data from the tasks with the largest data, capping the number of examples per task at 200 (resulting in 51% of the data),100 (37% of the data), and 50 (22.5% of the data). To create a narrow dataset, we remove the tasks with the least data, thus keeping 97% of the overall data but only 75% of the tasks. As we decrease dataset size, we see a general trend of decreasing performance and a steeper trend of decreasing generalization. As we make the dataset more narrow, we see much steeper performance reductions, particularly in terms of generalization. In fact, removing 25% of the tasks while keeping 97% of the data achieves an equivalent generalization performance to reducing the dataset size by as much as 49%. Our key takeaway is thus that data diversity is more essential than data quantity.

在表格7中，我们展示了随着数据集大小（% 数据）和数据集多样性（% 任务）的减小，RT-1的性能、泛化和鲁棒性的变化情况。为了区分数据集大小和多样性的影响，我们创建了具有相同任务多样性的较小数据集，通过从具有最大数据的任务中删除数据，将每个任务的示例数量限制为200（结果为数据的51%），100（数据的37%）和50（数据的22.5%）。为了创建一个狭窄的数据集，我们删除了数据最少的任务，从而保留了总体数据的97%，但只有75%的任务。随着数据集大小的减小，我们观察到性能普遍下降的趋势以及泛化程度下降的陡峭趋势。随着数据集变得更狭窄，我们看到性能下降的趋势更为陡峭，特别是在泛化方面。事实上，保持97%的数据并删除25%的任务，其泛化性能相当于将数据集大小减小了高达49%。我们的主要结论是数据多样性比数据数量更为重要。

在这里插入图片描述
Table 7: Various data ablations of RT-1 across seen tasks, generalization to unseen tasks, and robustness to distractors and backgrounds. Data diversity has a higher impact on the performance andgeneralization than data quantity.
表格7：RT-1在看到的任务、对未见任务的泛化以及对干扰物和背景的鲁棒性上的各种数据消减。数据多样性对性能和泛化的影响高于数据数量。

7 CONCLUSIONS, LIMITATIONS AND FUTURE WORK

7 结论、局限性和未来工作

We presented Robotics Transformer 1, RT-1, a robot learning method that can effectively absorb large amounts of data and scales with data quantity and diversity. We trained RT-1 on a large dataset of demonstrations containing over 130k episodes collected over the course of 17 months with 13 robots. In our broad set of experiments, we demonstrated that our method that can perform over 700 instructions at 97% success rate and effectively generalize to new tasks, objects and environments better than previously published baselines. We also demonstrated that RT-1 can successfully absorb heterogeneous data from simulation and other robot morphologies without sacrificing original-tasks performance and while improving generalization to new scenarios. Lastly, we showed how this level of performance and generalization allowed us to execute very long-horizon tasks in the SayCan (Ahn et al., 2022) framework, with as many as 50 steps.

我们提出了机器人Transformer 1（RT-1），这是一种能够有效吸收大量数据并随着数据数量和多样性扩展的机器人学习方法。我们在包含超过130,000个示例的大型演示数据集上对RT-1进行了训练，该数据集在17个月的时间里由13台机器人收集而成。在我们广泛的一系列实验证明中，我们展示了我们的方法可以以97%的成功率执行超过700个指令，并且在新任务、对象和环境方面比先前发布的基线更好地进行泛化。我们还展示了RT-1可以成功地吸收来自仿真和其他机器人形态的异构数据，而不会牺牲原始任务的性能，并同时提高对新场景的泛化能力。最后，我们展示了这种性能和泛化水平如何使我们能够在SayCan（Ahn et al., 2022）框架中执行非常长的任务，长达50个步骤。

While RT-1 presents a promising step towards large-scale robot learning with an data-absorbent model, it comes with a number of limitations. First, it is an imitation learning method, which inherits the challenges of that class of approaches such as the fact that it may not be able to surpass the performance of the demonstrators. Second, the generalization to new instructions is limited to the combinations of previously seen concepts and RT-1 is not yet able to generalize to a completely new motion that has not been seen before. Lastly, our method is presented on a large but not very dexterous set of manipulation tasks. We plan to continue extending the set of instructions that RT-1 enables and generalizes to to address this challenge.

虽然RT-1对于具有数据吸收模型的大规模机器人学习迈出了一大步，但它也存在一些局限性。首先，它是一种模仿学习方法，继承了这一类方法的挑战，例如它可能无法超越示范者的性能。其次，对新指令的泛化仅限于先前看到的概念的组合，RT-1尚不能对以前未见过的全新动作进行泛化。最后，我们的方法是在一个庞大但不太灵巧的操纵任务集上展示的。我们计划继续扩展RT-1能够处理和泛化的指令集，以解决这一挑战。

As we explore future directions for this work, we hope to scale the number of robot skills faster by developing methods that allow non-experts to train the robot via directed data collection and model prompting. While the current version of RT-1 is fairly robust especially to distractor objects, its robustness to backgrounds and environments could be further improved by greatly increasing the environment diversity. We also hope to improve the reaction speeds and context retention of RT-1 through scalable attention and memory.

在探索未来工作的方向时，我们希望通过开发允许非专家通过定向数据收集和模型提示来训练机器人的方法，更快地扩展机器人技能的数量。虽然当前版本的RT-1在处理干扰对象方面相当强大，但它对背景和环境的鲁棒性仍有待提高，这可以通过大幅增加环境多样性来实现。我们还希望通过可扩展的注意力和内存来提高RT-1的反应速度和上下文保留能力。

To allow the research community to build on top of this work, we have open-sourced the code for RT1 4 , which we hope will provide researchers with a valuable resource for future research for scaling up robot learning.

为了让研究社区能够在这项工作的基础上进行研究，我们已经开源了RT1的代码[4]，我们希望这将为未来扩大机器人学习的研究提供有价值的资源。

ACKNOWLEDGMENTS

致谢

We would like to acknowledge Aleksandra Faust, Andy Christiansen, Chuyuan Fu, Daniel Kappler, David Rendleman, Eric Jang, Jessica Gomez, Jessica Lin, Jie Tan, Josh Weaver, Justin Boyd, Krzysztof Choromanski, Matthew Bennice, Mengyuan Yan, Mrinal Kalakrishnan, Nik Stewart, Paul Wohlhart, Peter Pastor, Pierre Sermanet, Wenlong Lu, Zhen Yu Song, Zhuo Xu, and the greater teams at Robotics at Google and Everyday Robots for their feedback and contributions.

我们要感谢Aleksandra Faust、Andy Christiansen、Chuyuan Fu、Daniel Kappler、David Rendleman、Eric Jang、Jessica Gomez、Jessica Lin、Jie Tan、Josh Weaver、Justin Boyd、Krzysztof Choromanski、Matthew Bennice、Mengyuan Yan、Mrinal Kalakrishnan、Nik Stewart、Paul Wohlhart、Peter Pastor、Pierre Sermanet、Wenlong Lu、Zhen Yu Song、Zhuo Xu以及谷歌机器人和Everyday Robots团队的所有成员，感谢他们的反馈和贡献。

APPENDIX

附录

A AUTHOR CONTRIBUTIONS

A 作者贡献

• 评估（剖析、设计程序、实施和运行剖析）：Yevgen Chebotar、Keerthana Gopalakrishnan、Karol Hausman、Julian Ibarz、Brian Ichter、Alex Irpan、Isabel Leal、Kuang-Huei Lee、Yao Lu、Ofir Nachum、Kanishka Rao、Sumedh Sontakke、Austin Stone、Quan Vuong、Fei Xia、Ted Xiao和Tianhe Yu。

• 网络架构（分词器、训练、推理）：Yevgen Chebotar、Keerthana Gopalakrishnan、Julian Ibarz、Alex Irpan、Kuang-Huei Lee、Yao Lu、Karl Pertsch、Kanishka Rao、Michael Ryoo、Sumedh Sontakke、Austin Stone和Quan Vuong。

• 开发基础设施（数据、训练、收集、模拟、评估、存储和运营）：Anthony Brohan、Keerthana Gopalakrishnan、Karol Hausman、Alex Herzog、Jasmine Hsu、Alex Irpan、Nikhil Joshi、Ryan Julian、Dmitry Kalashnikov、Yuheng Kuang、Isabel Leal、Yao Lu、Fei Xia、Ted Xiao、Peng Xu、Sichun Xu和Tianhe Yu。

• 领导（项目管理或提供建议）：Chelsea Finn、Karol Hausman、Julian Ibarz、Sally Jesmonth、Sergey Levine、Yao Lu、Igor Mordatch、Carolina Parada、Kanishka Rao、Pannag Sanketi、Vincent Vanhoucke。

• 论文（图表、可视化、写作）：Keerthana Gopalakrishnan、Karol Hausman、Brian Ichter、Sergey Levine、Ofir Nachum、Karl Pertsch、Kanishka Rao、Austin Stone、Fei Xia和Ted Xiao。

• 数据收集和评估：Noah Brown、Justice Carbajal、Joseph Dabis、Tomas Jackson、Utsav Malla、Deeksha Manjunath、Jodily Peralta、Emily Perez、Jornell Quiambao、Grecia Salazar、Kevin Sayed、Jaspiar Singh、Clayton Tan、Huong Tran、Steve Vega和Brianna Zitkovich。

B MODEL CARD

B 模型卡

We present the Model Card for RT-1 in Fig. 7.

我们在图7中呈现了RT-1的模型卡。

Model Card for RT-1 (Robotics Transformer)
RT-1模型卡

Model Details
• Developed by researchers at Robotics at Google and Everyday Robots, 2022, v1.
• Transformer-based model, built upon a FiLM-conditioned EfficientNet (Tan & Le, 2019), a TokenLearner (Ryoo et al., 2021), and a Transformer (Vaswani et al., 2017).
• Trained with imitation learning with inputs of natural language tasks and images and output robot actions.

模型详细信息：
• 由谷歌机器人学和Everyday Robots的研究人员于2022年开发，版本1。
• 基于Transformer的模型，构建在FiLM-conditioned EfficientNet（Tan & Le, 2019）、TokenLearner（Ryoo et al., 2021）和Transformer（Vaswani et al., 2017）之上。
• 采用模仿学习进行训练，输入为自然语言任务和图像，输出为机器人动作。

Intended Use
• Intended to be used for controlling an Everyday Robot for manipulation tasks.
• Unclear suitability as a learned representation for different robotic embodiments, environments, or significantly varied downstream tasks. • Not suitable for interaction with humans. Factors
• Factors include varying backgrounds, lighting, scenes, base position, and novel natural language tasks. Hardware factors include camera and robot embodiment.

预期用途：
• 旨在用于控制Everyday Robot执行操作任务。
• 不清楚是否适用于不同的机器人实体、环境或显著不同的下游任务。 • 不适用于与人类互动。因素
• 因素包括不同的背景、光照、场景、基座位置和新颖的自然语言任务。硬件因素包括摄像头和机器人实体。

Metrics
• Evaluation metrics include seen task performance, unseen task performance, robustness to backgrounds and distractors, and performance in long-horizon scenarios. Each measures the success rate of the model performing natural language specified tasks with randomized objects and object locations and varying scenes.

度量标准：
• 评估指标包括看到的任务性能、未见任务性能、对背景和干扰物的鲁棒性以及在长时程情景中的性能。每个指标都衡量了模型在执行随机对象和对象位置以及不同场景的自然语言指定任务时的成功率。

Training Data
• Trained on 130k tele-operation demonstrations over 13 robots and 744 tasks.

训练数据：
• 在13台机器人和744个任务上进行了13万次远程操作演示。
在这里插入图片描述
Evaluation Data
• Evaluated on real-world randomized scenes and over 3000 total rollouts in the environment it was trained on as well as two new office kitchen environments.

评估数据：
• 在真实世界的随机场景中进行评估，在模型训练的环境中进行了超过3000次总体模拟，还在两个新的办公厨房环境中进行了评估。

Quantitative Analyses
• RT-1 shows high-performance and robustness and can learn from heterogenous data.

定量分析：
• RT-1表现出高性能和鲁棒性，能够从异构数据中学习。
在这里插入图片描述

Ethical Considerations
• Early research, model has not yet been evaluated for suitability to use outside of its current research setting.

伦理考虑：
• 早期研究，模型尚未评估是否适用于当前研究环境之外的使用。

Caveats and Recommendations
• While the current model covers only a small portion of possible robotic manipulation tasks, it presents a recipe for scalable robotic learning and an architecture that shows favorable generalization and data absorption properties.

注意事项和建议：
• 尽管当前模型仅涵盖可能的机器人操纵任务的一小部分，但它呈现了可扩展机器人学习的方法和一种显示有利的泛化和数据吸收特性的架构。

C MODEL AND DATA

C 模型和数据

C.1 MODEL INFERENCE

C.1 模型推断

In addition to the inference speed requirement, we need to ensure that our system outputs actions at a consistent frequency, avoiding jitter. To accomplish this, we introduce a fixed-time waiting mechanism that waits a certain amount of time (280ms, the max observed latency of all components) after the state, that was used to compute the next action, has been captured, but before applying the action, similarly to the procedure described by Xiao et al. (2020).

除了推断速度的要求外，我们需要确保我们的系统以一致的频率输出动作，避免抖动。为了实现这一点，我们引入了一个固定时间等待机制，在捕获用于计算下一个动作的状态后等待一定的时间（280毫秒，所有组件的最大观测延迟），但在应用动作之前，类似于Xiao等人（2020）描述的过程。

C.2 DATA COLLECTION AT SCALE

C.2 规模化数据收集

Each of the robots autonomously approaches its station at the beginning of the episode and communicates to the operator the instruction that they should demonstrate to the robot. To ensure a balanced dataset as well as randomization of the scene, we created a software module responsible for sampling the instructions to be demonstrated as well as the randomization of the background configuration. Each of the robots tells the demonstrator how to randomize the scene and which instruction to demonstrate.

每个机器人在每一集开始时都会自主地接近其工作站，并向操作员传达他们应该向机器人演示的指令。为了确保平衡的数据集以及场景的随机化，我们创建了一个负责对要演示的指令进行采样以及对背景配置进行随机化的软件模块。每个机器人告诉演示者如何随机化场景以及演示哪个指令。

Demonstrations are collected with direct line-of-sight between operator and robot using 2 virtual reality remotes. We map remote controls onto our policy action space to preserve consistency of the transition-dynamics. 3D position and rotational displacements of the remote are mapped to 6d displacements of the robot tool. The x, y position of the joystick is mapped to a turning angle and driving distance of the mobile base. We compute and track trajectories to the target poses that we obtain from the joystick commands.

演示是通过使用2个虚拟现实遥控器，以操作员和机器人之间的直线视线进行的。我们将遥控器控件映射到我们的策略动作空间，以保持转换动态的一致性。遥控器的3D位置和旋转位移被映射到机器人工具的6D位移。摇杆的x、y位置被映射到移动基座的转向角和行驶距离。我们计算并跟踪从摇杆命令中获取的目标姿势的轨迹。

C.3 MODEL SELECTION AT SCALE

C.3 规模化的模型选择

As robot learning systems become more capable and the number of instructions they can handle increases, evaluation of these models becomes difficult (Kalashnikov et al., 2021a; Jang et al., 2021). This is an important consideration not only for evaluating different model classes and data distributions during the development process, but also for selecting the most performant model checkpoints for a particular training run. While there have been a number of proposed solutions to this problem (Dud´ık et al., 2011; Irpan et al., 2019; Hanna et al., 2017), mostly known in the offline reinforcement learning literature as “off-policy evaluation”, it still remains an open research challenge to evaluate multi-task robot learning systems at scale.

随着机器人学习系统变得更加强大，它们能够处理的指令数量增加，对这些模型进行评估变得困难（Kalashnikov等，2021a; Jang等，2021）。这不仅是在开发过程中评估不同模型类和数据分布的重要考虑因素，还包括选择特定训练运行的性能最佳模型检查点。虽然对于这个问题已经有许多提出的解决方案（Dud´ık等，2011; Irpan等，2019; Hanna等，2017），在离线强化学习文献中通常被称为“离线策略评估”，但在规模上评估多任务机器人学习系统仍然是一个开放的研究挑战。

In this work, we propose leveraging simulation for “real to sim” transfer as a scalable tool that provides an approximate estimate of model performance during training across many real tasks. We run policies trained from real data in a simulator to test the full rollout performance. Note that all of our training data comes from the real world (except the experiment in Section 6.3), and the simulator is used only for model selection. To accomplish this, we expand the simulation environment proposed by Lee et al. (2022b) to support 551 of the tasks described in Section 5.2. For each of these tasks, we define a set of scene setup randomizations, robot pose randomizations, and success detection criteria. To bridge the visual distribution shift between the real world and the simulation, we train a RetinaGAN (Ho et al., 2020) model that transforms simulated images into realistic looking images. Then, we deploy policies trained on real data directly into these simulation environments by applying RetinaGAN visual transformations at each timestep and measuring rollout simulated task success rates.

在这项工作中，我们提出利用模拟进行“真实到虚拟”的转移，作为一个可伸缩的工具，它在训练过程中为许多真实任务提供模型性能的近似估计。我们在模拟器中运行从真实数据训练的策略，以测试完整的回放性能。请注意，我们所有的训练数据都来自真实世界（除了第6.3节的实验），模拟器仅用于模型选择。为了实现这一点，我们扩展了由Lee等人提出的模拟环境（2022b），以支持第5.2节中描述的551个任务。对于这些任务中的每一个，我们定义了一组场景设置的随机化、机器人姿势的随机化和成功检测标准。为了弥合真实世界和模拟之间的视觉分布差异，我们训练了一个RetinaGAN（Ho等，2020）模型，将模拟图像转换为看起来真实的图像。然后，我们通过在每个时间步应用RetinaGAN视觉转换并测量回放模拟任务成功率，将在真实数据上训练的策略直接部署到这些模拟环境中。

While models trained only on real world data perform better in the real world than they do in simulation, we find that the simulation success rates of high-performing real world policies are higher than the simulation success rates of low-performing real world policies. In other words, the ordering of simulation policy success rates are informative for predicting the ordering of real world policy success rates. We note that in this real-to-sim evaluation setting, we have a less strict requirement for simulation accuracy compared to sim-to-real settings; as long as simulation success rates are directionally correlated with real success rates, we can accept a moderate or even high gap between real and simulation success rates.

虽然仅根据现实世界数据训练的模型在现实世界中的表现比在模拟中表现更好，但我们发现高性能真实世界策略的模拟成功率高于低性能真实世界策略的模拟成功率。换句话说，模拟策略成功率的排序对预测真实策略成功率的排序是有信息的。我们注意到，在这个真实到虚拟的评估设置中，我们对模拟准确性的要求不像虚拟到真实的设置那么严格；只要模拟成功率在方向上与真实成功率相关，我们可以接受真实和模拟成功率之间的中等甚至高的差距。

We present example camera images from simulation as well as their RetinaGAN-based transformations in Fig. 8.

在这里插入图片描述
Figure 8: Example camera images showcasing raw simulation, simulation with RetinaGAN applied, and the real world.

图8：展示原始模拟、应用RetinaGAN的模拟以及真实世界的示例摄像头图像。

C.4 DATA COLLECTION PROCESS

C.4 数据收集过程

Figure 9 shows the growth of data, number of tasks, and the success rate of the policy over time. The number of tasks/instructions that our system is capable of grows over time as more data is collected. The same is true with the performance of seen tasks. One of the important aspects of the future work is develop techniques that allow us to grow the data as well as the robots performance and general capabilities at a faster rate.

图9展示了随时间推移数据量、任务数量以及策略成功率的增长。随着收集更多数据，我们系统能够执行的任务/指令数量也在逐渐增加。对于已见任务的表现也是如此。未来工作的一个重要方面是开发技术，使我们能够以更快的速度增加数据以及机器人的性能和整体能力。
在这里插入图片描述
Figure 9: The growth of data, number of tasks, and seen instruction performance over time.
图9：随时间增长的数据量、任务数量和已见指令性能。

D EXPERIMENTS

D 实验

D.1 EVALUATION DETAILS

D.1 评估细节

In Section 6.2, we study the zero-shot generalization capabilities of RT-1 to difficult scenarios not present in the training dataset. To fairly evaluate different ablations of RT-1 as well as baseline policies, we design standardized evaluation procedures that cover a range of incremental difficulty levels.

在第6.2节中，我们研究了RT-1对于训练数据集中不存在的困难场景的零样本泛化能力。为了公平评估RT-1的不同消融版本以及基准策略，我们设计了标准化的评估程序，涵盖了一系列逐渐增加的难度级别。

Seen tasks. We evaluate on 744 tasks present in the training dataset. The breakdown between 12 skills is shown in Table 1. For all “Seen” evaluations, we use the same classroom setting used for data collection as described in Section 5.2. For each policy, we report a single representative metric that takes a skill-weighted average across individual skill evaluations.

已见任务。我们评估了训练数据集中的744个任务。表1显示了12种技能的分布。对于所有“已见”评估，我们使用了与数据收集中描述的相同的教室设置。对于每个策略，我们报告一个代表性的指标，该指标在个体技能评估中采用了技能加权平均值。‘

Unseen tasks. We evaluate policy performance on 53 tasks that are held out during training. While the unseen instructions’ specific combinations of skills and objects are not seen during training, other combinations of the same skills and objects are present in the training set. We evaluate these unseen tasks in the same environment and the same randomization procedure as the Seen tasks. A full list of these unseen tasks is shown in Table 8.

未见任务。我们评估了模型在训练期间被保留的53个任务上的性能。虽然未见指令的特定技能和对象的组合在训练期间未见，但在训练集中存在相同技能和对象的其他组合。我们在与“已见”任务相同的环境和相同的随机化程序中评估这些未见任务。这些未见任务的完整列表在表8中显示。
在这里插入图片描述

Table 8: List of Unseen Instructions in Sec. 6.2. For the “Unseen Tasks” evaluation, we exclude a total of 53 tasks during training. While these exact instructions were not present in the training set, the objects and skills contained in these instructions were still present in the training set.

表8：Sec. 6.2中未见指令的列表。在“未见任务”评估中，我们在训练过程中排除了共计53个任务。尽管这些确切的指令在训练集中不存在，但这些指令中包含的物体和技能仍然存在于训练集中。

Distractor robustness. We test three tasks (“pick coke can”, “place coke can upright”, “move coke can near green rice chip bag”) with incrementally more distractor objects added to the scene. The easy setting includes 0, 2, or 5 distractor objects. The medium setting includes 9 distractor objects, but the coke can is never obscured. The hard setting includes 9 distractor objects, but the scene is more crowded and the coke can is partially occluded. Both the medium are hard setting are more difficult than scenarios in the training dataset, which contained between 0 and 4 distractors. Examples of these difficulty settings and policy evaluation rollouts are shown in Figure 12.

干扰物鲁棒性。我们测试了三个任务（“拿起可乐罐”、“把可乐罐竖起来放”、“将可乐罐移到绿色米片袋附近”），逐渐增加了场景中的干扰物对象。简单设置包括0、2或5个干扰物对象。中等设置包括9个干扰物对象，但可乐罐从未被遮挡。困难设置包括9个干扰物对象，但场景更拥挤，可乐罐被部分遮挡。中等和困难设置都比训练数据集中的场景更困难，该数据集中包含0到4个干扰物。这些难度设置和策略评估过程的示例在图12中显示。
在这里插入图片描述
Figure 12: “Distractors” evaluations focus on diversifying initial scene configurations well beyond the distributions contained in the training dataset, which contain between 2 and 4 distractor objects. In the most challenging scenarios, the scene is extremely cluttered and contains occlusions for the objects of interest.

图12：“Distractors”评估重点是在初始场景配置方面实现多样性，远远超出了训练数据集中包含的分布，该分布包含2到4个干扰对象。在最具挑战性的情景中，场景非常混乱，包含了干扰对象的遮挡。

Background robustness. We test six tasks (“pick coke can”, “move blue chip bag near orange”, “knock redbull can over”, “pick green jalapeno chip bag”, “move sponge near brown chip bag”,“place redbull can upright”) with incrementally more challenging backgrounds and counter textures. In the easy setting, we utilize the same background environments and counter textures as the training dataset. In the medium setting, we utilize the same background environment but add a patterned tablecloth to change the counter texture. In the hard setting, we utilize a brand new kitchen environment with a new countertop; this changes the counter texture, drawer material and color, and background visuals. Examples of these difficulty settings and policy evaluation rollouts are shown in Figure 10.

背景鲁棒性。我们测试了六个任务（“拿起可乐罐”、“将蓝色薯片袋移到橙色附近”、“推倒红牛罐”、“拿起绿色辣椒薯片袋”、“将海绵移到棕色薯片袋附近”、“把红牛罐竖起来放”），逐渐增加了具有挑战性的背景和柜台纹理。在简单设置中，我们使用与训练数据集相同的背景环境和柜台纹理。在中等设置中，我们使用相同的背景环境，但添加了带有图案的桌布以更改柜台纹理。在困难设置中，我们使用一个全新的厨房环境，其中包括新的台面；这改变了柜台纹理、抽屉材料和颜色以及背景视觉效果。这些难度设置和策略评估过程的示例在图10中显示。

在这里插入图片描述
Figure 10: “Backgrounds” evaluations focus on testing the performance of RT-1 on settings with different table textures and different backgrounds, such as those found in kitchens never trained on. These visual differences are quite pronounced, which in the most challenging case entails a new kitchen with different counter texture, different lighting conditions, different counter material, and a different background.

图10：“背景”评估着重于在具有不同桌面纹理和不同背景的设置上测试RT-1的性能，例如在从未经过训练的厨房中找到的设置。这些视觉差异非常明显，在最具挑战性的情况下，涉及到一个具有不同柜台纹理、不同照明条件、不同柜台材料和不同背景的新厨房。

Realistic instructions. To study how RT-1 performs in more realistic scenarios, we propose an evaluation setting in a real office kitchen that is a dramatic shift from the original training classroom environment. We propose a variety of skills that combine aspects of the previous zero-shot evaluations, including adding new distractors, including new backgrounds, and new combinations of objects with skills. We refer to the easiest scenario as L1 generalization, which introduces a new countertop and lighting condition but keeps the skills and objects the same. Next, L2 generalization additionally adds novel distractor objects such as kitchen jar containers. Finally, L3 generalization adds new objects or new locations such as near a sink. While some of these distribution shifts are tested in Section 6.2, these realistic instructions aim to test multiple dimensions simultaneously. Examples of these instructions are presented in Fig. 11.

真实指令。为了研究RT-1在更现实的场景中的表现，我们提出了在真实办公室厨房进行评估的设置，这是与原始训练教室环境截然不同的场景。我们提出了一系列技能，结合了先前零样本评估的各个方面，包括添加新的干扰物、包括新的背景以及新的对象与技能的组合。我们将最简单的情景称为L1泛化，它引入了新的台面和照明条件，但保持了技能和对象不变。接下来，L2泛化还额外添加了新的干扰物对象，如厨房罐容器。最后，L3泛化添加了新的对象或新的位置，例如靠近水槽。尽管一些这些分布变化在第6.2节进行了测试，但这些真实指令旨在同时测试多个维度。这些指令的示例在图11中呈现。
在这里插入图片描述
Figure 11: “Realistic instructions” evaluations propose realistic scenarios multiple distribution shifts that incrementally increase in difficulty. L1 generalization introduces a new real office kitchen with new lighting conditions. L2 generalization additionally adds unseen distractor objects. Finally, L3 generalization includes new objects or objects in new locations, such as next to a sink.

图11：“真实指令”评估提出了多个分布变化的真实场景，逐渐增加难度。L1泛化引入了一个新的真实办公厨房，具有新的照明条件。L2泛化进一步添加了未见过的干扰物体。最后，L3泛化包括新的物体或位于新位置的物体，例如在水槽旁边。

D.2 HETEROGENEOUS DATA

D.2 异质数据

We also explore the limits of RT-1 for utilizing highly heterogeneous data. We demonstrate how RT1 can incorporate and learn from vastly different data sources and improve from such data without sacrificing its original-tasks performance across the varied tasks inherent in this data. To this end, we conduct two experiments: (1) RT-1 trained and tested on both real data and simulation data and (2) RT-1 trained across large datasets of different tasks, originally collected by different robots.

我们还探索了RT-1利用高度异质数据的限制。我们演示了RT-1如何吸收并学习来自非常不同的数据源的数据，并在不损害其在这些数据中固有的各种任务上的原始任务性能的情况下改进。为此，我们进行了两个实验：（1）RT-1在真实数据和模拟数据上进行训练和测试，以及（2）RT-1跨不同机器人最初收集的大量不同任务的数据进行训练。

Absorbing simulation data. Table 9 shows the ability of RT-1, and baselines, to absorb both real and simulation data. To test this, we take all of the real demonstration data but we also provide additional simulation data that includes objects that the robot has never seen in the real world. We add a set of sim objects and only show them on a subset of tasks, specifically the picking tasks, in simulation. To accomplish this, we run our real2sim method described in Sec. C.3 to bootstrap a simulation policy from the real world policy that is then trained with multi-task RL (Kalashnikov et al., 2021a) with additional objects in simulation. From this process, we extract 518k successful trajectories of picking new objects and mix them with the real data that was used in the previous experiments. The goal of this experiment is to demonstrate that by expanding the dataset of simulation trajectories, we can benefit RT-1’s generalization capabilities without sacrificing the original training performance – a desired property of an absorbent model.

吸收模拟数据。表9显示了RT-1及基准模型吸收真实和模拟数据的能力。为了测试这一点，我们获取所有真实演示数据，但我们还提供了额外的模拟数据，其中包括机器人在真实世界中从未见过的对象。我们添加了一组模拟对象，并仅在模拟中的一部分任务中显示它们，具体来说是在模拟中的拾取任务。为了实现这一点，我们运行了我们在C.3中描述的real2sim方法，以从真实世界策略中引导一个模拟策略，然后使用模拟中的额外对象进行多任务RL（Kalashnikov等人，2021a）的训练。从这个过程中，我们提取了拾取新对象的518k成功轨迹，并将它们与先前实验中使用的真实数据混合在一起。这个实验的目标是演示通过扩展模拟轨迹数据集，我们可以在不损害原始训练性能的情况下提高RT-1的泛化能力 - 这是一个吸收模型的期望属性。
**加粗样式**
Table 9: Experimental results for incorporating simulation data in RT-1. Adding simulation data does not impact the performance on real objects, while significantly improving real performance on objects that were only introduced in simulation.

表9：在RT-1中整合模拟数据的实验结果。添加模拟数据不影响对真实对象的性能，而显著提高了仅在模拟中引入的对象的真实性能。

To evaluate the properties of this model, we specify different generalization scenarios: for seen skills with real objects the training data has real data of that instruction (i.e., performance on seen tasks), for seen skills with sim objects the training data has sim data of that instruction (e.g. “pick up a sim object”, which was present in sim), and for unseen skills with sim objects the training data has sim data of that object but there are no examples of the instruction describing the skill with that object either in sim or in real (e.g., “move a sim object to apple”, even though the robot has only practiced in picking that sim object and not moving it near other objects). All evaluations are done in the real world but to limit the number of instructions evaluated, we focus on pick and move-to skills

为了评估该模型的性能，我们指定了不同的泛化场景：对于具有真实对象的已见技能，训练数据包含该指令的真实数据（即在已见任务上的性能），对于具有模拟对象的已见技能，训练数据包含该指令的模拟数据（例如“拾取模拟对象”，该对象在模拟中存在），对于具有模拟对象的未见技能，训练数据包含该对象的模拟数据，但在模拟或真实中都没有描述该对象技能的示例（例如“将模拟对象移动到苹果附近” ，尽管机器人只练习了拾取该模拟对象而不是将其移动到其他对象附近）。所有评估都在真实世界中进行，但为了限制评估的指令数量，我们专注于拾取和移动技能。

We find in Table 9 that for RT-1, we do not lose performance adding simulation data compared to the Real Only dataset. We do however, see a significant increase in performance (from 23% to 87%) on objects and tasks seen only in simulation, to approximately the performance of the those in real, demonstrating an impressive degree of domain transfer. We also see a significant increase in performance on unseen instructions from 7% to 33%; impressive given the object in question has never been seen in real and the instruction never seen at all. Overall, we find that RT-1 is able to efficiently “sponge up” new data, even from a very different domain.

在表9中，我们发现对于RT-1，与仅真实数据集相比，添加模拟数据并不会损失性能。然而，我们确实看到了性能的显著增加（从23%增加到87%），对于在模拟中看到的仅对象和任务，性能接近于真实的情况，展示了令人印象深刻的领域转移程度。我们还看到在未见指令上性能的显著增加，从7%增加到33%；考虑到所讨论的对象在真实中从未见过，指令从未见过，这是令人印象深刻的。总体而言，我们发现RT-1能够高效地吸收新数据，即使来自非常不同的领域。

Absorbing data from different robots. To push the data absorption limits of RT-1, we conduct an additional set of experiments where we combine two data sources that originate from different robots: Kuka IIWA as well as the Everyday Robots mobile manipulators used in the experiments so far. The Kuka data contains all the successful examples collected in QT-Opt (Kalashnikov et al., 2018), which corresponds to 209k episodes, where the robot was indiscriminately grasping objects in a bin (see an example of a Kuka episode in Table. 10). Our goal in this experiment is to analyze whether the performance on the RT-1 tasks drops when adding the additional data and, more importantly, whether we can observe any transfer from data collected by a different robot morphology.

吸收来自不同机器人的数据。为了推动RT-1的数据吸收极限，我们进行了另一组实验，其中我们结合了两个来自不同机器人的数据源：Kuka IIWA以及到目前为止在实验中使用的Everyday Robots移动操作器。 Kuka数据包含在QT-Opt（Kalashnikov等人，2018）中收集的所有成功示例，对应于209k个剧集，机器人在其中毫不区分地抓取垃圾箱中的物体（请参见表10中Kuka剧集的示例）。我们在这个实验中的目标是分析在添加额外数据时RT-1任务的性能是否下降，更重要的是，我们是否能观察到来自不同机器人形态的数据的任何转移。
在这里插入图片描述 Table 10: Experimental results for mixing data from two different robots. Incorporating Kuka bin-picking data from QT-Opt (Kalashnikov et al., 2018) in RT-1 minimally impacts the standard classroom evaluation performance and results in almost a 2x improvement in generalization to the Bin-picking evaluation (that is similar to the setup in the Kuka data) on the Everyday Robots manipulator. This demonstrates an effective transfer across two different robot morphologies.

表10：混合来自两个不同机器人的数据的实验结果。在RT-1中整合来自QT-Opt（Kalashnikov等人，2018）的Kuka抓取数据对标准教室评估性能影响很小，并在对Everyday Robots机械手进行的类似Kuka数据设置的Bin-picking评估中几乎提高了2倍的泛化性能。这证明了在两种不同机器人形态之间的有效迁移。
We would like to emphasize the difficulty of this setting by noting the major differences between the datasets. Not only are the robots that collected the data different in appearance and action space, but also the environment they were deployed in has different appearance and dynamics. In addition the QT-Opt data presents a completely different action distribution – it was collected by an RL agent as opposed to human demonstrations present in our dataset.

我们要强调这个设置的难度，注意数据集之间的主要差异。收集数据的机器人不仅在外观和动作空间上不同，而且它们部署的环境在外观和动力学上也不同。此外，QT-Opt数据呈现了完全不同的动作分布 - 它是由RL代理收集的，而不是我们数据集中存在的人类演示。

To mix the Kuka data together with the RT-1 data, we first transform the original Kuka 4-DOF action space into the same action space as RT-1, namely we set the roll and pitch to 0, while keeping the yaw values that were present in the original Kuka data. In addition, we transform the binary gripper-close command into a continuous gripper-closedness command that is present in the RT-1 data. We also need text instructions corresponding to the task performed and since the Kuka data does not contain the name of the object that was grasped, we relabel all the data to the “pick anything” instruction. With these modifications, we mix both datasets with the 2:1 (RT-1 data : Kuka data) ratio and train RT-1 to obtain the final model.

为了将Kuka数据与RT-1数据混合在一起，我们首先将原始Kuka 4-DOF动作空间转换为与RT-1相同的动作空间，即将滚动和俯仰设置为0，同时保留原始Kuka数据中存在的偏航值。此外，我们将二进制夹持器关闭命令转换为RT-1数据中存在的连续夹持器关闭命令。我们还需要与执行的任务相对应的文本指令，由于Kuka数据不包含被抓取的对象的名称，我们将所有数据重新标记为“拾取任何东西”的指令。通过这些修改，我们以2:1（RT-1数据：Kuka数据）的比例混合了两个数据集，并训练RT-1获得最终模型。

To test whether RT-1 can effectively absorb these two very different datasets, we evaluate the performance on the original RT-1 tasks (in this case, we also focus on “pick” and “move to” skills), which we refer to as the standard “Classroom eval”, as well as the performance on the newly constructed tasks that reflect the bin-picking setup present in the Kuka data, which we refer to as the “Bin-picking eval”. For the Bin-picking eval to be close to the original dataset, we put in the same looking bin for the objects as well as modify the robot to be similar to the Kuka manipulators by adding extra wires and coloring the gripper gray. For all of the evaluations we use the Everyday Robots robot with the picking commands and evaluate it based on 72 grasping trials.

为了测试RT-1是否能有效吸收这两个非常不同的数据集，我们评估了对原始RT-1任务的性能（在这种情况下，我们还专注于“拾取”和“移动到”技能），我们将其称为标准的“Classroom eval”，以及在新构建的任务上的性能，反映了Kuka数据中存在的垃圾桶拾取设置，我们将其称为“Bin-picking eval”。为了使Bin-picking eval接近原始数据集，我们放入与对象相同的外观的垃圾箱，并通过添加额外的电线和将夹爪颜色设置为灰色来修改机器人，使其类似于Kuka操作器。对于所有评估，我们使用Everyday Robots机器人执行拾取命令，并基于72个抓取试验进行评估。

The results are presented in Table 10. We observe that the model that mixes the RT-1 data and the Kuka data has only a minimal decrease in the original tasks’ performance (i.e. Classroom eval), i.e. 2%. Even more importantly, in the Bin-picking eval, we observe that the model trained on multirobot data performs at 39% compared to the 22% of the model that was trained only on the RT-1 data. This is a 17% performance difference (almost 2x). Additionally, RT-1 trained on Kuka bin-picking data and evaluated on the bin-picking tasks with the Everyday Robots (EDR) robot achieves 0% performance, confirming that it is difficult to transfer a behavior from another robot morphology. However, mixing the data from both robots allows RT-1 to infer the correct actions of the EDR robot even when faced with the states observed by Kuka robots. This is achieved without explicit demonstrations of bin-picking on EDR robot and by taking advantage of past experiences collected by Kuka robots. These results indicate that RT-1’s absorption properties also include the ability to acquire new skills through observing other robots’ experiences and present an exciting avenue of future work where we combine many more multi-robot datasets to enhance the robot capabilities.

结果显示在表10中。我们观察到混合RT-1数据和Kuka数据的模型在原始任务的性能上仅有微小的下降（即Classroom eval），即2%。更重要的是，在Bin-picking eval中，我们观察到在使用多机器人数据训练的模型的性能为39%，而仅在RT-1数据上训练的模型的性能为22%。这是17%的性能差异（几乎是2倍）。此外，在Kuka垃圾桶拾取数据上训练的RT-1并在使用Everyday Robots（EDR）机器人执行垃圾桶拾取任务时实现了0%的性能，验证了从另一个机器人形态转移行为的困难性。然而，混合来自两个机器人的数据使RT-1能够在面对Kuka机器人观察到的状态时推断出EDR机器人的正确动作。这是在没有在EDR机器人上明确演示垃圾桶拾取的情况下完成的，并利用了Kuka机器人收集的过去经验。这些结果表明，RT-1的吸收属性还包括通过观察其他机器人的经验获得新技能的能力，并呈现了一个令人兴奋的未来研究方向，即我们结合更多的多机器人数据集以增强机器人能力。

D.3 LONG-HORIZON EVALUATION DETAILS

D.3 长时任务评估细节

In addition to short-horizon individual skill evaluations shown in previous sections, we also evaluate how RT-1 performs in a long-horizon realistic kitchen setting that chains multiple manipulation and navigation skills to accomplish natural language instructions within the SayCan framework (Ahn et al., 2022). A list of long-horizon instructions used for these evaluations is listed in Table 12.

除了前面部分显示的短时个别技能评估之外，我们还评估了RT-1在长时、现实厨房环境中的表现，该环境链式连接多个操纵和导航技能，以完成SayCan框架中表达的自然语言指令（Ahn等人，2022）。用于这些评估的长时指令列表列在表12中。
在这里插入图片描述
Table 12: List of SayCan instructions evaluated in Sec. 6.4
Table 12: 在第6.4节中评估的SayCan指令列表

The success rate of long-horizon tasks decreases exponentially with the length of the task, so high success rates in manipulation skills are particularly important. Furthermore, as mobile manipulation tasks require both navigation and manipulation, the policies ability to be robust to base position is crucial. Since SayCan combines many low-level instructions to perform high-level instructions, the number of possible high-level instructions increases combinatorially with instructions, so the skill-breadth of RT-1 can be fully seen.

长时任务的成功率随任务长度的增加呈指数下降，因此操纵技能的高成功率尤为重要。此外，由于移动操纵任务既需要导航又需要操纵，因此策略对基座位置的鲁棒性至关重要。由于SayCan将许多低级指令组合起来执行高级指令，因此随着指令的增加，可能的高级指令数量呈组合增加，因此RT-1的技能广度可以得到充分展现。

SayCan works by grounding language models in robotic affordances and it leverages few-shot prompting to break down a long horizon task expressed in natural language to a sequence of low level skills. An example of long horizon task would be “Bring me two different sodas”, and one feasible plan would be “1. find a coke, 2. pick up the coke, 3. bring it to you, 4. put down the coke, 5. find a pepsi, 6. pick up the pepsi, 7. bring it to you, 8. put down the pepsi, 9. done.” To obtain the affordance function we use value functions trained with MT-OPT (Kalashnikov et al., 2021a). For a detailed description of SayCan algorithm please refer to (Ahn et al., 2022).

SayCan通过将语言模型基于机器人能力，并利用少样本提示将以自然语言表达的长时任务分解为一系列低级技能。一个长时任务的示例可能是“给我两种不同的苏打”，一个可行的计划可能是“1.找到一罐可乐，2.拿起可乐，3.给你带过去，4.放下可乐，5.找到一罐百事，6.拿起百事，7.给你带过去，8.放下百事，9.完成。”为了获得能力函数，我们使用了通过MT-OPT训练的值函数（Kalashnikov等人，2021a）。有关SayCan算法的详细描述，请参阅（Ahn等人，2022）。

Since the focus of this paper is acquisition of many generalizable skills, we focus our evaluation on one subset of tasks presented in Ahn et al. (2022). It is the long-horizon family of tasks, involving 15 instructions, each instruction requires an average of 9.6 steps to complete, and involves an average of 2.4 manipulation skills per instruction. A full list of the instructions can be found in Table 12.

由于本文的重点是获取许多可推广技能，我们将评估重点放在Ahn等人（2022）中介绍的任务子集上。这是一个长时任务系列，涉及15个指令，每个指令平均需要完成9.6个步骤，并涉及每个指令平均2.4个操纵技能。指令的完整列表可在表12中找到。

We compare against 3 baselines. 1) SayCan with BC-Z, which uses SayCan planning algorithm with BC-Z as manipulation policy, 2) SayCan with Gato, which uses SayCan planning algorithm with Gato as manipulation policy, 3) Originally reported SayCan results, which use SayCan planning algorithm with BC-Z, but since it uses a slightly different prompt, the planning success rate is lower. We reimplemented 3) in 1) for a fair comparison.

我们与3个基准进行比较。1) SayCan与BC-Z，使用SayCan规划算法和BC-Z作为操纵策略，2) SayCan与Gato，使用SayCan规划算法和Gato作为操纵策略，3) 最初报告的SayCan结果，使用SayCan规划算法和BC-Z，但由于它使用了稍有不同的提示，规划成功率较低。我们重新实现了3)以进行公平比较。

As shown in Table 11, except for original SayCan, all methods get 87% as planning success rate, and RT-1 performs the best, with 67% execution success rate in Kitchen1. Kitchen2 constitutes a much more challenging generalization scene, since the Robot Classroom training scenes are modeled after Kitchen1 (see the pictures of the kitchens in Fig. 2). Due to this generalization difficulty, SayCan with Gato is not able to finish any long horizon task, and SayCan with BC-Z is able to achieve a success rate of 13%. The original SayCan paper did not evaluate performance in a new kitchen. Surprisingly, the manipulation performance does not see a visible drop from Kitchen1 to Kitchen2 for our method. In the supplementary video, we show that this enables us to operate unseen drawers in Kitchen2, and that we can use SayCan-RT1 to plan and execute ultra-long horizon tasks, with as many as 50 steps

如表11所示，除了原始SayCan外，所有方法的规划成功率都为87%，而RT-1表现最佳，在Kitchen1中的执行成功率为67%。Kitchen2构成了一个更具挑战性的泛化场景，因为机器人教室的训练场景是模仿Kitchen1的（请参见图2中的厨房图片）。由于这种泛化难度，SayCan与Gato无法完成任何长时任务，而SayCan与BC-Z能够达到13%的成功率。最初的SayCan论文没有在新的厨房中评估性能。令人惊讶的是，我们的方法在从Kitchen1到Kitchen2的操纵性能上没有明显下降。在补充视频中，我们展示了这使我们能够操作Kitchen2中看不见的抽屉，并且我们可以使用SayCan-RT1规划和执行超长时任务，最多有50个步骤。
在这里插入图片描述
Table 11: SayCan style long horizon tasks in Kitchen1 and Kitchen2. (*Original SayCan eval usesa slightly different prompt so the planning success rate is lower.)

表11：在Kitchen1和Kitchen2中的SayCan风格的长时任务。(*原始的SayCan评估使用稍有不同的提示，因此规划成功率较低。)

D.4 MODEL ABLATIONS

D.4 模型消融

What are the important and practical decisions in the design of the model and how do they affect performance and generalization?
模型设计中的重要且实际的决策是什么，它们如何影响性能和泛化？

To answer this question, we perform a set of ablations over different design decisions in RT-1. We aim to test a number of hypotheses that will help us disambiguate where the benefits of our method come from. Possible hypotheses about the source of improvement include: (i) the capacity and expressiveness of our model, which we verify by ablating the model size, trying other architectures (e.g., by removing the Transformer component); (ii) the particular action representation, which makes it easy to represent complex multi-modal action distributions, which we test by switching to continuous (normally distributed) actions, as well as by ablating the auto-regressive action representation; (iii) the ImageNet pre-trained initialization of the components, which we test by initializing the model’s weights randomly; and (iv) access to the short history, which we test by excluding observation history. More concretely, we ablate our model by (1) decreasing the model size (from 35M to 21M parameters), (2) removing the Transformer architecture (using a pre-trained EfficientNet instead), (3) using a continuous instead of discrete action space (using an MSE loss and multivariate normal output), (4) auto-regressively conditioning on actions, (5) removing ImageNet pre-training of the FiLM EfficientNet, and (6) removing history (reducing the sequence of six images as input to a single image). For each ablation we compare on the axes of performance on seen tasks, performance on unseen tasks, as well as inference speed and robustness to distractors and backgrounds (with a more detailed description of each category in Section 6.1 and Appendix D.1).

为了回答这个问题，我们在RT-1中对不同的设计决策进行了一系列模型消融实验。我们旨在测试一些假设，以帮助我们澄清我们的方法的好处来自何处。关于改进来源的可能假设包括：(i) 我们模型的容量和表达能力，我们通过缩小模型大小、尝试其他架构（例如，去除Transformer组件）来进行验证；(ii) 特定的动作表示，它使得表示复杂的多模态动作分布变得容易，我们通过切换到连续（正态分布的）动作以及去除自回归动作表示进行测试；(iii) ImageNet预训练初始化组件，我们通过随机初始化模型权重进行测试；和 (iv) 访问短时历史，我们通过排除观测历史进行测试。更具体地说，我们通过以下方式对我们的模型进行消融：(1) 减小模型大小（从35M减小到21M参数），(2) 移除Transformer架构（使用预训练的EfficientNet代替），(3) 使用连续而不是离散的动作空间（使用MSE损失和多元正态输出），(4) 自回归地对动作进行条件处理，(5) 移除FiLM EfficientNet的ImageNet预训练，以及 (6) 移除历史（将六个图像的序列作为输入缩减为单个图像）。对于每个消融实验，我们在已见任务上的表现、未见任务上的表现、推断速度以及对干扰物和背景的鲁棒性这几个方面进行比较（详细描述见第6.1节和附录D.1）。

Table 13 shows the results of each ablation and the delta performance compared to the full RT-1. RT-1 achieves impressive performance on tasks and new environments, and particularly outperforms baselines on the most challenging robustness problems. We also find that each design decision is important, though at varying levels. We first evaluate a model that replaces the per-dimension discretized action representation in our model with a more standard continuous Gaussian distribution. We observe a significant decline in performance from this modification. The per-dimension discretization allows our model to represent complex multi-modal distributions, while the Gaussian distribution captures only a single mode. These results suggest that this standard and popular choice is highly suboptimal with the more complex and diverse demonstration data used by our system. ImageNet pre-training is particularly important for model generalization and robustness, decreasing the unseen task performance rate by 33%, as a result of the large and diverse visuals of the ImageNet dataset. Adding history has an impact primarily on generalization to distractors, while removing the Transformer component has a uniform but small negative impact across the seen tasks, unseen tasks and distractors. In order to keep the ImageNet pre-training while reducing the model size, we reduce the number of parameters only by 40% (from 31M to 25M). Resulting performance drops across training and generalization tasks but not as much as in other ablations. Finally, autoregressively conditioning on actions, as used in (Reed et al., 2022; Chen et al., 2021; Lee et al., 2022a), did not benefit performance and slowed inference by more than 2x.

表13显示了每个消融实验的结果，以及与完整的RT-1相比的性能差异。RT-1在任务和新环境上取得了令人印象深刻的表现，特别是在最具挑战性的鲁棒性问题上超过了基线。我们还发现每个设计决策都很重要，尽管影响程度有所不同。我们首先评估了一种将我们模型中的每维离散化动作表示替换为更标准的连续高斯分布的模型。我们观察到性能在这种修改后显著下降。每维离散化允许我们的模型表示复杂的多模态分布，而高斯分布只能捕捉一个模态。这些结果表明，对于我们系统使用的更复杂、多样的演示数据，这种标准而流行的选择是高度次优的。ImageNet预训练对模型的泛化能力和鲁棒性尤为重要，由于ImageNet数据集的大规模和多样性，未见任务的性能下降了33%。添加历史主要影响对干扰物的泛化，而去除Transformer组件在已见任务、未见任务和干扰物方面都有一致但较小的负面影响。为了保持ImageNet预训练而减小模型大小，我们仅减少了40%的参数（从31M减少到25M）。在训练和泛化任务中，性能下降，但不像其他消融实验中那么明显。最后，自回归地对动作进行条件处理，如（Reed et al., 2022; Chen et al., 2021; Lee et al., 2022a）中所使用的方法，对性能没有好处，并使推断速度减慢了2倍以上。

As described in Sec. 5.1, in order to run large Transformer models on real robots, we require a model that supports fast inference for real-time operation. Note that in order to achieve our target control rate of 3Hz (described in Sec. 5.1), we also need to consider other sources of latency in the pipeline, such as the camera latency and communication overhead. However, these factors will be constant for all the models, and therefore we focus our evaluation on just the network inference time. The last column of Table 13 shows the inference speed of all the models. RT-1 is almost an order of magnitude faster than Gato with a similar number of parameters, but it is also considerably slower than a ResNet-based BC-Z. In terms of the different ablations of our model, we observe that the biggest slow-down is caused by including auto-regressive actions (∼2x slow-down), and since this does not significantly influence the performance, the final version of RT-1 does not generate actions auto-regressively.

正如第5.1节所述，为了在真实机器人上运行大型Transformer模型，我们需要一个支持实时操作的快速推断的模型。请注意，为了实现我们的目标控制速率为3Hz（第5.1节中描述），我们还需要考虑管道中其他潜在的延迟源，如摄像头延迟和通信开销。然而，这些因素对所有模型都是恒定的，因此我们将重点放在网络推断时间上。表13的最后一列显示了所有模型的推断速度。RT-1的速度几乎是Gato的一个数量级，参数数量相似，但它也比基于ResNet的BC-Z要慢得多。在我们模型的不同消融实验中，我们观察到自回归动作导致的减速最大（约2倍减速），由于这对性能影响不大，RT-1的最终版本不采用自回归生成动作。
在这里插入图片描述

Table 13: Various model ablations of RT-1 across seen tasks, generalization to unseen tasks, and robustness to distractors and backgrounds.
表13：RT-1的各种模型消融实验，涉及已见任务、对未见任务的泛化以及对干扰物和背景的鲁棒性。

D.5 SUMMARY AND ANALYSIS

D.5 总结与分析

In this section, we summarize some of our findings and propose intuition for RT-1’s high performance, generalization, and robustness. First, ImageNet pretraining (along with Universal Sentence Encoder language embedding) has a large impact particularly on unseen tasks. We observe that RT-1 inherits some of the knowledge that results from the generality and diversity of the datasets these models were trained on. Second, continuous actions have a large impact across all aspects of performance. This has been previously observed and may be due to the ability to represent more complex action distributions – the per-dimension discretization allows our model to represent complex multi-modal distributions, while the Gaussian distribution captures only a single mode. Third, given such expressive multitask models, data diversity has a larger impact than data size. Indeed, even datasets collected in simulated environments or from different robotic embodiments can be leveraged by RT-1, opening avenues for new regimes of data collection.

在本节中，我们总结了一些发现，并提出了对RT-1高性能、泛化和鲁棒性的直观理解。首先，ImageNet预训练（以及通用句子编码器语言嵌入）对未见任务特别产生了巨大影响。我们观察到RT-1继承了这些模型所训练数据集的普遍性和多样性所产生的一些知识。其次，连续动作在性能的各个方面都有很大影响。这已经在先前的研究中观察到，并且可能是由于能够表示更复杂的动作分布——每个维度的离散化允许我们的模型表示复杂的多模态分布，而高斯分布只捕捉到单一模态。第三，鉴于如此表达力强大的多任务模型，数据的多样性比数据大小更具影响力。实际上，甚至在模拟环境或不同机器人体型中收集的数据集也可以被RT-1利用，为新的数据收集方案打开了途径。

Finally, RT-1 fuses language into the image pipeline early via FiLM conditioning, compared to e.g., Gato’s late fusion. This enables image tokens that focus only on relevant features for the instruction at hand, which may be the cause of poor distractor performance for Gato. Figure 13 visualizes the attention during rollouts of RT-1. We see that the attention is focused on relevant features and particularly on interaction between the gripper and the object of interest. The bottleneck of attention layers such as these results in a compact representation which effectively ignores distractors and varying backgrounds.

最后，RT-1通过FiLM调制将语言早早地融入图像管线，与Gato等模型的后期融合相比。这使得图像标记仅关注手头指令相关的特征，这可能是Gato在处理干扰物时性能较差的原因。图13可视化了RT-1在推演过程中的注意力。我们看到注意力集中在相关特征上，特别是在夹具和感兴趣的对象之间的互动上。这些注意力层的瓶颈导致了一个紧凑的表示，有效地忽略了干扰物和不同的背景。

在这里插入图片描述
Figure 13: In this figure we show the attention map of the RT-1 policy. Different layers and heads generally focus on different part of the image. Most commonly, they focus on the parts of the scene with the richest interaction affordances, such as graspable objets. For example, Layer 2 Head 6 focuses on the jalapeno chips and pepsi can in grasping tasks; and Layer 4 Head 2 focuses on the drawer in drawer opening tasks.

图13：在这张图中，我们展示了RT-1策略的注意力图。不同的层和头部通常会集中在图像的不同部分。最常见的情况是，它们会集中在场景中具有最丰富交互属性的部分，比如可抓取的物体。例如，第2层第6头部集中在夹持任务中的辣味玉米片和百事可乐罐上；第4层第2头部集中在抽屉打开任务中的抽屉上。

YYGe

关注

16
点赞
踩
30

收藏

觉得还不错? 一键收藏
0
评论
RT-1论文翻译：ROBOTICS TRANSFORMER FOR REAL-WORLD CONTROL AT SCALE

这项工作的目标是构建和展示一个通用的机器人学习系统，能够吸收大量数据并有效泛化。我们使用来自Everyday Robots的移动操纵器，它具有7个自由度的机械臂、一个两指夹持器和一个移动底座（见图2（d））。为了收集数据并评估我们的方法，我们使用了三个基于厨房的环境：两个真实的办公室厨房和一个模拟这些真实厨房的训练环境。训练环境如图2（a）所示，由部分柜台构成，用于大规模数据收集。
复制链接

扫一扫

专栏目录