【论文翻译】Open X-Embodiment: Robotic Learning Datasets and RT-X Models

最新推荐文章于 2025-05-05 13:07:46 发布

大头要健康啊

最新推荐文章于 2025-05-05 13:07:46 发布

阅读量665

点赞数 2

文章标签：人工智能深度学习

本文链接：https://blog.csdn.net/weixin_47435705/article/details/134716654

版权

该博客围绕Open X-Embodiment和RT-X模型展开，涉及相关论文。博主刚开始了解LLM+Robotics，后续会优化翻译并添加个人理解。内容包含摘要、引言、相关工作、开放存储库、RT-X设计、实验结果等部分。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

论文链接：Open X-Embodiment: Robotic Learning Datasets and RT-X Models

备注：本人刚开始了解LLM+Robotics，很多相关专业知识还未深入学习，后期会对该翻译进行优化，并添加个人理解，非常欢迎相互交流学习！

Abstract

Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train "generalist" X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22different robots collected through a collaboration between 21institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website roboticstransformer-x.github.io.

在不同数据集上训练得到的大型、高容量模型在有效处理下游应用方面取得了显著的成功。从自然语言处理到计算机视觉的领域，这引发了预训练模型的整合，其中通用的预训练主干作为许多应用的起点。那么，在机器人领域是否也可能发生这样的整合呢？传统上，机器人学习方法对每个应用、每台机器人，甚至每个环境都要训练一个独立的模型。我们能否转而训练一个可以高效适应新机器人、新任务和新环境的“通用”X-robot策略呢？在本文中，我们提供了标准化格式的数据集和模型，以探索在机器人操控领域实现这一可能性，以及实验结果，提供了有效的 X-robot 策略示例。我们整合了来自21个机构的22种不同机器人的数据集，展示了527项技能（160266个任务）。我们展示了一个在这些数据上训练的高容量模型，名为RT-X，它展示了积极的迁移学习效果，并通过利用其他平台的经验提升了多个机器人的能力。更多详情可以在项目官网roboticstransformer-x.github.io.上找到。

I. INTRODUCTION

A central lesson from recent advances in machine learning and artificial intelligence is that large-scale learning from broad and diverse datasets can enable capable AI systems by providing for general-purpose pretrained models. In fact, large-scale general-purpose models typically trained on large and diverse datasets can often outperform their narrowly targeted counterparts trained on smaller but more task-specific data. For instance, open-vocabulary image classifiers (e.g., CLIP [1]) trained on large datasets scraped from the web tend to outperform fixed-vocabulary models trained on more limited datasets, and large language models [2, 3] trained on massive text corpora tend to outperform systems that are only trained on narrow task-specific datasets. Increasingly, the most effective way to tackle a given narrow task (e.g., in vision or NLP) is to adapt a general-purpose model. However, these lessons are difficult to apply in robotics: any single robotic domain might be too narrow, and while computer vision and NLP can leverage large datasets sourced from the web, comparably large and broad datasets for robotic interaction are hard to come by. Even the largest data collection efforts still end up with datasets that are a fraction of the size and diversity of benchmark datasets in vision (5-18M) [4, 5] and NLP (1.5B-4.5B) [6, 7]. Perhaps more importantly, such datasets are often still narrow along some axes of variation, either focusing on a single environment, a single set of objects, or a narrow range of tasks. How can we overcome these challenges in robotics and move the field of robotic learning toward the kind of large data regime that has been so successful in other domains?	最近在机器学习和人工智能领域的进展中一个核心教训是，通过使用通用的预训练模型从广泛多样的数据集进行大规模学习可以来实现能力强大的AI系统。实际上，通常在大型多样数据集上训练的大规模通用模型往往能超越在更小但更具任务特定数据上训练的狭窄目标模型。例如，在大型网络数据集上训练的开放词汇图像分类器（例如，CLIP [1]）往往优于在更有限数据集上训练的固定词汇模型，而在庞大文本语料库上训练的大型语言模型 [2, 3] 往往优于仅在狭窄的任务特定数据集上训练的系统。越来越多地，解决给定狭窄任务（例如，在视觉或NLP领域）的最有效方式是适应一个通用模型。然而，这些教训在机器人技术中难以应用：任何单一的机器人领域可能过于狭窄，而计算机视觉和NLP可以利用从网络上获取的大型数据集，但对于机器人交互来说，相当大和广泛的数据集难以获得。即使是最大的数据收集努力也只能得到与视觉（5-18M）[4, 5] 和 NLP（1.5B-4.5B）[6, 7] 领域基准数据集相比规模和多样性都是一小部分的数据集。或许更重要的是，这些数据集通常在某些变化轴上仍然狭窄，要么专注于单一环境，单一物体集，或狭窄的任务范围。我们如何在机器人技术中克服这些挑战，并将机器人学习领域推向在其他领域已经非常成功的大数据体系呢？
Inspired by the generalization made possible by pretraining large vision or language models on diverse data, we take the perspective that the goal of training generalizable robot policies requires X-embodiment training, i.e., with data from multiple robotic platforms. While each individual robotic learning dataset might be too narrow, the union of all such datasets provides a better coverage of variations in environments and robots. Learning generalizable robot policies requires developing methods that can utilize X-embodiment data, tapping into datasets from many labs, robots, and settings. Even if such datasets in their current size and coverage are insufficient to attain the impressive generalization results that have been demonstrated by large language models, in the future, the union of such data can potentially provide this kind of coverage. Because of this, we believe that enabling research into X-embodiment robotic learning is critical at the present juncture.	受到在多样数据上预训练大型视觉或语言模型所实现的泛化启发，我们认为训练可泛化的机器人策略的目标需要X-embodiment，即使用来自多个机器人平台的数据。虽然每个单独的机器人学习数据集可能过于狭窄，但所有这些数据集的结合提供了更好的环境和机器人变化覆盖。学习可泛化的机器人策略需要开发能够利用X-embodiment数据的方法，挖掘来自许多实验室、机器人和环境的数据集。即使这些数据集在当前的规模和覆盖范围上不足以达到大型语言模型所展示的令人印象深刻的泛化结果，但未来，这些数据的结合可能提供这种覆盖。因此，我们认为目前阶段促进对X-embodiment机器人学习的研究至关重要。
Following this rationale, our work has two primary goals:(1) Demonstrate that policies trained on data from many different robots and environments enjoy the benefits of positive transfer, attaining better performance than policies trained only on data from each evaluation setup. (2) Provide datasets, data formats and models for the robotics community to enable future research on X-embodiment models.	遵循这一理念，我们的工作有两个主要目标：（1）展示在来自许多不同机器人和环境的数据上训练的策略有正向迁移的好处，比仅在每个评估设置的数据上训练的策略表现更好。（2）为机器人社区提供数据集、数据格式和模型，以支持未来对X-embodiment模型的研究。
We focus our work on robotic manipulation. Addressing goal (1), our empirical contribution is to demonstrate that several recent robotic learning methods, with minimal modification, can utilize X-embodiment data and enable positive transfer. Specifically, we train the RT-1[8]and RT-2[9] models on 9 different robotic manipulators. We show that the resulting models, which we call RT-X, can improve over policies trained only on data from the evaluation domain, exhibiting better generalization and new capabilities. Addressing (2), we provide the Open X-Embodiment (OXE) Repository, which includes a dataset with 22 different robotic embodiments from 21 different institutions that can enable the robotics community to pursue further research on X-embodiment models, along with open-source tools to facilitate such research. Our aim is not to innovate in terms of the particular architectures and algorithms, but rather to provide the model that we trained together with data and tools to energize research around X-embodiment robotic learning.	我们的工作重点放在机器人控制上。针对目标（1），我们的实证贡献是展示了几种最近的机器学习方法，经过最小修改，就可以利用X-embodiment数据并实现正向迁移。具体来说，我们在9种不同的机器人操控器上训练了RT-1 [8] 和 RT-2 [9] 模型。我们展示了由此产生的模型，名为RT-X，可以改善仅在评估领域数据上训练的策略，展现了更好的泛化能力和新的能力。针对目标（2），我们提供了开放X-Embodiment（OXE）库，其中包括来自21个不同机构的22种不同机器人实体的数据集，可以使机器人社区进一步研究X-embodiment模型，并提供了开源工具来促进这种研究。我们的目标不是在特定架构和算法上创新，而是提供我们训练的模型以及数据和工具，以激发围绕X-embodiment机器人学习的研究。

Ⅱ. RELATED WORK

Transfer across embodiments. A number of prior works have studied methods for transfer across robot embodiments in simulation [10–22] and on real robots [23–29]. These methods often introduce mechanisms specifically designed to address the embodiment gap between different robots, such as shared action representations [14, 30], incorporating representation learning objectives [17, 26], adapting the learned policy on embodiment information [11, 15, 18, 30, 31], and decoupling robot and environment representations [24]. Prior work has provided initial demonstrations of X-embodiment training [27] and transfer [25, 29, 32] with transformer models. We investigate complementary architectures and provide complementary analyses, and, in particular, study the interaction between X-embodiment transfer and web-scale pretraining. Similarly, methods for transfer across human and robot embodiments also often employ techniques for reducing the embodiment gap, i.e. by translating between domains or learning transferable representations [33–43]. Alternatively, some works focus on sub-aspects of the problem such as learning transferable reward functions [17, 44–48], goals [49, 50], dynamics models [51], or visual representations [52–59] from human video data. Unlike most of these prior works, we directly train a policy on X-embodiment data, without any mechanisms to reduce the embodiment gap, and observe positive transfer by leveraging that data.	跨实体迁移。许多先前的研究探讨了在仿真[10–22]和真实机器人[23–29]中跨机器人实体迁移的方法。这些方法通常引入专门设计的机制来解决不同机器人之间的实体差异，例如共享动作表示[14, 30]、结合学习目标表示[17, 26]、根据实体信息调整学习到的策略[11, 15, 18, 30, 31]，以及解耦机器人和环境表示[24]。先前的工作已经提供了X-embodiment训练[27]和迁移[25, 29, 32]的初步展示，使用transformer模型。我们研究互补的架构并提供互补的分析，特别是研究X-embodiment迁移与网络规模预训练之间的相互作用。类似地，跨人类和机器人实体迁移方法也经常采用减少实体差异的技术，即通过在领域之间进行转换或学习可转移的表示[33–43]。或者，一些工作专注于问题的子方面，如从人类视频数据中学习可转移的奖励函数[17, 44–48]、目标[49, 50]、动力学模型[51]或视觉表示[52–59]。与大多数先前的工作不同，我们直接在X-embodiment数据上训练策略，不采用任何减少实体差异的机制，并通过利用该数据观察到正向转移。
Large-scale robot learning datasets. The robot learning community has created open-source robot learning datasets, spanning grasping [60–71], pushing interactions [23, 72–74], sets of objects and models [75–85], and teleoperated demonstrations [8,86–95]. With the exception of RoboNet [23], these datasets contain data of robots of the same type, whereas we focus on data spanning multiple embodiments. The goal of our data repository is complementary to these efforts: we process and aggregate a large number of prior datasets into a single, standardized repository, called Open X-Embodiment, which shows how robot learning datasets can be shared in a meaningul and useful way.	大规模机器人学习数据集。机器人学习社区创建了开源的机器人学习数据集，涵盖抓取[60–71]、推动交互[23, 72–74]、物体和模型集合[75–85]，以及远程操作演示[8, 86–95]。除了RoboNet[23]之外，这些数据集包含同一类型机器人的数据，而我们关注的是跨多个实体的数据。我们数据存储库的目标是补充这些努力：我们处理并汇总大量先前的数据集到一个单一的、标准化的存储库，称为Open X-Embodiment，展示了机器人学习数据集如何以有意义和有用的方式共享。
Language-conditioned robot learning. Prior work has aimed to endow robots and other agents with the ability to understand and follow language instructions [96–101], often by learning language-conditioned policies [8, 40, 45, 102– 106]. We train language-conditioned policies via imitation learning like many of these prior works but do so using large-scale multi-embodiment demonstration data. Following previous works that leverage pre-trained language embeddings [8, 40, 45, 103, 107–112] and pre-trained visionlanguage models [9, 113–115] in robotic imitation learning, we study both forms of pre-training in our experiments, specifically following the recipes of RT-1 [8] and RT-2 [9].	基于语言的机器人学习。先前的工作旨在赋予机器人和其他代理理解和遵循语言指令的能力[96–101]，通常通过学习基于语言的策略[8, 40, 45, 102– 106]。我们通过模仿学习训练基于语言的策略，如许多这些先前的工作一样，但使用大规模多实体示范数据来进行。我们遵循之前利用预训练的语言嵌入[8, 40, 45, 103, 107–112]和预训练的视觉语言模型[9, 113–115]在机器人模仿学习中的工作，在我们的实验中研究这两种形式的预训练，具体遵循RT-1[8]和RT-2[9]的方法。

Ⅲ. THE OPEN X-EMBODIMENT REPOSITORY

We introduce the Open X-Embodiment Repository(robotics-transformer-x.github.io) – an open-source repository which includes large-scale data along with pre-trained model checkpoints for X-embodied robot learning research. More specifically, we provide and maintain the following open-source resources to the broader community: Open X-Embodiment Dataset: robot learning dataset with 1M+ robot trajectories from 22 robot embodiments. Pre-Trained Checkpoints: a selection of RT-X model checkpoints ready for inference and finetuning.	我们推出了Open X-Embodiment存储库（robotics-transformer-x.github.io）——一个开源存储库，包括大规模数据和预训练模型检查点来用于X-embodied机器人学习研究。更具体地说，我们为广泛社区提供并维护以下开源资源： Open X-Embodiment数据集：包含来自22种机器人实体的100万+机器人轨迹的机器人学习数据集。预训练检查点：一系列RT-X模型检查点，可用于推理和微调。
We intend for these resources to form a foundation for X-embodiment research in robot learning, but they are just the start. Open X-Embodiment is a community-driven effort, currently involving 21 institutions from around the world, and we hope to further broaden participation and grow the initial Open X-Embodiment Dataset over time. In this section, we summarize the dataset and X-embodiment learning framework, before discussing the specific models we use to evaluate our dataset and our experimental results.	我们希望这些资源为机器人学习中的X-embodiment研究奠定基础，但这只是个开始。Open X-Embodiment是一个社区驱动的努力，目前涉及全球21个机构，我们希望随着时间的推移，进一步扩大参与范围并增加开放X-Embodiment数据集的初始规模。在本节中，我们将总结数据集和X-embodiment学习框架，然后讨论我们用于评估我们的数据集和实验结果的具体模型。
A.The Open X-Embodiment Dataset The Open X-Embodiment Dataset contains 1M+ real robot trajectories spanning 22 robot embodiments, from single robot arms to bi-manual robots and quadrupeds. The dataset was constructed by pooling 60 existing robot datasets from 34 robotic research labs around the world and converting them into a consistent data format for easy download and usage. We use the RLDS data format [119], which saves data in serialized tfrecord files and accommodates the various action spaces and input modalities of different robot setups, such as differing numbers of RGB cameras, depth cameras and point clouds. It also supports efficient, parallelized data loading in all major deep learning frameworks. For more details about the data storage format and a breakdown of all 60 datasets,see robotics-transformer-x.github.io.	A. X-Embodiment数据集 Open X-Embodiment数据集包含100万+真实机器人轨迹，涵盖了从单臂机器人到双臂机器人和四足机器人的22种机器人实体。该数据集是通过汇集全球34个机器人研究实验室的60个现有机器人数据集并将其转换为一致的数据格式以便于下载和使用而构建的。我们使用RLDS数据格式[119]，将数据保存在序列化的tfrecord文件中，并适应不同机器人设置的各种动作空间和输入模式，例如不同数量的RGB相机、深度相机和点云。它还支持在所有主要深度学习框架中高效、并行化的数据加载。有关数据存储格式和所有60个数据集的详细信息，请参见robotics-transformer-x.github.io.
B.Dataset Analysis Fig. 2 analyzes the Open X-Embodiment Dataset. Fig. 2(a) shows the breakdown of datasets by robot embodiments, with the Franka robot being the most common. This is reflected in the number of distinct scenes (based on dataset metadata) per embodiment (Fig. 2(b)), where Franka dominates. Fig. 2(c) shows the breakdown of trajectories per embodiment. To further analyze the diversity, we use the language annotations present in our data. We use the PaLM language model [3] to extract objects and behaviors from the instructions. Fig. 2(d,e) show the diversity of skills and objects. While most skills belong to the pick-place family, the long tail of the dataset contains skills like "wiping" or "assembling". Additionally, the data covers a range of household objects, from appliances to food items and utensils.	B.数据集分析图2分析了Open X-Embodiment数据集。图2（a）显示了按机器人实体划分的数据集，其中Franka机器人最常见。这反映在每个实体的不同场景数量上（基于数据集元数据）（图2（b）），其中Franka占主导地位。图2（c）显示了每个实体的轨迹分布。为了进一步分析多样性，我们使用语言注释我们的数据。我们使用PaLM语言模型[3]从指令中提取对象和行为。图2（d，e）展示了技能和对象的多样性。虽然大多数技能属于pick-place（取放），但数据集的长尾包含了诸如“擦拭”或“组装”之类的技能。此外，数据涵盖了一系列家用物品，从家电到食品和餐具。

Ⅳ. RT-X DESIGN

To evaluate how much X-embodiment training can improve the performance of learned policies on individual robots, we require models that have sufficient capacity to productively make use of such large and heterogeneous datasets. To that end, our experiments will build on two recently proposed Transformer-based robotic policies: RT1[8] and RT-2[9]. We briefly summarize the design of these models in this section, and discuss how we adapted them to the X-embodiment setting in our experiments.	为了评估X-embodiment训练能在多大程度上提高单个机器人上学习策略的表现，我们需要具有足够容量的模型，以有效地利用如此大型和异质的数据集。为此，我们的实验将基于两种最近提出的基于Transformer的机器人策略：RT-1 [8] 和 RT-2 [9]。我们在本节简要概述了这些模型的设计，并讨论了我们如何在实验中将它们适应到X-实体设置。
A.Data format consolidation One challenge of creating X-embodiment models is that observation and action spaces vary significantly across robots. We use a coarsely aligned action and observation space across datasets. The model receives a history of recent images and language instructions as observations and predicts a 7-dimensional action vector controlling the endeffector (x, y, z, roll, pitch, yaw, and gripper opening or the rates of these quantities). We select one canonical camera view from each dataset as the input image, resize it to a common resolution and convert the original action set into a 7 DoF end-effector action. We normalize each dataset's actions prior to discretization. This way, an output of the model can be interpreted (de-normalized) differently depending on the embodiment used. It should be noted that despite this coarse alignment, the camera observations still vary substantially across datasets, e.g. due to differing camera poses relative to the robot or differing camera properties, see Figure 3. Similarly, for the action space, we do not align the coordinate frames across datasets in which the end-effector is controlled, and allow action values to represent either absolute or relative positions or velocities, as per the original control scheme chosen for each robot. Thus, the same action vector may induce very different motions for different robots.	A. 数据格式统一创建X-embodiment模型的一个挑战是观察和动作空间在不同机器人之间差异显著。我们在数据集之间使用粗略对齐的动作和观察空间。模型接收最近图像和语言指令的历史作为观察，并预测控制末端执行器的7维动作向量（x, y, z, roll, pitch, yaw 和夹持器开度或这些量的速率）。我们从每个数据集中选择一个规范的相机视角作为输入图像，将其调整为通用分辨率，并将原始动作集转换为7自由度末端执行器动作。在离散化之前，我们对每个数据集的动作进行归一化。这样，模型的输出可以根据所使用的实体不同而有不同的解释（去标准化）。应当注意的是，尽管这种粗略对齐，来自不同数据集的相机观察仍然存在显著差异，例如由于相对于机器人的不同相机位置或不同相机属性，见图3。同样，对于动作空间，我们没有在控制末端执行器的数据集之间对齐坐标框架，并允许动作值表示绝对或相对位置或速度，根据每个机器人的原始控制方案。因此，相同的动作向量可能会导致不同机器人产生非常不同的运动。

B.Policy architectures We consider two model architectures in our experiments: (1) RT-1[8], an efficient Transformer-based architecture designed for robotic control, and (2) RT-2[9] a large vision-language model co-fine-tuned to output robot actions as natural language tokens. Both models take in a visual input and natural language instruction describing the task, and output a tokenized action. For each model, the action is tokenized into 256 bins uniformly distributed along each of eight dimensions; one dimension for terminating the episode and seven dimensions for end-effector movement. Although both architectures are described in detail in their original papers[8, 9], we provide a short summary of each below: RT-1[8] is a 35M parameter network built on a Transformer architecture[118] and designed for robotic control, as shown in Fig. 3. It takes in a history of 15 images along with the natural language. Each image is processed through an ImageNet-pretrained EfficientNet [117] and the natural language instruction is transformed into a USE [120] embedding. The visual and language representations are then interwoven via FiLM [116] layers, producing 81 visionlanguage tokens. These tokens are fed into a decoder-only Transformer, which outputs the tokenized actions. RT-2[9] is a family of large vision-language-actionmodels (VLAs) trained on Internet-scale vision and language data along with robotic control data. RT-2 casts the tokenized actions to text tokens, e.g., a possible action may be "1 128 91 241 5 101 127". As such, any pretrained vision-language model (VLM [121–123]) can be finetuned for robotic control, thus leveraging the backbone of VLMs and transferring some of their generalization properties. In this work, we focus on the RT-2-PaLI-X variant [121] built on a backbone of a visual model, ViT [124], and a language model, UL2 [125], and pretrained primarily on the WebLI [121] dataset.	B.策略架构我们在实验中考虑了两种模型架构：(1)RT-1[8]，一种高效的基于Transformer的机器人控制架构，和(2)RT-2 [9]，一种大型视觉语言模型，经共同微调以输出机器人动作作为自然语言标记。两种模型都接收视觉输入和描述任务的自然语言指令，并输出标记化动作。对于每个模型，操作被标记成 256 个条柱，沿八个维度中的每一个均匀分布;一个维度用于终止episode，七个维度用于末端执行器移动。尽管这两种架构在它们的原始论文[8, 9]中有详细描述，我们在下面提供了每个的简要总结： RT-1[8] 是一个基于Transformer架构[118]并为机器人控制设计的3500万参数网络，如图3所示。它接收15张图像的历史和自然语言。每张图像通过ImageNet-pretrained EfficientNet [117]处理，自然语言指令被转换为USE [120]嵌入。然后通过FiLM [116]层交织视觉和语言表示，产生81个视觉语言标记。这些标记被送入仅有解码器的Transformer，输出标记化动作。 RT-2 [9] 是一系列大型视觉-语言-动作模型（VLAs），在互联网规模的视觉和语言数据以及机器人控制数据上进行训练。RT-2将标记化的动作转换为文本标记，例如，一个可能的动作可能是“1 128 91 241 5 101 127”。因此，任何预训练的视觉-语言模型（VLM [121–123]）都可以为机器人控制进行微调，从而利用VLM的主干并转移其泛化特性。在这项工作中，我们专注于基于视觉模型ViT [124]和语言模型UL2 [125]的主干，并主要在WebLI [121]数据集上预训练的RT-2-PaLI-X变体[121]。
C.Training and inference details Both models use a standard categorical cross-entropy objective over their output space (discrete buckets for RT1 and all possible language tokens for RT-2). We define the robotics data mixture used across all of the experiments as the data from 9 manipulators, and taken from RT-1 [8], QT-Opt [66], Bridge [95], Task Agnostic Robot Play [126, 127], Jaco Play [128], Cable Routing [129], RoboTurk [86], NYU VINN [130], Austin VIOLA [131], Berkeley Autolab UR5[132], TOTO[133] and Language Table [91] datasets. RT-1-X is trained on only robotics mixture data defined above, whereas RT-2-X is trained via co-fine-tuning (similarly to the original RT-2[9]), with an approximately one to one split of the original VLM data and the robotics data mixture. Note that the robotics data mixture used in our experiments includes 9 embodiments which is fewer than the entire Open X-Embodiment dataset (22) – the practical reason for this difference is that we have continued to extend the dataset over time, and at the time of the experiments, the dataset above represented all of the data. In the future, we plan to continue training policies on the extended versions of the dataset as well as continue to grow the dataset together with the robot learning community. At inference time, each model is run at the rate required for the robot (3-10 Hz), with RT-1 run locally and RT-2 hosted on a cloud service and queried over the network.	C.训练和推理细节这两个模型都使用标准的分类交叉熵目标在其输出空间上（RT1的离散桶和RT-2的所有可能语言标记）。我们定义了跨所有实验使用的机器人数据混合体，包括9个操纵器的数据，取自RT-1 [8]、QT-Opt [66]、Bridge [95]、Task Agnostic Robot Play[126, 127]、Jaco Play[128]、Cable Routing[129]、RoboTurk[86]、NYU VINN [130]、Austin VIOLA[131]、Berkeley Autolab UR5 [132]、TOTO [133]和Language Table [91]数据集。RT-1-X仅在上述定义的机器人混合数据上进行训练，而RT-2-X则通过共同微调（类似于原始RT-2 [9]）进行训练，大致将原始VLM数据和机器人数据混合体分为一半。请注意，我们实验中使用的机器人数据混合体包括9种实体，少于整个开放X-实体数据集（22种）——这种差异的实际原因是我们一直在随着时间的推移扩展数据集，并且在实验时，上述数据集代表了所有数据。未来，我们计划继续在数据集的扩展版本上训练策略，并与机器人学习社区一起继续增长数据集。在推理时，每个模型以机器人所需的速率运行（3-10 Hz），RT-1在本地运行，而RT-2托管在云服务上，并通过网络查询。

Ⅴ. EXPERIMENTAL RESULTS

Our experiments answer three questions about the effect of X-embodiment training: (1) Can policies trained on our X-embodiment dataset effectively enable positive transfer, such that co-training on data collected on multiple robots improves performance on the training task? (2) Does cotraining models on data from multiple platforms and tasks improve generalization to new, unseen tasks? (3) What is the influence of different design dimensions, such as model size, model architecture or dataset composition, on performance and generalization capabilities of the resulting policy? To answer these questions we conduct the total number of 3600 evaluation trials across 6 different robots.	我们的实验回答了关于X-embodiment 训练效果的三个问题：(1) 在我们的X-embodiment 数据集上训练的策略是否能有效地实现正向迁移，即在多个机器人上收集的数据上共同训练是否能提高训练任务的表现？(2) 在多个平台和任务的数据上共同训练模型是否能提高对新的、未见过的任务的泛化能力？(3) 不同设计维度，如模型大小、模型架构或数据集组成，对结果策略的表现和泛化能力有何影响？为回答这些问题，我们在6种不同机器人上进行了共计3600次评估试验。
A. In-distribution performance across different embodiments To assess the ability of our RT-X model variants to learn from X-embodiment data, we evaluate their performance on in-distribution tasks. We split our evaluation into two types of use cases: evaluation on domains that only have small-scale datasets (Fig. 4), where we would expect transfer from larger datasets to significantly improve performance, and evaluation on domains that have large-scale datasets (Table I), where we expect further improvement to be more challenging. Note that we use the same robotics data training mixture (defined in Sec. IV-C) for all the evaluations presented in this section. For small-scale dataset experiments, we consider Kitchen Manipulation [128], Cable Routing [129], NYU Door Opening [130], AUTOLab UR5 [132], and Robot Play [134]. We use the same evaluation and robot embodiment as in the respective publications. For large-scale dataset experiments, we consider Bridge [95] and RT-1 [8] for in-distribution evaluation and use their respective robots: WidowX and Google Robot.	A. 不同实体间的在分布性能为了评估我们的RT-X模型变体从 X-embodiment 数据中学习的能力，我们评估了它们在分布任务上的表现。我们将评估分为两种用例：在只有小规模数据集的领域（图4）的评估，我们期望从较大数据集的迁移能显著提高表现，以及在拥有大规模数据集的领域（表I）的评估，我们期望进一步改进将更具挑战性。请注意，我们在本节中展示的所有评估都使用相同的机器人数据训练混合体（定义于第IV-C节）。对于小规模数据集实验，我们考虑了厨房操纵[128]、电缆布线[129]、NYU门开启[130]、AUTOLab UR5[132]和机器人玩耍[134]。我们使用了与相应出版物中相同的评估和机器人实体。对于大规模数据集实验，我们考虑了Bridge[95]和RT-1[8]进行在分布评估，并使用它们各自的机器人：WidowX和Google Robot。
For each small dataset domain, we compare the performance of the RT-1-X model, and for each large dataset we consider both the RT-1-X and RT-2-X models. For all experiments, the models are co-trained on the full X-embodiment dataset. Throughout this evaluation we compare with two baseline models: (1) The model developed by the creators of the dataset trained only on that respective dataset. This constitutes a reasonable baseline insofar as it can be expected that the model has been optimized to work well with the associated data; we refer to this baseline model as the Original Method model. (2) An RT-1 model trained on the dataset in isolation; this baseline allows us to assess whether the RT-X model architectures have enough capacity to represent policies for multiple different robot platforms simultaneously, and whether co-training on multiembodiment data leads to higher performance.	对于每个小数据集领域，我们比较了RT-1-X模型的表现，对于每个大数据集，我们考虑了RT-1-X和RT-2-X模型。在所有实验中，模型都在完整的X-embodiment数据集上共同训练。在这个评估中，我们与两个基准模型进行比较：(1) 数据集创建者开发的模型，仅在各自的数据集上训练。这构成了一个合理的基准，因为可以预期该模型已经优化以适应相关数据；我们将此基准模型称为原始方法模型。 (2) 一个仅在数据集上独立训练的RT-1模型；这个基线允许我们评估RT-X模型架构是否有足够的容量，同时代表多个不同机器人平台的策略，以及在多实体数据上共同训练是否能带来更高的表现。
Small-scale dataset domains (Fig. 4). RT-1-X outperforms Original Method trained on each of the robot-specific datasets on 4 of the 5 datasets, with a large average improvement, demonstrating domains with limited data benefit substantially from co-training on X-embodiment data. Large-scale dataset domains (Table I). In the largedataset setting, the RT-1-X model does not outperform the RT-1 baseline trained on only the embodiment-specific dataset, which indicates underfitting for that model class. However, the larger RT-2-X model outperforms both the Original Method and RT-1 suggesting that X-robot training can improve performance in the data-rich domains, but only when utilizing a sufficiently high-capacity architecture.	小规模数据集领域（图4）。在5个数据集中的4个上，RT-1-X的表现超过了在各个机器人特定数据集上训练的原始方法，平均提升较大，表明在数据有限的领域中，从X-embodiment数据上的共同训练中获益颇多。大规模数据集领域（表I）。在大数据集设置中，RT-1-X模型并未超过仅在特定实体数据集上训练的RT-1基线，这表明该模型类别存在拟合不足。然而，更大的RT-2-X模型超过了原始方法和RT-1，表明X-robot训练可以在数据丰富的领域提升性能，但前提是使用足够高容量的架构。

B.Improved generalization to out-of-distribution settings We now examine how X-embodiment training can enable better generalization to out-of-distribution settings and more complex and novel instructions. These experiments focus on the high-data domains, and use the RT-2-X model. Unseen objects, backgrounds and environments. We first conduct the same evaluation of generalization properties as proposed in[9], testing for the ability to manipulate unseen objects in unseen environments and against unseen backgrounds. We find that RT-2 and RT-2-X perform roughly on par (Table II, rows (1) and (2), last column). This is not unexpected, since RT-2 already generalizes well (see [9]) along these dimensions due to its VLM backbone. Emergent skills evaluation. To investigate the transfer of knowledge across robots, we conduct experiments with the Google Robot, assessing the performance on tasks like the ones shown in Fig. 5. These tasks involve objects and skills that are not present in the RT-2 dataset but occur in the Bridge dataset [95] for a different robot (the WidowX robot). Results are shown in Table II, Emergent Skills Evaluation column. Comparing rows (1) and (2), we find that RT-2-X outperforms RT-2 by ∼ 3×, suggesting that incorporating data from other robots into the training improves the range of tasks that can be performed even by a robot that already has large amounts of data available. Our results suggest that co-training with data from other platforms imbues the RT-2-X controller with additional skills for the platform that are not present in that platform's original dataset. Our next ablation involves removing the Bridge dataset from RT-2-X training: Row (3) shows the results for RT-2-X that includes all data used for RT-2-X except the Bridge dataset. This variation significantly reduces performance on the hold-out tasks, suggesting that transfer from the WidowXdata may indeed be responsible for the additional skills that can be performed by RT-2-X with the Google Robot.	B.提高对分布外环境的泛化能力现在，我们将研究X-embodiment训练如何能够实现对分布外环境以及更复杂和新颖指令的更好泛化。这些实验聚焦于高数据领域，并使用RT-2-X模型。未见过的物体、背景和环境。我们首先进行与[9]中提出的泛化特性评估相同的测试，测试在未见环境中操纵未见物体以及在未见背景下的能力。我们发现RT-2和RT-2-X的表现大致相当（表II，行(1)和(2)，最后一列）。这并不意外，因为由于其VLM主干，RT-2已经在这些维度上表现出良好的泛化能力（见[9]）。新兴技能评估。为了调查知识在不同机器人之间的迁移，我们使用Google Robot进行实验，评估类似图5所示任务的表现。这些任务涉及在RT-2数据集中不存在的物体和技能，但出现在Bridge数据集[95]中的不同机器人（WidowX机器人）的任务。结果显示在表II，新兴技能评估列中。比较行(1)和(2)，我们发现RT-2-X的表现比RT-2高出约3倍，这表明将其他机器人的数据纳入训练可以提升即使是已有大量数据的机器人执行的任务范围。我们的结果表明，与其他平台的数据共同训练使RT-2-X控制器获得了该平台原始数据集中不存在的附加技能。我们的下一个消融实验涉及从RT-2-X训练中移除Bridge数据集：行(3)显示了不包括Bridge数据集的RT-2-X的结果。这种变化显著降低了保留任务的性能，表明WidowX数据的迁移可能确实是RT-2-X在Google Robot上执行附加技能的原因。

C.Design decisions Lastly, we perform ablations to measure the influence of different design decisions on the generalization capabilities of our most performant RT-2-X model, which are presented in Table II. We note that including a short history of images significantly improves generalization performance (row (4) vs row (5)). Similarly to the conclusions in the RT-2 paper [9], Web-based pre-training of the model is critical to achieving a high performance for the large models (row (4) vs row (6)). We also note that the 55B model has significantly higher success rate in the Emergent Skills compared to the 5B model (row (2) vs row (4)), demonstrating that higher model capacity enables higher degree of transfer across robotic datasets. Contrary to previous RT-2 findings, co-fine-tuning and fine-tuning have similar performance in both the Emergent Skills and Generalization Evaluation (row (4) vs row (7)), which we attribute to the fact that the robotics data used in RT-2-X is much more diverse than the previously used robotics datasets.	C.设计决策最后，我们进行消融实验以衡量不同设计决策对我们最高性能RT-2-X模型的泛化能力的影响，结果展示在表II中。我们注意到，包含一段短暂图像历史显著提高了泛化性能（行(4)与行(5)）。与RT-2论文[9]中的结论相似，模型的基于网络的预训练对于大型模型实现高性能至关重要（行(4)与行(6)）。我们还注意到，55B模型在新兴技能方面的成功率显著高于5B模型（行(2)与行(4)），表明更高的模型容量能够实现更高程度的跨机器人数据集的迁移。与之前的RT-2发现相反，共同微调和微调在新兴技能和泛化评估方面有类似的性能（行(4)与行(7)），我们将这归因于RT-2-X中使用的机器人数据比以前使用的机器人数据集更为多样化。

Ⅵ. DISCUSSION, FUTURE WORK, AND OPEN PROBLEMS

We presented a consolidated dataset that combines data from 22 robotic embodiments collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We also presented an experimental demonstration that Transformer-based policies trained on this data can exhibit significant positive transfer between the different robots in the dataset. Our results showed that the RT-1-X policy has a 50% higher success rate than the original, state-of-the-art methods contributed by different collaborating institutions, while the bigger vision-language-modelbased version (RT-2-X) demonstrated ∼ 3× generalization improvements over a model trained only on data from the evaluation embodiment. In addition, we provided multiple resources for the robotics community to explore the X-embodiment robot learning research, including: the unified X-robot and X-institution dataset, sample code showing how to use the data, and the RT-1-X model to serve as a foundation for future exploration.

我们展示了一个整合了来自21个机构合作的22种机器人实体的数据集，展示了527项技能（共160266个任务）。我们还展示了一个实验演示，表明基于Transformer的策略在这些数据上训练后，可以在数据集中不同机器人之间展示出显著的正向迁移。我们的结果显示，RT-1-X策略的成功率比不同合作机构贡献的原始、最先进方法高出50%，而更大的视觉-语言模型版本（RT-2-X）在仅在评估实体的数据上训练的模型上展示了约3倍的泛化改进。此外，我们为机器人学习社区提供了多个资源以探索X-embodimen机器人学习研究，包括：统一的X-robot和X-institution数据集、展示如何使用数据的示例代码，以及作为未来探索基础的RT-1-X模型。

While RT-X demonstrates a step towards a X-embodied robot generalist, there are many more steps needed to make this future a reality. Our experiments have a number of limitations: it does not consider robots with very different sensing and actuation modalities, it does not study generalization to new robots, and it does not provide a decision criterion for when positive transfer does or does not happen. Studying these questions is an important direction for future work. We hope that this work will serve not only as an example that X-robot learning is feasible and practical, but also provide the tools to advance research in this direction in the future.

尽管RT-X朝着X-实体机器人通才迈出了一步，但要实现这一未来还需要更多步骤。我们的实验有许多局限性：它没有考虑具有非常不同感知和执行模态的机器人，它没有研究对新机器人的泛化，也没有提供正向迁移发生与否的决策标准。研究这些问题是未来工作的重要方向。我们希望这项工作不仅作为一个示例，表明X-机器人学习是可行和实用的，而且还提供了未来在这个方向上推进研究的工具。