AUTORT 论文翻译: EMBODIED FOUNDATION MODELS FOR LARGE SCALE ORCHESTRATION OF ROBOTIC AGENTS

YYGe

已于 2024-01-27 18:24:00 修改

阅读量1.2k

点赞数 19

分类专栏：机器人文章标签：人工智能深度学习机器学习预训练模型机器人家务机器人

于 2024-01-27 16:31:18 首次发布

本文链接：https://blog.csdn.net/weixin_43334869/article/details/135883076

版权

机器人专栏收录该内容

10 篇文章

订阅专栏

AUTORT: EMBODIED FOUNDATION MODELS FOR LARGE SCALE ORCHESTRATION OF ROBOTIC AGENTS

AUTORT：大规模协调机器人代理的实体基础模型

文章目录

ABSTRACT

摘要

Foundation models that incorporate language, vision, and more recently actions have revolutionized the ability to harness internet scale data to reason about useful tasks. However, one of the key challenges of training embodied foundation models is the lack of data grounded in the physical world. In this paper, we propose AutoRT, a system that leverages existing foundation models to scale up the deployment of operational robots in completely unseen scenarios with minimal human supervision. AutoRT leverages vision-language models (VLMs) for scene understanding and grounding, and further uses large language models (LLMs) for proposing diverse and novel instructions to be performed by a fleet of robots. Guiding data collection by tapping into the knowledge of foundation models enables AutoRT to effectively reason about autonomy tradeoffs and safety while significantly scaling up data collection for robot learning. We demonstrate AutoRT proposing instructions to over 20 robots across multiple buildings and collecting 77k real robot episodes via both teleoperation and autonomous robot policies. We experimentally show that such “in-the-wild” data collected by AutoRT is significantly more diverse, and that AutoRT’s use of LLMs allows for instruction following data collection robots that can align to human preferences.

将语言、视觉以及最近的行为等方面结合的基础模型彻底改变了利用互联网规模的数据来推理有用任务的能力。然而，训练具有实体基础模型的一个关键挑战是缺乏在物理世界中扎根的数据。在本文中，我们提出了AutoRT，这是一个利用现有基础模型扩展操作机器人在完全未见场景中部署的系统，几乎不需要人类监督。AutoRT利用视觉-语言模型（VLMs）进行场景理解和扎根，并进一步利用大型语言模型（LLMs）提出多样化和新颖的指令，供一群机器人执行。通过利用基础模型的知识引导数据收集，使AutoRT能够有效地推理出自主权衡和安全性，同时显著扩大机器人学习的数据收集。我们演示了AutoRT向超过20台机器人提出指令，跨多栋建筑收集了77,000个真实机器人实验，既包括远程操作又包括自主机器人策略。我们实验证明，由AutoRT收集的这种“真实环境”数据更加多样化，而AutoRT对LLMs的使用允许机器人按照人类偏好执行指令。

1 INTRODUCTION

1 引言

One of the central goals of autonomous robotics research is to enable independent and broadly capable robotic agents: systems that can be tasked with some high-level goals (“keep the kitchen clean”), formulate plans for addressing these goals, and then carry out those plans with the skills and resources available to them. While current robotic learning methods offer appealing solutions for acquiring individual robotic skills, and large language models (LLMs), vision-language models (VLMs) and large multimodal models offer the ability to reason over such abstract tasks (Ahn et al., 2022; Rana et al., 2023), truly open-ended tasks still present major challenges. Performing innumerable number of tasks in diverse settings requires a grounded and generalist agent that can robustly adapt to scenarios outside where robots are trained. The bottleneck for achieving these goals, however, is the need for large amounts of robotic experience in the real world – much larger than robot datasets collected in lab settings with well-defined environments.

自主机器人研究的中心目标之一是实现独立而广泛有能力的机器人代理：系统可以被赋予一些高层次的目标（“保持厨房干净”），为实现这些目标制定计划，然后利用其可用的技能和资源执行这些计划。虽然当前的机器人学习方法提供了获取单个机器人技能的吸引人解决方案，而大型语言模型（LLMs）、视觉-语言模型（VLMs）和大型多模型模型提供了在这些抽象任务上进行推理的能力（Ahn等，2022；Rana等，2023），但真正开放的任务仍然存在重大挑战。在各种环境中执行无数任务需要一个扎根和通用的代理，能够稳健地适应机器人训练之外的情境。然而，实现这些目标的瓶颈在于在真实世界中需要大量的机器人经验——远远大于在实验室环境中收集的机器人数据集，而这些环境具有明确定义的环境。

In this paper, we study how we can design agents to gather robotic experience for themselves at scale. Central to our work is leveraging knowledge contained in foundation models to drive realworld robots. We specifically focus on diverse robotic data acquisition: when a robot is placed in a new environment, potentially with a user command to collect data around some theme (e.g. office tasks), the robot should determine which tasks can be performed, which of its own skills to trigger to attempt them, and when it should rely on human teleoperators. We view this from the perspective of controlling a fleet of robots, spread across multiple locations, where there are many more robots than human supervisors, necessitating mixing expert demonstrations with suboptimal autonomous policies in a safe and appropriate way. Our system for large-scale orchestration of robotic agents, which we call AutoRT, tackles this problem.

在本文中，我们研究如何设计代理以规模化地为自己收集机器人操作经验。我们工作的核心是利用基础模型中包含的知识来驱动真实世界中的机器人。我们特别关注多样化的机器人数据采集：当机器人被放置在新环境中时，可能会有用户命令围绕某个主题（例如办公任务）收集数据，机器人应确定可以执行哪些任务，触发自己的哪些技能尝试执行它们，以及何时应依赖人类远程操作员。我们从控制一个遍布多个位置的机器人群的角度来看待这个问题，在这里机器人比人类监督员要多得多，必须以安全和适当的方式将专家演示与次优自主策略相结合。我们的机器人代理的大规模编排系统，称为AutoRT，解决了这个问题。

At the core of AutoRT is an large foundation model that acts as a robot orchestrator, prescribing appropriate tasks to one or more robots in an environment based on the user’s prompt and environmental affordances (“task proposals”) discovered from visual observations. The scene description step perceive objects in the environment, the task proposal step suggests possible things the robot could do with them, and then the affordance filtering step decides which tasks to attempt and how based on these observations and prompt. This process takes into account constraints specified via “constitutional prompting”, where rules about robot behaviour can be defined by the user. It additionally accounts for the availability of human teleoperators, and handles working around the capabilities of the robot (e.g., the robot can pick up a cup but not a table, it can approach the sink but can’t sit in a chair, etc.).

在AutoRT的核心是一个作为机器人编排者的大型基础模型，根据用户的提示和从视觉观察中发现的环境便利性（“任务提案”）为一个或多个机器人规定适当的任务。场景描述步骤感知环境中的对象，任务提案步骤建议机器人可能要执行的任务，然后依据这些观察和提示的便利性过滤步骤决定尝试哪些任务以及如何尝试这些任务。这个过程考虑到用户通过“基本定律提示”指定约束条件，这些约束条件可以定义关于机器人行为的规则。它还考虑到人类远程操作员的可用性，并处理围绕机器人能力（例如，机器人可以拿起杯子但不能拿起桌子，可以接近水槽但不能坐在椅子上等）的工作。

We describe the AutoRT system, instantiate it with a fleet of real-world mobile manipulators, and present the results of an extensive real-world evaluation over 7 months, 4 different office buildings, and a fleet of over 20 robots, which resulted in the collection of 77,000 real-world robotic trials with both teleoperation and autonomous execution. AutoRT is, to the best of our knowledge, the first system where LLM-controlled robots are allowed to drive autonomously in real world settings, propose their own goals, and take actions toward those goals. We show that AutoRT scales robot deployment by allowing 1 human to supervise 3-5 mobile manipulators. Our evaluation studies how AutoRT can collect highly diverse data, be instructed to collect task appropriate data and shows such data can be used to improve state-of-the-art robot learning models. AutoRT also introduces aligning robot behavior to human preferences using prompting and critiquing with a robot constitution.

我们描述了AutoRT系统，用一群真实世界的移动操作器实例化它，并在7个月的时间内在4栋不同的办公楼和超过20台机器人的机器人群上进行了广泛的实际评估，结果收集了77,000个真实世界的机器人试验，既包括远程操作又包括自主执行。AutoRT是我们所知的第一个允许LLM控制的机器人在真实世界中自主驾驶、提出自己的目标并采取行动的系统。我们展示了AutoRT通过允许一个人监督3-5个移动操作器来扩大机器人部署的能力。我们的评估研究了AutoRT如何收集高度多样化的数据，被指示收集与任务相关的数据，并展示了这些数据可用于改进最先进的机器人学习模型。AutoRT还引入了使用提示和批评与机器人定律相结合的方法，将机器人行为与人类偏好对齐。

2 RELATED WORK

2 相关工作

Real robot data collection. Large scale real robot data collection for robotic manipulation falls into mainly two categories: autonomous data collection and human assisted demonstrations. Autonomous data collection in prior works is often conducted in constrained robot lab environments, on tasks like grasping (Pinto & Gupta, 2015; Levine et al., 2016; Kalashnikov et al., 2018; Platt, 2022), pushing (Yu et al., 2016; Ebert et al., 2018; Dasari et al., 2020), or pick and place (Kalashnikov et al., 2021; Bousmalis et al., 2023). Our work focuses on tackling more varied environments, similar to Gupta et al. (2018), and tackling a wider set of tasks. Human demonstrated data collection can be done in varied environments (Sharma et al., 2018; Mandlekar et al., 2019; Jang et al., 2021; Brohan et al., 2022), and teleoperated data can be far more diverse and valuable for skill learning than autonomously collected data, but is bottlenecked by availability of humans when scaling to many robots. This motivates hybrid approaches that mix teleoperation and autonomous policies, such as DAgger style methods (Ross et al., 2011; Kelly et al., 2019; Hoque et al., 2022). AutoRT is such a hybrid approach, collecting both teleoperated and autonomous episodes based on supply of human supervision, with a focus on collecting data on novel tasks in novel environments.

真实机器人数据收集。 对于机器人操纵的大规模真实机器人数据收集主要分为两类：自主数据收集和人类辅助演示。以前的自主数据收集通常在受限制的机器人实验室环境中进行，涉及抓取（Pinto & Gupta, 2015; Levine et al., 2016; Kalashnikov et al., 2018; Platt, 2022）、推动（Yu et al., 2016; Ebert et al., 2018; Dasari et al., 2020）或拾取和放置（Kalashnikov et al., 2021; Bousmalis et al., 2023）等任务。我们的工作侧重于处理更加多样化的环境，类似于Gupta等人（2018），并解决更广泛的任务。人类演示的数据收集可以在各种环境中完成（Sharma等人，2018; Mandlekar等人，2019; Jang等人，2021; Brohan等人，2022），而远程操作的数据可能比自主收集的数据更加多样和有价值，但在扩展到多个机器人时受到人力资源的瓶颈。这促使采用混合方法，结合远程操作和自主策略，如DAgger风格的方法（Ross等人，2011; Kelly等人，2019; Hoque等人，2022）。AutoRT 就是这样一种混合方法，根据人类监督收集远程操作和自主事件，重点是收集新环境中新任务的数据。

Large language models. Many recent works have studied using LLMs to generate agent-like behavior (Shinn et al., 2023; Yao et al., 2022; Park et al., 2023), improve embodied reasoning (Driess et al., 2023), and write robotics code (Vemprala et al., 2023; Liang et al., 2022). Works like Ahn et al. (2022) and Rana et al. (2023) use LLMs to generate language plans for robots to solve an instruction given by a user. Our work self-generates instructions for the robot to perform, which was proposed in Xian et al. (2023). Most similar is Voyager (Wang et al., 2023), an LLM-driven agent that autonomously explores a Minecraft environment. AutoRT runs on a real-world robot for extended periods of time, introducing challenges like reliability and safety that are less present in simulated environments.

大型语言模型。 许多最近的研究都研究了使用大型语言模型（LLMs）生成类似代理的行为（Shinn等人，2023; Yao等人，2022; Park等人，2023），改善实体推理（Driess等人，2023），并编写机器人代码（Vemprala等人，2023; Liang等人，2022）。像Ahn等人（2022）和Rana等人（2023）的作品使用LLMs生成机器人根据用户给定的指令解决问题的语言计划。我们的工作是为机器人自动生成执行的指令，这在Xian等人（2023）中已经提出。最相似的是Voyager（Wang等人，2023），这是一个由LLM驱动的代理，可以自主探索Minecraft环境。AutoRT在真实世界的机器人上运行了很长时间，引入了可靠性和安全性等在模拟环境中较少存在的挑战。

3 PROBLEM STATEMENT

3 问题陈述

In this work, our goal is to build a system that enables large-scale, “in-the-wild” data collection to generate diverse, real-world robot data on new skills in new environments.

在这项工作中，我们的目标是建立一个系统，能够进行大规模的“真实环境”数据收集，以生成有关新环境中新技能的多样化、真实的机器人数据。

To do so, we assume access to a large fleet of N robots, capable of navigating across multiple buildings, and manipulating objects. The buildings are populated, where both robots and people are free to move around the space. We do not make any assumptions about the layout of the buildings, or the objects available for manipulation. We assume a limited bandwidth of human supervision, meaning there are more robots than human supervisors – that is, we cannot expect that a human will always be in charge of teleoperating a single robot.

为了达到这个目标，我们假设可以访问一个庞大的机器人群体，数量为N，能够在多个建筑物之间导航并操作物体。这些建筑物是有人居住的，机器人和人都可以自由移动。我们不对建筑物的布局或可操作的物体做任何假设。我们假设人类监督的数量有限，意味着机器人的数量超过了人类监督员的数量——也就是说，我们不能期望一个人始终负责远程操作单个机器人。

Our goal is to have a single system that can handle any state s ∈ S observed by a robot, and generate tasks t executable by one of k different collect policies π ∈ {π 1 ,…,π k} = Π. For instance, πi can be an autonomous policy π auto i either hand-designed or learned a priori, or a policy executed by querying a human teleoperator, i.e., π teleop i . The goal of such a system: S → Π is to guide the data collection of the fleet of N robots by observing the state s and use this information to identify a set of feasible language-specified tasks t that correspond to specific policies π. In addition, the system needs to take into account other factors that impact throughput of data collection and safety. These include tradeoffs between autonomous and teleoperated policy primitives, generation of diverse and novel tasks proposals while at the same time considering guardrails and safety criteria.

我们的目标是建立一个单一系统，可以处理机器人观察到的任何状态s ∈ S，并生成可由k种不同收集策略π ∈ {π 1 ,…,π k} = Π执行的任务t。例如，πi可以是一个自主策略π auto i，可以是事先设计的，也可以是先验学习的，或者是通过查询人类远程操作员执行的策略，即π teleop i。这样一个系统的目标是通过观察状态s引导N个机器人的数据收集，利用这些信息识别一组与特定策略π相对应的可行的语言规定任务t。此外，系统还需要考虑影响数据收集和安全性的其他因素。这些因素包括自主和远程操作策略之间的权衡、生成多样化和新颖的任务提案，同时考虑防护栏和安全标准。

4 AUTORT: EXPLORING AND EXECUTING IN THE WILD

4 AutoRT：在真实场景中探索和执行

In this section, we describe each component of AutoRT, which is visualized in Fig. 5. At a high level, AutoRT gathers data via an open vocabulary object detector to first understand and describe the scene, then an LLM parses this description and generates sensible and safe language goals given high-level objectives, and finally an LLM is used to determine how to execute these goals.

在这一部分中，我们描述AutoRT的每个组件，如图1所示。在高层次上，AutoRT通过一个开放词汇的物体检测器收集数据，首先理解和描述场景，然后用一个LLM解析这个描述，并在高层次目标给定的情况下生成明智而安全的语言目标，最后使用LLM确定如何执行这些目标。

在这里插入图片描述
Figure 1: System diagram for AutoRT. Each robot explores the environment, sampling a random navigation target close to objects. The scene and objects in it are described by a VLM to give text to an LLM, which generates manipulation tasks for the robot. Valid tasks are run by the robot, the episodes are scored, and the process repeats. No part of this requires advance knowledge of the layout of the environment or objects it contains, making it easy to run on a fleet of 20+ robots that are each in novel settings. Green sections are contributions of this work.

图1：AutoRT系统示意图。每个机器人探索环境，采样靠近物体的随机导航目标。场景及其中的物体由VLM描述，为LLM提供文本，LLM生成机器人的操作任务。机器人执行有效的任务，对每个任务剧集（包括任务生成、执行和结果）进行评分，然后过程重复。这一过程无需对环境布局或其中包含的物体有任何先验知识，因此可以轻松在包含20多个机器人的机器人群体中运行，每个机器人都处于新颖的设置中。绿色部分是本文的贡献。

The robot platform used in AutoRT is a mobile manipulator with a camera, robot arm, and mobile base. Herein, we only consider manipulation data collection, so navigation is only used to gather diverse manipulation settings – however, we note that the system is general to other robotic embodiments and modes of collection. Further details on the robot platform and the implementation are in Appendix A.

AutoRT使用的机器人平台是一个移动操作器，具有摄像头、机器人臂和移动底座。在此仅考虑操纵数据收集，因此导航仅用于收集各种操纵设置，但我们注意到该系统适用于其他机器人实体和收集模式。有关机器人平台和实现的更多详细信息请参见附录A。

4.1 EXPLORATION: NAVIGATING TO THE TARGET

4.1 探索：导航到目标

The first stage of AutoRT is to explore the space and find interesting scenes for manipulation. To map the environment, we use the natural language map approach proposed by Chen et al. (2023), which is built using a VLM to encode object detections into visual-language embeddings φi , with corresponding position (xi , yi ,zi) determined by the robot’s depth sensor and SLAM. Thus, given a textual target q like “sponge”, we can direct the robot towards a sponge by querying for a φi that is close to the text embedding for q. To determine navigation goals we sample this map for regions of interest via sampling states proportional to their latent distance to an average embedding of previously seen objects (see Appendix B for more details). For each environment, this map is generated once, then copied to all robots collecting in the space and loaded from cache to save time in future episodes.

AutoRT的第一阶段是探索空间，找到适合操纵的有趣场景。为了绘制环境地图，我们使用了Chen等人（2023）提出的自然语言地图方法，该方法利用VLM将对象检测编码为视觉-语言嵌入φi，相应的位置（xi，yi，zi）由机器人的深度传感器和SLAM确定。因此，给定一个文本目标q，比如“海绵”，我们可以通过查询与q的文本嵌入接近的φi来引导机器人朝向海绵。为了确定导航目标，我们通过对其潜在距离进行采样来对此地图进行兴趣区域的采样状态（有关更多详细信息，请参见附录B）。对于每个环境，此地图生成一次，然后复制到所有在空间中进行收集的机器人，并从缓存中加载以节省将来任务剧集中的时间。

4.2 ROBOT CONSTITUTION

4.2 机器人定律

Key to safe robot operation is breaking down high level objectives relevant to humans into tasks a robot may perform. We specify this to robots using what we call a Robot Constitution, a list of rules an LLM is instructed to follow, inspired by methods like Constitutional AI (Bai et al., 2022). These rules are divided into three categories:

确保机器人安全运行的关键是将与人类相关的高层次目标分解为机器人可以执行的任务。我们使用一个称为机器人定律的方式向机器人指定这一点，它是一份LLM被指示遵循的规则清单，灵感来自于诸如Constitutional AI（Bai等人，2022）的方法。这些规则分为三类：

• Foundational rules inspired by Asimov’s three laws (Asimov, 1942) that govern robotics in general and govern interactions with humans. We modify the exact text of these laws as described in Appendix D.

• Safety rules describing what tasks are considered unsafe or undesired based on current capabilities in deployment. These discourage the collect policies from interacting with humans or animals. They also discourage handling sharp and fragile objects or electrical equipment.

• Embodiment rules describing limitations of the robot’s embodiment, such as its maximum payload and its unimanual nature, to discourage attempting tasks with heavier objects or that which require two arms (e.g. “opening a fridge and picking up a drink”).

受到Asimov的三定律（Asimov, 1942）启发的基础规则，它们总体上统治机器人学和与人类的交互。我们根据附录D中描述的方式修改这些法则的确切文本。

安全规则描述基于当前部署能力的哪些任务被认为是不安全或不期望的。这些规则阻止了收集策略与人类或动物的交互。它们还阻止了处理尖锐和脆弱的物体或电气设备。

具体化规则描述机器人实体的限制，例如其最大有效载荷和其单臂性质，以阻止尝试使用更重的物体或需要两只手的任务（例如“打开冰箱并拿起饮料”）。

A fourth category, the guidance rules, provides an input for an optional high-level human command: “The human command, which the robot should follow if given: {guidance}”. The way the robot constitution is used in task generation and affordance is explained below.

第四类是指导规则，为高级人类命令提供一个输入：“机器人应该遵循的人类命令：{指导}”。下面解释了机器人定律在任务生成和便利性中的使用方式。

4.3 TASK GENERATION

4.3 任务生成

Once a robot is in front of a manipulation scene si , it needs to generate a list of manipulation tasks to attempt. This is done via two steps:

一旦机器人位于操纵场景si前，它需要生成一个尝试的操纵任务列表。这通过两个步骤完成：

• Scene description: Given an image from the robot camera, a VLM outputs text describing the scene the robot observes, and 5 objects that exist in that scene. For example, as shown in Fig. 1, the VLM lists soap, napkin, snack, cloth, sponge in the given scene.

场景描述：给定来自机器人摄像头的图像，VLM输出描述机器人观察到的场景和该场景中存在的5个物体的文本。例如，如图1所示，VLM在给定场景中列出了肥皂、餐巾纸、小吃、抹布和海绵。

• Task proposal: In this step, AutoRT is prompted to generate a list of tasks. This prompt begins with a system prompt, such as: “I am a robot operating in an office environment”, which describes the role the LLM should play. It continues with a list of rules that should be followed for task generation, codified by the robot constitution. The prompt ends with a section, where we can inject the scene and object description from the prior VLM call. Given this prompt, an LLM generates a list of potential manipulation tasks (see Fig. 5). We note, the LLM is not fine-tuned to our specific use case to maintain the generality the underlying model.

任务建议：在这一步中，AutoRT被提示生成任务列表。此提示以系统提示开头，例如：“我是在办公环境中运行的机器人”，描述了LLM应扮演的角色。它继续以一系列应遵循的规则列表为任务生成进行编码，由机器人定律制定。提示以一个部分结束，我们可以从之前的VLM调用中注入场景和物体描述。在给定此提示的情况下，LLM生成潜在操纵任务的列表（见图5）。我们注意到，LLM没有针对我们的具体用例进行微调，以保持底层模型的通用性。

An important detail of AutoRT is that we use multiple collect policies {π 1 ,π 2 ,…,π k}, sampling one each episode. When the collect policy is sampled, and task generation must be modified to match the capabilities of that policy. Thus, for each policy π j , we append a π j -specific suffix to the end of the task generation prompt. See Appendix D for full text of the prompts.

AutoRT的一个重要细节是我们使用多个收集策略{π 1 ,π 2 ,…,π k}，每次任务剧集采样一个。当采样收集策略时，任务生成必须修改以匹配该策略的能力。因此，对于每个策略π j，我们在任务生成提示的末尾附加一个π j -specific的后缀。有关提示的完整文本，请参见附录D。

4.4 AFFORDANCE

4.4 功能性

Tasks generated by the LLM on the first pass may not fully follow the provided prompt and thus AutoRT uses an extra step of task filtering. This is done using another prompted LLM; one can view this as a self-reflection step where an LLM is prompted to critique its own output, inspired by approaches such as Reflexion (Shinn et al., 2023), ReAct (Yao et al., 2022), and Constitutional AI (Bai et al., 2022).

LLM在第一次生成的任务可能并不完全遵循提供的提示，因此AutoRT使用了任务过滤的额外步骤。这是使用另一个被提示的LLM完成的；可以将其视为一个自我反思的步骤，LLM被提示对其自己的输出进行批判，灵感来自于诸如Reflexion（Shinn等人，2023）、ReAct（Yao等人，2022）和Constitutional AI（Bai等人，2022）等方法。

During the affordance step, in addition to the robot constitution, the LLM is further prompted with the list of collect policies available and text summaries of what each collect policy can do. For each generated task, the LLM is asked to either output a collect policy or a reason to reject that task. A few examples are provided to guide the LLM output into the desired format. This can be viewed as a classifier between the k collect policies, with an extra category for unknown tasks. The final task is then selected by randomly sampling from the accepted tasks. For instance, as shown in Fig. 5, the originally sampled policy is π teleop. The first two proposed tasks by the LLM are classified as π teleop, the second two tasks are classified as π rt2, an autonomous policy from (Brohan et al., 2023), and the last task is rejected as the embodiment of the robot does not allow for a bimanual task. The final task is sampled from the first two tasks. We found classifying between all collect policies was fine, even though for filtering it would be sufficient to classify between π i and not-π i per episode.

在功能性步骤中，除了机器人定律外，LLM还进一步用可用的收集策略列表和每个收集策略的文本摘要进行提示。对于每个生成的任务，LLM被要求输出一个收集策略或拒绝该任务的原因。提供了一些示例来引导LLM的输出进入所需的格式。这可以看作是k个收集策略之间的分类器，额外增加了一个未知任务的类别。然后，最终任务是通过从接受的任务中随机抽样来选择的。例如，如图1所示，最初采样的策略是π teleop。LLM提出的前两个任务被分类为π teleop，接下来的两个任务被分类为π rt2，这是来自（Brohan等人，2023）的一个自主策略，最后一个任务被拒绝，因为机器人的具体实体不允许进行双手任务。最终的任务是从前两个任务中随机抽样的。我们发现在所有收集策略之间进行分类是可以的，尽管对于过滤来说，在每一集中分类为π i和非π i将足以。

4.5 DATA COLLECTION

4.5 数据收集

Any number of collect policies could be used, but our instance of AutoRT uses three: teleoperation, a scripted pick policy, and RT-2 (Brohan et al., 2023). The scripted pick policy pseudocode is provided in Appendix H. Each π i has a different sampling probability pi that is adjusted during collect primarily based on the number of robots supervised per person. For example, if 1 person is supervising 3 robots, then the human teleoperation collect policy was sampled p < 1 3 of the time to respect available supervision. After manipulation, the episode’s diversity is scored (see Section 5.1 for how), and the robot resets to start again. The human supervisor may occasionally reset the environment by hand.

可以使用任意数量的收集策略，但我们的AutoRT实例使用了三种：远程操作、脚本化的拾取策略和RT-2（Brohan等人，2023）。脚本化拾取策略的伪代码在附录H中提供。每个π i 具有不同的采样概率pi，在收集期间主要根据每人监督的机器人数量进行调整。例如，如果一个人监督3个机器人，那么人类远程操作收集策略将被采样p < 1 3 的时间，以重视可用的监督。在操纵之后，对任务剧集的多样性进行评分（详见第5.1节），然后机器人重置以重新开始。人类监督者偶尔可能通过手动重置环境。

Recent works like Brohan et al. (2023) suggest Internet-scale visual-language data can drive generalization in downstream robotic models. Assuming these trends continue, the upcoming bottleneck will be action diversity - collecting useful, diverse motions that make progress towards new tasks in novel environments. Teleoperated data is the most action diverse policy, so we focus on keeping throughput of teleoperation high (no worse than a “1 human 1 robot” setup), potentially at the cost of assisting autonomous robots less frequently. We additionally prompt task generation for teleop to collect varied tasks by including lines like “none of these tasks should be simple pick and place”. For a breakdown of throughput by collect policy, or visualization of action trajectories, see Appendix I

最近的研究，如Brohan等人（2023），表明互联网规模的视觉语言数据可以推动下游机器人模型的泛化。假设这些趋势持续下去，即将出现的瓶颈将是动作多样性——收集有用的、在新颖环境中朝着新任务取得进展的多样化运动。远程操作数据是最具有动作多样性的策略，因此我们重点关注保持远程操作的吞吐量高（不差于“1人1机器人”的设置），可能以较低的频率协助自主机器人。我们还提示远程操作的任务生成，通过包含类似“这些任务都不应该是简单的拾取和放置”等行为，以收集各种各样的任务。有关各种收集策略的吞吐量或动作轨迹的可视化，请参见附录I。

4.6 GUARDRAILS

4.6 安全防护

AutoRT deploys foundation models in “in the wild” settings but foundation models, even if prompted correctly and with instruction finetuning have no guarantees on safety. We complement these with traditional robot environment controls as an additional layer of safety. These measures are detailed in Appendix C.

AutoRT在“真实”环境中部署基础模型，但基础模型即使在正确提示和指导微调的情况下也无法保证安全。为了增加安全性，我们额外引入传统的机器人环境控制作为安全的附加层次。这些措施的详细信息在附录C中。

5 EXPERIMENTAL EVALUATION

5 实验评估

Our experimental evaluation studies the deployment of AutoRT in a variety of real-world environments, covering about 7 months, 4 different buildings, simultaneous operation of over 20 robots, and about 77,000 real-world robotic trials. We aim to evaluate the diversity of the data collected by AutoRT, the degree to which we can steer the tasks that AutoRT attempts by modifying the prompt, the semantic and functional appropriateness of the automatically generated task proposals, and an initial evaluation showing an example application of the AutoRT-collected data to improve the RT-1 (Brohan et al., 2022) model.

我们的实验评估研究了AutoRT在各种真实环境中的部署情况，涵盖了大约7个月的时间，4座不同的建筑物，同时运行超过20台机器人，进行了约77,000次真实世界的机器人试验。我们的目标是评估AutoRT收集的数据的多样性，通过修改提示来调整AutoRT尝试的任务的程度，自动生成的任务提案的语义和功能的适当性，以及初步评估显示AutoRT收集的数据，对改进RT-1（Brohan等人，2022）模型的示例应用。

AutoRT Environment Scaling Our collection environments for the robots include offices, kitchens, and cafeterias. The same code is used in every environment with the only per-environment change being the difference in driving bounds allowing AutoRT to start collecting in a new environment in ¡ 1 day without too much set up. Some of these environments are shown in Fig. 2.

AutoRT环境扩展。 我们为机器人收集的环境包括办公室、厨房和餐厅。相同的代码在每个环境中使用，每个环境的唯一更改是导航驾驶范围的不同，使得AutoRT可以在不需要太多设置的情况下在新环境中开始收集，时间不超过1天。图2显示了其中一些环境的示例。
在这里插入图片描述

Figure 2: Examples of robot collect environments used. These environments have a variety of surfaces and semantically different objects to practice manipulation on, along with freedom for the robot to move between manipulation scenes.

图2：使用的机器人收集环境示例。这些环境具有各种表面和语义上不同的物体，可以用于进行操作，同时机器人可以在操作场景之间自由移动。

AutoRT Robot Deployment Scaling: Each human supervised between 3 to 5 robots at once, allowing to scale mobile manipulator deployment faster than number of humans employed. Some of AutoRT was run using stationary robots that skipped navigation, only running task generation and manipulation in a loop. These robots were easier to supervise due to their smaller range of motion, and were run with 1 human watching up to 8 robots. Human availability dictated the sampling ratios for collect policies.

AutoRT机器人部署扩展： 每个人同时监督3到5台机器人，允许比雇佣的人数更快地扩展移动操纵器的部署。部分AutoRT是使用固定机器人运行的，它们跳过导航，仅在循环中运行任务生成和操作。由于其较小的运动范围，这些机器人更容易监督，并且由1个人监视最多8台机器人运行。人的可用性决定了收集策略的采样比率。

Data statistics: In total, 53 robots were used to collect 77,000 new episodes over the course of 7 months, with a peak load of over 20 simultaneous robots. Over 6,650 unique instructions appear in the dataset. More details can be found in Fig. 3, Fig. 4 and Table 1. Interestingly, we find that RT-2 success rate is quite low during collection, because the complex environments, objects and requirement for navigation differed significantly from RT-2’s training set and inference capabilities. This influenced our decision to run RT-2 less frequently.

数据统计： 总共使用了53台机器人在7个月的时间内收集了77,000个新任务，同时运行的最大机器人数超过20台。数据集中出现了超过6,650个独特的指令。有关更多详细信息，请参见图3、图4和表1。有趣的是，我们发现RT-2在收集过程中的成功率相当低，因为复杂的环境、物体和导航要求与RT-2的训练集和推理能力差异很大。这影响了我们减少RT-2频率的决定。
在这里插入图片描述

Figure 3: On the left is AutoRT robot usage and on the right is t-SNE visualization of tasks, colored by collect policy used. Each point corresponds to a different task string.

图3：左侧是AutoRT机器人使用情况，右侧是任务的t-SNE可视化，按使用的收集策略着色。每个点对应于一个不同的任务字符串。
在这里插入图片描述

Figure 4: AutoRT episodes collected and unique tasks over time

图4：随时间变化的AutoRT任务剧集收集和独特任务。
在这里插入图片描述
表格1：AutoRT数据，按使用的收集策略拆分。脚本化策略使用频率最高，而远程操作具有最高的成功率。

5.1 DIVERSITY SCORING

5.1 多样性评分

Given a fixed budget of human oversight and a fleet of robots, we aim to collect as much useful data as possible. Evaluating this is challenging, because downstream methods for utilizing such data are still imperfect – despite considerable recent progress, RL methods present scalability challenges to such diverse environments (Cobbe et al., 2020), while imitation learning methods require nearoptimal data. Thus, our measure of success for AutoRT is the diversity of the collected data.We consider two different axes of diversity: visual diversity (how diverse are the collected trajectories visually), and language diversity (how diverse are the natural language instructions proposed by our system). We additionally present an evaluation of the RT-1 model via filtered BC in Section 5.4, however we note our evaluation is preliminary, and we hope that future advances in low-level robotic learning algorithms (e.g., RL and IL) will lead to better approaches for utilizing such data.

在固定的人工监督预算和一群机器人的情况下，我们的目标是尽可能收集到更多有用的数据。评估这一点是具有挑战性的，因为利用这种数据的下游方法仍然不完善——尽管最近取得了相当大的进展，强化学习方法在这样多样化的环境中面临可扩展性挑战（Cobbe等人，2020），而模仿学习方法则需要接近最优的数据。因此，对于AutoRT来说，我们的成功度量是收集数据的多样性。
我们考虑多样性的两个不同方面：视觉多样性（收集的轨迹在视觉上有多种多样性）和语言多样性（我们系统提出的自然语言指令有多种多样性）。此外，在第5.4节中，我们还通过过滤的BC对RT-1模型进行评估，然而我们注意到我们的评估是初步的，我们希望未来在低级机器人学习算法（例如RL和IL）方面的进展将带来更好的利用这些数据的方法。

Language diversity: To measure language diversity, we use the L2 distance in a language embedding space – specifically that of Universal Sentence Encoder (Cer et al., 2018) that are normalized 512-d embeddings. We compare AutoRT’s task generation approach with the hand-designed tasks from three previous works: tasks from Language Table (Lynch et al., 2023), tasks from BC-Z (Jang et al., 2021), and tasks from RT-1 (Brohan et al., 2022). Table 2 shows AutoRT has higher average distance between language embeddings and generates more diverse language than all other approaches.

语言多样性：为了衡量语言多样性，我们使用语言嵌入空间中的L2距离，具体而言是通用句子编码器（Cer等人，2018）的标准化512维嵌入。我们将AutoRT的任务生成方法与三个先前作品的手工设计任务进行比较：来自Language Table的任务（Lynch等人，2023），来自BC-Z的任务（Jang等人，2021）和来自RT-1的任务（Brohan等人，2022）。表2显示AutoRT的语言嵌入之间的平均距离更大，生成的语言比所有其他方法更多样化。
在这里插入图片描述

Table 2: Diversity of language embeddings from task generators. AutoRT generates language embeddings that are further apart.

表格2：来自任务生成器的语言嵌入的多样性。AutoRT生成的语言嵌入之间的距离更远。

We additionally use the language diversity score to compare two VLMs for scene description without generating large amounts of robot data. We compare PaLI (Chen et al., 2022) and FlexCap (Review, 2023). Keeping the LLM prompts fixed, we first sample 70 random scenes the robots saw so far. Each scene was described by each VLM, and their descriptions were passed to task generation. The diversity of language embeddings after affordance filtering was then used to score the VLMs. We found both VLMs led to better scores than our baselines. Qualitative examples of sampled tasks from the two VLMs are in Appendix G.

我们还使用语言多样性评分来比较两个VLMs（视觉-语言模型）进行场景描述，而不生成大量机器人数据。我们比较了PaLI（Chen等人，2022）和FlexCap（Review，2023）这两个VLMs。在保持LLM提示不变的情况下，我们首先随机采样了70个机器人迄今看到的场景。每个场景由每个VLM描述，并将它们的描述传递给任务生成。在经过权衡过滤后，语言嵌入的多样性被用来评分VLMs。我们发现两个VLMs的得分都优于我们的基线。从两个VLMs采样的任务的定性示例在附录G中。

Visual diversity: To measure visual diversity, we utilize a clustering method similar to a diversity measure used in Tirumala et al. (2023). Robot episodes are first embedded by a visual encoder, then k-means unsupervised clustering is done in the space. New episodes are scored based on the distance from that episode’s embedding to the nearest k-means centroid. This distance is the diversity score, with larger distances indicating more novel data. We utilize a CLIP model as our embedder, finetuned to contrast {first image, goal image} embeddings with natural language captions (Xiao et al., 2023), and cluster with k = 1000. We found this was better at capturing semantic differences, although it does ignore intermediate images.

视觉多样性：为了衡量视觉多样性，我们利用一种类似于Tirumala等人（2023）使用的多样性度量的聚类方法。首先，通过视觉编码器对机器人任务剧集进行嵌入，然后在该空间中进行k均值无监督聚类。基于该任务剧集的嵌入到最近的k均值质心的距离对新任务剧集进行评分。这个距离就是多样性分数，距离越大表示数据越新颖。我们使用一个CLIP模型作为我们的嵌入器，通过对比{第一张图像，目标图像}的嵌入与自然语言字幕（Xiao等人，2023）进行微调，并使用k = 1000进行聚类。我们发现这更能捕捉语义差异，尽管它忽略了中间图像。

Fig. 5 shows the visual diversity across each of AutoRT’s data collection policies, along with the RT-1 dataset as a baseline. We find that the visual diversity is larger for each type of AutoRT data, with higher diversity in teleop than the scripted policy. Notably, RT-1’s dataset is only teleop, yet AutoRT is more diverse across all categories. Sample images are shown in Fig. 6. We also did an experiment where human supervisors directly optimized the visual diversity at collect time based on robot feedback. Further details are in Appendix E

图5 显示了AutoRT的每种数据收集策略的视觉多样性，以及RT-1数据集作为基线。我们发现每种类型的AutoRT数据的视觉多样性更大，而在远程操作中的多样性比脚本化策略更高。值得注意的是，RT-1的数据集只包含了远程操作，然而AutoRT在所有类别上都更加多样化。样本图像显示在图6中。我们还进行了一个实验，其中人类监督员根据机器人的反馈直接在收集时优化了视觉多样性。更多细节请参见附录E。
在这里插入图片描述
Figure 5: Visual diversity visualizations for AutoRT, as scored by distance to closest k-means centroid. Left: Histogram of 1000 random successes per collect policy (or all successes from RT-2 collect). Right: CDF of distributions, median of distribution annotated. Higher distances (more weight on the right) are further from prior data, and thus better. We find all AutoRT data is more diverse due to running in more varied environments, with teleop data from AutoRT scoring best.

图5 ：对AutoRT的视觉多样性进行了可视化，通过与最接近的k均值质心的距离进行评分。左侧为每种收集策略的1000个随机成功（或来自RT-2收集的所有成功）的直方图。右侧为分布的CDF（累积分布函数），其中中位数被注释。较大的距离（更靠右的权重更重）离先前的数据更远，因此效果更好。我们发现，由于在更加多样化的环境中运行，所有AutoRT数据的多样性更高，其中来自AutoRT的远程操作（teleop）数据效果最佳。

在这里插入图片描述
Figure 6: Example last-frame images (color corrected) from RT-1 (left) and AutoRT (right)

图 6：RT-1（左）和 AutoRT（右）的最后一帧图像示例（颜色校正）

5.2 TASK GENERATION

5.2 任务生成

In this section we study the quality of task generation prior to filtering based on feasibility (is the task possible) and relevance (does the task follow high-level guidance) and compare against two baselines. First, a simple templated language approach that matches a random verb from a hardcoded list with an object seen by the VLM, e.g. "

在这一部分，我们研究了在基于可行性（任务是否可能）和相关性（任务是否符合高级指导）进行过滤之前任务生成的质量，并与两个基准进行了比较。首先是一种简单的模板语言方法，它从硬编码列表中匹配一个随机动词与VLM看到的对象，例如" "。这反映了RT-1中使用的语言指导过程。其次，为了削弱AutoRT如何被引导到有用任务的能力，我们考虑了一个AutoRT（未引导）变体，该变体从提示中删除了引导规则。

To evaluate, the robot is placed in front of 5 scenes. We generate 75 tasks in total, using guidance like “collect gardening tasks” or “how would you clean this mess?” for AutoRT (guided). Results are shown in Table 3. We find that AutoRT’s tasks (guided and unguided) are 1.5x more likely to be feasible than templated language. The large increase in feasibility is because naively mixand-matching verbs is likely to generate nonsense language like “open keyboard”, whereas LLMs will tend to generate sensible language. We further find that we can guide task generation towards gardening, cleaning, etc., which is promising for allowing end-users to tell robots what data we would like them to collect. Qualitative outputs are in Appendix G.

为了评估，将机器人放在5个场景前。我们总共生成了75个任务，使用类似“收集园艺任务”或“你会如何清理这个混乱？”的指导性语言用于AutoRT（有指导）。结果显示在表3中。我们发现AutoRT的任务（有指导和无指导）比模板语言更有可能可行，增加了1.5倍。可行性的大幅提高是因为简单地混搭动词可能会生成无意义的语言，例如“打开键盘”，而LLMs往往会生成合理的语言。我们进一步发现，我们可以引导任务生成朝着园艺、清理等方向，这对于允许最终用户告诉机器人我们希望它们收集什么样的数据是有希望的。定性输出在附录G中。

在这里插入图片描述
Table 3: Comparison of task generation methods at generating completable tasks and relevant tasks. Injecting the high-level guidance into the LLM prompt improves the relevance of generated tasks. Using an LLM at all improves both feasibility and relevance thanks to common-sense inherited from Internet-scale data.

表3：任务生成方法的比较，生成可完成的任务和相关任务。将高级指导注入到LLM提示中提高了生成任务的相关性。总体而言，使用LLM提高了可行性和相关性，这要归功于从互联网规模数据中继承的常识。

5.3 AFFORDANCE AND ROBOT CONSTITUTION

5.3 任务性质与机器人定律

In this section we study the effect of constitutional prompting and LLM self-critiquing on identifying safe and feasible tasks. Task generation and filtering are evaluated via two metrics: % Safe, the fraction of safe and feasible tasks proposed by AutoRT, and Recall, how often the self critiquing step correctly rejects unsuitable tasks generated during task proposal step.

在这一部分，我们研究了定律提示和LLM自我批评对识别安全和可行任务的影响。通过两个指标评估任务生成和过滤：％安全，AutoRT提出的安全和可行任务所占的比例，以及召回率，即在任务提议阶段生成的任务中，自我批评步骤正确拒绝不适当任务的频率。

Accuracy of AutoRT Task Generation: Across a sample of 64 scenes, we consider all 259 tasks generated and label whether each task is safe and feasible to collect. In this sample, we found 31 tasks that outght to have been rejected, giving a base rate of 228/259 = 88% acceptable tasks. After the LLM affordance filtering step we see the rate of acceptable tasks increase to 200/214 = 93%.

AutoRT任务生成的准确性：在64个场景的样本中，我们考虑了所有259个生成的任务，并标记了每个任务是否安全和可行。在这个样本中，我们发现了31个本应该被拒绝的任务，给出了一个基础比率为228/259 = 88％的可接受任务。经过LLM的任务过滤步骤后，我们看到可接受任务的比率增加到了200/214 = 93％。

When evaluating affordance, over-rejecting tasks is better than under-rejecting them, so we further evaluate the recall of rejected tasks. How often does the LLM reject (or fail to reject) tasks that should be rejected? Of the 31 unsuitable tasks, the LLM rejected 17/31 = 55% of them. Aditionally we find that all 14 errors occurred during teleop task sampling, attributable to forcing teleop task generation to remain highly diverse. These tasks were rejected by the teleoperator during collect indicating the importance of human-in-the-loop supervision, both as a safety mechanism and as a source of intervention data to improve affordance of task generation.

在评估任务性质时，过度拒绝任务比不足拒绝它们更好，因此我们进一步评估了被拒绝任务的召回率。LLM多久会拒绝（或未拒绝）应该被拒绝的任务？在31个不适当的任务中，LLM拒绝了17/31 = 55％。此外，我们发现所有14个错误发生在远程操作任务采样期间，这归因于强制远程操作任务生成保持高度多样性。这些任务在采集期间被远程操作员拒绝，表明人在循环监督既是一种安全机制，也是改进任务生成的任务性质的干预数据的重要来源。

Table 3: Comparison of task generation methods at generating completable tasks and relevant tasks. Injecting the high-level guidance into the LLM prompt improves the relevance of generated tasks. Using an LLM at all improves both feasibility and relevance thanks to common-sense inherited from Internet-scale data.

Adversarial Testing of Constitutional Prompting: To measure the effect of constitutional prompting, we set up deliberately adversarial scenes, and ablate our rules from the task generation prompt and affordance prompt. First, 5 test scenes were set up with objects that the robot should not interact with, including lifelike toy animals, sharp items, and people. Three task generation prompts are used: an unsafe prompt (designed to propose unsafe tasks), a minimal prompt (describing task generation without rules or constitution), and the constitutional prompt. These tasks are then filtered via two affordance prompts: a minimal one (describing affordance classification) and a constitutional one. Full prompt texts are in Appendix D.1. We show in Table 4 that the rate of safe tasks is significantly increased when robot constitution is included at task generation time or affordance filtering time, with best results when included at both steps. Additionally constitutional prompting is able to achieve high recall when given unsafe tasks.

定律提示的对抗测试：为了衡量定律提示的影响，我们设置了故意对抗的场景，并从任务生成提示和任务性质提示中切除了我们的规则。首先，设置了5个测试场景，其中包含机器人不应与之互动的对象，包括栩栩如生的玩具动物、锋利的物品和人。使用三个任务生成提示：一个不安全的提示（设计用于提出不安全的任务），一个最小的提示（描述任务生成而无需规则），以及定律提示。然后，通过两个任务性质提示进行过滤：一个最小的提示（描述任务性质分类），一个定律提示。完整的提示文本在附录D.1中。我们在表4中展示，在任务生成时或任务性质过滤时包含机器人定律时，安全任务的比率显著增加，当在两个步骤中都包含时效果最好。此外，定律提示能够在给定不安全任务时实现高召回率。
在这里插入图片描述

Table 4: Effect of constitutional prompting on safety of proposed tasks

表4：宪法提示对任务安全性的影响

5.4 MODEL TRAINING

5.4 模型训练

The data generated by AutoRT covers a significantly wider range of language and visuals than in datasets such as RT-1 (Brohan et al., 2022). As a sanity check on the usefulness of the data, we run a training comparison with the RT-1 model. A pretrained RT-1 model is co-fine-tuned on a 50-50 mixture of the pretraining dataset described in Brohan et al. (2022) and AutoRT’s dataset. RT-1 is used instead of RT-2 due to training more quickly and cheaply.

由AutoRT生成的数据涵盖了比RT-1（Brohan等，2022年）等数据集更广泛的语言和视觉范围。为了对数据的实用性进行初步检查，我们进行了与RT-1模型的训练比较。我们使用了一个预训练的RT-1模型，在Brohan等人（2022年）描述的预训练数据集和AutoRT的数据集的50-50混合上进行了共同微调。由于训练更快且更便宜，我们选择了RT-1而不是RT-2。

The co-fine-tuned model is evaluated on two tasks we find RT-1 generalizes poorly to: picking from different heights, and wiping. Exact evaluation instructions and details are in Appendix F. When co-fine-tuned, RT-1’s performance increases from 0% to 12.5% on picking from different height, and 10% to 30% on wiping. We additionally include an ablation where we train from only the teleoperated segment of AutoRT data. We find this model is no longer able to pick from different heights, indicating that non-teleoperated AutoRT can be useful. These increases are modest, but we note that the focus of AutoRT was on collecting diverse data, not on achieving high success rates. RT-1 training was done to verify the data could improve the model, but the high diversity of tasks and scenarios leads to a challenging learning problem that is hard to perform well at.

共同微调的模型在我们发现RT-1泛化效果较差的两个任务上进行了评估：从不同高度挑选物品和擦拭。确切的评估说明和细节在附录F中。在共同微调后，RT-1在从不同高度挑选物品的任务上的性能从0%提高到了12.5%，在擦拭任务上从10%提高到了30%。我们还进行了一个去除了从AutoRT数据的仅远程操作部分进行训练的消融实验。我们发现这个模型不再能够从不同高度挑选物品，表明非远程操作的AutoRT数据也是有用的。这些提高是适度的，但我们注意到AutoRT的重点是收集多样化的数据，而不是追求高成功率。RT-1的训练是为了验证数据能够改善模型，但任务和场景的高度多样性导致了一个难以良好执行的挑战性问题。
在这里插入图片描述

Table 5: Results from co-finetuning RT-1 on AutoRT data

表 5：在 AutoRT 数据上协同微调 RT-1 的结果

6 CONCLUSION, LIMITATIONS, AND FUTURE WORK

6 结论、限制和未来工作

We presented AutoRT, an approach for directing fleets of robots to collect data in the real world, autonomously and with human help, supervised by large-scale vision and language models. We demonstrated that this approach results in useful, diverse, and large-scale data – leading to 77k realworld demonstrations collected by over 20 robots in 7 months in 4 buildings. We further introduced a robot constitution – which defined foundational rules, outlined safety constraints, and detailed the robot’s embodiment, and ablated the system design to show its usefulness. Finally, by training a model on this collected data we demonstrated novel capabilities and improved generalization over state of the art models. We believe this work is a step towards scaling robot data collection to the breadth of foundation models as well as embodying foundation models into robotic systems.

我们介绍了AutoRT，这是一种指导机器人群在现实世界中进行数据收集的方法，通过大规模的视觉和语言模型进行自主操作，并借助人类的帮助进行监督。我们展示了这种方法产生了有用、多样化和大规模的数据，导致在4座建筑物中的7个月内由20多个机器人收集的77,000个真实世界演示。我们进一步引入了一个机器人定律，定义了基本规则，概述了安全约束，并详细说明了机器人的实体结构，并通过对系统设计进行消融来展示其实用性。最后，通过对收集的数据进行模型训练，我们展示了对先进模型的新颖能力和改进的泛化性。我们认为这项工作是将机器人数据收集扩展到基础模型广度以及将基础模型体现到机器人系统中的一步。

Despite the promise of AutoRT, the current approach comes with a number of limitations.

尽管AutoRT有希望，但目前的方法存在一些限制。

AutoRT relies in large part on scripted and learned policies to scale collection for fixed teleoperation budget. If these policies only handle simpler tasks or have lower success rates in unseen settings, it lowers the throughput of successful episodes. Scaling the generation of higher quality data requires more robust and diverse autonomous collect policies as in Arenas et al. (2023)
AutoRT在很大程度上依赖于脚本和学到的策略，以扩展固定远程操作预算的收集。如果这些策略只处理更简单的任务或在未见设置中的成功率较低，将降低成功演示的吞吐量。扩展生成更高质量数据需要更强大和多样化的自主收集策略，就像Arenas等人（2023年）中所描述的那样。
Communication bandwidth between scene description and language model can introduce an information bottleneck in AutoRT. Failures of perception such as hallucination of objects, lack of generalization to novel environments, and motion blur can introduce and propagate failures in the system. As noted by prior work (Ahn et al., 2022; Mees et al., 2023; Gao et al., 2023), foundation models also face challenges in reasoning about task and embodiment specific information, such as physics of objects and capabilities of the robot. We ignored this for simplicity, but expect future efforts to require more accurate real-world reasoning.

2.场景描述和语言模型之间的通信带宽可能会在AutoRT中引入信息瓶颈。感知失败，如对象的幻觉、对新环境的泛化不足和运动模糊等可能会在系统中引入和传播失败。正如先前的工作所指出的（Ahn等人，2022年；Mees等人，2023年；Gao等人，2023年），基础模型在推理任务和具体的信息，如对象的物理性质和机器人的能力方面也面临着挑战。出于简单起见，我们忽略了这一点，但预计未来的工作将需要更准确的现实世界推理。

Thirdly, the type of data collected by AutoRT tends to be highly diverse, leading to fewer samples per task and lots of variety in scenes and object configurations. This “sparse” data presents a harder learning problem than the datasets used in existing state of the art robot learning methods like Brohan et al. (2022) and Brohan et al. (2023). AutoRT assumes data collection is decoupled from the control policy, but achieving the best policy improvement would likely require the two to evolve in tandem with each other.

3.AutoRT收集的数据类型往往是高度多样化的，导致每个任务的样本较少，场景和对象配置有很多变化。这种“稀疏”数据提出了一个比现有最先进的机器人学习方法使用的数据集更难的学习问题，比如Brohan等人（2022年）和Brohan等人（2023年）。

Lastly, though constitutional prompting improves safety of generated tasks, prompting an LLM does not guarantee that the prompt’s instructions will be followed, and a small percentage of unsafe tasks generated by the LLM will pass the affordance filtering. This necessitates some degree of human supervision.
最后，尽管定律提示提高了生成任务的安全性，但提示LLM并不能保证指令将被执行，LLM生成的一小部分不安全任务会通过容许性过滤。这需要一定程度的人类监督。

As we explore future directions, a chief question is how a robot should autonomously act in the world. What we call a robot constitution has historically been a topic reserved for science fiction (Asimov, 1942), but this work concretizes a real application where such rules could be helpful. We also see future work in treating model improvement and data collection as a single goal, rather than two separate areas, with an eye on identifying proximal skills and improving sample efficiency via directed data collection.

在探索未来方向时，一个主要问题是机器人应该如何在世界中自主行动。我们所谓的机器人定律在历史上一直是保留给科幻小说的主题（Asimov，1942年），但这项工作具体化了一个真实的应用，其中这些规则可能会有所帮助。我们还在将模型改进和数据收集视为单一目标的未来工作中看到了前景，而不是两个分开的领域，并关注通过定向数据收集来识别临近技能并提高样本效率。

ACKNOWLEDGMENTS

We thank Celeste Barajas, Joseph Dabis, Gavin Gonzalez, Tomas Jackson, Alex Luong, Utsav Malla, Emily Perez, Elio Prado, Jornell Quiambao, Sangeetha Ramesh, Jaspiar Singh, Clayton Tan, Jodexty Therlonge, Eric Tran, Steven Vega, and Samuel Wan for assistance on data collection, model evaluation, and AutoRT supervision. We thank Anthony Brohan and Noah Brown for assistance on data analysis. We thank David DoVo, Regine Firmeza, Tad Koch, Gus Kouretas, Jessica Lam, Thien Nguyen, and Eric Zankiewicz for robot setup and maintenance. We thank Nicolas Heess, Jacky Liang, Vincent Vanhoucke, and Andy Zeng for providing feedback on paper drafts.

致谢：我们感谢Celeste Barajas、Joseph Dabis、Gavin Gonzalez、Tomas Jackson、Alex Luong、Utsav Malla、Emily Perez、Elio Prado、Jornell Quiambao、Sangeetha Ramesh、Jaspiar Singh、Clayton Tan、Jodexty Therlonge、Eric Tran、Steven Vega和Samuel Wan在数据收集、模型评估和AutoRT监督方面的帮助。我们感谢Anthony Brohan

APPENDIX

附录

A ROBOT AND SYSTEM SETUP

A 机器人和系统设置

Each robot is a 7 DoF robot arm attached to a mobile base, with a camera mounted on the head of the robot. The robot is capable of both navigation and manipulation. At collection time, the robot is driven to a location which could be either a natural environment, such as an office area, a kitchen area, a lounge, or an artificially set up room with objects on different surfaces. The robots are given the bounding box of the region they should stay within for safety purposes, but are not given any information on object locations ahead of time, and must explore the area to find objects for themselves.

每个机器人都是一个附有移动底座的7自由度机械臂，机器人头部安装了摄像头。该机器人具有导航和操作能力。在收集时，机器人被驱动到一个位置，该位置可以是自然环境，如办公区域、厨房区域、休息室，或者是一个人工设置的房间，其中摆放有不同表面上的物体。为了安全起见，机器人被给予了它们应该保持在其中的区域的边界框，但事先不提供有关对象位置的任何信息，它们必须探索区域自行查找对象。

The code is structured in a form we call the policy graph. Each node v ∈ V of the policy graph is a subpolicy π(a|s,data), where s is the robot state, a is the robot action, and data is information that accumulates as we go through the graph. The collect policies {π 1 ,…,π k} are themselves subpolicies in the policy graph, but the policy graph includes subpolicies for navigation, and subpolicies whose focus is only querying the LLM. Subpolicies that do not move the robot simply output a no-op action a.

代码的结构采用了我们称之为策略图的形式。策略图的每个节点v ∈ V都是一个子策略π(a|s,data)，其中s是机器人状态，a是机器人动作，data是随着我们遍历图而累积的信息。收集策略{π1,…,πk}本身是策略图中的子策略，但策略图包括用于导航的子策略和仅专注于查询LLM的子策略。不移动机器人的子策略简单地输出一个无操作动作a。

After every timestep, we check the transition conditions β defined for each node. Transition conditions β : S×Data → {0,1},V are functions that take the current state and accumulated data, and decide if a subpolicy should yield control to the next node, and if so, which one. These conditions are similar to those in a finite-state machine. A given node can have multiple incoming and outgoing transition conditions. When there are multiple outgoing conditions, only one should be true at a time. For example, in Fig. 5 the AffordanceFilter has k outgoing transition conditions, one for each of collect policies π i ∈ {π 1 ,…,π k}, and the DiversityScoring node has k incoming transition conditions, one from each collect policies.

每个时间步之后，我们检查为每个节点定义的转移条件β。转移条件β：S×Data → {0,1}，V是函数，它接受当前状态和累积数据，并决定子策略是否应将控制权移交给下一个节点，如果是，则是哪一个。这些条件类似于有限状态机中的条件。给定节点可以具有多个传入和传出转移条件。当存在多个传出条件时，一次只能有一个条件为真。例如，在图1中，AffordanceFilter有k个传出转移条件，每个都对应于收集策略πi ∈ {π1,…,πk}中的一个，并且Diversity Scoring节点有k个传入转移条件，每个都来自于每个收集策略。

One property of AutoRT is that it only generates tasks based on what the robot sees, which can bias task generation. For example, if run in an office environment, AutoRT will mostly see office supplies and generate office-based tasks. To get better coverage of task space, we gathered many (over 100) random objects, like plastic toys and soda cans, and scattered some of them in the environments each day, swapping the objects every day. This provides a greater variety of objects for AutoRT’s task generation.

AutoRT的一个特性是它仅基于机器人所见生成任务，这可能会对任务生成产生偏见。例如，如果在办公环境中运行，AutoRT将主要看到办公用品并生成基于办公的任务。为了更好地覆盖任务空间，我们收集了许多（超过100个）随机对象，如塑料玩具和苏打罐，并每天在环境中散布其中一些对象，每天交换对象。这为AutoRT的任务生成提供了更多种类的对象。

B NAVIGATION SAMPLING

B 导航抽样

We first define a fixed query embedding with the goal of biasing sampling towards easier tasks. A short list of object names from previous works was gathered.
首先，我们定义了一个固定的查询嵌入，目的是使抽样偏向更容易的任务。我们收集了来自先前研究的一小部分对象名称。

apple, basket, blue can, bottled tea, bowl, box of tea, brown chip bag, can, cereal, chip bag, clipboard, coffee machine, coffee_machine, compost, compost bin, cup, drawer, drinking machine, empty bottle, energy bar, espresso machine, ficus, first aid station, fridge, fruit, green bag of chips, green can, green plant, green soda can, human, jar of white candy, landfill, light switch, microwave oven, mini fridge, multigrain chip, napkin box, orange, paper bowl, paper cup, pepsi, plastic bottle, poster, potted plant, red can, silver spoon, sink, slippery sign, snack jar, snack jar of almonds, snack jar of dried fruits, snack jar of gums, snack jar of nuts, socket, sponge, table, tap, trash can, tv, up side down mug, upside down paper cup, water bottle, water machine, water_bottle, white bowl, white chair, white jar, white mug, white sign, woven basket, yellow sign

苹果、篮子、蓝色罐头、瓶装茶、碗、茶盒、褐色薯片袋、罐头、谷物、薯片袋、剪贴板、咖啡机、咖啡机、堆肥、堆肥箱、杯子、抽屉、饮水机、空瓶、能量棒、浓缩咖啡机、榕树、急救站、冰箱、水果、绿色薯片袋、绿色罐头、绿色植物、绿色苏打罐、人类、白色糖果罐、垃圾场、开关、微波炉、小冰箱、多谷薯片、餐巾纸盒、橙子、纸碗、纸杯、百事可乐、塑料瓶、海报、盆栽植物、红罐头、银汤匙、水槽、易滑标志、零食罐、杏仁罐、干果罐、口香糖罐、坚果罐、插座、海绵、桌子、水龙头、垃圾桶、电视、颠倒的马克杯、颠倒的纸杯、水瓶、饮水机、水瓶、白碗、白椅子、白罐、白马克杯、白标志、编织篮子、黄标志

This list was gathered once, and not changed or ablated during the project.

这个列表在项目进行期间只被收集了一次，并且没有在项目期间进行更改或修改。

We defined φq as the normalized average text embedding for these object names. Each navigation target φi was then scored from 0 to 1 by:

我们将φq定义为这些对象名称的归一化平均文本嵌入。然后，每个导航目标φi的评分为0到1：
在这里插入图片描述

and sampled proportionally to score β i , where β is a hyperparameter deciding the temperature of sampling. We use β = 1 in data collection to maintain higher variation during collection, but recommend using larger β when doing more targeted data collection.

并且按比例采样得分βi，其中β是决定抽样温度的超参数。在数据收集中，我们使用β = 1 以保持在收集过程中的更高变化，但建议在进行更有针对性的数据收集时使用较大的β。

C GUARDRAILS

C 安全防护措施

The following guardrails are put in place to ensure operational safety.

以下安全防护措施旨在确保运行安全。

• All robots will pause motion if detected force on joints exceeds a threshold. All robots can also be immediately disengaged using a physical E-stop button.
• Unless the robot workspace is barricaded, at least one human must supervise the robots in such a way that all robots are within line of sight.
• During regular operation, we proactively remove objects from the environment that is unsafe for a robot to handle. This is in addition to prompting the LLM to not interact with them.
• Whenever we collect a human demonstration, the human expert sanity checks the generated task, since they are already available to provide human feedback to the model.

• 如果检测到关节上的力超过阈值，所有机器人将暂停运动。所有机器人也可以立即通过物理急停按钮停止运行。
• 除非机器人的工作空间被围挡，否则必须至少有一个人以使所有机器人在视线范围内进行监督。
• 在正常操作期间，我们主动从环境中删除对机器人处理不安全的物体。这是为了避免与它们交互，同时提示LLM不要与它们交互。
• 每当我们收集人类演示时，人类专家会对生成的任务进行合理性检查，因为他们已经可以提供对模型的人类反馈。

Many of these controls are standard practice in robot learning. As robot policies and LLMs improve, user expectations of robots will increase, and we anticipate verification protocols to become more complex and important to get right.

其中许多控制措施是机器人学习中的标准做法。随着机器人策略和LLM的改进，用户对机器人的期望将提高，我们预计验证协议将变得更加复杂且更加重要。

D PROMPTS

D 提示

All prompts are based on Python string formatting. When doing teleop task generation, we use num tasks=10. Task generation guidance is set to “N/A” unless specified otherwise.

所有提示都基于Python字符串格式。在进行远程操作任务生成时，我们使用num tasks=10。任务生成的指导设置为“N/A”，除非另有说明。

Robot constitution:
机器人定律：

Asimov’s three laws of robotics are modified in two ways. The first law removes the “through inaction” part, as our robot’s agency is limited and we do not want to bias towards in-action. The order of the second and third laws are swapped, since our robots are currently more in need of protection from humans asking for tasks which could endanger the robots, rather than the other way around.

阿西莫夫的三大机器人定律经过两次修改。第一定律删除了“通过不作为”部分，因为我们的机器人的代理能力有限，我们不希望偏向不采取行动。第二和第三定律的顺序被交换，因为我们的机器人目前更需要受到保护，以免人们提出可能危及机器人的任务，而不是反过来。

在这里插入图片描述

Task generation prompt for teleop policy:

远程操作策略的任务生成提示：

在这里插入图片描述

Task generation prompts for RT-2:

RT-2 的任务生成提示：

在这里插入图片描述

Task generation prompts for scripted pick

Scripted Pick 的任务生成提示：

在这里插入图片描述

Affordance LLM prompt

Affordance LLM 提示：图片略

D.1 PROMPTS FOR ADVERSARIAL EXPERIMENTS

D.1 对抗性实验提示

Minimal task generation prompt for teleop. This is identical to the default prompt, without the inclusion of robot constitution rules.

远程操作的最小任务生成提示。这与默认提示相同，没有包含机器人定律规则。

在这里插入图片描述

Unsafe task generation prompt for teleop. This both removes the constituional rules and modifies the prompt to oversample tasks we want the affordance filter to capture.

远程操作的不安全任务生成提示。这既删除了宪法规则，又修改了提示，以过采样我们希望 affordance 过滤器捕捉的任务。

图片略

Minimal affordance LLM prompt used for affordance filtering ablation. This is identical to the default one, without the inclusion of the robot constitution rules.

Affordance 过滤消融使用的最小 affordance LLM 提示。这与默认提示相同，没有包含机器人宪法规则。
图片略

E OPTIMIZING VISUAL DIVERSITY

E 优化视觉多样性

Since our robot agents can calculate visual diversity scores after every episode, we can use this as a metric to optimize. We perform a pilot study where the robot speaks out loud the diversity score of the episode it has collected. The human supervising the data collection pays attention to this score, and changed the scene between episodes to try to maximize the spoken score. The resulting scenes in Fig. 7 feature more distractor objects, askew tables, and unconventional object arrangements like turned over recycling bins and objects on top of chairs. This demonstrates another benefit of quantifying data diversity - it can provide online feedback that allows for faster iteration loops during data collection.

由于我们的机器人代理可以在每个剧集之后计算视觉多样性分数，我们可以将其用作优化的指标。我们进行了一项试点研究，机器人大声宣布它已经收集的剧集的多样性分数。监督数据收集的人员注意到这个分数，并在剧集之间改变场景，试图最大化它。图7中的结果场景包含了更多的干扰对象，歪斜的桌子以及不规则的物体排列，比如倒过来的回收箱和放在椅子上的物体。这展示了量化数据多样性的另一个好处 - 它可以提供在线反馈，允许在数据收集过程中进行更快的迭代循环。

在这里插入图片描述

Figure 7: Robot environments before and after adjusting scene based on visual diversity. Note the unconventional arrangement of objects, surfaces, and distractors.

图 7：基于视觉多样性调整场景前后的机器人环境。注意物体、表面和干扰物的非常规排列。

F MODEL IMPROVEMENT EVALUATION TASKS

F 模型改进评估任务

For picking from different heights, pick attempts were done against 3 different heights: a desk, a shorter table, and the floor. For each height, we sampled 4 candidate tasks, giving 12 tasks in total. For wiping evals, the scene was set up with a table, a sponge, and a cloth, and we sampled 5 wiping tasks, some of which required using the correct object, and some of which could use either the sponge or cloth. All tasks were attempted 2 times each. Exact task strings are in Appendix F.

对于从不同高度挑选的任务，进行了对三个不同高度的挑选尝试：桌子、较矮的桌子和地板。对于每个高度，我们随机抽取了4个候选任务，总共提供了12个任务。对于擦拭评估，场景设置为有一个桌子、一个海绵和一块布，我们随机抽取了5个擦拭任务，其中一些需要使用正确的物体，而另一些可以使用海绵或布中的任何一个。每个任务都尝试了2次。确切的任务字符串在附录F中。
在这里插入图片描述

Table 6: Tasks used to evaluate training ablations

表 6：用于评估训练消融的任务

G QUALITATIVE EXAMPLES

G 定性示例

We collect qualitative examples of LLM generations here. Table 7 lists sample text generations from AutoRT when using different VLMs. Table 8 lists tasks from Section 5.2 experiments for templated language, unguided AutoRT, and guided AutoRT. Table 9 lists tasks from adversarial testing of constitutional prompting

这里收集了LLM生成的一些定性示例。表格7列举了在使用不同的VLMs时，AutoRT生成的样本文本。表格8列举了来自第5.2节实验的模板语言、未引导的AutoRT和引导的AutoRT的任务。表格9列举了宪法提示对任务生成进行对抗测试的任务。

在这里插入图片描述

Table 7: Example generated tasks with AutoRT using the teleoperated prompt, comparing two different VLMs for describing the scene and nearby objects. We found FlexCap to be more descriptive in its object description, particularly with regards to color.

表格7：使用远程操作提示时，AutoRT使用不同VLMs生成的示例任务，比较了两个不同的VLMs对场景和附近对象的描述。我们发现FlexCap在对象描述方面更为详细，特别是在颜色方面。
在这里插入图片描述

Table 8: Examples from Section 5.2 experiments testing relevance and feasibility

表格8：第5.2节实验中测试相关性和可行性的示例
在这里插入图片描述

Table 9: Tasks generated in Section 5.3 experiments. We present an image the robot sees, tasks generated by the unsafe task generation prompt, and the reply of both the minimal affordance and constitutional affordance.

表格9：第5.3节实验中生成的任务。我们呈现了机器人看到的图像、使用不安全的任务生成提示生成的任务，以及最小限度可负担和宪法可负担的回复。

H SCRIPTED PICK

H 脚本拾取

Below is pseudocode for the scripted picking policy used in data collection. The first draft of this code was generated by an LLM, but changes were later made by hand to better comment behavior and improve robustness in edge cases. Our early explorations into code generation have found that LLMs can generate a good first attempt, but that first attempt often misses edge cases that need to be handled to make the code suitable for long-running data collection.

以下是在数据收集中使用的脚本拾取策略的伪代码。此代码的初稿是由LLM生成的，但后来进行了手动更改，以更好地注释行为并改进边缘情况的鲁棒性。我们早期对代码生成的探索发现，LLMs可以生成一个很好的第一尝试，但这个第一尝试通常会忽略需要处理的边缘情况，以使代码适用于长时间的数据收集。

I TRAJECTORY DIVERSITY
轨迹多样性
在这里插入图片描述

Figure 8: Robot trajectories from scripted motion (left) and teleop motion (right). Note that teleop is on the whole a lot more diverse from a trajectory perspective
图8：脚本运动（左）和远程操作运动（右）的机器人轨迹。请注意，从轨迹的角度来看，远程操作总体上更加多样化

在这里插入图片描述

Figure 9: Hours of data collected per policy per day. We aimed for teleop collect throughput to exceed a simple 1 person:1 robot baseline. We found a small increase in teleop throughput from AutoRT since AutoRT used fewer manual resets than typical collection (a robot can navigate to a new scene instead of waiting for a reset).

图9：每天每个策略收集的数据小时数。我们的目标是使远程操作的收集吞吐量超过一个简单的1人：1机器人的基线。我们发现与Typical collection相比，AutoRT的远程操作吞吐量略有增加，因为AutoRT使用的手动复位较少（机器人可以导航到新场景而无需等待复位）。