UMI —— 无需非现场机器人的非现场机器人示教

原创

已于 2024-12-26 12:49:03 修改 · 1.5k 阅读

15 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习 #人工智能

于 2024-12-10 12:35:54 首次发布

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robotsyhttps://umi-gripper.github.ioIn-The-Wild Robot Teaching Without In-The-Wild Robots

UMI（通用操作接口）

无需非现场机器人的非现场机器人示教

*注：In-The-Wild指机器人实际的目标工作环境。此处译为“非现场”。机器人的人工智能，或具身智能

Abstract—We present Universal Manipulation Interface (UMI) – a data collection and policy learning framework that allows direct skill transfer from in-the-wild human demonstrations to deployable robot policies. UMI employs hand-held grippers coupled with careful interface design to enable portable, lowcost, and information-rich data collection for challenging bimanual and dynamic manipulation demonstrations. To facilitate deployable policy learning, UMI incorporates a carefully designed policy interface with inference-time latency matching and a relative-trajectory action representation. The resulting learned policies are hardware-agnostic and deployable across multiple robot platforms. Equipped with these features, UMI framework unlocks new robot manipulation capabilities, allowing zeroshot generalizable dynamic, bimanual, precise, and long-horizon behaviors, by only changing the training data for each task. We demonstrate UMI’s versatility and efficacy with comprehensive real-world experiments, where policies learned via UMI zeroshot generalize to novel environments and objects when trained on diverse human demonstrations. UMI’s hardware and software system is open-sourced at https://umi-gripper.github.io.

摘要

我们提出了一种机械臂统一接口UMI——一种数据收集和策略学习架构，该架构可以直接将人类展示的非现场技能转变为可以部署的机器人策略。UMI采用了手持机械爪并搭配经过悉心设计的接口，从而可使针对极具挑战性的双臂和动态机械臂演示动作的数据收集变得可转移、成本低且信息充足。为了方便可部署策略的学习，UMI集成了一种精心设计的策略接口，该接口具有与推理时延匹配且与运动轨迹相关的动作表征。最终习得的策略是硬件无关的且可以在多个机器人平台上部署。因为具备这些特点，UMI架构解锁了机械臂的新技能：仅仅通过改变每一个任务的训练数据，允许零样本学习泛化到动态、双臂、精准和多步骤的行为中。我们通过一系列广泛的真实世界的实验，展示了UMI的多样性和功效。当利用各种人类的演示来训练的时候，上述机械臂行为的策略是通过UMI架构被零样本泛化到全新的环境和物体中的。

1. INTRODUCTION

1 研究概述

How should we demonstrate complex manipulation skills for robots to learn from? Attempts in the field have approached this question primarily from two directions: collecting targeted in-the-lab robot datasets via teleoperation or leveraging unstructured in-the-wild human videos. Unfortunately, neither is sufffcient, as teleoperation requires high setup costs for hardware and expert operators, while human videos exhibit a large embodiment gap to robots.

如何让机器人来学习我们人类所展示的复杂操作技能？业界尝试解决这个问题的方法主要有两个方向：通过遥操作来收集实验室内机器人数据；利用非现场的人类视频。不幸的是，上述两种方法都不够便利，因为遥操作方式的建立需要来自硬件和专业级操作臂的高昂成本，而人类视频则与机器人之间存在巨大的实体不同。

Recently, using sensorized hand-held grippers as a data collection interface [41, 50, 36] has emerged as a promising middle-ground alternative – simultaneously minimizing the embodiment gap while remaining intuitive and flexible. Despite their potential, these approaches still struggle to balance action diversity with transferability. While users can theoretically collect any actions with these hand-held devices, much of that data can not be transferred to an effective robot policy. As a result, despite achieving impressive visual diversity across hundreds of environments, the collected actions are constrained to simple grasping [41] or quasi-static pick-and place [50, 36], lacking action diversity.

近来，采用带有传感器的手持爪作为数据收集接口的方法已经出现。这一类折衷的方法令人期待——既最小化了前述方法的实体不同，同时也保留了直观性和灵活性。尽管这一类方法有相当前景，但是它们确仍然需要权衡动作多样性和动作转换能力，大多数的数据无法被转移为有效的机器人策略。因此，虽然其在不同环境中展现了惊人的视觉多样性，但是其收集的动作被局限于简单的抓取、准静态过程的取放，即缺少动作多样性。

What prevents action transfer in previous work? We identiffed a few subtle yet critical issues:

在以前的研究工作中，究竟是什么阻碍了动作转移？我们甄别出了以下一些细小却关键的问题：

• Insufffcient visual context: While using a wrist-mounted camera is key for aligning the observation space and enhancing device portability, it restricts the scene’s visual coverage. The camera’s proximity to the manipulated object often results in heavy occlusions, providing insufffcient visual context for action planning.

• 视觉上下文欠缺：当安装在腕关节部位的相机被用来当做对齐观测空间和加强设备的可转移能力的关键手段时，场景的视觉覆盖范围就被限制了。摄像头接近被操控物体时，通常会导致严重的遮挡现象，从而造成用于动作规划的视觉上下文欠缺。

• Action imprecision: Most hand-held devices rely on monocular structure-from-motion (SfM) to recover robot actions. However, such methods often struggle to recover precise global action due to scale ambiguity, motion blur, or insufffcient texture, which signiffcantly restrict the precision of tasks for which the system can be employed.

• 动作不精准：大多数的手持设备依赖于单眼相机运动恢复结构原理（从2D图像序列获取3D信息）来重现机器人动作。但是由于尺度缺失、移动模糊或纹理不足的原因，这样的方法通常仅仅勉强能够重现精准的全局动作，从而极大地限制了系统被部署用于执行精准任务。

• Latency discrepancies: During hand-held data collection, observation and action recording occur without latency. However, during inference, various latency sources, including sensor, inference, and execution latencies, arise within the system. Policies unaware of these latency discrepancies will encounter out-of-distribution input and in turn, generate out-of-sync actions. This issue is especially salient for fast and dynamic actions.

•延迟不同：在手持数据搜集过程中，观测和动作记录的发生没有延迟。但是在推理时却系统内存在各种类型的延迟，包括传感器、推理和执行延迟。一旦策略没有注意到这些延迟不同，策略的输入则会产生乱序，从而产生不同步的动作。这一问题在快速和动态动作中极为突出。

• Insufffcient policy representation: Prior works often use simple policy representations (e.g., MLPs) with action regression loss, limiting their capacity to capture complex multimodal action distributions inherent in human data. Consequently, even with precisely recovered demonstrated actions and all discrepancies removed, the resulting policy could still struggle to fit the data accurately. This further hampers large-scale, distributed human data collection, as more demonstrators increase action multimodality.

•策略表征欠缺：以前的研究工作通常使用简单的策略表征（例如MLPs，多层感知网络）来计算动作误差（回归损失），从而限制了系统获取具有复杂多模态特性的人类数据的能力。因此，即便能够精准重现演示动作并且弥补延迟差异，所得出的策略依然很难精确地适配训练数据。由于更多的演示者会增加动作的多模态性，上述问题会进一步导致大范围且分散的人类动作搜集困难重重。

In this paper, we address these issues with careful design of the demonstration and policy interface:

在这篇论文中，我们通过精心设计的演示和策略接口解决了上述问题

• First, we aim to identify the right physical interface for human demonstration that is intuitive and meanwhile able to capture all the information necessary for policy learning. Speciffcally, we use a fisheye lens to increase the field of view and visual context, and add side mirrors on the gripper to provide implicit stereo observation. When combined with the GoPro’s built-in IMU sensor, we can enable robust tracking under fast motion.

•首先，我们致力于寻求一种正确的被用于人类动作演示的物理接口，这一接口需要直观（即适配人类操作）同时能够获取所有的必要信息以便机器人习得策略。具体而言，我们使用了一个鱼眼镜头来增加视场角或视觉上下文，并且夹爪的两侧增加了镜子以提供隐式的立体观测。当把GoPro内置的IMU传感器集成后，我们就能够在快速运动场景下实现鲁棒的动作跟踪。

Second, we explore the right policy interface (i.e., observation and action representations) that could make the policy hardware-agnostic and thereby enable effective skill transfer. Concretely, we employ inference-time latency matching to handle different sensor observation and execution latency, use relative trajectory as action representation to remove the need for precise global action, and finally, apply Diffusion Policy [9] to model multimodal action distributions.

其次，我们探索出一种正确的策略接口（比如，观测和动作表征），这一接口能够使策略与硬件无关，从而实现了技能的有效迁移。具体而言，我们采用了推理时间延迟匹配来解决传感器观测和执行时间的不同，使用相对轨迹作为动作表征来替代精准的全局动作，最终应用Diffusion Policy扩散策略来建模多模态动作分布。

The final system, Universal Manipulation Interface (UMI), provides a practical and accessible framework to unlock new robot manipulation skills, allowing us to demonstrate any actions in any environment while maintaining high transferability from human demonstration to robot policy.

最终的系统UMI提供了一种实用且易获得的框架来解锁新的机器人操作技能。它使我们可以在任何环境中演示任何动作，并确保了强大的从人类演示到机器人策略的迁移能力。

Fig. 1: Universal Manipulation Interface (UMI) is a portable, intuitive, low-cost data collection and policy learning framework. This framework allows us to transfer diverse human demonstrations to effective visuomotor policies. We showcase the framework for tasks that would be difffcult with traditional teleoperation, such as dynamic, precise, bimanual and long-horizon tasks.

图1：通用操作接口（UMI）是可移植，直观的、低成本的数据手机和策略学习框架。这一框架允许我们把各种人类演示动作迁移到可执行的视觉-电机策略。对于一些传统任务包括动态、精准、双臂和多步骤任务，传统的要操作难以胜任，而我们却展示出了这一架构的能力。

Fig. 2: UMI Demonstration Interface Design. Left: Hand-held grippers for data collection, with a GoPro as the only sensor and recording device. Middle: Image from the GoPro’s 155° Fisheye view. Note the physical side mirrors highlighted in green which provide implicit stereo information. Right: UMI-compatible robot gripper and camera setup make observation similar to hand-held gripper view.

图2：UMI的设计示例。左：用于数据搜集的手持夹持器，仅集成有GoPro作为传感器和记录设备。中：由GoPro 155度鱼眼相机拍摄的图像。右：UMI可兼容的机器人，安装有夹持器和摄像头，确保其观测视场类似于手持夹持器的视场。

With just a wrist-mounted camera on the hand-held gripper (Fig. 2), we show that UMI is capable of achieving a wide range of manipulation tasks that involve dynamic, bimanual, precise and long-horizon actions by only changing the training data for each task (Fig. 1). Furthermore, when trained with diverse human demonstrations, the final policy exhibits zero-shot generalization to novel environments and objects, achieving a remarkable 70% success rate in out-of-distribution tests, a level of generalizabilty seldomly observed in other behavior cloning frameworks. We open-source the hardware and software system at https://umi-gripper.github.io.