Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robotsyhttps://umi-gripper.github.ioIn-The-Wild Robot Teaching Without In-The-Wild Robots
UMI(通用操作接口)
无需非现场机器人的非现场机器人示教
*注:In-The-Wild指机器人实际的目标工作环境。此处译为“非现场”。机器人的人工智能,或具身智能
Abstract—We present Universal Manipulation Interface (UMI) – a data collection and policy learning framework that allows direct skill transfer from in-the-wild human demonstrations to deployable robot policies. UMI employs hand-held grippers coupled with careful interface design to enable portable, lowcost, and information-rich data collection for challenging bimanual and dynamic manipulation demonstrations. To facilitate deployable policy learning, UMI incorporates a carefully designed policy interface with inference-time latency matching and a relative-trajectory action representation. The resulting learned policies are hardware-agnostic and deployable across multiple robot platforms. Equipped with these features, UMI framework unlocks new robot manipulation capabilities, allowing zeroshot generalizable dynamic, bimanual, precise, and long-horizon behaviors, by only changing the training data for each task. We demonstrate UMI’s versatility and efficacy with comprehensive real-world experiments, where policies learned via UMI zeroshot generalize to novel environments and objects when trained on diverse human demonstrations. UMI’s hardware and software system is open-sourced at https://umi-gripper.github.io.
摘要
我们提出了一种机械臂统一接口UMI——一种数据收集和策略学习架构,该架构可以直接将人类展示的非现场技能转变为可以部署的机器人策略。UMI采用了手持机械爪并搭配经过悉心设计的接口,从而可使针对极具挑战性的双臂和动态机械臂演示动作的数据收集变得可转移、成本低且信息充足。为了方便可部署策略的学习,UMI集成了一种精心设计的策略接口,该接口具有与推理时延匹配且与运动轨迹相关的动作表征。最终习得的策略是硬件无关的且可以在多个机器人平台上部署。因为具备这些特点,UMI架构解锁了机械臂的新技能:仅仅通过改变每一个任务的训练数据,允许零样本学习泛化到动态、双臂、精准和多步骤的行为中。我们通过一系列广泛的真实世界的实验,展示了UMI的多样性和功效。当利用各种人类的演示来训练的时候,上述机械臂行为的策略是通过UMI架构被零样本泛化到全新的环境和物体中的。
1. INTRODUCTION
1 研究概述
How should we demonstrate complex manipulation skills for robots to learn from? Attempts in the field have approached this question primarily from two directions: collecting targeted in-the-lab robot datasets via teleoperation or leveraging unstructured in-the-wild human videos. Unfortunately, neither is sufffcient, as teleoperation requires high setup costs for hardware and expert operators, while human videos exhibit a large embodiment gap to robots.
如何让机器人来学习我们人类所展示的复杂操作技能?业界尝试解决这个问题的方法主要有两个方向:通过遥操作来收集实验室内机器人数据;利用非现场的人类视频。不幸的是,上述两种方法都不够便利,因为遥操作方式的建立需要来自硬件和专业级操作臂的高昂成本,而人类视频则与机器人之间存在巨大的实体不同。
Recently, using sensorized hand-held grippers as a data collection interface [41, 50, 36] has emerged as a promising middle-ground alternative – simultaneously minimizing the embodiment gap while remaining intuitive and flexible. Despite their potential, these approaches still struggle to balance action diversity with transferability. While users can theoretically collect any actions with these hand-held devices, much of that data can not be transferred to an effective robot policy. As a result, despite achieving impressive visual diversity across hundreds of environments, the collected actions are constrained to simple grasping [41] or quasi-static pick-and place [50, 36], lacking action diversity.
近来,采用带有传感器的手持爪作为数据收集接口的方法已经出现。这一类折衷的方法令人期待——既最小化了前述方法的实体不同,同时也保留了直观性和灵活性。尽管这一类方法有相当前景,但是它们确仍然需要权衡动作多样性和动作转换能力,大多数的数据无法被转移为有效的机器人策略。因此,虽然其在不同环境中展现了惊人的视觉多样性,但是其收集的动作被局限于简单的抓取、准静态过程的取放,即缺少动作多样性。
What prevents action transfer in previous work? We identiffed a few subtle yet critical issues:
在以前的研究工作中,究竟是什么阻碍了动作转移?我们甄别出了以下一些细小却关键的问题:
• Insufffcient visual context: While using a wrist-mounted camera is key for aligning the observation space and enhancing device portability, it restricts the scene’s visual coverage. The camera’s proximity to the manipulated object often results in heavy occlusions, providing insufffcient visual context for action planning.
• 视觉上下文欠缺:当安装在腕关节部位的相机被用来当做对齐观测空间和加强设备的可转移能力的关键手段时,场景的视觉覆盖范围就被限制了。摄像头接近被操控物体时,通常会导致严重的遮挡现象,从而造成用于动作规划的视觉上下文欠缺。
• Action imprecision: Most hand-held devices rely on monocular structure-from-motion (SfM) to recover robot actions. However, such methods often struggle to recover precise global action due to scale ambiguity, motion blur, or insufffcient texture, which signiffcantly restrict the precision of tasks for which the system can be employed.
• 动作不精准:大多数的手持设备依赖于单眼相机运动恢复结构原理(从2D图像序列获取3D信息)来重现机器人动作。但是由于尺度缺失、移动模糊或纹理不足的原因,这样的方法通常仅仅勉强能够重现精准的全局动作,从而极大地限制了系统被部署用于执行精准任务。
• Latency discrepancies: During hand-held data collection, observation and action recording occur without latency. However, during inference, various latency sources, including sensor, inference, and execution latencies, arise within the system. Policies unaware of these latency discrepancies will encounter out-of-distribution input and in turn, generate out-of-sync actions. This issue is especially salient for fast and dynamic actions.
•延迟不同:在手持数据搜集过程中,观测和动作记录的发生没有延迟。但是在推理时却系统内存在各种类型的延迟,包括传感器、推理和执行延迟。一旦策略没有注意到这些延迟不同,策略的输入则会产生乱序,从而产生不同步的动作。这一问题在快速和动态动作中极为突出。
• Insufffcient policy representation: Prior works often use simple policy representations (e.g., MLPs) with action regression loss, limiting their capacity to capture complex multimodal action distributions inherent in human data. Consequently, even with precisely recovered demonstrated actions and all discrepancies removed, the resulting policy could still struggle to fit the data accurately. This further hampers large-scale, distributed human data collection, as more demonstrators increase action multimodality.
•策略表征欠缺:以前的研究工作通常使用简单的策略表征(例如MLPs,多层感知网络)来计算动作误差(回归损失),从而限制了系统获取具有复杂多模态特性的人类数据的能力。因此,即便能够精准重现演示动作并且弥补延迟差异,所得出的策略依然很难精确地适配训练数据。由于更多的演示者会增加动作的多模态性,上述问题会进一步导致大范围且分散的人类动作搜集困难重重。
In this paper, we address these issues with careful design of the demonstration and policy interface:
在这篇论文中,我们通过精心设计的演示和策略接口解决了上述问题
• First, we aim to identify the right physical interface for human demonstration that is intuitive and meanwhile able to capture all the information necessary for policy learning. Speciffcally, we use a fisheye lens to increase the field of view and visual context, and add side mirrors on the gripper to provide implicit stereo observation. When combined with the GoPro’s built-in IMU sensor, we can enable robust tracking under fast motion.
•首先,我们致力于寻求一种正确的被用于人类动作演示的物理接口,这一接口需要直观(即适配人类操作)同时能够获取所有的必要信息以便机器人习得策略。具体而言,我们使用了一个鱼眼镜头来增加视场角或视觉上下文,并且夹爪的两侧增加了镜子以提供隐式的立体观测。当把GoPro内置的IMU传感器集成后,我们就能够在快速运动场景下实现鲁棒的动作跟踪。
Second, we explore the right policy interface (i.e., observation and action representations) that could make the policy hardware-agnostic and thereby enable effective skill transfer. Concretely, we employ inference-time latency matching to handle different sensor observation and execution latency, use relative trajectory as action representation to remove the need for precise global action, and finally, apply Diffusion Policy [9] to model multimodal action distributions.
其次,我们探索出一种正确的策略接口(比如,观测和动作表征),这一接口能够使策略与硬件无关,从而实现了技能的有效迁移。具体而言,我们采用了推理时间延迟匹配来解决传感器观测和执行时间的不同,使用相对轨迹作为动作表征来替代精准的全局动作,最终应用Diffusion Policy扩散策略来建模多模态动作分布。
The final system, Universal Manipulation Interface (UMI), provides a practical and accessible framework to unlock new robot manipulation skills, allowing us to demonstrate any actions in any environment while maintaining high transferability from human demonstration to robot policy.
最终的系统UMI提供了一种实用且易获得的框架来解锁新的机器人操作技能。它使我们可以在任何环境中演示任何动作,并确保了强大的从人类演示到机器人策略的迁移能力。
Fig. 1: Universal Manipulation Interface (UMI) is a portable, intuitive, low-cost data collection and policy learning framework. This framework allows us to transfer diverse human demonstrations to effective visuomotor policies. We showcase the framework for tasks that would be difffcult with traditional teleoperation, such as dynamic, precise, bimanual and long-horizon tasks.
图1:通用操作接口(UMI) 是可移植,直观的、低成本的数据手机和策略学习框架。这一框架允许我们把各种人类演示动作迁移到可执行的视觉-电机策略。对于一些传统任务包括动态、精准、双臂和多步骤任务,传统的要操作难以胜任,而我们却展示出了这一架构的能力。
Fig. 2: UMI Demonstration Interface Design. Left: Hand-held grippers for data collection, with a GoPro as the only sensor and recording device. Middle: Image from the GoPro’s 155° Fisheye view. Note the physical side mirrors highlighted in green which provide implicit stereo information. Right: UMI-compatible robot gripper and camera setup make observation similar to hand-held gripper view.
图2:UMI的设计示例。左:用于数据搜集的手持夹持器,仅集成有GoPro作为传感器和记录设备。中:由GoPro 155度鱼眼相机拍摄的图像。右:UMI可兼容的机器人,安装有夹持器和摄像头,确保其观测视场类似于手持夹持器的视场。
With just a wrist-mounted camera on the hand-held gripper (Fig. 2), we show that UMI is capable of achieving a wide range of manipulation tasks that involve dynamic, bimanual, precise and long-horizon actions by only changing the training data for each task (Fig. 1). Furthermore, when trained with diverse human demonstrations, the final policy exhibits zero-shot generalization to novel environments and objects, achieving a remarkable 70% success rate in out-of-distribution tests, a level of generalizabilty seldomly observed in other behavior cloning frameworks. We open-source the hardware and software system at https://umi-gripper.github.io.
仅利用安装在手持夹爪手腕处的相机(图2),且仅需改变每种任务的训练数据,我们展示了UMI能够实现多种类的涉及动态、双臂、精准和多步骤动作的操控任务(图1)。更进一步,当通过各种人类演示来训练后,最终的策略可以被零样本泛化到全新的环境和物体,在分布外数据测试中实现了惊人的70%成功率。这一泛化能力很少能在其他行为复制框架里被观测到。我们在https://umi-gripper.github.io开源了软硬件系统。
II. RELATED WORKS
A key enabler for any data-driven robotics system is the data itself. Here, we review a few typical data collection workflows in the context of robotic manipulation.
2 相关研究
对于数据驱动的机器人系统,不可或缺的关键元素是数据本身。这里我们回顾一些针对机器人操控的典型的数据收集流程。
A. Teleoperated Robot Data
Imitation learning learns policies from expert demonstrations. Behavior cloning (BC), utilizing teleoperated robot demonstrations, stands out for its direct transferability. However, teleoperating real robots for data collection poses significant challenges. Previous approaches utilized interfaces such as 3D spacemouse [9, 54], VR or AR controllers [35, 3, 13, 19, 31, 51, 12], smartphones [44, 45, 22], and haptic devices [38, 47, 43, 26, 4] for teleoperation. These methods are either very expensive or hard to use due to high latency and lack of user intuitiveness. While recent advancements in leaderfollower (i.e. puppetting) devices such as ALOHA [53, 15] and GELLO [46] offer promise with intuitive and low-cost interfaces, their reliance on real robots during data collection limits the type and number of environments the system can gain access to for “in-the-wild” data acquisition. Exoskeletons [14, 20] remove the dependence on real robots during data collection, however, require fine-tuning using teleoperated real robot data for deployment. Moreover, the resulting data and policy from aforementioned devices are embodiment-speciffc, preventing reusage for different robots. In contrast, UMI eliminates the need for physical robots during data collection and offers a more portable interface for in-the-wild robot teaching, providing data and policies that are transferable to different robot embodiments (e.g., 6DoF or 7DoF robot arms).
2.1 遥操作机器人数据
模拟学习从专家演示动作学习策略。遥操作机器人的动作克隆(BC)技术因其直接地转移能力脱颖而出。尽管如此,遥操作真实的机器人极具挑战。以前的遥操作方法会采用3D空间鼠标【9,54】,VR或AR控制器【[35, 3, 13, 19, 31, 51, 12】,智能手机【44,45,22】,和触觉装置。这些方法要么非常昂贵,要么由于高延迟和缺乏直观性而难于使用。尽管主从(比如木偶)装置如ALOHA【53,15】和GELLO【46】最近取得了一些进步,即提供直观和低成本的接口,在数据获取时他们仍需要依赖真实的机器人,这限制了系统在非现场获取数据时所能适用的环境种类和数量。Exoskeletons【14,20】移除了数据获取对实际机器人的依赖,但是部署的话却需要精细地调校遥操作的真实机器人。更进一步,前述方法获取的数据和策略是针对具体实体的,从而阻碍了不同机器人的数据和策略重用。作为对比,UMI不需要实体机器人收集数据且提供了一种更便于转移的非现场机器人示教;同时数据和策略可被转移到不同的机器人实体(比如6自由度或7自由度机械臂)。
B. Visual Demonstrations from Human Video
There’s a distinct line of work dedicated to policy learning from in-the-wild video data (e.g. YouTube videos). The most common way is to learn from diverse passive human demonstration videos. Utilizing passive human demonstrations, previous works learn task cost functions [37, 8, 1, 21], affordance functions [2], dense object descriptors [40, 24, 39], action correspondences [33, 28], and pre-trained visual representations [23, 48]. However, this approach encounters three major challenges. Firstly, most video demonstrations lack explicit action information, crucial for learning generalizable policies. To infer action data from passive human video, previous works resort to hand pose detectors [44, 1, 38, 28], or combining human videos with in-domain teleoperated robot data to predict actions [33, 20, 34, 28]. Second, the evident embodiment gap between humans and robots hinders action transfer. Efforts to bridge the gap include learning human-to-robot action mapping with hand pose retargetting [38, 28] or extracting embodiment-agnostic keypoints [49]. Despite these attempts, the inherent embodiment differences still complicate policy transfer from human video to physical robots. Thirdly, the inherent observation gap induced by the embodiment gap in this line of work introduces inevitable mismatch between train/inference time observation data, exacerbating the transferability of the resulting policies, despite efforts in aligning demonstration observation with robot observation [20, 28]. In contrast, data collected with UMI exhibit minimal embodiment gap both in action and observation spaces, enabled by precise manipulation action extraction via robust visual-inertial camera tracking and the shared Fisheye wristmounted cameras during teaching and testing. Consequently, this enables in-the-wild zero-shot policy transfer for dynamic, bimanual, precise, and long-horizon manipulation tasks.
2.2 人类视频视觉示例
有一个专门的行当是从非现场视频数据(如油管视频)习得策略。最通常的方式是从不同的被动的人类示例视频中学习。利用被动的人类示例,前述工作学习任务代价函数【37,8,1,21】,直观性函数 【2】,稠密物体描述符【40, 24, 39】动作响应 【33, 28】,和预训练视觉表征 【23, 48】。尽管如此,这一方法遇到了三个主要挑战。第一,大多数的视频演示缺少对学习可泛化策略关键的显式动作信息。为了从被动的人类视频推断出动作数据,前述工作要么付诸于手部姿态检测器【44, 1, 38, 28】,要么把人类视频和遥操作机器人的域内数据结合来推测动作【33, 20, 34, 28】。第二,人类动作和机器人关节动作之间明显的物理实体区别。减少这一区别的努力包括学习人类到机器人的动作映射,这一映射要么会把手部姿态进行重定位【38,28】要么会抽取与物理实体无关的关键点【49】。尽管做了这些尝试,物理实体不同的内在因素仍然会把人类视频到机器人的策略转移搞得复杂。
C. Hand-Held Grippers for Quasi-static Actions
Hand-held grippers [41, 50, 10, 32, 27, 25] minimize observation embodiment gaps in manipulation data collection, offering portability and intuitive interfaces for efffcient data collection in the wild. However, accurately and robustly extracting 6DoF end-effector (EE) pose from these devices remains challenging, hindering the deployment of robot policies learned from these data on fine-grained manipulation tasks. Prior works attempted to address this issue through various approaches, such as SfM [50, 25] which suffers from scale ambiguity; RGB-D fusion [41] which requires expensive sensors and onboard compute; external motion tracking [32, 27] which is limited to lab settings. These devices, constrained to quasi-static actions due to low EE tracking accuracy and robustness, often necessitate cumbersome onboard computer or external motion capture (MoCap) systems, diminishing their feasibility for in-the-wild data collection. In contrast, UMI integrates state-of-the-art SLAM [6] with built-in IMU data from GoPro, to accurately capture 6DoF actions at the global scale. The high-accuracy data enables trained BC policy to learn bimanual tasks. With thorough latency matching, UMI further enables real-world deployable policy for dynamic actions such as tossing.
2.3 准静态动作手持夹持器
在操控数据搜集阶段,由于手持夹持器为非现场数据收集提供了可转移性和直观性接口,其 【41, 50, 10, 32, 27, 25】最小化了观测维度上的实体区别。尽管如此,精准地且可靠地从这些装置种萃取6自由度末端执行器(EE)姿态依然富有挑战,从而阻碍了将基于这些姿态数据所习得的机器人策略部署到精细的操控任务上。以前的工作尝试了各种方法来解决这一问题,比如备受缩放模糊困扰的SfM方法【50,25】;需要昂贵传感器和在线计算的RGB-D融合方法【41】;受限于实验室设置的外部动作跟踪方法【32,27】。由于较低的EE跟踪精度和可靠性,这些装置被局限于准静态动作,通常需要繁重的在线计算或者外部的动作捕获(MoCap)系统,从而在非现场数据收集上毫无可行性可言。作为对比,为了在全局维度上精准地获取6自由度的动作,UMI集成了采用GoPro内置IMU数据的最新的SLAM【6】算法。这一高精度的数据让我们能够使用动作克隆策略来学习双臂任务。通过完全的延迟匹配,UMI进一步使我们能得到针对动态动作比如抛球的且真实世界可部署的策略。
Recently, Dobb-E [36] proposed a “reacher-grabber” tool mounted with an iPhone to collect single-arm demonstrations for the Stretch robot. Yet, Dobb-E only demonstrates policy deployment for quasi-static tasks and requires environmentspeciffc policy fine-tuning. Conversely, using only data collected with UMI enables trained policy to zero-shot generalize to novel in-the-wild environments, unseen objects, multiple robot embodiments, for dynamic, bimanual, precise and longhorizon tasks.
近来,Dobb-E【36】提出了一种利用安装有iPhone手机的夹持器(“reacher-grabber”)工具,该工具可以为Stretch机器人提取单臂示例数据。但是Dobb-E仅仅展示了策略部署于准静态动作而且需要特定环境下的策略精细调校。相反,仅仅使用UMI收集的数据能够将训练出的策略零样本泛化到全新的非现场环境中、未见过的物体、多个机器人实体中,且能够胜任动态、双臂、精准和多步骤的任务。
III. METHOD
3 实施方法
Universal Manipulation Interface (UMI) is hand-held data collection and policy learning framework that allows direct transfer from in-the-wild human demonstrations to deployable robot policies. It is designed with the following goals in mind:
通用操控接口(UMI)是一种手持数据收集和策略学习框架,其允许从非现场人类示例到可部署机器人策略的直接迁移。它被设计成满足如下目标:
• Portable. The hand-held UMI grippers can be taken to any environment and start data collection with close-to-zero setup time.
• Capable. The ability to capture and transfer natural and complex human manipulation skills beyond pick-and-place.
• Sufffcient. The collected data should contain sufffcient information for learning effective robot policies and contain minimal embodiment-speciffc information that would prevent transfer.
• Reproducible: Researchers and enthusiasts should be able to consistently build UMI grippers and use data to train their own robots, even with different robot arms.
• 可移植的。手持UMI夹持器可以被拿到任一环境中进行几乎无准备时间的数据收集。
• 有能力的。有能力去捕获和转移自然和复杂的且远超取放动作的人类操控技能。
• 充足的。收集的数据应该包含充足的信息来学习有效的机器人策略,并且包含最少的实物相关的信息(这些信息会阻碍迁移)。
• 可重现的:研究者和技术狂人应该可以不断地搭建UMI夹持器,并且使用数据来训练他们自己的机器人(甚至于是不同的机械臂)。
The following sections describe how we enable the above goals through our hardware and policy interface design.
以下部分描述了我们如何通过我们的硬件和策略接口设计实现上述目标。
A. Demonstration Interface Design
3.1 演示接口设计
UMI’s data collection hardware takes the form of a triggeractivated, handheld 3D printed parallel jaw gripper with soft fingers, mounted with a GoPro camera as the only sensor and recording device (see HD1).For bimanual manipulation, UMI can be trivially extended with another gripper. The key research question we need to address here is:
UMI的数据收集硬件采用了触发激活的形式,手持式3D打印的具有软指的并行夹持器,安装有一个GoPro相机作为唯一的传感器和记录设备(参见HD1)。对于双臂操作,UMI可以被轻易地扩展到另一个夹持器。我们这里需要解决的关键研究性问题是:
How can we capture sufffcient information for a wide variety of tasks with just a wrist-mounted camera?
我们如何能够仅仅利用腕部安装的相机来捕获能够完成各种类型任务的足够的信息?
Speciffcally, on the observation side, the device needs to capture sufffcient visual context to infer action HD2 and critical depth information HD3. On the action side, it needs to capture precise robot action under fast human motion HD4, detailed subtle adjustments on griping width HD5, and automatically check whether each demonstration is valid given the robot hardware kinematics HD6. The following sections describe details on how we achieve these goals.
具体而言,站在观测的角度而言,装置需要捕获足够的视觉上下文来推断动作HD2和关键的深度信息HD3。站在动作的角度而言,装置需要在快速的人类移动HD4下,捕获到机器人的精准动作,精细而隐含的夹持宽度调整HD5,以及自动检查每一个演示在机器人硬件运动学基础下HD6是有效的。以下部分描述了我们如何实现这些目标的细节。
HD1. Wrist-mounted cameras as input observation. We rely solely on wrist-mounted cameras, without the need for any external camera setups. When deploying UMI on a robot, we place GoPro cameras with the same location with respect to the same 3D-printed fingers as on the hand-held gripper. This design provides the following benefits:
HD1. 安装于腕部的相机被作为输入观测。我们仅需要腕部安装的相机,而无需搭建任何外部相机。当在一个机器人上部署UMI时,我们将GoPro相机安放在与其在3D打印的夹持器所处手指位置相同的地方。这一设计提供了以下好处:
1) Minimizing the observation embodiment gaps. Thanks to our hardware design, the videos observed in wristmount cameras are almost indistinguishable between human demonstrations and robot deployment, making the policy input less sensitive to embodiment.
1)最小化了实体之间的观测误差。多亏我们的硬件设计,腕部相机所录制的视频在人类演示和机器人部署之间几乎没有不同,这就使得策略的输入对实体差异不那么敏感。
2) Mechanical robustness. Because the camera is mechanically fixed relative to the fingers, mounting UMI on robots does not require camera-robot-world calibration. Hence, the system is much more robust to mechanical shocks, making it easy to deploy.
2)结构稳定性。因为相机是通过机械方式在手指上的某处固定的,在机器人安装UMI并不需要相机-机器人-世界坐标系之间的校准。因此此系统对机械振动的抗干扰能力更强,使其部署更为容易。
3) Portable hardware setup. Without the need for an external static camera or additional onboard compute, we largely simplify the data collection setup and make the whole system highly portable.
3)可移植的硬件搭建。无需外部静态相机或额外的在线计算,我们极大地简化了数据收集的环境搭建过程而且使得整个系统高度可移植。
4) Camera motion for natural data diversiffcation. A side benefft we observed from experiments is that when training with a moving camera, the policy learns to focus on taskrelevant objects or regions instead of background structures (similar in effect to random cropping). As a result, the final policy naturally becomes more robust against distractors at inference time.
4)相机的移动可用于数据的天然分类。我们从试验中看到的一个额外好处是当使用移动相机来训练时,习得的策略会关注于任务相关的物体或区域而不是背景框架(效果上类似于随机裁剪)。因此,最终习得的策略在推理阶段天然地变得抗干扰能力更强。
Avoiding use of external static cameras also introduce additional challenges for downstream policy learning. For example, the policy now needs to handle non-stationary and partial observations. We mitigated these issues by leveraging wide-FoV Fisheye Lens HD2, and robust visual tracking HD4, described in the following sections.
避免使用外部的静态相机对于下游的策略学习也引入了 额外的挑战。比如,策略需要处理非静止的和部分的观测。我们通过利用下述的宽视角的鱼眼相机镜头HD2以及鲁棒性好的视觉跟随算法HD4,来减轻这一问题的影响。
Fig. 3: Fisheye vs Rectilinear (a) UMI policies use raw Fisheye image as observation. (b) Rectifying a large 155° FoV image to the pin-hole model severely stretches the peripheral view (outside of blue line), while compresses the most important information at the center to a small area (inside of red line).
图3:鱼眼照片 vs 校正照片 (a)UMI策略使用原始的鱼眼相机作为观测。(b)按针孔模型校正的一张155度大视场角的照片极大地扭曲了外部视图(蓝色线之外),同时把包含最重要信息的视场中心压缩为一小块区域(红色线内部)。
HD2. Fisheye Lens for visual context. We use a 155- degree Fisheye lens attachment on wrist-mounted GoPro camera, which provides sufffcient visual context for a wide range of tasks, as shown in Fig. 2. As the policy input, we directly use raw Fisheye images without undistortion since Fisheye effects conveniently preserve resolution in the center while compressing information in the peripheral view. In contrast, rectiffed pinhole image (Fig. 3 right) exhibits extreme distortions, making it unsuitable for learning due to the wide FoV. Beyond improving SLAM robustness with increased visual features and overlap [52], our quantitative evaluation (Sec V-A) shows that the Fisheye lens improves policy performance by providing the necessary visual context.
HD2. 鱼眼镜头捕获视觉上下文。我们在腕部的GoPro相机上使用了155度视场角的鱼眼镜头。如图2所示,它可以为任务提供足够宽广的视觉上下文。我们直接使用未经畸变校正的原始的鱼眼相机图像作为策略输入,因为鱼眼相机效应很便利地保留了视场中心的分辨率且压缩了周边的信息。作为对比,应用针孔校正法处理的图像(图3 右)展现了极端的畸变,使其并不适用于广角视场下的学习。通过视觉特征和重叠【52】进一步改进SLAM的鲁棒性,我们的定量评价(V-A)表明鱼眼镜头因为提供了必要的视觉上下文,因而改善了策略的性能。
Fig. 4: UMI Side Mirrors. The ultra-wide-angle camera coupled with strategically positioned mirrors, facilitates implicit stereo depth estimation. (a): The view through each mirror effectively creates two virtual cameras, whose poses are reflected along the mirror planes with respect to the main camera. (b): Ketchup on the plate, occluded from the main camera view, is visible inside the right mirror, proving that mirrors simulate cameras with different optical centers. (c): We digitally reflect the content inside mirrors for policy observation. Note the orientation of the cup handle becomes consistent across all 3 views after reflection.
图4:UMI侧面镜子。超广角相机加上特别放置的镜子,便利了隐含立体深度信息的估计。(a):每个镜子中的视图相当于创建了两个虚拟相机,其位姿沿着镜平面相对主相机的反射轴。b):盘子中的番茄酱没有在主相机的视图里,但在右侧的镜子中可见,这证明了镜子模拟了有不同光心的相机。(c):我们对镜子里的内容进行了数字化的镜像处理以方便策略观测。通过数字化镜像处理后,注意杯子把手的方向在全部三个视图里是一致的。
HD3. Side mirrors for implicit stereo. To mitigate the lack of direct depth perception from the monocular camera view, we placed a pair of physical mirrors in the cameras’ peripheral view which creates implicit stereo views all in the same image. As illustrated in Fig 4 (a), the images inside the mirrors are equivalent to what can be seen from additional cameras reflected along the mirror plane, without the additional cost and weight. To make use of these mirror views, we found that digitally reflecting the crop of the images in the mirrors, shown in Fig 4 (c), yields the best result for policy learning (Sec. V-A). Note that without digital reflection, the orientation of objects seen through side mirrors is the opposite of that in the main camera view.
HD3. 侧面镜子提供隐含立体信息。为减轻单目视觉缺少直接的深度信息这一问题的影响,我们在相机的外部视场放置了一对物理镜面,从而使得我们在同一张照片里制造出隐含的立体视场。正如图4(a)所示,无需付出额外的花费和重量,镜子中的照片等同于从额外的相机沿镜子平面拍摄所得。为了利用这些镜子中的图像,我们发现数字化的镜像处理镜子中的图像,如图4(c)所示,产生了最佳的策略学习结果(Sec. V-A)。请注意没有经过数字化镜像处理,侧面镜子里看到的物体朝向在主相机的视图里是反的。
HD4. IMU-aware tracking. UMI captures rapid movements with absolute scale by leveraging GoPro’s built-in capability to record IMU data (accelerometer and gyroscope) into standard mp4 video files [18]. By jointly optimizing visual tracking and inertial pose constraints, our Inertial-monocular SLAM system based on ORB-SLAM3 [7] maintains tracking for a short period of time even if visual tracking fails due to motion blur or a lack of visual features (e.g. looking down at a table). This allows UMI to capture and deploy highly dynamic actions such as tossing (shown in Fig 7). In addition, the joint visual-inertial optimization allows direct recovery of real metric scale, important for action precision and intergripper pose proprioception PD2.3: a critical ingredient to enable bimanual policy.
HD4. 具有IMU感知的跟踪。UMI通过利用GoPro‘s内建的IMU(加速度计和陀螺仪)数据记录为标准mp4视频文件的能力来捕获基于绝对尺度的快速的动作【18】。通过联合优化视觉跟踪和惯性位姿限制,我们基于ORB-SLAM3【7】的惯性-单目SLAM系统保持了短时的跟踪,即便由于移动模糊或欠缺视觉特征(比如俯视桌面)会导致视觉跟踪失效。这一特性就允许UMI去捕获和部署高动态的动作比如投掷(图7里所示)。另外,视觉和惯导联合优化允许直接的真实尺度的测量重建,这对于动作精准度和夹持器之间的姿态本体感知PD2.3而言相当重要:一个使能双臂策略的关键性因素。
HD5. Continuous gripper control. In contrast to the binary open-close action used in prior works [41, 44, 54], we found commanding gripper width continuously signiffcantly expands the range of tasks doable by parallel-jaw grippers. For example, the tossing task (Fig. 7) requires precise timing for releasing objects. Since objects have different widths, binary gripper actions will be unlikely to meet the precision requirement. On UMI gripper, finger width is continuously tracked via fiducial markers [16] (Fig. 2 left). Using serieselastic end effectors principle [42], UMI can implicitly record and control grasp forces by regulating the deformation of soft fingers through continuous gripper width control.
HD5. 夹持器无极控制。与以前研究工作采用二值化开关动作不同,我们发现无极控制夹持器宽度极大地扩大了并行夹爪夹持器的可完成任务范围。比如,投掷任务(图7)需要精确的时序来释放物体。因为物体有不同的宽度,二值化的夹持器开关动作无法满足精准要求。UMI夹持器的手指宽度可通过基准标记【16】(图2 左)来连续地跟踪。利用序列化弹性末端执行器原则【42】,UMI通过连续控制夹持器宽度来调节柔软手指的变形,进而可以隐形地记录和控制抓取力度。
HD6. Kinematic-based data filtering. While the data collection process is robot-agnostic, we apply simple kinematicbased data filtering to select valid trajectories for different robot embodiments. Concretely, when the robot’s base location and kinematics are known, the absolute end-effector pose recovered by SLAM allows kinematics and dynamics feasibility filtering on the demonstration data. Training on the filtered dataset ensures policies comply with embodimentspeciffc kinematic constraints.
HD6. 基于运动学的数据滤波。尽管数据收集过程是机器人无关的,我们仍然会应用简单的基于运动学的数据滤波来为不同的机器人实体选取有效的轨迹。具体而言,当机器人的底座位置和运动学已知时,通过SLAM重建的末端执行器的绝对位姿允许对演示数据进行基于运动学和动力学可行度的滤波。基于滤波后的数据集的训练确保了策略遵循实体特定的运动学限制。
Putting everything together. The UMI gripper weighs 780g, with an external dimension of L310mm ×W175mm × H210mm and finger stroke of 80mm. The 3D printed gripper has a BoM cost of $73, while the GoPro camera and accessories total $298. As shown in Fig. 2, we can equip any robot arms with a compatible gripper and camera setup.
综合以上,UMI夹持器重约780g,外部尺寸长宽高310mm x 175mm x 210mm,手指行程80mm。3D打印的夹持器BOM成本$73,GoPro相机和附件总计$298。如图2所示,我们可以将夹持器和相机适配安装任何的机器人手臂上。
B. Policy Interface Design
3.2 策略接口设计
Fig. 5: UMI Policy Interface Design. (b) UMI policy takes in a sequence of synchronized observations (RGB image, relative EE pose, and gripper width) and outputs a sequence of desired relative EE pose and gripper width as action. (a) We synchronize different observation streams with physically measured latencies. (c) We send action commands ahead of time to compensate for robots’ execution latency.
图5:UMI策略接口设计。(b)UMI策略以一系列同步过的观测(RGB图像,相对EE位姿和夹持器宽度)为输入,以一系列期望的相对EE位姿和夹持器宽度作为动作。(a)我们以实际测量所得的时延来同步不同的观测数据流。(c)我们提前发出动作指令以补偿机器人的执行延迟。
With the collected demonstration data, we can train a visuomotor policy that takes in a sequence of synchronized observations (RGB images, 6 degrees-of-freedom end-effector pose, and gripper width) and produces a sequence of actions (end-effector pose and gripper width) as shown in Fig. 5 (b). In this paper, we use Diffusion Policy [9] for all of our experiments, while other frameworks such as ACT [53] could potentially serve as a drop-in replacement. An important goal of UMI’s policy interface design is to ensure the interface is agnostic to underlying robotic hardware platforms such that the resulting policy, trained on one data source (i.e., hand-held gripper), could be directly deployed to different robot platforms. To do so, we aim to address the following two key challenges:
如图5(b)所示,利用收集到的演示数据,我们可以训练视觉动作策略,其以一系列同步过的观测(RGB图像,6自由度末端执行器位姿和夹持器宽度)作为输入,从而输出一系列的动作(末端执行器位姿和夹持器宽度)。在本文中,我们在所有实验中使用了扩散策略(Diffusion Policy)【9】,同时其他架构如ACT【53】也可作为一个潜在的可用替代品。UMI策略接口设计的一个重要目标是要确保接口与其机器人硬件平台无关,这样经过数据源(如手持夹持器)训练得到的策略就能够直接部署于不同的机器人平台。为达成此目的,我们试图解决以下两个关键挑战:
• Hardware-speciffc latency. The latency of various hardware (streaming camera, robot controller, industrial gripper) is highly variable across system deployments, ranging from single-digit to hundreds of milliseconds. In contrast, all information streams captured by UMI grippers have zero latency with respect to the image observation, thanks to GoPro’s synchronized video, IMU measurements and the vision-based gripper width estimation.
• 硬件特定延迟。不同的系统部署会有相当不同的硬件延迟(相机视频流,机器人控制器,工业夹持器),该延迟范围从几ms到几百ms。与此不同,多亏了GoPro的同步视频,IMU测量和基于视频的夹持器宽度估计,所有UMI夹持器捕获的信息流具有相对图像观测的0延迟。
• Embodiment-speciffc proprioception. Commonly used proprioception observations such as joint angles and EE pose are only well-defined with respect to a speciffc robot arm and robot base placement. In contrast, UMI needs to collect data across diverse environments and be generalizable to multiple robot embodiments.
• 实体特定感知。通常使用的感知观测如关节角度和EE姿态的定义,仅仅是相对特定机器人手臂和机器人基座位置的。与此不同,UMI需要在不同环境中收集数据并泛化到多种机器人实体上。
In the following sections, we will describe three policy interface designs that address these challenges.
在以下部分,我们将描述解决上述挑战的三个策略接口设计。
PD1. Inference-time latency matching. While UMI’s policy interface assumes synchronized observation streams and immediate action execution, physical robot systems do not conform to this assumption. If not carefully handled, the timing mismatch between training and testing can cause large performance drops on dynamic manipulation tasks that require rapid movement and precise hand-eye coordination, demonstrated in Sec V-B. In this paper, we separately handle timing discrepancies on the observation and action sides:
PD1. 推理阶段延迟匹配。尽管UMI策略接口假定同步过的观测流和即时的动作执行,实际的机器人系统并不会遵从这一假设。如果没有仔细处理这一问题,如V-B部分所示,训练和测试之间的时序错乱在执行需要快速移动和精准手眼协调的动态操作任务时会导致极大的性能降低。在本论文中,我们分别处理了观测端和动作端的延迟不同:
PD1.1) Observation latency matching. On real robotic systems, different observation streams (RGB image, EE pose, gripper width) are captured by distributed micro-controllers, resulting in different observation latency.
PD1.1)观测端延迟匹配。在实际的机器人系统中,不同的观测流(RGB图像,EE位姿,夹持器宽度)由不同的分布式微控制器捕获,因而有不同的观测延迟。
For each observation stream, we individually measure their latency (details see §A1-A3). At inference time, we align all observations with respect to the stream with the highest latency (usually the camera). Specifically, we first temporally downsample the RGB camera observations to the desired frequency (often 10-20Hz), and then use the capture timestamp of each image to linearly interpolate gripper and robot proprioception streams. In bimanual systems, we soft-synchronize two cameras by finding the nearest neighbor frames, which can be off by a maximum of
seconds. The result is a sequence of synchronized observations that conform to UMI policy, shown in Fig. 5 (a).
对于每一个观测流,我们单独测量其延迟(细节参见§A1-A3)。在推理阶段,我们将所有观测流对齐到有最高延迟的观测流(一般是相机)。具体而言,我们首先暂时亚采样RGB相机的观测到期望频率(通常10-20HZ),然后我们将每一张图像的捕获时间戳 线性插值到夹持器和机器人的感知流里。在双臂系统里,我们通过查找最近的相邻帧来软同步两个相机,因此其偏差最大为
秒。其结果是一系列的同步过的遵循UMI策略的观测,如图5(a)所示。
PD1.2) Action latency matching. UMI policy assumes the output as a sequence of synchronized EE poses and gripper widths. However, in practice, robot arms and grippers can only track the desired pose sequence up to an execution latency, that varies across different robot hardware. To make sure the robots and grippers reach the desired pose at the desired time (given by the policy), we need to send commands ahead of time to compensate for execution latency, as shown in Fig. 5 (c). See §A4 for execution latency calibration details.
PD1.2)动作延迟匹配。UMI策略假定输出为一系列同步过的EE位姿和夹持器宽度。尽管如此,在实操中,机械臂和夹持器仅仅能够跟踪最大为执行延迟的期望位姿序列,而不同的机器人硬件却有不同的最大执行延迟。为了确认机器人和夹持器在期望的时间到达期望的位姿(由策略给出),我们需要提前发出命令以补偿执行延迟,如图5(c)所示。执行延迟校正细节请参见§A4。
During execution, the UMI policy predicts the action sequence starting at the last step of observation . The first few actions predicted are immediately outdated due to observation latency
, policy inference latency
and execution latency
. We simply discard the outdated actions and only execute actions with the desired timestamp after tact for each hardware.
在执行阶段,UMI策略从观测的最后一步开始预测动作序列。初始的少量
IV. EVALUATIONS
V. CAPABILITY EXPERIMENTS
VI. IN-THE-WILD GENERALIZATION EXPERIMENTS
VII. DATA COLLECTION THROUGHPUT AND ACCURACY
VIII. LIMITATIONS AND FUTURE WORKS
IX. CONCLUSION
ACKNOWLEDGMENTS
This work was supported in part by the Toyota Research Institute, NSF Award #2037101, and #2132519. We want to thank Google and TRI for the UR5 robots, and IRIS and IPRL lab for the Franka robot hardware. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors. We would like to thank Andy Zeng, Pete Florance, Huy Ha, Yihuai Gao, Samir Gadre, Mandi Zhao, Mengda Xu, Alper Canberk, Kevin Zakka, Zeyi Liu, Dominik Bauer, Tony Zhao, Zipeng Fu and Lucy Shi for their thoughtful discussions. We thank Alex Alspach, Brandan Hathaway, Aimee Goncalves, Phoebe Horgan, and Jarod Wilson for their help on hardware design and prototyping. We thank Naveen Kuppuswamy, Dale McConachie, and Calder Phillips-Graffine for their help on low-level controllers. We thank John Lenard, Frank Michel, Charles Richter, and Xiang Li for their advice on SLAM. We thank Eric Dusel, Nwabisi C., and Letica Priebe Rocha for their help on the MoCap dataset collection. We thank Chen Wang, Zhou Xian, Moo Jin Kim, and Marion Lepert for their assistance with the Franka setup. We especially thank Steffen Urban for his open-source projects on GoPro SLAM and Camera-IMU calibration, and John @ 3D printing world for inspiration of the gripper mechanism.
REFERENCES
[1] Shikhar Bahl, Abhinav Gupta, and Deepak Pathak. Human-to-robot imitation in the wild. In Proceedings of Robotics: Science and Systems (RSS), 2022.
[2] Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023.
[3] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. In Proceedings of Robotics: Science and Systems (RSS), 2023.
[4] Anais Brygo, Ioannis Sarakoglou, Nadia GarciaHernandez, and Nikolaos Tsagarakis. Humanoid robot teleoperation with vibrotactile based balancing feedback. In Haptics: Neuroscience, Devices, Modeling, and Applications: 9th International Conference, EuroHaptics 2014, Versailles, France, June 24-26, 2014, Proceedings, Part II 9, pages 266–275. Springer, 2014.
[5] Berk Calli, Arjun Singh, Aaron Walsman, Siddhartha Srinivasa, Pieter Abbeel, and Aaron M. Dollar. The ycb object and model set: Towards common benchmarks for manipulation research. In 2015 International Conference on Advanced Robotics (ICAR), pages 510–517, 2015. doi: 10.1109/ICAR.2015.7251504.
[6] Carlos Campos, Richard Elvira, Juan J Gomez ´ Rodr´ıguez, Jose MM Montiel, and Juan D Tardos.Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics, 37(6):1874–1890, 2021.
[7] Carlos Campos, Richard Elvira, Juan J. Gomez ´ Rodr´ıguez, Jose M. M. Montiel, and Juan D. Tard ´ os. ´ Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics, 37(6):1874–1890, 2021. doi: 10.1109/TRO. 2021.3075644.
[8] Annie S Chen, Suraj Nair, and Chelsea Finn. Learning generalizable robotic reward functions from “in-thewild” human videos. In Proceedings of Robotics: Science and Systems (RSS), 2021.
[9] Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023.
[10] Kiran Doshi, Yijiang Huang, and Stelian Coros. On hand-held grippers and the morphological gap in human manipulation demonstration. arXiv preprint arXiv:2311.01832, 2023.
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
[12] Jiafei Duan, Yi Ru Wang, Mohit Shridhar, Dieter Fox, and Ranjay Krishna. Ar2-d2: Training a robot without a robot. 2023.
[13] Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with crossdomain datasets. In Proceedings of Robotics: Science and Systems (RSS), 2022.
[14] Hongjie Fang, Hao-Shu Fang, Yiming Wang, Jieji Ren, Jingjing Chen, Ruo Zhang, Weiming Wang, and Cewu Lu. Low-cost exoskeletons for learning whole-arm manipulation in the wild. arXiv preprint arXiv:2309.14975, 2023.
[15] Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024.
[16] S. Garrido-Jurado, R. Munoz-Salinas, F.J. Madrid- ˜ Cuevas, and M.J. Mar´ın-Jimenez. Automatic genera- ´ tion and detection of highly reliable fiducial markers under occlusion. Pattern Recognition, 47(6):2280–2292, 2014. ISSN 0031-3203. doi: https://doi.org/10.1016/ j.patcog.2014.01.005. URL https://www.sciencedirect.com/science/article/pii/S0031320314000235.
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. corrabs/1512.03385 (2015), 2015.
[18] GoPro Inc. Gpmf introuction: Parser for gpmf™ formatted telemetry data used within gopro® cameras. https: //gopro.github.io/gpmf-parser/. Accesssed: 2023-01-31.
[19] Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning (CoRL), volume 164, pages 991–1002. PMLR, 2022.
[20] Moo Jin Kim, Jiajun Wu, and Chelsea Finn. Giving robots a hand: Broadening generalization via handcentric human video demonstrations. In Deep Reinforcement Learning Workshop NeurIPS, 2022.
[21] Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. VIP: Towards universal visual reward and representation via value-implicit pre-training. In The Eleventh International Conference on Learning Representations, 2023.
[22] Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning (CoRL), volume 87, pages 879–893. PMLR, 2018.
[23] Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. In Proceedings of The 6th Conference on Robot Learning (CoRL), volume 205, pages 892–909. PMLR, 2022.
[24] Chuer Pan, Brian Okorn, Harry Zhang, Ben Eisner, and David Held. Tax-pose: Task-specific cross-pose estimation for robot manipulation. In Proceedings of The 6th Conference on Robot Learning (CoRL), volume 205, pages 1783–1792. PMLR, 2023.
[25] Jyothish Pari, Nur Muhammad Shafiullah, Sridhar Pandian Arunachalam, and Lerrel Pinto. The surprising effectiveness of representation learning for visual imitation. In Proceedings of Robotics: Science and Systems (RSS), 2022.
[26] Luka Peternel and Jan Babic. Learning of compliant ˇ human–robot interaction using full-body haptic interface. Advanced Robotics, 27(13):1003–1012, 2013.
[27] Pragathi Praveena, Guru Subramani, Bilge Mutlu, and Michael Gleicher. Characterizing input methods for human-to-robot demonstrations. In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 344–353. IEEE, 2019.
[28] Yuzhe Qin, Yueh-Hua Wu, Shaowei Liu, Hanwen Jiang, Ruihan Yang, Yang Fu, and Xiaolong Wang. Dexmv: Imitation learning for dexterous manipulation from human videos. In European Conference on Computer Vision, pages 570–587. Springer, 2022.
[29] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[30] Harish Ravichandar, Athanasios S Polydoros, Sonia Chernova, and Aude Billard. Recent advances in robot learning from demonstration. Annual review of control, robotics, and autonomous systems, 3:297–330, 2020.
[31] Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, and Wolfram Burgard. Latent plans for taskagnostic offline reinforcement learning. In Proceedings of The 6th Conference on Robot Learning (CoRL), volume 205, pages 1838–1849. PMLR, 2023.
[32] Felipe Sanches, Geng Gao, Nathan Elangovan, Ricardo V Godoy, Jayden Chapman, Ke Wang, Patrick Jarvis, and Minas Liarokapis. Scalable. intuitive human to robot skill transfer with wearable human machine interfaces: On complex, dexterous tasks. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6318–6325. IEEE, 2023.
[33] Karl Schmeckpeper, Annie Xie, Oleh Rybkin, Stephen Tian, Kostas Daniilidis, Sergey Levine, and Chelsea Finn. Learning predictive models from observation and interaction. In European Conference on Computer Vision, pages 708–725. Springer, 2020.
[34] Karl Schmeckpeper, Oleh Rybkin, Kostas Daniilidis, Sergey Levine, and Chelsea Finn. Reinforcement learning with videos: Combining offline observations with interaction. In Proceedings of the 2020 Conference on Robot Learning (CoRL), volume 155, pages 339–354. PMLR, 2021.
[35] Mingyo Seo, Steve Han, Kyutae Sim, Seung Hyeon Bang, Carlos Gonzalez, Luis Sentis, and Yuke Zhu. Deep imitation learning for humanoid loco-manipulation through human teleoperation. In 2023 IEEE-RAS 22nd International Conference on Humanoid Robots (Humanoids), pages 1–8. IEEE, 2023.
[36] Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home. arXiv preprint arXiv:2311.16098, 2023.
[37] Lin Shao, Toki Migimatsu, Qiang Zhang, Karen Yang, and Jeannette Bohg. Concept2robot: Learning manipulation concepts from instructions and human demonstrations. The International Journal of Robotics Research, 40(12-14):1419–1434, 2021.
[38] Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. Videodex: Learning dexterity from internet videos. In Proceedings of The 6th Conference on Robot Learning (CoRL), volume 205, pages 654–665. PMLR, 2023.
[39] William Shen, Ge Yang, Alan Yu, Jansen Wong, Leslie Pack Kaelbling, and Phillip Isola. Distilled feature fields enable few-shot language-guided manipulation. In Proceedings of The 7th Conference on Robot Learning (CoRL), volume 229, pages 405–424. PMLR, 2023.
[40] Anthony Simeonov, Yilun Du, Andrea Tagliasacchi, Joshua B Tenenbaum, Alberto Rodriguez, Pulkit Agrawal, and Vincent Sitzmann. Neural descriptor fields:Se (3)-equivariant object representations for manipulation. In 2022 International Conference on Robotics and Automation (ICRA), pages 6394–6400. IEEE, 2022.
[41] Shuran Song, Andy Zeng, Johnny Lee, and Thomas Funkhouser. Grasping in the wild: Learning 6dof closedloop grasping from low-cost demonstrations. Robotics and Automation Letters, 2020.
[42] H.J. Terry Suh, Naveen Kuppuswamy, Tao Pang, Paul Mitiguy, Alex Alspach, and Russ Tedrake. SEED: Series elastic end effectors in 6d for visuotactile tool use. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4684–4691, 2022. doi: 10.1109/IROS47612.2022.9982092.
[43] Alexander Toedtheide, Xiao Chen, Hamid Sadeghian, Abdeldjallil Naceri, and Sami Haddadin. A forcesensitive exoskeleton for teleoperation: An application in elderly care robotics. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 12624–12630. IEEE, 2023.
[44] Chen Wang, Linxi Fan, Jiankai Sun, Ruohan Zhang, Li Fei-Fei, Danfei Xu, Yuke Zhu, and Anima Anandkumar. Mimicplay: Long-horizon imitation learning by watching human play. In Proceedings of The 7th Conference on Robot Learning (CoRL), volume 229, pages 201–221. PMLR, 2023.
[45] Josiah Wong, Albert Tung, Andrey Kurenkov, Ajay Mandlekar, Li Fei-Fei, Silvio Savarese, and Roberto Mart´ınMart´ın. Error-aware imitation learning from teleoperation data for mobile manipulation. In Proceedings of the 5th Conference on Robot Learning (CoRL), volume 164, pages 1367–1378. PMLR, 2022.
[46] Philipp Wu, Fred Shentu, Xingyu Lin, and Pieter Abbeel. GELLO: A general, low-cost, and intuitive teleoperation framework for robot manipulators. In Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition @ CoRL2023, 2023.
[47] Keenan A Wyrobek, Eric H Berger, HF Machiel Van der Loos, and J Kenneth Salisbury. Towards a personal robotics development platform: Rationale and design of an intrinsically safe personal robot. In 2008 IEEE International Conference on Robotics and Automation, pages 2165–2170. IEEE, 2008.
[48] Tete Xiao, Ilija Radosavovic, Trevor Darrell, and Jitendra Malik. Masked visual pre-training for motor control. arXiv:2203.06173, 2022.
[49] Haoyu Xiong, Quanzhou Li, Yun-Chun Chen, Homanga Bharadhwaj, Samarth Sinha, and Animesh Garg. Learning by watching: Physical imitation of manipulation skills from human videos. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7827–7834. IEEE, 2021.
[50] Sarah Young, Dhiraj Gandhi, Shubham Tulsiani, Abhinav Gupta, Pieter Abbeel, and Lerrel Pinto. Visual imitation made easy. In Conference on Robot Learning (CoRL), volume 155, pages 1992–2005. PMLR, 2021.
[51] Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Xi Chen, Ken Goldberg, and Pieter Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 5628–5635. IEEE, 2018.
[52] Zichao Zhang, Henri Rebecq, Christian Forster, and Davide Scaramuzza. Benefit of large field-of-view cameras for visual odometry. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 801–808, 2016. doi: 10.1109/ICRA.2016.7487210.
[53] Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. In Proceedings of Robotics: Science and Systems (RSS), 2023.
[54] Yifeng Zhu, Abhishek Joshi, Peter Stone, and Yuke Zhu. Viola: Imitation learning for vision-based manipulation with object proposal priors. In Proceedings of The 6th Conference on Robot Learning (CoRL), volume 205, pages 1199–1210. PMLR, 2023.
APPENDIX
Please check out our website (https://umi-gripper.github.io) for additional results and comparisons. In appendix, we present additional details on latency measurement §A, data collection protocol §B, evaluation protocol §C, SLAM §D, policy implementation §E, and hardware implementation §F.