ALOHA论文翻译：Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

YYGe

已于 2024-01-16 15:05:01 修改

阅读量3.4k

点赞数 26

分类专栏：机器人文章标签： markdown 斯坦福机器人家务机器人

于 2024-01-16 10:43:30 首次发布

本文链接：https://blog.csdn.net/weixin_43334869/article/details/135605693

版权

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

学习用低成本硬件进行精细双手操作

这是ALOHA 翻译，别搞混了。
Mobile ALOHA 论文翻译，请移步：Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

文章目录

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
学习用低成本硬件进行精细双手操作

在这里插入图片描述

图1：ALOHA：一种用于双手远程操作的低成本开源硬件系统。整个系统使用现成的机器人和3D打印组件，总成本不到20,000美元。左图：用户通过反向驱动领导机器人进行远程操作，从而使跟随机器人模仿运动。右图：ALOHA能够执行精准、接触丰富且动态的任务。我们展示了远程操作和学习技能的示例。

Abstract

摘要

Abstract—Fine manipulation tasks, such as threading cable ties or slotting a battery, are notoriously difficult for robots because they require precision, careful coordination of contact forces, and closed-loop visual feedback. Performing these tasks typically requires high-end robots, accurate sensors, or careful calibration, which can be expensive and difficult to set up. Can learning enable low-cost and imprecise hardware to perform these fine manipulation tasks? We present a low-cost system that performs end-to-end imitation learning directly from real demonstrations, collected with a custom teleoperation interface. Imitation learning, however, presents its own challenges, particularly in highprecision domains: errors in the policy can compound over time, and human demonstrations can be non-stationary. To address these challenges, we develop a simple yet novel algorithm, Action Chunking with Transformers (ACT), which learns a generative model over action sequences. ACT allows the robot to learn 6 difficult tasks in the real world, such as opening a translucent condiment cup and slotting a battery with 80-90% success, with only 10 minutes worth of demonstrations. Project website: tonyzhaozh.github.io/aloha
精细操纵任务，如穿线束扎带或插入电池，对于机器人而言常常具有挑战性，因为它们需要精确性、对接触力的仔细协调以及闭环视觉反馈。执行这些任务通常需要高端机器人、准确的传感器或精细的校准，这可能昂贵且难以设置。学习是否能让低成本和不精确的硬件执行这些精细操纵任务？我们提出了一个低成本系统，直接从真实演示中进行端到端模仿学习，这些演示是使用定制的远程操作界面收集的。然而，模仿学习在高精度领域中也面临着挑战：策略中的错误可能随时间而累积，人类演示可能是非静态的。为了解决这些问题，我们开发了一种简单而新颖的算法，称为Action Chunking with Transformers（ACT），它在动作序列上学习生成模型。ACT使机器人能够在真实世界中学习6项困难任务，例如打开半透明的调味品杯和插入电池，成功率达到80-90％，仅需10分钟的演示。项目网站：tonyzhaozh.github.io/aloha。

I. INTRODUCTION

I. 引言

Fine manipulation tasks involve precise, closed-loop feedback and require high degrees of hand-eye coordination to adjust and re-plan in response to changes in the environment. Examples of such manipulation tasks include opening the lid of a condiment cup or slotting a battery, which involve delicate operations such as pinching, prying, and tearing rather than broad-stroke motions such as picking and placing. Take opening the lid of a condiment cup in Figure 1 as an example, where the cup is initialized upright on the table: the right gripper needs to first tip it over, and nudge it into the opened left gripper. Then the left gripper closes gently and lifts the cup off the table. Next, one of the right fingers approaches the cup from below and pries the lid open. Each of these steps requires high precision, delicate hand-eye coordination, and rich contact. Millimeters of error would lead to task failure.
精细操纵任务涉及精确的闭环反馈，需要高度的手眼协调能力，以便根据环境的变化进行调整和重新规划。此类操纵任务的示例包括打开调味品杯的盖子或插入电池，涉及到精细的操作，如捏、撬和撕裂，而不是像拾取和放置这样的大范围运动。以图1中打开调味品杯盖子为例，杯子首先在桌子上直立初始化：右抓手需要先将其倾斜，然后轻推到已打开的左抓手中。接着，左抓手轻轻合上并将杯子从桌子上抬起。然后，右手指之一从下方接近杯子，撬开盖子。这些步骤都需要高度的精确性、精细的手眼协调和丰富的接触。毫米级的误差将导致任务失败。

Existing systems for fine manipulation use expensive robots and high-end sensors for precise state estimation [29, 60, 32, 41]. In this work, we seek to develop a low-cost system for fine manipulation that is, in contrast, accessible and reproducible. However, low-cost hardware is inevitably less precise than high-end platforms, making the sensing and planning challenge more pronounced. One promising direction to resolve this is to incorporate learning into the system. Humans also do not have industrial-grade proprioception [71], and yet we are able to perform delicate tasks by learning from closed-loop visual feedback and actively compensating for errors. In our system, we therefore train an end-to-end policy that directly maps RGB images from commodity web cameras to the actions. This pixel-to-action formulation is particularly suitable for fine manipulation, because fine manipulation often involves objects with complex physical properties, such that learning the manipulation policy is much simpler than modeling the whole environment. Take the condiment cup example: modeling the contact when nudging the cup, and also the deformation when prying open the lid involves complex physics on a large number of degrees of freedom. Designing a model accurate enough for planning would require significant research and task specific engineering efforts. In contrast, the policy of nudging and opening the cup is much simpler, since a closed-loop policy can react to different positions of the cup and lid rather than precisely anticipating how it will move in advance.
现有的精细操纵系统使用昂贵的机器人和高端传感器进行精确的状态估计[29, 60, 32, 41]。在这项工作中，我们致力于开发一种低成本的精细操纵系统，与之相反，该系统既易于获取又可复制。然而，低成本硬件不可避免地比高端平台不够精确，这使得感知和规划的挑战更加明显。解决这个问题的一个有前途的方向是将学习纳入系统。人类也没有工业级的本体感知[71]，但我们能够通过从闭环视觉反馈中学习并积极补偿错误来执行精细任务。因此，在我们的系统中，我们训练一个端到端策略，直接将来自通用网络摄像头的RGB图像映射到动作上。这种从像素到动作的表述特别适用于精细操纵，因为精细操纵通常涉及具有复杂物理特性的对象，学习操纵策略比对整个环境建模要简单得多。以调味品杯为例：在轻推杯子时建模接触，以及在撬开盖子时建模变形，涉及大量自由度上的复杂物理。设计一个足够准确的用于规划的模型将需要大量的研究和任务特定的工程工作。相比之下，轻推和打开杯子的策略要简单得多，因为闭环策略可以对杯子和盖子的不同位置做出反应，而不是精确地预测它们将如何提前移动。

Training an end-to-end policy, however, presents its own challenges. The performance of the policy depends heavily on the training data distribution, and in the case of fine manipulation, high-quality human demonstrations can provide tremendous value by allowing the system to learn from human dexterity. We thus build a low-cost yet dexterous teleoperation system for data collection, and a novel imitation learning algorithm that learns effectively from the demonstrations. We overview each component in the following two paragraphs.
然而，训练端到端策略也面临着自己的挑战。策略的性能在很大程度上取决于训练数据的分布，在精细操纵的情况下，高质量的人类演示可以通过让系统从人类的灵巧中学习提供巨大的价值。因此，我们建立了一个低成本但灵巧的远程操作系统进行数据收集，并采用一种新颖的模仿学习算法，能够有效地从演示中学习。我们将在接下来的两段中概述每个组件。

Teleoperation system. We devise a teleoperation setup with two sets of low-cost, off-the-shelf robot arms. They are approximately scaled versions of each other, and we use jointspace mapping for teleoperation. We augment this setup with 3D printed components for easier backdriving, leading to a highly capable teleoperation system within a $20k budget. We showcase its capabilities in Figure 1, including teleoperation of precise tasks such as threading a zip tie, dynamic tasks such as juggling a ping pong ball, and contact-rich tasks such as assembling the chain in the NIST board #2 [4].
远程操作系统。 我们设计了一个远程操作设置，其中包括两组低成本的现成机器人手臂。它们大致是彼此的缩小版本，我们使用关节空间映射进行远程操作。我们通过3D打印组件来增强这个设置，使其更容易进行反向驱动，从而在20000美元的预算内建立了一个非常强大的远程操作系统。我们在图1中展示了其能力，包括精确任务的远程操作，如穿过拉链扎带，动态任务，如抛接乒乓球，以及接触丰富的任务，如在NIST板 #2 中组装链条。

Imitation learning algorithm. Tasks that require precision and visual feedback present a significant challenge for imitation learning, even with high-quality demonstrations. Small errors in the predicted action can incur large differences in the state, exacerbating the “compounding error” problem of imitation learning [47, 64, 29]. To tackle this, we take inspiration from action chunking, a concept in psychology that describes how sequences of actions are grouped together as a chunk, and executed as one unit [35]. In our case, the policy predicts the target joint positions for the next k timesteps, rather than just one step at a time. This reduces the effective horizon of the task by k-fold, mitigating compounding errors. Predicting action sequences also helps tackle temporally correlated confounders [61], such as pauses in demonstrations that are hard to model with Markovian single-step policies. To further improve the smoothness of the policy, we propose temporal ensembling, which queries the policy more frequently and averages across the overlapping action chunks. We implement action chunking policy with Transformers [65], an architecture designed for sequence modeling, and train it as a conditional VAE (CVAE) [55, 33] to capture the variability in human data. We name our method Action Chunking with Transformers (ACT), and find that it significantly outperforms previous imitation learning algorithms on a range of simulated and real-world fine manipulation tasks.

模仿学习算法。 需要精确性和视觉反馈的任务即使在有高质量演示的情况下，也对模仿学习构成了重大挑战。在预测动作中出现的小错误可能导致状态出现较大差异，加剧了模仿学习中“复合误差”问题[47, 64, 29]。为了解决这个问题，我们受到了心理学中描述动作序列如何作为一个“块”分组并作为一个单元执行的“动作分块”概念的启发[35]。在我们的情况下，策略预测接下来k个时间步的目标关节位置，而不仅仅是一次预测一个步骤。这将任务的有效视野减少了k倍，有助于减轻复合错误。预测动作序列还有助于解决时间上相关的混淆因素[61]，例如演示中的暂停，这在使用马尔可夫单步策略难以建模。为了进一步提高策略的平滑性，我们提出了时间集成，它更频繁地查询策略并对重叠的动作块进行平均。我们使用为序列建模设计的Transformers [65]实现了动作分块策略，并将其训练为有条件的变分自编码器（CVAE）[55, 33]以捕捉人类数据的变异性。我们将我们的方法命名为Action Chunking with Transformers (ACT)，并发现在一系列模拟和真实世界的精细操纵任务中，它明显优于先前的模仿学习算法。

The key contribution of this paper is a low-cost system for learning fine manipulation, comprising a teleoperation system and a novel imitation learning algorithm. The teleoperation system, despite its low cost, enables tasks with high precision and rich contacts. The imitation learning algorithm, Action Chunking with Transformers (ACT), is capable of learning precise, close-loop behavior and drastically outperforms previous methods. The synergy between these two parts allows learning of 6 fine manipulation skills directly in the real-world, such as opening a translucent condiment cup and slotting a battery with 80-90% success, from only 10 minutes or 50 demonstration trajectories.
本文的主要贡献是一种用于学习精细操纵的低成本系统，包括一个远程操作系统和一种新颖的模仿学习算法。尽管远程操作系统成本低廉，但它能够执行高精度和接触丰富的任务。模仿学习算法，即Action Chunking with Transformers (ACT)，能够学习精确的闭环行为，并在性能上远超过先前的方法。这两部分之间的协同作用使得能够直接在现实世界中学习6项精细操纵技能，例如成功率达到80-90%的打开半透明的调味品杯和插入电池，仅需10分钟或50次演示轨迹。

II. RELATED WORK

II. 相关工作

Imitation learning for robotic manipulation. Imitation learning allows a robot to directly learn from experts. Behavioral cloning (BC) [44] is one of the simplest imitation learning algorithms, casting imitation as supervised learning from observations to actions. Many works have then sought to improve BC, for example by incorporating history with various architectures [39, 49, 26, 7], using a different training objective [17, 42], and including regularization [46]. Other works emphasize the multi-task or few-shot aspect of imitation learning [14, 25, 11], leveraging language [51, 52, 26, 7], or exploiting the specific task structure [43, 68, 28, 52]. Scaling these imitation learning algorithms with more data has led to impressive systems that can generalize to new objects, instructions, or scenes [15, 26, 7, 32]. In this work, we focus on building an imitation learning system that is low-cost yet capable of performing delicate, fine manipulation tasks. We tackle this from both hardware and software, by building a high-performance teleoperation system, and a novel imitation learning algorithm that drastically improves previous methods on fine manipulation tasks.
机器人操纵的模仿学习。 模仿学习允许机器人直接从专家那里学习。行为克隆（BC）[44]是最简单的模仿学习算法之一，将模仿视为从观察到动作的监督学习过程。许多工作随后试图改进BC，例如通过使用不同的架构来纳入历史信息[39, 49, 26, 7]，采用不同的训练目标[17, 42]，以及引入正则化[46]。其他工作强调了模仿学习的多任务或少样本学习方面[14, 25, 11]，利用语言[51, 52, 26, 7]，或者利用特定任务结构[43, 68, 28, 52]。使用更多数据扩展这些模仿学习算法已经导致了令人印象深刻的系统，能够推广到新的对象、指令或场景[15, 26, 7, 32]。在这项工作中，我们专注于构建一个低成本的模仿学习系统，能够执行精细、精细的操纵任务。我们从硬件和软件两方面解决这个问题，通过构建一个高性能的远程操作系统，以及一种新颖的模仿学习算法，该算法在精细操纵任务上显著改进了先前的方法。

Addressing compounding errors. A major shortcoming of BC is compounding errors, where errors from previous timesteps accumulate and cause the robot to drift off of its training distribution, leading to hard-to-recover states [47, 64]. This problem is particularly prominent in the fine manipulation setting [29]. One way to mitigate compounding errors is to allow additional on-policy interactions and expert corrections, such as DAgger [47] and its variants [30, 40, 24]. However, expert annotation can be time-consuming and unnatural with a teleoperation interface [29]. One could also inject noise at demonstration collection time to obtain datasets with corrective behavior [36], but for fine manipulation, such noise injection can directly lead to task failure, reducing the dexterity of teleoperation system. To circumvent these issues, previous works generate synthetic correction data in an offline manner [16, 29, 70]. While they are limited to settings where lowdimensional states are available, or a specific type of task like grasping. Due to these limitations, we need to address the compounding error problem from a different angle, compatible with high-dimensional visual observations. We propose to reduce the effective horizon of tasks through action chunking, i.e., predicting an action sequence instead of a single action, and then ensemble across overlapping action chunks to produce trajectories that are both accurate and smooth.
解决复合误差。 BC的一个主要缺点是复合误差，即来自先前时间步的错误会累积并导致机器人偏离其训练分布，导致难以恢复的状态[47, 64]。在精细操纵的环境中，这个问题尤为突出[29]。减轻复合误差的一种方法是允许额外的在线交互和专家纠正，例如DAgger [47]及其变体[30, 40, 24]。然而，专家注释在使用远程操作界面时可能耗时且不自然[29]。也可以在演示收集时注入噪声以获得具有纠正行为的数据集[36]，但对于精细操纵而言，这种噪声注入可能直接导致任务失败，降低了远程操作系统的灵巧性。为了绕过这些问题，先前的工作以离线方式生成合成的纠正数据[16, 29, 70]。尽管它们局限于具有低维状态或特定类型任务（如抓取）的情境。由于这些限制，我们需要从不同的角度解决复合误差问题，使其兼容高维度的视觉观察。我们提出通过动作分块来减小任务的有效视野，即预测一个动作序列而不是单个动作，然后在重叠的动作块上进行集成，以产生既准确又平滑的轨迹。

Bimanual manipulation. Bimanual manipulation has a long history in robotics, and has gained popularity with the lowering of hardware costs. Early works tackle bimanual manipulation from a classical control perspective, with known environment dynamics [54, 48], but designing such models can be timeconsuming, and they may not be accurate for objects with complex physical properties. More recently, learning has been incorporated into bimanual systems, such as reinforcement learning [9, 10], imitating human demonstrations [34, 37, 59, 67, 32], or learning to predict key points that chain together motor primitives [20, 19, 50]. Some of the works also focus on fine-grained manipulation tasks such as knot untying, cloth flattening, or even threading a needle [19, 18, 31], while using robots that are considerably more expensive, e.g. the da Vinci surgical robot or ABB YuMi. Our work turns to low-cost hardware, e.g. arms that cost around $5k each, and seeks to enable them to perform high-precision, closed-loop tasks. Our teleoperation setup is most similar to Kim et al. [32], which also uses joint-space mapping between the leader and follower robots. Unlike this previous system, we do not make use of special encoders, sensors, or machined components. We build our system with only off-the-shelf robots and a handful of 3D printed parts, allowing non-experts to assemble it in less than 2 hours.
双手操纵。 在机器人学中，双手操纵有着悠久的历史，并随着硬件成本的降低而变得越来越受欢迎。早期的工作从经典控制的角度处理双手操纵，考虑已知环境动态[54, 48]，但设计这样的模型可能耗时，并且对于具有复杂物理性质的物体可能不准确。最近，学习已经被纳入到双手系统中，例如强化学习[9, 10]，模仿人类演示[34, 37, 59, 67, 32]，或学习预测将运动基元连接起来的关键点[20, 19, 50]。一些工作还专注于精细的操纵任务，如解开结、展平布料，甚至穿过针孔[19, 18, 31]，同时使用显著更昂贵的机器人，例如 da Vinci 外科机器人或 ABB YuMi。我们的工作转向低成本硬件，例如每只手臂成本约为5000美元，并试图使它们能够执行高精度、闭环任务。我们的远程操作设置与 Kim 等人的工作[32]最为相似，也使用了领导机器人和跟随机器人之间的关节空间映射。与先前的系统不同，我们没有使用特殊的编码器、传感器或加工部件。我们的系统仅使用现成的机器人和少量3D打印部件构建，使非专业人员能够在不到2小时内组装它。

III. ALOHA: A LOW-COST OPEN-SOURCE HARDWARE SYSTEM FOR BIMANUAL TELEOPERATION

III. ALOHA: 一个用于双手远程操作的低成本开源硬件系统

We seek to develop an accessible and high-performance teleoperation system for fine manipulation. We summarize our design considerations into the following 5 principles.
1) Low-cost: The entire system should be within budget for most robotic labs, comparable to a single industrial arm.
2) Versatile: It can be applied to a wide range of fine manipulation tasks with real-world objects.
3) User-friendly: The system should be intuitive, reliable, and easy to use.
4) Repairable: The setup can be easily repaired by researchers, when it inevitably breaks.
5) Easy-to-build: It can be quickly assembled by researchers, with easy-to-source materials.

我们致力于开发一个易于获取且高性能的用于精细操纵的远程操作系统。我们将我们的设计考虑总结为以下5个原则。
1)低成本： 整个系统应在大多数机器人实验室的预算范围内，与单个工业机械臂相当。
2) 多功能： 可以应用于各种真实世界对象的精细操纵任务。
3) 用户友好： 系统应直观、可靠且易于使用。
4) 可修复： 当系统不可避免地发生故障时，研究人员可以轻松修复设置。
5) 易建造： 可以由研究人员快速组装，使用易获得的材料。

When choosing the robot to use, principles 1, 4, and 5 lead us to build a bimanual parallel-jaw grippers setup with two ViperX 6-DoF robot arms [1, 66]. We do not employ de

最低0.47元/天解锁文章