论文翻译:OK-Robot: What Really Matters in Integrating Open-Knowledge Models for Robotics

OK-Robot: What Really Matters in Integrating Open-Knowledge Models for Robotics

OK-Robot:整合开放知识模型在机器人学中的真正重要性

文章目录


在这里插入图片描述

Fig. 1: OK-Robot is an Open Knowledge robotic system, which integrates a variety of learned models trained on publicly available data, to pick and drop objects in real-world environments. Using Open Knowledge models such as CLIP, Lang-SAM, AnyGrasp, and OWL-ViT, OK-Robot achieves a 58.5% success rate across 10 unseen, cluttered home environments, and 82.4% on cleaner, decluttered environments.

图1:OK-Robot是一种开放知识机器人系统,集成了多种在公开可用数据上训练的学习模型,用于在现实环境中拾取和放置物体。利用诸如CLIP、Lang-SAM、AnyGrasp和OWL-ViT等开放知识模型,OK-Robot在10个未见过的、混乱的家庭环境中实现了58.5%的成功率,在更干净、整理过的环境中达到了82.4%。

Abstract— Remarkable progress has been made in recent years in the fields of vision, language, and robotics. We now have vision models capable of recognizing objects based on language queries, navigation systems that can effectively control mobile systems, and grasping models that can handle a wide range of objects. Despite these advancements, general-purpose applications of robotics still lag behind, even though they rely on these fundamental capabilities of recognition, navigation, and grasping. In this paper, we adopt a systems-first approach to develop a new Open Knowledge-based robotics framework called OK-Robot. By combining Vision-Language Models (VLMs) for object detection, navigation primitives for movement, and grasping primitives for object manipulation, OK-Robot offers a integrated solution for pick-and-drop operations without requiring any training. To evaluate its performance, we run OK-Robot in 10 real-world home environments. The results demonstrate that OK-Robot achieves a 58.5% success rate in open-ended pick-and-drop tasks, representing a new state-of-the-art in Open Vocabulary Mobile Manipulation (OVMM) with nearly 1.8× the performance of prior work. On cleaner, uncluttered environments, OK-Robot’s performance increases to 82%. However, the most important insight gained from OK-Robot is the critical role of nuanced details when combining Open Knowledge systems like VLMs with robotic modules.

摘要—近年来,在视觉、语言和机器人领域取得了显著的进展。我们现在拥有能够根据语言查询识别物体的视觉模型,能够有效控制移动系统的导航系统,以及能够处理各种物体的抓取模型。尽管取得了这些进展,机器人的通用应用仍然滞后,即使它们依赖于识别、导航和抓取等基本能力。在本文中,我们采用了一种以系统为先的方法,开发了一种名为OK-Robot的新型基于开放知识的机器人框架。通过结合用于物体检测的视觉语言模型(VLMs)、用于移动的导航基元以及用于物体操纵的抓取基元,OK-Robot为拾取和放置操作提供了一个集成的解决方案,无需任何训练。为了评估其性能,我们在10个真实的家庭环境中运行了OK-Robot。结果表明,OK-Robot在开放式拾取和放置任务中实现了58.5%的成功率,代表了开放词汇移动操纵(OVMM)的新技术水平,性能几乎是之前工作的1.8倍。在更干净、无杂乱的环境中,OK-Robot的性能提高到82%。然而,从OK-Robot中获得的最重要的见解是,在将VLMs等开放知识系统与机器人模块结合时,具有微妙细节的关键作用。

I.INTRODUCTION

I.引言

Creating a general-purpose robot has been a longstanding dream of the robotics community. With the increase in datadriven approaches and large robot models, impressive progress is being made [1–4]. However, current systems are brittle, closed, and fail when encountering unseen scenarios. Even the largest robotics models can often only be deployed in previously seen environments [5, 6]. The brittleness of these systems is further exacerbated in settings where little robotic data is available, such as in unstructured home environments.

创造一款通用机器人一直是机器人学界的梦想。随着数据驱动方法和大型机器人模型的增加,取得了令人瞩目的进展[1–4]。然而,当前的系统脆弱、封闭,在面对未见过的情景时容易失败。即使是最大的机器人模型,通常也只能在先前见过的环境中部署[5, 6]。这些系统的脆弱性在机器人数据稀缺的环境中进一步恶化,比如在非结构化的家庭环境中。

The poor generalization of robotic systems lies in stark contrast to large vision models [7–10], which show capabilities of semantic understanding [11–13], detection [7, 8], and connecting visual representations to language [9, 10, 14] At the same time, base robotic skills for navigation [15], grasping [16–19], and rearrangement [20, 21] are fairly mature. Hence, it is perplexing that robotic systems that combine modern vision models with robot-specific primitives perform so poorly. To highlight the difficulty of this problem, the recent NeurIPS 2023 challenge for open-vocabulary mobile manipulation (OVMM) [22] registered a success rate of 33% for the winning solution [23].

机器人系统的通用性差异鲜明,与大型视觉模型[7–10]形成对比,后者展示了语义理解[11–13]、检测[7, 8]以及将视觉表征与语言连接的能力[9, 10, 14]。同时,用于导航[15]、抓取[16–19]和重新排列[20, 21]的基本机器人技能相当成熟。因此,令人困惑的是,将现代视觉模型与机器人特定基元相结合的机器人系统表现如此糟糕。为突显这个问题的难度,最近在开放词汇移动操纵(OVMM)方面的NeurIPS 2023挑战[22]中,获胜解决方案的成功率为33% [23]。

So what makes open-vocabulary robotics so hard? Unfortunately, there isn’t a single challenge that makes this problem hard. Instead, inaccuracies in different components compound and together results in overall drop. For example, the quality of open-vocabulary retrievals of objects in homes is dependent on the quality of query strings, navigation targets determined from VLMs may not be reachable to the robot, and the choice of different grasping models may lead to large differences in grasping performance. Hence, making progress on this problem requires a careful and nuanced framework that both integrates VLMs and robotics primitives, while being flexible enough to incorporate newer models as they are developed by the VLM and robotics community.

那么,是什么使得开放词汇机器人学如此困难呢?不幸的是,没有一个单一的挑战使这个问题变得困难。相反,不同组件的不准确性相互叠加,导致整体性能下降。例如,在家庭中检索对象的开放词汇质量取决于查询字符串的质量,从VLMs确定的导航目标可能对机器人不可达,而选择不同的抓取模型可能导致抓取性能的显著差异。因此,在解决这个问题上取得进展需要一个谨慎而微妙的框架,既能整合VLMs和机器人基元,同时又足够灵活,能够纳入VLM和机器人社区开发的新模型。

We present OK-Robot, an Open Knowledge Robot that integrates state-of-the-art VLMs with powerful robotics primitives for navigation and grasping to enable pick-and-drop. Here, Open Knowledge refers to learned models trained on large, publicly available datasets. When placed in a new home environment, OK-Robot is seeded with a scan taken from an iPhone. Given this scan, dense vision-language representations are computed using LangSam [24] and CLIP [9] and stored in a semantic memory. Then, given a language-query for an object that has to be picked, language representations of the query is matched with semantic memory. After this, navigation and picking primitives are applied sequentially to move to the desired object and pick it up. A similar process can be carried out for dropping the object.

我们介绍OK-Robot,一款集成了最先进的VLMs和强大的机器人基元以实现导航和抓取的开放知识机器人。在这里,开放知识指的是在大型公开可用数据集上训练的学习模型。当OK-Robot放置在新的家庭环境中时,它以从iPhone拍摄的扫描作为种子。给定此扫描,使用LangSam [24]和CLIP [9]计算密集的视觉语言表示,并存储在语义内存中。然后,针对需要拾取的对象的语言查询,将查询的语言表示与语义内存匹配。在此之后,导航和拾取基元按顺序应用于移动到所需对象并将其拾取。类似的过程可以用于放置对象。

To study OK-Robot, we tested it in 10 real world home environments. Through our experiments, we found that on a never seen, natural home environment, a zero-shot deployment of our system achieves 58.5% success on average. However, this success rate is largely dependant on the “naturalness” of the environment, as we show that with improving the queries, decluttering the space, and excluding objects that are clearly adversarial (too large, too translucent, too slippery), this success rate reaches about 82.4%. Overall, through our experiments, we make the following observations:

为了研究OK-Robot,我们在10个真实的家庭环境中对其进行了测试。通过我们的实验证明,在一个从未见过的自然家庭环境中,我们系统的零次部署平均成功率为58.5%。然而,这一成功率在很大程度上取决于环境的“自然性”,因为我们发现通过改善查询、清理空间并排除明显对抗性的物体(太大、太透明、太滑)可以将成功率提高到约82.4%。总体而言,通过我们的实验,我们得出以下观察结果:

Pre-trained VLMs are highly effective for openvocabulary navigation: Current open-vocabulary visionlanguage models such as CLIP [9] or OWL-ViT [8] offer strong performance in identifing arbitrary objects in the real world, and enable navigating to them in a zero-shot manner (see Section II-A.)

预训练的VLMs对于开放词汇导航非常有效: 目前的开放词汇视觉语言模型,如CLIP [9]或OWL-ViT [8],在识别现实世界中的任意对象方面表现出色,并能够以零次部署方式导航到它们(见第II-A节)。

• Pre-trained grasping models can be directly applied to mobile manipulation: Similar to VLMs, special purpose robot models pre-trained on large amounts of data can be applied out of the box to approach open-vocabulary grasping in homes. These robot models do not require any additional training or fine-tuning (see Section II-B.)

预训练的抓取模型可以直接应用于移动操纵: 与VLMs类似,预先在大量数据上训练的特定用途的机器人模型可以直接应用于家庭中的开放词汇抓取。这些机器人模型不需要任何额外的训练或微调(见第II-B节)。

• How components are combined is crucial: Given the pretrained models, we find that they can be combined with no training using a simple state-machine model. We also find that using heuristics to counteract the robot’s physical limitations can lead to a better success rate in the real world (see Section II-D.)

如何组合组件至关重要: 在给定预训练模型的情况下,我们发现它们可以使用简单的状态机模型进行无训练组合。我们还发现使用启发式方法来抵消机器人的物理限制可以在现实世界中提高成功率(见第II-D节)。

• Several challenges still remain: While, given the immense challenge of operating zero-shot in arbitrary homes, OK-Robot improves upon prior work, by analyzing the failure modes we find that there are significant improvements that can be made on the VLMs, robot models, and robot morphology, that will directly increase performance of openknowledge manipulation agents (see Section III-D).

仍然存在一些挑战: 尽管考虑到在任意家庭中零次操作的巨大挑战,OK-Robot在改进先前的工作方面取得了进展,通过分析失败模式,我们发现在VLMs、机器人模型和机器人形态学方面仍然存在显著的改进空间,这将直接提高开放知识操纵代理的性能(见第III-D节)。

To encourage and support future work in open-knowledge robotics, we will share the code and modules for OK-Robot, and are committed to supporting reproduction of our results. More information along with robot videos are available on our project website: https://ok-robot.github.io.

为了鼓励和支持未来的开放知识机器人研究,我们将分享OK-Robot的代码和模块,并致力于支持我们结果的复现。有关更多信息以及机器人视频,请访问我们的项目网站:https://ok-robot.github.io。

II.TECHNICAL COMPONENTS AND METHOD

II.技术组件和方法

Our method, on a high level, solves the problem described by the query: “Pick up A (from B) and drop it on/in C”, where A is an object and B and C are places in a real-world environment such as homes. The system we introduce is a combination of three primary subsystems combined on a Hello Robot: Stretch. Namely, these are the open-vocabulary object navigation module, the open-vocabulary RGB-D grasping module, and the dropping heuristic. In this section, we describe each of these components in more details.

在高层次上,我们的方法解决了由查询描述的问题:“拾取A(从B)并将其放在/放入C”,其中A是一个对象,B和C是现实世界环境(如家庭)中的位置。我们介绍的系统是基于Hello Robot: Stretch上的三个主要子系统的组合。具体而言,这些是开放词汇对象导航模块、开放词汇RGB-D抓取模块和放置启发式。在本节中,我们详细描述了每个组件。

A.Open-home, open-vocabulary object navigation

A.开放家庭,开放词汇对象导航

The first component of our method is an open-home, openvocabulary object navigation model that we use to map a home and subsequently navigate to any object of interest designated by a natural language query.

我们方法的第一个组件是一个开放家庭、开放词汇的对象导航模型,我们用它来映射一个家庭,随后可以导航到任何由自然语言查询指定的感兴趣的对象。

Scanning the home: For open vocabulary object navigation, we follow the approach from CLIP-Fields [27] and assume a pre-mapping phase where the home is “scanned” manually using an iPhone. This manual scan simply consists of taking a video of the home using the Record3D app on the iPhone, which results in a sequence of posed RGB-D images.

扫描家庭:对于开放词汇对象导航,我们采用了来自CLIP-Fields [27]的方法,并假定有一个预映射阶段,其中使用iPhone手动“扫描”家庭。这个手动扫描简单地包括使用iPhone上的Record3D应用程序拍摄家庭的视频,这导致一系列带有姿势的RGB-D图像。

Alternatively, this could be done automatically using frontierbased exploration [15, 25, 26], but for speed and simplicity we prefer the manual approach [26, 27]. We take this approach since the frontier-based approaches tend to be slow and cumbering, especially for a novel space, while our “scan” take less than one minute for each room. Once collected, the RGB-D images, along with the camera pose and positions, are exported to our library for map-building.

或者,这可以使用基于前沿的探索 [15, 25, 26]自动完成,但出于速度和简单起见,我们更喜欢手动方法 [26, 27]。我们采取这种方法是因为基于前沿的方法往往很慢且繁琐,特别是对于新颖的空间,而我们的“扫描”对每个房间不到一分钟。一旦收集到RGB-D图像以及相机的姿势和位置,它们将被导出到我们的库进行地图构建。

To ensure our semantic memory contains both the objects of interest as well as the navigable surface and any obstacles, the recording must capture the floor surface alongside the objects and receptacles in the environment.

为了确保我们的语义内存既包含感兴趣的对象,又包含可导航的表面和任何障碍物,录制必须捕捉环境中的地面表面以及对象和容器。

Detecting objects: On each frame of the scan, we run an openvocabulary object detector. Unlike previous works which used Detic [7], we chose OWL-ViT [8] as the object detector since we found it to perform better in preliminary queries. We apply the detector on every frame, and extract each of the object bounding box, CLIP-embedding, detector confidence, and pass them onto the object memory module of our navigation module.

检测对象:在扫描的每一帧上,我们运行一个开放词汇的对象检测器。与以前使用Detic [7]的工作不同,我们选择了OWL-ViT [8]作为对象检测器,因为我们发现它在初步查询中的性能更好。我们在每一帧上应用检测器,并提取每个对象的边界框、CLIP嵌入、检测器置信度,并将它们传递给我们导航模块的对象内存模块。

Building on top of previous work [27], we further refine the bounding boxes into object masks with Segment Anything (SAM) [28]. Note that, in many cases, open-vocabulary object detectors still require a set of natural language object queries that they try to detect. We supply a large set of such object queries, derived from the original Scannet200 labels [29], such that the detector captures most common objects in the scene.

在之前的工作[27]的基础上,我们使用Segment Anything (SAM) [28]将边界框进一步细化为对象掩模。请注意,在许多情况下,开放词汇对象检测器仍然需要一组它们尝试检测的自然语言对象查询。我们提供了一组大量的这样的对象查询,从原始Scannet200标签 [29]派生而来,以便检测器

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值