大语言模型LLMs驱动机器人李飞飞 VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

普罗米修船

已于 2023-08-03 12:14:41 修改

阅读量4.5k

点赞数 8

文章标签：语言模型机器人人工智能 gpt

于 2023-07-12 18:50:45 首次发布

原文链接：https://voxposer.github.io/voxposer.pdf

版权

摘要：

大型语言模型（LLMs）被证明拥有丰富的可行动知识Actionable Knowledge，这些知识可以以推理和规划的形式提取用于机器人操作。尽管已经取得了进展，但大多数模型仍然依赖于预定义的运动基元Motion Primitives来进行与环境的物理交互，这仍然是一个主要的瓶颈。

在这项工作中，我们的目标是合成机器人轨迹，即一个密集的6自由度末端执行器航点序列，用于处理给定开放集合的指令和开放集合的对象的各种操作任务。我们首先观察到LLMs擅长根据自由形式的语言指令推断出可供性和约束。更重要的是，通过利用他们的代码编写能力，他们可以与视觉语言模型（VLM）交互，组合3D价值地图 3D Value Map，将知识固定在智能体的观察空间中。然后，将组合的价值地图用于基于模型的规划框架中，以zero-shot方式合成对动态干扰具有鲁棒性的闭环机器人轨迹。

我们进一步展示了所提出的框架如何通过有效地学习涉及丰富接触交互的场景的动态模型，从在线体验中获益。我们在模拟和真实机器人环境中对所提出的方法进行了大规模研究，展示了执行以自然语言自由形式指定的各种日常操作任务的能力。

项目网站：voxposer.github.io。

图1：VoxPoser 从LLMs中提取语言条件的可供性和约束，并使用VLMs通过代码接口将它们固定到感知空间，无需对任一组件进行额外训练。组合的地图被称为3D value map，它使得对于大量日常操作任务的轨迹的zero-shot 合成成为可能，这些任务具有开放集的指令和开放集的对象。

1 引言

语言是一种压缩的媒介，通过它人们能提炼和交流他们对世界的知识和经验。大型语言模型（LLMs）已经成为一种捕捉这种抽象的有前景的方法，学习通过投射到语言空间*[1–4]* 来代表世界。虽然人们认为这些模型能够以文本形式内化通用的知识，但如何使用这种通用的知识使得具有身体的代理人在现实世界中进行物理行动仍然是一个问题。

我们关注的问题是如何将抽象的语言指令（例如，“布置桌子”）转化为机器人的动作*[5]。之前的研究已经利用词汇分析来解析指令[6-8]，而最近的语言模型已经被用来将指令分解成一个文本步骤序列[9–11]。然而，为了与环境进行物理交互，现有的方法通常依赖于一系列手动设计或预训练的运动基元（即技能），这些基元可以被LLM或规划器调用，这种依赖于单个技能的获取通常被认为是系统的主要瓶颈，因为缺乏大规模的机器人数据。那么问题就来了：我们如何在更细粒度的动作级别，利用LLMs的丰富内部知识，而不需要对每一个单独的基元Primitive*进行繁重的数据收集或手动设计？

在应对这个挑战时，我们首先注意到，LLMs直接输出文本形式的控制行动是不可行的，因为这些行动通常由高维空间的高频控制信号驱动。然而，我们发现LLMs擅长推断出语言条件的可供性和约束，并通过利用他们的代码编写能力，他们可以通过指挥感知调用（例如，通过CLIP*[12]或开放词汇检测器[13–15]）和数组操作（例如，通过NumPy[16]*）来组合密集的3D体素地图，将这些地图固定在视觉空间中。例如，给定一个指令“打开顶部的抽屉并注意瓶子”，LLMs可以被提示推断出：1）顶部的抽屉手柄应被抓住，2）手柄需要被向外平移，3）机器人应该远离瓶子。虽然这些都是以文本形式表达的，但LLMs可以生成Python代码来调用感知API，获取相关对象或部分（例如，“手柄”）的空间几何信息，然后操作3D体素，在观测空间的相关位置规定奖励或成本（例如，手柄的目标位置被分配了高值，而瓶子的周围被分配了低值）。最后，组合的价值图可以作为目标函数，供运动规划器直接合成实现给定指令的机器人轨迹，而不需要为每个任务或LLM提供额外的训练数据。（1The approach also bears resemblance and connections to potential field methods in path planning [17] and constrained optimization methods in manipulation planning [18].）图1显示了一个插图图和我们考虑的一部分任务。

我们将这种方法称为VOXPOSER，这是一种从LLMs中提取可供性和约束来在3D观察空间中组合体素价值地图的公式Formulaion，以指导机器人与环境交互。特别是，该方法利用LLMs来组成生成机器人轨迹的关键方面，而不是试图在机器人数据上训练策略，这些数据通常数量有限或变化有限，有效地实现了对开放集指令的Zero-shot泛化。通过将其整合到基于模型的规划框架中，我们展示了通过模型预测控制（MPC）进行的能够抵御外部干扰的闭环执行。我们进一步展示了如何通过有限的在线交互，VoxPoser也可以有效地学习一个涉及丰富接触的动态模型。

我们的贡献总结如下：

• 我们提出了VoxPoser，一种从预训练语言模型中提取可供性和约束进行机器人操控的方法，这不需要额外的训练，并且能够泛化到开放集的指令。

• 使用VoxPoser来表示任务目标，我们展示了合成的轨迹可以通过MPC在模拟和现实环境中的大量操作任务中被稳健地执行。

• 我们展示了VoxPoser应用的可能性，只需要通过有效地学习一个动态模型来从有限量的在线交互中受益，例如，在3分钟内学会打开一个带有杆手柄的门。

2 相关工作

语言指令的落地化

语言落地化在智能代理*[19–22]和机器人[23, 6, 24, 25, 5, 7, 26]技术中都已经被广泛研究，语言可以作为组合目标规范的工具[5, 27–33]，作为训练多模态表示的语义锚点[12, 34, 35]，或者作为规划和推理的中间基质[36–38, 9, 10, 39, 40]。之前的工作已经研究了使用诸如词汇分析、形式逻辑和图形模型等经典工具来解释语言指令[27, 7, 6, 26]。最近，受到在离线领域的成功应用的推动[41–43, 1]，一些端到端的方法被应用到直接将语言指令在机器人交互中落地化，通过从语言注释的数据中学习，这些方法涵盖了模型学习[44]、模仿学习[45, 46, 30, 47–54]到强化学习[55–57]。与我们的工作最相关的是Sharma等人的工作[50]*，他们通过监督学习优化了一个端到端的成本预测器，将语言指令映射到2D成本地图，这些地图用来引导运动规划器生成无碰撞的优选轨迹。相反，我们依赖预训练的语言模型的开放世界知识，并应对更具挑战性的3D机器人操作。

语言模型在机器人技术中的应用

利用预训练的语言模型进行具身应用是一个活跃的研究领域，大量的工作集中在使用语言模型进行规划和推理*[9–11, 58, 31, 39, 59–72, 36, 73]。为了让语言模型感知物理环境，可以给出场景的文本描述[39, 11, 59]或感知API[74]，视觉可以在解码过程中使用[67]，或者可以直接作为多模态语言模型的输入。除了感知，为了真正的连接感知-行动闭环，一个具身的语言模型也必须知道如何行动，这通常是通过一组预定义的基元库来实现的。Liang等人[74]展示了LLMs表现出的行为常识可以用于低级控制。尽管有着充满希望的迹象，但仍然需要手动设计的运动基元，而尽管LLMs被证明有能力组成顺序策略逻辑，但在空间级别是否可以进行组成仍然不清楚。一系列相关的工作也探索了在奖励设计[75]和强化学习[76–79]的探索中使用LLMs进行奖励规范，以及人类偏好学习[80]。相比之下，我们专注于在机器人的3D观察空间*中落地化由LLMs生成的奖励，我们认为这对于操作任务最有用。

基于学习的轨迹优化

许多工作已经探索了利用基于学习的方法进行轨迹优化。虽然文献很多，但他们可以广泛地分为学习模型的[81–89]和学习成本/奖励或约束的[90–93, 50, 94],，这些数据通常来自于域内交互。为了在野外实现泛化，一系列并行的工作已经探索了从大规模离线数据中学习任务规范[95–97, 35, 34, 44, 98, 99, 54]，特别是第一人称视频[100, 101]，或者利用预训练的基础模型[102–104, 33, 105, 106]。然后，学习到的成本函数被用于强化学习[102, 99, 107]、模仿学习[97, 96]或轨迹优化[95, 35]*来生成机器人动作。在这项工作中，我们利用LLMs进行无需域内交互数据和具有更好泛化能力的在野成本规范。与之前利用基础模型的工作相比，我们直接在3D观察空间中地面化成本，有实时的视觉反馈，这使得VoxPoser适用于执行稳定的闭环MPC。

图2：VOXPOSER的概述。给定环境的RGB-D观察和语言指令，LLMs生成代码，与VLMs交互，生成一系列在机器人的观察空间中的3D可供性地图和约束地图（统称为价值地图）（a）组成的价值地图然后作为目标函数，用于运动规划器合成机器人操作的轨迹（b）整个过程不涉及任何额外的训练。

3 方法

我们首先提供了VoxPoser作为一个优化问题的公式化（3.1节）。然后我们描述了如何使用VoxPoser作为一个通用的零射击框架，将语言指令映射到3D价值地图（3.2节）。接着，我们展示了如何在闭环中合成机器人操作的轨迹（3.3节）。虽然VoxPoser的本质是零射击，但我们演示了如何让VoxPoser从在线交互中学习，有效地解决接触丰富的任务（3.4节）。

3.1 问题公式化

考虑一个给定的自由形式语言指令 $\mathcal{L}$ （例如，“打开顶部的抽屉”）的操作问题。然而，根据L生成机器人轨迹可能很困难，因为L可能是任意长的视野或未指定的（即，需要上下文理解）。相反，我们关注问题的单个阶段（子任务）ℓ $_i$ ，这些阶段明确地指定了一个操作任务（例如，“抓住抽屉手柄”，“拉开抽屉”），其中，分解 $\mathcal{T}$ → (ℓ $_1$ , ℓ $_2$ , . . . , ℓ $_n$ )是由高级规划器（例如，LLM或基于搜索的规划器）给出的。本工作调查的核心问题是为机器人r和每个由指令ℓ $_i$ 描述的操作阶段生成一个运动轨迹 $\mathcal{T}^r_i$ 。我们将 $\mathcal{T}^r_i$ 表示为一个由操作空间控制器执行的密集末端执行器航点的序列，每个航点包括一个期望的6-DoF末端执行器姿态，末端执行器速度和夹具动作。然而，值得注意的是，也可以使用其他轨迹的表示，如关节空间轨迹。给定每个子任务ℓ $_i$ ，我们将此公式化为一个优化问题，定义如下：

3.2 通过VoxPoser对语言指令进行落地化

计算与自由形式语言指令相关的 $\mathcal{F}_{task}$ 是极其具有挑战性的，不仅因为语言能够传达的语义空间丰富，而且也因为缺乏与 $\rm{T}$ 和 ℓ 标签的机器人数据。然而，我们提供了一个关键的观察，即大量的任务可以用机器人观察空间中的体素值图 $\rm{V}\in$ $R^{w \times h \times d}$ 来表征，该图引导场景中“感兴趣实体”的运动，例如机器人末端执行器，一个对象或一个对象部分。例如，考虑任务“打开顶部抽屉”及其第一个子任务“抓住顶部抽屉手柄”（由LLMs推断）在图2中。“感兴趣的实体”是机器人末端执行器，体素值图应该反映出对抽屉手柄的吸引力。通过进一步命令“注意瓶子”，地图也可以更新以反映对瓶子的排斥。我们将“感兴趣的实体”表示为e，其轨迹表示为 $\mathcal{T}^e$ 。使用这个体素值图对于给定的指令 ℓ $_i$ ， $\mathcal{F}_{task}$ 可以通过累积e通过 $\rm{V}_i$ 的值来近似，正式计算为

其中 $p^e_j$ ∈ $\mathcal{R}^3$ 是在步骤j处e的离散化（x, y, z）位置。

值得注意的是，我们观察到大型语言模型，通过在互联网规模的数据上预训练，不仅表现出识别“感兴趣的实体”的能力，而且还表现出通过编写Python程序来组合精确反映任务指令的值图的能力。具体来说，当一个指令作为代码中的注释给出时，LLMs可以被提示

1）调用感知API（这将调用视觉语言模型（VLM）如开放词汇检测器[13-15]）来获取相关对象的空间-几何信息，

2）生成NumPy操作来操作3D数组，

3）在相关位置规定精确的值。

我们将这种方法称为VOXPOSER。具体来说，我们的目标是通过提示LLM并通过Python解释器执行代码来获取体素值 $\rm{V}^t_i$ =VoxPoser( $o^t$ ,ℓ $_i$ )，其中 $o^t$ 是时间t的RGB-D观察，ℓ $_i$ 是当前指令。此外，因为 $\rm{V}$ 通常是稀疏的，我们通过平滑操作使体素图变得密集，因为它们鼓励运动规划器优化更平滑的轨迹。

额外的轨迹参数化。上述VoxPoser的公式化使用LLMs来组合V : $N^3$ → $\mathcal{R}$ 来从体素空间的离散化坐标映射到实值的“成本”，我们可以用它来优化只包含位置项的路径。为了扩展到SE(3)位姿，我们也可以使用LLMs来组合任务目标相关的坐标的旋转图 $\rm{V}_r$ : $N^3$ → SO(3)（例如，“末端执行器应面向把手的支持法线”）。类似地，我们进一步组合夹具图 $\rm{V}_g$ : $N^3$ → {0, 1}来控制夹具的打开/关闭，和速度图 $\rm{V}_v$ : $N^3$ → $\mathcal{R}$ 来指定目标速度。请注意，虽然这些额外的轨迹参数化没有映射到实值的“成本”，但它们也可以被纳入优化过程（方程1）来参数化轨迹。

3.3 使用VoxPoser进行zero-shot轨迹合成

在获取了任务成本 $\mathcal{F}_{task}$ 之后，我们现在可以解决方程1中定义的完整问题来规划一个运动轨迹。我们使用简单的零阶优化，通过随机采样轨迹并使用提议的目标对它们进行评分。优化进一步在模型预测控制框架中实现，该框架在每一步使用当前观察迭代地重新规划轨迹，即使在动态扰动下也能稳健地执行轨迹，其中可以使用学习的或基于物理的模型。然而，因为VoxPoser有效地在观察空间提供了“密集奖励”，并且我们能够在每一步重新规划，我们惊讶地发现，即使使用简单的基于启发式的模型，整个系统已经可以实现本工作中考虑的大量操作任务。由于一些价值图是定义在“感兴趣的实体”上的，这可能不一定是机器人，我们也使用动态模型来找到所需的机器人轨迹以最小化任务成本（即，机器人和环境之间的什么互动实现了期望的对象运动）。因为我们的方法对运动规划的特定实例是不可知的，我们将我们的实现的扩展讨论留给第4节。

3.4 通过在线体验高效地学习动态模型

原文

图3：在真实环境中组合的3D值图和推演的可视化。顶行展示了“感兴趣的实体”是对象或部分的情况，值图引导它们向目标位置移动。底部两行展示了“感兴趣的实体”是机器人末端执行器的任务。最底部的任务涉及两个阶段，这些阶段也由LLMs编排。

4 实验和分析

我们首先在4.1节中讨论实施VoxPoser的设计选择。然后，我们在4.2节中直接在真实世界系统中验证VoxPoser是否可以执行日常操作任务。我们还在4.3节中提出了一个详细的定量研究，比较了VoxPoser在模拟中与学习和基于LLM的基线的泛化性能。我们进一步在4.4节中演示了VoxPoser如何从仅有的在线体验中受益，为富有接触性的任务学习一个动态模型。最后，我们在4.5节中研究了整个系统中错误的来源，并讨论了如何进行改进。

4.1 VoxPoser的实现

在此，我们讨论我们的VoxPoser实例。我们将讨论的重点放在模拟和真实世界领域之间共享的设计选择上。每个领域的环境设置的更多细节可以在附录中找到。

LLMs和提示

我们按照Liang等人[74]*的提示结构，递归地使用自己生成的代码调用LLMs，其中每个语言模型程序（LMP）负责一个唯一的功能（例如，处理感知调用）。我们使用来自OpenAI API的GPT-4。提示在附录中

VLMs和感知

给定LLMs的一个对象/部分查询，我们首先调用开放词汇检测器OWL-ViT [15]获得一个边界框，将其输入到Segment Anything [109]以获得一个遮罩，并使用视频跟踪器XMEM [110]跟踪遮罩。使用跟踪的遮罩和RGB-D观察结果重构对象/部分的点云。

值图组合

我们定义了以下类型的值图：可操作性、避让、末端执行器速度、末端执行器旋转和夹具动作。每种类型使用不同的LMP，它接收一个指令并输出一个形状为（100, 100, 100, k）的体素图，其中k对每个值图都不同（例如，对于可操作性和避让，k = 1，因为它指定成本，对于旋转，k = 4，因为它指定SO(3)）。我们对可操作性图应用欧几里得距离变换，并对避让图应用高斯滤波器。在值图LMPs之上，我们定义了两个高级LMPs来协调它们的行为：规划器接收用户指令L作为输入（例如，“打开抽屉”）并输出一系列子任务ℓ $_{1:N}$ ，而组合器接收子任务ℓ $_i$ 并调用相关的值图LMPs进行详细的语言参数化。

运动规划器

我们在规划器优化中只考虑可操作性和避让图，它使用贪心搜索找到一系列无碰撞的末端执行器位置p $_{1:N}\in$ $\mathcal{R}^3$ 。然后，我们在每个p处通过其余的值图强制执行其他参数化（例如，旋转图，速度图）。运动规划器使用的成本图是归一化的可操作性和避让图的加权和的负值，权重为2和1。合成一个6-DoF轨迹后，执行第一个路径点，然后以5 Hz的频率重新规划一个新轨迹。

环境动态模型

对于指定的“感兴趣的实体”为机器人的任务，我们假设在每一步重新规划以考虑最新的观察结果的身份环境动态。对于“感兴趣的实体”为物体的任务，我们只研究一个由接触点、推动方向和推动距离参数化的平面推动模型。基于启发式的动态模型将输入点云沿着推动方向平移推动距离。我们使用随机射击的MPC来优化动作参数。然后根据动作参数执行一个预定义的推动原语。然而，我们注意到，当动作参数定义在机器人的末端执行器或关节空间时，不需要原语，这可能会产生更平滑的轨迹，但优化需要更多的时间。

4.2 VoxPoser用于日常操纵任务

我们研究VoxPoser是否能零样本合成机器人轨迹，执行日常操作任务。我们在真实世界的桌面环境中设置了一个Franka Emika Panda机器人。更多详情可以在附录A.2中找到。虽然所提出的方法可以推广到开放集合的指令和开放集合的对象，如图1所示，但我们选择5个代表性任务进行定量评估。我们还在图3中显示了包括环境推展和值图可视化在内的其他定性结果。如表1所示，我们发现VoxPoser可以有效地合成日常操作任务的机器人轨迹，平均成功率高。特别是，通过利用LLMs中丰富的世界知识，我们可以提取出适应多样化场景和对象的语言条件可操作性。例如，LLMs可以推断出瓶子可以通过逆时针方向围绕z轴旋转来打开。随后，VoxPoser可以将这种知识定位到观察空间中，直接指导运动规划器完成任务。我们进一步比较了一个Code as Policies [74]的变体，它使用LLMs来参数化一个预定义的简单原语列表（例如，移动到姿势，打开夹具）。我们发现，与链式序列策略逻辑相比，在考虑其他约束的情况下共同优化方案的空间组合能力是更灵活的表达方式，解锁了更多操纵任务的可能性，并导致了更稳健的执行。特别是，通过在MPC中利用组合的空间图，VoxPoser可以有效地从外部干扰中恢复，如移动目标/障碍物和在机器人关闭抽屉后拉开抽屉。

4.3 对未见指令和属性的泛化

在这里，我们研究了VoxPoser的泛化能力。为了提供严谨的定量结果，我们设置了一个模拟环境，该环境与我们的真实世界设置相仿[111]，但具有固定的物体列表（包括一个柜子以及10个颜色的方块和线条）和固定的13个模板指令（例如，“将 [obj] 推到 [pos]”），其中 [obj] 和 [pos] 是属性，它们在预定义的列表上随机化。指令和属性被划分为已见和未见的组，其中已见的指令/属性可能出现在提示（或对于监督基线在训练数据）中。我们进一步将它们分为两类，其中“对象交互”指的是需要与对象交互的任务（而不是无碰撞路径规划），而“空间组合”指的是需要机器人在其轨迹中考虑环境中的空间约束的任务（例如，在特定对象附近移动速度较慢）。对于基线，我们通过比较[74]的一个变体，该变体将LLM与原语结合，以及[50]的一个变体，该变体学习U-Net [112]为运动规划合成costmaps，消除了VoxPoser的两个组成部分，即LLM和运动规划器。表2显示了每个任务的20个场景的平均成功率。VoxPoser在两个测试类别中都超过了两个基线，尤其是在未见的指令或属性上。与通过监督学习训练的U-Net进行成本规定相比，LLMs通过明确推理关于语言条件可操作性和约束的知识，能够更好地泛化。另一方面，通过值图组合而不是直接指定原始参数将LLM知识定位在观察中，提供了一致的性能，可以泛化到提示中给出的示例之外。

4.4 通过在线体验进行高效的动态学习

尽管对未见指令具有zero-shot的泛化能力，我们还研究了VoxPoser如何也可以从涉及更具挑战性的丰富接触动态的任务的在线互动中受益，因为许多行为细微差别可能不存在于LLMs中。为此，我们研究了一系列涉及与常见铰接对象交互的模拟任务，包括开门、冰箱和窗户。我们假设，尽管这些任务对于自主智能体来说具有挑战性，因为探索困难，但由 VoxPoser 的 zero-shot 合成的轨迹将为探索提供有用的提示（例如，“为了打开门，首先需要按下把手”）。具体来说，我们首先使用VoxPoser合成k个不同的轨迹，每个轨迹都表示为一系列的末端执行器航点。然后，我们通过迭代过程学习一个MLP动态模型，该模型通过在数据收集和模型学习之间交替的代理人来预测 $o_{t+1}$ 从 $o_t$ 和 $a_t$ 。最初合成的轨迹被用作MPC动作采样分布的先验，其中我们添加ε ∼ N(0, $\delta^2$ )到 $\mathcal{T}^r_0$ 中的每个航点以鼓励本地探索。结果显示在表3中。对于这些涉及与铰接对象复杂交互的任务，我们发现由VoxPoser的Zero-shot合成的轨迹是有意义的，但不足够。然而，我们可以通过使用这些轨迹作为探索先验，在不到3分钟的在线互动中学习到一个有效的动态模型，从而获得高的最终成功率。相比之下，如果没有这个先验，学习动态模型将非常困难，因为大多数动作不会导致有意义的环境变化。在所有情况下，实验超过了最大的12小时限制。

4.5 错误分析

由于VoxPoser涉及多个组件共同工作，以合成各种操纵任务的轨迹，因此我们在这里分析了每个组件产生的错误以及如何进一步改进整个系统。我们在模拟中进行实验，其中我们可以获取到精确的感知和动态模型（即，模拟器）。结果如图4所示。UNet + MP [50]训练了一个U-Net [112, 113]，直接将RGB-D观察映射到值地图，然后由运动规划器（MP）使用，因此没有独立的感知模块。这里的“规格错误”指的是U-Net的错误，例如难以优化的噪声预测。LLM + Primitives [74]使用LLM顺序地组成基元，因此没有动力学模块。对于这个基线和VoxPoser，“规格错误”指的是LLM在组合策略逻辑或组合值图时的错误。相比之下，尽管VoxPoser使用了多个组件，但通过将其表述为一个联合的基于模型的优化问题，VoxPoser实现了最低的总体错误，其最大的错误源是感知模块。我们还观察到，拥有更好的动态模型（而不是基于启发式的模型）可以有助于提高整体性能，例如学习模型或物理模型。

5 涌现的行为能力

涌现能力是指仅在大型模型中出现的不可预测现象[114]。由于VoxPoser使用预训练的LLMs作为骨干，我们观察到了由LLMs的丰富世界知识驱动的类似的具身的涌现能力。特别是，我们将研究的重点放在VoxPoser独有的行为能力上。我们观察到以下能力：

1) 估算物理属性：给定两个未知质量的块，任务是让机器人使用可用的工具进行物理实验，以确定哪个块更重；

2) 行为常识推理：在机器人设置餐桌的任务中，用户可以指定行为偏好，如“我是左撇子”，这需要机器人理解任务背景下的含义；

3) 细粒度语言修正：对于需要高精度的任务，如“用盖子盖住茶壶”，用户可以给机器人精确的指示，如“你偏离了1厘米”；

4) 多步骤视觉程序[115, 116]：对于一个任务“精确地打开抽屉一半”，由于没有可用的对象模型，信息不足，VoxPoser可以基于视觉反馈提出多步操纵策略，首先完全打开抽屉，同时记录抽屉的位移，然后将其关闭到中点以满足要求。

6 结论，限制和未来工作

在这项工作中，我们提出了VoxPoser，一个通用的机器人操纵框架，它从LLMs中提取了可供性和约束性，为开放集的指令和对象提供了显著的泛化优势。特别是，我们使用编写代码的LLMs与VLMs交互，以在观察空间中构建3D价值地图，这些地图被用来合成日常操纵任务的轨迹。此外，我们展示了VoxPoser如何通过有效地学习接触丰富任务的动态模型，从在线交互中获益。

VoxPoser有几个限制。首先，它依赖于外部感知模块，在需要整体视觉推理或理解细粒度物体几何的任务中，这是限制性的。其次，虽然适用于有效的动力学学习，但仍需要通用的动力学模型来实现与泛化同等水平的接触丰富任务。第三，我们的运动规划器只考虑末端执行器轨迹，而全臂规划也是可行的，很可能是更好的设计选择[117-119]。最后，LLMs需要手动提示工程。

我们也看到了一些令人兴奋的未来工作场所。例如，最近在多模态LLMs[68, 2, 120]上的成功可以直接转化为VoxPoser进行直接视觉定位。用于对齐[121, 122]和提示[123-126]的方法也可以用来提高合成价值地图的质量，并减轻提示工程的努力。最后，虽然我们在轨迹优化中使用贪婪搜索，但可以开发更先进的优化方法，这些方法最好可以与VoxPoser合成的价值地图进行接口。

致谢

我们要感谢 Andy Zeng, Igor Mordatch 以及 Stanford 视觉和学习实验室的成员们的富有成效的讨论。这项工作部分得到了 AFOSR YIP FA9550-23-1-0127, ONR MURI N00014-22-1-2740, ONR MURI N00014-21-1-2801, Stanford 人本 AI 研究所（HAI），JPMC 和 Analog Devices 的支持。Wenlong Huang 部分由 Stanford 工程学院奖学金支持。Ruohan Zhang 部分由 Wu Tsai 人类表现联盟奖学金支持。

附录

A.1 APIs for VoxPoser

Central to VoxPoser is an LLM generating Python code that is executed by a Python interpreter. Besides exposing NumPy [16] and the Transforms3d library to the LLM, we provide the following environment APIs that LLMs can choose to invoke:

detect(obj_name): Takes in an object name and returns a list of dictionaries, where each dictionary corresponds to one instance of the matching object, containing center position, occupancy grid, and mean normal vector.

execute(movable,affordance_map,avoidance_map,rotation_map,velocity_map,gripper_map): Takes in an “entity of interest” as “movable” (a dictionary returned by detect) and (optionally) a list of value maps and invokes the motion planner to execute the trajectory. Note that in MPC settings, “movable” and the input value maps are functions that can be re-evaluated to reflect the latest environment observation.

cm2index(cm,direction): Takes in a desired offset distance in centimeters along direction and returns 3-dim vector reflecting displacement in voxel coordinates.

index2cm(index,direction): Inverse of cm2index. Takes in an integer “index” and a “direction” vector and returns the distance in centimeters in world coordinates displaced by the “integer” in voxel coordinates.

pointat2quat(vector): Takes in a desired pointing direction for the end-effector and returns a satisfying target quaternion.

set_voxel_by_radius(voxel map,voxel xyz,radius cm,value): Assigns “value” to voxels within “radious cm” from “voxel xyz” in “voxel map”.

get_empty_affordance_map(): Returns a default affordance map initialized with 0, where a high value attracts the entity.

get_empty_avoidance_map(): Returns a default avoidance map initialized with 0, where a high value repulses the entity. get empty rotation map(): Returns a default rotation map initialized with current end-effector quaternion.

get_empty_gripper_map(): Returns a default gripper map initialized with current gripper action, where 1 indicates “closed” and 0 indicates “open”.

get_empty_velocity_map(): Returns a default affordance map initialized with 1, where the number represents scale factor (e.g., 0.5 for half of the default velocity).

reset_to_default_pose(): Reset to robot rest pose.

A.2 Real-World Environment Setup

我们使用一台Franka Emika Panda机器人，搭配桌面设置。我们采用了Operational Space Controller，并使用了来自Deoxys [127]的阻抗控制。我们在桌子的两个对角位置分别安装了两个RGB-D相机（Azure Kinect），从俯视图来看，一个位于桌子的右下方，另一个位于左上方。在每次实验开始时，两个相机开始录制并以20赫兹的频率返回实时的RGB-D观察数据。

对于每个任务，我们在两个设置下评估每种方法：无干扰和有干扰。对于有干扰的任务，我们对环境施加三种干扰，我们在评估开始时预先选择了它们的序列：

1）随机力作用于机器人，

2）与任务相关和干扰物体的随机位移，以及

3）任务进度的逆转（例如，当机器人关闭抽屉时将其拉开）。我们只在"entity of interest"是物体或物体的一部分的任务中应用第三种干扰。

我们将其与Code as Policies [74]的一种变体进行比较，作为基准线，该变体使用带有动作基元的LLM（Low-Level Motion）模型。这些基元包括：移动到位置，按四元数旋转，设置速度，打开夹爪，关闭夹爪。我们没有提供诸如拾取和放置等基元，因为它们会为特定的一套任务定制，而我们在研究中不限制这些任务（类似于Sec. A.1中指定的VoxPoser控制API）。

A.2.1 Tasks

Move&Avoid: “Move to the top of [obj1] while staying away from [obj2]”, where [obj1] and [obj2] are randomized everyday objects selected from the list: apple, banana, yellow bowl, headphones, mug, wood block.

Set Up Table: “Please set up the table by placing utensils for my pasta”.

Close Drawer: “Close the [deixis] drawer”, where [deixis] can be “top” or “bottom”.

Open Bottle: “Turn open the vitamin bottle”.

Sweep Trash: “Please sweep the paper trash into the blue dustpan”.

A.3 Simulated Environment Setup

我们在SAPIEN [111]中实现了一个桌面操作环境，使用Franka Emika Panda机器人。该控制器的输入是期望的末端执行器的6自由度姿态，通过逆运动学计算一系列插值航点，并最终使用PD控制器跟随这些航点。除此之外，我们还使用了一组10个彩色方块和10个彩色线段，以及一个带有3个抽屉的可调节柜子。它们的初始化方式取决于具体的任务。线段被用作视觉地标，不可与之交互。为了感知环境，总共安装了4个RGB-D相机，位于桌子的每一端，指向工作区的中心。

A.3.1 Tasks

我们创建了一个包含13个任务的自定义套件，这些任务显示在表4中。每个任务都有一个模板化指令（在表4中显示），其中可能有一个或多个属性从下面预定义的列表中随机选择。在重置时间，会选择一定数量的物体（取决于具体任务），并在工作区随机摆放，同时确保在重置时任务未完成且任务完成是可行的。属性的完整列表如下，分为"seen"（已见）和"unseen"（未见）两类：

Seen Attributes:

• [pos]: [“back left corner of the table”, “front right corner of the table”, “right side of the table”, “back side of the table”]

• [obj]: [“blue block”, “green block”, “yellow block”, “pink block”, “brown block”] • [preposition]: [“left of”, “front side of”, “top of”]

• [deixis]: [“topmost”, “second to the bottom”]

• [dist]: [3, 5, 7, 9, 11] • [region]: [“right side of the table”, “back side of the table”] • [velocity]: [“faster speed”, “a quarter of the speed”]

• [line]: [“blue line”, “green line”, “yellow line”, “pink line”, “brown line”]

Unseen Attributes:

• [pos]: [“back right corner of the table”, “front left corner of the table”, “left side of the table”, “front side of the table”]

• [obj]: [“red block”, “orange block”, “purple block”, “cyan block”, “gray block”]

• [preposition]: [“right of”, “back side of”]

• [deixis]: [“bottommost”, “second to the top”]

• [dist]: [4, 6, 8, 10]

• [region]: [“left side of the table”, “front side of the table”]

• [velocity]: [“slower speed”, “3x speed”]

• [line]: [“red line”, “orange line”, “purple line”, “cyan line”, “gray line”]

A.3.2 Full Results on Simulated Environments

A.4 提示词

Prompts used in Sec. 4.2 and Sec. 4.3 can be found below.

planner: Takes in a user instruction L and generates a sequence of sub-tasks ℓi which is fed into

“composer” (Note that planner is not used in simulation as the evaluated tasks consist of a single

manipulation phase).

real-world: voxposer.github.io/prompts/real planner prompt.txt.

composer: Takes in sub-task instruction ℓi and invokes necessary value map LMPs to compose

affordance maps and constraint maps.

simulation: voxposer.github.io/prompts/sim composer prompt.txt.

real-world: voxposer.github.io/prompts/real composer prompt.txt.

parse query obj: Takes in a text query of object/part name and returns a list of dictionaries, where

each dictionary corresponds to one instance of the matching object containing center position, occupancy

grid, and mean normal vector.

simulation: voxposer.github.io/prompts/sim parse query obj prompt.txt.

real-world: voxposer.github.io/prompts/real parse query obj prompt.txt.

get affordance map: Takes in natural language parametrization from composer and returns a

NumPy array for task affordance map.

simulation: voxposer.github.io/prompts/sim get affordance map prompt.txt.

real-world: voxposer.github.io/prompts/real get affordance map prompt.txt.

get avoidance map: Takes in natural language parametrization from composer and returns a

NumPy array for task avoidance map.

simulation: voxposer.github.io/prompts/sim get avoidance map prompt.txt.

real-world: voxposer.github.io/prompts/real get avoidance map prompt.txt.

get rotation map: Takes in natural language parametrization from composer and returns a NumPy

array for end-effector rotation map.

simulation: voxposer.github.io/prompts/sim get rotation map prompt.txt.

real-world: voxposer.github.io/prompts/real get rotation map prompt.txt.

get gripper map: Takes in natural language parametrization from composer and returns a NumPy

array for gripper action map.

simulation: voxposer.github.io/prompts/sim get gripper map prompt.txt.

real-world: voxposer.github.io/prompts/real get gripper map prompt.txt.

get velocity map: Takes in natural language parametrization from composer and returns a NumPy

array for end-effector velocity map.

simulation: voxposer.github.io/prompts/sim get velocity map prompt.txt.

real-world: voxposer.github.io/prompts/real get velocity map prompt.txt.

参考文献

[1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,

P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances

in neural information processing systems, 33:1877–1901, 2020.

[2] OpenAI. Gpt-4 technical report. arXiv, 2023.

[3] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W.

Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways.

arXiv preprint arXiv:2204.02311, 2022.

[4] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein,

J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models.

arXiv preprint arXiv:2108.07258, 2021.

[5] S. Tellex, N. Gopalan, H. Kress-Gazit, and C. Matuszek. Robots that use language. Annual

Review of Control, Robotics, and Autonomous Systems, 2020.

[6] S. Tellex, T. Kollar, S. Dickerson, M. Walter, A. Banerjee, S. Teller, and N. Roy. Understanding

natural language commands for robotic navigation and mobile manipulation. In

Proceedings of the AAAI Conference on Artificial Intelligence, volume 25, pages 1507–1514,

2011.

[7] T. Kollar, S. Tellex, D. Roy, and N. Roy. Toward understanding natural language directions.

In 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages

259–266. IEEE, 2010.

[8] M. Bollini, S. Tellex, T. Thompson, N. Roy, and D. Rus. Interpreting and executing recipes

with a cooking robot. In Experimental Robotics, pages 481–495. Springer, 2013.

[9] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch. Language models as zero-shot planners:

Extracting actionable knowledge for embodied agents. In International Conference on Machine

Learning. PMLR, 2022.

[10] M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan,

K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano,

K. Jeffrey, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, K.-H. Lee, S. Levine,

Y. Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettinghouse, D. Reyes, P. Sermanet,

N. Sievers, C. Tan, A. Toshev, V. Vanhoucke, F. Xia, T. Xiao, P. Xu, S. Xu, and M. Yan.

Do as i can and not as i say: Grounding language in robotic affordances. In arXiv preprint

arXiv:2204.01691, 2022.

[11] A. Zeng, A. Wong, S. Welker, K. Choromanski, F. Tombari, A. Purohit, M. Ryoo, V. Sindhwani,

J. Lee, V. Vanhoucke, et al. Socratic models: Composing zero-shot multimodal reasoning

with language. arXiv preprint arXiv:2204.00598, 2022.

[12] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell,

P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision.

In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.

[13] X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui. Open-vocabulary object detection via vision and

language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.

[14] A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, and N. Carion. Mdetr-modulated

detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International

Conference on Computer Vision, pages 1780–1790, 2021.

[15] M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, A. Mahendran,

A. Arnab, M. Dehghani, Z. Shen, et al. Simple open-vocabulary object detection

with vision transformers. arXiv preprint arXiv:2205.06230, 2022.

[16] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau,

E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk,

M. Brett, A. Haldane, J. F. del RÅLıo, M. Wiebe, P. Peterson, P. GÅLerard-Marchant, K. Sheppard,

T. Reddy, W.Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant. Array programming with

NumPy. Nature, 585(7825):357–362, Sept. 2020. doi:10.1038/s41586-020-2649-2. URL

https://doi.org/10.1038/s41586-020-2649-2.

[17] Y. K. Hwang, N. Ahuja, et al. A potential field approach to path planning. IEEE transactions

on robotics and automation, 8(1):23–32, 1992.

[18] M. Toussaint, J. Harris, J.-S. Ha, D. Driess, and W. HÅNonig. Sequence-of-constraints mpc:

Reactive timing-optimal control of sequential manipulation. In 2022 IEEE/RSJ International

Conference on Intelligent Robots and Systems (IROS), pages 13753–13760. IEEE, 2022.

[19] J. Andreas, D. Klein, and S. Levine. Learning with latent language. arXiv preprint

arXiv:1711.00482, 2017.

[20] R. Zellers, A. Holtzman, M. Peters, R. Mottaghi, A. Kembhavi, A. Farhadi, and Y. Choi.

Piglet: Language grounding through neuro-symbolic interaction in a 3d world. arXiv preprint

arXiv:2106.00188, 2021.

[21] R. Zellers, X. Lu, J. Hessel, Y. Yu, J. S. Park, J. Cao, A. Farhadi, and Y. Choi. Merlot:

Multimodal neural script knowledge models. Advances in Neural Information Processing

Systems, 2021.

[22] V. Shwartz, P. West, R. L. Bras, C. Bhagavatula, and Y. Choi. Unsupervised commonsense

question answering with self-talk. arXiv preprint arXiv:2004.05483, 2020.

[23] T.Winograd. Procedures as a representation for data in a computer program for understanding

natural language. Technical report, MASSACHUSETTS INST OF TECH CAMBRIDGE

PROJECT MAC, 1971.

[24] V. Blukis, R. A. Knepper, and Y. Artzi. Few-shot object grounding and mapping for natural

language robot instruction following. arXiv preprint arXiv:2011.07384, 2020.

[25] S. Tellex, R. Knepper, A. Li, D. Rus, and N. Roy. Asking for help using inverse semantics.

2014.

[26] T. Kollar, S. Tellex, D. Roy, and N. Roy. Grounding verbs of motion in natural language

commands to robots. In Experimental robotics, pages 31–47. Springer, 2014.

[27] J. Thomason, S. Zhang, R. J. Mooney, and P. Stone. Learning to interpret natural language

commands through human-robot dialog. In Twenty-Fourth International Joint Conference on

Artificial Intelligence, 2015.

[28] J. Thomason, A. Padmakumar, J. Sinapov, N.Walker, Y. Jiang, H. Yedidsion, J. Hart, P. Stone,

and R. Mooney. Jointly improving parsing and perception for natural language commands

through human-robot dialog. Journal of Artificial Intelligence Research, 67:327–374, 2020.

[29] E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn.

Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot

Learning, pages 991–1002. PMLR, 2021.

[30] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan,

K. Hausman, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control

at scale. arXiv preprint arXiv:2212.06817, 2022.

[31] D. Shah, B. Osinski, B. Ichter, and S. Levine. Lm-nav: Robotic navigation with large pretrained

models of language, vision, and action. arXiv preprint arXiv:2207.04429, 2022.

[32] Y. Cui, S. Karamcheti, R. Palleti, N. Shivakumar, P. Liang, and D. Sadigh. ” no, to the right”–

online language corrections for robotic manipulation via shared autonomy. arXiv preprint

arXiv:2301.02555, 2023.

[33] A. Stone, T. Xiao, Y. Lu, K. Gopalakrishnan, K.-H. Lee, Q. Vuong, P.Wohlhart, B. Zitkovich,

F. Xia, C. Finn, et al. Open-world object manipulation using pre-trained vision-language

models. arXiv preprint arXiv:2303.00905, 2023.

[34] S. Nair, A. Rajeswaran, V. Kumar, C. Finn, and A. Gupta. R3m: A universal visual representation

for robot manipulation. arXiv preprint arXiv:2203.12601, 2022.

[35] Y. J. Ma, V. Kumar, A. Zhang, O. Bastani, and D. Jayaraman. Liv: Language-image representations

and rewards for robotic control. 2023.

[36] P. A. Jansen. Visually-grounded planning without vision: Language models infer detailed

plans from high-level instructions. arXiv preprint arXiv:2009.14259, 2020.

[37] V. Micheli and F. Fleuret. Language models are few-shot butlers. arXiv preprint

arXiv:2104.07972, 2021.

[38] P. Sharma, A. Torralba, and J. Andreas. Skill induction and planning with latent language.

arXiv preprint arXiv:2110.01517, 2021.

[39] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch,

Y. Chebotar, P. Sermanet, N. Brown, T. Jackson, L. Luu, S. Levine, K. Hausman, and

B. Ichter. Inner monologue: Embodied reasoning through planning with language models. In

arXiv preprint arXiv:2207.05608, 2022.

[40] B. Z. Li, W. Chen, P. Sharma, and J. Andreas. Lampp: Language models as probabilistic

priors for perception and action. arXiv e-prints, pages arXiv–2302, 2023.

[41] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale

Hierarchical Image Database. In CVPR09, 2009.

[42] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional

neural networks. Communications of the ACM, 60(6):84–90, 2017.

[43] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are

unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.

[44] S. Nair, E. Mitchell, K. Chen, S. Savarese, C. Finn, et al. Learning language-conditioned

robot behavior from offline data and crowd-sourced annotation. In Conference on Robot

Learning, pages 1303–1315. PMLR, 2022.

[45] M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipulation.

In Conference on Robot Learning, pages 894–906. PMLR, 2022.

[46] M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic

manipulation. In Proceedings of the 6th Conference on Robot Learning (CoRL), 2022.

[47] S. Li, X. Puig, Y. Du, C. Wang, E. Akyurek, A. Torralba, J. Andreas, and I. Mordatch. Pretrained

language models for interactive decision-making. arXiv preprint arXiv:2202.01771,

2022.

[48] O. Mees, L. Hermann, and W. Burgard. What matters in language conditioned robotic imitation

learning. arXiv preprint arXiv:2204.06252, 2022.

[49] O. Mees, J. Borja-Diaz, and W. Burgard. Grounding language with visual affordances over

unstructured data. arXiv preprint arXiv:2210.01911, 2022.

[50] P. Sharma, B. Sundaralingam, V. Blukis, C. Paxton, T. Hermans, A. Torralba, J. Andreas,

and D. Fox. Correcting robot plans with natural language feedback. arXiv preprint

arXiv:2204.05186, 2022.

[51] W. Liu, C. Paxton, T. Hermans, and D. Fox. Structformer: Learning spatial structure for

language-guided semantic rearrangement of novel objects. In 2022 International Conference

on Robotics and Automation (ICRA), pages 6322–6329. IEEE, 2022.

[52] C. Lynch and P. Sermanet. Language conditioned imitation learning over unstructured data.

Robotics: Science and Systems, 2021. URL https://arxiv.org/abs/2005.07648.

[53] C. Lynch, A.Wahid, J. Tompson, T. Ding, J. Betker, R. Baruch, T. Armstrong, and P. Florence.

Interactive language: Talking to robots in real time. arXiv preprint arXiv:2210.06407, 2022.

[54] L. Shao, T. Migimatsu, Q. Zhang, K. Yang, and J. Bohg. Concept2robot: Learning manipulation

concepts from instructions and human demonstrations. The International Journal of

Robotics Research, 40(12-14):1419–1434, 2021.

[55] J. Luketina, N. Nardelli, G. Farquhar, J. N. Foerster, J. Andreas, E. Grefenstette, S. Whiteson,

and T. RocktÅNaschel. A survey of reinforcement learning informed by natural language. In

IJCAI, 2019.

[56] J. Andreas, D. Klein, and S. Levine. Modular multitask reinforcement learning with policy

sketches. ArXiv, abs/1611.01796, 2017.

[57] Y. Jiang, S. S. Gu, K. P. Murphy, and C. Finn. Language as an abstraction for hierarchical

deep reinforcement learning. Advances in Neural Information Processing Systems, 32, 2019.

[58] B. Chen, F. Xia, B. Ichter, K. Rao, K. Gopalakrishnan, M. S. Ryoo, A. Stone, and D. Kappler.

Open-vocabulary queryable scene representations for real world planning. arXiv preprint

arXiv:2209.09874, 2022.

[59] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and

A. Garg. Progprompt: Generating situated robot task plans using large language models.

arXiv preprint arXiv:2209.11302, 2022.

[60] C. Huang, O. Mees, A. Zeng, and W. Burgard. Visual language maps for robot navigation.

arXiv preprint arXiv:2210.05714, 2022.

[61] S. S. Raman, V. Cohen, E. Rosen, I. Idrees, D. Paulius, and S. Tellex. Planning with large

language models via corrective re-prompting. arXiv preprint arXiv:2211.09935, 2022.

[62] C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y. Su. Llm-planner: Fewshot

grounded planning for embodied agents with large language models. arXiv preprint

arXiv:2212.04088, 2022.

[63] B. Liu, Y. Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, and P. Stone. Llm+ p: Empowering

large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477,

2023.

[64] S. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor. Chatgpt for robotics: Design principles

and model abilities. 2023, 2023.

[65] K. Lin, C. Agia, T. Migimatsu, M. Pavone, and J. Bohg. Text2motion: From natural language

instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023.

[66] Y. Ding, X. Zhang, C. Paxton, and S. Zhang. Task and motion planning with large language

models for object rearrangement. arXiv preprint arXiv:2303.06247, 2023.

[67] W. Huang, F. Xia, D. Shah, D. Driess, A. Zeng, Y. Lu, P. Florence, I. Mordatch, S. Levine,

K. Hausman, et al. Grounded decoding: Guiding text generation with grounded models for

robot control. arXiv preprint arXiv:2303.00855, 2023.

[68] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson,

Q. Vuong, T. Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint

arXiv:2303.03378, 2023.

[69] H. Yuan, C. Zhang, H. Wang, F. Xie, P. Cai, H. Dong, and Z. Lu. Plan4mc: Skill reinforcement

learning and planning for open-world minecraft tasks. arXiv preprint arXiv:2303.16563,

2023.

[70] Y. Xie, C. Yu, T. Zhu, J. Bai, Z. Gong, and H. Soh. Translating natural language to planning

goals with large-language models. arXiv preprint arXiv:2302.05128, 2023.

[71] Y. Lu, P. Lu, Z. Chen,W. Zhu, X. E.Wang, andW. Y.Wang. Multimodal procedural planning

via dual text-image prompting. arXiv preprint arXiv:2305.01795, 2023.

[72] D. Patel, H. Eghbalzadeh, N. Kamra, M. L. Iuzzolino, U. Jain, and R. Desai. Pretrained

language models as visual planners for human assistance. arXiv preprint arXiv:2304.09179,

2023.

[73] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar.

Voyager: An open-ended embodied agent with large language models. arXiv preprint

arXiv:2305.16291, 2023.

[74] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as

policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753,

2022.

[75] M. Kwon, S. M. Xie, K. Bullard, and D. Sadigh. Reward design with language models. arXiv

preprint arXiv:2303.00001, 2023.

[76] A. Tam, N. Rabinowitz, A. Lampinen, N. A. Roy, S. Chan, D. Strouse, J. Wang, A. Banino,

and F. Hill. Semantic exploration from language abstractions and pretrained representations.

Advances in Neural Information Processing Systems, 35:25377–25389, 2022.

[77] J. Mu, V. Zhong, R. Raileanu, M. Jiang, N. Goodman, T. RocktÅNaschel, and E. Grefenstette.

Improving intrinsic exploration with language abstractions. arXiv preprint arXiv:2202.08938,

2022.

[78] C. Colas, T. Karch, N. Lair, J.-M. Dussoux, C. Moulin-Frier, P. Dominey, and P.-Y. Oudeyer.

Language as a cognitive tool to imagine goals in curiosity driven exploration. Advances in

Neural Information Processing Systems, 33:3761–3774, 2020.

[79] Y. Du, O. Watkins, Z. Wang, C. Colas, T. Darrell, P. Abbeel, A. Gupta, and J. Andreas.

Guiding pretraining in reinforcement learning with large language models. arXiv preprint

arXiv:2302.06692, 2023.

[80] H. Hu and D. Sadigh. Language instructed reinforcement learning for human-ai coordination.

arXiv preprint arXiv:2304.07297, 2023.

[81] I. Lenz, R. A. Knepper, and A. Saxena. Deepmpc: Learning deep latent features for model

predictive control. In Robotics: Science and Systems, volume 10. Rome, Italy, 2015.

[82] L. Hewing, K. P. Wabersich, M. Menner, and M. N. Zeilinger. Learning-based model predictive

control: Toward safe learning in control. Annual Review of Control, Robotics, and

Autonomous Systems, 3:269–296, 2020.

[83] M. B. Chang, T. Ullman, A. Torralba, and J. B. Tenenbaum. A compositional object-based

approach to learning physical dynamics. arXiv preprint arXiv:1612.00341, 2016.

[84] P. Battaglia, R. Pascanu, M. Lai, D. Jimenez Rezende, et al. Interaction networks for learning

about objects, relations and physics. Advances in neural information processing systems, 29,

2016.

[85] Z. Xu, J.Wu, A. Zeng, J. B. Tenenbaum, and S. Song. Densephysnet: Learning dense physical

object representations via multi-step dynamic interactions. arXiv preprint arXiv:1906.03853,

2019.

[86] A. Byravan and D. Fox. Se3-nets: Learning rigid body motion using deep neural networks.

In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 173–180.

IEEE, 2017.

[87] A. Nagabandi, K. Konolige, S. Levine, and V. Kumar. Deep dynamics models for learning

dexterous manipulation. In Conference on Robot Learning, pages 1101–1112. PMLR, 2020.

[88] A. Sanchez-Gonzalez, N. Heess, J. T. Springenberg, J. Merel, M. Riedmiller, R. Hadsell,

and P. Battaglia. Graph networks as learnable physics engines for inference and control. In

International Conference on Machine Learning, pages 4470–4479. PMLR, 2018.

[89] Y. Li, J. Wu, R. Tedrake, J. B. Tenenbaum, and A. Torralba. Learning particle dynamics for

manipulating rigid bodies, deformable objects, and fluids. arXiv preprint arXiv:1810.01566,

2018.

[90] C. Finn, S. Levine, and P. Abbeel. Guided cost learning: Deep inverse optimal control via

policy optimization. In International conference on machine learning, pages 49–58. PMLR,

2016.

[91] J. Fu, K. Luo, and S. Levine. Learning robust rewards with adversarial inverse reinforcement

learning. arXiv preprint arXiv:1710.11248, 2017.

[92] D. Driess, O. Oguz, J.-S. Ha, and M. Toussaint. Deep visual heuristics: Learning feasibility of

mixed-integer programs for manipulation planning. In 2020 IEEE International Conference

on Robotics and Automation (ICRA), pages 9563–9569. IEEE, 2020.

[93] B. Amos, I. Jimenez, J. Sacks, B. Boots, and J. Z. Kolter. Differentiable mpc for end-to-end

planning and control. Advances in neural information processing systems, 31, 2018.

[94] M. Mittal, D. Hoeller, F. Farshidian, M. Hutter, and A. Garg. Articulated object interaction

in unknown scenes with whole-body mobile manipulation. In 2022 IEEE/RSJ International

Conference on Intelligent Robots and Systems (IROS), pages 1647–1654. IEEE, 2022.

[95] S. Bahl, A. Gupta, and D. Pathak. Human-to-robot imitation in the wild. arXiv preprint

arXiv:2207.09450, 2022.

[96] S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos

as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on

Computer Vision and Pattern Recognition, pages 13778–13790, 2023.

[97] C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y. Zhu, and A. Anandkumar.

Mimicplay: Long-horizon imitation learning by watching human play. arXiv preprint

arXiv:2302.12422, 2023.

[98] H. Bharadhwaj, A. Gupta, S. Tulsiani, and V. Kumar. Zero-shot robot manipulation from

passive human videos. arXiv preprint arXiv:2302.02011, 2023.

[99] Y. J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V. Kumar, and A. Zhang. Vip: Towards

universal visual reward and representation via value-implicit pre-training. arXiv preprint

arXiv:2210.00030, 2022.

[100] D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti,

J. Munro, T. Perrett, W. Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In

Proceedings of the European Conference on Computer Vision (ECCV), pages 720–736, 2018.

[101] K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger,

H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video.

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

pages 18995–19012, 2022.

[102] Y. Cui, S. Niekum, A. Gupta, V. Kumar, and A. Rajeswaran. Can foundation models perform

zero-shot task specification for robot manipulation? In Learning for Dynamics and Control

Conference, pages 893–905. PMLR, 2022.

[103] T. Yu, T. Xiao, A. Stone, J. Tompson, A. Brohan, S. Wang, J. Singh, C. Tan, J. Peralta,

B. Ichter, et al. Scaling robot learning with semantically imagined experience. arXiv preprint

arXiv:2302.11550, 2023.

[104] Z. Mandi, H. Bharadhwaj, V. Moens, S. Song, A. Rajeswaran, and V. Kumar. Cacti: A

framework for scalable multi-task multi-scene visual imitation learning. arXiv preprint

arXiv:2212.05711, 2022.

[105] T. Xiao, H. Chan, P. Sermanet, A.Wahid, A. Brohan, K. Hausman, S. Levine, and J. Tompson.

Robotic skill acquisition via instruction augmentation with vision-language models. arXiv

preprint arXiv:2211.11736, 2022.

[106] C.Wang, D. Xu, and L. Fei-Fei. Generalizable task planning through representation pretraining.

IEEE Robotics and Automation Letters, 7(3):8299–8306, 2022.

[107] C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. MartÅLın-MartÅLın, C. Wang, G. Levine,

M. Lingelbach, J. Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday

activities and realistic simulation. In Conference on Robot Learning, pages 80–93. PMLR,

2023.

[108] O. Khatib. A unified approach for motion and force control of robot manipulators: The

operational space formulation. IEEE Journal on Robotics and Automation, 3(1):43–53, 1987.

[109] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead,

A. C. Berg, W.-Y. Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.

[110] H. K. Cheng and A. G. Schwing. Xmem: Long-term video object segmentation with an

atkinson-shiffrin memory model. In Computer Vision–ECCV 2022: 17th European Conference,

Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 640–658.

Springer, 2022.

[111] F. Xiang, Y. Qin, K. Mo, Y. Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y. Yuan, H. Wang, et al.

Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition, pages 11097–11107, 2020.

[112] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image

segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI

2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings,

Part III 18, pages 234–241. Springer, 2015.

[113] OÅN . C. ic.ek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger. 3d u-net: learning

dense volumetric segmentation from sparse annotation. In Medical Image Computing

and Computer-Assisted Intervention–MICCAI 2016: 19th International Conference, Athens,

Greece, October 17-21, 2016, Proceedings, Part II 19, pages 424–432. Springer, 2016.

[114] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma,

D. Zhou, D. Metzler, et al. Emergent abilities of large language models. arXiv preprint

arXiv:2206.07682, 2022.

[115] T. Gupta and A. Kembhavi. Visual programming: Compositional visual reasoning without

training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern

Recognition, pages 14953–14962, 2023.

[116] D. SurÅLıs, S. Menon, and C. Vondrick. Vipergpt: Visual inference via python execution for

reasoning. arXiv preprint arXiv:2303.08128, 2023.

[117] L. E. Kavraki, P. Svestka, J.-C. Latombe, and M. H. Overmars. Probabilistic roadmaps for

path planning in high-dimensional configuration spaces. IEEE transactions on Robotics and

Automation, 12(4):566–580, 1996.

[118] N. D. Ratliff, J. Issac, D. Kappler, S. Birchfield, and D. Fox. Riemannian motion policies.

arXiv preprint arXiv:1801.02854, 2018.

[119] T. Marcucci, M. Petersen, D. von Wrangel, and R. Tedrake. Motion planning around obstacles

with convex optimization. arXiv preprint arXiv:2205.04422, 2022.

[120] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with

frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.

[121] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal,

K. Slama, A. Ray, et al. Training language models to follow instructions with human

feedback. arXiv preprint arXiv:2203.02155, 2022.

[122] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini,

C. McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint

arXiv:2212.08073, 2022.

[123] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought

prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903,

2022.

[124] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi.

Self-instruct: Aligning language model with self generated instructions. arXiv preprint

arXiv:2212.10560, 2022.

[125] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zeroshot

reasoners. arXiv preprint arXiv:2205.11916, 2022.

[126] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan. Tree

of thoughts: Deliberate problem solving with large language models. arXiv preprint

arXiv:2305.10601, 2023.

[127] Y. Zhu, A. Joshi, P. Stone, and Y. Zhu. Viola: Imitation learning for vision-based manipulation

with object proposal priors. 6th Annual Conference on Robot Learning, 2022.

大语言模型LLMs驱动机器人 李飞飞 VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

摘要：

1 引言

2 相关工作

语言指令的落地化

语言模型在机器人技术中的应用

基于学习的轨迹优化

3 方法

3.1 问题公式化

3.2 通过VoxPoser对语言指令进行落地化

3.3 使用VoxPoser进行zero-shot轨迹合成

3.4 通过在线体验高效地学习动态模型

4 实验和分析

4.1 VoxPoser的实现

LLMs和提示

VLMs和感知

运动规划器

环境动态模型

4.2 VoxPoser用于日常操纵任务

4.3 对未见指令和属性的泛化

4.4 通过在线体验进行高效的动态学习

4.5 错误分析

5 涌现的行为能力

6 结论，限制和未来工作

致谢

附录

A.1 APIs for VoxPoser

A.2 Real-World Environment Setup

A.2.1 Tasks

A.3 Simulated Environment Setup

A.3.1 Tasks

A.3.2 Full Results on Simulated Environments

参考文献

大语言模型LLMs驱动机器人李飞飞 VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models