论文翻译: Vision-Language Foundation Models as Effective Robot Imitators-CSDN博客

本文链接：https://blog.csdn.net/weixin_43334869/article/details/135654301

本文介绍了利用视觉语言基础模型进行机器人有效模仿的研究。开发了RoboFlamingo框架，基于开源VLMs OpenFlamingo，通过简单微调适配机器人控制。该框架在测试基准上表现出色，具有部署灵活性，有望成为低成本、易使用的机器人操纵解决方案。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Vision-Language Foundation Models as Effective Robot Imitators

使用视觉-语言基础模型对机器人进行有效的模仿

文章目录

ABSTRACT

摘要

Recent progress in vision language foundation models has shown their ability to understand multimodal data and resolve complicated vision language tasks, including robotics manipulation. We seek a straightforward way of making use of existing vision-language models (VLMs) with simple fine-tuning on robotics data. To this end, we derive a simple and novel vision-language manipulation framework, dubbed RoboFlamingo, built upon the open-source VLMs, OpenFlamingo. Unlike prior works, RoboFlamingo utilizes pre-trained VLMs for single-step visionlanguage comprehension, models sequential history information with an explicit policy head, and is slightly fine-tuned by imitation learning only on languageconditioned manipulation datasets. Such a decomposition provides RoboFlamingo the flexibility for open-loop control and deployment on low-performance platforms. By exceeding the state-of-the-art performance with a large margin on the tested benchmark, we show that RoboFlamingo can be an effective and competitive alternative to adapt VLMs to robot control. Our extensive experimental results also reveal several interesting conclusions regarding the behavior of different pre-trained VLMs on manipulation tasks. RoboFlamingo can be trained or evaluated on a single GPU server, and we believe it has the potential to be a cost-effective and easy-to-use solution for robotics manipulation, empowering everyone with the ability to fine-tune their own robotics policy. Codes and model weights are public at roboflamingo.github.io.

近期视觉语言基础模型的研究进展表明，它们能够理解多模态数据并解决复杂的视觉语言任务，包括机器人操纵。我们寻求一种简单直接的方法，利用现有的视觉语言模型（VLMs）并通过简单的在机器人数据上的微调来使用它们。为此，我们开发了一个简单新颖的视觉语言操纵框架，名为RoboFlamingo，建立在开源VLMs OpenFlamingo之上。与之前的工作不同，RoboFlamingo利用预训练的VLMs进行单步视觉语言理解，用一个显式的策略头模型来模拟序列历史信息，并且只通过模仿学习，在语言条件操纵数据集上进行轻微的微调。这种分解使得RoboFlamingo具有在开环控制和低性能平台上部署的灵活性。通过在测试基准上以大幅超过最先进性能，我们展示了RoboFlamingo可以成为一个有效且有竞争力的方法，用于将VLMs适配到机器人控制。我们广泛的实验结果还揭示了关于不同预训练VLMs在操纵任务上的行为的几个有趣结论。RoboFlamingo可以在单个GPU服务器上进行训练或评估，我们相信它有可能成为一个成本效益高且易于使用的机器人操纵解决方案，让每个人都能微调自己的机器人策略。代码和模型权重在roboflamingo.github.io上公开。

1 INTRODUCTION

1 引言

Recent progress in vision-language foundation models (VLM) has presented their exhilarating ability in modeling and aligning the representation of images and words, and the unlimited potential to resolve a wide range of downstream tasks with multi-modality data, for instance, visual questionanswering (Li et al., 2023; Zhou et al., 2022), image captioning (Zeng et al., 2022; Wang et al., 2022; Li et al., 2021), human-agent interactions (Liu et al., 2022b; Oertel et al., 2020; Seaborn et al., 2021). These successes, undeniably, encourage people to imagine a generalist robot equipped with such a vision-language comprehension ability to interact naturally with humans and perform complex manipulation tasks.

在视觉-语言基础模型（VLM）的最新进展中，它们在建模和对齐图像和文字表示方面展现出令人振奋的能力，并在多模态数据下解决各种下游任务中具有无限潜力，例如视觉问答（Li等人，2023; Zhou等人，2022），图像字幕（Zeng等人，2022; Wang等人，2022; Li等人，2021），人-代理互动（Liu等人，2022b; Oertel等人，2020; Seaborn等人，2021）。这些成功无疑激励人们想象一台通用机器人，具备这样的视觉-语言理解能力，能够与人类自然交互并执行复杂的操作任务。

Therefore, we aim to explore integrating vision-language foundation models to serve as robot manipulation policies. While there have been some previous studies that incorporated large language models (LLMs) and vision-language models (VLMs) into robot systems as high-level planners (Ahn et al., 2022; Driess et al., 2023), making use of them directly for low-level control still poses challenges. Most VLMs are trained on static image-language pairs, whereas robotics tasks require video comprehension for closed-loop control. Additionally, VLM outputs primarily consist of language tokens, which significantly differ in representation compared to robot actions. A recent work (Brohan et al., 2023), namely Robotics Transformer 2 (RT-2), has demonstrated a possible solution for adapting VLMs to low-level robot control. However, democratizing such an expensive framework for all robotics practitioners proves difficult as it utilizes private models and necessitates co-fine-tuning on extensive vision-language data to fully showcase its effectiveness. Consequently, there is an urgent need for robot communities to have a low-cost alternative solution that effectively enables a robot manipulation policy with VLMs.

因此，我们的目标是探索将视觉-语言基础模型整合为机器人操作策略。尽管之前有一些研究将大型语言模型（LLMs）和视觉-语言模型（VLMs）纳入机器人系统作为高级规划器（Ahn等人，2022; Driess等人，2023），但直接将它们用于低级控制仍然存在挑战。大多数VLMs是在静态图像-语言对上训练的，而机器人任务则需要对闭环控制的视频理解。此外，VLM的输出主要包括语言标记，与机器人动作相比在表示上存在显著差异。最近的工作（Brohan等人，2023），即Robotics Transformer 2（RT-2），已经展示了一种可能的解决方案，可以将VLMs适应低级机器人控制。然而，让所有机器人领域的从业者都能使用这样昂贵的框架证明是困难的，因为它使用私有模型并需要在广泛的视觉-语言数据上进行协同微调，以充分展示其有效性。因此，机器人社区迫切需要一个低成本的替代方案，能够有效地利用VLM实现机器人操作策略。

To this end, we introduce RoboFlamingo, a novel vision-language manipulation framework that leverages publicly accessible pre-trained VLMs to effectively construct manipulation policies for robotics. Specifically, RoboFlamingo is grounded upon the open-source VLM, OpenFlamingo (Awadalla et al., 2023), and resolves the challenge by decoupling visual-language understanding and decision making. Unlike previous works, RoboFlamingo takes advantage of pre-trained VLMs mainly for understanding vision observations and language instructions at every decision step, models the historical features with an explicit policy head, and is fine-tuned solely on language-conditioned manipulation datasets using imitation learning. With such a decomposition, only a minimal amount of data is required to adapt the model to downstream manipulation tasks, and RoboFlamingo also offers flexibility for open-loop control and deployment on low-performance platforms. Moreover, benefiting from the pretraining on extensive vision-language tasks, RoboFlamingo achieves state-of-the-art performance with a large margin over previous works, and generalizes well to zero-shot settings and environments. It is worth noting that RoboFlamingo can be trained or evaluated on a single GPU server. As a result, we believe RoboFlamingo can be a cost-effective yet high-performance solution for robot manipulation, empowering everyone with the ability to fine-tune their own robots with VLMs.

为此，我们推出了RoboFlamingo，这是一个新颖的视觉-语言操作框架，利用公开可访问的预训练VLMs，有效构建机器人操作策略。具体而言，RoboFlamingo基于开源VLM OpenFlamingo（Awadalla等人，2023），通过解耦视觉-语言理解和决策制定来解决挑战。与先前的工作不同，RoboFlamingo主要利用预训练的VLMs来理解每个决策步骤中的视觉观察和语言指令，用显式策略头模拟历史特征，并仅在使用模仿学习的语言条件操作数据集上进行微调。通过这种分解，只需极小量的数据即可使模型适应下游操作任务，并且RoboFlamingo还为开环控制和在低性能平台上部署提供了灵活性。此外，由于在广泛的视觉-语言任务上进行预训练，RoboFlamingo在性能上取得了领先于先前工作的巨大优势，并且在零样本设置和环境中具有很好的泛化能力。值得注意的是，RoboFlamingo可以在单个GPU服务器上进行训练或评估。因此，我们相信RoboFlamingo可以是机器人操作的经济且高性能的解决方案，使每个人都能够使用VLM微调自己的机器人。

Through extensive experiments, we demonstrate that RoboFlamingo outperforms existing methods by a clear margin. Specifically, we evaluate its performance using the Composing Actions from Language and Vision benchmark (CALVIN) (Mees et al., 2022b), a widely-recognized simulation benchmark for long-horizon language-conditioned tasks. Our findings indicate that RoboFlamingo is an effective and competitive alternative for adapting VLMs to robot control, achieving 2x performance improvements compared with the previous state-of-the-art method. Our comprehensive results also yield valuable insights into the use of pre-trained VLMs for robot manipulation tasks, offering potential directions for further research and development.

通过大量实验，我们展示了RoboFlamingo在性能上明显优于现有方法。具体来说，我们使用Composing Actions from Language and Vision基准（CALVIN）（Mees等人，2022b）评估其性能，这是一个广受认可的用于长时程语言条件任务的仿真基准。我们的研究结果表明，RoboFlamingo是将VLMs适应机器人控制的一种有效且有竞争力的替代方案，相较于先前的最先进方法，性能提高了2倍。我们全面的结果还为利用预训练VLMs进行机器人操作任务提供了有价值的见解，为进一步的研究和开发提供了潜在的方向。

2 RELATED WORK

2 相关工作

Language can be the most intuitive and pivotal interface for human-robot interaction, enabling non-expert humans to seamlessly convey their instructions to robots for achieving diverse tasks. Consequently, the realm of language-conditioned multi-task manipulation has garnered substantial attention in recent years. Intuitively, such tasks require robots to have a good understanding of not only the visual captures of the outside world, but also the instructions represented by words. With the strong representation ability of pre-trained vision and language models, a lot of previous works have incorporated pre-trained models into the learning framework. Among them, we roughly classify them into the following three categories, which is also illustratively compared in Fig. 1.

语言可以成为人机交互的最直观和关键的界面，使非专业人士能够无缝地传达他们的指令，以便机器人完成各种任务。因此，近年来，语言条件的多任务操作领域引起了相当大的关注。直观地说，这些任务要求机器人不仅对外界的视觉捕捉有很好的理解，还要理解由单词表示的指令。借助预训练的视觉和语言模型的强大表示能力，许多先前的工作已经将预训练模型纳入学习框架。其中，我们将它们粗略地分为以下三类，如图1中形象化地比较。
在这里插入图片描述
图1：RoboFlamingo与现有的视觉-语言操作解决方案的比较。

Fine-tuning. While some early works such as Jang et al. (2022); Lynch & Sermanet (2020) trained a vision encoder and a language encoder to learn representations for the input language and vision data from manipulation tasks, some recent work directly takes pre-trained models to obtain great representations, then trains the policy model beyond them from scratch or fine-tuning the whole model. For instance, Jiang et al. (2023) utilizes a pre-trained T5 (Raffel et al., 2020) model to encode the multi-modal prompts, and learn the actions by fine-tuning the T5 model and additionally training an object encoder and attention layers. HULC (Mees et al., 2022a) utilizes the vision encoder of Lynch & Sermanet (2020) trained on the CALVIN dataset (Mees et al., 2022b) and some pre-trained language encoder models such as sentence transformer (Reimers & Gurevych, 2019), and their HULC++ (Mees et al., 2023) also fine-tunes these encoders. Besides, Brohan et al. (2022) proposed RT-1, i.e., robotics transformers, a 35M vision-language-action model (VLA) which tokenizes the action and aligns the vision, language, and action in the token space and is trained on a large amount of real-world manipulation dataset, using the Universal Sentence Encoder (Cer et al., 2018) to obtain the language embedding and the pre-trained EfficientNet-B3 (Tan & Le, 2019) as the vision tokenizer.

微调：一些早期的工作，如Jang等人（2022）和Lynch＆Sermanet（2020），训练了一个视觉编码器和一个语言编码器，用于学习来自操作任务的输入语言和视觉数据的表示。而一些最近的工作则直接采用预训练模型来获取出色的表示，然后从头开始训练或对整个模型进行微调以训练策略模型。例如，Jiang等人（2023）利用预训练的T5（Raffel等人，2020）模型对多模态提示进行编码，并通过微调T5模型并额外训练对象编码器和注意力层来学习动作。HULC（Mees等人，2022a）利用了在CALVIN数据集（Mees等人，2022b）上训练的Lynch＆Sermanet（2020）的视觉编码器以及一些预训练语言编码器模型，如sentence transformer（Reimers＆Gurevych，2019），其HULC++（Mees等人，2023）还对这些编码器进行了微调。此外，Brohan等人（2022）提出了RT-1，即机器人变压器，一个35M视觉-语言-动作模型（VLA），它对动作进行标记并在令牌空间中对齐视觉、语言和动作，并在大量真实世界操作数据集上进行训练，使用Universal Sentence Encoder（Cer等人，2018）获取语言嵌入，以及预训练的EfficientNet-B3（Tan＆Le，2