VLA之Agent:《Magma: A Foundation Model for Multimodal AI Agents》翻译与解读

VLA之Agent:《Magma: A Foundation Model for Multimodal AI Agents》翻译与解读

导读:这篇论文介绍了Magma,一种新的多模态AI智能体基础模型Magma,以及相应的SoM和ToM技术,有效地解决了现有VLA模型泛化能力差和数据不足的问题,为构建更强大、更通用的AI智能体提供了新的思路。

>> 背景痛点:现有的视觉-语言-动作(VLA)模型通常针对特定任务进行训练,例如UI导航或机器人操作,导致泛化能力差,难以跨任务和领域应用。构建一个通用的多模态AI智能体需要同时具备多模态理解能力(语义、空间和时间理解)和多模态动作预能力(将长期任务分解成精确的动作序列)。 此外,现有的VLA数据量有限,难以支撑大规模模型训练。

>> 具体的解决方案:Magma模型及SoM/ToM技术。Magma是一个多模态基础模型,它通过结合视觉语言理解和空间-时间推理能力,来完成从UI导航到机器人操作等各种智能体任务。为了赋予Magma这些能力,论文提出了两种新的辅助训练任务:

● Set-of-Mark (SoM,标记集):用于动作定位。在图像中标记可操作的视觉对象(例如GUI中的可点击按钮),模型学习预测这些标记的位置。

● Trace-of-Mark (ToM,标记轨迹):用于动作规划。在视频中标记对象的运动轨迹(例如人的手或机器臂的轨迹),模型学习预测这些轨迹。

通过SoM和ToM,将图像和视频数据转化为视觉-语言-动作数据,弥合了多模态理解和动作执行之间的差距,并有效利用了大量的未标记视频数据。

>> 核心思路步骤:Magma的训练过程包含以下步骤:

 数据收集: 收集来自UI界面、机器人操作和教学视频等多种来源的大规模异构数据集。

● SoM/ToM生成:使用图像分割、目标检测或点跟踪模型等方法,生成SoM和ToM标注。 对于视频数据,还需处理相机运动等问题。

● 模型预训练:使用一个共享的视觉编码器对图像和视频进行编码,并将编码结果与文本描述一起输入到一个解码器-仅有的LLM中,进行SoM和ToM预测训练。

● 下游任务微调:将预训练好的Magma模型应用于不同的下游任务,进行微调以提高性能。

>> 优势:

通用性强:Magma能够处理多种模态的数据和任务,具有较强的泛化能力。

效率高:SoM和ToM技术有效利用了大量的未标记数据,提高了训练效率。

性能优异:在UI导航和机器人操作等任务上取得了新的SOTA结果,并且在多模态理解任务上也表现出色。

>> 论文结论和观点:

● Magma是第一个能够在数字和物理世界中执行多模态AI智能体任务的基础模型。

● SoM和ToM技术有效地提高了模型的空间-时间推理能力,并实现了大规模预训练。

● Magma在UI导航和机器人操作等任务上取得了SOTA结果,同时在多模态理解任务上也具有竞争力。

● 论文强调了多模态理解和动作预测能力对于构建通用AI智能体的必要性,并提出了一种有效的预训练方法。

● 论文也讨论了模型的社会影响和局限性,以及负责任的AI开发

目录

《Magma: A Foundation Model for Multimodal AI Agents》翻译与解读

Abstract

1、Introduction

Figure 2:A multimodal AI agent should be capable of mutimodal understanding and action-prediction towards a given goal.图 2:多模态人工智能代理应能够针对给定目标进行多模态理解和动作预测。

Conclusion

Social Impacts and Limitations.社会影响与局限性。

Responsible AI.负责任的人工智能


《Magma: A Foundation Model for Multimodal AI Agents》翻译与解读

地址

论文地址:[2502.13130] Magma: A Foundation Model for Multimodal AI Agents

时间

2025218

作者

Microsoft Research等

Abstract

We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. Magma is a significant extension of vision-language (VL) models in that it not only retains the VL understanding ability (verbal intelligence) of the latter, but is also equipped with the ability to plan and act in the visual-spatial world (spatial-temporal intelligence) and complete agentic tasks ranging from UI navigation to robot manipulation. To endow the agentic capabilities, Magma is pretrained on large amounts of heterogeneous datasets spanning from images, videos to robotics data, where the actionable visual objects (e.g., clickable buttons in GUI) in images are labeled by Set-of-Mark (SoM) for action grounding, and the object movements (e.g., the trace of human hands or robotic arms) in videos are labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show that SoM and ToM reach great synergy and facilitate the acquisition of spatial-temporal intelligence for our Magma model, which is fundamental to a wide range of tasks as shown in Fig.1. In particular, Magma creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are specifically tailored to these tasks. On image and video-related multimodal tasks, Magma also compares favorably to popular large multimodal models that are trained on much larger datasets. We make our model and code public for reproducibility at this https URL.

我们推出 Magma,这是一种基础模型,能够服务于数字物理世界中的多模态人工智能代理任务。Magma 是视觉语言(VL)模型的重大扩展,它不仅保留了后者的视觉语言理解能力(语言智能),还具备在视觉空间世界中规划和行动(时空智能)的能力,并能完成从用户界面导航到机器人操作等代理任务。为了赋予代理能力,Magma 在大量异构数据集上进行了预训练,这些数据集涵盖了从图像、视频到机器人数据,其中图像中的可操作视觉对象(例如图形用户界面中的可点击按钮)通过集合标记(SoM)进行标注以实现动作定位,而视频中的对象运动(例如人手或机械臂的轨迹)通过轨迹标记(ToM)进行标注以实现动作规划。大量实验表明,SoM 和 ToM 达到了很好的协同作用,有助于我们的 Magma 模型获取时空智能,这对于如图 1 所示的广泛任务至关重要。特别是,Magma 在用户界面导航和机器人操作任务方面创造了新的最先进的成果,超越了专门针对这些任务的先前模型。在与图像和视频相关的多模态任务上,Magma 也优于在更大数据集上训练的流行大型多模态模型。我们公开了我们的模型和代码,以实现可重复性,网址为 https://this-url 。

1、Introduction

A long-standing research topic of AI is to develop autonomous agents that can perceive visual stimuli, language inputs, and other environmentally-grounded data and produce meaningful embodied actions in physical and digital environments to complete specific tasks.

Recently, there has been a growing interest in developing AI agents based on Vision-Language-Action (VLA) models [54, 29, 5, 6, 42, 19]. These models are typically pretrained on large amounts of vision-language datasets and then action trajectories to attain ability to take actions given VL inputs. However, due to the inherent difference between various environments (e.g., 2D digital world and 3D physical ones), VLA models are typically trained separately for simplicity and then used for different tasks. Exemplary models in the digital world include Pix2ACT [108], WebGUM [34], and Ferret-UI [131] for UI navitation. VLA models in the 3D physical world include RT-2 [5] and OpenVLA [54] for robotics manipulation. Although claimed as generalist, most of these models prioritize learning a task-specific action policy at the cost of a significant decline in generic multimodal understanding capabilities, rendering limited genralizability across tasks and domains.

In this research, we strive to develop a foundation model for multimodal AI agents and argue that it requires simultaneously possessing the following capabilities:

• Multimodal Understanding to understand multimodal input from various domains (both digital and physical) not only semantically, but also spatially and temporally.

• Multimodal Action Prediction to break down the long-horizon task into an accurate action sequence, which can be effectively executed by AI agent systems.

Such an agent system should be driven by external goals specified by human commands as shown in Fig. 2.

长期以来,人工智能的一个重要研究课题是开发能够感知视觉刺激、语言输入以及其他基于环境的数据,并在物理和数字环境中产生有意义的具身动作以完成特定任务的自主智能体。

近来,基于视觉-语言-动作(VLA)模型开发人工智能代理的兴趣日益浓厚[54, 29, 5, 6, 42, 19]。这些模型通常在大量的视觉语言数据集上进行预训练,然后通过动作轨迹来获得根据视觉语言输入采取行动的能力。然而,由于各种环境之间存在固有的差异(例如,二维数字世界和三维物理世界),为了简化起见,VLA 模型通常分别进行训练,然后用于不同的任务。在数字世界中的典型模型包括用于用户界面导航的 Pix2ACT [108]、WebGUM [34] 和 Ferret-UI [131]。在三维物理世界中的 VLA 模型包括用于机器人操作的 RT-2 [5] 和 OpenVLA [54]。尽管这些模型被宣称是通用型的,但大多数模型在学习特定任务的动作策略时,却以显著降低通用多模态理解能力为代价,导致其在不同任务和领域中的泛化能力有限。

在本研究中,我们致力于开发一种用于多模态人工智能代理的基础模型,并认为它需要同时具备以下能力:

• 多模态理解能力,能够从各种领域(包括数字和物理领域)理解多模态输入,不仅在语义上,而且在空间和时间上。

• 多模态动作预测能力,能够将长时任务分解为准确的动作序列,以便由人工智能代理系统有效执行。

这样的代理系统应由人类指令指定的外部目标驱动,如图 2 所示。

To endow the broad capabilities, we effectively leverage large amounts of heterogeneous vision-language and action datasets, including UI datasets such as SeekClick [19], robotic manipulation dataset OXE [23], human instructional videos like Ego-4d [40] and image-text pairs used in LMMs [71, 13]. Instead of sequentially training on one domain and adapting to another, we train a single foundation model which can be applied in a zero-shot manner to different downstream tasks in various settings.

Simply combining those datasets, however, does not bring benefits to the foundation model, due to the significant gap between multimodal understanding which is mostly verbal (i.e., textual descriptions for images and videos) and the action-taking tasks which are mostly spatial (i.e., 2D coordinates for UI or 7-DoF for robot arm). To bridge the gap, we propose two surrogate tasks for model training, action grounding and action planning, by asking the model to predict the proximal action outputs given the visual-spatial observations, represented as images or video frames. Specifically, in each image, we label the visual objects that are actionable by Set-of-Mark (SoM) (e.g., clickable buttons in Fig. 1 bottom-middle) and labeled in each video the object movements, which are the results of actions, with Trace-of-Mark (ToM) (e.g., the trace of human hand or robotic arm in Fig. 1 top-middle). In this way, the image and video datasets, which are not labeled with actions, are transformed into “vision-language-action” data to morph the gap among different types of tasks. We show through extensive empirical studies that SoM and ToM achieve are environment-agnostic and easy to generalize to new agentic tasks, offering an effective and efficient approach to scaling up our Magma model pretraining using large amounts of unlabeled videos, such as raw instructional videos.

To the best of our knowledge, Magma is the first foundation model for multimodal AI agents that can understand multimodal inputs (see Fig. 1 left), perform action grounding and planning for the future (see Fig. 1 middle), and finally adapt to downstream (unseen) agentic tasks in both the digital and physical environments(see Fig. 1 right). We evaluated Magma on three task categories: UI navigation (e.g., Mind2Web, AITW), where it has to reason and act in evolving digital environments; vision-language understanding (e.g., GQA, VideoMME), where it grounds language in visual objects and events; and finally robotic manipulation (e.g., Bridge, LIBERO), which tests its 3D spatial intelligence for physical interaction. Magma achieves new SOTA results on UI navigation and robotic manipulation tasks, outperforming even domain-specific models while maintaining strong performance on VL tasks which are comparable to SOTA LMMs.

为了赋予这些广泛的能力,我们有效地利用了大量异构的视觉语言和动作数据集,包括诸如 SeekClick [19] 这样的用户界面数据集、机器人操作数据集 OXE [23]、人类指令视频如 Ego-4d [40] 以及用于 LMMs 的图像文本对 [71, 13]。我们并非依次在各个领域进行训练然后适应另一个领域,而是训练一个单一的基础模型,该模型能够以零样本的方式应用于各种设置下的不同下游任务。

然而,简单地将这些数据集合并在一起,并不能给基础模型带来好处,这是因为多模态理解(主要是语言性的,即图像和视频的文字描述)与行动任务(主要是空间性的,即用户界面的二维坐标或机械臂的 7 自由度)之间存在显著差距。为了弥合这一差距,我们提出了两个替代任务用于模型训练,即动作定位和动作规划,要求模型根据视觉空间观察(以图像或视频帧的形式表示)预测近端动作输出。具体而言,在每张图像中,我们使用集合标记(SoM)标注可操作的视觉对象(例如,图 1 底部中间的可点击按钮),在每个视频中,我们使用轨迹标记(ToM)标注对象的移动,这些移动是动作的结果(例如,图 1 顶部中间的人手或机械臂的轨迹)。通过这种方式,原本未标注动作的图像和视频数据集被转化为“视觉-语言-动作”数据,从而弥合了不同类型任务之间的差距。我们通过大量的实证研究证明,SoM 和 ToM 具有环境无关性,易于推广到新的代理任务,为利用大量未标注视频(如原始教学视频)对我们的 Magma 模型进行预训练提供了一种有效且高效的方法。

据我们所知,Magma 是首个用于多模态 AI 代理的基础模型,它能够理解多模态输入(见图 1 左),进行动作定位和未来规划(见图 1 中),最终在数字和物理环境中适应下游(未见过的)代理任务(见图 1 右)。我们对 Magma 在三个任务类别上进行了评估:用户界面导航(例如 Mind2Web、AITW),在这些任务中,它必须在不断变化的数字环境中进行推理和行动;视觉语言理解(例如 GQA、VideoMME),在此类任务中,它需要将语言与视觉对象和事件联系起来;最后是机器人操作(例如 Bridge、LIBERO),这测试了其在物理交互中的三维空间智能。Magma 在用户界面导航和机器人操作任务上取得了新的最先进成果,甚至超越了特定领域的模型,同时在视觉语言任务上也保持了强大的性能,与最先进语言模型相比毫不逊色。

In summary, the main contributions of this work are:

• We propose Magma, the first foundation model that acquires not only multimodal understanding but also spatial-temporal reasoning abilities for agentic tasks in both digial and physical environments.

• We propose the use of Set-of-Mark and Trace-of-Mark techniques to significantly enhance the spatial-temporal intelligence for action grounding and planning, and allow Magma to be pretrained effectively on large amounts of heterogeneous datasets.

• We curate a large-scale pretraining dataset, which consists of not only open-source VL datasets, but also UI, robotics data and human instructional videos, auto-labeled using SoM and ToM. In total, our training corpus contains approximately 39 million diverse samples.

• We extensively evaluate the pretrained Magma model to demonstrate the superior model performance across a wide range of tasks. Magma with a single suite of parameters achieves new SOTA on both robotic manipulation and UI navigation over open-sourced counterparts.

• We show that the proposed Magma pretraining method significantly improves model’s verbal and spatial-temporal intelligence abilities. For instance, Magma can achieve SOTA performance on the BLINK dataset without instruction fine-tuning, and SOTA performance on video question-answering benchmarks despite being pretrained on much fewer frames.

总之,这项工作的主要贡献在于:

• 我们提出了 Magma,这是首个不仅具备多模态理解能力,还具备在数字和物理环境中执行代理任务所需的时空推理能力的基础模型。

• 我们提出了使用标记集和标记轨迹技术,显著增强了动作定位和规划的时空智能,并使 Magma 能够在大量异构数据集上进行有效预训练。• 我们精心整理了一个大规模的预训练数据集,其中不仅包含开源的视觉语言数据集,还包括用户界面、机器人数据和人类指导视频,这些数据通过 SoM 和 ToM 进行自动标注。总体而言,我们的训练语料库包含约 3900 万种多样化的样本。

• 我们对预训练的 Magma 模型进行了广泛的评估,以证明其在广泛任务中的卓越模型性能。仅使用一套参数的 Magma 在机器人操作和用户界面导航方面均超越了开源同类模型,达到了新的最先进水平。

• 我们表明,所提出的 Magma 预训练方法显著提升了模型的语言和时空智能能力。例如,Magma 在 BLINK 数据集上无需指令微调即可达到最先进水平,尽管在视频问答基准测试中预训练的帧数要少得多,但仍能取得最先进水平的成绩。

Figure 1:We introduce Magma, the first foundation model that is capable of interpreting and grounding multimodal inputs within its environment. Given a described goal, Magma is able to formulate plans and execute actions to achieve it. By effectively transferring knowledge from freely available visual and language data, Magma bridges verbal and spatial intelligence to navigate complex tasks.图1:我们介绍了Magma,这是第一个能够在其环境中解释和接地多模态输入的基础模型。给定一个描述的目标,Magma能够制定计划并执行行动来实现它。通过有效地从免费的视觉和语言数据中转移知识,Magma将语言和空间智能连接起来,以导航复杂的任务。

Figure 2:A multimodal AI agent should be capable of mutimodal understanding and action-prediction towards a given goal.图 2:多模态人工智能代理应能够针对给定目标进行多模态理解和动作预测。

Conclusion

We present the Magma foundation model that can understand and act on multimodal inputs to complete agentic tasks in different environments. Our experiments show that the use of SoM and ToM prediction tasks in pretraining helps the model learn to ground and plan actions, respectively. In our experiments, Magma shows strong spatial-temporal reasoning ability and significantly outperforms baselines on downstream UI navigation and robotic manipulation tasks.

我们提出了 Magma 基础模型,该模型能够理解并处理多模态输入,从而在不同环境中完成代理任务。我们的实验表明,在预训练中使用空间对象(SoM)和理论思维(ToM)预测任务有助于模型分别学习到对动作的定位和规划。在我们的实验中,Magma 展现出了强大的时空推理能力,并在下游的用户界面导航和机器人操作任务中显著优于基准模型。

Social Impacts and Limitations.社会影响与局限性。

To develop a foundation model with both verbal and spatial intelligence capable of handling diverse agentic tasks in digital and physical environments, we curated a comprehensive pretraining dataset from a wide range of image, video, and robotics domains:

• UI navigation data. We leverage two pretraining datasets SeeClick and Vision2UI.

• Instructional videos. As our goal was to learn an agentic model that can undertake daily tasks like humans, we compile the videos from Epic Kitchen, Ego4d, Something-Something v2 and other instructional videos.

• Robotics manipulation data. For robotics task, we follow OpenVLA to leverage the robotics data in Open-X-Embodiment.

• Multimodal understanding data. Lastly, we include a small set of multi modal pretraining data ShareGPT4V, and instruction tuning data LlaVA-1.5 plus a number of other domain-specific data to retain the generic multimodal understanding capability of the pre-trained model.

The data markup of the robotics and UI navigation data is fairly standardized focusing on generic manipulation tasks (“Place x object on y object”) and generic UI navigation tasks (“Click search button”). We, however, performed a detailed data reflection exercise on the video data of people performing certain tasks. The core inferences we took from these videos were the trajectory of objects over time when the tasks were performed.

We note that the distribution of identities and activities in the instructional videos are not representative of the global human population and the diversity in society. We are cognizant of the unintended societal, gender, racial and other biases in training with these data, so we will ensure required disclaimers are in place when publishing the models. The training dataset, task list and descriptions focus on the next action to perform only – not describe, act on, or perform any analysis on the subject itself. While there can be unintended outputs from the model based on adverse task descriptions, we will ensure to highlight the use cases the model was trained for and it’s intended use.

为了开发一种兼具语言和空间智能、能够处理数字和物理环境中各种代理任务的基础模型,我们从图像、视频和机器人技术等广泛领域精心整理了一个全面的预训练数据集:

• 用户界面导航数据。我们利用了两个预训练数据集 SeeClick 和 Vision2UI。

• 指导性视频。由于我们的目标是学习一种能够像人类一样完成日常任务的代理模型,我们从 Epic Kitchen、Ego4d、Something-Something v2 以及其他指导性视频中收集了视频。

• 机器人操作数据。对于机器人任务,我们遵循 OpenVLA,利用 Open-X-Embodiment 中的机器人数据。

• 多模态理解数据。最后,我们纳入了一小部分多模态预训练数据 ShareGPT4V 和指令微调数据 LlaVA-1.5 以及一些其他特定领域的数据,以保持预训练模型的通用多模态理解能力。

机器人和用户界面导航数据的数据标注相当标准化,侧重于通用操作任务(“将 x 物体放在 y 物体上”)和通用用户界面导航任务(“点击搜索按钮”)。不过,我们对人们执行某些任务的视频数据进行了详细的数据反思练习。我们从这些视频中得出的核心推论是执行任务时物体随时间变化的轨迹。

我们注意到,教学视频中身份和活动的分布并不能代表全球人口和社会的多样性。我们意识到使用这些数据进行训练可能会无意中产生社会、性别、种族和其他方面的偏见,因此在发布模型时,我们将确保附上必要的免责声明。训练数据集、任务列表和描述仅关注要执行的下一个动作——不描述、不针对或分析主体本身。虽然基于不利的任务描述可能会产生意外的输出,但我们确保突出模型训练的用例及其预期用途。

Responsible AI.负责任的人工智能

It is important to note that the model is specifically designed for UI navigation in a controlled Web UI and Android simulator, and robotic manipulation tasks and should not be broadly applied to other tasks. The recommended usage is within the settings they were trained on, namely, an enclosure equipped with a robotic arm and everyday objects for robotic manipulation and an android simulator running on a computer for UI manipulation. For UI navigation task, researchers should make sure that a human is in the loop and in control for every action the agentic system generates. Since the model cannot act by itself, the sub-module a researcher uses to actually perform the UI navigation action should ensure that no unintended consequences can occur as a result of performing the UI action proposed by the model.

需要注意的是,该模型专门设计用于受控的 Web 用户界面和 Android 模拟器中的用户界面导航以及机器人操作任务,不应广泛应用于其他任务。建议在它们所训练的设置中使用,即配备有机械臂和日常物品以进行机器人操作的封闭空间,以及在计算机上运行的用于用户界面操作的 Android 模拟器。对于用户界面导航任务,研究人员应确保在代理系统生成的每个动作中都有人在回路中并进行控制。由于该模型本身无法独立行动,研究人员用于实际执行用户界面导航动作的子模块应确保不会因执行模型提出的用户界面动作而产生意外后果。

The model by itself demonstrates good-enough capability in UI navigation and robotic manipulation, but is not usable as is for exploitation scenarios. A threat actor, can however use specific training data for a specific malicious task, to leverage the model as a base to perform automated UI navigation. This is a generic risk associated with the agentic models.

该模型本身在用户界面导航和机器人操作方面表现出足够的能力,但不能直接用于利用场景。然而,威胁行为者可以使用特定的训练数据来执行特定的恶意任务,从而利用该模型作为基础来执行自动用户界面导航。这是与代理模型相关的普遍风险。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

一个处女座的程序猿

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值