【多模态智能体+游戏】Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case

Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case 多模态智能体+《荒野大镖客:救赎 2》

论文地址
代码地址

请添加图片描述
在这里插入图片描述

Abstract

Despite the success in specific tasks and scenarios, existing foundation agents, empowered by large models (LMs) and advanced tools, still cannot generalize to different scenarios, mainly due to dramatic differences in the observations and actions across scenarios. In this work, we propose the General Computer Control (GCC) setting: building foundation agents that can master any computer task by taking only screen images (and possibly audio) of the computer as input, and producing keyboard and mouse operations as output, similar to human-computer interaction. The main challenges of achieving GCC are: 1) the multimodal observations for decision-making, 2) the requirements of accurate control of keyboard and mouse, 3) the need for long-term memory and reasoning, and 4) the abilities of efficient exploration and self-improvement. To target GCC, we introduce Cradle, an agent framework with six main modules, including: 1) information gathering to extract multi-modality information, 2) self-reflection to rethink past experiences, 3) task inference to choose the best next task, 4) skill curation for generating and updating relevant skills for given tasks, 5) action planning to generate specific operations for keyboard and mouse control, and 6) memory for storage and retrieval of past experiences and known skills. To demonstrate the capabilities of generalization and self-improvement of Cradle, we deploy it in the complex AAA game Red Dead Redemption II, serving as a preliminary attempt towards GCC with a challenging target. To our best knowledge, our work is the first to enable LMM-based agents to follow the main storyline and finish real missions in complex AAA games, with minimal reliance on prior knowledge or resources. The project website is at https://baai-agents.github.io/Cradle/.

尽管在特定任务和场景中取得了成功,但现有的基础agent在大型模型(LMs)和先进工具的帮助下,仍然无法推广到不同的场景中,这主要是由于不同场景中的观察和操作存在巨大差异。

在这项工作中,我们提出了通用计算机控制(GCC)设置:建立基础agent,只需将计算机屏幕图像(可能还有音频)作为输入,并将键盘和鼠标操作作为输出,就能掌握任何计算机任务,类似于人机交互。

实现 GCC 所面临的主要挑战有

  1. 决策的多模态观察;
  2. 键盘和鼠标的精确控制要求;
  3. 长期记忆和推理的需要;
  4. 高效探索和自我完善的能力。

针对 GCC,我们引入了 Cradle,这是一个包含六个主要模块的代理框架,其中包括

  1. 信息收集,用于提取多模态信息;
  2. 自我反思,用于重新思考过去的经验;
  3. 任务推理,用于选择最佳的下一个任务;
  4. 技能策划,用于生成和更新给定任务的相关技能;
  5. 行动规划,用于生成键盘和鼠标控制的具体操作;
  6. 记忆,用于存储和检索过去的经验和已知技能。

为了展示 Cradle 的泛化和自我完善能力,我们在复杂的 AAA 级游戏《荒野大镖客:救赎 2》中部署了 Cradle,作为对具有挑战性目标的 GCC 的初步尝试。据我们所知,我们的工作是首次使LMM-based agents能够在复杂的 AAA 级游戏中跟随主故事情节并完成真正的任务,而无需依赖先前的知识或资源。

  • 34
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值