谷歌云AI新作:CROME,跨模态适配器高效多模态大语言模型

CROME: Cross-Modal Adapters for Efficient Multimodal LLM

https://arxiv.org/pdf/2408.06610

Abstract

研究对象:Multimodal Large Language Models (MLLMs) demonstrate remarkable imagelanguage capabilities, but their widespread use faces challenges in cost-effective training and adaptation.

提出问题:Existing approaches often necessitate expensive language model retraining and limited adaptability. Additionally, the current focus on zero-shot performance improvements offers insufficient guidance for task-specific tuning.

本文方法:We propose CROME, an efficient vision-language instruction tuning framework. It features a novel gated cross-modal adapter that effectively combines visual and textual representations prior to input into a frozen LLM.

方法优点:This lightweight adapter, trained with minimal parameters, enables efficient cross-modal understanding. Notably, CROME demonstrates superior zero-shot performance on standard visual question answering and instruction-following benchmarks. Moreover, it yields fine-tuning with exceptional parameter efficiency, competing with task-specific specialist state-of-the-art methods.

实验结论:CROME demonstrates the potential of pre-LM alignment for building scalable, adaptable, and parameter-efficient multimodal models.

现有的方法通常需要昂贵的语言模型再训练和有限的适应性。此外,目前对zero-shot性能改进的关注为特定于任务的调优提供了不足的指导。

提出了CROME,一个高效的视觉语言指令调优框架。它具有新颖的门控跨模态适配器,可以在输入到冻结LLM之前有效地结合视觉和文本表示。 

这个轻量级适配器使用最小的参数进行训练,可以实现高效的跨模式理解。值得注意的是,CROME在标准视觉问题回答和指令遵循基准上表现出优越的zero-shot性能。此外,它产生具有特殊参数效率的微调,与特定任务的专家最先进的方法竞争。

多模态大型语言模型(MLLMs)在多个领域,特别是在视觉-语言学习方面,近期取得了令人瞩目的突破。值得注意的是,OpenAI的GPT-4v[1]和Google的Gemini Pro Vision[2]在图像描述和视觉问答等任务上展现了卓越的性能。然而,这些商业模型通常仅通过仅预测API提

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值