CROME: Cross-Modal Adapters for Efficient Multimodal LLM
https://arxiv.org/pdf/2408.06610
Abstract
研究对象:Multimodal Large Language Models (MLLMs) demonstrate remarkable imagelanguage capabilities, but their widespread use faces challenges in cost-effective training and adaptation.
提出问题:Existing approaches often necessitate expensive language model retraining and limited adaptability. Additionally, the current focus on zero-shot performance improvements offers insufficient guidance for task-specific tuning.
本文方法:We propose CROME, an efficient vision-language instruction tuning framework. It features a novel gated cross-modal adapter that effectively combines visual and textual representations prior to input into a frozen LLM.
方法优点:This lightweight adapter, trained with minimal parameters, enables efficient cross-modal understanding. Notably, CROME demonstrates superior zero-shot performance on standard visual question answering and instruction-following benchmarks. Moreover, it yields fine-tuning with exceptional parameter efficiency, competing with task-specific specialist state-of-the-art methods.
实验结论:CROME demonstrates the potential of pre-LM alignment for building scalable, adaptable, and parameter-efficient multimodal models.
现有的方法通常需要昂贵的语言模型再训练和有限的适应性。此外,目前对zero-shot性能改进的关注为特定于任务的调优提供了不足的指导。
提出了CROME,一个高效的视觉语言指令调优框架。它具有新颖的门控跨模态适配器,可以在输入到冻结LLM之前有效地结合视觉和文本表示。
这个轻量级适配器使用最小的参数进行训练,可以实现高效的跨模式理解。值得注意的是,CROME在标准视觉问题回答和指令遵循基准上表现出优越的zero-shot性能。此外,它产生具有特殊参数效率的微调,与特定任务的专家最先进的方法竞争。
多模态大型语言模型(MLLMs)在多个领域,特别是在视觉-语言学习方面,近期取得了令人瞩目的突破。值得注意的是,OpenAI的GPT-4v[1]和Google的Gemini Pro Vision[2]在图像描述和视觉问答等任务上展现了卓越的性能。然而,这些商业模型通常仅通过仅预测API提