谷歌云AI新作：CROME，跨模态适配器高效多模态大语言模型

最新推荐文章于 2025-03-20 12:49:59 发布

Phoenixtree_DongZhao

最新推荐文章于 2025-03-20 12:49:59 发布

阅读量1.6k

点赞数 28

分类专栏： Large Model Transformer 文章标签：人工智能大语言模型适配器

本文链接：https://blog.csdn.net/u014546828/article/details/141355921

版权

CROME: Cross-Modal Adapters for Efficient Multimodal LLM

https://arxiv.org/pdf/2408.06610

Abstract

研究对象：Multimodal Large Language Models (MLLMs) demonstrate remarkable imagelanguage capabilities, but their widespread use faces challenges in cost-effective training and adaptation.

提出问题：Existing approaches often necessitate expensive language model retraining and limited adaptability. Additionally, the current focus on zero-shot performance improvements offers insufficient guidance for task-specific tuning.

本文方法：We propose CROME, an efficient vision-language instruction tuning framework. It features a novel gated cross-modal adapter that effectively combines visual and textual representations prior to input into a frozen LLM.

方法优点：This lightweight adapter, trained with minimal parameters, enables efficient cross-modal understanding. Notably, CROME demonstrates superior zero-shot performance on standard visual question answering and instruction-following benchmarks. Moreover, it yields fine-tuning with exceptional parameter efficiency, competing with task-specific specialist state-of-the-art methods.

实验结论：CROME demonstrates the potential of pre-LM alignment for building scalable, adaptable, and parameter-efficient multimodal models.