论文分享 | 多模态大模型最新进展

最新推荐文章于 2025-03-11 15:36:50 发布

Python_金钱豹

最新推荐文章于 2025-03-11 15:36:50 发布

阅读量1.1k

点赞数 13

文章标签：人工智能深度学习计算机视觉 microsoft cnn

本文链接：https://blog.csdn.net/Python_cocola/article/details/144357281

版权

论文分享 | 多模态大模型相关研究进展

我们从2024-12-03到2024-12-05的57篇文章中精选出5篇优秀的工作分享给读者。

Quantization-Aware Imitation-Learning for Resource-Efficient Robotic Control
Understanding the World’s Museums through Vision-Language Reasoning
NLPrompt: Noise-Label Prompt Learning for Vision-Language Models
OODFace: Benchmarking Robustness of Face Recognition under Common Corruptions and Appearance Variations
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

1.AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

Authors: Seongmin Park, Hyungmin Kim, Wonseok Jeon, Juyoung Yang, Byeongwook Jeon, Yoonseon Oh, Jungwook Choi

https://arxiv.org/abs/2412.01034

论文摘要

Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive per formance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two sounds is louder, and 2) determining which of two sounds has a higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a comprehensive audio visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information. This benchmark encompasses 4,555 carefully crafted problems, each incorporating text, visual, and audio components. To successfully infer answers, models must effectively leverage clues from both visual and audio inputs. To ensure pre cise and objective evaluation of MLLM responses, we have structured the questions as multiple-choice, eliminating the need for humanevaluation or LLM-assisted assessment. We benchmark a series of closed-source and open-source mod els and summarize the observations. By revealing the limi tations of current models, we aim to provide useful insight for future dataset collection and model development.

论文简评

这篇论文针对当前多模态大语言模型（MLLMs）在音频视觉理解方面的不足，提出了AV-Odyssey Bench基准，为研究社区提供了一个全面且实用的评估工具。作者通过包含4555个多模态问题的数据集，系统地测试了模型在基本听觉任务和音频-视觉信息整合能力上的表现，并通过DeafTest揭示了当前模型在简单听觉辨别任务中的局限性。论文的贡献在于填补了现有基准中针对听觉理解的空白，同时为未来的数据集构建和模型优化提供了有价值的参考。整体而言，这项工作数据全面、问题设计合理，为多模态模型的研究提供了重要推动力。

2.Understanding the World’s Museums through Vision-Language Reasoning

Authors: Ada-Astrid Balauca, Sanjana Garai, Stefan Balauca, Rasesh Udayakumar Shetty, Naitik Agrawal, Dhwanil Subhashbhai Shah, Yuqian Fu, Xi Wang, Kristina Toutanova, Danda Pani Paudel, Luc Van Gool

https://arxiv.org/abs/2412.01370

论文摘要

Museums serve as vital repositories of cultural heritage and historical artifacts spanning diverse epochs, civilizations, and regions, preserving well-documented collections. Data reveal key attributes such as age, origin, material, and cultural significance. Understanding museum exhibits from their images requires reasoning beyond visual features. In this work, we facilitate such reasoning by (a) collecting and curating a large-scale dataset of 65M images and 200M question-answer pairs in the standard museum catalog format for exhibits from all around the world; (b) training large vision-language models on the collected dataset; © benchmarking their ability on five visual question answering tasks. The complete dataset is labeled by museum experts, ensuring the quality as well as the practical sig- nificance of the labels. We train two VLMs from differ- ent categories: the BLIP model, with vision-language aligned embeddings, but lacking the expressive power of large language models, and the LLaVA model, a powerful instruction-tuned LLM enriched with vision-language reasoning capabilities. Through exhaustive experiments, we provide several insights on the complex and fine-grained understanding of museum exhibits. In particular, we show that some questions whose answers can often be derived directly from visual features are well answered by both types of models. On the other hand, questions that require the grounding of the visual features in repositories of human knowledge are better answered by the large vision-language models, thus demonstrating their superior capacity to perform the desired reasoning. Find our dataset, benchmarks, and source code at: github.com/insait-institute/Museum-65

论文简评

这篇论文介绍了一个名为MUSEUM-65的大规模数据集，包含650万张图像和2亿对问题-答案，旨在增强博物馆场景中的视觉问答（VQA）。它比较了两个视觉语言模型BLIP和LLaVA，并在多种与文化遗产相关的任务上评估它们的表现。该研究的主要贡献在于提供了丰富的资源，为未来的研究奠定了坚实的基础。此外，通过比较BLIP和LLaVA的性能，论文提出了有价值的见解，显示出这些模型在不同文化背景下的表现差异。此外，论文强调了MUSEUM-65对于教育和研究的重要性，因为它填补了现有数据集的空白，使研究者能够更深入地探索文化和历史知识的可视化表达方式。综上所述，本文不仅展示了MUSEUM-65的数据价值，也揭示了其对视觉问答领域的影响和意义，因此值得读者关注与研究。

3.NLPrompt: Noise-Label Prompt Learning for Vision-Language Models

Authors: Bikang Pan, Qun Li, Xiaoying Tang, Wei Huang, Zhen Fang, Feng Liu, Jingya Wang, Jingyi Yu, Ye Shi

https://arxiv.org/abs/2412.01256

论文摘要

The emergence of vision-language foundation models, such as CLIP, has revolutionized image-text representation, enabling a broad range of applications via prompt learning. Despite its promise, real-world datasets often contain noisy labels that can degrade prompt learning performance. In this paper, we demonstrate that using mean absolute error (MAE) loss in prompt learning, named PromptMAE, significantly enhances robustness against noisy labels while maintaining high accuracy. Though MAE is straightforward and recognized for its robustness, it is rarely used in noisy-label learning due to its slow convergence and poor performance outside prompt learning scenarios. To elucidate the robustness of PromptMAE, we leverage feature learning theory to show that MAE can suppress the influence of noisy samples, thereby improving the signal-to-noise ratio and enhancing overall robustness. Additionally, we introduce PromptOT, a prompt-based optimal transport data purification method to further enhance robustness. PromptOT employs text encoder representations in vision-language models as prototypes to construct an optimal transportation matrix, effectively partitioning datasets into clean and noisy subsets, allowing for the application of cross-entropy loss to the clean subset and MAE loss to the noisy subset. Our Noise-Label Prompt Learning method, named NLPrompt, offers a simple and efficient approach that leverages the expressive representation and precise alignment capabilities of vision-language models for robust prompt learning. We validate NLPrompt through extensive experiments across various noise settings, demonstrating significant performance improvements.

论文简评

这篇论文探讨了如何通过引入NLPrompt方法来提高视觉语言模型中的提示学习效果。该方法的核心在于使用均方误差（Mean Absolute Error, MAE）损失函数以及PromptOT数据净化策略，旨在有效分离干净与有噪声的数据。实验结果表明，在不同噪声水平下，这种改进显著提高了模型性能。总的来说，这篇文章为机器学习中的一个重要问题——对抗性噪音标签，提供了一种有效的解决方案，为视觉语言模型的发展提供了新的思路。

4.OODFace: Benchmarking Robustness of Face Recognition under Common Corruptions and Appearance Variations

Authors: Caixin Kang, Yubo Chen, Shouwei Ruan, Shiji Zhao, Ruochen Zhang, Jiayi Wang, Shan Fu, Xingxing Wei

https://arxiv.org/abs/2412.02479

论文摘要

With the rise of deep learning, facial recognition technology has seen extensive research and rapid development. Although facial recognition is considered a mature technology, we find that existing open-source models and commercial algorithms lack robustness in certain real-world Out-of-Distribution (OOD) scenarios, raising concerns about the reliability of these systems. In this paper, we introduce OODFace, which explores the OOD challenges faced by facial recognition models from two perspectives: common corruptions and appearance variations. We systematically design 30 OOD scenarios across 9 major categories tailored for facial recognition. By simulating these challenges on public datasets, we establish three robustness benchmarks: LFW-C/V, CFP-FP-C/V, and YTF-C/V. We then conduct extensive experiments on 19 different facial recognition models and 3 commercial APIs, along with extended experiments on face masks, Vision-Language Models (VLMs), and defense strategies to assess their robustness. Based on the results, we draw several key insights, highlighting the vulnerability of facial recognition systems to OOD data and suggesting possible solutions. Additionally, we offer a unified toolkit that includes all corruption and variation types, easily extendable to other datasets. We hope that our benchmarks and findings can provide guidance for future improvements in facial recognition model robustness.

论文简评

OODFace作为面部识别系统对抗常见污染和外观变异的一套基准，是当前研究中一个重要方向。该文针对面部识别系统的弱点设计了30个OOD场景，并基于公开数据集建立了三个评估标准。实验结果表明，尽管模型和API在OOD挑战中的表现各异，但它们都存在一些共同的问题，如对局部细节的鲁棒性不足和泛化能力差等。这些发现为未来的研究提供了有价值的参考，并促使我们思考如何通过改进算法、优化特征提取过程等方式来提高面部识别系统的抗OOD能力。总的来说，该文深入分析了面部识别系统在OOD环境下的表现及其存在的问题，对于提升这一领域的研究具有重要意义。

5.VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

Authors: Dhiman Paul, Md Rizwan Parvez, Nabeel Mohammed, Shafin Rahman

https://arxiv.org/abs/2412.01558

论文摘要

Video Highlight Detection and Moment Retrieval (HD/MR) are essential in video analysis. Recent joint prediction transformer models often overlook their cross-task dynamics and video-text alignment and refinement. Moreover, most mod els typically use limited, uni-directional attention mechanisms, resulting in weakly integrated representations and suboptimal performance in capturing the interdependence between video and text modalities. Although large-language and vision-language models (LLM/LVLMs) have gained prominence across various domains, their application in this field remains relatively un derexplored. Here we propose VideoLights, a novel HD/MR framework addressing these limitations through (i) Convolutional Projection and Feature Refinement modules with an alignment loss for better video-text feature alignment, (ii) Bi-Directional Cross-Modal Fusion network for strongly coupled query-aware clip representations, and (iii) Uni-directional joint-task feedback mechanism enhancing both tasks through correlation. In addi tion, (iv) we introduce hard positive/negative losses for adaptive error penalization and improved learning, and (v) leverage LVLMs like BLIP-2 for enhanced multimodal feature integra tion and intelligent pretraining using synthetic data generated from LVLMs. Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate state-of the-art performance. Codes and models are available at: https: //github.com/dpaul06/VideoLights.

论文简评

VideoLights 是一种开创性的方法，将视频和文本模态相结合，以解决视频中亮点检测与片段检索的联合任务问题。论文提出了一种新颖的框架，利用多个预训练的视觉-语言模型来实现更好的特征对齐和任务相互依赖性。作者引入了多个创新模块，提升了视频-文本的融合效果和性能，为计算机视觉与自然语言处理领域的研究人员和实践者提供了重要工具。在最新基准测试上取得了令人瞩目的成果，VideoLights 展现出其作为联合亮点检测和片段检索强大工具的潜力。

如何学习大模型 AI ？

由于新岗位的生产效率，要优于被取代岗位的生产效率，所以实际上整个社会的生产效率是提升的。

但是具体到个人，只能说是：

“最先掌握AI的人，将会比较晚掌握AI的人有竞争优势”。

这句话，放在计算机、互联网、移动互联网的开局时期，都是一样的道理。

我在一线互联网企业工作十余年里，指导过不少同行后辈。帮助很多人得到了学习和成长。

我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

在这里插入图片描述