MLMs之R1-Omni:《R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning》翻
导读:这篇论文研究的是如何利用强化学习提升多模态大型语言模型的情感识别能力。总而言之,这篇论文证明了 RLVR 和 GRPO 方法在视频多模态情感识别任务中的有效性,为多模态大型语言模型的优化提供了新的思路。 但同时也指出了模型的局限性,为未来的研究提供了方向。
>> 背景痛点:现有的大型语言模型情感识别研究主要集中在图像文本模态,缺乏对包含更丰富信息(如音频和动态视觉内容)的视频多模态模型的研究。 特别是,如何有效地利用强化学习来提升视频多模态模型的情感识别能力,以及如何解释模型的推理过程,仍然是一个挑战。
>> 具体的解决方案:RLVR 和 GRPO。论文提出将强化学习与可验证奖励 (RLVR) 结合,并使用组相对策略优化 (GRPO) 方法来训练一个视频多模态情感识别模型 R1-Omni。 RLVR 直接利用验证函数来评估模型输出的正确性,避免了中间奖励模型的训练,提高了训练效率和可靠性。 GRPO 则通过比较多个生成的响应来评估其相对质量,无需额外的评价模型。
>> 核心思路步骤:
● 冷启动:先用 Explainable Multimodal Emotion Reasoning (EMER) 数据集和人工标注的数据集对 HumanOmni-0.5B 模型进行微调,赋予其初步的推理能力。 EMER 数据集包含多模态数据及其详细的推理过程标注。
● RLVR 训练:使用 RLVR 训练 HumanOmni-0.5B 模型进行情感识别。奖励函数由准确性奖励 (Racc) 和格式奖励 (Rformat) 两部分组成,前者奖励正确的预测,后者奖励符合指定格式的输出。
● GRPO 优化:结合 GRPO 方法,通过比较不同响应的奖励来优化模型。
>> 优势:
● 增强推理能力:R1-Omni 生成更详细、更可解释的情感识别推理过程。
● 提高理解能力:在 MAFW 和 DFEW 数据集上,R1-Omni 的情感识别准确率显著高于监督微调 (SFT) 模型。
● 增强泛化能力:在 RAVDESS (OOD) 数据集上,R1-Omni 表现出更强的泛化能力。
>> 论文结论和观点:
● RLVR 和 GRPO 方法能够有效地提升视频多模态模型的情感识别能力。
● R1-Omni 模型在情感识别任务中取得了显著的性能提升,并展现出更强的推理和泛化能力。
● 论文也指出了模型的一些局限性,例如字幕识别不准确、推理过程出现幻觉以及音频线索利用不足等,为未来的研究方向提供了指引。 这些局限性包括需要加强基础模型能力、减少推理过程中的幻觉、增强对音频线索的利用以及增强推理深度和情感智能。
目录
《R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning》翻译与解读
5.1 Inaccurate Subtitle Recognition字幕识别不准确
5.2 Hallucination in Reasoning推理中的幻觉
5.3 Underutilization of Audio Cues音频线索利用不足
5.4 Implications for Future Research对未来研究的启示
《R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning》翻译与解读
地址 | 论文地址:[2503.05379] R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning |
时间 | 2025年3月7日 |
作者 | 阿里通义千问团队 |
Abstract
In this work, we present the first application of Reinforcement Learning with Verifiable Reward (RLVR) to an Omni-multimodal large language model in the context of emotion recognition, a task where both visual and audio modalities play crucial roles. We leverage RLVR to optimize the Omni model, significantly enhancing its performance in three key aspects: reasoning capability, emotion recognition accuracy, and generalization ability. The introduction of RLVR not only improves the model's overall performance on in-distribution data but also demonstrates superior robustness when evaluated on out-of-distribution datasets. More importantly, the improved reasoning capability enables clear analysis of the contributions of different modalities, particularly visual and audio information, in the emotion recognition process. This provides valuable insights into the optimization of multimodal large language models. | 在这项工作中,我们首次将可验证奖励强化学习(RLVR)应用于全模态大型语言模型,以解决情绪识别任务,该任务中视觉和音频模态都发挥着关键作用。我们利用 RLVR 优化 Omni 模型,在推理能力、情绪识别准确性和泛化能力这三个关键方面显著提升了其性能。RLVR 的引入不仅提高了模型在分布内数据上的整体性能,而且在评估分布外数据集时也表现出更强的鲁棒性。更重要的是,推理能力的提升使得能够清晰地分析不同模态(特别是视觉和音频信息)在情绪识别过程中的贡献。这为多模态大型语言模型的优化提供了宝贵的见解。 |
1、Introduction
With the advent of DeepSeek R1 [2], the potential of Reinforcement Learning (RL) has garnered increasing attention from researchers working on large models. A key innovation introduced by DeepSeek R1 is Reinforcement Learning with Verifiable Reward (RLVR), which leverages rule-based reward mechanisms to optimize models in a highly efficient and reliable manner. This approach has demonstrated remarkable success in enhancing the capabilities of large language models (LLMs) even with limited training data. Recent studies have extended this methodology to multimodal large language models (MLLMs), further showcasing its versatility. For instance, R1-V [1] has been applied to tasks such as geometry reasoning and visual counting, where MLLMs trained with RLVR not only exhibit strong reasoning abilities but also achieve performance comparable to Supervised Fine-Tuning (SFT) on in-domain tests, while significantly outperforming SFT models on out-of-distribution (OOD) evaluations. In another notable work, Visual-RFT [6], the authors validated the effectiveness of RLVR on classical computer vision tasks such as image classification and object detection. Their results demonstrated that RLVR consistently outperforms SFT across nearly all categories, highlighting its broad applica-bility and robustness. Despite these advancements, the integration of RLVR with MLLMs has thus far been limited to image-text modalities. To the best of our knowledge, no prior work has explored the application of RLVR to video-based multimodal models that incorporate richer sources of information, such as audio and dynamic visual content. Bridging this gap, we present the first exploration of RLVR in conjunction with Video Omni-multimodal Models, focusing on the task of emotion recognition—a domain where both visual and audio modalities provide critical cues for accurate predictions. | 随着 DeepSeek R1 [2] 的问世,强化学习(RL)在大型模型研究中的潜力引起了越来越多研究人员的关注。DeepSeek R1 引入的一项关键创新是可验证奖励强化学习(RLVR),它利用基于规则的奖励机制以高效且可靠的方式优化模型。这种方法在仅使用有限训练数据的情况下显著提升了大型语言模型(LLMs)的能力。近期的研究还将此方法扩展到了多模态大型语言模型(MLLMs),进一步展示了其灵活性。例如,R1-V [1] 已应用于几何推理和视觉计数等任务,其中使用 RLVR 训练的 MLLMs 不仅表现出强大的推理能力,而且在领域内测试中的表现可与监督微调(SFT)相媲美,同时在领域外(OOD)评估中显著优于 SFT 模型。 在另一项引人注目的工作中,Visual-RFT [6] 的作者验证了 RLVR 在图像分类和目标检测等经典计算机视觉任务中的有效性。他们的研究结果表明,RLVR 在几乎所有类别中都始终优于 SFT,突显了其广泛适用性和稳健性。 尽管取得了这些进展,但 RLVR 与 MLLMs 的集成目前仅限于图像 - 文本模态。据我们所知,此前没有研究探索将 RLVR 应用于包含更丰富信息源(如音频和动态视觉内容)的基于视频的多模态模型。为填补这一空白,我们首次探索了 RLVR 与视频全模态模型的结合,重点关注情绪识别任务——这是一个视觉和音频模态都为准确预测提供关键线索的领域。 |
In this study, we build upon HumanOmni [8], a first open-source Omni model designed for human-centric scene understanding. By applying RLVR to HumanOmni, we aim to investigate its potential in enhancing emotion recognition performance. Our findings reveal several key insights: >> Enhanced Reasoning Capability: R1-Omni demonstrate superior reasoning abilities, en-abling a clearer understanding of how visual and audio information contribute to emotion recognition. >> Improved Understanding Capability: Compared to SFT, RLVR significantly boosts performance on emotion recognition tasks. >> Stronger Generalization Capability: RLVR models exhibit markedly better generalization capabilities, particularly excelling in out-of-distribution scenarios. | 在本研究中,我们基于 HumanOmni [8] 构建,这是首个面向人类中心场景理解的开源全模态模型。通过将 RLVR 应用于 HumanOmni,我们旨在探究其在提升情绪识别性能方面的潜力。我们的研究结果揭示了几个关键见解: >> 增强的推理能力:R1-Omni 展示了更出色的推理能力,能够更清晰地理解视觉和音频信息如何对情绪识别做出贡献。>> 理解能力提升:与 SFT 相比,RLVR 在情绪识别任务上的表现显著增强。 >> 更强的泛化能力:RLVR 模型展现出明显更优的泛化能力,在分布外场景中尤其出色。 |
5 Limitations
Despite the significant improvements achieved by the R1-Omni, there remain several limitations that warrant further investigation. To illustrate these challenges, we present three representative examples in Figure 3. | 尽管 R1-Omni 取得了显著的进步,但仍存在一些局限性,值得进一步研究。为了说明这些挑战,我们在图 3 中展示了三个具有代表性的例子。 |
5.1 Inaccurate Subtitle Recognition字幕识别不准确
In the first example, although the model produces a correct emotion prediction, we observe that inaccuracies in subtitle recognition remain a potential limitation. This issue arises because neither the HumanOmni base model nor the subsequent SFT and RLVR training processes explicitly focus on improving subtitle recognition capabilities. Addressing this limitation will require integrating more robust subtitle processing techniques, such as fine-tuning on specialized datasets or incorporating advanced natural language understanding models. | 在第一个示例中,尽管模型生成了正确的情感预测,但我们注意到字幕识别中的不准确仍然是一个潜在的限制。出现这一问题的原因在于,无论是 HumanOmni 基础模型,还是后续的 SFT 和 RLVR 训练过程,都没有明确地致力于提升字幕识别能力。要解决这一限制,需要整合更强大的字幕处理技术,例如在专门的数据集上进行微调,或者引入先进的自然语言理解模型。 |
5.2 Hallucination in Reasoning推理中的幻觉
The second example demonstrates a common issue hallucination, where the model generates reasoning outputs that are not grounded in the actual content of the video. For instance, the statement “The voiceover reveals her neutral initial reaction, which gradually turns into mild excitement and anger over time” does not align with the video’s actual emotional trajectory. This fabricated reasoning leads the model to incorrectly predict the emotion as surprise, highlighting the need for mechanisms to ensure the model’s outputs remain faithful to the input data. | 第二个例子展示了常见的幻觉问题,即模型生成的推理结果与视频的实际内容不符。例如,“旁白揭示了她起初的中立反应,随着时间推移逐渐转变为轻微的兴奋和愤怒”这一表述与视频中实际的情感走向并不一致。这种虚构的推理致使模型错误地预测情绪为惊讶,这凸显了需要有机制来确保模型的输出忠实于输入数据。 |
5.3 Underutilization of Audio Cues音频线索利用不足
The third example underscores the model’s limited ability to fully utilize audio cues, such as tone and intonation, which are critical for accurate emotion recognition. Although our model is capable of reasoning about emotions by integrating both audio and visual information, it appears that in certain cases, the use of audio features is not as thorough or effective as the use of visual cues. In this specific instance, the character’s vocal delivery provides strong emotional signals, yet the model fails to adequately incorporate these nuances into its reasoning process. | 第三个例子突显了模型在充分利用音频线索(如语调和音调)方面的能力有限,而这些对于准确的情感识别至关重要。尽管我们的模型能够通过整合音频和视觉信息来推断情感,但在某些情况下,音频特征的使用似乎不如视觉线索那样全面或有效。在这个特定的例子中,角色的语音表达提供了强烈的情感信号,但模型未能充分将这些细微差别纳入其推理过程。 |
5.4 Implications for Future Research对未来研究的启示
The limitations identified in our analysis highlight several promising directions for future research to further enhance the capabilities of R1-Omni. Specifically, we propose the following key areas of exploration: (1)、Strengthening the Foundation Model’s Capabilities:While RLVR significantly enhances the reasoning and generalization abilities of the base model, the inherent performance of the foundation model remains a critical determinant of overall success. Therefore, continuous efforts to improve the underlying Omni model such as through larger-scale pretraining, more diverse datasets, or advanced architectural designs are essential to unlock the full potential of RLVR-based approaches. (2)、Mitigating Hallucination in Reasoning Outputs:Due to the inherent challenges of multimodal data, such as the weaker causal relationships within video and audio tokens compared to text tokens, as well as the lack of explicit supervision for reasoning content, hallucinations can occur during the model’s reasoning process. These inaccuracies not only degrade performance but also negatively impact user experience. Developing mechanisms to detect and mitigate hallucinations will be crucial for improving the reliability and usability of the model. (3)、Enhancing Audio Cue Utilization:The underutilization of audio cues, such as tone and intonation, represents a limitation in the current model. Future work should focus on improving the model’s ability to extract and integrate audio features effectively. (4)、Enhancing Reasoning Depth and Emotional Intelligence:The current reasoning process tends to be somewhat mechanistic, focusing primarily on directly observable features such as visual cues and audio signals. However, human emotion recognition often involves deeper psychological insights, such as understanding the moti-vations, intentions, or internal states of individuals. By guiding the model to explore more nuanced aspects of reasoning, such as inferring psychological activities or emotional drivers, we can elevate its emotional intelligence and enhance its ability to capture complex emo-tional dynamics. This advancement would enable the model to better simulate human-like empathy and reasoning in real-world scenarios. | 我们分析中指出的局限性凸显了几个未来研究的有前景的方向,以进一步提升 R1-Omni 的能力。具体而言,我们提出以下关键探索领域: (1)强化基础模型的能力:尽管 RLVR 显著增强了基础模型的推理和泛化能力,但基础模型的固有性能仍是整体成功的关键决定因素。因此,持续努力改进底层的 Omni 模型,例如通过更大规模的预训练、更多样化的数据集或先进的架构设计,对于充分发挥基于 RLVR 方法的潜力至关重要。 (2)减轻推理输出中的幻觉:由于多模态数据固有的挑战,例如视频和音频标记内的因果关系较弱,与文本标记相比,以及推理内容缺乏明确的监督,模型在推理过程中可能会出现幻觉。这些不准确不仅会降低性能,还会对用户体验产生负面影响。开发检测和减轻幻觉的机制对于提高模型的可靠性和可用性至关重要。 (3)增强音频线索利用:当前模型对音频线索(如音调和语调)的利用不足,这是其局限性之一。未来的工作应着重于提升模型有效提取和整合音频特征的能力。 (4)增强推理深度和情感智能:当前的推理过程往往较为机械,主要关注直接可观察的特征,如视觉线索和音频信号。然而,人类的情感识别通常涉及更深层次的心理洞察,比如理解个体的动机、意图或内心状态。通过引导模型探索推理的更细微方面,例如推断心理活动或情感驱动因素,我们可以提升其情感智能,并增强其捕捉复杂情感动态的能力。这一进步将使模型能够在现实场景中更好地模拟人类般的同理心和推理。 |