Cross-modal Contrastive Learning for Multimodal Fake News Detection

FAndromedA

已于 2024-03-05 18:19:32 修改

阅读量539

点赞数 1

文章标签： python

于 2024-03-04 16:47:15 首次发布

原文链接：https://arxiv.org/pdf/2302.14057.pdf

版权

本文提出用于多模态假新闻检测的跨模态对比学习框架COOLANT，旨在实现更准确图文对齐。利用辅助任务软化负样本损失项，开发跨模态融合模块学习相关性，用注意力机制聚合特征。在Twitter和Weibo数据集实验显示，其表现远超先前方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Code and data are available at https://github.com/wishever/COOLANT

copyright：

Permission to make digital or hard copies of part or all of this work for personal or classroom use 
is granted without fee provided that copies are not made or distributed for profit or commercial advantage 
and that copies bear this notice and the full citation on the first page. Copyrights for third-party components 
of this work must be honored. For all other uses, contact the owner/author(s).
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada
© 2023 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0108-5/23/10.
https://doi.org/10.1145/3581783.3613850

ABSTRACT

Automatic detection of multimodal fake news has gained a widespread attention recently. Many existing approaches seek to fuse unimodal features to produce multimodal news representations.
However, the potential of powerful cross-modal contrastive learning methods for fake news detection has not been well exploited. Besides, how to aggregate features from different modalities to boost the performance of the decision-making process is still an open question. To address that, we propose COOLANT, a crossmodal contrastive learning framework for multimodal fake news detection, aiming to achieve more accurate image-text alignment. To further capture the fine-grained alignment between vision and language, we leverage an auxiliary task to soften the loss term of negative samples during the contrast process. A cross-modal fusion module is developed to learn the cross-modality correlations. An attention mechanism with an attention guidance module is implemented to help effectively and interpretably aggregate the aligned unimodal representations and the cross-modality correlations. Finally, we evaluate the COOLANT and conduct a comparative study on two widely used datasets, Twitter and Weibo. The experimental results demonstrate that our COOLANT outperforms previous approaches by a large margin and achieves new state-of-the-art results on the two datasets.

多模态假新闻的自动检测近来受到了广泛关注。许多现有方法试图融合单模态特征以产生多模态新闻表示。然而，强大的跨模态对比学习方法在假新闻检测中的潜力尚未得到充分利用。此外，如何聚合来自不同模态的特征以提升决策过程的性能仍然是一个未解决的问题。为了解决这个问题，我们提出了COOLANT，一个用于多模态假新闻检测的跨模态对比学习框架，旨在实现更准确的图文对齐。为了进一步捕获视觉和语言之间的细粒度对齐，我们利用辅助任务来软化对比过程中负样本的损失项。开发了一个跨模态融合模块来学习跨模态相关性。实施了一个具有注意力引导模块的注意力机制，以帮助有效和可解释地聚合对齐的单模态表示和跨模态相关性。最后，我们评估了COOLANT，并对两个广泛使用的数据集Twitter和微博进行了比较研究。实验结果表明，我们的COOLANT在两个数据集上的表现优于先前方法，并取得了新的最优结果。

Recently, several contrastive learning-based multimodal pretraining methods have achieved great success, suggesting that contrastive learning may be a powerful paradigm for multimodal representation learning [1, 12, 16, 21, 22, 28, 37]. A contrastive loss aims to align the image features and the text features by pushing the embeddings of positive image-text pair together while pushing those of negative image-text pair apart. It has been shown to be an effective objective for improving the unimodal encoders to better understand the semantic meaning of images and texts. While effective, the one-hot labels in contrastive learning penalize all negative predictions regardless of their correctness [10, 22].

Therefore, this contrastive framework for multimodal fake news detection suffers from several key limitations: (1) A huge number of image-text pairs in fake news are inherently not matched (e.g. Figure 1d), and the contrastive objective may overfit to those data and degrade the model’s generalization performance; (2) Different image-text pairs may have potential correlation (especially in the case of different multimodal news about the same event), existing contrastive objectives directly treat those pairs as negative, which may confuse the model. Therefore, although these advanced technologies can be beneficial in multimodal representation learning, their application in multimodal fake news detection remains to be explored.
最近，几种基于对比学习的多模态预训练方法取得了巨大成功，表明对比学习可能是多模态表示学习的一种强大范式。对比损失旨在通过将正的图像-文本对的嵌入推向一起，同时将负的图像-文本对的嵌入分开来对齐图像特征和文本特征。已经证明对于改善单模态编码器以更好地理解图像和文本的语义含义是一种有效的目标。然而，对比学习中的one-hot标签会惩罚所有负面预测，而不考虑其正确性。

因此，这种用于多模态虚假新闻检测的对比框架存在几个关键限制：（1）虚假新闻中的大量图像-文本对本质上不匹配（例如图1d），对比目标可能会过拟合这些数据并降低模型的泛化性能；（2）不同的图像-文本对可能具有潜在的相关性（特别是在涉及同一事件的不同多模态新闻的情况下），现有的对比目标直接将这些对视为负面，这可能会混淆模型。因此，尽管这些先进技术在多模态表示学习中可能是有益的，但它们在多模态虚假新闻检测中的应用仍有待探索。

图1：来自Twitter的一些虚假新闻。 (a) 图像没有提供实质性信息，而文本内容表明这可能是虚假的。 (b) 篡改的图像暗示这可能是虚假新闻。 (c) 文本和图像都显示这很可能是虚假的。 (d) 一种虚假关联的例子，描述了类似战争情景的情况，但图像中描绘了一种愉快的情绪。

Taking the consideration above, we propose COOLANT, a Crossmodal Contrastive Learning framework for Multimodal Fake News Detection. We utilize a simple dual-encoder framework to construct a visual semantics level and a linguistic semantics level. Then we use the image-text contrastive (ITC) learning objective to ensure the alignment between image and text modalities. As mentioned above, the contrastive learning framework utilized for detecting multimodal fake news is subject to certain constraints, primarily stemming from the one-hot labeling method. To alleviate this problem and further improve the alignment precision, we leverage an auxiliary task, called cross-modal consistency learning, to introduce more supervisions and bring in more fine-grained semantic information.
考虑到上述情况，我们提出了COOLANT，即用于多模态虚假新闻检测的跨模态对比学习框架。我们利用一个简单的双编码器框架构建了一个视觉语义级别和一个语言语义级别。然后，我们使用图像-文本对比（ITC）学习目标来确保图像和文本模态之间的对齐。如上所述，用于检测多模态虚假新闻的对比学习框架受到某些约束，主要源于one-hot标记方法。为了缓解这个问题并进一步提高对齐精度，我们利用了一个辅助任务，称为跨模态一致性学习，引入了更多监督并带入了更细粒度的语义信息。
Specifically, the contrastive learning objective ensures that the image-text pairs are in perfect one-to-one correspondence, and the consistency learning task can derive the potential semantic similarity features to soften the loss of negative samples (unpaired samples). After that, we feed the aligned unimodal representations into a cross-modal fusion module to learn the cross-modality correlations. Finally, we design an attention mechanism module to help effectively aggregate the aligned unimodal representations and the cross-modality correlations. Inspired by [6], we introduce an attention guidance module to quantify the ambiguity between text and image by estimating the divergence of their representation distributions, which can help guide the attention mechanism to assign reasonable weights to modalities. In this way, COOLANT can acquire more sophisticated aligned unimodal representations and cross-modal features, and then effectively aggregate these features to boost the performance of multimodal fake news detection.
具体而言，对比学习目标确保图像-文本对之间有完美的一对一对应关系，一致性学习任务可以提取潜在的语义相似性特征以软化负样本（不配对样本）的损失。之后，我们将对齐的单模态表示馈送到跨模态融合模块中，以学习跨模态相关性。最后，我们设计了一个注意力机制模块，帮助有效地聚合对齐的单模态表示和跨模态相关性。受[6]的启发，我们引入了一个注意力引导模块，通过估计它们的表示分布的差异来量化文本和图像之间的模糊性，这可以帮助引导注意力机制为模态分配合理的权重。通过这种方式，COOLANT可以获取更复杂的对齐的单模态表示和跨模态特征，然后有效地聚合这些特征以提高多模态虚假新闻检测的性能。
The main contributions of this paper are as follows:
• We propose COOLANT, a cross-modal contrastive learning framework for multimodal fake news detection, aiming to achieve more accurate image-text alignment.
• We soften the loss term of negative samples during the contrast process to ease the strict constraint, so as to make it more compatible with our task.
• We introduce an attention mechanism with an attention guidance module to help effectively and interpretably aggregate features from different modalities.
• We conduct experiments on two widely used datasets, Twitter and Weibo. Experimental results demonstrate that our model outperforms previous systems by a large margin and achieves new state-of-the-art results on the two datasets.
本文的主要贡献如下：
• 我们提出了COOLANT，一个用于多模态虚假新闻检测的跨模态对比学习框架，旨在实现更准确的图文对齐。
• 我们在对比过程中软化了负样本的损失项，以缓解严格的约束，使其更加与我们的任务相兼容。
• 我们引入了一个带有注意力引导模块的注意力机制，帮助有效和可解释地聚合来自不同模态的特征。
• 我们在两个广泛使用的数据集Twitter和微博上进行实验。实验结果表明，我们的模型在两个数据集上明显优于先前的系统，并在这两个数据集上取得了新的最优结果。

2 RELATED WORKS

2.1 Fake News Detection

2.1.1 Unimodal Methods.
… …

2.1.2 Multimodal Methods.

More recently, several methods based on cross-modal discriminative patterns have been proposed to obtain superior performance in fake news detection. To learn the cross-modal characteristics, EANN [31] leverages an additional event discriminator to aid feature extraction. MVAE [19] introduces a multimodal variable autoencoder to learn probabilistic latent variable models and then reconstructs the original texts and low-level image features. MCAN [33] stacks multiple co-attention layers to better fuse textual and visual features for fake news detection. However, studies in multimodal fake news detection have rarely considered the application of the recently emerged multimodal representation learning paradigms. Besides, some methods work on the principles of weak and strong modality. CAFE [6] measures cross-modal ambiguity by evaluating the Kullback-Leibler(KL) divergence between the distributions of unimodal features. The learned ambiguity score then linearly adjusts the weight of unimodal and multimodal features before final classification. LIIMR [29] identifies the modality that presents more substantial confidence towards fake news detection. In this paper, we effectively leverage features from different modalities and make the decision process more interpretable.
最近，有几种基于跨模态辨别模式的方法被提出，以在虚假新闻检测中获得更优越的性能。为了学习跨模态特性，EANN [31] 利用额外的事件判别器来辅助特征提取。MVAE [19] 引入了一个多模态变量自编码器来学习概率潜变量模型，然后重构原始文本和低级图像特征。MCAN [33] 堆叠多个共同注意力层，以更好地融合文本和视觉特征进行虚假新闻检测。然而，多模态虚假新闻检测领域的研究很少考虑最近出现的多模态表示学习范式的应用。此外，一些方法基于弱模态和强模态的原理进行工作。CAFE [6] 通过评估单模态特征分布之间的Kullback-Leibler（KL）散度来衡量跨模态模糊性。然后，学习的模糊度分数在最终分类之前线性调整单模态和多模态特征的权重。LIIMR [29] 确定了对虚假新闻检测表现更有实质性信心的模态。在本文中，我们有效地利用了来自不同模态的特征，并使决策过程更具解释性。

2.2 Contrastive Learning

Recently, contrastive learning has achieved a great success in computer vision (CV) [4, 5, 12] and natural language processing (NLP) [9,35]. It has also been adapted to vision-language representation learning. WenLan [15] proposes a two-tower Chinese multimodal pre-training model and adapts MoCo [12] into the cross-modal scenario. CLIP [28] and ALIGN [16] demonstrate that dual-encoder models pretrained with contrastive objectives on massive noisy web data can learn strong image and text representations, which enable zero-shot transfer of the model to various downstream tasks. ALBEF [22] employs a contrastive loss to effectively align the vision and language representations, followed by a cross-attention model for fusion. Furthermore, ALBEF [22] presents a hard negative mining strategy founded on the contrastive similarity distribution, a method similarly employed by BLIP [21] and VLMo [1]. CoCa [37] conbines contrastive loss and captioning (generative) loss in an modified encoder-decoder architecture, which is widely applicable to many types of downstream tasks, and obtains a series of stateof-the-art performance. In this paper, we propose a cross-modal contrastive learning framework for multimodal fake news detection. In particular, our study utilizes an image-text contrastive (ITC) learning objective to effectively align the visual and language representations through a straightforward dual-encoder framework, thereby producing a unified latent embedding space. Moreover, we leverage an auxiliary cross-modal consistency learning task to measure the semantic similarity between images and texts, and then provide soft targets for the contrastive learning module.
最近，对比学习在计算机视觉（CV）[4, 5, 12] 和自然语言处理（NLP）[9, 35] 领域取得了巨大成功。它也被应用于视觉-语言表示学习。WenLan [15] 提出了一个双塔式的中文多模态预训练模型，并将MoCo [12] 调整到跨模态场景中。CLIP [28] 和ALIGN [16] 表明，使用对比目标在大规模嘈杂网络数据上预训练的双编码器模型可以学习出强大的图像和文本表示，从而实现了模型在各种下游任务上的零样本迁移。ALBEF [22] 利用对比损失有效地对齐了视觉和语言表示，然后通过交叉注意力模型进行融合。此外，ALBEF [22] 提出了一种基于对比相似性分布的硬负样本挖掘策略，这是BLIP [21] 和VLMo [1] 同样采用的方法。CoCa [37] 在修改后的编码器-解码器架构中结合了对比损失和字幕生成损失，这对许多类型的下游任务都适用，并获得了一系列最先进的性能。在本文中，我们提出了一个用于多模态虚假新闻检测的跨模态对比学习框架。特别地，我们的研究利用了一个图像-文本对比（ITC）学习目标，通过一个直接的双编码器框架有效地对齐了视觉和语言表示，从而产生了一个统一的潜在嵌入空间。此外，我们利用一个辅助的跨模态一致性学习任务来衡量图像和文本之间的语义相似度，然后为对比学习模块提供软目标。

3 METHODOLOGY

In this section, we present our proposed framework COOLANT, that leverages the cross-modal contrastive learning task to align the features from image and text modalities. The overall model structure is illustrated in Figure 2. Given image-text pairs, we first extract unimodal features by the modal-specific encoder (§3.1).
Then our method consists of three main components: the crossmodal contrastive learning module (§3.2) for the alignment between image and text modalities, the cross-modal fusion module (§3.3) for learning the cross-modality corrections, and the cross-modal aggregation module (§3.4) with an attention mechanism and an attention guidance for assigning reasonable attention scores to each modality, which then boosts the performance of multimodal fake news detection.

在本节中，我们介绍了我们提出的框架COOLANT，它利用跨模态对比学习任务来对齐图像和文本模态的特征。整体模型结构如图2所示。给定图像-文本对，我们首先通过模态特定编码器（§3.1）提取单模态特征。然后我们的方法由三个主要组件组成：用于图像和文本模态之间对齐的跨模态对比学习模块（§3.2），用于学习跨模态关联的跨模态融合模块（§3.3），以及带有注意力机制和注意力引导的跨模态聚合模块（§3.4），为每个模态分配合理的注意力分数，从而提高多模态虚假新闻检测的性能。

3.1 Modal-specific Encoder

Let each input multimodal news x = [𝑥 𝑣 , 𝑥𝑡 ] ∈ D, where 𝑥 𝑣 , 𝑥 𝑡 and D mean image, text and dataset, respectively. Since the modal-specific encoders are not the focus of this work, we leverage pre-training techniques to encode the image 𝑥 𝑣 and the text 𝑥 𝑡 into unimodal embedding 𝑒 𝑣 and 𝑒 𝑡, respectively.

3.1.1 Visual Encoder.

Given a visual content 𝑥 𝑣 , we utilize the pretrained model ResNet [13] trained over the ImageNet database to extract regional features. The final visual embedding 𝑒 𝑣 is obtained by using a fully connected layer to transform the regional features captured by ResNet.

3.1.2 Text Encoder.

To precisely capture both semantic and contextualised representations, we adopt BERT [8] as the core module
of our textual language model. Specifically, given a text 𝑥 𝑡 with a set of words, each word is tokenized by a pre-prepared vocabulary, then we utilize BERT to obtain the aggregate sequence representation as temporal textual features. The final textual embedding 𝑒 𝑡 is obtained by transforming the temporal textual features through a fully connected layer.
让每个输入的多模态新闻 $[x^v , x^t]\in D$ ，其中 $x^v$ , $x^t$ 和 $D$ 分别表示图像、文本和数据集。由于模态特定编码器不是本文的重点，我们利用预训练技术将图像 $x^v$ 和文本 $x^t$ 编码为单模态嵌入 $e^v$ 和 $e^t$ 。

3.1.1 视觉编码器。
给定视觉内容 $x^v$ ，我们利用在ImageNet数据库上训练的预训练模型ResNet [13] 提取区域特征。最终的视觉嵌入 $e^v$ 是通过使用一个全连接层来转换ResNet捕获的区域特征获得的。

3.1.2 文本编码器。
为了精确捕捉语义和上下文化的表示，我们采用BERT [8] 作为我们文本语言模型的核心模块。具体来说，给定一个文本 $x^t$ 和一组词，每个词被一个预先准备好的词汇表进行标记化，然后我们利用BERT来获取聚合的序列表示作为临时文本特征。最终的文本嵌入 $e^t$ 是通过一个全连接层来转换临时文本特征获得的。

3.2 Cross-modal Contrastive Learning

Features from different modalities may have huge semantic gaps, so we adopt a more advanced multimodal representation learning paradigm, cross-modal contrastive learning, to align the features from different modalities by transforming the unimodal embeddings into a shared space. Specifically, we utilize a simple dual-encoder framework, establishing distinct visual semantics and linguistic semantics levels to construct a cross-modal contrastive learning module. As mentioned above, the one-hot labeling method in contrastive learning imposes a penalty on all negative predictions irrespective of their accuracy. Therefore, we propose to leverage an auxiliary cross-modal consistency learning task, which can help measure the semantic similarity between images and texts. The consistency learning module can provide semantic similarity matrixes as soft targets for the contrastive learning module.
***不同模态的特征可能存在巨大的语义差距，因此我们采用了更先进的多模态表示学习范式，即跨模态对比学习，通过将单模态嵌入转换为共享空间来对齐不同模态的特征。***具体而言，我们利用一个简单的双编码器框架，建立不同的视觉语义和语言语义级别来构建一个跨模态对比学习模块。如上所述，对比学习中的one-hot标记方法对所有负面预测都施加了惩罚，无论其准确性如何。因此，我们提出利用一个辅助的跨模态一致性学习任务，可以帮助衡量图像和文本之间的语义相似度。一致性学习模块可以为对比学习模块提供语义相似度矩阵作为软目标。

3.2.1 Consistency Learning.

The cross-modal consistency learning is a binary classification task, which predicts whether a pair of image and text is positive (matched) or negative (not matched) given their multimodal feature. Specifically, we begin with crafting a new dataset $D^′ = [D_{pos}, D_{neg} ]$ on the basis of $D$ , where a textimage pair is labeled $y^′$ = 1 if the textual and visual embeddings are from the same piece of real news, otherwise $y^′ = 0$ . We feed the modal-specific encoders with $[x^{v^′}, x^{t^′}] \in D^′$ to obtain unimodal embeddings $e^{v^′}$ and $e^{t^′}$ . The unimodal embeddings are projected to a shared semantic space via a modality-specific multilayer perceptron (MLP) to learn shared embeddings $e^{v^′}_s$ and $e^{t^′}_s$ . Then, the shared embeddings are fed to an average pooling layer, followed by a full-connected layer as a binary classifier. We use the cosine embedding loss with margin $d$ as supervision:

跨模态一致性学习是一个二元分类任务，其预测给定他们的多模态特征，一对图像和文本是否匹配（正匹配）或不匹配（负匹配）。具体而言，我们首先基于 D 构建一个新的数据集 $D^′ = [D_{pos}, D_{neg} ]$ ，其中如果文本和图像嵌入来自同一条真实新闻，则文本-图像对标记为 $y^′$ = 1，否则为 $y^′$ = 0。我们将模态特定编码器提供的 $[x^{v^′}, x^{t^′}] \in D^′$ 用于获取单模态嵌入 $e^{v^′}$ 和 $e^{t^′}$ 。通过模态特定的多层感知器（MLP）将单模态嵌入投影到共享语义空间中，以学习共享嵌入 $e^{v^′}_s$ 和 $e^{t^′}_s$ 。然后，共享嵌入被送入一个平均池化层，接着是一个全连接层作为二元分类器。我们使用余弦嵌入损失并设置边界为 $d$ 作为监督：
$\mathcal{L}_{ITM} = \begin{cases} 1-\cos(e^{v^{'}}_s,e^{t^{'}}_s) & {if\ \ y^{'}=1}\\ \max(0,\cos(e^{v^{'}}_s,e^{t^{'}}_s)-d) & {if\ \ y^{'}=0} \tag{1} \end{cases}$
where cos(·) denotes the normalized cosine similarity and the margin 𝑑 is set as 0.2 due to empirical studies. With the gradients from back-propagation, the cross-modal consistency learning task can automatically learn a shared semantic space between multimodal embeddings, which can help measure their semantic similarity. The task can be in parallel learned with the contrastive learning task.

其中 $\cos(·)$ 表示归一化的余弦相似度，边界 $\mathcal{d}$ 设为0.2，根据经验研究确定。通过反向传播得到的梯度，跨模态一致性学习任务可以自动学习多模态嵌入之间的共享语义空间，从而有助于衡量它们的语义相似度。该任务可以与对比学习任务并行学习。

3.2.2 Contrastive Learning.

For a batch of $N$ image-text pairs $=\{(x^v_i, x^t_i)\}^N_{i=1}$ , where 𝑖 indicates the $i_{th}$ pair, the normalized embedded vectors $=\{e^v_i, e^t_i\}^N_{i=1}$ of the same dimension are obtained by the modal-specific encoders. The image-text contrastive learning aims to predict which of the $N\times N$ possible image-text pairings across a batch actually occurred. There are $N^2-N$ negative image-text pairs within a training batch. Our contrastive losses are designed to achieve the alignment between visual representation and textual representation. Specifically, for the $i_{th}$ pair, the predicted visionto-text similarity $p^{v→t}_i$ $=$ $\{p^{v→t}_{ij}\}^N_{j=1}$ and text-to-vision similarity $p^{t→v}_i$ $=$ $\{p^{t→v}_{ij}\}^N_{j=1}$ can be calculated through:

对于一批大小为 $N$ 的图像-文本对 $=\{(x^v_i, x^t_i)\}^N_{i=1}$ ，其中 𝑖 表示第 𝑖 对，通过模态特定编码器获得相同维度的归一化嵌入向量 $=\{e^v_i, e^t_i\}^N_{i=1}$ 。图像-文本对比学习旨在预测在一个批次中可能发生的 $N\times N$ 种图像-文本配对中的哪些实际发生了。在训练批次中有 $N^2-N$ 个负的图像-文本对。我们设计的对比损失旨在实现视觉表示和文本表示之间的对齐。具体来说，对于第 𝑖 对的视觉到文本相似度 $p^{v→t}_i$ $=$ $\{p^{v→t}_{ij}\}^N_{j=1}$ 和文本到视觉相似度 $p^{t→v}_i$ $=$ $\{p^{t→v}_{ij}\}^N_{j=1}$ ,可以通过以下公式计算预测：

$\begin{align} p^{v→t}_{ij}=\frac{\exp(sim(e^v_i,e^t_j)/\tau)}{\sum^N_{j=1}\exp(sim(e^v_i, e^t_j)/\tau)} \notag\\ p^{t→v}_{ij}=\frac{\exp(sim(e^t_i,e^v_j)/\tau)}{\sum^N_{j=1}\exp(sim(e^t_i,e^v_j)/\tau)} \tag{2} \end{align}$

where $\tau$ is a learnable temperature parameter initialized with 0.07 and the function sim(·) conducts dot product to measure the similarity scores. The corresponding one-hot label vectors of the groundtruth $y^{v→t}_i$ = $\{y^{v→t}_{ij}\}^N_{j=1}$ and $y^{t→v}_i$ = $\{y^{t→v}_{ij}\}^N_{j=1}$ , with positive pair denoted by 1 and negatives by 0, are used as the targets to calculate cross-entropy:

其中 $\tau$ 是一个可学习的温度参数，初始化为0.07，而函数 sim(·) 进行点积运算来衡量相似性分数。对应的正向标签向量 $y^{v→t}_i$ = $\{y^{v→t}_{ij}\}^N_{j=1}$ 和 $y^{t→v}_i$ = $\{y^{t→v}_{ij}\}^N_{j=1}$ ，其中正样本用1表示，负样本用0表示，被用作计算交叉熵的目标：

$\mathcal{L}^{v\rightarrow t}=-\frac{1}{N}\sum^N_{i=1}\sum^N_{j=1}y^{v\rightarrow t}_{ij}\log p^{v\rightarrow t}_{ij}\tag{3}$
Likewise, we can compute $\mathcal{L}^{t\rightarrow v}$ and then reach to:
$\mathcal{L}_{ITC}=\frac{\mathcal{L}^{v\rightarrow t}+\mathcal{L}^{t\rightarrow v}}{2}\tag{4}$

However, as mentioned above, this kind of hard targets may not be entirely compatible with multimodal fake news detection. To further improve the alignment precision, we use the consistency learning module to build a more refined semantic level as soft targets to provide more accurate supervisions.

然而，如上所述，这种硬目标可能并不完全适用于多模态虚假新闻检测。为了进一步提高对齐精度，我们使用一致性学习模块来构建一个更精细的语义级别，作为软目标提供更准确的监督。

3.2.3 Build Soft Target.

Building upon the previous unimodal embeddings $e^v$ and $e^t$ , the consistency learning module can project them to shared embeddings $e^v_s$ and $e^t_s$ . For a batch of $N$ image-text pairs, we propose to leverage shared embeddings $\{(e^v_s)_i,(e^t_s)_i\}^N_{i=1}$ to build the semantic similarity matrix as the soft targets. Take the semantic vision-to-text similarity as an example. For the $i_{th}$ pair, the semantic vision-to-text similarity $s^{v\rightarrow t}_i=\{s^{v\rightarrow t}_{ij}\}^N_{j=1}$ can be calculated through:

在基于先前的单模态嵌入 $e^v$ 和 $e^t$ 的基础上，一致性学习模块可以将它们投影到共享嵌入 $e^v_s$ 和 $e^t_s$ 。对于一批大小为 $N$ 的图像-文本对，我们建议利用共享嵌入 $\{(e^v_s)_i,(e^t_s)_i\}^N_{i=1}$ 来构建语义相似度矩阵作为软目标。以语义视觉到文本相似度为例。对于第 $i$ 对，语义视觉到文本相似度 $s^{v\rightarrow t}_i=\{s^{v\rightarrow t}_{ij}\}^N_{j=1}$ 可以通过以下方式计算：

$s^{v\rightarrow t}_{ij}=\frac{\exp(sim((e^v_s)_i,(e^t_s)_j)/\tau)}{\sum^N_{j=1}\exp(sim((e^v_s)_i,(e^t_s)_j)/\tau)}\tag{5}$
where $\tau$ is the temperature initialized at 0.07. Likewise, we can compute the semantic text-to-vision similarity $s^{t\rightarrow v}_i$

其中 $\tau$ 是初始化为0.07的温度参数。同样地，我们可以计算语义文本到视觉相似度 $s^{t\rightarrow v}_i$ 。

3.2.4 Semantic Matching Loss.

The semantic similarity $s^{v\rightarrow t}_i$ and $s^{t\rightarrow v}_i$ are used as the soft targets to calculate semantic matching loss.The semantic matching loss is hence the cross entropy between the predicted similarity and soft targets as:

语义相似度 $s^{v\rightarrow t}_i$ 和 $s^{t\rightarrow v}_i$ 被用作计算语义匹配损失的软目标。因此，语义匹配损失是预测相似度和软目标之间的交叉熵，表示为：

$\mathcal{L}_{SEM}=-\frac{1}{2N}\sum^N_{i=1}\sum^N_{j=1}(s^{v\rightarrow t}_{ij}\log p^{v\rightarrow t}_{ij}+s^{t\rightarrow v}_{ij}\log p^{t\rightarrow v}_{ij})\tag{6}$
The final learning objective of the cross-modality contrastive Learning module is defined as:
跨模态对比学习模块的最终学习目标被定义为：

$\mathcal{L}_{CL}=\mathcal{L}_{ITC}+\lambda\mathcal{L}_{SEM}\tag{7}$
where $\lambda$ controls the contribution of the soft targets mechanism. We jointly train the cross-modality contrastive learning module to produce the semantically aligned unimodal representations $m^v$ and $m^t$ as the input of the cross-modal fusion module and the cross-modal aggregation module.

其中 $\lambda$ 控制软目标机制（ $\mathcal{L}_{SEM}$ ）的贡献。我们共同训练跨模态对比学习模块，以生成语义对齐的单模态表示 $m^v$ 和 $m^t$ ，作为跨模态融合模块和跨模态聚合模块的输入。

3.3 Cross-modal Fusion

In order to capture the semantic interactions between different modalities, we adopt the cross-modal fusion module to learn crossmodality correlations [6]. Specifically, given the aligned unimodal representations $m^v$ and $m^t$ , we first obtain the inter-modal attention weights by calculating the association between unimodal representations:

为了捕获不同模态之间的语义交互作用，我们采用跨模态融合模块来学习跨模态相关性。具体而言，给定对齐的单模态表示 $m^v$ 和 $m^t$ ，我们首先通过计算单模态表示之间的关联来获得跨模态注意力权重：
$\begin{align} f_{t\rightarrow v}=softmax\big([m^v][m^t]^T/\sqrt{dim} \big) \notag \\ f_{v\rightarrow t}=softmax\big([m^t][m^v]^T/\sqrt{dim} \big) \tag{8} \end{align}$
where 𝑑𝑖𝑚 denotes the dimension size of the unimodal representation. Then, we update the original unimodal embedding vectors by the inter-modal attention weights to obtain the explicit correlation features

这里， $d im$ 表示单模态表示的维度大小。然后，我们通过跨模态注意力权重更新原始的单模态嵌入向量，以获得显式的相关特征。
$\begin{align} m^v_f =f_{t\rightarrow v} \times m^v \notag \\ m^t_f =f_{v\rightarrow t} \times m^t \tag{9} \end{align}$
Finally, we use an outer product between $m^v_f$ and $m^t_f$ to define their interaction matrix $m^f$ :

最后，我们使用 $m^v_f$ 和 $m^t_f$ 之间的外积来定义它们的交互矩阵 $m^f$ ：
$m^f=m^v_f\otimes m^t_f\tag{10}$

$\otimes$ denotes outer product. The final correlation matrix $m^f$ is flattened into a vector.
$\otimes$ 表示外积。最终的相关矩阵 $m^f$ 被展平成一个向量。

3.4 Cross-modal Aggregation

The input of the aggregation module is obtained by adaptively concatenating two sets of embeddings: the aligned unimodal representations $m^v$ and $m^t$ from the cross-modal contrastive learning module and the cross-modality correlations $m^f$ from the crossmodal fusion module

聚合模块的输入是通过自适应地将两组嵌入连接在一起获得的：来自跨模态对比学习模块的对齐的单模态表示 $m^v$ 和 $m^t$ ，以及来自跨模态融合模块的跨模态相关性 $m^f$ 。

3.4.1 Attention Mechanism.

Since not all modalities play an equal role in the decision-making process [29], we propose to apply an attention mechanism module to reweight these features before their aggregation. Inspired by the success of Squeeze-and-Excitation Network (SE-Net) [14, 40], we adopt an attention module to model modality-wise relationships and then weight each feature adaptively. Specifically, given these three 𝐿 × 1 features $m^v$ , $m^t$ and $m^f$ , we first concatenate them into one 𝐿×3 feature, where 𝐿 represents the length of the feature. We adopt global average pooling $F_{sq}(\cdot)$ to squeeze global modality information into a 1 × 3 vector. Then, we opt to employ a simple gating mechanism $F_{ex}(\cdot,W)$ with a sigmoid activation to fully capture modality-wise dependencies. The final output of the attention mechanism module is obtained by rescaling $F_{scale}(\cdot,\cdot)$ the 𝐿 × 3 feature, which will be used to obtain the attention weights $a = \{a^v, a^t , a^f\}$ . More details can refer to [14].

由于并非所有的模态在决策过程中发挥相同的作用，我们提出在它们聚合之前对这些特征进行重新加权的注意力机制模块。受到Squeeze-and-Excitation Network (SE-Net)的成功启发，我们采用一个注意力模块来建模模态之间的关系，然后自适应地对每个特征进行加权。具体而言，给定这三个 $\times 1$ 的特征 $m^v$ 、 $m^t$ 和 $m^f$ ，我们首先将它们连接成一个 $\times 3$ 的特征，其中 $L$ 表示特征的长度。我们采用全局平均池化 $F_{sq}(\cdot)$ 来将全局模态信息压缩成一个 $\times 3$ 的向量。然后，我们选择使用具有sigmoid激活的简单门控机制 $F_{ex}(\cdot,W)$ 来完全捕获模态间的依赖关系。注意力机制模块的最终输出是通过对 $\times 3$ 的特征进行重新缩放 $F_{scale}(\cdot,\cdot)$ 获得的，这将用于获得注意力权重 $a = \{a^v, a^t , a^f\}$ 。更多细节可以参考文献 [14] （Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition）。

3.4.2 Attention Guidance.

However, this kind of decision-making process is still at a black-box level, in which the network designs cannot explain why such weights are assigned to each modality. To make this process more interpretable, we utilize the Variational Autoencoder (VAE) [19] to model the latent variable and form the attention guidance module. Specifically, given the aligned unimodal features $m^v$ and $m^t$ , the variational posterior can be denoted as: $\mu(m), \sigma(m))$ , in which the mean $\mu$ and variance $\sigma$ can be obtained from the modal-specific encoder. Considering the distribution over the entire dataset:

然而，这种决策过程仍然处于黑盒子级别，网络设计无法解释为什么会对每个模态分配这样的权重。为了使这个过程更具解释性，我们利用变分自编码器（VAE）来建模潜变量并形成注意力引导模块。具体而言，给定对齐的单模态特征 $m^v$ 和 $m^t$ ，变分后验 可以表示为： $\mu(m), \sigma(m))$ ，其中均值 $\mu$ 和方差 $\sigma$ 可以从模态特定的编码器中获得。考虑整个数据集上的分布：
$\begin{align} q(z^v)=\frac{1}{N}\sum^N_{i=1}q(z^v_i | m^v_i) \notag \\ q(z^t)=\frac{1}{N}\sum^N_{i=1}q(z^t_i | m^t_i) \tag{11} \end{align}$
[6] suggest when unimodal features present strong ambiguity, the fake news detector should pay more attention to cross-modal features, and vice versa, which is formulated as the cross-modal ambiguity learning problem. Following the definition of cross-modal ambiguity, we measure the ambiguity of different modalities in data sample x𝑖 by the averaged Kullback-Leibler (KL) divergence between the distributions of unimodal features:

[6]建议当单模态特征呈现出较强的模糊性时，虚假新闻检测器应该更多地关注跨模态特征，反之亦然，这被表述为跨模态模糊性学习问题。根据跨模态模糊性的定义，我们通过单样本 $x_i$ 中不同模态的特征分布之间的平均 Kullback-Leibler（KL）散度来衡量模态的模糊性：

$g^{v\rightarrow t}_i=\Bigg(\frac{D_{KL}\big(q(z^v_i|m^v_i)||q(z^t_i|m^t_i)\big)}{D_{KL}\big(q(z^v)||q(z^t)\big)} \Bigg) \tag{12}$

where $D_{KL} (\cdot||\cdot)$ stands for the KL divergence. Likewise, we can compute $g^{v\rightarrow t}_i$ and then reach to:

其中 $D_{KL} (\cdot||\cdot)$ 表示KL散度。同样地，我们可以计算 $g^{v\rightarrow t}_i$ ，然后得到：

$g_i=sigmoid(\frac{1}{2}(g^{v\rightarrow t}_i+g^{t\rightarrow v}_i)) \tag{13}$

Then we can obtain the cross-modal ambiguity scores $g = \{[1 - g_i, 1 - g_i, g_i]\}_{i=1}^N$ . We develop another loss function $\mathcal{L}_{AG}$ , which calculates the logarithmic difference between the attention weights $a = \{a_v, a_t, a_f\}$ from the attention mechanism module and the ambiguity scores $g$ :

然后我们可以获得跨模态模糊度分数 $g = \{[1 - g_i, 1 - g_i, g_i]\}_{i=1}^N$ 。我们开发了另一个损失函数 $\mathcal{L}_{AG}$ ，它计算注意力机制模块中的注意力权重 $a = \{a_v, a_t, a_f\}$ 与模糊度分数 $g$ 之间的对数差异：
$\mathcal{L}_{AG}=D_{KL} (a||g)\tag{14}$
通过最小化 $\mathcal{L}_{AG}$ ，注意力机制模块学习将合理的注意力分数分配给不同的模态，这意味着该模块基于不同模态的模糊度分配每个模态。

3.4.3 Classifier.

Given the unimodal representations, the crossmodality correlations and the attention weights, the final representation $\tilde{x}$ can be calculated through:

给定单模态表示、跨模态相关性和注意力权重，最终表示 $\tilde{x}$ 可以通过以下方式计算:

$\tilde{x}=(a_v\times m^v)\oplus (a_t\times m^t)\oplus (a_f\times m^f) \tag{15}$

where ⊕ represents the concatenation operation. Then, we feed it into a fully-connected network to predict the label:

其中，⊕ 表示连接操作。然后，我们将其输入到一个全连接网络中以预测标签：

$\hat{y} = \text{softmax}(MLP(\tilde{x}))\tag{16}$

We use the cross-entropy loss function as:
我们使用交叉熵损失函数：

$\mathcal{L}_{CLS}=-(y\log(\hat{y})+(1-y)\log(1-\hat{y}))\tag{17}$
where $y$ denotes the ground-truth label. The final learning objective of the cross-modality aggregation module is defined as:
其中 $y$ 表示基准事实标签。跨模态聚合模块的最终学习目标被定义为：

$\mathcal{L}_{CA}=\mathcal{L}_{CLS}+\gamma \mathcal{L}_{AG} \tag{18}$

where $g amma$ controls the the ratio of $\mathcal{L}_{AG}$ . We jointly train the crossmodality aggregation module to assign reasonable attention scores for each modality, and effectively leverage information from all modalities to boost the performance of multimodal fake news detection.
The final loss function for COOLANT is defined as the combination of the consistency learning loss in Eq. 1, the contrastive learning loss in Eq. 7 and the cross-modal aggregation learning loss in Eq. 18:

其中 $\gamma$ 控制 $L_{AG}$ 的比例。我们共同训练跨模态聚合模块，为每个模态分配合理的注意力分数，并有效地利用所有模态的信息来提升多模态虚假新闻检测的性能。COOLANT 的最终损失函数被定义为 Eq. 1 中的一致性学习损失、Eq. 7 中的对比学习损失和 Eq. 18 中的跨模态聚合学习损失的组合：

$\mathcal{L}=\mathcal{L}_{ITM}+\mathcal{L}_{CL}+\mathcal{L}_{CA}\tag{19}$

4 EXPERIMENTS

4.1 Experimental Configurations

4.1.1 Datasets

Our model is evaluated on two real-world datasets: Twitter [3] and Weibo [17]. The Twitter dataset was released for Verifying Multimedia Use task at MediaEval. In experiments, we keep the same data split scheme as the benchmark [3, 6]. The training set contains 6, 840 real tweets and 5, 007 fake tweets, and the test set contains 1, 406 posts. The Weibo dataset collected by [17] contains 3749 fake news and 3783 real news for training, 1000 fake news and 996 real news for testing. In experiments, we follow the same steps in the work [17, 31] to remove the duplicated and low-quality images to ensure the quality of the entire dataset.

我们的模型在两个真实世界的数据集上进行评估：Twitter [3] 和 Weibo [17]。Twitter 数据集是在 MediaEval 的“验证多媒体使用”任务中发布的。在实验中，我们保持与基准 [3, 6] 相同的数据拆分方案。训练集包含 6,840 条真实推文和 5,007 条虚假推文，测试集包含 1,406 条帖子。Weibo 数据集由 [17]（Multimodal fusion with recurrent neural networks for rumor detection on microblogs. In Proceedings of the 25th ACM international conference on Multimedia）收集，包含 3749 条虚假新闻和 3783 条真实新闻用于训练，1000 条虚假新闻和 996 条真实新闻用于测试。在实验中，我们遵循 [17, 31] 的工作步骤来移除重复和低质量的图像，以确保整个数据集的质量。

4.1.3 Implementation Details.

The evaluation metrics include Accuracy, Precision, Recall, and F1-score. We use the batch size of 64 and train the model using Adam [20] with an initial learning rate of 0.001 for 50 epochs with early stopping. The $\lambda$ in the contrastive learning loss (Eq. 7) and the $\gamma$ in the cross-modal aggregation learning loss (Eq. 18) are set to 0.2 and 0.5, respectively. All codes are implemented with PyTorch [24] and run on NVIDIA RTX TITAN.

评估指标包括准确率（Accuracy）、精确率（Precision）、召回率（Recall）和 F1 分数。我们使用批大小为 64，并使用 Adam [20] 以初始学习率 0.001 进行训练，训练 50 个 epochs 并使用提前停止策略。对比学习损失（Eq. 7）中的 $\lambda$ 和跨模态聚合学习损失（Eq. 18）中的 $\gamma$ 分别设置为 0.2 和 0.5。所有代码使用 PyTorch [24] 实现，并在 NVIDIA RTX TITAN 上运行。

5 CONCLUSION

In this paper, we propose COOLANT, a novel cross-modal contrastive learning framework for multimodal fake news detection, which uses the image-text contrastive learning objective to achieve more accurate image-text alignment. To further improve the alignment precision, we leverage an auxiliary task to soften the loss term of negative samples during the contrast process. After that, we feed the aligned unimodal representations into a cross-modal fusion module to learn the cross-modality correlations. An attention mechanism with an attention guidance module is implemented to help effectively and interpretably aggregate features from different modalities. Experimental results on two datasets Twitter and Weibo demonstrate that COOLANT outperforms previous approaches by a large margin and achieves new state-of-the-art results on the two datasets.

在本文中，我们提出了 COOLANT，一个新颖的用于多模态虚假新闻检测的跨模态对比学习框架，它利用图像-文本对比学习目标来实现更准确的图像-文本对齐。为了进一步提高对齐精度，我们利用辅助任务在对比过程中软化负样本的损失项。然后，我们将对齐的单模态表示馈送到跨模态融合模块中，学习跨模态相关性。实现了一个带有注意力引导模块的注意力机制，帮助有效且可解释地聚合来自不同模态的特征。在 Twitter 和 Weibo 两个数据集上的实验结果表明，COOLANT 在两个数据集上的表现大大优于先前的方法，并在这两个数据集上取得了新的最先进结果。