Multimodal Short Video Rumor Detection System Based on Contrastive Learning

FAndromedA

已于 2024-03-05 20:42:33 修改

阅读量75

点赞数

文章标签： python

于 2024-03-04 16:05:42 首次发布

原文链接：https://arxiv.org/pdf/2304.08401.pdf

版权

段落开头这个颜色表示不重要，这个颜色表示重要

原文链接

Abstract

With the rise of short video platforms as prominent channels for news dissemination（传播）, major platforms in China have gradually evolved into fertile（肥沃的） grounds for the proliferation（扩撒） of fake news. However, distinguishing short video rumors poses a significant challenge due to the substantial amount of information and shared features among videos, resulting in homogeneity（同质）. To address the dissemination of short video rumors effectively, our research group proposes a methodology encompassing（包含） multimodal feature fusion and the integration of external knowledge, considering the merits（优点） and drawbacks of each algorithm. The proposed detection approach entails the following steps: (1) creation of a comprehensive dataset comprising multiple features extracted from short videos; (2) development of a multimodal rumor detection model: first, we employ the Temporal Segment Networks (TSN) video coding model to extract video features, followed by the utilization of Optical Character Recognition (OCR) and Automatic Speech Recognition (ASR) to extract textual features. Subsequently, the BERT model is employed to fuse textual and video features; (3) distinction is achieved through contrast learning: we acquire external knowledge by crawling relevant sources and leverage a vector database to incorporate this knowledge into the classification output. Our research process is driven by practical considerations, and the knowledge derived from this study will hold significant value in practical scenarios, such as short video rumor identification and the management of social opinions.

随着短视频平台成为主要的新闻传播渠道，中国的主要平台逐渐成为虚假新闻传播的温床。然而，由于信息量大且视频之间存在共享特征，区分短视频谣言是一项重大挑战，导致了同质化。为了有效应对短视频谣言的传播，我们研究团队提出了一种方法，包括多模态特征融合和整合外部知识，考虑了每种算法的优缺点。所提出的检测方法包括以下步骤：（1）创建一个包含从短视频中提取的多个特征的全面数据集；（2）开发多模态谣言检测模型：首先，我们使用时序段网络（TSN）视频编码模型提取视频特征，然后利用光学字符识别（OCR）和自动语音识别（ASR）提取文本特征。随后，采用BERT模型融合文本和视频特征；（3）通过对比学习实现区分：我们通过爬取相关来源获取外部知识，并利用矢量数据库将此知识整合到分类输出中。我们的研究过程受实际考虑驱动，从这项研究中获得的知识在实际场景中将具有重要价值，例如短视频谣言识别和社会舆论管理。

1. Introduction

Based on previous research, this project proposes a new multimodal short-video rumor detection system using contrastive learning. The general framework of the project is as follows: 1) Establishment of dataset: Building a rumor short-video dataset with multiple features; 2) Multimodal rumor detection model: Firstly, using the TSN (Temporal Segment Networks) video encoding model to extract video visual features; then using the fusion of Optical Character Recognition (OCR) and Automatic Speech Recognition (ASR) to extract text; then using the BERT model to fuse text features and video visual features; finally, using contrastive learning to differentiate: Firstly, crawling external knowledge, then using vector databases to introduce external knowledge and output the final classification structure.

基于先前的研究，该项目提出了一个新的多模态短视频谣言检测系统，使用对比学习。该项目的一般框架如下：

建立数据集：构建一个包含多种特征的短视频谣言数据集；
多模态谣言检测模型：首先，使用时序段网络（TSN）视频编码模型提取视频视觉特征；然后利用光学字符识别（OCR）和自动语音识别（ASR）的融合提取文本特征；接下来，使用BERT模型融合文本特征和视频视觉特征；最后，使用对比学习进行区分：首先，爬取外部知识，然后利用向量数据库引入外部知识并输出最终的分类结构。

2. Overall model architecture

2.1 Data collection, processing and labeling

该项目的第一部分是数据收集、处理和标注，主要包括以下三个步骤：

使用Python网络爬虫，以常见文本谣言数据集作为关键词，在常见短视频平台（如抖音）中搜索可能出现谣言的领域，例如知识或新闻，通过反向搜索谣言关键词，下载相应的短视频。
处理收集到的谣言数据集，并对谣言数据进行手动标注。
使用爬虫从官方或可信的知识平台（如中国辟谣网和百度百科）上爬取真实信息，将原始文章划分为知识段落，提取其文本特征以获取特征向量，并将其插入数据库中。

2.2 Feature extraction

The second part of this project is the multimodal feature extraction module, which mainly includes the following five steps:fi

(1). Video image feature processing. Frames are extracted from the video, and the ResNet50 neural network model is used to process the raw information of RGB and optical flow extracted from the video through the RGB model and optical flow model, respectively, to obtain RGB and optical flow features. The two features are pooled and fused as the image features of the video.

(2). Subtitle text detection. For the subtitles in the short video, text boxes can be extracted by positioning and cropping, and similar frames with high similarity are diltered using efficient algorithms such as perceptual hashing for adjacent content with the same subtitles. Finally, the optical character recognition (OCR) is used to convert the subtitle images into text.

(3). Audio processing. For the audio part extracted from the short video, VSD speech endpoint detection is used to detect the start and end timestamps of each segment of effiective speech, and only the detected segments of effiective speech are retained as the input for subsequent recognition engines to filter out invalid sound fragments and reduce recognition errors 3 in subsequent steps. The core step of audio processing is to use automatic speech recognition (ASR) technology to convert each segment of audio speech into text. To increase the readability of the text recognized by the speech recognition module during manual debugging and retain some information such as sentence breaks and emotions for downstream steps, a punctuation prediction step is performed after speech recognition.

(4). Fusion of image text and audio text. Based on the recognized information such as the start and end timestamps of each sentence and the image text information extracted from the subtitles, they are merged and aligned. Similarities are judged by the edit distance for deduplication, and the video text information is obtained.

(5). Fusion of video text information and image information. The encoded video text information and image information are input into the Bert model, and the cross-attention mechanism of the Bert model is used for fusion.

该项目的第二部分是多模态特征提取模块，主要包括以下五个步骤：

视频图像特征处理。从视频中提取帧，并使用ResNet50神经网络模型分别处理通过RGB模型和光流模型提取的RGB和光流的原始信息，以获取RGB和光流特征。这两个特征被汇总和融合为视频的图像特征。
字幕文本检测。对于短视频中的字幕，可以通过定位和裁剪提取文本框，并使用诸如感知哈希等高效算法过滤出具有相同字幕的相邻内容的相似帧。最后，使用光学字符识别（OCR）将字幕图像转换为文本。
音频处理。对于从短视频中提取的音频部分，使用VSD语音端点检测来检测每个有效语音段的开始和结束时间戳，并仅保留检测到的有效语音段作为后续识别引擎的输入，以过滤掉无效的声音片段并减少后续步骤中的识别错误。音频处理的核心步骤是使用自动语音识别（ASR）技术将每个音频语音段转换为文本。为了增加手动调试期间由语音识别模块识别出的文本的可读性，并保留一些信息，如句子断点和情感，供后续步骤使用，语音识别后进行标点预测步骤。
图像文本和音频文本的融合。基于每个句子的识别信息，如开始和结束时间戳以及从字幕中提取的图像文本信息，将它们合并和对齐。通过编辑距离判断相似性以进行去重，并获得视频文本信息。
视频文本信息和图像信息的融合。将编码的视频文本信息和图像信息输入到Bert模型中，并使用Bert模型的交叉注意机制进行融合。

3. Model specific implementation

3.1 Image Feature Extraction

In 2014, Simonyan et al.[2] proposed a dual-stream network model using convolutional neural networks (CNNs), which fused the classiffication results of two branches: spatial stream and temporal stream, to effiectively extract spatial and temporal features. Through a series of convolutional and fully connected layers, the RGB image and optical flow were combined, and a softmax classifier was used to output the prediction result. Finally, the scores of the two network streams were fused to obtain the final classification performance.

Due to the limited access to the contextual temporal relationship, traditional two-stream convolutional networks can only process single data in a single frame or video segment. Therefore, Wang et al.[3] proposed a network called TSN that combines video-level supervision with a sparse sampling strategy for video frames. TSN is based on a two-stream network, but with a different approach - it uses a sparse sampling method and makes an initial prediction of the action type for each video segment. Then, by using a segmental consensus function, it obtains a video-level prediction. Finally, all the predictions are combined to produce the final prediction result . Due to the simplicity and efficiency of TSN, this paper uses TSN to extract image features.

2014年，Simonyan等人提出了一个使用卷积神经网络（CNN）的双流网络模型，该模型融合了两个分支的分类结果：空间流和时间流，以有效提取空间和时间特征。通过一系列的卷积和全连接层，RGB图像和光流被合并，并使用softmax分类器输出预测结果。最后，两个网络流的分数被融合以获得最终的分类性能。

由于传统的两流卷积网络对上下文时间关系的获取有限，因此只能处理单帧或视频段中的单个数据。因此，Wang等人提出了一种称为TSN的网络，该网络将视频级监督与稀疏采样策略相结合。TSN基于两流网络，但采用了不同的方法——它使用了一种稀疏采样方法，并对每个视频段进行了动作类型的初始预测。然后，通过使用片段共识函数，它获得了视频级别的预测。最后，所有预测结果被组合以产生最终的预测结果。

由于TSN的简单性和高效性，本文使用TSN来提取图像特征。

First, the video is broken down into frames, obtaining RGB frames and optical flow frames. The optical flow is used to describe the objects in a scene that undergo dynamic changes between two consecutive frames due to motion (either of the object or of the camera). Its essence is a two-dimensional vector field, where each vector represents the displacement of the point in the scene from the previous frame to the current frame. Then, the RGB frames and the optical flow frames are separately fed through pre-trained ResNet-50 networks to obtain RGB features and optical flow features. This paper uses ResNet-50 as the basic feature extraction network for motion information features. The most typical improvement of ResNet-50 is the design of bottleneck residual blocks. As shown in the figure 2, a 3×3×256 feature vector is input, then 64 1x1 convolutions are used to reduce the 256 channels to 64, and finally 1x1 convolution is used to restore 256 channels. Stage 0 Stage4 represent conv1,conv2_X ,conv3_x ,conv4_x ,conv5_x , respectively, and each layer performs downsampling with step s of 2, where Convl_x mainly performs convolution operation, normalization operation, activation function calculation, and max pooling operation on input values. conv2_x .Conv3_x , Conv4_x and Conv4_x all contain residual blocks, each of which has three convolution layers, i.e., 18-layer, 34-layer, and 50-layer. After a series of convolutional operations, motion features can be successfully extracted. Finally, before outputting the results, average pooling is performed to convert the extracted motion features into feature vectors. The softmax classifier performs operations on each feature vector and outputs the probability of action categories. Therefore, to balance performance and computation cost, this paper uses the ResNet-50 network model as one of the basic feature extraction network models. Through the ResNet50 network, we obtain two feature vectors, each with 200 dimensions.

首先，将视频分解为帧，获取RGB帧和光流帧。光流用于描述场景中由于运动（物体或相机的运动）而在两个连续帧之间发生动态变化的对象。其本质是一个二维向量场，其中每个向量表示场景中点从前一帧到当前帧的位移。 然后，将RGB帧和光流帧分别通过预训练的ResNet-50网络传入，以获取RGB特征和光流特征。本文将ResNet-50作为基本特征提取网络，用于提取运动信息特征。ResNet-50最典型的改进是瓶颈残差块的设计。如图2所示，输入一个3×3×256的特征向量，然后使用64个1x1卷积将256个通道减少到64个，最后使用1x1卷积将256个通道恢复。Stage 0到Stage4（下图右侧）分别代表conv_1,conv2_X ,conv3_x ,conv4_x ,conv5_x，其中每一层都以步长为2进行下采样，其中Convl_x主要对输入值进行卷积操作、归一化操作、激活函数计算和最大池化操作。conv2_x,Conv3_x,Conv4_x和Conv4_x都包含残差块，每个残差块有三个卷积层，即18层、34层和50层。经过一系列卷积操作后，可以成功提取运动特征。最后，在输出结果之前，进行平均池化将提取的运动特征转换为特征向量。softmax分类器对每个特征向量执行操作，并输出动作类别的概率。因此，为了平衡性能和计算成本，本文使用ResNet-50网络模型作为基本特征提取网络模型之一。通过ResNet50网络，我们获得两个特征向量，每个向量有200维。

Figure 2. Results of the bottleneck residual module and the network structure of ResNet-50

The two features are then concatenated to form a 400-dimensional feature vector, and all frames are transformed into a 100 × 400 video feature through an average pooling layer.

这两个特征然后被连接起来形成一个400维的特征向量，然后通过一个平均池化层将所有帧转换成一个100 × 400的视频特征。

3.2 Text feature extraction

3.2.1 Speech recognition

We use the efficient non-autoregressive end-to-end speech recognition framework, Paraformer[4]. The Paraformer model consists of five components: Encoder, Predictor, Sampler, Decoder, and Loss function, as shown in the figure below. The Encoder can adopt dierent network structures such as self-attention, conformer, and SAN-M. The Predictor has two layers of FFN, predicting the number of target words and extracting the corresponding acoustic vectors. The Sampler is a module without learnable parameters that produces semantically meaningful feature vectors based on the input acoustic vectors and target vectors. The Decoder structure is similar to autoregressive models, except that it is bidirectional (autoregressive is unidirectional). In addition to the cross-entropy (CE) and MWER discriminative optimization targets, the Loss function section also includes the Predictor optimization target MAE.

The main points are: The Predictor module: Based on Continuous integrate-and-fire (CIF), it extracts acoustic feature vectors corresponding to target characters, which can more accurately predict the number of target characters in the speech. The Sampler: By sampling, acoustic feature vectors and target word vectors are transformed into feature vectors containing semantic information, which are combined with a bidirectional decoder to enhance the model’s ability to model context. The MWER training criterion is based on negative sampling.

我们使用了高效的非自回归端到端语音识别框架 Paraformer（这部分其实都是用的阿里的开源模型）。Paraformer 模型包括五个组件：编码器、预测器、采样器、解码器和损失函数，如下图所示。编码器可以采用不同的网络结构，如自注意力、conformer 和 SAN-M。预测器有两层 FFN，用于预测目标词的数量并提取相应的声学向量。采样器是一个没有可学习参数的模块，根据输入的声学向量和目标向量生成语义上有意义的特征向量。解码器的结构类似于自回归模型，但是是双向的（自回归是单向的）。除了交叉熵（CE）和 MWER 鉴别优化目标之外，损失函数部分还包括预测器优化目标 MAE。

主要要点如下：

预测器模块：基于持续积分发射（CIF），它提取与目标字符相对应的声学特征向量，可以更准确地预测语音中目标字符的数量。
采样器：通过采样，声学特征向量和目标词向量被转换为包含语义信息的特征向量，这些特征向量与双向解码器结合在一起，增强了模型建模上下文的能力。
MWER 训练准则基于负采样。

Speech endpoint detection

FSMN-Monophone VAD is a speech endpoint detection model used to detect the starting and ending time points of valid speech in input audio, and input the detected valid audio segments into the recognition engine for recognition, reducing recognition errors caused by invalid speech.

Figure 3. FSMN-Monophone VAD model structure

The structure of the FSMN-Monophone VAD model is shown in the gure 3: at the model structure level, the FSMN model structure can consider contextual information, has fast training and inference speed, and the latency is controllable; at the same time, according to the requirements of VAD model size and low latency, the network structure and the number of right context frames of the FSMN have been adapted. At the modeling unit level, speech information is rich, and the learning ability of using a single class to represent it is limited, so a single speech class is upgraded to Monophone. The subdivision of modeling units can avoid parameter averaging, enhance abstract learning ability, and improve discriminability.

FSMN-Monophone VAD 是一个语音端点检测模型，用于检测输入音频中有效语音的起始和结束时间点，并将检测到的有效音频段输入到识别引擎进行识别，从而减少无效语音导致的识别错误。

FSMN-Monophone VAD 模型（这部分其实都是用的阿里的开源模型）的结构如图 3 所示：在模型结构级别上，FSMN 模型结构可以考虑上下文信息，具有快速的训练和推断速度，并且延迟可控；同时，根据 VAD 模型大小和低延迟的要求，FSMN 的网络结构和正确上下文帧的数量已经进行了适应。在建模单元级别上，语音信息丰富，使用单一类别来表示它的学习能力有限，因此单一语音类别升级为 Monophone。建模单元的细分可以避免参数平均化，增强抽象学习能力，并提高可区分性。

Punctuation prediction We use the punctuation module in the Controllable Time-delay Transformer post-processing framework. It predicts punctuation for the recognized speech-to-text results, making the text more readable.

标点符号预测我们使用Controllable Time-delay Transformer后处理框架（这部分其实都是用的阿里的开源模型）中的标点符号模块。它为识别的语音到文本结果预测标点符号，使文本更具可读性。

3.2.2 Image text recognition

Firstly, perceptual hash (phash) [5] is calculated for each video frame to obtain the image similarity, and similar frames are filtered out. Then, text detection is performed on the filtered frames, and perceptual hash (phash) calculation is performed again for each text box image to filter out similar text boxes. Finally, text recognition is performed to obtain the text in the image.

The implementation steps of perceptual hash (phash) are as follows: (1). The image is downscaled to remove high-frequency and detailed information, while retaining the structural brightness and darkness of the image. The image is reduced to an 8x8 size, which is a total of 64 pixels, eliminating the differences in images caused by dierent sizes and proportions. (2). The color is simplified. The downscaled image is converted to 64-level grayscale. In other words, there are only 64 colors for all pixels. (3). The Discrete Cosine Transform (DCT) is calculated. The DCT divides the image into frequency clusters and stepped shapes. Although JPEG uses an 8x8 DCT transformation, a 32x32 DCT transformation is used here. (4). The DCT is further downscaled. Although the result of the DCT is a 32x32 matrix, only the top-left 8x8 matrix is retained, which presents the lowest frequency of the image. (5). The average value is calculated. The average value of all 64 values is calculated. (6). The DCT is further reduced to calculate the hash value. This is the most important step. Based on the 8x8 DCT matrix, a 64-bit hash value is set to "1" if it is greater than or equal to the mean value of the DCT, and "0" if it is less than the mean value of the DCT.

首先，对每个视频帧计算感知哈希（phash）[5]以获取图像相似度，并过滤出相似帧。然后，在过滤后的帧上执行文本检测，并针对每个文本框图像再次进行感知哈希（phash）计算，以过滤出相似文本框。最后，执行文本识别以获取图像中的文本。

感知哈希（phash）的实现步骤如下：（1）图像进行降尺度处理，以消除高频和详细信息，同时保留图像的结构亮度和暗度。图像缩小至8x8大小，总共64个像素，消除了由不同大小和比例引起的图像差异。（2）简化颜色。缩小后的图像转换为64级灰度。换句话说，所有像素只有64种颜色。（3）计算离散余弦变换（DCT）。DCT将图像分成频率聚类和阶梯形状。尽管JPEG使用8x8的DCT变换，但这里使用了32x32的DCT变换。（4）进一步缩小DCT。尽管DCT的结果是一个32x32的矩阵，但只保留左上角的8x8矩阵，它呈现了图像的最低频率。（5）计算平均值。计算所有64个值的平均值。（6）进一步减小DCT以计算哈希值。这是最重要的步骤。基于8x8的DCT矩阵，如果大于或等于DCT的平均值，则将64位哈希值设置为“1”，否则设置为“0”。

Specifically, for each image, after phash calculation, we obtain a 64-bit binary hash value. If the number of different bits between two hash values is less than 8, we consider these two images as similar; otherwise, they are not similar. For each frame of the video, we compare it with the adjacent 10 frames. If all of them are similar, we discard the redundant frames; otherwise, we keep them. After that, we perform text detection on the remaining frames to obtain several text box images. We conduct a similar deduplication process for these text box images by comparing them with the adjacent 100 text box images while preserving their timestamps. Finally, we perform text recognition on the text box images to obtain the image text.

具体地，对于每张图像，经过 phash 计算后，我们得到一个64位的二进制哈希值。如果两个哈希值之间不同位的数量小于8，我们认为这两个图像是相似的；否则，它们不相似。对于视频的每一帧，我们将其与相邻的10帧进行比较。如果它们都相似，我们丢弃多余的帧；否则，我们保留它们。之后，我们对剩余的帧进行文本检测，以获取几个文本框图像。对于这些文本框图像，我们通过与相邻的100个文本框图像进行比较进行类似的去重过程，同时保留它们的时间戳。最后，我们对文本框图像执行文本识别以获取图像文本。

Through merging and alignment based on the timestamps corresponding to the audio text and image text, we use two pointers to traverse the audio text and image text, and if the timestamp of one pointer is smaller than the timestamp of the other pointer, we process the data corresponding to that pointer. For each statement to be added, we compare its similarity to the added statements using edit distance, also called Levenshtein distance. We remove the image text with high similarity to achieve text deduplication. The results are then concatenated to obtain the video text. Calculation of edit distance: The Levenshtein distance is the minimum number of edit operations required to transform one string into another. We use dynamic programming algorithm to calculate it.

通过基于音频文本和图像文本对应的时间戳进行合并和对齐，我们使用两个指针遍历音频文本和图像文本，如果其中一个指针的时间戳小于另一个指针的时间戳，则处理对应于该指针的数据。对于要添加的每个语句，我们使用编辑距离（也称为Levenshtein距离）比较其与已添加语句的相似性。我们删除与高相似性的图像文本以实现文本去重。然后将结果连接以获取视频文本。编辑距离的计算：Levenshtein距离是将一个字符串转换为另一个字符串所需的最小编辑操作数。我们使用动态规划算法来计算它。

3.3 Image and text information integration

This article is based on a fusion mechanism that prioritizes text modality and supplements with video modality. The article uses BERT (bidirectional encoder representation from transformers) model to combine a large amount of text and video image information, in order to achieve the effect of text-based classification.

As the input feature dimension of the BERT model is (512, 768), which requires 512 vectors of 768 dimensions, we allocate one vector for cls, m vectors for video features, and 510 – m vectors for text features. According to subsequent experiments, the best result is achieved when m is 25. It can be seen that the text modality feature plays a dominant role in the fusion process, while the video modality feature plays a complementary and corrective role in the fusion.

The BERT input mainly consists of video information and textual information. For video information, the output of TSN (video feature extraction model) with size of 100×400 is averaged-pooled to obtain a 25×768 vector. For textual information, BERT model is used to convert text information into a 485 × 768 word vector. The above two aspects constitute the input of the BERT model, plus a CLS and SEP token to form all the inputs of this project.

本文基于一种融合机制，优先考虑文本模态并以视频模态为辅。文章使用 BERT（来自Transformer的双向编码器表示）模型结合大量的文本和视频图像信息，以实现基于文本的分类效果。

由于BERT模型的输入特征维度为（512，768），需要512个768维度的向量，我们为cls分配一个向量，为视频特征分配m个向量，为文本特征分配510-m个向量。根据后续的实验，当m为25时达到了最佳结果。可以看出，在融合过程中，文本模态特征起主导作用，而视频模态特征起辅助和纠正作用。

BERT的输入主要包括视频信息和文本信息。对于视频信息，TSN（视频特征提取模型）的输出大小为100×400，经过平均池化得到一个25×768的向量。对于文本信息，BERT模型用于将文本信息转换为一个485×768的词向量。以上两个方面构成了BERT模型的输入，再加上一个CLS和SEP标记，形成了本项目的所有输入。

Figure 7. Input image features and text into the BERT model to get the image and text fusion features

The token embeddings, positional embeddings, and segment embeddings are summed and input into the BERT model for fusion. After passing through BERT, the CLS token yields a representation vector C, which is a feature vector that fully integrates the image and text features through multiple layers of multi-head self-attention mechanisms, where the text features dominate.

令牌嵌入、位置嵌入和段嵌入被求和并输入到BERT模型进行融合。经过BERT，CLS令牌产生一个表示向量C，这是一个特征向量，通过多层多头自注意机制完全整合了图像和文本特征，其中文本特征占主导地位。

3.4 Contrastive Learning

In the original dataset, we have many samples labeled as 0, 1, or 2 (non-rumor, rumor, or debunked). In order to improve the accuracy of retrieval in our vector database, we need to be able to distinguish these samples well in the vector space. We randomly pair the samples into pairs and use them as training data, where the label is 0 if the labels of the two samples are different and 1 if the labels are the same. In this case, we can use contrastive loss: similar pairs with label 1 are pulled together so that they are close in the vector space, while dissimilar pairs that are closer than the defened margin are pushed apart.

In other words, we try to make samples with the same label cluster together in the vector space as closely as possible, while keeping a distance of at least 0.5 cosine similarity between samples with different labels. We use cosine distance (i.e. 1 minus cosine similarity) as our basic contrastive loss function, with a margin of 0.5. This means that dissimilar samples should have a cosine distance of at least 0.5 (equivalent to a cosine similarity difference of 0.5). An improved version of contrastive loss is OnlineContrastiveLoss, which looks for negative pairs whose distance is lower than the highest positive pair, and positive pairs whose distance is higher than the lowest negative pair. In other words, this loss automatically detects the hard cases in a batch and only calculates the loss for those cases.

在原始数据集中，我们有许多标记为0、1或2（非谣言、谣言或已证实为假新闻）的样本。为了提高我们的向量数据库中检索的准确性，我们需要能够在向量空间中很好地区分这些样本。我们将样本随机配对成对，并将它们用作训练数据，如果两个样本的标签不同则标签为0，如果标签相同则标签为1。在这种情况下，我们可以使用对比损失：标签为1的相似对被拉在一起，使它们在向量空间中接近，而比定义的边界更近的不相似对则被分开。

换句话说，我们试图使具有相同标签的样本在向量空间中尽可能紧密地聚集在一起，同时保持具有不同标签的样本之间至少0.5的余弦相似度距离。我们使用余弦距离（即1减去余弦相似度）作为我们的基本对比损失函数，边界为0.5。这意味着不相似的样本应该具有至少0.5的余弦距离（相当于0.5的余弦相似度差异）。对比损失的改进版本是OnlineContrastiveLoss，它寻找距离小于最高正对的负对，以及距离高于最低负对的正对。换句话说，该损失自动检测批处理中的困难情况，并仅对这些情况计算损失。

3.5 Vector search and result prediction

We obtain feature vectors by feeding existing knowledge into a pre-trained feature representation model, and then insert them into the vector database.

Approximate nearest neighbor searching (ANNS) is currently the mainstream approach for vector search. Its core idea is to perform calculations and searches only in a subset of the original vector space, thereby speeding up the overall search speed.

The HNSW(Hierarchical NSW)[6] and IVF-PQ algorithms are both Approximate Nearest Neighbor (ANN) algorithms. ANN is an approximate approximation of the k-Nearest Neighbor algorithm. In a single vector retrieval problem, we need to nd the top-k nearest vectors in the retrieval vector database (gallery) based on a query vector. This is the problem that the kNN algorithm needs to solve. The most intuitive method of kNN is to calculate the similarity between the query vector and all vectors in the retrieval database one by one. In practical applications, the number of vectors in the retrieval database may grow explosively, and brute force kNN search cannot meet the real-time requirements. The purpose of ANN is to sacrice accuracy for much faster search speed on condition that the precision is allowed. HNSW and IVF-PQ are the two most commonly used methods, where HNSW is graph-based and IVF-PQ is encoding-based.

HNSW is based on the NSW (Navigable small world models) method. NSW refers to the structure of navigable small world without hierarchy, and building an NSW graph is very simple, as shown in Figure 8. For each new incoming element (the green query node in the gure), we start from a randomly existing point (the entry point node in the gure) and nd its set of nearest neighbors (an approximate Delaunay graph) from the structure. As more and more elements are inserted into the structure, the previously used short distance edges become long distance edges, forming a navigable small world.

我们通过将现有知识输入预训练的特征表示模型中来获得特征向量，然后将它们插入到向量数据库中。

近似最近邻搜索（ANNS）目前是向量搜索的主流方法。其核心思想是仅在原始向量空间的子集中执行计算和搜索，从而加快整体搜索速度。HNSW（Hierarchical NSW）和IVF-PQ算法都是近似最近邻（ANN）算法。ANN是k最近邻算法的近似。在单个向量检索问题中，我们需要基于查询向量在检索向量数据库（库）中找到前k个最近的向量。这是kNN算法需要解决的问题。kNN的最直观方法是逐个计算查询向量与检索数据库中所有向量之间的相似度。在实际应用中，检索数据库中的向量数量可能会呈爆炸性增长，暴力kNN搜索无法满足实时要求。ANN的目的是在允许精度的情况下牺牲精度以获得更快的搜索速度。HNSW和IVF-PQ是两种最常用的方法，其中HNSW是基于图形的，而IVF-PQ是基于编码的。

HNSW基于NSW（Navigable small world models）方法。NSW是指无层次结构的可导航小世界结构，并且构建NSW图非常简单，如图8所示。对于每个新进入的元素（图中的绿色查询节点），我们从一个随机的现有点（图中的入口点节点）开始，并从结构中找到其最近邻的集合（一个近似的Delaunay图）。随着越来越多的元素被插入结构中，先前使用的短距离边变成了长距离边，形成了可导航的小世界。

The search in the NSW graph uses a greedy search approach where the algorithm calculates the distance from the query Q to each vertex in the friend list of the current vertex and selects the vertex with the minimum distance. If the distance between the query and the selected vertex is smaller than the distance between the query and the current element, the algorithm moves to the selected vertex and it becomes the new current vertex. The algorithm stops when it reaches a local minimum: a vertex whose friend list does not contain a vertex closer to the query than the vertex itself.

HNSW (Hierarchical Navigable Small World) is a hierarchical construction of the NSW graph, as shown in the figure. The algorithm uses a greedy strategy to traverse elements from upper layers until a local minimum is reached. Then, the search switches to a lower layer (with shorter connections) and starts searching again from the local minimum in the previous layer. HNSW uses a layered structure to divide edges based on their feature radii, reducing the computational complexity of NSW from polylogarithmic to logarithmic complexity.

Perform a nearest neighbor search in the vector database and retrieve the top 10 most similar vectors. For each label (true, false, or debunk video), sum up the cosine similarities of the vectors in the top 10 results that belong to that label. The resulting sum represents the weight of that label. Output the label with the highest weight as the final prediction.

在NSW图中的搜索使用一种贪婪搜索方法，其中算法计算查询Q到当前顶点的朋友列表中每个顶点的距离，并选择距离最小的顶点。如果查询和选定顶点之间的距离小于查询和当前元素之间的距离，则算法移动到选定的顶点，并且它成为新的当前顶点。当达到局部最小值时，算法停止：一个顶点，其朋友列表不包含比顶点本身更接近查询的顶点。

HNSW（Hierarchical Navigable Small World）是NSW图的分层构造，如图所示。该算法使用贪婪策略从上层遍历元素，直到达到局部最小值。然后，搜索切换到较低层（具有更短连接）并从前一层的局部最小值重新开始搜索。HNSW使用分层结构根据其特征半径划分边，将NSW的计算复杂度从多对数级降低到对数级复杂度。

在向量数据库中执行最近邻搜索并检索前10个最相似的向量。对于每个标签（真实、错误或辟谣视频），求出属于该标签的前10个结果中向量的余弦相似性之和。结果之和代表该标签的权重。将具有最高权重的标签输出为最终预测结果。

hnswlib库的链接如下：

https://github.com/nmslib/hnswlib

4.1 The Dataset

We used the FakeSV short video rumor dataset and a self-annotated short video rumor dataset

5 Conclusion

In this article, we propose a relatively complete framework for detecting and classifying short video rumors based on multimodal feature fusion: 1) Dataset establishment: we build a rumor short video dataset with multiple features; 2) Multi-modal rumor detection model: we first use the TSN (Temporal Segment Networks) video encoding model to extract video visual features; then we use the fusion of optical character recognition (OCR) and automatic speech recognition (ASR) technologies to extract text features; next, we use the BERT model to fuse text and video visual features; finally, we use contrastive learning to distinguish between different structures: we crawl external knowledge and introduce it into the vector database for the final structure classification output. Throughout our research process, we have always been oriented towards practical needs. We hope to continue optimizing our research results and improve the performance of the model in detecting short video rumors. Our goal is to apply our knowledge and achievements to various practical scenarios, such as short video rumor identication and social public opinion control.

在本文中，我们提出了一个相对完整的框架，用于基于多模态特征融合来检测和分类短视频谣言：

数据集建立：我们建立了一个包含多种特征的短视频谣言数据集；
多模态谣言检测模型：我们首先使用 TSN（Temporal Segment Networks）视频编码模型来提取视频视觉特征；然后我们利用光学字符识别（OCR）和自动语音识别（ASR）技术的融合来提取文本特征；接下来，我们使用 BERT 模型来融合文本和视频视觉特征；最后，我们使用对比学习来区分不同的结构：我们爬取外部知识并将其引入向量数据库，用于最终的结构分类输出。

在整个研究过程中，我们始终以实际需求为导向。我们希望继续优化我们的研究成果，提高模型在检测短视频谣言方面的性能。我们的目标是将我们的知识和成果应用到各种实际场景中，如短视频谣言识别和社会舆论控制。

FAndromedA

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Multimodal Short Video Rumor Detection System Based on Contrastive Learning

段落开头这个颜色表示不重要，这个颜色表示重要原文链接 With the rise of short video platforms as prominent channels for news dissemination（传播）, major platforms in China have gradually evolved into fertile（肥沃的） grounds for the proliferation（扩撒） of fake news. However, distinguis
复制链接

扫一扫