Deepfake Video Detection Using Recurrent Neural Networks论文阅读笔记
D. Güera and E. J. Delp, “Deepfake Video Detection Using Recurrent Neural Networks,” 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2018, pp. 1-6, doi: 10.1109/AVSS.2018.8639163.
Introduction
ses a convolu-tional neural network (CNN) to extract frame-level features
使用CNN卷积神经网络提取帧级特征
These features are then used to train a recurrent neural net-work (RNN) that learns to classify if a video has been sub-ject to manipulation or not.
特征提取后,使用RNN递归神经网络,学习如何对视频分类
The main contributions of this work are summarized asfollows. First, we propose a two-stage analysis composedof a CNN to extract features at the frame level followed by atemporally-aware RNN network to capture temporal incon-sistencies between frames introduced by the face-swappingprocess. Second, we have used a collection of 600 videos to evaluate the proposed method, with half of the videos being deepfakes collected from multiple video hosting websites. Third, we show experimentally the effectiveness of the de-
scribed approach, which allows use to detect if a suspect video is a deepfake manipulation with 94% more accuracy than a random detector baseline in a balanced setting
这项工作的主要贡献总结如下。首先,我们提出了一个由CNN组成的两阶段分析,用于在帧级提取特征,然后是一个时间感知RNN网络,用于捕获人脸交换过程中引入的帧之间的时间不一致性。其次,我们收集了600个视频来评估所提出的方法,其中一半的视频是从多个视频托管网站收集的DeepFake视频。第三,我们通过实验证明了所述方法的有效性,该方法允许在平衡设置下检测可疑视频是否为深度假操作,准确率比随机检测器基线高94%
Realted Work
-
Digital Media Forensics 数字媒体取证
two pre-trained deep CNNs
two different face swapping manipulations using a two-stream network
-
Face-based Video Manipulation Methods 基于人脸的视频处理方法
Face2Face:a real-time fa-cial reenactment system,capable of altering facial move-ments in different types of video streams
Generative adversarial networks ( 生成性对抗网络GANs):Gans shows remarkable results in altering face attributes such as age, facial hair or mouth expressions.
-
Recurrent Neural Networks 循环神经网络
LSTM网络
When a deep learning architecture is equipped with a LSTM combined with a CNN, it is typically considered as “deep in space” and “deep in time” respectively, which can be seen as two distinct system modalities.
当深度学习体系结构配备有LSTM和CNN时,它通常分别被视为“空间深度”和“时间深度”,这可以被视为两种不同的系统模式。
训练方式(Unet神经网络也是相同的训练方式)
Two sets of training images are required
the original face原始图像
the desired face预期图像
生成方式
pass a latent representation of a face generated from the original subject present in the video to the decoder network trained on faces of the subject we want to insert in the video
缺陷:
边界效果
Because the encoder is not aware of the skin or other scene information it is very common to have boundary effects due to a seamed fusion between the new face and the rest of the frame
最终视频本身的生成过程固有的
Because the autoencoder is used frame-by-frame, it is completely
unaware of any previous generated face that it may have created.
CNN的用处:
The most prominent is an inconsistent choice of illuminants between scenes with frames, with leads to a flickering phenomenon in the face region common to the majority of fake videos. Although this phenomenon can be hard to appreciate to the naked eye in the best manually-tuned deepfake manipulations, it is easily captured by a pixel-level CNN feature extracto
最突出的是场景与帧之间光源的选择不一致,导致大多数假视频常见的脸部区域闪烁现象。虽然这种现象很难用肉眼在最好的人工调节深度伪造操作中欣赏,但它很容易被像素级的CNN特征提取捕捉到
Recurrent Network for Deepfake Detection
Convolutional LSTM
-
CNN for frame feature extraction.
CNN帧特征提取
removed to directly output a deep representation of each frame using the ImageNet pre-trained model
The 2048-dimensional feature vec-tors after the last pooling layers are then used as the sequen-tial LSTM input.
-
LSTM for temporal sequence analysis.
LSTM时间序列分析
2048-wide LSTM takes a sequence of 2048-dimensional ImageNet feature vectors
512 fully-connected layer
a softmax layer to compute the probabilities of the frame sequence being either pristine or deepfake
without the need of auxiliary loss functions.