Video Face Manipulation Detection Through Ensemble of CNNs
创新点:
-
注意力机制:发现检测真伪的关键部分,增强可解释性
The proposed attention-based solution provides interesting insights on which part of each frame drives face manipulation detection, thus enabling a small step forward towards the explainability of the network results
-
孪生网络训练:提升模型鉴别效果
A triplet siamese training strategy which extracts deep features from data to achieve better classification performances.
-
集成学习:基于投票策略对多个模型的结果进行综合
We therefore focus on investigating whether and how it is possible to train different CNN-based classifiers to capture different high-level semantic information that complement one another, thus positively contributing to the ensemble for this specific problem.
1. 网络架构
基于EfficientNet构建四种神经网络:
-
EfficientNetB4:基础骨架网络
-
EfficientNetB4Att :加入attention机制,并使用端到端进行训练
-
EfficientNetB4ST:基础骨架网络使用孪生网络的方法进行训练
-
EfficientNetB4AttST:加入attention机制,并使用端到端进行训练
2 具体实现
数据集:
- FF++:著名开源数据集,通过使用Face2Face 、FaceSwap等方法对从Youtube上下载的真实视频构造而成
- DFDC: Kaggle竞赛中的数据集
同时,论文中对每个视频应抽取的帧数进行了分析:可以发现,随着每个视频提取帧数FPV(frame per video)的增加,可以有效地防止其过拟合,但是在验证集上并没有很大的提升,因此考虑到硬件限制和计算复杂度,论文中每个视频只抽取32帧。
End-to-end training采用LogLoss函数评估其训练结果
Siamese training采用三元组(锚样本,正例样本,反例样本)边际损失来评估损失
具体实现和超参数见代码
3 实验结果
-
为了分析注意力机制在抽取人脸最有用信息所起到的作用,论文抽取了FF++中分析得到的部分人脸,可以发现:人脸中的眼睛和牙齿部分仍然比较粗糙,是鉴别的主要依据。
-
孪生训练机制:可以对人脸图像集合使用t-SNE算法很好的进行聚类
-
对四种模型计算结果的相关性进行检验,发现其基本不存在相关性,因此可以将四种模型进行集成提升鉴别精度
all plots outside of the main diagonal show that different networks provide slightly different scores for each frame. Indeed, the point clouds do not perfectly align on a shape that can be easily described by a simple relation. This motivates us in using the different trained models in an ensemble way. If all networks were perfectly correlated, this would not be reasonable.
-
最终可发现对四种模型进行集成,可以取得较好的结果(相较于之前的XceptionNet在FF++数据集上有了较大的提升,同时只考虑B4和B4ST在DFDC上效果也较好)