Vi2CLR: Video and Image for Visual Contrastive Learning of Representation_vi2act:video-enhanced cross-modal co-learning with-CSDN博客

本文链接：https://blog.csdn.net/qq_43296217/article/details/137082218

“Vi2CLR: Video and Image for Visual Contrastive Learning of Representation ¹”

(Diba 等, 2021, p. 1502) (pdf)
🔤Vi2CLR：用于表征视觉对比学习的视频和图像🔤

简介：

核心还是使用对比学习，采用聚类的方法来构建正负样本对。
以往的对比学习正样本对构建都是通过不同的 view, 使用momentum Encoder来提取特征，或者采用memory bank。而本文采用了聚类的形式，将 Image 和 Video Encoder 输出的 concat 在一起，然后在特征空间做聚类。

网络结构

网络结构如上：采用了视频和图像的联合学习，使用不同的encoder。视频输入的 clip采自视频，输入图像是 clip的中间帧，将视频和图像 Encdoer 的输出特征拼接，然后做聚类。
伪代码如下:

对比损失

使用了三个损失：

1、对聚类中心做对比损失
$\mathcal{L}_{CenterNCE}=\sum_{i=1}^n-\log \frac{\exp \left(r_{J i} \cdot c_s / \phi_s\right)}{\sum_{j=1}^k \exp \left(r_{J i} \cdot c_j / \phi_j\right)}$ $\phi_s$ 是每个聚类的浓度估计，保证聚类平衡
正负样本：当前 Joint 特征与所属类别的聚类中心是正样本，与其他类的聚类中心构成负样本
2、Image or Video 对比损失
$\mathcal{L}_{MIL-NCE}=\sum_{i=1}^n-\log \frac{\sum_{p \in \mathcal{P}_i} \exp \left(r_i \cdot r_p / \tau\right)}{\sum_{p \in \mathcal{P}_i} \exp \left(r_i \cdot r_p / \tau\right)+\sum_{n \in N_i} \exp \left(r_i \cdot r_n / \tau\right)}$ 使用的是多实例的 InforNCE loss, 就是正样本对不止一个，改动的地方就是把整正样本对的损失加在一起了。
正负样本：正样本对：当前样本和同类别中的其他样本；负样本对：不同类别中的样本。
正负都是随机从聚类中选出来的。
3、最终的监督损失
$\mathcal{L}_{\mathrm{Vi}^2 \text { CLR }}=\mathcal{L}_{\text {CenterNCE }}+\mathcal{L}_{\text {VideonCE }}+\mathcal{L}_{\text {lmgNCE }}$
[color=#907bf7]

实验

消融实验

1、比较了不同的聚类方法

文章里采用的是 FINCH ² 聚类方法，没有参数，并且速度快，不需要指定聚类的数量。与 Kmeans 聚类做了比较。发现FINCH好.

2、损失的消融实验

聚类中心的对比损失看起来作用很大

其他：

FINCH讲解：
https://blog.csdn.net/qq_36560894/article/details/122307420
代码地址：
https://github.com/ssarfraz/FINCH-Clustering

参考文献

Diba, Ali, Vivek Sharma, Reza Safdari, Dariush Lotfi, M. Saquib Sarfraz, Rainer Stiefelhagen和Luc Van Gool. 《Vi 2 CLR: Video and Image for Visual Contrastive Learning of Representation》. 收入 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 1482–92. Montreal, QC, Canada: IEEE, 2021. https://doi.org/10.1109/ICCV48922.2021.00153. ↩︎
Sarfraz S, Sharma V, Stiefelhagen R. Efficient parameter-free clustering using first neighbor relations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition 2019 (pp. 8934-8943). https://arxiv.org/abs/1902.11266 ↩︎