SigLIP技术小结

最新推荐文章于 2025-03-12 09:59:51 发布

莫叶何竹

最新推荐文章于 2025-03-12 09:59:51 发布

阅读量2.3k

点赞数 23

分类专栏：多模态文章标签： CLIP SigLIP SigLIT 对比学习

本文链接：https://blog.csdn.net/weixin_40779727/article/details/142611538

版权

多模态专栏收录该内容

8 篇文章

订阅专栏

paper	https://arxiv.org/abs/2303.15343
github	https://github.com/google-research/big_vision
个人博客位置	http://myhz0606.com/article/siglip

1 背景

CLIP[1]自提出以来在zero-shot分类、跨模态搜索、多模态对齐等多个领域得到广泛应用。得益于其令人惊叹的能力，激起了研究者广泛的关注和优化。

目前对CLIP的优化主要可以分为两大类：其一是如何降低CLIP的训练成本；其二是如何提升CLIP的performance。

对于第一类优化任务的常见思路有3种。1）优化训练架构，如LiT[2]通过freezen image encoder，单独训练text encoder来进行text 和image的对齐来加速训练；2）减少训练token，如FLIP[3]通过引入视觉mask，通过只计算非mask区域的视觉表征来实现加速（MAE[4]中的思路）。3）优化目标函数，如CatLIP[5]将caption转为class label，用分类任务来代替对比学习任务来实现加速。

对于第二类提升CLIP的performance最常用和有效的手段就是数据治理，即构建高质量、大规模、高多样性的图文数据，典型的工作如：DFN[6]。

SigLIP这篇paper提出用sigmoid loss来做图文对比训练。这个方案既能降低训练成本，在小batch下（低于32k）performance也优于传统方法。

2 Method

为了方便阐述，符号定义如下：

	符号
image encoder	$f(\cdot)$
text encoder	$g(\cdot)$
image	$I$
text	$T$
mini-batch	$\mathcal{B} = \{ (I_1, T_1), (I_2, T_2), \cdots \}$

对于经典的softmax-based优化目标InfoNCE，其核心思路是让positive的图文对的距离越近越好，让negative图文对的距离越远越好，计算公式如下：

$-\frac { 1 } { 2 | \mathcal { B } | } \sum _ { i = 1 } ^ { | \mathcal { B } | } \left( \overbrace { \log \frac { e ^ { t \mathbf { x } _ { i } \cdot \mathbf { y } _ { i } } } { \sum _ { j = 1 } ^ { | \mathcal { B } | } e ^ { t \mathbf { x } _ { i } \cdot \mathbf { y } _ { j } } } } ^ { \mathrm { i m a g e \space \to t e x t \space s o f t m a x } } + \overbrace { \log \frac { e ^ { t \mathbf { x } _ { i } \cdot \mathbf { y } _ { i } } } { \sum _ { j = 1 } ^ { | \mathcal { B } | } e ^ { t \mathbf { x } _ { j } \cdot \mathbf { y } _ { i } } } } ^ { \mathrm { t e x t } \rightarrow \mathrm{image \, softmax} } \right) \tag{1}$

其中： $\begin{array} { r } { \mathbf { x } _ { i } = \frac { f ( I _ { i } ) } { \| f ( I _ { i } ) \| _ { 2 } } \; , } \end{array}$ $\mathbf { y } _ { i } = \frac { g ( T _ { i } ) } { \| g ( T _ { i } ) \| _ { 2 } }$

InfoNCE的缺点

softmax的计算存在数值不稳定的问题，需要引入额外的trick保证softmax的计算稳定性。详情见附录。
计算量大。softmax loss的非对称（asymmetry），需要做了两次normalization，即 $\sum _ { j = 1 } ^ { | \mathcal { B } | } e ^ { t \mathbf { x } _ { j } \cdot \mathbf { y } _ { i } } \neq \sum _ { j = 1 } ^ { | \mathcal { B } | } e ^ { t \mathbf { x } _ { i } \cdot \mathbf { y } _ { j } }$ 。并且计算稳定性的trick也需要引入额外的计算量。
显存占用大，由于要计算normalize，需要维护一个很大的概率分布矩阵。假定batch size为32k，那么这个概率分布矩阵的大小为 $32k \times 32k$

下面来看文本提出的sigmoid loss 。其定义如下：

$-\frac { 1 } { | \mathcal { B } | } \sum _ { i = 1 } ^ { | \mathcal { B } | } \sum _ { j = 1 } ^ { | \mathcal { B } | } \underbrace { \log \frac { 1 } { 1 + e ^ { z _ { i j } ( - t \mathbf { x } _ { i } \cdot \mathbf { y } _ { j } + b ) } } } _ { \mathcal { L } _ { i j } } \tag{2}$

从上式可见，Sigmoild loss将每对图文对独立看待。即分别将每对图文对做二分类。

当 $I_i, T_i)$ 时为正例。
当 $(I_i, T_{j, j\neq i})$ 时为负例。

式子中， $z _ { i j }$ 为图文对的标签，1表示是正例，-1表示是负例。直观来看，式(2)明显存在正负样本不均衡的问题，batch size为 $|\mathcal{B}|$ 时，正例数为 $|\mathcal{B}|$ ，负例数为 $|\mathcal{B}|^ 2 - |\mathcal{B}|$ 。为了缓解正负样本不均衡，作者引入两个learnable parameter $t, b$ 来调节正负例的梯度，初始时 $t'=\log 10$ ， $b = - 10$ 。附录对这两个参数的作用机理进行了浅要分析。

在这里插入图片描述

多卡场景下，可以用式(3)的通信策略实现高效训练。

$-\frac{1}{\mathcal{B}} \underbrace{ \sum _ {d_i = 1} ^ {D} } _ {\mathbf{A} \, \forall \, \mathrm{device} \, d_i} \overbrace{ \sum _ {d_j = 1} ^ {D} }^{\substack{\mathbf{B:\,} \mathrm{swap\, negs} \\ \mathrm{across\, devices}}} \overbrace{ \underbrace{\sum_{i=bd_i}^{b(d_i + 1)}}_{ \substack{ \mathrm{all \, local} \\ \mathrm{ positives}}} \underbrace{\sum_{j=bd_j}^{b(d_j + 1)}}_{ \substack{ \mathrm{negs \, from} \\ \mathrm{next \, device}}} \mathcal{L}_{ij} }^{\mathbf{C:\,} \mathrm{per\, device \, loss}} \tag{3}$

在这里插入图片描述

sigmoid-based contrastive learning的理论说完了，下面从实验的角度分析sigmoid loss的一些特性。

3 Experiment

3.1 Setting

(一) 模型

作者将基于sigmoid loss训练的CLIP称为SigLIP(Sigmoid loss for Language-Image Pre-training), 将sigmoid loss和LiT[2]架构训练的CLIP称为SigLIT (sigmoid LiT)

(二) 评估指标

作者主要从以下2个指标来评估模型的性能

Imagenet的zero shot准确率
XM3600多语言数据集的zero shot跨模态搜索准确率。

(三)训练数据集

webLi[9]

3.2 The influence of batch size

在之前的研究表明[8]：对比学习的batch size越大，效果越好。但之前的研究受限成本，最大只研究到64k。这篇paper将batch size扩大到1M。结果表明，当batch size达到32k，继续扩大的收益就很低了，达到256k后，收益达到顶峰。随后根据上述经验，作者对比了sigmoid和softmax的scale up batch size的能力，有以下几点核心结论：

sigmoid loss相比softmax loss更节约显存。用sigmoid loss时，4张TPU-v4能够容纳4096个batch size，但若用softmax，batch size只能容纳2048。
在小batch下（batch size低于32k）sigmoid-based明显优于softmax-based loss，随着batch size进一步增加，二者差距逐渐减少。

在这里插入图片描述

作者给出了2个微调经验

1）微调时image encoder不要引入weight-decay

2）增加batch size时，transformer的训练开始变得不稳定，通过设小beta2有助于huan jie。

在这里插入图片描述

3.3 The influence of positive and negative pairs ratio

对于sigmoid来说，它的loss是以pair为粒度计算的，positive和negative非常不平衡。以batch size $|\mathcal{B}| = 16k$ 为例（有16k个图文对）,只有 $16 k$ 个positive samples，但有 $16 k * 16 k - 16 k = 16 k (16 k - 1)$ 个negative samples，其positive和negative的比率约为 $1 : 16 k$ 。

因此，有必要深入探究positive和negative的不平衡对模型的影响。得益于sigmoid loss(式2)以pair为粒度的计算方式，我们可以很方便的人为控制正负样本的比例。作者尝试了4种方式调控positive和negative的比例

Random: 通过随机mask掉negative sample，来保证positive和negative的占比
Hard：通过mask掉loss较低的negative sample ，来保证positive和negative的占比
Hard, matched pair：通过mask掉loss较低的negative的sample，来保证positive和negative的占比。由于上述mask的操作，模型的“pair seen”少，此实验通过增加iteration来保证”pair seen”和原始一致。（相当于常用的resample方法）
Easy：通过mask掉loss较高的negative sample，来保证positive和negative的占比。

作者在SigLIT上用进行以上四种mask out机制的实验。 $|\mathcal{B}| = 16k$ ，迭代 $\mathrm{Iter}=900M$

在这里插入图片描述

结果表明：

不做matched pair的情况下，用3种mask方式均会造成精度下降。影响程度：easy>random>hard。
Hard sample mining + matched pair有助于进一步提升模型性能。
当正负样本的imbalance减弱时，learnable bias和pair的logit都在上升，说明了预设的learnable bias起到了积极的作用。

总体来看，得益于learnable temperature和learnable bias，sigmoid loss的正负样本不均衡基本不会导致模型性能下降。

文中对这两个超参数的初始值进行了进一步实验，结果如下。（可见引入合适的prior knowledge对提升模型performance非常有效）

在这里插入图片描述

3.4 Label noise robustness

作者进一步评估数据噪声对模型鲁棒性的影响。通过以下五种方法“污染”训练噪声：

Image：以概率 $p$ 将图文对的图片用均匀噪声替换；
Text：以概率 $p$ 将图文对的文本token序列用随机采样的等长token序列替换；
Batch alignment: 随机将batch中的 $p\%$ 的sample的图文pair进行shuffle；
Image & text: 同时进行1.和2.
Image, text & batch: 同时进行3和4

从结果可见，sigmoid loss在“污染”数据的performance更好。

在这里插入图片描述

4 小结

sigmoid-based contrastive learning从经典的softmax-based contrastive learning的“pick the right class”转化为“rate this pair”。这个转化实现了compute efficient和memory efficient，并在实验中证明，siglip在小batch下（低于32k）更具优势。

5 参考文献

[1] Learning Transferable Visual Models From Natural Language Supervision

[2] LiT: Zero-Shot Transfer With Locked-Image Text Tuning

[3] Scaling Language-Image Pre-training via Masking

[4] Masked Autoencoders Are Scalable Vision Learners

[5] CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

[6] Data Filtering Networks

[7] Representation Learning with Contrastive Predictive Coding

[8] Combined Scaling for Zero-shot Transfer Learning

[9] Pali: A jointly-scaled multilingual languageimage model.

6 附录

6.1 附录一：softmax的溢出问题

解决上溢出问题

$\begin{aligned} \text{Softmax}(x_{i}) &= \frac{\exp(x_i) }{ \sum_{j=1}^{N} \exp(x_j)} \\ &= \frac{\exp(x_i) / \exp{(x_{max})}}{ \sum_{j=1}^{N} \exp(x_j) / \exp{(x_{max})} } \\ &= \frac{\exp(x_i - x_{max})}{ \sum_{j=1}^{N} \exp(x_j - x_{max})} \end{aligned} \tag{3}$

当 $x_{max}$ 很大时，分子可能出现 $0$ ，当和 $\mathrm{cross \, entropy}$ 联用时，会出现 $l o g (0)$ ，此时应当进行如下变形。

$\begin{aligned} \log \mathrm{softmax}(x_i) &= \log \Bigr( {\frac{\exp(x_i - x_{max})}{ \sum_{j=1}^{N} \exp(x_j - x_{max})}} \Bigr) \\ & = \log \exp(x_i - x_{max}) - \log { \sum_{j=1}^{N} \exp(x_j - x_{max}) } \\ & = (x_i - x_{max}) - \log { \underbrace{\sum_{j=1}^{N} \exp(x_j - x_{max}) }_{\gt 1} } \end{aligned} \tag{4}$

6.2 附录2: sigmoid loss梯度分析

$\mathcal{L} = - \frac { 1 } { | \mathcal { B } | } \sum _ { i = 1 } ^ { | \mathcal { B } | } \sum _ { j = 1 } ^ { | \mathcal { B } | } \log \frac { 1 } { 1 + e ^ { z _ { i j } ( - t \mathbf { x } _ { i } \cdot \mathbf { y } _ { j } + b ) } } \\ = - \frac { 1 } { | \mathcal { B } | } \sum _ { i = 1 } ^ { | \mathcal { B } | } \sum _ { j = 1 } ^ { | \mathcal { B } | } \log \mathrm{Sigmoid}(- z _ { i j } ( - t \mathbf { x } _ { i } \cdot \mathbf { y } _ { j } + b )) \tag{5}$

梯度

$\mathrm{Sigmoid}(x)^{\prime} = \mathrm{Sigmoid}(x) (1 - \mathrm{Sigmoid}(x)) \tag{6}$

$\begin{align*} \frac{\partial{\mathcal{L}}}{\partial \mathbf {x_i}} &= \frac{\partial (- \frac { 1 } { | \mathcal { B } | } \sum _ { i = 1 } ^ { | \mathcal { B } | } \sum _ { j = 1 } ^ { | \mathcal { B } | } \log \mathrm{Sigmoid}(- z _ { i j } ( - t \mathbf { x } _ { i } \cdot \mathbf { y } _ { j } + b )))}{\partial \mathbf {x_i}} \\ &= - \frac { 1 } { | \mathcal { B } | } \sum _ { j = 1 } ^ { | \mathcal { B } | } \frac{\partial( \log \mathrm{Sigmoid}(- z _ { i j } ( - t \mathbf { x } _ { i } \cdot \mathbf { y } _ { j } + b ))) )}{\partial \mathbf {x_i}} \\ &= - \frac { 1 } { | \mathcal { B } | } \sum _ { j = 1 } ^ { | \mathcal { B } | } \frac{1}{\mathrm{Sigmoid}(- z _ { i j } ( - t \mathbf { x } _ { i } \cdot \mathbf { y } _ { j } + b ))}[\mathrm{Sigmoid}(- z _ { i j } ( - t \mathbf { x } _ { i } \cdot \mathbf { y } _ { j } + b )) (1 - \mathrm{Sigmoid}(- z _ { i j } ( - t \mathbf { x } _ { i } \cdot \mathbf { y } _ { j } + b )))]* (z_{ij}t)\cdot \mathbf { y } _ { j } \\ &= \underbrace{-\frac { z_{ij}t } { | \mathcal { B } | } \sum _ { j = 1 } ^ { | \mathcal { B } | } (1 - \mathrm{Sigmoid}(- z _ { i j } ( - t \mathbf { x } _ { i } \cdot \mathbf { y } _ { j } + b )))}_{\mathrm{coef }} \cdot \mathbf { y } _ { j } \end{align*} \tag{7}$

当为正例 $z_{ij} = 1$

$\frac{\partial{\mathcal{L}}}{\partial \mathbf {x_i}} = - \frac { t } { | \mathcal { B } | } \sum _ { j = 1 } ^ { | \mathcal { B } | } (1 - \mathrm{Sigmoid}( t \mathbf { x } _ { i } \cdot \mathbf { y } _ { j } - b ))\cdot \mathbf { y } _ { j } \tag{8}$

当为负例 $z_{ij} = -1$

$\frac{\partial{\mathcal{L}}}{\partial \mathbf {x_i}} = \frac { t } { | \mathcal { B } | } \sum _ { j = 1 } ^ { | \mathcal { B } | } (1 - \mathrm{Sigmoid}( - t \mathbf { x } _ { i } \cdot \mathbf { y } _ { j } + b ))\cdot \mathbf { y } _ { j } \tag{9}$