[TPAMI 2024]Vision-Language Models for Vision Tasks: A Survey

②Content: a) background of VLM in visual task, b) doundations of VLM, c) datasets, d) pretraining, transfer learning and knowledge distillation methods of VLM, e) benchmarks, f) challenges

laborious adj.费力的；辛苦的

2.2. Introduction

①New paradigm: Pre-training (on large scale data w/ or w/o label), Fune-tuning (for specific labelled training data), and Prediction, see (a) and (b):

②Vision-Language Model Pre-training and Zero-shot Prediction which do not need fune-tuning:

③VLM publication number on Google Scholar:

frisbee n.（投掷游戏用的）飞盘；飞碟

2.3. Background

2.3.1. Training Paradigms for Visual Recognition

（1）Traditional Machine Learning and Prediction

①Mostly hand-crafted and lightweight but hard to cope with complex or multi tasks

②Poor scalability

（2）Deep Learning From Scratch and Prediction

①Low speed convergence from scratch

②A mount of labels needed

（3）Supervised Pre-Training, Fine-Tuning and Prediction

①Speed up convergence

（4）Unsupervised Pre-Training, Fine-Tuning & Prediction

①Does not require labelled data

②Beter performance due to larger samples learning

（5）VLM Pre-Training and Zero-Shot Prediction

①Discarding fine-tuning

②Future directions: a) large scale informative image-text data, b) high-capacity models, c) new pre-training objectives

2.3.2. Development of VLMs for Visual Recognition

①3 improvements to VLMs:

2.3.3. Relevant Surveys

①Framework of their review:

2.4. VLM Foundations

2.4.1. Network Architectures

①Number of image-text pairs: $N$

②Features extracted from pairs: $\mathcal{D}=\left \{ x^I_n, x^T_n\right \}^N_{n=1}$ , where $x$ with superscript $I$ denotes image sample with $T$ denotes text

③Image encoder and text encoder in DNN: $f_\theta$ / $f_\phi$

④Encoding operation: $z_n^I=f_\theta(x_n^I)$ and $z_n^T=f_\theta(x_n^T)$

（1）Architectures for Learning Image Features

①CNN-based architectures: such as VGG, ResNet and EfficientNet

②Transformer-base architectures: such as ViT

（2）Architectures for Learning Language Features

①The framework of standard Transformer: 6 blocks in encoder (each with a multi-head attention layer and MLP) and 6 blocks in decoder (each with a multi-head attention layer, a masked multi-head layer and MLP)

2.4.2. VLM Pre-Training Objectives

（1）Contrastive Objectives

①Image Contrastive Learning: close with positive keys and faraway from negative keys in embedding space. For $B$ images（实际上作者这里表达得很特殊，他们是说“对于这样的batch size”大小，这是比较贴近代码的表达，如果要概念上的表达其实就看成总共有这么多样本就好）, this loss always be:

$\mathcal{L}_I^\mathrm{InfoNCE}=-\frac{1}{B}\sum_{i=1}^{B}\log\frac{\exp{(z_i^I\cdot z_+^I/\tau)}}{\sum_{j=1,j\neq i}^{B+1}\exp(z_i^I\cdot z_j^I/\tau)}$

where $z_i^I$ denotes query embedding, $\{z_j^I\}_{j=1,j\neq i}^{B+1}$ denotes key embeddings, $z_+^I$ denotes positive keys in the $i$ -th sample, $\tau$ denotes temperature hyper-parameter

②Image-Text Contrastive Learning: pull paired embeddings closed and others away:

$\begin{gathered} \mathcal{L}_{I\to T} =-\frac1B\sum_{i=1}^B\log\frac{\exp{(z_i^I\cdot z_i^T/\tau)}}{\sum_{j=1}^B\exp(z_i^I\cdot z_j^T/\tau)} \\ \mathcal{L}_{T\to I} =-\frac1B\sum_{i=1}^B\log\frac{\exp{(z_i^T\cdot z_i^I/\tau)}}{\sum_{j=1}^B\exp(z_i^T\cdot z_j^I/\tau)}\\ \mathcal{L}_{\mathrm{infoNCE}}^{IT}=\mathcal{L}_{I\to T}+\mathcal{L}_{T\to I} \end{gathered}$

where $\mathcal{L}_{I\to T}$ denotes contrasting the query image with the text keys, $\mathcal{L}_{T\to I}$ denotes contrasting the query text with image keys

③Image-Text-Label Contrastive Learning: supervised:

$\begin{gathered} \mathcal{L}_{I\to T}^{ITL} =-\sum_{i=1}^B\frac1{|\mathcal{P}(i)|}\sum_{k\in\mathcal{P}(i)}\log\frac{\exp{(z_i^I\cdot z_k^T/\tau)}}{\sum_{j=1}^B\exp(z_i^I\cdot z_j^T/\tau)} \\ \mathcal{L}_{T\to I}^{ITL} =-\sum_{i=1}^B\frac1{|\mathcal{P}(i)|}\sum_{k\in\mathcal{P}(i)}\log\frac{\exp{(z_i^T\cdot z_k^I/\tau)}}{\sum_{j=1}^B\exp(z_i^T\cdot z_j^I/\tau)}\\ \mathcal{L}_{\mathrm{infoNCE}}^{ITL}=\mathcal{L}_{I\to T}^{ITL}+\mathcal{L}_{T\to I}^{ITL} \end{gathered}$

where $k\in\mathcal{P}(i)=\{k|k\in B,y_k=y_i\}$ , $y$ denotes the class label of $(z^I,z^T)$ （相当于多增加了一个样本类循环）

（2）Generative Objectives

①Masked Image Modelling: learns cross-patch correlation by masking a set of patches and reconstructing images. The loss usually is:

$\mathcal{L}_{MIM}=-\frac1B\sum_{i=1}^B\log f_\theta(\overline{x}_i^I\mid\hat{x}_i^I)$

where $\overline{x}_i^I$ denotes masked patches, $\hat{x}_i^I$ denotes unmasked patches（这“|”什么玩意儿啊条件概率吗但是说不通？在不mask的情况下mask的概率？？？怎么感觉反了呢还是我有问题）

②Masked Language Modelling: mask at a specific ratio:

$\mathcal{L}_{MLM}=-\frac1B\sum_{i=1}^B\log f_\phi(\overline{x}_i^T\mid\hat{x}_i^T)$

③Masked Cross-Modal Modelling: randomly masks a subset of image patches and a subset of text tokens then reconstruct by unmasked ones:

$\mathcal{L}_{MCM}=-\frac{1}{B}\sum_{i=1}^{B}[\log f_{\theta}(\overline{x}_{i}^{I}|\hat{x}_{i}^{I},\hat{x}_{i}^{T})+\log f_{\phi}(\overline{x}_{i}^{T}|\hat{x}_{i}^{I},\hat{x}_{i}^{T})]$

④Image-to-Text Generation: through image and text pairs to predict text:

$\mathcal{L}_{ITG}=-\sum_{l=1}^L \log f_\theta(x^T\mid x_{<l}^T,z^I)$

where $L$ denotes the number of tokens, $z^I$ is the embedding of the image paired with $x^T$

（3）Alignment Objectives

①Image-Text Matching: BCE loss:

$\mathcal{L}_{IT}=p\log\mathcal{S}(z^I,z^T)+(1-p)\log(1-\mathcal{S}(z^I,z^T))$

where $\mathcal{S}\left ( \cdot \right )$ measures the alignment probability between the image and text, $p=1$ when matches otherwise 0

②Region-Word Matching: model local cross-modal correlation in dense scenes:

$\mathcal{L}_{RW}=p\log\mathcal{S}^r(r^I,w^T)+(1-p)\log(1-\mathcal{S}^r(r^I,w^T))$

where $(r^I,w^T)$ denotes a region-word pair, $p=1$ when matches otherwise 0

2.4.3. VLM Pre-Training Frameworks

①two-tower, two-leg and one-tower pre-training approaches:

2.4.4. Evaluation Setups and Downstream Tasks

（1）Zero-Shot Prediction

①Image Classification: apply prompt engineering and compare embeddings of images and texts

②Semantic Segmentation: comparing the embeddings of the given image pixels and texts

③Object Detection: comparing the embeddings of the given object proposals and texts

④Image-Text Retrieval: retrieve the demanded samples from one modality given the cues from another modality, text-to-image or image-to-text

（2）Linear Probing

①freeze pre-trained VLM→get embedding→train a linear classifier to classify these embeddings

2.5. Datasets

①Widely Used Image-Text Datasets for VLM Pre-Training:

②Widely-Used Visual Recognition Datasets for VLM Evaluation:

2.5.1. Datasets for Pre-Training VLMs

①Collection of image-text data is easier and cheaper than traditional crowd-labelled data

②⭐Some researches utilize auxiliary datasets to provide additional information for better vision-language modelling, such as GLIP leverages Object365 for extracting region-level features

2.5.2. Datasets for VLM Evaluation

①Count each type of datasets

2.6. Vision-Language Model Pre-Training

①Vision-Language Model Pre-Training Methods:

2.6.1. VLM Pre-Training With Contrastive Objectives

（1）Image Contrastive Learning

①e.g. SLIP utilizes infoNCE loss to learn the discriminative image features

（2）Image-Text Contrastive Learning

①Learning the correlation between pair image-text, and pull irrelevant matchings away:

（3）Image-Text-Label Contrastive Learning

①Encodding image-text-label to one shared space:

（4）Discussion

①Challenge 1: Joint optimizing positive and negative pairs is complicated and challenging

②Challenge 2: Heuristic temperature hyper-parameter selection

2.6.2. VLM Pre-Training With Generative Objectives

（1）Masked Image Modelling

①Image patches mask strategy:

（2）Masked Language Modelling

①Text mask strategy:

（3）Masked Cross-Modal Modelling

①Mask image and text at the same time

（4）Image-to-Text Generation

①Encode images and then decode them to match the texts

（5）Discussion

①Learning context information

2.6.3. VLM Pre-Training With Alignment Objectives

（1）Image-Text Matching

①Match image and text pairs

（2）Region-Word Matching

①Match region and text pairs:

（3）Discussion

①Alignment always be context information or correlation enhancing

2.6.4. Summary and Discussion

①Recent VLM pre-training focuses on learning global vision-language correlation or models local fine-grained vision-language correlation via region-word matching

2.7. VLM Transfer Learning

2.7.1. Motivation of Transfer Learning

①Chanllenges for pretrained VLM: a) different downstream distribution,b) different downstream task

2.7.2. Common Setup of Transfer Learning

①Unsupervised methods are more efficient and promising

2.7.3. Common Transfer Learning Methods

①3 types of VLM transfer models:

（1）Transfer Via Prompt Tuning

①Transfer with Text Prompt Tuning:

②Transfer with Visual Prompt Tuning:

③Transfer with Text-Visual Prompt Tuning: tune image and text together

④Discussion: Challenge of this is low flexibility by following the manifold (distribution) of the original VLMs in prompting

（2）Transfer Via Feature Adaptation

①Fine-tune the feature by additional feature adapter:

but has intellectual property problem

（3）Other Transfer Methods

①Lists other methods

2.7.4. Summary and Discussion

①2 main methods of VLM transfer learning: prompt tuning and feature adapter

2.8. VLM Knowledge Distillation

2.8.1. Motivation of Distilling Knowledge From VLMs

①VLM knowledge distillation distils general and robust VLM knowledge to task-specific models without the restriction of VLM architecture

intact adj.完整的；完整；完好无损

2.8.2. Common Knowledge Distillation Methods

（1）Knowledge Distillation for Object Detection

①Introduced basic and prompt based knowledge distillation for open vocabulary object detection

（2）Knowledge Distillation for Semantic Segmentation

①Also basic and weak supervised distillation methods

2.8.3. Summary and Discussion

①More flexible than transfer learning

2.9. Performance Comparison

2.9.1. Performance of VLM Pre-Training

①Performance comparison on image classification:

②Data and model size test:

③The main source of VLM advantages: a) large samples, b) large model, c) task-agnostic learning

④Segmentation performance:

⑤Detection performance:

⑥Limitation: a) saturates when continuously expanding the scale of the model, 2) computing costs in pre-training, c) excessive computation and memory overheads in both training and inference