[ICLR 2023] LPT: Long-tailed Prompt Tuning for Image Classification

最新推荐文章于 2023-09-30 20:02:44 发布

连理o

最新推荐文章于 2023-09-30 20:02:44 发布

阅读量1.8k

点赞数 1

分类专栏：长尾问题文章标签： ICLR 2023

本文链接：https://blog.csdn.net/weixin_42437114/article/details/130936403

版权

长尾问题专栏收录该内容

4 篇文章 1 订阅

订阅专栏

Introduction
Preliminary Study
- Performance Investigation of VPT (Visual Prompt Tuning)
- Analysis of Prompt Tuning
Long-tailed Prompt Tuning (LPT)
Experiments
References

Introduction

作者提出 Long-tailed Prompt Tuning (LPT)，通过 prompt learning 来解决长尾问题，包括 (1) 使用 shared prompt 学习 general features 并将预训练模型 adapt 到 target domain；(2) 使用 group-specific prompts 学习 group-specific features 来提高模型的 fine-grained discriminative ability

在这里插入图片描述

Preliminary Study

Performance Investigation of VPT (Visual Prompt Tuning)

作者首先通过对比 VPT (Visual Prompt Tuning) 和 linear probing 在 Places-LT 数据集上的精度来说明 prompt tuning 对长尾数据集是有效的 (VPT 的输入为 input tokens 加上 learnable prompts (tokens)，同时和 linear probing 一样在预训练模型最后加上 linear classifier)
从下表中可以看出：a) prompt tuning 可以持续提高模型的 LTR 性能；b) prompt tuning 对长尾分布具有鲁棒性，能更好地学习尾部类别。同时也可以注意到，简单的 prompt tuning 并不能直接让模型在长尾数据集上达到 SOTA

Analysis of Prompt Tuning

作者接下来分析了为什么 prompt tuning 适合长尾识别 (但仍然没有从原理上分析为什么)
由下图的 LDA 可视化可以看出 (use the pretrained ViT-B and the ViT-B fine-tuned by VPT on Places-LT to extract features of ImageNet val set and Places-LT val set)，prompt tuning 可以很好地将下游任务数据分布 (Places-LT) 和预训练数据分布 (ImageNet) 对齐，可以更好地让预训练模型 adapt 到长尾任务的 target domain (from domain adaptation perspective)
作者计算了 ViT-B 和 VPT 输出特征的平均类内距离、平均类间距离以及两者之商 $\gamma$ ，可以看到，VPT 的平均类内距离和 $\gamma$ 都更小，KNN 分类准确率更高，说明 VPT 输出的特征更具有区分度

Long-tailed Prompt Tuning (LPT)

在这里插入图片描述

Phase 1: Shared Prompt Tuning

在这里插入图片描述

类似于 VPT-Deep，给 ViT 的 $L$ 层都各自加上额外的 prompts，因此 phase 1 需要优化 shared prompt $\mathbf u=[\mathbf u_1,...,\mathbf u_L]$ 和 cosine classifier $f$ ，其中 shared prompt 用于学习所有类别的共同特征，并带来了上节讨论的 prompt tuning 的各种好处，包括 domain adaptation 和输出更具区分度的特征
每层里的前向过程为
其中， $\mathbf c$ 为 [CLS]， $\mathbf z$ 为 token embed. 新添加的 prompts 不需要计算对应的自注意力输出，只需要作为 key 和 value 与 token embed 做交互即可
损失函数为

Phase 2: Group Prompts Tuning

在这里插入图片描述

作者在 phase 2 加入了 $m$ 组 group-specific prompts $\mathcal R=\{(\mathbf k_1,\mathbf r^1),...,(\mathbf k_m,\mathbf r^m)\}$ 用于学习 group-specific knowledge 从而增强模型的 fine-grained discriminative ability，其中 $\mathbf k_i$ 为 $i$ -th group 的 key， $\mathbf r^i$ 为 $i$ -th group 的 prompts，包含 $L - K$ 个 prompt 序列 (只在后 $L - K$ 层使用 group-specific prompts).
Phase 2 包含两个步骤：(1) 冻住 shared prompts，经过 $L$ 层推理得到 $\mathbf c_L$ 作为 query $\mathbf q$ 与 $m$ 个 keys 计算余弦相似度，选出相似度最高的 $k$ 个 groups
然后对选出的 $k$ 个 groups 的 prompts 进行 prompt ensembling
(2) 重新使用步骤 (1) 在前向传播中得到的 $(\mathbf c_K,\mathbf z_K)$ ，在后 $L - K$ 层重新进行前向传播，每层的输入包括 [CLS] embed $\mathbf c$ 、patch embed $\mathbf z$ 、shared prompt $\mathbf u$ 和 group-specific prompt $\mathbf r$ ，每层里的前向过程为
损失函数为
其中， $\beta$ 为 scale factor，第二项损失函数被用于增大 $\mathbf q$ 和其匹配的 $k$ 个 groups 的 keys 之间的余弦相似度，这是由于 Phase 1 生成的特征已经比较 compact 并且在 Phase 2 是不变的，因此该损失项可以使得 keys 靠近特征空间中的不同聚类中心，使得不同 groups 对应不同的 group-specific feature
Dual Sampling. class-balanced sampling 和 instance-balanced sampling 分别容易使得模型对尾部和头部类别过拟合，作者采用 Dual Sampling，从 instance-balanced sampler 和 class-balanced sampler 分别采样一个 mini-batch $\{\mathbf I\}_{\text{ins}}$ 和 $\{\mathbf I\}_{\text{bal}}$ . $\{\mathbf I\}_{\text{bal}}$ 的损失函数对应 $\beta=1$ 时的 $\mathcal L_{\mathbf P_2}$ ， $\{\mathbf I\}_{\text{ins}}$ 的损失函数对应 $\beta=\eta(E-e)/E$ 时的 $\mathcal L_{\mathbf P_2}$ ，其中 $\eta=0.5$ 为 initialized weight， $E$ 为总的训练 epoch 数， $e$ 为当前 epoch 数

Loss Function

phase 1/2 中使用的 $\mathcal L_{\text{cls}}$ 采用 asymmetric GCL loss $\mathcal L_{\text{A-GCL}}$ .
首先根据 GCL 对 logits $\hat {\mathbf s}$ 进行加上 bias 和 rescale
其中， $\alpha$ 为 scaling factor， $\epsilon$ 为从高斯分布中采样的随机变量 ( $\|\epsilon\|$ 为取绝对值)， $n_i$ 为训练集中类别 $i$ 的样本数， $n_{max}$ 为训练集中的最大类别样本数. 对应的 per-class probability 为
然后根据 ASL 进行 Asymmetric Focusing
$\mathcal{L}_{\mathrm{A}-\mathrm{GCL}}=-\mathbf y_{\mathrm j}\left(1-\mathbf{p}_{\mathrm{j}}\right)^{\lambda_{+}} \log \left(\mathbf{p}_{\mathrm{j}}\right)-\sum_{1 \leq \mathrm{i} \leq \mathrm{C}, \mathrm{i} \neq \mathrm{j}}\mathbf y_{\mathrm i}\left(\mathbf{p}_{\mathrm{i}}\right)^{\lambda_{-}} \log \left(\mathbf{p}_{\mathrm{i}}\right)$ 其中， $j$ 为输入样本的标签类别， $λ_+=0,λ_−=4$ 为 focusing parameter， $\mathbf y$ 为 label smoothing 后的类别标签向量，即 $\mathbf y_{\mathrm j}=0.9+0.1/C,\mathbf y_{\mathrm i}=0.1/C$ (疑问：ASL 本来是 BCE 上用的，但这里是 CE + label smoothing 之后再加上 ASL 的动态加权， $\left(1-\mathbf{p}_{\mathrm{j}}\right)^{\lambda_{+}}$ 的意义和 ASL 一样，都是筛选出难样本，但感觉 $\left(\mathbf{p}_{\mathrm{i}}\right)^{\lambda_{-}}$ 的意义已经和 ASL 完全不同了，可以等进一步理解 label smoothing 为什么有用之后再来看)

Experiments

Model. ViT-B/16 with ImageNet-21k pretrained model.
Shared Prompt. default length of prompt as 10.
Group-specific Prompts. shared layer number $K = 6$ and the size of prompt size $m = 20$ ; for each prompt in the set, the prompt length is also set as 10 (Note that setting $K = 6$ may lead to 1.5x inference cost compared to VPT). prompt ensemble number $k = 2$ .

Comparison with State-of-The-Art Methods

Comparison on Places-LT.
Comparison on CIFAR100-LT.
Comparison on iNaturalist 2018.

Robustness with Domain Shift

在这里插入图片描述

Ablation Study

Different Model Size and Pretrained Models.
Effect of Each Phase.
Decoupled Training. during joint training, the shared prompt is still updated simultaneously, thus the query function is sub-optimal during training, resulting in worse matching results.
Query Function and Group Size $m$ .
when we further increase the size to 40, the final accuracy declines to 49.87%. A possible reason is that, some classes in the dataset may share some similar group-specific feature or knowledge
Effect of $K$ . $K$ 过大会导致无法学得有效的 group-specific knowledge，过小会导致 Phase 2 匹配 groups 时无法充分利用 Phase 1 得到的 adapted feature representation
Effect of Ensemble Number $k$ .
Effect of Asymmetric GCL Loss.
Statistic of Prompt Matching.

References

连理o

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
[ICLR 2023] LPT: Long-tailed Prompt Tuning for Image Classification

[ICLR 2023] LPT: Long-tailed Prompt Tuning for Image Classification
复制链接

扫一扫