All Tokens Matter: Token Labeling for Training Better Vision Transformers

最新推荐文章于 2025-04-01 00:41:20 发布

南北封魏晋.

最新推荐文章于 2025-04-01 00:41:20 发布

阅读量240

点赞数

分类专栏：深度学习文章标签：人工智能深度学习 transformer

本文链接：https://blog.csdn.net/canyangchen/article/details/128521922

版权

深度学习专栏收录该内容

3 篇文章

订阅专栏

研究摘要

In this paper, we present token labeling—a new training objective for training high-performance vision transformers (ViTs). Different from the standard training objective of ViTs that computes the classification loss on an additional trainable class token, our proposed one takes advantage of all the image patch tokens to compute the training loss in a dense manner. Specifically, token labeling reformulates the image classification problem into multiple token-level recognition problems and assigns each patch token with an individual location-specific supervision generated by a machine annotator. Experiments show that token labeling can clearly and consistently improve the performance of various ViT models across a wide spectrum.For a vision transformer with 26M learnable parameters serving as an example, with token labeling, the model can achieve 84.4% Top-1 accuracy on ImageNet. The result can be further increased to 86.4% by slightly scaling the model size up to 150M, delivering the minimal-sized model among previous models (250M+) reaching 86%. We also show that token labeling can clearly improve the generalization capability of the pretrained models on downstream tasks with dense prediction, such as semantic segmentation.

传统的ViT都是在一个额外的可训练的class token上聚集全局信息用作最后的分类，并用于该令牌计算分类损失，而作者提出了一种新的高性能视觉转换器，称为LV-ViT，其特点是将图像分类问题重新定义为多个令牌级识别问题，并为每个patch token分配由机器生成的单独的特定位置的监督。

源码地址

Code is available at https://github.com/zihangJiang/TokenLabeling.

论文解读

The code shown below is is not identical to the author’s source code.

Introduction

在文章中，作者提出了一种可以同时利用patch token和class token上的有效信息的训练方式，称为token labeling。该方法使用一个K维的score map作为监督，以密集方式监督所有token，其中K是目标数据集和类别数。通过这种方式，每个patch token显式地地与单独的特定位置监督相关联，表明在相应的图像补丁内存在目标物体。

Token Labeling

传统的ViT一般是在patch token 添加一个可训练的class token上聚集全局信息进行分类预测，loss计算方式为
传统loss
上述方式忽略了patch token中富含的有效信息，无法充分利用patch token和class token中的互补信息。为此，作者提出了新的训练方式，称为token labeling。在这种方法中，作者引入一个K×N的score map，将每个patch token显式地地与单独的特定位置监督相关联，则token labeling的辅助损失为
辅助loss
整体的loss公式为
整体loss

值得一提的是，与传统知识蒸馏需要teacher model生成监督标签不同，token labeling的score map可以预先训练生成，在使用时只需裁剪并进行插值，故而额外的计算成本可以忽略不计。

Token Labeling with MixToken

为了增强模型的性能与鲁棒性，同时为了避免直接使用CutMix致使产生来自来个不同patch信息的混合补丁，作者采用了一种名为MixToken的数据增强方法。

采用二进制矩阵M，则新的token序列表示为
token混合
相应的令牌标签表示为
标签混合
class token的标签表示为
classtoken