研究摘要
In this paper, we present token labeling—a new training objective for training high-performance vision transformers (ViTs). Different from the standard training objective of ViTs that computes the classification loss on an additional trainable class token, our proposed one takes advantage of all the image patch tokens to compute the training loss in a dense manner. Specifically, token labeling reformulates the image classification problem into multiple token-level recognition problems and assigns each patch token with an individual location-specific supervision generated by a machine annotator. Experiments show that token labeling can clearly and consistently improve the performance of various ViT models across a wide spectrum.For a vision transformer with 26M learnable parameters serving as an example, with token labeling, the model can achieve 84.4% Top-1 accuracy on ImageNet. The result can be further increased to 86.4% by slightly scaling the model size up to 150M, delivering the minimal-sized model among previous models (250M+) reaching 86%. We also show that token labeling can clearly improve the generalization capability of the pretrained models on downstream tasks with dense prediction, such as semantic segmentation.
传统的ViT都是在一个额外的可训练的class token上聚集全局信息用作最后的分类,并用于该令牌计算分类损失,而作者提出了一种新的高性能视觉转换器,称为LV-ViT,其特点是将图像分类问题重新定义为多个令牌级识别问题,并为每个patch token分配由机器生成的单独的特定位置的监督。
源码地址
Code is available at https://github.com/zihangJiang/TokenLabeling.
论文解读
The code shown below is is not identical to the author’s source code.
Introduction
在文章中,作者提出了一种可以同时利用patch token和class token上的有效信息的训练方式,称为token labeling。该方法使用一个K维的score map作为监督,以密集方式监督所有token,其中K是目标数据集和类别数。通过这种方式,每个patch token显式地地与单独的特定位置监督相关联,表明在相应的图像补丁内存在目标物体。
Token Labeling
传统的ViT一般是在patch token 添加一个可训练的class token上聚集全局信息进行分类预测,loss计算方式为
上述方式忽略了patch token中富含的有效信息,无法充分利用patch token和class token中的互补信息。为此,作者提出了新的训练方式,称为token labeling。在这种方法中,作者引入一个K×N的score map,将每个patch token显式地地与单独的特定位置监督相关联,则token labeling的辅助损失为
整体的loss公式为
值得一提的是,与传统知识蒸馏需要teacher model生成监督标签不同,token labeling的score map可以预先训练生成,在使用时只需裁剪并进行插值,故而额外的计算成本可以忽略不计。
Token Labeling with MixToken
为了增强模型的性能与鲁棒性,同时为了避免直接使用CutMix致使产生来自来个不同patch信息的混合补丁,作者采用了一种名为MixToken的数据增强方法。
采用二进制矩阵M,则新的token序列表示为
相应的令牌标签表示为
class token的标签表示为