【论文阅读笔记】（2021 CVPR）Re-labeling ImageNet: from Single to Multi-Labels, from Global to Localized Label-CSDN博客

本文链接：https://blog.csdn.net/qq_36627158/article/details/124926739

Re-labeling ImageNet: from Single to Multi-Labels, from Global to Localized Labels

（2021 CVPR）

Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, Junsuk Choe, Sanghyuk Chun

Notes

Contributions

In this paper, we propose a re-labeling strategy, ReLabel, to obtain pixel-wise labeling L ∈ RH×W×C, which are both multi-labels and localized labels, on the ImageNet training set. We use strong classifiers trained on external training data to generate those labels. The predictions before the final pooling layer have been used. We also contribute a novel training scheme, LabelPooling, for training classifiers based on the dense labels. For each random crop sample, we compute the multi-label ground truth by pooling the label scores from the crop region. ReLabel incurs only a one-time cost for generating the label maps per dataset, unlike e.g. Knowledge Distillation [20] which in- volves one forward pass per training iteration to generate the supervision. Our LabelPooling supervision adds only a small amount of computational cost on the usual single-label cross-entropy supervision.

knowledge distillation 为什么有 local label？

With the random crop augmentation in place, every training iteration would involve a forward pass through the strong yet heavy teacher.

random crop 下来给 teacher 预测一遍生成 soft label 就相当于有 local label 了

Method

Re-labeling ImageNet

We obtain dense ground truth labels from a machine annotator, a state-of-the-art classifier that has been pretrained on a super-ImageNet scale (e.g. JFT-300M [43] or InstagramNet-1B [32]) and fine-tuned on ImageNet to predict ImageNet classes. We show the detailed process of obtaining a label map in Figure A1. The original classifier takes an input image, computes the feature map (RH×W×d), conducts global average pooling (R1×1×d), and generates the predicted label Lorg ∈ R1×1×C with the fully-connected layer (Wfc ∈ Rd×C). On the other hand, the modified classifier do not have global average pooling layer, and outputs a label map Lours ∈ RH×W×C from the feature map (RH×W×d). Note that the fully-connected layer (Wfc ∈ Rd×C) of the original classifier and 1 × 1 conv (W1x1 conv ∈ R1×1×d×C) of the modified classifier are identical. Specifically, we save the storage space by storing only the top-5 predictions per image, resulting in 10 GB of label map data.

Training a Classifier with Dense Multi-labels

LabelPooling loads a pre-computed label map and conducts a regional pooling operation (RoIAlign) on the label map corresponding to the coordinates of the random crop. Global average pooling and softmax operations are performed on the pooled prediction maps to get a multi-label ground-truth vector in [0, 1]C with which the model is trained. We use the cross-entropy loss.

Results

ReLabel is both multi-label and pixel-wise. To examine the necessity of the two properties, we conduct an experiment by ablating each of them.

ReLabel：ROIAlign + softmax【p vector】
Localized single labels: ROIAlign + argmax【one-hot】
Global multi-label: GAP + softmax【p vector】
Global single-label: GAP + argmax（和 ImageNet Label 是一样的，只不过是 machine-generated）【one-hot】

Label smoothing [45] assigns a slightly weaker weight on the foreground class (1 − ǫ) and distributes the remaining weight ǫ uniformly across background classes.

Label cleaning by Beyer et al. [2] prunes out all training samples where the ground truth annotation does not agree with the prediction of a strong teacher classifier, namely BiT-L [24].

We show multi-label accuracies on two ver- sions: ReaL [2] and Shankar et al. [39]. The metrics are identical: 就是预测的 top-1 类别 hit 中该张图片的多标签集合就 ok

The difference between the metrics lies in the ground-truth multi-label annotation.

We utilize EfficientNet-L2 [15] as our machine annotator classifier whose input size is 475×475. For all training images, we resize them into 475 × 475 without cropping and generate label maps by feed-forwarding them. The spatial size of label map (W,H) is (15, 15), number of channel d is 5504, and the number of classes C is 1000.

CutMix：

8-bit、16-bit 是什么意思？

the resulting label map dimension is L ∈ R15×15×1000. Saving the entire label maps for all classes will require more than 1 TB of storage: (1.28 × 106) images × (15 × 15 × 1000) dim/image × 4bytes/dim ≈ 1.0 TB.

用多少个 bit 去存 labelmap pixel level 的预测概率