ResNet变种

最新推荐文章于 2025-02-28 00:15:25 发布

00000cj

最新推荐文章于 2025-02-28 00:15:25 发布

阅读量724

点赞数 1

分类专栏： Backbones 文章标签：深度学习计算机视觉

本文链接：https://blog.csdn.net/ooooocj/article/details/122266603

版权

Backbones 专栏收录该内容

31 篇文章

订阅专栏

Bag of Tricks for Image Classification with Convolutional Neural Networks

3. Efficient Traning

Larget-batch training

Low-precision traning

5. Training Refinements

6. Transfer Learning

6.1 Object Detection

6.2 Semantic Segmentation

3. Efficient Traning

Larget-batch training

Linear scaling learning rate
Learning rate warmup
Zero γ
No bias decay

Low-precision traning

The evaluation results for ResNet-50 are shown in Table 3. Compared to the baseline with batch size 256 and FP32, using a larger 1024 batch size and FP16 reduces the training time for ResNet-50 from 13.3-min per epoch to 4.4-min per epoch. In addition, by stacking all heuristics for large-batch training, the model trained with 1024 batch size and FP16 even slightly increased 0.5% top-1 accuracy compared to the baseline model.

The ablation study of all heuristics is shown in Table 4. Increasing batch size from 256 to 1024 by linear scaling learning rate alone leads to a 0.9% decrease of the top-1 accuracy while stacking the rest three heuristics bridges the gap. Switching from FP32 to FP16 at the end of training does not affect the accuracy.

4. Model Tweaks

ResNet-B

It changes the downsampling block of ResNet. The observation is that the convolution in path A ignores three-quarters of the input feature map because it uses a kernel size 1×1 with a stride of 2. ResNet-B switches the strides size of the first two convolutions in path A, as shown in Figure 2a, so no information is ignored. Because the second convolution has a kernel size 3 × 3, the output shape of path A remains unchanged.

ResNet-C

The observation is that the computational cost of a convolution is quadratic to the kernel width or height. A 7 × 7 convolution is 5.4 times more expensive than a 3 × 3 convolution. So this tweak replacing the 7 × 7 convolution in the input stem with three conservative 3 × 3 convolutions, which is shown in Figure 2b, with the first and second convolutions have their output channel of 32 and a stride of 2, while the last convolution uses a 64 output channel.

ResNet-D

Inspired by ResNet-B, we note that the 1 × 1 convolution in the path B of the downsampling block also ignores 3/4 of input feature maps, we would like to modify it so no information will be ignored. Empirically, we found adding a 2×2 average pooling layer with a stride of 2 before the convolution, whose stride is changed to 1, works well in practice and impacts the computational cost little. This tweak is illustrated in Figure 2c.

5. Training Refinements

Cosine Learning Rate Decay
Label Smoothing
Knowledge Distillation
Mixup Training

increase the number of epochs from 120 to 200 because the mixed examples ask for a longer training progress to converge better.

6. Transfer Learning

6.1 Object Detection

We can observe that a base model with a higher validation accuracy leads to a higher mAP for Faster-RNN in a consistent manner.

6.2 Semantic Segmentation

In contradiction to our results on object detection, the cosine learning rate schedule effectively improves the accuracy of the FCN performance, while other refinements provide suboptimal results. A potential explanation to the phenomenon is that semantic segmentation predicts in the pixel level. While models trained with label smoothing, distillation and mixup favor soften labels, blurred pixel-level information may be blurred and degrade overall pixel-level accuracy.