https://www.bilibili.com/video/BV11E411y7Dr/?spm_id_from=333.788.videocard.12
ResNeSt: Split-Attention Networks(ResNet改进版本)
(1)大的min batch,使用cosine学习率衰减策略。warm up。BN层参数设置。
(2)标签平滑
(3)自动增强
(4)mixup训练
(5)大的切割设置
(6)正则化
Bisenet
https://github.com/CoinCheung/BiSeNet
These are the tricks that I find might be useful:
use online hard example mining loss. This let the model be trained more efficiently.
do not add weight decay when bn parameters and bias parameters of nn.Conv2d or nn.Linear are tuned.
use a 10 times larger lr at the model output layers.
use crop evaluation. We do not want the eval scale to be too far away from the train scale, so we crop the chips from the images to do evaluation and then combine the results to make the final prediction.
multi-scale training and multi-scale-flip evaluating. On each scale, the scores of the original image and its flipped version are summed up, and then the exponential of the sum is computed to be the prediction of this scale.
warmup of 1000 iters to make sure the model better initialized.
torch-cv
# Zero-initialize the last BN in each residual branch,
# so that the residual branch starts with zeros, and each residual block behaves like an identity.
# This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
if zero_init_residual:
for m in self.modules():
if isinstance(m, Bottleneck):
nn.init.constant_(m.bn3.weight, 0)
elif isinstance(m, BasicBlock):
nn.init.constant_(m.bn2.weight, 0)