Introduction
- 不同于显式地寻找特征显著区域并抽取其特征,作者充分利用了 CNN 不同 stage 输出的特征图的语义粒度信息,并使用 Jigsaw Puzzle Generator 进行数据增强来帮助模型学得多粒度的图像特征,提高模型的细粒度分类性能。值得注意的是,Jigsaw Puzzle Generator 进行数据增强的过程非常类似于 Swin Transformer 合并 image patch 的过程,并且文章也进一步证明了融合多个 stage 的预测结果对细粒度分类是有很大提升的
Progressive Multi-Granularity (PMG) training framework
- Network Architecture:PMG 可以采用任意 backbone
F
F
F. 假设它有
L
L
L stages,其中
l
l
l-th stage 输出的特征图为
F
l
∈
R
H
l
×
W
l
×
C
l
F^l\in\R^{H_l\times W_l\times C_l}
Fl∈RHl×Wl×Cl. 此外,由于作者还想在最后
S
S
S 个
F
l
F^l
Fl 上施加分类损失,因此每个 stage 还对应一个 convolution block
H
c
o
n
v
l
H_{conv}^l
Hconvl (2 conv + max pooling) 用来得到特征向量
V
l
=
H
c
o
n
v
l
(
F
l
)
V^l=H_{conv}^l(F^l)
Vl=Hconvl(Fl),最后再经过
H
c
l
a
s
s
l
H^l_{class}
Hclassl with Batchnorm and Elu (2 FC + Softmax) 即可得到
y
l
=
H
c
l
a
s
s
l
(
V
l
)
y^l = H^l_{class}(V^l)
yl=Hclassl(Vl). 此外,将最后
S
S
S 个 stage 对应的
V
l
V^l
Vl concat 起来可以得到
还可以在 V c o n c a t V^{concat} Vconcat 上施加分类损失 y c o n c a t = H c l a s s c o n c a t ( V c o n c a t ) y^{concat} = H^{concat}_{class} (V^{concat}) yconcat=Hclassconcat(Vconcat) (作者选取 S = 3 S=3 S=3) - Progressive Training:作者采用了 progressive training,即先训练 low stage,再逐步训练后续 stage (At each iteration, a batch of data
d
d
d will be used for
S
+
1
S + 1
S+1 steps)。由于 low stage 的感受野和表达能力有限,因此为了正确分类,它更容易关注到一些 discriminative information from local details (i.e. object textures) (this increment nature allows the model to locate discriminative information from local details to global structures when the features are gradually sent into higher stages, instead of learning all the granularities simultaneously)
- Jigsaw Puzzle Generator:Jigsaw Puzzle solving (Wei, Chen, et al. “Iterative reorganization with weak spatial constraints: Solving arbitrary jigsaw puzzles for unsupervised representation learning.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.) 已被证明是一种有效的图像增强方法。作者使用 Jigsaw Puzzle 来为不同 step 的输入数据进行数据增强来强制模型学习与当前 stage 相对应的粒度信息 (devise different granularity regions and force the model to learn information specific to the corresponding granularity level at each training step; only the last step (combined step) is still trained with original images)。给定输入图像 d ∈ R 3 × W × H d\in\R^{3\times W\times H} d∈R3×W×H,可以将其分为 n × n n\times n n×n patches,然后将 patches 随机 shuffle 组成新的图像, n n n 越大 patches 对应的粒度也就越小。每个 stage 的 n n n 需要满足如下条件:(i) patch size 应该小于当前 stage 的感受野;(ii) patch size 应该随着当前 stage 感受野的增加而增加。由于相邻 stage 感受野通常减半,因此作者将 l l l-th stage 的 n n n 设为 2 L − l + 1 2^{L-l+1} 2L−l+1。需要注意的是,jigsaw puzzle generator 并不能总是保证细粒度特征区域在同一个 patch 内,但由于作者采用了 random cropping,因此这一问题不会带来模型性能降低
- Inference:可以只使用 concat feature 进行分类
也可以融合多个 stage 的分类结果进行分类
Experiments
Implementation Details 见 4.1 (The input images are resized to a fixed size of 550 × 550 550 × 550 550×550 and randomly cropped into 448 × 448 448×448 448×448)
- Comparisons with State-of-the-Art Methods
- Ablation Study
- Visualization (Grad-CAM)