【论文阅读】Look-into-Object: Self-supervised Structure Modeling for Object Recognition
摘要
Most object recognition approaches predominantly focus on learning discriminative visual patterns while overlooking the holistic object structure. Though important, structure modeling usually requires significant manual annotations and therefore is labor-intensive. In this paper, we propose to “look into object” (explicitly yet intrinsically model the object structure) through incorporating self supervisions into the traditional framework. We show the recognition backbone can be substantially enhanced for more robust representation learning, without any cost of extra annotation and inference speed. Specifically, we first propose an object-extent learning module for localizing the object according to the visual patterns shared among the instances in the same category. We then design a spatial context learning module for modeling the internal structures of the object, through predicting the relative positions within the extent. These two modules can be easily plugged into any backbone networks during training and detached at inference time. Extensive experiments show that our lookinto-object approach (LIO) achieves large performance gain on a number of benchmarks, including generic object recognition (ImageNet) and fine-grained object recognition tasks (CUB, Cars, Aircraft). We also show that this learning paradigm is highly generalizable to other tasks such as object detection and segmentation (MS COCO). Project page: https://github.com/JDAI-CV/LIO.
大多数物体识别方法主要集中于学习有区别的视觉模式,而忽略了整体的物体结构。虽然物体结构重要,但是结构建模通常需要大量的手工注释,因此是劳动密集型的。在本文中,我们建议通过在传统框架中加入自监督来“看向对象”(显式但内在地建模对象结构)。我们表明,识别主干可以得到显著增强,以实现更鲁棒的表示学习,而不会增加额外的注释和推理速度。具体来说,我们首先提出一个对象范围学习模块,用于根据同一类别中的实例之间共享的视觉模式来定位对象。然后,我们设计了一个空间上下文学习模块,用于通过预测范围内的相对位置来建模对象的内部结构。这两个模块可以在训练期间很容易地插入任何主干网络,并在推断时分离。大量实验表明,我们的观察对象方法(LIO)在许多基准上获得了很大的性能提升,包括通用对象识别(ImageNet)和细粒度对象识别任务(CUB, Cars, Aircraft)。我们还表明,这种学习范式可以高度推广到其他任务,如对象检测和分割 (MS COCO)。源码:https://github.com/JDAI-CV/LIO
具体实现
分成三个模块:
分类模块(CM):提取基本图像表示并产生最终对象类别的主干分类网络。
对象范围学习模块(OEL):用于在给定图像中定位主要对象的模块。
空间背景学习模块(SCL):这是一个自我监督模块,通过CM中特征单元之间的相互作用来加强区域之间的联系。
图片 I I I,标签 l l l,通过CNN获得特征图 f ( I ) f(I) f(I), 大小为 N × N × C N × N × C N×N×C
对象范围学习模块
1.将特征图 f ( I ) f(I) f(I)分成 N × N N × N N×N个特征向量 f ( I ) i , j ∈ R 1 × C \boldsymbol{f}(I)_{i, j} \in \mathbb{R}^{1 \times C} f(I)i,j∈R1×C, i i i和 j j j是水平和垂直下标,每个特征向量集中响应输入图像 I I I中的某个区域。
2.受来自同一类别的图像中的对象总是共享一些共性的原理的启发,这些共性反过来帮助模型识别对象,采样与
I
I
I标签相同的图片集
I
′
=
{
I
1
′
,
I
2
′
,
⋯
,
I
P
′
}
\boldsymbol{I}^{\prime}=\left\{I_{1}^{\prime}, I_{2}^{\prime}, \cdots, I_{P}^{\prime}\right\}
I′={I1′,I2′,⋯,IP′},然后度量之间区域级相关性
φ
i
,
j
(
I
,
I
′
)
=
1
C
max
1
≤
i
′
,
j
′
≤
N
⟨
f
(
I
)
i
,
j
,
f
(
I
′
)
i
′
,
j
′
⟩
\varphi_{i, j}\left(I, I^{\prime}\right)=\frac{1}{C} \max _{1 \leq i^{\prime}, j^{\prime} \leq N}\left\langle\boldsymbol{f}(I)_{i, j}, \boldsymbol{f}\left(I^{\prime}\right)_{i^{\prime}, j^{\prime}}\right\rangle
φi,j(I,I′)=C11≤i′,j′≤Nmax⟨f(I)i,j,f(I′)i′,j′⟩
⟨
⋅
,
⋅
⟩
\langle\cdot, \cdot\rangle
⟨⋅,⋅⟩ 表示点乘
与分类目标
L
c
l
s
L_{cls}
Lcls联合训练后,相关性得分
φ
i
,
j
\varphi_{i, j}
φi,j通常与
l
l
l的语义相关性正相关
3.为
I
I
I中的对象范围构造一个
N
×
N
N × N
N×N的语义掩模矩阵
φ
i
,
j
\varphi_{i, j}
φi,j,因此,来自同一类别的图像的共同性可以被该语义相关掩模
φ
\varphi
φ很好地捕获,并且
φ
\varphi
φ的值自然地区分主对象区域和背景:
M
(
I
,
I
′
)
=
1
P
∑
n
=
1
P
φ
(
I
,
I
p
′
)
M\left(I, \boldsymbol{I}^{\prime}\right)=\frac{1}{P} \sum_{n=1}^{P} \varphi\left(I, I_{p}^{\prime}\right)
M(I,I′)=P1n=1∑Pφ(I,Ip′)
4.在
f
(
I
)
f(I)
f(I)之后装备了一个简单的流,以将
f
(
I
)
f(I)
f(I)中的所有特征地图与权重融合。这些特征通过
1
×
1
1 × 1
1×1卷积进行处理,以获得一个单通道
m
′
(
I
)
m^{\prime}(I)
m′(I)的输出。
与传统的旨在检测某些特定部分或区域的注意力不同,OEL模块被训练为收集对象内的所有区域,而忽略背景或其他不相关的对象。
L
o
e
l
=
∑
MSE
(
m
′
(
I
)
,
M
(
I
,
I
′
)
)
\mathcal{L}_{o e l}=\sum \operatorname{MSE}\left(m^{\prime}(I), M\left(I, \boldsymbol{I}^{\prime}\right)\right)
Loel=∑MSE(m′(I),M(I,I′))
L
o
e
l
\mathcal{L}_{o e l}
Loel定义为对象范围的伪掩码
M
(
I
,
I
′
)
M(I,I^{\prime})
M(I,I′)和
m
′
(
I
)
m^{\prime}(I)
m′(I)之间的距离
待续。。。
————————————————————————————————————————
待续到此:https://blog.csdn.net/qq_41118968/article/details/117946950?spm=1001.2014.3001.5502