CVPR2020 Harmonizing Transferability and Discriminability for Adapting Object Detector


Harmonizing Transferability and Discriminability for Adapting Object Detector




(1) Importance Weighted Adversarial Training with input Interpolation (IWAT-I), which strengthens the global discriminability by re-weighting the interpolated image-level features; (2) Context-aware Instance-Level Alignment (CILA) module, which enhances the local discriminability by capturing the underlying complementary effect between the instance-level feature and the global context information for the instance-level feature alignment; (3) local feature masks that calibrate the local transferability to provide semantic guidance for the following discriminative pattern alignment.


由于源域与目标域之间存在domain shift,在源域训练得到的目标检测器不能够很好地泛化到目标域上。

This hinders the deployment of models in real-world situations where data distributions typically vary from one domain to another. Unsupervised Domain Adaptation (UDA) serves as a promising solution to solve this problem by transferring knowledge from a labeled source domain to a fully unlabeled target domain.


文章中提到的transferability discriminability解释如下:

Note that, in this paper, the transferability refers to the invariance of the learned representations across domains, and discriminability refers to the ability of the detector to localize and distinguish different instances.

Related Work

Unsupervised Domain Adaptation

Typically, UDA methods propose to bridge different domains by matching the high-order statistics of source and target feature distributions in the latent space.

With insights from the practice of Generative Adversarial Nets (GAN), tremendous works have been done by leveraging the two-player game to achieve domain confusion with Gradient Reversal Layer (GRL) for feature alignment.

In addition, other GAN-based works aim to achieve pixel-level adaptation in virtue of image-toimage translation techniques

UDA for Object Detection

Nevertheless, all these UDA methods do not properly handle the the potential contradiction between transferability and discriminability when adapting object detectors in the context of adversarial adaptation.


Hierarchical Transferability Calibration Network (HTCN)

由于transferability discriminability之间存在一定的矛盾关系,因此作者主要从两个方面入手解决此问题。


Importance Weighted Adversarial Training with Input Interpolation


IWAT-I模块中,对于域判别器 D 2 D_{2} D2的输出为 d i = D 2 ( G 1 ∘ G 2 ( x i ) ) d_{i} = D_{2}\left(G_{1}\circ G_{2}(x_{i})\right) di=D2(G1G2(xi)),其对应的输入样本 x i x_{i} xi的域标签信息,其不确定都计算方式如下:
v i = H ( d i ) = − d i ⋅ log ⁡ ( d i ) − ( 1 − d i ) ⋅ log ⁡ ( 1 − d i ) ( 1 ) v_{i} = H(d_{i}) = -d_{i}\cdot \log(d_{i})- (1-d_{i})\cdot \log(1-d_{i}) \qquad (1) vi=H(di)=dilog(di)(1di)log(1di)(1)
则每个样本的权重为 1 + v i 1+v_{i} 1+vi

Images with high uncertainty (hard-to-distinguish by D 2 D_{2} D2) should be up-weighted, vice versa.

g i = f i × ( 1 + v i ) ( 2 ) g_{i} = f_{i}\times \left(1+v_{i}\right) \qquad (2) gi=fi×(1+vi)(2)
所以 D 3 D_{3} D3的对抗损失为
L g a = E [ log ⁡ ( D 3 ( G 3 ( g i s ) ) ) ] + E [ 1 − log ⁡ ( D 3 ( G 3 ( g i t ) ) ) ] ( 3 ) \mathcal{L}_{ga} = \mathbb{E}[\log(D_{3}(G_{3}(g_{i}^{s})))] + \mathbb{E}[1 - \log(D_{3}(G_{3}(g_{i}^{t})))] \qquad (3) Lga=E[log(D3(G3(gis)))]+E[1log(D3(G3(git)))](3)

Context-Aware Instance-Level Alignment

之前实例级别的对齐是基于ROI-Pooling后的特征进行局部特征的对齐,主要是对域间样本object scale、viewpoint、deformation、appearance特征的对齐。然而存在的问题是这些特征只考虑到目标周围局部特征的对齐,没有考虑整体上下文的一个信息。因此对于不同域之间实例特征是有差别的,二上下文向量通过从底层特征进行融合,能够保证域间的不变性,作者通过将二者融合,实现二者之间的互补。

backbone不同层获得context vector f c i ( i = 1 , 2 , 3 ) f_{c}^{i}(i=1,2,3) fci(i=1,2,3),第 i i i个图像经过ROI-Polling后得到第 j j j个区域对应的特征为 f i n s i , j f_{ins}^{i,j} finsi,j,通过将其concatenate进行实例特征和context vector的融合得到 [ f c 1 , f c 2 , f c 3 , f i n s ] [f_{c}^{1},f_{c}^{2},f_{c}^{3},f_{ins}] [fc1,fc2,fc3,fins],但是此种方法存在一个问题,context vector与实例级的特征之间是独立的,无法实现上述所说的二者的互补。由于这两个特征是不对称的,

采用下述方式将 f c i ( i = 1 , 2 , 3 ) f_{c}^{i}(i=1,2,3) fci(i=1,2,3),与 f i n s f_{ins} fins相乘,会产生维度爆炸。
f f u s = [ f c 1 , f c 2 , f c 3 ] ⊗ f i n s ( 4 ) \boldsymbol{f}_{fus} = [f_{c}^{1},f_{c}^{2},f_{c}^{3}]\otimes f_{ins} \qquad(4) ffus=[fc1,fc2,fc3]fins(4)

we propose to leverage the randomized methods as an unbiased estimator of the tensor product.

f f u s = 1 d ( R 1 f c ) ⊙ ( R 2 f i n s ) ( 5 ) \boldsymbol{f}_{fus} = \frac{1}{\sqrt{d}}(\boldsymbol{R}_{1}\boldsymbol{f}_{c})\odot(\boldsymbol{R}_{2}\boldsymbol{f}_{ins}) \qquad(5) ffus=d 1(R1fc)(R2fins)(5)

其中 ⊙ \odot 表示Hadamard product

L i n s = − 1 N s ∑ i = 1 N s ∑ i , j log ⁡ ( D i n s ( f f u s i , j ) s ) − 1 N t ∑ i = 1 N t ∑ i , j log ⁡ ( 1 − D i n s ( f f u s i , j ) t ) ( 6 ) \mathcal{L}_{ins} = -\frac{1}{N_s}\sum^{N_s}_{i=1}\sum_{i,j}\log(D_{ins}(\boldsymbol{f}^{i,j}_{fus})_s) -\frac{1}{N_t}\sum^{N_t}_{i=1}\sum_{i,j}\log(1-D_{ins}(\boldsymbol{f}^{i,j}_{fus})_t) \qquad(6) Lins=Ns1i=1Nsi,jlog(Dins(ffusi,j)s)Nt1i=1Nti,jlog(1Dins(ffusi,j)t)(6)

Local Feature Mask for Semantic Consistency

对于不同域的图像尽管scene layoutsobject co-occurrencebackground之间存在一定的差异,但是对于同一类目标器在不同的域中应该具有相同的sketch,作者假设图像中某些区域相对于其他区域更具有一定的描述性和优势,因此作者在浅层特征的基础上对局部特征计算编码得到mask,指导后续的语义一致性。

对于 G 1 G_1 G1pixel-wise对抗损失为
L l a = 1 N s ⋅ H W ∑ i = 1 N s ∑ k = 1 H W log ⁡ ( D 1 ( G 1 ( x i s ) k ) ) 2 + 1 N t ⋅ H W ∑ i = 1 N t ∑ k = 1 H W log ⁡ ( 1 − D 1 ( G 1 ( x i t ) k ) ) 2 ( 7 ) \mathcal{L}_{la} = \frac{1}{N_s\cdot HW}\sum^{N_s}_{i=1}\sum^{HW}_{k=1} \log(D_1(G_1(x^s_i)_k))^2 +\frac{1}{N_t\cdot HW}\sum^{N_t}_{i=1}\sum^{HW}_{k=1} \log(1-D_1(G_1(x^t_i)_k))^2 \qquad(7) Lla=NsHW1i=1Nsk=1HWlog(D1(G1(xis)k))2+NtHW1i=1Ntk=1HWlog(1D1(G1(xit)k))2(7)
对于源域和目标域的特征mask m f s m^s_f mfs m f t m^t_f mft的计算是利用 D 1 D_1 D1的不确定度,

G 1 G_1 G1提取的源域和目标域样本特征为 r i k = ( G 1 ( x i ) ) k r^k_i=(G_1(x_i))_k rik=(G1(xi))k D 1 D_1 D1的输出为 d i k = D 1 ( r i k ) d^k_i=D_1(r^k_i) dik=D1(rik)类似公式(1)的计算方式,每个区域的不确定度为 v ( r i k ) = H ( d i k ) v(r^k_i)=H(d^k_i) v(rik)=H(dik),由此得到每个样本对应的feature mask m f k = 2 − v ( r i k ) m^k_f=2-v(r^k_i) mfk=2v(rik),经过re-weighted后的特征为 r ~ i k ← r i k ⋅ m i k \widetilde{r}^k_i\leftarrow r^k_i\cdot m^k_i r ikrikmik

D 2 D_2 D2的对抗损失为
L m a = E [ log ⁡ ( D 2 ( G 2 ( f ~ i s ) ) ) ] + E [ 1 − log ⁡ ( D 2 ( G 2 ( f ~ i t ) ) ) ] ( 8 ) \mathcal{L}_{ma}=\mathbb{E}[\log(D_2(G_2(\widetilde{f}^s_i)))] +\mathbb{E}[1-\log(D_2(G_2(\widetilde{f}^t_i)))] \qquad(8) Lma=E[log(D2(G2(f is)))]+E[1log(D2(G2(f it)))](8)

Training Loss

L c l s \mathcal{L}_{cls} Lcls L r e g \mathcal{L}_{reg} Lreg为目标检测的损失,总的目标函数为
max ⁡ D 1 , D 2 , D 3 min ⁡ G 1 , G 2 , G 3 L c l s + L r e g − λ ( L l a + L m a + L g a + L i n s ) ( 9 ) \max\limits_{D_1,D_2,D_3}\min\limits_{G_1,G_2,G_3} \mathcal{L}_{cls}+\mathcal{L}_{reg}-\lambda(\mathcal{L}_{la}+\mathcal{L}_{ma}+\mathcal{L}_{ga}+\mathcal{L}_{ins}) \qquad(9) D1,D2,D3maxG1,G2,G3minLcls+Lregλ(Lla+Lma+Lga+Lins)(9)


Cityscapes to Foggy-Cityscapes


PASCAL VOC to Clipart


Sim10K to Cityscapes


Ablation Study


