文章目录
1 Author
Swami Sankaranarayanan 1*, Yogesh Balaji 1*, Arpit Jain 2, Ser Nam Lim 2,3, Rama Chellappa1
1 UMIACS, University of Maryland, College Park, MD
2 GE Global Research, Niskayuna, NY
3 Avitas Systems, GE Venture, Boston MA.
∗First two authors contributed equally
2 Abstract
Contrary to previous approaches that use a simple adversarial objective or superpixel information to aid the process, we propose an approach based on Generative Adversarial Networks (GANs) that brings the embeddings closer in the learned feature space.
3 Introduction
The focus of this paper is in developing domain adaptation algorithms for semantic segmentation. Specifically, we focus on the hard case of the problem where no labels
from the target domain are available. This class of techniques is commonly referred to as Unsupervised Domain Adaptation.
Traditional
approaches for domain adaptation
involve minimizing some measure of distance between the source and the target distributions.Two commonly used measures are Maximum Mean Discrepancy (MMD)
, and learning the distance metric
using DCNNs as done in Adversarial approaches
The main contribution
of this work is that we propose a technique that employs generative models to align the source and target distributions in the feature space.
4 Method
- We provide an input-output description of different network blocks in our pipeline.
- We describe separately the treatment of source and target data, followed by a description of the different loss functions and the corresponding update steps.
- We motivate the design choises involved in the discriminator (D) architecture.
4.1 Description of network blocks
(a) The base network
, whose architecture is similar to a pre-trained model such as VGG-16, is split into two parts: the embedding denoted by F and the pixel-wise classifier denoted by C. The output of C is a label map up-sampled to the same size as the input of F .
(b) The generator network
(G) takes as input the learned embedding and reconstructs the RGB image.
(c) The discriminator network
(D) performs two different tasks given an input: (a) It classifies the input as real or fake in a domain consistent manner (b) It performs a pixelwise labeling task similar to the C network. Note that (b) is active only for source data since target data does not have any labels during training.
4.2 Treatment of source and target data
As shown in Figure 3, D performs two tasks: (1) Distinguishing the real source input and generated source image as source-real/source-fake (2) producing a pixel-wise label map of the generated source image.
Given a target input
X
t
{X}^{t}
Xt, the generator network
G
G
G takes the target embedding from
F
F
F as input and reconstructs the target image. Similar to the previous case,
D
D
D is trained to distinguish between real target data (target-real) and the generated target images from
G
G
G (target-fake). However, different from the previous case,
D
D
D performs only a single task i.e. it classifies the target input as target-real/target-fake. Since the target data does not have any labels during training, the classifier network
C
C
C is not active
when the system is presented with target inputs.
4.3 Iterative optimization
The directions of flow of information across different network blocks are listed in Figure 2.
The network blocks are updated iteratively in the following order:
- D-update
Combination of within-domain adversarial loss L a d v , D s \mathcal{L}_{a d v, D}^{s} Ladv,Ds
Auxiliary classification loss L a u x s \mathcal{L}_{a u x}^{s} Lauxs
For target inputs L a d v , D t \mathcal{L}_{a d v, D}^{t} Ladv,Dt
Overall: L D = L a d v , D s + L a d v , D t + L a u x s \mathcal{L}_{D}=\mathcal{L}_{a d v, D}^{s}+\mathcal{L}_{a d v, D}^{t}+\mathcal{L}_{a u x}^{s} LD=Ladv,Ds+Ladv,Dt+Lauxs - G-update
Combination of an adversarial loss L a d v , G s + L a d v , G t \mathcal{L}_{a d v, G}^{s}+\mathcal{L}_{a d v, G}^{t} Ladv,Gs+Ladv,Gt
Reconstruction loss L r e c \mathcal{L}_{r e c} Lrec
Overall: L G = L a d v , G s + L a d v , G t + L r e c s + L r e c t \mathcal{L}_{G}=\mathcal{L}_{a d v, G}^{s}+\mathcal{L}_{a d v, G}^{t}+\mathcal{L}_{r e c}^{s}+\mathcal{L}_{r e c}^{t} LG=Ladv,Gs+Ladv,Gt+Lrecs+Lrect - F-update
The parameters of F are updated using a combination of several loss terms
L F = L s e g + α L a u x s + β ( L a d v , F s + L a d v , F t ) \mathcal{L}_{F}=\mathcal{L}_{s e g}+\alpha \mathcal{L}_{a u x}^{s}+\beta\left(\mathcal{L}_{a d v, F}^{s}+\mathcal{L}_{a d v, F}^{t}\right) LF=Lseg+αLauxs+β(Ladv,Fs+Ladv,Ft)
4.4 Motivating design choice of D D D
-
Recent works on image generation have utilized the idea of
Patch discriminator
in which the output is a two dimensional feature map where each pixel carries a real/fake probability.
Output map indicates real/fake probabilities across source and target domains hence resulting in four classes per pixel:src-real
,src-fake
,tgt-real
,tgt-fake
. -
Inspired by
Auxiliary Classifier GAN
(ACGAN) , adding an auxiliary classification loss to D, they can realize a morestable GAN training
and even generatelarge scale images
.
We extend their idea to the segmentation problem by employing an auxiliary pixel-wise labeling loss to the D D D network.