Fully Convolutional Adaptation Networks for Semantic Segmentation
Yiheng Zhang, Zhaofan Qiu, Ting Yao, Dong Liu, and Tao Mei
Abstract
问题:语义分割的标注极其困难
一种思路:render synthetic data (e.g., computer games) and generate ground truth automatically
缺陷:simply applying the models learnt on synthetic images may lead to high generalization error on real images due to domain shift
解决方案:Fully Convolutional Adaptation Networks (FCAN) = Appearance Adaptation Networks (AAN, learns a transformation from one domain to the other in the pixel space) + Representation Adaptation Networks(RAN, optimized in an adversarial learning manner to maximally fool the domain discriminator with the learnt source and target representations)
visual appearance-level domain adaptation: adapts source-domain images to appear as if drawn from the “style” in the target domain
representation-level domain adaptation: learn domain-invariant representations
Introduction
标数据任务重,但是直接从GTA5等游戏场景中合成数据又会有”domain shift”的问题
解决方案:unsupervised domain adaptation, utilize labeled examples from the source domain and a large number of unlabeled examples in the target domain to reduce a prediction error on the target data
build invariance across domains by minimizing the measure of domain shift such as correlation distances or maximum mean discrepancy
build appearance level and representation-level invariance
appearance level invariance: recombine the image content in one domain with the “style” from the other domain
representation-level invariance: model domain distribution via an adversarial objective with respect to a domain discriminator. guiding the representation learning in both domains, making the difference between source and target representation distributions indistinguishable through the domain discriminator
FCAN:
AAN: construct an image that captures high-level content in a source image and low-level pixel information of the target domain, starts with a white noise image and adjusts the output image by using gradient descent to minimize the Euclidean distance between the feature maps of the output image and those of the source image or mean feature maps of the images in target domain(类似neural style transfer)
RAN: first employed to produce image representation in each domain, followed by bilinear interpolation to upsample the outputs for pixel-level classification, and meanwhile a domain discriminator to distinguish between source and target domain
Atrous Spatial Pyramid Pooling (ASPP): enlarge the field of view of filters in feature map and endow the domain discriminator with more power
loss: classification loss to measure pixel-level semantics, adversarial loss to maximally fool the domain discriminator with the learnt source and target representations
Related Work
Semantic segmentation: FCN, multi-scale feature ensemble(Dilated Convolution, RefineNet, DeepLab and HAZNet), context information preservation(ParseNet, PSPNet and DST-FCN), post processing(CRF), weak supervision(instance level bounding boxes, image-level tags)
Deep Domain adaptation: transfer model learnt in a labeled source domain to a target domain in a deep learning framework. unsupervised adaptation(Deep Correlation Alignment, Adversarial Discriminative Domain Adaptation), supervised adaptation(Deep Domain Confusion), semi-supervised(Deep Adaptation Network)
adaptation
Fully Convolutional Adaptation Networks (FCAN) for Semantic Segmentation
Appearance Adaptation Networks (AAN)
和neural style完全相同
Representation Adaptation Networks (RAN)
To further reduce the impact of domain shift
guiding the learning of feature representations in both domains by fooling a domain discriminator D with the learnt source and target representations
resize the images with multiple resolutions
simultaneously optimize the standard pixel-level classification loss Lseg L s e g