Macro-Micro Adversarial Network for Human Parsing
ECCV-2018 2018-10-27 15:15:07
Paper: https://arxiv.org/pdf/1807.08260.pdf
Code: https://github.com/RoyalVane/MMAN
Motiviation-1: Why use the Adversarial Loss ?
Based on CNN architecture, the pixel-wise classification loss is usually used [19,34,10] which punishes the classification error for each pixel. Despite providing an effective baseline, the pixel-wise classification loss which is designed for per-pixel category prediction, has two drawbacks.
First, the pixel-wise classification loss may lead to local inconsistency, such as holes and blur. The reason is that it merely penalizes the false prediction on every pixel without explicitly considering the correlation among the adjacent pixels.
Second, pixel-wise classification loss may lead to semantic inconsistency in the overall segmentation map, such as unreasonable human poses and incorrect spatial relationship of body parts. Compared to the local inconsistency, the semantic inconsistency is generated from deeper layers. When only looking at a local region, the learned model does not have an overall sense of the topology of body parts.
In the attempt to address the inconsistency problems, the conditional random fields (CRFs) [17] can be employed as a post processing method. However, CRFs usually handle inconsistency in very limited scope (locally) due to the pairwise potentials, and may even generate worse label maps given poor initial segmentation result. As an alternative to CRFs, a recent work proposes the use of adversarial network [24]. Since the adversarial loss assesses whether a label map is real or fake by joint configuration of many label variables, it can enforce higher-level consistency, which cannot be achieved with pairwise terms or the per-pixel classification loss. Now, an increasing number of works adopt the routine of combining the cross entropy loss with an adversarial loss to produce label maps closer to the ground truth [5,27,12].
Motiviation-2: Why use the Two Discriminator ?
Nevertheless, the previous adversarial network also has its limitations.
First, the single discriminator back propagates only one adversarial loss to the generator. However, the local inconsistency is generated from top layers and the semantic inconsistency is generated from deep layers. The two targeted layers can not be discretely trained with only one adversarial loss.
Second, a single discriminator has to look at overall high-resolution image (or a large part of it) in order to supervise the global consistency. As mentioned by numbers of literatures [7,14], it is very difficult for a generator to fool the discriminator on a high-resolution image. As a result, the single discriminator back propagates a maximum adversarial loss invariably, which makes the training unbalanced. We call it poor convergence problem, as shown in Fig. 2.
Our Proposed Approach:
In this paper, the basic objective is to improve the local and semantic consistency of label maps in human parsing. We adopt the idea of adversarial training and at the same time aim to addresses its limitations, i.e., the inferior ability in improving parsing consistency with a single adversarial loss and the poor convergence problem. Specifically, we introduce the Macro-Micro Adversarial Nets (MMAN). MMAN consists of a dual-output generator (G) and two discriminators (D), named Macro D and Micro D. The three modules constitute two adversarial networks (Macro AN, Micro AN), addressing the semantic consistency and the local consistency, respectively.
Difference with Previous Works:
A brief pipeline of the proposed framework is shown in Fig. 3. It is in two critical aspects that MMAN departs from previous works.
First, our method explicitly copes with the local inconsistency and semantic inconsistency problem using two task-specific adversarial networks individually.
Second, our method does not use large-sized FOVs on high-resolution image, so we can avoid the poor convergence problem. More detailed description of the merits of the proposed network is provided in Section 3.5.
Our Contributions:
– We propose a new framework called Macro-Micro Adversarial Network (MMAN) for human parsing. The Macro AN and Micro AN focus on semantic and local inconsistency respectively, and work in complementary way to improve the parsing quality.
– The two discriminators in our framework achieve local and global supervision on the label maps with small field of views (FOVs), which avoids the poor convergence problem caused by high-resolution images.
– The proposed adversarial net achieves very competitive mIoU on the LIP and PASCAL-Person-Part datasets, and can be well generalized on a relatively small dataset PPSS.
==