#Problem
- Traditional GAN methods directly operate on the whole image, and inevitably change the attribute-irrelevant regions
- The performance of traditional regression methods heavily depends on the paired training data, which are however quite difficult to acquire
#Relative work
-
ResGAN: learning the residual image avoids changing the attribute-irrelevant region by restraining most regions of the residual image as zero.
-
Improvement:This work is quite insightful to enforce the manipulation mainly concentrate on local areas especially for those local attributes.
-
Drawback: the location and the appearance of target attributes are modeled in single sparse residual image which is actually hard for a favorable optimization than modeling them separately
#Method
- SaGAN: only alter the attribute specific region and keep the rest unchanged
- The generator contains an attribute manipulation network (AMN) to edit the face image, and a spatial attention network (SAN) to localize the attribute-speciic region which restricts the alternation of AMN within this region.
#Contribution
- The spatial attention is introduced to the GAN framework, forming an end-to-end generative model for face attribute editing (referred to as SaGAN),which can only alter those attribute-speciic region and keep the rest irrelevant region remain the same.
- The proposed SaGAN adopts single generator with attribute as conditional signal rather than two dual ones for two inverse face attribute editing.
- The proposed SaGAN achieves quite promising results especially for those local attributes with the attribute-irrelevant details well preserved. Besides, our approach also benefits the face recognition by data augmentation.
#Generative Adversarial Network with Spatial Attention
notation | meaning |
---|---|
I I I | input image |
I ^ \hat{I} I^ | output image |
I a I_a Ia | an edited face image output by AMN |
c c c | attribute value |
c g c_g cg | ground truth attribute label of the real image I I I |
D s r c ( I ) D_{src}(I) Dsrc(I) | probability of an image I I I to be a real one |
D c l s ( c ∥ I ) D_{cls}(c\|I) Dcls(c∥I) | probability of an image I I I with the attribute c c c |
F m F_m Fm | an attribute manipulation network (AMN) |
F a F_a Fa | a spatial attention network(SAN) |
b b b | a spatial attention mask, used to restrict the alternation of AMN within this region |
λ 1 \lambda_1 λ1 | balance parameters |
λ 2 \lambda_2 λ2 | balance parameters |
λ g p λ_{gp} λgp | hyper-parameters control the gradient penalty, default = 10 |
- the goal of face attribute editing is to translate I into an
new image I ^ \hat{I} I^, which should be realistic, with attribute c and look the same as the input image excluding the attribute-specific region
##Discriminator
- Two objectives, one to distinguish the generated images from the real ones, and another to classify the attributes of the generated and real images
- The two classifiers are both designed as a CNN with softmax function, denoted as D s r c D_{src} Dsrc and D c l s D_{cls} Dcls respectively.
- The two networks can share the first few convolutional layers followed by distinct fully-connected layers for different classifications
KaTeX parse error: Got function '\hat' with no arguments as subscript at position 61: …(I)]+\mathbb{E}_̲\hat{I}(log(1-D…
L c l s D = E I , c g [ − l o g D c l s ( c g ∣ I ) ] \mathcal{L}_{cls}^D = \mathbb{E}_{I,c^g}[-logD_{cls}(c^g|I)] LclsD=EI,cg[−logDcls(cg∣I)]
discriminator D:
min D s r c , D c l s L D = L s r c D + L c l s D \min \limits_{D_{src},D_{cls}} \mathcal{L}_D = \mathcal{L}_{src}^D+\mathcal{L}_{cls}^D Dsrc,DclsminLD=LsrcD+LclsD
##Generator
- G contains two modules, an attribute manipulation network(AMN) and a spatial attention network(SAN)
- AMN focuses on how to manipulate and SAN focuses on where to manipulate.
- The attribute manipulation network takes a face image
I
I
I and an attribute value
c
c
c as input, and outputs an edited face image
I
a
I_a
Ia
I a = F m ( I , c ) I_a = F_m(I,c) Ia=Fm(I,c) - The spatial attention network takes the face image I I I as input, and predict a spatial attention mask b b b, which is used to restrict the alternation of AMN within this region
- Ideally, the attribute-specific region of b b b should be 1, and the rest regions should be 0.
- Regions with non-zeros attention values are all regarded as attribute-specific region, and the rest with zero attention values are regarded as attribute-irrelevant region
b = F a ( I ) b = F_a(I) b=Fa(I) - the attribute-specfiic regions are manipulated towards the target attribute while the rest regions remain the same
I ^ = G ( I , c ) = I a ⋅ b + I ⋅ ( 1 − b ) \hat{I} = G(I,c) = I_a \cdot b + I \cdot (1-b) I^=G(I,c)=Ia⋅b+I⋅(1−b)
- To make the edited face image
I
^
\hat{I}
I^ photo-realistic: an adversarial loss is designed to confuse the real/fake classifier
(2) L s r c G = E I ^ [ [ − l o g D s r c ( I ^ ) ] ] \mathcal{L}^G_{src} = \mathbb{E}_{\hat{I}}[[-logD_{src}(\hat{I})]]\tag{2} LsrcG=EI^[[−logDsrc(I^)]](2) - To make
I
^
\hat{I}
I^ be correctly with target attribute
c
c
c: an attribute classification loss is designed to enforce the attribute prediction of
I
^
\hat{I}
I^ from the attribute classifier approximates the target value
c
c
c
KaTeX parse error: Got function '\hat' with no arguments as subscript at position 33: …^G = \mathbb{E}_̲\hat{I}[-logD_{… - To keep the attribute-irrelevant region unchanged: a reconstruction loss is employed similar as CycleGAN and StarGAN
L r e c G = λ 1 E I , c , c g [ ( ∣ ∣ I − G ( G ( I , c ) , c g ) ∣ ∣ 1 ] + λ 2 E I , c g [ ( ∣ ∣ I − G ( I , c g ) ∣ ∣ 1 ] \mathcal{L}_{rec}^G = \lambda_1\mathbb{E}_{I,c,c^g}[(||I-G(G(I,c),c^g)||_1]+\lambda_2\mathbb{E}_{I,c^g}[(||I-G(I,c^g)||_1] LrecG=λ1EI,c,cg[(∣∣I−G(G(I,c),cg)∣∣1]+λ2EI,cg[(∣∣I−G(I,cg)∣∣1] - generator G
min F m , F a L G = L a d v G + L c l s G + L r e c G \min \limits_{F_m,F_a} \mathcal{L}_G = \mathcal{L}_{adv}^G+\mathcal{L}_{cls}^G+\mathcal{L}_{rec}^G Fm,FaminLG=LadvG+LclsG+LrecG
#Implementation
##Optimization
To optimize the adversarial real/fake classification more stably, in all experiments the objectives in Eq.(1) and Eq.(2) is optimized by using WGAN-GP
KaTeX parse error: Got function '\hat' with no arguments as subscript at position 59: …(I)]+\mathbb{E}_̲\hat{I}[D_{src}…
I ~ \tilde{I} I~ is sampled uniformly along a straight line between the edited images I ^ \hat{I} I^ and the real images I I I
##Network Architecture
- For the generator, the two networks of AMN and SAN share the same network architecture except slight difference in the input and output:
Network | Input | Output | Activation function |
---|---|---|---|
AMN | 4-channel input, an input image and a attribute | 3-channel RGB image | Tanh |
SAN | 3-channel input, an input image | 1-channel attention mask image | Sigmoid |
- For the discriminator, the same architecture as PatchGAN, is used considering its promising performance.