LADN: Local Adversarial Disentangling Network for Facial Makeup and De-Makeup(ICCV19)

3. LADN

3.1. Problem Formulation

定义domain, X ⊂ R H × W × 3 X\subset \mathbb{R}^{H\times W\times 3} XRH×W×3为before-makeup faces, Y ⊂ R H × W × 3 Y\subset \mathbb{R}^{H\times W\times 3} YRH×W×3为after-makeup faces

数据集包括 { x i } i = 1 , ⋯ M , x i ∈ X \left \{ x_i \right \}_{i=1,\cdots M}, x_i\in X {xi}i=1,M,xiX以及 { y i } i = 1 , ⋯ N , y i ∈ Y \left \{ y_i \right \}_{i=1,\cdots N}, y_i\in Y {yi}i=1,N,yiY

goal:学习化妆的mapping function Φ Y : x i , y j → y ~ i \Phi_Y: x_i, y_j\rightarrow\tilde{y}_i ΦY:xi,yjy~i,以及卸妆的mapping function Φ X : y j → x ~ j \Phi_X: y_j\rightarrow\tilde{x}_j ΦX:yjx~j
值得注意的是,makeup的过程是需要reference image作为condition,而makeup removal的过程不需要condition

3.2. Network Architecture

核心idea:separate the makeup style latent variable from non-makup features (identity, facial structure, head pose, etc.) and generate new images through recombination of these latent variables.
因此本文借鉴了一个disentanglement framework DRIT

在这里插入图片描述
定义:attribute space A A A that captures the makeup style latent、content space S S S which includes the non-makeup features

网络:2个domain的content encoders { E X c , E Y c } \left \{ E_X^c, E_Y^c \right \} {EXc,EYc},style encoders { E X a , E Y a } \left \{ E_X^a, E_Y^a \right \} {EXa,EYa}(上标为 a a a,其实就是attribute encoders),generators { G X , G Y } \left \{ G_X, G_Y \right \} {GX,GY}

使用Encoder网络分别对 x i x_i xi y j y_j yj提取attribute and content features
E X a ( x i ) = A i E Y a ( y j ) = A j E X c ( x i ) = C i E X c ( y j ) = C j E_X^a(x_i)=A_i \quad E_Y^a(y_j)=A_j \\ E_X^c(x_i)=C_i \quad E_X^c(y_j)=C_j EXa(xi)=AiEYa(yj)=AjEXc(xi)=CiEXc(yj)=Cj
然后送入generators生成de-makeup result x ~ j \tilde{x}_j x~j和makeup transfer result y ~ i \tilde{y}_i y~i
G X ( A i , C j ) = x ~ j G Y ( A j , C i ) = y ~ i ( 1 ) G_X\left ( A_i, C_j \right )=\tilde{x}_j \quad G_Y\left ( A_j, C_i \right )=\tilde{y}_i \qquad(1) GX(Ai,Cj)=x~jGY(Aj,Ci)=y~i(1)

The encoders and decoders are designed with a U-Net structure, The latent variables A A A, C C C are concatenated at the bottleneck and skip connections are used between the content encoder and generator. This structure can help retain more identity details from the source in the generated image.

对于2个domain,设置2个判别器 { D X , D Y } \left \{ D_X, D_Y \right \} {DX,DY},从而有adversarial loss L d o m a i n a d v = L X a d v + L Y a d v L_{domain}^{adv}=L_X^{adv}+L_Y^{adv} Ldomainadv=LXadv+LYadv
L X a d v = E x ∼ P X [ log ⁡ D X ( x ) ] + E x ~ ∼ G X [ log ⁡ ( 1 − D X ( x ~ ) ) ] L Y a d v = E y ∼ P Y [ log ⁡ D Y ( y ) ] + E y ~ ∼ G Y [ log ⁡ ( 1 − D Y ( y ~ ) ) ] ( 2 ) \begin{aligned} &L_X^{adv}=\mathbb{E}_{x\sim P_X}\left [ \log D_X(x) \right ] + \mathbb{E}_{\tilde{x}\sim G_X}\left [ \log\left ( 1-D_X\left ( \tilde{x} \right ) \right ) \right ] \\ &L_Y^{adv}=\mathbb{E}_{y\sim P_Y}\left [ \log D_Y(y) \right ] + \mathbb{E}_{\tilde{y}\sim G_Y}\left [ \log\left ( 1-D_Y\left ( \tilde{y} \right ) \right ) \right ] \qquad(2) \end{aligned} LXadv=ExPX[logDX(x)]+Ex~GX[log(1DX(x~))]LYadv=EyPY[logDY(y)]+Ey~GY[log(1DY(y~))](2)

3.3. Local Style Discriminator

在这里插入图片描述
除了2个global的判别器 D X , D Y D_X, D_Y DX,DY,额外引入local discriminator

对于每一个unpaired的样本 ( x i , y j ) \left ( x_i, y_j \right ) (xi,yj),人工生成一个synthetic ground truth W ( x i , y j ) W\left ( x_i, y_j \right ) W(xi,yj),方法是warping and blending y j y_j yj onto x i x_i xi according
to their facial landmarks

当然warping result是有artifacts的,需要使用网络来fix

注:这个生成的代价是否太大了,假设 X X X有1k幅图像, Y Y Y有1k幅图像,那么两两交叉,需要生成100万幅图像

好处是:Although the synthetic results cannot serve as the real ground truth of the final results, they can provide guidance to the makeup transfer network on what the generated results should look like.

如Fig.3所示,对于512x512尺寸的reference image y j y_j yj,warping result W ( x i , y j ) W\left ( x_i, y_j \right ) W(xi,yj),generated result y ~ i \tilde{y}_i y~i上,设置 K = 12 K=12 K=12个box框出不同的patch,每个box的尺寸为102x102,这些box的位置是与facial landmarks绑定的

以红色的box为例,(a)~(c)展示了patch放大后的样子

K K K个patch对应了 K K K个local discriminators { D k l o c a l } k = 1 , ⋯   , K \left \{ D_k^{local} \right \}_{k=1,\cdots,K} {Dklocal}k=1,,K

Note that the local discriminators are overlapping.(因为box是有公共区域的,所以说local discriminator是overlapping)

将reference image上的patch p k Y p_k^Y pkY,warping result上的patch p k w p_k^w pkw,generated result上的patch p ~ k Y \tilde{p}_k^Y p~kY,组合成positive pair ( p k Y , p k w ) \left ( p_k^Y, p_k^w \right ) (pkY,pkw),negative pair ( p k Y , p ~ k Y ) \left ( p_k^Y, \tilde{p}_k^Y \right ) (pkY,p~kY),送入local discriminator D k l o c a l D_k^{local} Dklocal

定义local adversarial loss L l o c a l = ∑ k L k l o c a l L^{local}=\sum_{k}L_k^{local} Llocal=kLklocal
L k l o c a l = E x i ∼ P X , y j ∼ P Y [ log ⁡ D k l o c a l ( p k Y , p k W ) ] + E x i ∼ P X , y j ∼ P Y [ log ⁡ ( 1 − D k l o c a l ( p k Y , p ~ k Y ) ) ] ( 3 ) \begin{aligned} L_k^{local}&=\mathbb{E}_{x_i\sim P_X, y_j\sim P_Y}\left [ \log D_k^{local}\left ( p_k^Y, p_k^W \right ) \right ] \\ &+\mathbb{E}_{x_i\sim P_X, y_j\sim P_Y}\left [ \log \left ( 1-D_k^{local}\left ( p_k^Y, \tilde{p}_k^Y \right ) \right ) \right ] \qquad(3) \end{aligned} Lklocal=ExiPX,yjPY[logDklocal(pkY,pkW)]+ExiPX,yjPY[log(1Dklocal(pkY,p~kY))](3)

local discriminator与生成器网络之间的min-max game定义如下
max ⁡ D k l o c a l min ⁡ E X c , E Y a , G Y L l o c a l ( 4 ) \underset{D_k^{local}}{\max} \underset{E_X^c, E_Y^a, G_Y}{\min} L^{local} \qquad(4) DklocalmaxEXc,EYa,GYminLlocal(4)

总的来说,人工生成warping result以及引入local discriminator的目的为了增加guidance,以帮助生成器更好地capture makeup details from the makeup reference

3.4. Asymmetric Losses

对于extreme makeup style(类似彩绘一样的妆容),仍然是具有挑战性的,原因在于:

  1. 对于化妆过程,extreme makeup style包含了high-frequency components,需要网络将这些高频成分与其它脸部高频成分区(如睫毛)分开来
  2. 对于卸妆过程,需要网络重构出人脸原有的正常肤色(hallucinate the
    facial skin color without makeup),hallucinate意味着需要一定的想象力

因此,对于makeup transfer branch引入high-order loss L h o L^{ho} Lho来提升对high-frequency details的迁移性能;对于de-makeup branch引入smooth loss L s m o o t h L^{smooth} Lsmooth来提升重构人脸原有肤色的能力

High-Order Loss
将Laplacian filters作用到 p k W , p ~ k Y p_k^W, \tilde{p}_k^Y pkW,p~kY上,定义high-order loss如下
L h o = ∑ k h k ∥ f ( p k W ) − f ( p ~ k Y ) ∥ 1 ( 5 ) L^{ho}=\sum_{k}h_k\left \| f\left ( p_k^W \right ) - f\left ( \tilde{p}_k^Y \right ) \right \|_1 \qquad(5) Lho=khkf(pkW)f(p~kY)1(5)
其中 h k h_k hk是每一个local patch分配的权重, f f f是Laplacian filter
对于所有的local patch都设置相等的 h k h_k hk,除了眼部区域的权重稍微大一点

Smooth Loss
与makeup transfer相反,我们不希望de-makeup result x ~ j \tilde{x}_j x~j中表现出high-frequency details,换言之,我们希望 x ~ j \tilde{x}_j x~j的local patch是平滑的,因此为所有patch定义smooth loss如下
L s m o o t h = ∑ k s k ∥ f ( p ~ k X ) ∥ 1 ( 6 ) L^{smooth}=\sum_{k}s_k\left \| f\left ( \tilde{p}_k^X \right ) \right \|_1 \qquad(6) Lsmooth=kskf(p~kX)1(6)
其中 p ~ k X \tilde{p}_k^X p~kX x ~ j \tilde{x}_j x~j中的local patch, s k s_k sk是每一个local patch分配的权重, f f f是Laplacian filter
对于眼部区域的权重设置得小一些,因为我们不希望丢失眼部的高频成分;对于脸颊和鼻子区域,设置比较大的权重

总的来说,把high-order loss和smooth loss统称为asymmetric losses,表明二者的目的是相反的,high-order loss鼓励网络迁移高频成分的能力,smooth loss追求平滑的生成结果,希望抑制高频成分

3.5. Other Loss Functions

Reconstruction Loss
借鉴CycleGAN的cycle consistency,对于Fig.2,之前的做法是将分离出来的attribute和content做交叉,现在的做法是不做交叉,将 ( A i , C i ) \left ( A_i, C_i \right ) (Ai,Ci)送入 G X G_X GX生成 x ~ i s e l f \tilde{x}_i^{self} x~iself,将 ( A j , C j ) \left ( A_j, C_j \right ) (Aj,Cj)送入 G Y G_Y GY生成 y ~ j s e l f \tilde{y}_j^{self} y~jself,于是 x ~ i s e l f ≈ x i , y ~ j s e l f ≈ y j \tilde{x}_i^{self}\approx x_i, \tilde{y}_j^{self}\approx y_j x~iselfxi,y~jselfyj

另一方面,对于generated results x ~ j , y ~ i \tilde{x}_j, \tilde{y}_i x~j,y~i,再次提取attribute和content做交叉,生成 x ~ i c r o s s , y ~ j c r o s s \tilde{x}_i^{cross}, \tilde{y}_j^{cross} x~icross,y~jcross,它们应该与 x i , y j x_i, y_j xi,yj相等,称为cross-cycle reconstruction loss
Q:这个loss是否来自DRIT?

所有的reconstruction loss定义如下
L r e o c n = ∥ x i − x ~ i s e l f ∥ 1 + 8 ∥ x i − x ~ i c r o s s ∥ 1 + ∥ y j − y ~ j s e l f ∥ 1 + 8 ∥ y j − y ~ j c r o s s ∥ 1 ( 7 ) \begin{aligned} L_{reocn} = &\left \| x_i - \tilde{x}_i^{self} \right \|_1 + 8\left \| x_i - \tilde{x}_i^{cross} \right \|_1 +\\ &\left \| y_j - \tilde{y}_j^{self} \right \|_1+ 8\left \| y_j - \tilde{y}_j^{cross} \right \|_1 \qquad(7) \end{aligned} Lreocn=xix~iself1+8xix~icross1+yjy~jself1+8yjy~jcross1(7)

KL Loss
令网络提取的attribute信息 { A i , A j } \left \{ A_i, A_j \right \} {Ai,Aj}近似于高斯分布,于是引入KL loss L K L = L i K L + L j K L L^{KL}=L_i^{KL} + L_j^{KL} LKL=LiKL+LjKL,其中
L i K L = E [ ( D K L ( A i ∥ N ( 0 , 1 ) ) ) ] L j K L = E [ ( D K L ( A j ∥ N ( 0 , 1 ) ) ) ] D K L ( p ∥ q ) = ∫ p ( x ) log ⁡ ( p ( x ) q ( x ) ) d x ( 8 ) \begin{aligned} &L_i^{KL} = \mathbb{E}\left [ \left ( D_{KL}\left ( A_i\parallel N(0,1) \right ) \right ) \right ] \\ &L_j^{KL} = \mathbb{E}\left [ \left ( D_{KL}\left ( A_j\parallel N(0,1) \right ) \right ) \right ] \\ &D_{KL}(p\parallel q)=\int p(x)\log\left ( \frac{p(x)}{q(x)} \right )dx \qquad(8) \end{aligned} LiKL=E[(DKL(AiN(0,1)))]LjKL=E[(DKL(AjN(0,1)))]DKL(pq)=p(x)log(q(x)p(x))dx(8)
Q:为什么attribute需要近似于高斯分布,而content就不需要呢?

Total Loss
L t o t a l = λ l o c a l L l o c a l + λ d o m a i n a d v L d o m a i n a d v + λ r e c o n L r e c o n + λ K L L K L + λ h o L h o + λ s m o o t h L s m o o t h ( 9 ) \begin{aligned} L^{total}=&\lambda_{local}L^{local} + \lambda_{domain}^{adv}L_{domain}^{adv} + \lambda_{recon}L^{recon} + \\ &\lambda_{KL}L^{KL} + \lambda_{ho}L^{ho} + \lambda_{smooth}L^{smooth} \qquad(9) \end{aligned} Ltotal=λlocalLlocal+λdomainadvLdomainadv+λreconLrecon+λKLLKL+λhoLho+λsmoothLsmooth(9)

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值