3. PSGAN
3.1. Formulation
source image domain X X X, reference image domain Y Y Y
domain X X X上有 N N N个样本, { x n } n = 1 , ⋯ , N , x n ∈ X \left \{ x^n \right \}_{n=1,\cdots,N}, x^n\in X {xn}n=1,⋯,N,xn∈X;domain Y Y Y上有 M M M个样本, { y m } m = 1 , ⋯ , M , y m ∈ Y \left \{ y^m \right \}_{m=1,\cdots,M}, y^m\in Y {ym}m=1,⋯,M,ym∈Y
domain X X X上的分布 P X \mathcal{P}_X PX,domain Y Y Y上的分布 P Y \mathcal{P}_Y PY
学习目标是一个transfer function G : { x , y } → x ~ G:\left \{ x, y \right \}\rightarrow\tilde{x} G:{x,y}→x~,使得 x ~ \tilde{x} x~包含 y y y的makeup style,以及 x x x的identity
3.2. Framework
Overall
PSGAN的framework如Fig. 2所示
- Makeup distill network(MDNet),从reference image y y y中提取makeup style,共有2个成分 γ , β \gamma, \beta γ,β,称为makeup matrices
- Attentive makeup morphing module(AMM module),因为source image x x x和reference image y y y之间的expression和pose差异很大,所以提出AMM module,用于morph the two makeup matrices λ , β \lambda, \beta λ,β to two new matrices λ ′ , β ′ \lambda', \beta' λ′,β′, which are adaptive to the source image by considering the similarities between pixels of the source and reference
- Makeup apply network(MANet),将 λ ′ , β ′ \lambda', \beta' λ′,β′作用在MANet的bottleneck feature map上
Makeup distill network(MDNet)
MDNet的网络结构为StarGAN的encoder-bottleneck部分(bottleneck指residual block),负责提取 the makeup related features(如唇彩、眼影等),这些feature被表示为2个makeup matrices γ , β \gamma, \beta γ,β
如Fig.2(B)所示,MDNet的输出为feature map V y ∈ R C × H × W \mathbf{V}_\mathbf{y}\in\mathbb{R}^{C\times H\times W} Vy∈RC×H×W,后接2个并列的1x1 conv layer,得到 γ ∈ R 1 × H × W , β ∈ R 1 × H × W \gamma\in\mathbb{R}^{1\times H\times W}, \beta\in\mathbb{R}^{1\times H\times W} γ∈R1×H×W,β∈R1×H×W
Attentive makeup morphing module(AMM module)
因为source image
x
x
x和reference image
y
y
y之间的expression和pose差异很大,所以不能直接将
γ
,
β
\gamma, \beta
γ,β直接作用在 source image
x
x
x上
Q:可以认为
γ
,
β
\gamma, \beta
γ,β中仍然包含reference image
y
y
y的expression和pose等信息吗?
AMM module计算一个attentive matrix
A
∈
R
H
W
×
H
W
A\in\mathbb{R}^{HW\times HW}
A∈RHW×HW to specify how a pixel in the source image
x
x
x is morphed from the pixels in the reference image
y
y
y,where
A
i
,
j
A_{i,j}
Ai,j indicates the attentive value between the
i
i
i-th pixel
x
i
x_i
xi in image
x
x
x and the
j
j
j-th pixel
y
j
y_j
yj in image
y
y
y
理解:假设在
x
x
x中position
i
i
i是眼角的位置,在
y
y
y中position
j
j
j也是眼角的位置,那么
A
i
,
j
A_{i,j}
Ai,j的值应该比较大,意味着
x
~
\tilde{x}
x~中position
i
i
i的像素值应该参考
y
y
y中position
j
j
j的像素值,才能实现较好的眼影迁移
(有个缺点,既然把
H
H
H和
W
W
W乘起来了,一定程度上丢失了spatial information)
引入68个facial landmarks作为anchor points
以鼻尖处的landmark为例,对于
x
x
x的所有position,计算该position
i
i
i到鼻尖x的距离(有正有负),得到一个2维vector,于是所有68 landmark就可以得到136维向量,
p
i
∈
R
136
,
i
=
1
,
⋯
,
H
×
W
\mathbf{p}_i\in\mathbb{R}^{136}, i=1,\cdots,H\times W
pi∈R136,i=1,⋯,H×W,称为relative position features
p
=
[
f
(
x
i
)
−
f
(
l
1
)
,
f
(
x
i
)
−
f
(
l
2
)
,
⋯
,
f
(
x
i
)
−
f
(
l
68
)
g
(
x
i
)
−
g
(
l
1
)
,
g
(
x
i
)
−
g
(
l
2
)
,
⋯
,
g
(
x
i
)
−
g
(
l
68
)
]
(
1
)
\begin{aligned} \mathbf{p}=&[ f(x_i)-f(l_1), f(x_i)-f(l_2),\cdots,f(x_i)-f(l_{68}) \\ &g(x_i)-g(l_1), g(x_i)-g(l_2),\cdots,g(x_i)-g(l_{68}) ] \qquad(1) \end{aligned}
p=[f(xi)−f(l1),f(xi)−f(l2),⋯,f(xi)−f(l68)g(xi)−g(l1),g(xi)−g(l2),⋯,g(xi)−g(l68)](1)
where
f
(
⋅
)
f(\cdot)
f(⋅) and
g
(
⋅
)
g(\cdot)
g(⋅) indicate the coordinates on
x
x
x and
y
y
y axes,
l
i
l_i
li indicates the
i
i
i-th facial landmark
思考:
p
\mathbf{p}
p的维度应该是
H
×
W
×
136
H\times W\times136
H×W×136吧
既然是landmark,那么必然会存在face size的差异,因此令 p \mathbf{p} p单位化,即 p ∥ p ∥ \frac{\mathbf{p}}{\left \| \mathbf{p} \right \|} ∥p∥p(为何不是将坐标转换到 [ 0 , 1 ] [0, 1] [0,1]之间?)
Moreover, to avoid unreasonable sampling pixels with similar relative positions but different semantics, we also consider the visual similarities between pixels
Fig.2(c)举了一个例子
【源代码】
face parser工具提供的标签
0:background,1:face,2:left-eyebrown,3:right-eyebrown,
4:left-eye,5:right-eye,6:nose,7:upper-lip,8:teeth,
9:under-lip,10:hair,11:left-ear,12:right-ear,13:neck
解析源代码
运行demo推理
python demo.py
python demo.py --device cuda --speed # 使用GPU,并且测试推理时间
【demo.py】
args = parser.parse_args(),执行后args的内容如下
args.config_file = 'configs/base.yaml'
args.device = 'cpu'
args.model_path = 'assets/models/G.pth'
args.opts = []
args.reference_dir = 'assets/images/makeup'
args.source_path = './assets/images/non-makeup/xfsy_0106.png'
args.speed = False
config = setup_config(args),执行后config为fvcore.common.config.CfgNode型,打印如下
DATA:
BATCH_SIZE: 1
IMG_SIZE: 256
NUM_WORKERS: 4
PATH: ./data
LOG:
LOG_PATH: log/
LOG_STEP: 8
SNAPSHOT_PATH: snapshot/
SNAPSHOT_STEP: 1024
VIS_PATH: visulization/
VIS_STEP: 2048
LOSS:
LAMBDA_A: 10.0
LAMBDA_B: 10.0
LAMBDA_CLS: 1
LAMBDA_EYE: 1
LAMBDA_HIS: 1
LAMBDA_HIS_EYE: 1
LAMBDA_HIS_LIP: 1
LAMBDA_HIS_SKIN: 0.1
LAMBDA_IDT: 0.5
LAMBDA_REC: 10
LAMBDA_SKIN: 0.1
LAMBDA_VGG: 0.005
MODEL:
D_CONV_DIM: 64
D_REPEAT_NUM: 3
G_CONV_DIM: 64
G_REPEAT_NUM: 6
NORM: SN
WEIGHTS: assets/models
POSTPROCESS:
WILL_DENOISE: False
PREPROCESS:
DOWN_RATIO: 0.23529411764705885
FACE_CLASS: [1, 6]
LANDMARK_POINTS: 68
LIP_CLASS: [7, 9]
UP_RATIO: 0.7058823529411765
WIDTH_RATIO: 0.23529411764705885
TRAINING:
BETA1: 0.5
BETA2: 0.999
C_DIM: 2
D_LR: 0.0002
G_LR: 0.0002
G_STEP: 1
NUM_EPOCHS: 50
NUM_EPOCHS_DECAY: 0
以上参数来自psgan/config.py, configs/base.yaml以及args
【psgan/inference.py】
source_input, face, crop_face = self.preprocess(source)