PSGAN: Pose and Expression Robust Spatial-Aware GAN for Customizable Makeup Transfer（CVPR20）

最新推荐文章于 2024-12-09 12:02:59 发布

o0Helloworld0o

最新推荐文章于 2024-12-09 12:02:59 发布

阅读量1.8k

点赞数 3

分类专栏：算法

本文链接：https://blog.csdn.net/o0Helloworld0o/article/details/105254362

版权

算法专栏收录该内容

15 篇文章

订阅专栏

介绍PSGAN，一种基于GAN的美妆风格迁移模型。通过MDNet、AMM模块和MANet，实现精准美妆效果转移，保留源图像的身份特征。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

3. PSGAN

3.1. Formulation

source image domain $X$ , reference image domain $Y$

domain $X$ 上有 $N$ 个样本， $\left \{ x^n \right \}_{n=1,\cdots,N}, x^n\in X$ ；domain $Y$ 上有 $M$ 个样本， $\left \{ y^m \right \}_{m=1,\cdots,M}, y^m\in Y$

domain $X$ 上的分布 $\mathcal{P}_X$ ，domain $Y$ 上的分布 $\mathcal{P}_Y$

学习目标是一个transfer function $G:\left \{ x, y \right \}\rightarrow\tilde{x}$ ，使得 $\tilde{x}$ 包含 $y$ 的makeup style，以及 $x$ 的identity

3.2. Framework

在这里插入图片描述
Overall

PSGAN的framework如Fig. 2所示

Makeup distill network（MDNet），从reference image $y$ 中提取makeup style，共有2个成分 $\gamma, \beta$ ，称为makeup matrices
Attentive makeup morphing module（AMM module），因为source image $x$ 和reference image $y$ 之间的expression和pose差异很大，所以提出AMM module，用于morph the two makeup matrices $\lambda, \beta$ to two new matrices $\lambda', \beta'$ , which are adaptive to the source image by considering the similarities between pixels of the source and reference
Makeup apply network（MANet），将 $\lambda', \beta'$ 作用在MANet的bottleneck feature map上

Makeup distill network（MDNet）

MDNet的网络结构为StarGAN的encoder-bottleneck部分（bottleneck指residual block），负责提取 the makeup related features（如唇彩、眼影等），这些feature被表示为2个makeup matrices $\gamma, \beta$

如Fig.2(B)所示，MDNet的输出为feature map $\mathbf{V}_\mathbf{y}\in\mathbb{R}^{C\times H\times W}$ ，后接2个并列的1x1 conv layer，得到 $\gamma\in\mathbb{R}^{1\times H\times W}, \beta\in\mathbb{R}^{1\times H\times W}$

Attentive makeup morphing module（AMM module）

因为source image $x$ 和reference image $y$ 之间的expression和pose差异很大，所以不能直接将 $\gamma, \beta$ 直接作用在 source image $x$ 上
Q：可以认为 $\gamma, \beta$ 中仍然包含reference image $y$ 的expression和pose等信息吗？

AMM module计算一个attentive matrix $A\in\mathbb{R}^{HW\times HW}$ to specify how a pixel in the source image $x$ is morphed from the pixels in the reference image $y$ ，where $A_{i,j}$ indicates the attentive value between the $i$ -th pixel $x_i$ in image $x$ and the $j$ -th pixel $y_j$ in image $y$
理解：假设在 $x$ 中position $i$ 是眼角的位置，在 $y$ 中position $j$ 也是眼角的位置，那么 $A_{i,j}$ 的值应该比较大，意味着 $\tilde{x}$ 中position $i$ 的像素值应该参考 $y$ 中position $j$ 的像素值，才能实现较好的眼影迁移
（有个缺点，既然把 $H$ 和 $W$ 乘起来了，一定程度上丢失了spatial information）

引入68个facial landmarks作为anchor points
以鼻尖处的landmark为例，对于 $x$ 的所有position，计算该position $i$ 到鼻尖x的距离（有正有负），得到一个2维vector，于是所有68 landmark就可以得到136维向量， $\mathbf{p}_i\in\mathbb{R}^{136}, i=1,\cdots,H\times W$ ，称为relative position features
$\begin{aligned} \mathbf{p}=&[ f(x_i)-f(l_1), f(x_i)-f(l_2),\cdots,f(x_i)-f(l_{68}) \\ &g(x_i)-g(l_1), g(x_i)-g(l_2),\cdots,g(x_i)-g(l_{68}) ] \qquad(1) \end{aligned}$
where $f(\cdot)$ and $g(\cdot)$ indicate the coordinates on $x$ and $y$ axes, $l_i$ indicates the $i$ -th facial landmark
思考： $\mathbf{p}$ 的维度应该是 $H\times W\times136$ 吧

既然是landmark，那么必然会存在face size的差异，因此令 $\mathbf{p}$ 单位化，即 $\frac{\mathbf{p}}{\left \| \mathbf{p} \right \|}$ （为何不是将坐标转换到 $[0, 1]$ 之间？）

Moreover, to avoid unreasonable sampling pixels with similar relative positions but different semantics, we also consider the visual similarities between pixels

Fig.2（c）举了一个例子

【源代码】
face parser工具提供的标签
0：background，1：face，2：left-eyebrown，3：right-eyebrown，
4：left-eye，5：right-eye，6：nose，7：upper-lip，8：teeth，
9：under-lip，10：hair，11：left-ear，12：right-ear，13：neck
在这里插入图片描述

解析源代码

运行demo推理

python demo.py
python demo.py --device cuda --speed	# 使用GPU，并且测试推理时间

【demo.py】
args = parser.parse_args()，执行后args的内容如下
args.config_file = 'configs/base.yaml'
args.device = 'cpu'
args.model_path = 'assets/models/G.pth'
args.opts = []
args.reference_dir = 'assets/images/makeup'
args.source_path = './assets/images/non-makeup/xfsy_0106.png'
args.speed = False

config = setup_config(args)，执行后config为fvcore.common.config.CfgNode型，打印如下
DATA:
  BATCH_SIZE: 1
  IMG_SIZE: 256
  NUM_WORKERS: 4
  PATH: ./data
LOG:
  LOG_PATH: log/
  LOG_STEP: 8
  SNAPSHOT_PATH: snapshot/
  SNAPSHOT_STEP: 1024
  VIS_PATH: visulization/
  VIS_STEP: 2048
LOSS:
  LAMBDA_A: 10.0
  LAMBDA_B: 10.0
  LAMBDA_CLS: 1
  LAMBDA_EYE: 1
  LAMBDA_HIS: 1
  LAMBDA_HIS_EYE: 1
  LAMBDA_HIS_LIP: 1
  LAMBDA_HIS_SKIN: 0.1
  LAMBDA_IDT: 0.5
  LAMBDA_REC: 10
  LAMBDA_SKIN: 0.1
  LAMBDA_VGG: 0.005
MODEL:
  D_CONV_DIM: 64
  D_REPEAT_NUM: 3
  G_CONV_DIM: 64
  G_REPEAT_NUM: 6
  NORM: SN
  WEIGHTS: assets/models
POSTPROCESS:
  WILL_DENOISE: False
PREPROCESS:
  DOWN_RATIO: 0.23529411764705885
  FACE_CLASS: [1, 6]
  LANDMARK_POINTS: 68
  LIP_CLASS: [7, 9]
  UP_RATIO: 0.7058823529411765
  WIDTH_RATIO: 0.23529411764705885
TRAINING:
  BETA1: 0.5
  BETA2: 0.999
  C_DIM: 2
  D_LR: 0.0002
  G_LR: 0.0002
  G_STEP: 1
  NUM_EPOCHS: 50
  NUM_EPOCHS_DECAY: 0
以上参数来自psgan/config.py, configs/base.yaml以及args

【psgan/inference.py】
source_input, face, crop_face = self.preprocess(source)