Image Style Transfer Using Convolutional Neural Networks(CVPR16)

Abstract

之前的工作不太成功,是因为缺乏一种表示图像semantic information的representations,用来分离图像的content和style

1. Introduction

Transferring the style from one image onto another can be considered a problem of texture transfer.
style transfer本质上是texture transfer,所以本文的目标是按照source image的style生成texture,保持target image的semantic content

2. Deep image representations

对图像提取的特征,是VGG19 normalized version的feature map,即对于一个 H × W × C H\times W\times C H×W×C的feature map,对 C C C个feature map做归一化

We normalized the network by scaling the weights such that the mean activation of each convolutional filter over images and positions is equal to one.

2.1. Content representation

F l ∈ R N l × M l F^l\in\mathbb{R}^{N_l\times M_l} FlRNl×Ml为一幅图像VGG19上的第 l l l层的feature map,其中 N l N_l Nl相当于维度 C C C M l M_l Ml相当于维度 H × W H\times W H×W F i j l F_{ij}^l Fijl中下标 i i i对应维度 C C C,下标 j j j对应维度 H × W H\times W H×W

x ⃗ \vec x x 表示一幅noise image, p ⃗ \vec p p 表示一幅content image,提取特征 F l F^l Fl P l P^l Pl,优化目标为
L content ( p ⃗ , x ⃗ , l ) = 1 2 ∑ i , j ( F i j l − P i j l ) 2 ( 1 ) \mathcal{L}_\text{content}\left ( \vec p, \vec x, l \right )=\frac{1}{2}\sum_{i,j}\left ( F_{ij}^l-P_{ij}^l \right )^2 \qquad(1) Lcontent(p ,x ,l)=21i,j(FijlPijl)2(1)
根据梯度下降,不断更新 x ⃗ \vec x x 使得公式(1)最小化
在这里插入图片描述
Fig.1的下方给出了不同 l l l的情况下,最终优化得到的 x ⃗ \vec x x

(a) conv_1_2
(b) conv_2_2
(c) conv_3_2
(e) conv_4_2
(f) conv_5_2

从左到右, x ⃗ \vec x x 所包含的内容越来越抽象,特别是d和e,已经属于high-level,只能得到大概内容信息,无法保证具体每一个像素与 p ⃗ \vec p p 相同

Thus, higher layers in the network capture the high-level content in terms of objects and their arrangement in the input image but do not constrain the exact pixel values of the reconstruction very much (Fig 1, content reconstructions d, e).

对于a-c, x ⃗ \vec x x 的每一个像素值仍然能够保持与 p ⃗ \vec p p 相同

In contrast, reconstructions from the lower layers simply reproduce the exact pixel values of the original image (Fig 1, content reconstructions a–c).

所以,为了描述content image的内容信息,本文选择high-level的feature map作为content representation

2.2. Style representation

To obtain a representation of the style of an input image, we use a feature space designed to capture texture information [10].
It consists of the correlations between the different filter responses, where the expectation is taken over the spatial extent of the feature maps.
These feature correlations are given by the Gram matrix G l ∈ R N l × N l G^l\in\mathbb{R}^{N_l\times N_l} GlRNl×Nl, where G i j l G_{ij}^l Gijl is the inner product between the feature maps i i i and j j j in layer l:

G i j l = ∑ k F i k l F j k l ( 3 ) G_{ij}^l=\sum_{k}F_{ik}^lF_{jk}^l \qquad(3) Gijl=kFiklFjkl(3)

翻译过来就是,对于一个 H × W × C H\times W\times C H×W×C的feature map,使用Gram matrix来表示其包含的style信息,维度为 C × C C\times C C×C,即抹掉了空间维度的信息

简言之,计算Gram matrix的方式是抽取第 i , j i,j i,j个feature map,求correlation得到一个标量,所有 i , j i,j i,j组合起来得到一个 C × C C\times C C×C的矩阵

style信息可以理解为channel维度两两之间的关系,故抹去空间维度的信息是合理的

l l l需要取多个值,得到style image的multi-scale representation

x ⃗ \vec x x 表示一幅noise image, a ⃗ \vec a a 表示一幅style image,分别提取第 l l l层的VGG feature map,并转换为Gram matrix,记为 A l , G l A^l, G^l Al,Gl,最小化它们之间的误差
E l = 1 4 N l 2 M l 2 ∑ i , j ( G i j l − A i j l ) 2 ( 4 ) E_l=\frac{1}{4N_l^2M_l^2}\sum_{i,j}\left ( G_{ij}^l-A_{ij}^l \right )^2 \qquad(4) El=4Nl2Ml21i,j(GijlAijl)2(4)
Question: G l , A l ∈ R N l × N l G^l, A^l\in\mathbb{R}^{N_l\times N_l} Gl,AlRNl×Nl,为什么分母要多一个 M l 2 M_l^2 Ml2

公式(4)为一个layer的style loss,把多个layer的loss加权求和,得到style loss
L style ( a ⃗ , x ⃗ ) = ∑ l = 0 L w l E l ( 5 ) \mathcal{L}_\text{style}\left ( \vec a, \vec x \right )=\sum_{l=0}^{L}w_lE_l \qquad(5) Lstyle(a ,x )=l=0LwlEl(5)

Fig.1上方展示了不同 l l l的情况下,最终优化得到的 x ⃗ \vec x x

(a) conv_1_1
(b) conv_1_1, conv_2_1
(c) conv_1_1, conv_2_1, conv_3_1
(e) conv_1_1, conv_2_1, conv_3_1, conv_4_1
(f) conv_1_1, conv_2_1, conv_3_1, conv_4_1, conv_5_1

图e说明取multi-level的feature map,能够充分表达style信息

2.3. Style transfer

给定style image a ⃗ \vec a a 和content image p ⃗ \vec p p ,从一幅noise image x ⃗ \vec x x 开始,最小化如下损失函数
L total ( p ⃗ , a ⃗ , x ⃗ ) = α L content ( p ⃗ , x ⃗ ) + β L s t y l e ( a ⃗ , x ⃗ ) ( 7 ) \mathcal{L}_\text{total}\left ( \vec p, \vec a, \vec x \right ) = \alpha\mathcal{L}_\text{content}\left ( \vec p, \vec x \right )+\beta\mathcal{L}_{style}\left ( \vec a, \vec x \right ) \qquad(7) Ltotal(p ,a ,x )=αLcontent(p ,x )+βLstyle(a ,x )(7)
其中 a ⃗ , p ⃗ , x ⃗ \vec a, \vec p, \vec x a ,p ,x 的尺寸是相同的

3. Results

content选择的layer为conv_4_2,style选择的layer为conv_1_1, conv_2_1, conv_3_1, conv_4_1, conv_5_1,每个layer的权重 w l = 1 / 5 w_l=1/5 wl=1/5

Question:作者没说迭代了多少次

最终得到的结果如Fig.3所示,content image为图A,使用5幅不同的style image,得到结果B~F
在这里插入图片描述
不同结果使用的超参数 α / β \alpha / \beta α/β略有不同(对于不同的style image,超参数需要调一下才能使得生成的结果比较好看吧)

B:1e-3
C:8e-4
D:5e-3
E:5e-4
F:5e-4
3.1. Tradeoff between content and style matching

Of course, image content and style cannot be completely disentangled.
作者用了一个比较严谨的说法,content和style不能完全分离

When synthesizing an image that combines the content of one image with the style of another, there usually does not exist an image that perfectly matches both constraints at the same time.
作者发现生成图像要想完美同时体现对应的content和style,是不可能的

在这里插入图片描述
Fig.4展示了不同 α / β \alpha / \beta α/β的生成结果,数值小偏向style,数值大偏向content

For a specific pair of content and style images one can adjust the trade-off between content and style to create visually appealing images.
对于本文的方法,为了生成比较好看的图像,需要仔细地调节 α / β \alpha / \beta α/β

3.2. Effect of different layers of the Convolutional Neural Network

在这里插入图片描述
接下来需要考虑的是使用哪一个layer表示content信息比较好,Fig.5使用conv_2_2, conv_4_2进行了实验,结果表明,conv_2_2的content detail保留得太细了,而conv_4_2的效果正好比较合适

When matching the content on a lower layer of the network, the algorithm matches much of the detailed pixel information in the photograph and the generated image appears as if the texture of the artwork is merely blended over the photograph

In contrast, when matching the content features on a higher layer of the network, detailed pixel information of the photograph is not as strongly constraint and the texture of the artwork and the content of the photograph are properly merged.

3.3. Initialisation of gradient descent

在这里插入图片描述
迭代优化的时候,可以不从noise image开始,也可以从style image或content image开始,结果表明,不同初始化最终迭代完成的效果都是差不多的

3.4. Photorealistic style transfer

在这里插入图片描述
可以指定style image为一幅真实自然图像,Fig.7给出了一个例子,生成图像的颜色和光影,都与style image很相似

4. Discussion

接下来讨论本文算法的局限性

图像尺寸越大,耗时也越长,例如生成512x512尺寸的图像,在K40显卡上需要近1个小时

Another issue is that synthesized images are sometimes subject to some low-level noise.
作者没有给出例子,所以不是特别理解这个limitation表达的意思

The separation of image content from style is not necessarily a well defined problem.
实话实说,从概念上确实不能完全区分style和content

本文提出使用channel之间的correlation表示style信息,但是真正的style,远远要抽象许多

This is mostly because it is not clear what exactly defines the style of an image. It might be the brush strokes in a painting, the color map, certain dominant forms and shapes, but also the composition of a scene and the choice of the subject of the image – and probably it is a mixture of all of them and many more.

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值