StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery 文本驱动的styleggan图像合成全文翻译

最新推荐文章于 2022-11-09 20:30:15 发布

lynnandwei

最新推荐文章于 2022-11-09 20:30:15 发布

阅读量802

点赞数

分类专栏： DeepLearning 文章标签：神经网络

原文链接：https://aiqianji.com/blog/article/85

版权

DeepLearning 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

# StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery 文本驱动的styleggan图像合成全文翻译

## Abstract 摘要

Inspired by the ability of StyleGAN to generate highly realistic images in a variety of domains, much recent work has focused on understanding how to use the latent spaces of StyleGAN to manipulate generated and real images. However, discovering semantically meaningful latent manipulations typically involves painstaking human examination of the many degrees of freedom, or an annotated collection of images for each desired manipulation. In this work, we explore leveraging the power of recently introduced Contrastive Language-Image Pre-training (CLIP) models in order to develop a text-based interface for StyleGAN image manipulation that does not require such manual effort. We first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt. Next, we describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable text-based manipulation. Finally, we present a method for mapping a text prompts to input-agnostic directions in StyleGAN's style space, enabling interactive text-driven image manipulation. Extensive results and comparisons demonstrate the effectiveness of our approaches. \vfill

受StyleGAN在各种领域生成高度逼真的图像的能力的启发，许多最新工作集中在理解如何使用StyleGAN的潜在空间来操纵生成的和真实的图像。然而，发现在语义上有意义的潜在操纵通常涉及艰苦的人类对许多自由度的检查，或涉及每个所需操纵的带注释的图像集合。在这项工作中，我们探索利用最新引入的对比语言-图像预训练（CLIP）模型的功能，以便为StyleGAN图像处理开发基于文本的界面，而无需进行此类人工操作。我们首先介绍一种优化方案，该方案利用基于CLIP的损失来响应用户提供的文本提示来修改输入潜在矢量。下一个，我们描述了一个潜在映射器，它针对给定的输入图像推断出文本引导的潜在操作步骤，从而允许更快，更稳定的基于文本的操作。最后，我们提出了一种在StyleGAN样式空间中将文本提示映射到与输入无关的方向的方法，从而实现交互式文本驱动的图像操作。广泛的结果和比较证明了我们方法的有效性。

## Introduction 简介

Generative Adversarial Networks (GANs) have revolutionized image synthesis, with recent style-based generative models boasting some of the most realistic synthetic imagery to date. Furthermore, the learnt intermediate latent spaces of StyleGAN have been shown to possess disentanglement properties, which enable utilizing pretrained models to perform a wide variety of image manipulations on synthetic, as well as real, images.
Harnessing StyleGAN's expressive power requires developing simple and intuitive interfaces for users to easily carry out their intent. Existing methods for semantic control discovery either involve manual examination (e.g.,), a large amount of annotated data, or pretrained classifiers . Furthermore, subsequent manipulations are typically carried out by moving along a direction in one of the latent spaces, using a parametric model, such as a 3DMM in StyleRig , or a trained normalized flow in StyleFlow . Specific edits, such as virtual try-on and aging have also been explored. Thus, existing controls enable image manipulations only along preset semantic directions, severely limiting the user's creativity and imagination. Whenever an additional, unmapped, direction is desired, further manual effort and/or large quantities of annotated data are necessary.
In this work, we explore leveraging the power of recently introduced Contrastive Language-Image Pre-training (CLIP) models in order to enable intuitive text-based semantic image manipulation that is neither limited to preset manipulation directions, nor requires additional manual effort to discover new controls. The CLIP model is pretrained on 400 million image-text pairs harvested from the Web, and since natural language is able to express a much wider set of visual concepts, combining CLIP with the generative power of StyleGAN opens fascinating avenues for image manipulation. Figuresteaser shows several examples of unique manipulations produced using our approach. Specifically, in this paper we investigate three techniques that combine CLIP with StyleGAN: The results in this paper and the supplementary material demonstrate a wide range of semantic manipulations on images of human faces, animals, cars, and churches. These manipulations range from abstract to specific, and from extensive to fine-grained. Many of them have not been demonstrated by any of the previous StyleGAN manipulation works, and all of them were easily obtained using a combination of pretrained StyleGAN and CLIP models.

生成对抗网络（GANs） [ goodfellow2014generative ]彻底改变了图像合成，最近有了基于样式的生成模型 [ karras2019style，karras2020analyzing，karras2020ada ] ，其中包含一些迄今为止最逼真的合成图像。此外，已经证明所学习的StyleGAN中间潜在空间具有纠缠性质 [ collins2020editing，shen2020interpreting，harkonen2020ganspace，tewari2020stylerig，wu2020stylespace ]，可以利用预训练的模型对合成图像和真实图像执行各种图像处理。

充分利用StyleGAN的表达能力，需要开发简单直观的界面，以便用户轻松实现其意图。现有的语义控制发现方法要么涉及手动检查（例如 [ harkonen2020ganspace，shen2020interpreting，wu2020stylespace ]），大量带注释的数据，要么是经过预训练的分类器[ shen2020interfacegan，abdal2020styleflow ]。此外，通常使用参数模型（例如StyleRig [ tewari2020stylerig ]中的3DMM），通过沿一个潜在空间中的方向移动来执行后续操作。或StyleFlow [ abdal2020styleflow ]中经过训练的标准化流。还研究了特定的编辑方法，例如虚拟试穿 [ lewis2021vogue ]和老化 [ alaluf2021matter ]。

因此，现有控件只能沿预设的语义方向进行图像操作，从而严重限制了用户的创造力和想象力。每当需要附加的，未映射的方向时，就需要进一步的人工操作和/或大量的注释数据。

在这项工作中，我们探索利用最新引入的对比语言-图像预训练（CLIP）模型的功能，以实现基于文本的直观语义图像操作，该操作不仅限于预设的操作方向，也不需要额外的人工来发现新控件。CLIP模型在从Web上收获的4亿个图像文本对上进行了预训练，并且由于自然语言能够表达更广泛的视觉概念，因此将CLIP与StyleGAN的生成能力相结合将为图像处理提供有趣的途径。Figures StyleCLIP：StyleGAN Imagery的文本驱动操作显示了使用我们的方法产生的独特操作的几个示例。具体而言，在本文中，我们研究了将CLIP与StyleGAN相结合的三种技术：

1. 文本指导的潜在优化，其中CLIP模型用作损失网络[ johnson2016perceptual ]。这是最通用的方法，但是需要几分钟的优化才能对图像进行操作。
2. 潜在残差映射器，针对特定的文本提示进行了培训。给定潜在空间中的起点（要处理的输入图像），映射器会在潜在空间中产生局部步长。
3. 一种将文本提示映射到StyleGAN样式空间中与输入无关的（全局）方向的方法，可控制操纵强度和解开程度。

本文和补充材料中的结果展示了对人脸，动物，汽车和教堂图像的广泛语义操纵。这些操作的范围从抽象到特定，从广泛到细化。他们中的许多人都没有通过以前的StyleGAN操作作品得到证明，并且所有这些人都可以通过结合使用预训练的StyleGAN和CLIP模型轻松获得。

## Related Work 相关工作

### Vision and Language 视觉与语言方向

Multiple works learn cross-modal Vision and language (VL) representations for a variety of tasks, such as language-based image retrieval, image captioning, and visual question answering. Following the success of BERT in various language tasks, recent VL methods typically use Transformers to learn the joint representations. A recent model, based on Contrastive Language-Image Pre-training (CLIP), learns a multi-modal embedding space, which may be used to estimate the semantic similarity between a given text and an image. CLIP was trained on 400 million text-image pairs, collected from a variety of publicly available sources on the Internet. The representations learned by CLIP have been shown to be extremely powerful, enabling state-of-the-art zero-shot image classification on a variety of datasets. We refer the reader to OpenAI's Distill article for an extensive exposition and discussion of the visual concepts learned by CLIP. The pioneering work of Reed approached text-guided image generation by training a conditional GAN, conditioned by text embeddings obtained from a pretrained encoder. Zhang improved image quality by using multi-scale GANs. AttnGAN incorporated an attention mechanism between the text and image features. Additional supervision was used in other works to further improve the image quality. A few studies focus on text-guided image manipulation. Some methods use a GAN-based encoder-decoder architecture, to disentangle the semantics of both input images and text descriptions.
ManiGAN introduces a novel text-image combination module, which produces high-quality images. Differently from the aforementioned works, we propose a single framework that combines the high-quality images generated by StyleGAN, with the rich multi-domain semantics learned by CLIP. Recently, DALL·E, a 12-billion parameter version of GPT-3, which at 16-bit precision requires over 24GB of GPU memory, has shown a diverse set of capabilities in generating and applying transformations to images guided by text. In contrast, our approach is deployable even on a single commodity GPU. A concurrent work to ours, TediGAN, also uses StyleGAN for text-guided image generation and manipulation. By training an encoder to map text into the StyleGAN latent space, one can generate an image corresponding to a given text. To perform text-guided image manipulation, TediGAN encodes both the image and the text into the latent space, and then performs style-mixing to generate a corresponding image. In Sectionexperiments we demonstrate that the manipulations achieved using our approach reflect better the semantics of the driving text. In a recent online post, Perez describes a text-to-image approach that combines StyleGAN and CLIP in a manner similar to our latent optimizer in Sectionopt. Rather than synthesizing an image from scratch, our optimization scheme, as well as the other two approaches described in this work, focus on image manipulation. While text-to-image generation is an intriguing and challenging problem, we believe that the image manipulation abilities we provide constitute a more useful tool for the typical workflow of creative artists.

#### 联合表达

多个作品学习跨模态视觉和语言（VL）表示形式 [ Desai2020VirTexLV，sariyildiz2020learning，Tan2019LXMERTLC，Lu2019ViLBERTPT，Li2019VisualBERTAS，Su2020VL-BERT，Li2020UnicoderVLAU，Chen2020UNITERUI，Li2020OscarOA ]用于多种任务，例如基于语言的图像检索字幕和视觉问题解答。继BERT [ Devlin2019BERTPO ]在各种语言任务中取得成功之后，最近的VL方法通常使用变形金刚 [ NIPS2017_3f5ee243 ]学习联合表示。一种基于对比语言-图像预训练（CLIP）[ radford2021learning ]的最新模型，学习了一种多模式嵌入空间，该空间可用于估计给定文本和图像之间的语义相似性。CLIP接受了从互联网上各种公开可用来源收集的4亿个文本图像对的培训。CLIP所学的表示方法非常强大，可以对各种数据集进行最新的零镜头图像分类。我们向读者介绍OpenAI的Distill文章[ distill2021multimodal ]，以广泛介绍和讨论CLIP所学的视觉概念。

#### 文本引导的图像生成和处理

里德的开创性工作\等人 [ Reed2016GenerativeAT ]通过训练条件GAN走近文本引导图像生成 [ mirza2014conditional ]，通过从预训练的编码器得到的嵌入文本空调。Zhang \ etal [ zhang2017stackgan，Zhang2019StackGANRI ]通过使用多尺度GAN改善了图像质量。AttnGAN [ Xu2018AttnGANFT ]在文本和图像特征之间引入了一种注意机制。其他作品中也使用了额外的监督 [ Reed2016GenerativeAT，Li2019ObjectDrivenTS，Koh2020TexttoImageGG ]进一步改善图像质量。

一些研究集中在文本引导的图像处理上。某些方法 [ Dong2017SemanticIS，Nam2018TextAdaptiveGA，Liu2020DescribeWT ]使用基于GAN的编码器-解码器体系结构来解开输入图像和文本描述的语义。ManiGAN [ li2020manigan ]引入了一种新颖的文本-图像组合模块，可以产生高质量的图像。与上述工作不同，我们提出了一个单一框架，该框架将StyleGAN生成的高质量图像与CLIP所学的丰富的多域语义相结合。

最近，DALL·E [ unpublished2021dalle，ramesh2021zeroshot ]，这是120亿个参数版本的GPT-3 [ Brown2020LanguageMA ]，其精度为16位，需要超过24GB的GPU内存，它在生成和应用转换方面表现出多种功能到由文字引导的图像。相反，我们的方法甚至可以部署在单个商用GPU上。

与我们同时进行的工作TediGAN [ xia2020tedigan ]也使用StyleGAN进行文本引导的图像生成和操作。通过训练编码器将文本映射到StyleGAN潜在空间中，可以生成与给定文本相对应的图像。为了执行文本引导的图像处理，TediGAN将图像和文本都编码到潜在空间中，然后执行样式混合以生成相应的图像。在第[7](https://www.aiqianji.com/papers/2103.17249/#S7 "7比较和评估‣StyleCLIP：StyleGAN图像的文本驱动操作")节中，我们证明了使用我们的方法实现的操作更好地反映了驾驶文本的语义。

在最近的在线帖子中，Perez [ perez2021imagesfromprompts ]描述了一种文本到图像的方法，该方法将StyleGAN和CLIP组合在一起，其方式类似于第[4](https://www.aiqianji.com/papers/2103.17249/#S4 "4潜在优化‣StyleCLIP：StyleGAN图像的文本驱动操作")节中的潜在优化器。我们的优化方案以及本工作中介绍的其他两种方法，不是从头开始合成图像，而是着眼于图像处理。尽管文本到图像的生成是一个有趣且具有挑战性的问题，但我们认为，我们提供的图像处理功能对于创意艺术家的典型工作流程而言，是一种更为有用的工具。

全文翻译链接 https://aiqianji.com/blog/article/85

翻译合集https://aiqianji.com/blog/articles

lynnandwei

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery 文本驱动的styleggan图像合成全文翻译

# StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery 文本驱动的styleggan图像合成## Abstract 摘要Inspired by the ability of StyleGAN to generate highly realistic images in a variety of domains, much recent work has focused on understanding how to use the late
复制链接

扫一扫