7.7 - Taming Transformers for High-Resolution Image Synthesis

Entropy Coding 文章目录


第一章 vqgan

第二章 McQuic


注:vqgan


摘要

概念补充:

        Transformer比较CNN:在设计中有没有加入inductive bias,基本含义是归纳偏置或归纳偏差,它是关于机器学习算法的目标函数的假设,也可以理解为模型的指导规则。

具体来说,inductive bias是指机器学习算法在学习过程中对数据的偏好或倾向,它影响算法学习的结果。当机器学习算法从数据中归纳规律时,会依赖某种先验假设、偏好或限制,这就是inductive bias。例如,如果一个分类器认为输入数据中的某些特征与输出标签高度相关,那么在训练过程中,该分类器就会更关注这些特征,并在这些特征上取得更好的预测效果。这种偏好或倾向有助于算法在面对新数据时进行推理和泛化,优先考虑具有某些属性的解。

在机器学习中,很多学习算法会对学习的问题做一些关于目标函数的必要假设,这些假设就是归纳偏置。归纳是自然科学中常用的方法之一,指从一些例子中寻找共性、泛化,形成一个较通用的规则的过程。而偏置则是对模型的偏好。因此,归纳偏置可以理解为从现实生活中观察到的现象中归纳出一定的规则,然后对模型做一定的约束,从而起到“模型选择”的作用。

归纳偏置的意义或作用是使学习器具有泛化的能力。通过学习算法的归纳偏置,可以从有限的数据中推断出一般性规律,并在新数据上进行应用。这种泛化能力使得机器学习算法能够处理未见过的数据,从而实现更好的预测和决策。

        即inductive bias在机器学习中具有重要的意义和作用。它帮助学习算法在面对新数据时进行推理和泛化,并优先考虑具有某些属性的解。通过合理设置归纳偏置,可以提高机器学习算法的性能和效果。

nn.Parameter()

        Parameters 是 Tensor 的子类,当与 Modules 一起使用时具有一个非常特殊的属性 - 当它们被分配为 Module attributes 时,它们会自动添加到其参数列表中,并将 出现例如 在 parameters() 迭代器中。 分配张量没有这样的效果。 这是因为人们可能想要在模型中缓存一些临时状态,例如 RNN 的最后一个隐藏状态。 如果没有像 Parameter 这样的类,这些临时对象也会被注册。

通俗解释
        torch.nn.Parameter()将一个不可训练的tensor转换成可以训练的类型parameter,并将这个parameter绑定到这个module里面。即在定义网络时这个tensor就是一个可以训练的参数了。使用这个函数的目的也是想让某些变量在学习的过程中不断的修改其值以达到最优化。

Abstract

        Designed to learn long-range interactions (长程长时间相互作用) on sequential data, transformers continue to show state-of-the-art (最高/技术水平) results on a wide variety of tasks. In contrast to CNNs, they(Transformers) contain no inductive bias (归纳偏置) that prioritizes local interactions (优先考虑局部交互??特征). This makes them expressive (表达能力更强), but also computationally infeasible for long sequences (长序列不可计算), such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize (建模并合成) high-resolution images. We show how to (i) use CNNs to learn a context rich vocabulary (富含上下文的词汇表) of image constituents (图像组成-压缩 ??), and in turn (ii) utilize transformers to efficiently model their composition ?? within high-resolution images. Our approach is readily applied to conditional synthesis tasks (有条件的合成任务), where both non-spatial information (非空间信息), such as object classes (对象类), and spatial information, such as segmentations (划分), can control the generated image. In particular, we present the first results on semantically guided synthesis (语义指导的合成) of megapixel (百万像素) images with transformers and obtain the state of the art among autoregressive models (自回归模型) on class-conditional (类条件的图像网络 数据集??) ImageNet. Code and pretrained models can be found at https://git.io/JnyvK.

Introduction

Transformers are on the rise—they are now the de-facto standard architecture for language tasks

 (事实上标准结构) and are increasingly adapted in other areas such as audio and vision. In contrast to the predominant (主导占优) vision architecture, convolutional neural networks (CNNs),
the transformer architecture contains no built-in inductive (内置归纳偏置)
prior on the locality of interactions and is therefore free
to learn complex relationships among its inputs. However,
this generality also implies that it has to learn all relationships, whereas CNNs have been designed to exploit prior
knowledge about strong local correlations within images.
Thus, the increased expressivity of transformers comes with
quadratically increasing computational costs, because all
pairwise interactions are taken into account. The resulting energy and time requirements of state-of-the-art transformer models thus pose fundamental problems for scaling
them to high-resolution images with millions of pixels.

Observations that transformers tend to learn convolutional structures [16] thus beg the question: Do we have to re-learn everything we know about the local structure and regularity of images from scratch (抓挠划伤) each time we train a vision model, or can we efficiently encode inductive image biases while still retaining the flexibility of transformers? We hypothesize (假设) that low-level image structure is well described by a local connectivity, i.e. 即 a convolutional architecture, whereas (鉴于) this structural assumption ceases to be effective on higher semantic levels. Moreover, CNNs not only exhibit a strong locality bias, but also a bias towards spatial invariance (空间不变性) through the use of shared ?? weights across all positions. This makes them ineffective if a more holistic (整体的) understanding of the input is required.

Our key insight to obtain an effective and expressive model is that, taken together, convolutional and transformer architectures can model the compositional nature of our visual world [51]: We use a convolutional approach to efficiently learn a codebook of context-rich visual parts and, subsequently, learn a model of their global compositions. The long-range interactions within these compositions require an expressive transformer architecture to model distributions over their consituent visual parts. Furthermore, we utilize an adversarial approach to ensure that the dictionary of local parts captures perceptually (感知地) important local structure to alleviate (减轻缓解) the need for modeling low-level statistics with the transformer architecture. Allowing transformers to concentrate on their unique strength — modeling long-range relations — enables them to generate high-resolution images as in Fig.1, a feat which previously has been out of reach.Our formulation gives control over the generated images by means of conditioning (影响) information regarding desired object classes or spatial layouts. Finally, experiments demonstrate
that our approach retains the advantages of transformers by outperforming previous codebook-based state-of-the-art approaches based on convolutional architectures.


前情提要:

一、vqvae是什么?

示例:pandas 是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的。

VQVAE 背景知识补充

VQGAN的设计细节、生成压缩图像的Transformer的设计细节、带约束图像生成的实现方法、高清图像生成的实现方法。

VQVAE的学习目标是用一个编码器把图像压缩成离散编码,再用一个解码器把图像尽可能地还原回原图像。

通俗来说,VQVAE就是把一幅真实图像压缩成一个小图像。这个小图像和真实图像有着一些相同的性质:小图像的取值和像素值(0-255的整数)一样,都是离散的;小图像依然是二维的,保留了某些空间信息。因此,VQVAE的示意图画成这样会更形象一些:

但小图像和真实图像有一个关键性的区别:与像素值不同,小图像的离散取值之间没有关联。真实图像的像素值其实是一个连续颜色的离散采样,相邻的颜色值也更加相似。比如颜色254和颜色253和颜色255比较相似。而小图像的取值之间是没有关联的,你不能说编码为1与编码为0和编码为2比较相似。由于神经网络不能很好地处理这种离散量,在实际实现中,编码并不是以整数表示的,而是以类似于NLP中的嵌入向量的形式表示的。VAE使用了嵌入空间(又称codebook)来完成整数序号到向量的转换。

为了让任意一个编码器输出向量都变成一个固定的嵌入向量,VQVAE采取了一种离散化策略:把每个输出向量替换成嵌入空间中最近的那个向量。的离散编码就是在嵌入空间的下标。这个过程和把254.9的输出颜色值离散化成255的整数颜色值的原理类似。

VQVAE的损失函数由两部分组成:重建误差和嵌入空间误差。

其中,重建误差就是输入和输出之间的均方误差。

嵌入空间误差为解码器输出向量和它在嵌入空间对应向量的均方误差。

作者在误差中还使用了一种「停止梯度」的技巧。这个技巧在VQGAN中被完全保留,此处就不过多介绍了。

二、Approach

1.引入codebook

ours():

codebook

2.读入数据

代码如下(示例):

data = pd.read_csv

该处使用的是codebook


总结

对文章进行总结:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值