7.7 - Taming Transformers for High-Resolution Image Synthesis

Jack.Du

已于 2024-07-08 22:20:50 修改

阅读量663

点赞数 8

文章标签：人工智能深度学习神经网络生成对抗网络回归

于 2024-07-07 18:28:08 首次发布

本文链接：https://blog.csdn.net/m0_62512118/article/details/140247038

版权

Entropy Coding 文章目录

第一章 vqgan

第二章 McQuic

注：vqgan

文章目录

系列文章目录
摘要
一、vqvae是什么？
二、使用步骤
- 1.引入库
- 2.读入数据
总结

摘要

概念补充：

Transformer比较CNN：在设计中有没有加入inductive bias，基本含义是归纳偏置或归纳偏差，它是关于机器学习算法的目标函数的假设，也可以理解为模型的指导规则。

具体来说，inductive bias是指机器学习算法在学习过程中对数据的偏好或倾向，它影响算法学习的结果。当机器学习算法从数据中归纳规律时，会依赖某种先验假设、偏好或限制，这就是inductive bias。例如，如果一个分类器认为输入数据中的某些特征与输出标签高度相关，那么在训练过程中，该分类器就会更关注这些特征，并在这些特征上取得更好的预测效果。这种偏好或倾向有助于算法在面对新数据时进行推理和泛化，优先考虑具有某些属性的解。

在机器学习中，很多学习算法会对学习的问题做一些关于目标函数的必要假设，这些假设就是归纳偏置。归纳是自然科学中常用的方法之一，指从一些例子中寻找共性、泛化，形成一个较通用的规则的过程。而偏置则是对模型的偏好。因此，归纳偏置可以理解为从现实生活中观察到的现象中归纳出一定的规则，然后对模型做一定的约束，从而起到“模型选择”的作用。

归纳偏置的意义或作用是使学习器具有泛化的能力。通过学习算法的归纳偏置，可以从有限的数据中推断出一般性规律，并在新数据上进行应用。这种泛化能力使得机器学习算法能够处理未见过的数据，从而实现更好的预测和决策。

即inductive bias在机器学习中具有重要的意义和作用。它帮助学习算法在面对新数据时进行推理和泛化，并优先考虑具有某些属性的解。通过合理设置归纳偏置，可以提高机器学习算法的性能和效果。

nn.Parameter()

Parameters 是 Tensor 的子类，当与 Modules 一起使用时具有一个非常特殊的属性 - 当它们被分配为 Module attributes 时，它们会自动添加到其参数列表中，并将出现例如在 parameters() 迭代器中。分配张量没有这样的效果。这是因为人们可能想要在模型中缓存一些临时状态，例如 RNN 的最后一个隐藏状态。如果没有像 Parameter 这样的类，这些临时对象也会被注册。

通俗解释
torch.nn.Parameter()将一个不可训练的tensor转换成可以训练的类型parameter，并将这个parameter绑定到这个module里面。即在定义网络时这个tensor就是一个可以训练的参数了。使用这个函数的目的也是想让某些变量在学习的过程中不断的修改其值以达到最优化。

Abstract

Designed to learn long-range interactions (长程长时间相互作用) on sequential data, transformers continue to show state-of-the-art (最高/技术水平) results on a wide variety of tasks. In contrast to CNNs, they(Transformers) contain no inductive bias (归纳偏置) that prioritizes local interactions (优先考虑局部交互??特征). This makes them expressive (表达能力更强), but also computationally infeasible for long sequences (长序列不可计算), such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize (建模并合成) high-resolution images. We show how to (i) use CNNs to learn a context rich vocabulary (富含上下文的词汇表) of image constituents (图像组成-压缩 ??), and in turn (ii) utilize transformers to efficiently model their composition ?? within high-resolution images. Our approach is readily applied to conditional synthesis tasks (有条件的合成任务), where both non-spatial information (非空间信息), such as object classes (对象类), and spatial information, such as segmentations (划分), can control the generated image. In particular, we present the first results on semantically guided synthesis (语义指导的合成) of megapixel (百万像素) images with transformers and obtain the state of the art among autoregressive models (自回归模型) on class-conditional (类条件的图像网络数据集??) ImageNet. Code and pretrained models can be found at https://git.io/JnyvK.

Introduction

Transformers are on the rise—they are now the de-facto standard architecture for language tasks

(事实上标准结构) and are increasingly adapted in other areas such as audio and vision. In contrast to the predominant (主导占优) vision architecture, convolutional neural networks (CNNs),
the transformer architecture contains no built-in inductive (内置归纳偏置)
prior on the locality of interactions and is therefore free
to learn complex relationships among its inputs. However,
this generality also implies that it has to learn all relationships, whereas CNNs have been designed to exploit prior
knowledge about strong local correlations within images.
Thus, the increased expressivity of transformers comes with
quadratically increasing computational costs, because all
pairwise interactions are taken into account. The resulting energy and time requirements of state-of-the-art transformer models thus pose fundamental problems for scaling
them to high-resolution images with millions of pixels.

Observations that transformers tend to learn convolutional structures [16] thus beg the question: Do we have to re-learn everything we know about the local structure and regularity of images from scratch (抓挠划伤) each time we train a vision model, or can we efficiently encode inductive image biases while still retaining the flexibility of transformers? We hypothesize (假设) that low-level image structure is well described by a local connectivity, i.e. 即 a convolutional architecture, whereas (鉴于) this structural assumption ceases to be effective on higher semantic levels. Moreover, CNNs not only exhibit a strong locality bias, but also a bias towards spatial invariance (空间不变性) through the use of shared ?? weights across all positions. This makes them ineffective if a more holistic (整体的) understanding of the input is required.

Our key insight to obtain an effective and expressive model is that, taken together, convolutional and transformer architectures can model the compositional nature of our visual world [51]: We use a convolutional approach to efficiently learn a codebook of context-rich visual parts and, subsequently, learn a model of their global compositions. The long-range interactions within these compositions require an expressive transformer architecture to model distributions over their consituent visual parts. Furthermore, we utilize an adversarial approach to ensure that the dictionary of local parts captures perceptually (感知地) important local structure to alleviate (减轻缓解) the need for modeling low-level statistics with the transformer architecture. Allowing transformers to concentrate on their unique strength — modeling long-range relations — enables them to generate high-resolution images as in Fig.1, a feat which previously has been out of reach.Our formulation gives control over the generated images by means of conditioning (影响) information regarding desired object classes or spatial layouts. Finally, experiments demonstrate
that our approach retains the advantages of transformers by outperforming previous codebook-based state-of-the-art approaches based on convolutional architectures.

前情提要：

一、vqvae是什么？

示例：pandas 是基于NumPy 的一种工具，该工具是为了解决数据分析任务而创建的。

VQVAE 背景知识补充

VQGAN的设计细节、生成压缩图像的Transformer的设计细节、带约束图像生成的实现方法、高清图像生成的实现方法。

VQVAE的学习目标是用一个编码器把图像压缩成离散编码，再用一个解码器把图像尽可能地还原回原图像。

通俗来说，VQVAE就是把一幅真实图像压缩成一个小图像。这个小图像和真实图像有着一些相同的性质：小图像的取值和像素值（0-255的整数）一样，都是离散的；小图像依然是二维的，保留了某些空间信息。因此，VQVAE的示意图画成这样会更形象一些：

但小图像和真实图像有一个关键性的区别：与像素值不同，小图像的离散取值之间没有关联。真实图像的像素值其实是一个连续颜色的离散采样，相邻的颜色值也更加相似。比如颜色254和颜色253和颜色255比较相似。而小图像的取值之间是没有关联的，你不能说编码为1与编码为0和编码为2比较相似。由于神经网络不能很好地处理这种离散量，在实际实现中，编码并不是以整数表示的，而是以类似于NLP中的嵌入向量的形式表示的。VAE使用了嵌入空间（又称codebook）来完成整数序号到向量的转换。