Density Estimation Using Real NVP
https://arxiv.org/pdf/1605.08803.pdf
目录
Density Estimation Using Real NVP
Abstract
Unsupervised learning of probabilistic models is a central yet challenging problem in machine learning. Specifically, designing models with tractable learning, sampling, inference and evaluation is crucial in solving this task.
We extend the space of such models using real-valued non-volume preserving (real NVP) transformations, a set of powerful, stably invertible, and learnable transformations, resulting in an unsupervised learning algorithm with exact log-likelihood computation, exact and efficient sampling, exact and efficient inference of latent variables, and an interpretable latent space.
We demonstrate its ability to model natural images on four datasets through sampling, log-likelihood evaluation, and latent variable manipulations.
概率模型的无监督学习是机器学习中一个核心而又具有挑战性的问题。具体来说,设计具有可处理学习、采样、推理和评价的模型是解决这一任务的关键。
本文使用 real-valued non-volume preserving (real NVP) 变换扩展了这些模型的空间,这是一组强大的、稳定可逆的、可学习的变换,产生了一种无监督学习算法,具有精确的对数似然计算、精确和高效的采样、精确和高效的潜在变量推断,以及一个可解释的潜在空间。
通过采样、对数似然评价和潜在变量操作,本文证明了它在四个数据集上对自然图像建模的能力。
Introduction
The domain of representation learning has undergone tremendous advances due to improved supervised learning techniques. However, unsupervised learning has the potential to leverage large pools of unlabeled data, and extend these advances to modalities that are otherwise impractical or impossible.
由于监督学习技术的改进,表示学习领域经历了巨大的发展。然而,无监督学习有潜力利用大量未标记的数据,并将这些进展扩展到其他不切实际或不可能的模式中。
One principled approach to unsupervised learning is generative probabilistic modeling. Not only do generative probabilistic models have the ability to create novel content, they also have a wide range of reconstruction related applications including inpainting [61, 46, 59], denoising [3], colorization [71], and super-resolution [9].
一种无监督学习方法是生成概率模型。生成概率模型不仅能够创建新的内容,还具有广泛的重建相关应用,包括 inpainting、去噪、着色和超分辨率。
[61] Generative image modeling using spatial lstms. 2015 NeurIPS
[46] Pixel recurrent neural networks. 2016
[59] Deep unsupervised learning using nonequilibrium thermodynamics. 2015 ICML
[3] Density modeling of images using a generalized normalization transformation. 2015
[71] Colorful image colorization. 2016
[9] Super-resolution with deep convolutional sufficient statistics 2015
As data of interest are generally high-dimensional and highly structured, the challenge in this domain is building models that are powerful enough to capture its complexity yet still trainable. We address this challenge by introducing real-valued non-volume preserving (real NVP) transformations, a tractable yet expressive approach to modeling high-dimensional data.
由于感兴趣的数据通常是高维和高度结构化的,这个领域的挑战是构建足够强大的模型,以捕获其复杂性,但仍然是可训练的。本文通过引入 real-valued non-volume preserving (real NVP) 转换来解决这一挑战,这是一种易于处理但表达性强的建模高维数据的方法。
This model can perform efficient and exact inference, sampling and log-density estimation of data points. Moreover, the architecture presented in this paper enables exact and efficient reconstruction of input images from the hierarchical features extracted by this model.
该模型能够对数据点进行高效准确的推断、抽样和对数密度估计。此外,本文提出的体系结构可以从该模型提取的层次特征中准确、高效地重建输入图像。
Related work
Substantial work on probabilistic generative models has focused on training models using maximum likelihood. One class of maximum likelihood models are those described by probabilistic undirected graphs, such as Restricted Boltzmann Machines [58] and Deep Boltzmann Machines [53]. These models are trained by taking advantage of the conditional independence property of their bipartite structure to allow efficient exact or approximate posterior inference on latent variables. However, because of the intractability of the associated marginal distribution over latent variables, their training, evaluation, and sampling procedures necessitate the use of approximations like Mean Field inference and Markov Chain Monte Carlo, whose convergence time for such complex models remains undetermined, often resulting in generation of highly correlated samples. Furthermore, these approximations can often hinder their performance [7].
关于概率生成模型的大量工作都集中在使用最大似然的训练模型上。一类极大似然模型是用概率无向图 probabilistic undirected graphs 描述的模型,如 Restricted Boltzmann Machines 和 Deep Boltzmann Machines。这些模型利用其二分结构的条件独立特性进行训练,从而对潜在变量进行有效的精确或近似的后验推断。然而,由于潜在变量的相关边际分布的难解性,它们的训练、评估和抽样过程需要使用 Mean Field inference 和 Markov Chain Monte Carlo 等近似方法,其收敛时间对于这样复杂的模型仍然不确定,通常会产生高度相关的样本。此外,这些近似常常会阻碍它们的性能。
Directed graphical models are instead defined in terms of an ancestral sampling procedure, which is appealing both for its conceptual and computational simplicity. They lack, however, the conditional independence structure of undirected models, making exact and approximate posterior inference on latent variables cumbersome [56].
Recent advances in stochastic variational inference [27] and amortized inference [13, 43, 35, 49], allowed efficient approximate inference and learning of deep directed graphical models by maximizing a variational lower bound on the log-likelihood [45]. In particular, the variational autoencoder algorithm [35, 49] simultaneously learns a generative network, that maps gaussian latent variables z to samples x, and a matched approximate inference network that maps samples x to a semantically meaningful latent representation z, by exploiting the reparametrization trick [68]. Its success in leveraging recent advances in backpropagation [51, 39] in deep neural networks resulted in its adoption for several applications ranging from speech synthesis [12] to language modeling [8]. Still, the approximation in the inference process limits its ability to learn high dimensional deep representations, motivating recent work in improving approximate inference [42, 48, 55, 63, 10, 59, 34].
相反,有向图形模型是根据原始的抽样过程定义的,因为其概念和计算的简单性都很吸引人。然而,它们缺乏无向模型的条件独立结构,使得对潜在变量的精确和近似后验推断很麻烦。
stochastic variational inference 和 amortized inference 的最新进展允许通过最大化对数似然的变分下界来有效地进行近似推理和深度有向图形模型的学习。特别是,变分自编码算法利用重参数化技巧同时学习生成网络 ( 将高斯潜在变量 z 映射到样本 x ) 和匹配近似推理网络 ( 将样本 x 映射到语义上有意义的潜在表示 z )。它成功地利用了最近在深度神经网络中的反向传播的进展,导致它被应用于从语音合成到语言建模等多个应用。然而,推理过程中的近似限制了它学习高维深度表示的能力,这激发了最近改进近似推理的工作。
Such approximations can be avoided altogether by abstaining from using latent variables. Autoregressive models [18, 6, 37, 20] can implement this strategy while typically retaining a great deal of flexibility. This class of algorithms tractably models the joint distribution by decomposing it into a product of conditionals using the probability chain rule according to a fixed ordering over dimensions, simplifying log-likelihood evaluation and sampling. Recent work in this line of research has taken advantage of recent advances in recurrent networks [51], in particular long-short term memory [26], and residual networks [25, 24] in order to learn state-of-the-art generative image models [61, 46] and language models [32]. The ordering of the dimensions, although often arbitrary, can be critical to the training of the model [66]. The sequential nature of this model limits its computational efficiency. For example, its sampling procedure is sequential and non-parallelizable, which can become cumbersome in applications like speech and music synthesis, or real-time rendering. Additionally, there is no natural latent representation associated with autoregressive models, and they have not yet been shown to be useful for semi-supervised learning.
如果不使用潜在变量,就可以完全避免这种近似。自回归模型可以实现这一策略,同时通常保留了很大的灵活性。这类算法通过使用概率链规则将联合分布按固定的维数顺序分解为条件的乘积,简化对数似然估计和抽样,从而对联合分布进行易于处理的建模。这方面的最新研究利用了递归网络,特别是长短期记忆和残差网络的最新进展,以学习最先进的生成图像模型和语言模型。虽然维度的顺序通常是任意的,但对模型的训练是至关重要的。该模型的顺序性质限制了其计算效率。例如,它的采样过程是顺序的和不可并行的,这在语音和音乐合成或实时渲染等应用中会变得很麻烦。此外,自回归模型没有自然的潜在表示,它们还没有被证明对半监督学习有用。
Generative Adversarial Networks (GANs) [21] on the other hand can train any differentiable generative network by avoiding the maximum likelihood principle altogether. Instead, the generative network is associated with a discriminator network whose task is to distinguish between samples and real data. Rather than using an intractable log-likelihood, this discriminator network provides the training signal in an adversarial fashion. Successfully trained GAN models [21, 15, 47] can consistently generate sharp and realistically looking samples [38]. However, metrics that measure the diversity in the generated samples are currently intractable [62, 22, 30]. Additionally, instability in their training process [47] requires careful hyperparameter tuning to avoid diverging behavior.
生成对抗网络完全避免了最大似然原则,可以训练任意可微生成网络。生成网络与判别网络相关联,判别网络的任务是区分样本和真实数据。这个鉴别器网络不是使用难以处理的对数似然,而是以对抗的方式提供训练信号。成功训练的 GAN 模型可以一致地生成清晰和逼真的样本。然而,度量生成样本中多样性的指标目前还难以解决。此外,在他们的训练过程中,[47] 的不稳定性需要小心的超参数调整,以避免发散行为。
Training such a generative network g that maps latent variable to a sample does not in theory require a discriminator network as in GANs, or approximate inference as in variational autoencoders. Indeed, if g is bijective, it can be trained through maximum likelihood using the change of variable formula:
训练这种将潜在变量 映射到样本 的生成网络 g,在理论上不需要 gan 中那样的判别器网络,也不需要像变分自编码器中那样的近似推理。确实,如果 g 是 bijective,可以通过最大似然法训练,使用变量的变化公式 (1)。
This formula has been discussed in several papers including the maximum likelihood formulation of independent components analysis (ICA) [4, 28], gaussianization [14, 11] and deep density models [5, 50, 17, 3]. As the existence proof of nonlinear ICA solutions [29] suggests, auto-regressive models can be seen as tractable instance of maximum likelihood nonlinear ICA, where the residual corresponds to the independent components. However, naive application of the change of variable formula produces models which are computationally expensive and poorly conditioned, and so large scale models of this type have not entered general use.
这一公式在独立分量分析 (ICA) 的最大似然公式、高斯化和深度密度模型等几篇论文中都进行了讨论。根据非线性 ICA 解的存在性证明,自回归模型可以看作是最大似然非线性 ICA 的易于处理的实例,其中残差对应于独立分量。然而,单纯应用变量变换公式产生的模型计算成本高,条件差,因此这种类型的大比例尺模型尚未得到普遍应用。
Model definition
In this paper, we will tackle the problem of learning highly nonlinear models in high-dimensional continuous spaces through maximum likelihood. In order to optimize the log-likelihood, we introduce a more flexible class of architectures that enables the computation of log-likelihood on continuous data using the change of variable formula. Building on our previous work in [17], we define a powerful class of bijective functions which enable exact and tractable density evaluation and exact and tractable inference. Moreover, the resulting cost function does not to rely on a fixed form reconstruction cost such as square error [38, 47], and generates sharper samples as a result. Also, this flexibility helps us leverage recent advances in batch normalization [31] and residual networks [24, 25] to define a very deep multi-scale architecture with multiple levels of abstraction.
本文将通过极大似然来解决高维连续空间中高度非线性模型的学习问题。为了优化对数似然,本文引入了一种更灵活的体系结构,可以使用变量变化公式对连续数据进行对数似然计算。在我们之前[17] 工作的基础上,定义了一个强大的双目标 (bijective) 函数类,它能够实现精确&易于处理的密集预测和精确&易于处理的推理。而且,得到的代价函数不依赖于固定的形式重构代价,如平方误差,从而产生更清晰的样本。此外,这种灵活性帮助我们利用批处理规范化和残差网络的最新进展来定义具有多级抽象的非常深的多尺度架构。
[17] Nice: non-linear independent components estimation. 2014
1. Change of variable formula
Given an observed data variable , a simple prior probability distribution on a latent variable , and a bijection (with ), the change of variable formula defines a model distribution on by
where is the Jacobian of at .
Exact samples from the resulting distribution can be generated by using the inverse transform sampling rule [16]. A sample is drawn in the latent space, and its inverse image generates a sample in the original space. Computing the density on a point is accomplished by computing the density of its image and multiplying by the associated Jacobian determinant det . See also Figure 1. Exact and efficient inference enables the accurate and fast evaluation of the model.
使用反变换采样规则 [16] 可以从结果分布中生成精确的样本。在潜在空间中绘制一个样本 ,其逆像 生成原始空间中的一个样本。计算点 上的密度是通过计算它的图像 的密度并乘以相关的雅可比行列式 完成的。请参见图 1。精确高效的推理使模型的评估更加准确、快速。
[16] Sample-based non-uniform random variate generation. 1986
2. Coupling layers
Computing the Jacobian of functions with high-dimensional domain and codomain and computing the determinants of large matrices are in general computationally very expensive. This combined with the restriction to bijective functions makes Equation 2 appear impractical for modeling arbitrary distributions.
计算具有高维域和上域的函数的雅可比矩阵以及计算大型矩阵的行列式通常是非常昂贵的。这与双目标函数的限制相结合,使得方程 2 对于任意分布的建模显得不切实际。
As shown however in [17], by careful design of the function , a bijective model can be learned which is both tractable and extremely flexible. As computing the Jacobian determinant of the transformation is crucial to effectively train using this principle, this work exploits the simple observation that the determinant of a triangular matrix can be efficiently computed as the product of its diagonal terms.
然而,正如 [17] 所示,通过仔细设计函数f,可以学习到一个双目标模型,该模型既易于处理,又极其灵活。由于计算变换的雅可比行列式对于有效地利用这一原理训练是至关重要的,本文利用了一个简单的观察结果,即三角形矩阵的行列式可以被有效地计算为其对角项的乘积。
We will build a flexible and tractable bijective function by stacking a sequence of simple bijections. In each simple bijection, part of the input vector is updated using a function which is simple to invert, but which depends on the remainder of the input vector in a complex way. We refer to each of these simple bijections as an affine coupling layer. Given a dimensional input and , the output of an affine coupling layer follows the equations
where s and t stand for scale and translation, and are functions from , and is the Hadamard product or element-wise product (see Figure 2(a)).
本文将建立一个灵活和易于处理的双射函数叠加一系列简单的 bijections。在每个简单的 bijections中,输入向量的一部分会使用一个函数来更新,这个函数的反变换很简单,但它以复杂的方式依赖于输入向量的其余部分。我们将这些简单的双射称为仿射耦合层。给定一个 维输入 和 ,仿射耦合层的输出 符合方程 (4) 和 (5),其中 和 代表比例和平移,并且是 的函数, 是 Hadamard 乘积或元素乘积 (见图2(a))。
3. Propertie
The Jacobian of this transformation is
where is the diagonal matrix whose diagonal elements correspond to the vector . Given the observation that this Jacobian is triangular, we can efficiently compute its determinant as . Since computing the Jacobian determinant of the coupling layer operation does not involve computing the Jacobian of or , those functions can be arbitrarily complex. We will make them deep convolutional neural networks. Note that the hidden layers of and can have more features than their input and output layer.
(6) 式是雅可比矩阵,其中 为对角线矩阵, 是对角线元素对应于向量。已知这个雅可比矩阵是三角形的,可以有效地计算它的行列式为 。由于计算耦合层操作的雅可比行列式不涉及计算或的雅可比行列式 or ,这些函数可以是任意复杂的。我们将它们做成深度卷积神经网络。注意, and 的隐藏层比他们的输入和输出层可以有更多的特性。
Another interesting property of these coupling layers in the context of defining probabilistic models is their invertibility. Indeed, computing the inverse is no more complex than the forward propagation (see Figure 2(b)),
meaning that sampling is as efficient as inference for this model. Note again that computing the inverse of the coupling layer does not require computing the inverse of s or t, so these functions can be arbitrarily complex and difficult to invert.
在定义概率模型时,这些耦合层的另一个有趣的特性是它们的可逆性。实际上,计算逆传播并不比正向传播更复杂(见图2(b))。公式( (7) 和 (8) 意味着对于这个模型,抽样和推理一样有效。请再次注意,计算耦合层的逆并不需要计算 s 或 t 的逆,因此这些函数可能是任意复杂的和难求的。
4. Masked convolution
Partitioning can be implemented using a binary mask , and using the functional form for
We use two partitionings that exploit the local correlation structure of images: spatial checkerboard patterns, and channel-wise masking (see Figure 3). The spatial checkerboard pattern mask has value 1 where the sum of spatial coordinates is odd, and 0 otherwise. The channel-wise mask is 1 for the first half of the channel dimensions and 0 for the second half. For the models presented here, both and are rectified convolutional networks.
分区可以使用二进制 mask 实现,并使用函数 (9) 构成 ;
本文使用了两个利用图像局部相关结构的分区: 空间棋盘图案和通道 mask (见图3)。当空间坐标的和为奇数时,空间棋盘图案 mask 的值为 1,否则为 0。通道 mask 的前半部分为 1,后半部分为 0。对于本文提出的模型, and 都是修正卷积网络。
5. Combining coupling layers
Although coupling layers can be powerful, their forward transformation leaves some components unchanged. This difficulty can be overcome by composing coupling layers in an alternating pattern, such that the components that are left unchanged in one coupling layer are updated in the next (see Figure 4(a)).
尽管耦合层可以很强大,但它们的前向转换没有改变某些组件。这个困难可以通过在交替模式中组合耦合层来克服,这样在一个耦合层中保持不变的组件在下一个耦合层中被更新(参见图4(a))。
The Jacobian determinant of the resulting function remains tractable, relying on the fact that
Similarly, its inverse can be computed easily as
公式 (10) 和 (11) 说明,所得函数的雅可比行列式仍然是可处理的。其逆运算可以很容易通过公式 (12) 计算。
6. Multi-scale architecture
We implement a multi-scale architecture using a squeezing operation: for each channel, it divides the image into subsquares of shape 2 × 2 × c, then reshapes them into subsquares of shape 1 × 1 × 4c. The squeezing operation transforms an s × s × c tensor into an s/2 × s/2 × 4c tensor (see Figure 3), effectively trading spatial size for number of channels.
本文使用压缩操作实现了一个多尺度体系结构 : 对于每个通道,它将图像分成形状为 2 × 2 × c 的子方块,然后将它们重塑为形状为 1 × 1 × 4c 的子方块。压缩操作将一个 s × s × c 张量转换成一个 s/2 × s/2 × 4c 张量(见图3),有效地用空间大小交换通道的数量。
At each scale, we combine several operations into a sequence: we first apply three coupling layers with alternating checkerboard masks, then perform a squeezing operation, and finally apply three more coupling layers with alternating channel-wise masking. The channel-wise masking is chosen so that the resulting partitioning is not redundant with the previous checkerboard masking (see Figure 3). For the final scale, we only apply four coupling layers with alternating checkerboard masks.
在每个尺度上,将几个操作组合成一个序列: 首先应用三个具有交替棋盘 mask 的耦合层,然后执行一个压缩操作,最后再应用三个具有交替信道 mask 的耦合层。选择信道 mask,这样得到的分区就不会与之前的棋盘格 mask 冗余 (见图3)。对于最终的比例,本文只应用 4 个交替棋盘格 mask 的耦合层。
Propagating a D dimensional vector through all the coupling layers would be cumbersome, in terms of computational and memory cost, and in terms of the number of parameters that would need to be trained. For this reason we follow the design choice of [57] and factor out half of the dimensions at regular intervals (see Equation 14). We can define this operation recursively (see Figure 4(b)),
从计算和存储成本以及需要训练的参数数量来看,通过所有耦合层传播一个D维向量将是很麻烦的。出于这个原因,我们遵循[57]的设计选择,并以规则的间隔提出一半的维度(见公式14)。我们可以递归地定义这个操作(参见图4(b)),
In our experiments, we use this operation for i < L. The sequence of coupling-squeezing-coupling operations described above is performed per layer when computing (Equation 14). At each layer, as the spatial resolution is reduced, the number of hidden layer features in s and t is doubled. All variables which have been factored out at different scales are concatenated to obtain the final transformed output (Equation 16).
在我们的实验中,我们对 i < L 使用这个运算。在计算 时,每层都执行上述耦合-挤压-耦合运算的顺序 (式14)。在每一层,随着空间分辨率的降低,s 和 t 中隐藏层特征的数量增加了一倍。将在不同尺度上提出的所有变量连接起来,得到最终的转换输出 (式16)。
As a consequence, the model must Gaussianize units which are factored out at a finer scale (in an earlier layer) before those which are factored out at a coarser scale (in a later layer). This results in the definition of intermediary levels of representation [53, 49] corresponding to more local, fine-grained features as shown in Appendix D.
因此,模型必须先高斯化在较细尺度上分解的单元 (在较早的一层中),然后才高斯化在较粗尺度上分解的单元 (在较晚的一层中)。这导致了中介层表示的定义,对应于更多的局部、细粒度特征,如附录 D 所示。
Moreover, Gaussianizing and factoring out units in earlier layers has the practical benefit of distributing the loss function throughout the network, following the philosophy similar to guiding intermediate layers using intermediate classifiers [40]. It also reduces significantly the amount of computation and memory used by the model, allowing us to train larger models.
此外,在较早的层中,高斯化和分解单元具有将损失函数分布到整个网络的实际好处,其原理类似于使用中间分类器引导中间层。它还显著减少了模型使用的计算量和内存,使网络能够训练更大的模型。