生成高分辨率pdf_用于高分辨率图像合成的生成变分自编码器-CSDN博客

生成高分辨率pdf

This article presents our research on high resolution image generation using Generative Variational Autoencoder.

本文介绍了我们使用生成变分自动编码器进行高分辨率图像生成的研究。

重要事项 (Important Points)

Our work addresses the mode collapse issue of GANs and blurred images generated using VAEs in a single model architecture.
我们的工作解决了单一模型架构中GAN的模式崩溃问题以及使用VAE生成的模糊图像。
We use the encoder of VAE as it is while replacing the decoder with a discriminator.
我们将VAE编码器原样使用，同时用鉴别符替换解码器。
The encoder is fed data from a normal distribution while the generator is fed that from a gaussian distribution.
编码器从正态分布中馈入数据，而生成器从高斯分布中馈入数据。
The combination from both is then fed to a discriminator which tells whether the generated images are correct or not.
然后将两者的组合馈送到鉴别器，该鉴别器告诉所生成的图像是否正确。
We evaluate our network on 3 different datasets: MNIST, CelebA-HQ and LSUN dataset.
我们在3个不同的数据集上评估我们的网络：MNIST，CelebA-HQ和LSUN数据集。
We outperform previous state-of-the-art methods in terms of MMD, SSIM, log likelihood, reconstruction error, ELBO and KL divergence as the evaluation metrics.
在MMD，SSIM，对数似然，重构误差，ELBO和KL散度作为评估指标方面，我们的表现优于以前的最新方法。

介绍 (Introduction)

The training of deep neural networks requires hundreds or even thousands of images. Lack of labelled datasets especially for medical images often hinders the progress. Hence it becomes imperative to create additional training data. Another area which is actively researched is using generative adversarial networks for image generation. Using this technique, new images can be generated by training on the existing images present in the dataset. The new images are realistic but different from the original data. There are two main approaches of using data augmentation using GANs: image to image translation and sampling from random distribution. The main challenge with GANs is the mode collapse problem i.e. the generated images are quite similar to each other and there is not enough variety in the images generated.

深度神经网络的训练需要数百甚至数千张图像。缺少特别是医学图像的标记数据集通常会阻碍这一进展。因此，必须创建其他训练数据。积极研究的另一个领域是使用生成对抗网络进行图像生成。使用这种技术，可以通过对数据集中存在的现有图像进行训练来生成新图像。新图像逼真但与原始数据不同。使用GAN进行数据增强的主要方法有两种：图像到图像的转换和随机分布的采样。 GAN的主要挑战是模式崩溃问题，即生成的图像彼此非常相似，并且生成的图像种类不足。

Another approach for image generation uses Variational Autoencoders. This architecture contains an encoder which is also known as generative network which takes a latent encoding as input and outputs the parameters for a conditional distribution of the observation. The decoder is also known as an inference network which takes as input an observation and outputs a set of parameters for the conditional distribution of the latent representation. During training VAEs use a concept known as reparameterization trick, in which sampling is done from a gaussian distribution. The main challenge with VAEs is that they are not able to generate sharp images.

图像生成的另一种方法是使用变分自动编码器。该体系结构包含一个编码器，也称为生成网络，它以潜在编码为输入并输出用于条件分布观测的参数。解码器也称为推理网络，其将观察值作为输入并输出用于潜在表示的条件分布的一组参数。在训练过程中，VAE使用一种称为“重新参数化技巧”的概念，其中从高斯分布中进行采样。 VAE的主要挑战是它们无法生成清晰的图像。

数据集 (Dataset)

The following datasets are used for training and evaluation:

以下数据集用于训练和评估：

MNIST — This is a large dataset of handwritten digits which has been used successfully for training image classification and image processing algorithms. It contains 60,000 training images and 10,000 test images.
MNIST —这是一个庞大的手写数字数据集，已成功地用于训练图像分类和图像处理算法。它包含60,000个训练图像和10,000个测试图像。
LSUN dataset — This dataset contains millions of color images with 10 scene categories and 20 object categories. This is one of the most common datasets for training and testing GAN based neural networks.
LSUN数据集—该数据集包含数百万个具有10个场景类别和20个对象类别的彩色图像。这是用于训练和测试基于GAN的神经网络的最常见数据集之一。
CelebA-HQ dataset -This is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. This is also one of the most common datasets for training and testing GAN based neural networks.
CelebA-HQ数据集-这是一个大规模的面部属性数据集，其中包含200,000多张名人图像，每张图像都有40个属性注释。这也是用于训练和测试基于GAN的神经网络的最常见数据集之一。

VAE与我们的网络 (VAE vs Ours Network)

We show how instead of inference made in the way shown in original VAE architecture, we can add the error vector to the original data and multiply by standard distribution. The new term goes to the encoder and gets converted to the latent space. In the decoder, similarly the error vector gets added to the latent vector and multiplied by standard deviation. In this manner, we use the encoder of VAE in a manner similar to that in the original VAE. While we replace the decoder with a discriminator and hence change the loss function accordingly. The comparison between model architectures of VAE and our architecture is shown in Fig 1.

我们展示了如何代替原始VAE体系结构中所示的方式进行推理，而是可以将误差矢量添加到原始数据并乘以标准分布。新术语进入编码器并转换为潜在空间。在解码器中，类似地，将误差矢量添加到潜矢量，并乘以标准偏差。以这种方式，我们以类似于原始VAE的方式使用VAE的编码器。虽然我们用鉴别器代替了解码器，因此相应地改变了损失函数。 VAE的模型架构与我们的架构之间的比较如图1所示。

Image for post — Figure 1: Comparison between standard VAE and our network where e1 and e2 denote samples from some noise distribution, x denotes image vector, z denotes latent space vector, f and g denotes encoder and decoder functions respectively and +, ∗ denotes addition and concat operators.

Our architecture can be seen both as an extension of VAE as well as that of GAN. Reasoning it as the former is easy as this requires a change in loss function for decoder, while the latter can be made by recalling the fact that GAN essentially works on the concept of zero sum game maintaining Nash Equilibrium between the generator and discriminator. In our case, both the encoder from VAE and discriminator from GAN are playing zero sum game and are competing with each other. As the training proceeds, the loss decreases in both the cases until it stabilizes.

我们的架构既可以看作是VAE的扩展，也可以看作是GAN的扩展。将其推理为前者很容易，因为这需要更改解码器的损失函数，而后者可以通过回顾GAN实质上是在零和博弈的概念上起作用，以保持生成器与鉴别器之间的纳什均衡这一事实来实现。在我们的案例中，VAE的编码器和GAN的鉴别器都在玩零和游戏，并且彼此竞争。随着训练的进行，两种情况下的损失都会减少，直到稳定为止。

网络架构 (Network Architecture)

The network architecture used in this work is explained in the below points:

以下几点解释了此工作中使用的网络体系结构：

The discriminator and encoder networks have four convolution layers, each of which uses 3×3 filters.
鉴别器和编码器网络具有四个卷积层，每个卷积层都使用3×3滤波器。
We use Batch Normalization and Leaky Rectified Linear Unit (LeakyReLU) layers after each layer.
我们在每层之后使用批归一化和泄漏校正线性单位(LeakyReLU)层。
In training, we found that our architecture suffers from instability during training. This was solved using WGAN loss function which measures Wasserstein distance between two distributions.
在训练中，我们发现我们的体系结构在训练过程中遭受不稳定的困扰。这是使用WGAN损失函数解决的，该函数测量两个分布之间的Wasserstein距离。
We used the gradient penalty term to stabilize the training.
我们使用梯度惩罚项来稳定训练。
Our loss function has a total for 3 terms. While training, the encoder and the generator are considered as one network. Thus, we sum up the loss functions of the two networks in the order encoder-generator, discriminator as one and train the networks.
我们的损失函数总共有3个条件。训练时，编码器和生成器被视为一个网络。因此，我们将两个网络的损失函数以编码器-生成器，鉴别器的阶数作为一个总和进行训练。
Two latent vectors are sampled one from normal distribution and the other from gaussian distribution. The one from normal distribution is fed to the encoder while the one from gaussian distribution is fed to the generator.
采样两个潜在向量，一个从正态分布中采样，另一个从高斯分布中采样。来自正态分布的一个馈给编码器，而来自高斯分布的一个馈给发电机。
The outputs from both the vectors are in turn fed to the discriminator to tell whether the generated image is real or not.
来自两个向量的输出又被馈送到鉴别器以判断所生成的图像是否真实。

Our network architecture is shown in Fig 2.

我们的网络架构如图2所示。

建筑细节 (Architecture Details)

The generator and discriminator layerwise architecture details is shown in Table 1 and Table 2 respectively. We denoted ResNet block as consisting of the following layers — convolutional, max pooling layer, 30 percent dropouts in between the layers and batch normalization layer.

生成器和鉴别器分层体系结构的详细信息分别显示在表1和表2中。我们将ResNet块表示为由以下几层组成-卷积，最大池化层，各层与批处理规范化层之间的30％的失落。

算法 (Algorithm)

The algorithm used in this work is trained using Stochastic Gradient Descent (SGD) as shown below:

这项工作中使用的算法是使用随机梯度下降(SGD)进行训练的，如下所示：

实验 (Experiments)

All the generated samples are generator outputs from random latent vectors. We normalize all data into the range [-1, 1] and use two evaluation metrics to measure the performance of our network. First of them measures the distribution distance between the real and generated samples with maximum mean discrepancy (MMD) scores. The second metric evaluates the generation diversity with multi-scale structural similarity metric (MS-SSIM). Table 4. compares MMD and MS-SSIM scores with previous state of the art architectures.

所有生成的样本都是随机潜矢量的生成器输出。我们将所有数据归一化为[-1，1]范围，并使用两个评估指标来衡量我们网络的性能。首先，它们以最大平均差异(MMD)分数测量实际样本与生成的样本之间的分布距离。第二个指标使用多尺度结构相似性指标(MS-SSIM)评估世代多样性。表4.将MMD和MS-SSIM得分与先前的最新体系结构进行了比较。

We noticed the model with a small latent vector size of 100 suffers from severe mode collapse. The best results can be obtained using a moderately large latent vector size. Table 5 compares the effect of different latent variable sizes on the MMD and MS-SSIM scores respectively.

我们注意到，较小的潜在矢量大小为100的模型会遭受严重的模式崩溃。使用适度大的潜在向量大小可以获得最佳结果。表5比较了不同潜在变量大小分别对MMD和MS-SSIM分数的影响。

As can be seen, latent variable size with value 1000 produces the best results of those being compared. Both at low and high latent variable size mode collapse is seen which is one of the main challenges faced while training GANs.

可以看出，值1000的潜在变量大小产生了被比较的最佳结果。在低潜变量和高潜变量模式下都可以看到崩溃，这是训练GAN时面临的主要挑战之一。

Four common evaluation metrics have been used in the literature for testing the performance of generative models. These are log-likelihood, reconstruction error, ELBO and KL divergence.

文献中已使用四种常见的评估指标来测试生成模型的性能。这些是对数似然，重构误差，ELBO和KL差异。

The log-likelihood is calculated by finding the parameter that maximizes the log-likelihood of the observed sample. The reconstruction error is the distance between the original data point and its projection onto a lower-dimensional subspace. The optimization problem used in our model uses KL divergence error which is intractable hence we maximize ELBO instead of minimizing the KL divergence. KL divergence is a measure of how similar the generated probability distribution is to the true probability distribution. The comparison using these evaluation metrics for our model on MNIST dataset with the original VAE architecture is shown in Table 6.

通过找到使所观察样品的对数似然性最大的参数来计算对数似然性。重建误差是原始数据点与其在低维子空间上的投影之间的距离。我们模型中使用的优化问题使用了KL散度误差，这是很难解决的，因此我们将ELBO最大化而不是将KL散度最小化。 KL散度是衡量所生成的概率分布与真实概率分布的相似程度的度量。表6显示了使用这些评估指标对我们的模型在MNIST数据集与原始VAE体系结构上进行的比较。

We compare our log probability distribution value with those obtained by previous state of the art methods which is shown in Table 7. The log probability distribution is an important evaluation metric in the sense that it shows the diversity of the samples generated.

我们将对数概率分布值与通过表7所示的现有技术方法获得的对数概率分布值进行比较。就对数概率分布而言，它显示了所生成样本的多样性，这是一项重要的评估指标。

结果 (Results)

We present the generated images on all the 3 datasets used for testing. The images were trained for 1000 iterations. The images generated using the CELEBA-HQ dataset is shown in Fig 3.

我们在用于测试的所有3个数据集上展示生成的图像。对图像进行了1000次迭代训练。使用CELEBA-HQ数据集生成的图像如图3所示。

The images generated using the LSUN BEDROOM dataset is shown in Fig 4.

使用LSUN BEDROOM数据集生成的图像如图4所示。

The images generated from different LSUN categories is shown in Fig 5.

从不同的LSUN类别生成的图像如图5所示。

We compare our results with previous state of the art networks on MNIST dataset in Fig 6.

我们将结果与图6中MNIST数据集上的现有技术网络进行了比较。

结论 (Conclusions)

In this blog, we presented a new training procedure for Variational Autoencoders based on generative models. This allows us to make the inference model much more flexible, allowing it to represent almost any posterior distributions over the latent variables. Our network was trained and tested on 3 publicly available datasets. On evaluating using MMD, SSIM, log likelihood, reconstruction error, ELBO and KL divergence as the evaluation metrics, our network beats the previous state of the art algorithms. Using generative model approaches to generate additional training data especially in fields like medical imaging could be revolutionary as there is a shortage of medical data for training deep convolutional neural network architectures.

在此博客中，我们介绍了基于生成模型的变分自动编码器的新训练程序。这使我们可以使推理模型更加灵活，从而可以表示潜在变量上的几乎任何后验分布。我们的网络在3个公开可用的数据集上进行了培训和测试。在使用MMD，SSIM，对数似然，重构误差，ELBO和KL散度作为评估指标进行评估时，我们的网络击败了现有算法。使用生成模型方法生成额外的训练数据，尤其是在医学成像等领域，可能是革命性的，因为缺乏用于训练深度卷积神经网络架构的医学数据。