Hybrid Convolutional Variational Autoencoder for TextGeneration paer reading

最新推荐文章于 2024-03-13 16:32:04 发布

Mihaela Rosca

最新推荐文章于 2024-03-13 16:32:04 发布

阅读量623

点赞数

本文链接：https://blog.csdn.net/msi_user/article/details/103434786

版权

Abstract

In this paper we explore the effect of architectural choices on learning a variational autoencoder (VAE) for text generation. In contrast to the previously introduced VAE model for text where both the encoder and decoder are RNNs, we propose a novel hybrid architecture that blends fully feed-forward convolutional and deconvolutional components with a recurrent language model. Our architecture exhibits several attractive properties such as faster run time and convergence, ability to better handle long sequences and, more importantly, it helps to avoid the issue of the VAE collapsing to a deterministic model.

摘要

在本文中，我们探讨了体系结构选择对学习变体自动编码器（VAE）进行文本生成的影响。与之前引入的文本VAE模型（编码器和解码器均为RNN）形成对比，我们提出了一种新颖的混合体系结构，它将完全前馈卷积和解卷积分量与递归语言模型融合在一起。我们的体系结构具有一些吸引人的属性，例如更快的运行时间和收敛性，更好地处理长序列的能力，更重要的是，它有助于避免在文本数据上训练VAE模型所带来的一些主要困难。

1 Introduction

①Generative models of texts are currently at the cornerstone of natural language understanding enabling recent breakthroughs in machine translation (Bahdanau et al., 2014; Wu et al., 2016), dialogue modelling (Serban et al., 2016), abstractive summarization (Rush et al., 2015), etc.
①文本的生成模型目前是自然语言理解的基石，可实现机器翻译的最新突破（Bahdanau等，2014； Wu等，2016），对话建模（Serban等，2016），抽象总结（Rush）等人，2015年）等。

②Currently, RNN-based generative models hold state-of-the-art results in both unconditional (Jozefowicz et al. ´ , 2016; Ha et al., 2016) and conditional (Vinyals et al., 2014) text generation. At a high level, these models represent a class of autoregressive models that work by generating outputs sequentially one step at a time where the next predicted element is conditioned on the history of elements generated thus far.
②当前，基于RNN的生成模型在无条件（Jozefowicz等人，2016； Ha等人，2016）和有条件（Vinyals等人，2014）文本生成中都拥有最新的结果。从较高的层次上讲，这些模型代表一类自回归模型，这些模型通过在下一个预测元素以到目前为止生成的元素的历史为条件的时间依次生成输出来工作。

③Variational autoencoders (VAE), recently introduced by (Kingma and Welling, 2013; Rezende et al., 2014), offer a different approach to generative modeling by integrating stochastic latent variables into the conventional autoencoder architecture. The primary purpose of learning VAE-based generative models is to be able to generate realistic examples as if they were drawn from the input data distribution by simply feeding noise vectors through the decoder. Additionally, the latent representations obtained by applying the encoder to input examples give a fine-grained control over the generation process that is harder to achieve with more conventional autoregressive models. Similar to compelling examples from image generation, where it is possible to condition generated human faces on various attributes such as hair, skin color and style (Yan et al., 2015; Larsen et al., 2015), in text generation it should be possible to also control various attributes of the generated sentences, such as, for example, sentiment or writing style.
③最近由（Kingma and Welling，2013; Rezende et al。，2014）引入的变分自动编码器（VAE）通过将随机潜在变量集成到常规自动编码器体系结构中，提供了一种不同的生成建模方法。学习基于VAE的生成模型的主要目的是能够生成现实的示例，就好像它们是通过简单地通过解码器馈送噪声矢量而从输入数据分布中提取的一样。此外，通过将编码器应用于输入示例而获得的潜在表示可以对生成过程进行细粒度控制，而使用更传统的自回归模型则很难实现。与图像生成中令人信服的示例相似，在文本生成中，应根据文本，头发，肤色和样式等各种属性来调节生成的人脸（Yan等，2015； Larsen等，2015）。还可能控制所生成句子的各种属性，例如情感或写作风格。

④While training VAE-based models seems to pose little difficulty when applied to the tasks of generating natural images (Bachman, 2016; Gulrajani et al., 2016) and speech (Fraccaro et al., 2016), their application to natural text generation requires additional care (Bowman et al., 2016; Miao et al., 2015). As discussed by Bowman et al. (2016), the core difficulty of training VAE models is the collapse of the latent loss (represented by the KL divergence term) to zero. In this case the generator tends to completely ignore latent representations and reduces to a standard language model. This is largely due to the high modeling power of the RNN-based decoders which with sufficiently small history can achieve low reconstruction errors while not relying on the latent vector provided by the encoder.
④虽然将训练基于VAE的模型应用于生成自然图像（Bachman，2016; Gulrajani et al。，2016）和语音（Fraccaro et al。，2016）的任务似乎没有什么困难，但将其应用于自然文本生成需要额外护理（Bowman等人，2016； Miao等人，2015）。如Bowman等人所述。（2016年），训练VAE模型的核心困难是潜在损失（以KL散度项表示）崩溃为零。在这种情况下，生成器倾向于完全忽略潜在表示，而简化为标准语言模型。这主要是由于基于RNN的解码器具有很高的建模能力，其历史记录非常短，可以在不依赖编码器提供的潜矢量的情况下实现较低的重构误差。

⑤In this paper, we propose a novel VAE model for texts that is more effective at forcing the decoder to make use of latent vectors. Contrary to existing work, where both encoder and decoder layers are LSTMs, the core of our model is a feed-forward architecture composed of one-dimensional convolutional and deconvolutional (Zeiler et al., 2010) layers. This choice of architecture helps to gain more control over the KL term, which is crucial for training a VAE model. Given the difficulty of generating long sequences in a fully feed-forward manner, we augment our network with an RNN language model layer. To the best of our knowledge, this paper is the first work that successfully applies deconvolutions in the decoder of a latent variable generative model of natural text. We empirically verify that our model is easier to train than its fully recurrent alternative, which, in our experiments, fails to converge on longer texts. To better understand why training VAEs for texts is difficult we carry out detailed experiments, discuss optimization difficulties, and propose effective ways to address them. Finally, we demonstrate that sampling from our model yields realistic texts.
⑤在本文中，我们提出了一种新颖的文本VAE模型，该模型可以更有效地迫使解码器使用潜在矢量。与既有编码器层又有解码器层均为LSTM的现有工作相反，我们模型的核心是由一维卷积和反卷积层组成的前馈体系结构（Zeiler等，2010）。这种架构选择有助于更好地控制KL术语，这对于训练VAE模型至关重要。鉴于很难以完全前馈的方式生成长序列，我们在网络上增加了RNN语言模型层。据我们所知，本文是在自然文本的潜在变量生成模型的解码器中成功应用反卷积的第一篇论文。我们凭经验证明，与完全循环的替代方法相比，我们的模型更易于训练，而在我们的实验中，该方法无法收敛于较长的文本。为了更好地理解为什么很难为文本训练VAE，我们进行了详细的实验，讨论了优化难题，并提出了解决这些难题的有效方法。最后，我们证明从我们的模型中采样可以得出真实的文本。

2 Related Work

①So far, the majority of neural generative models of text are built on the autoregressive assumption (Larochelle and Murray, 2011; van den Oord et al., 2016). Such models assume that the current data element can be accurately predicted given sufficient history of elements generated thus far. The conventional RNN based language models fall into this category and currently dominate the language modeling and generation problem in NLP. Neural architectures based on recurrent (Jozefowicz et al. ´ , 2016; Zoph and Le, 2016; Ha et al., 2016) or convolutional decoders (Kalchbrenner et al., 2016; Dauphin et al., 2016) provide an effective solution to this problem.
①到目前为止，大多数文本的神经生成模型都建立在自回归假设之上（Larochelle和Murray，2011； van den Oord等，2016）。这样的模型假定，鉴于迄今为止生成的足够的元素历史，可以准确地预测当前数据元素。基于常规RNN的语言模型属于此类，目前在NLP中主导语言建模和生成问题。基于递归（Jozefowicz等人，2016; Zoph and Le，2016; Ha等人，2016）或卷积解码器（Kalchbrenner等人，2016; Dauphin等人，2016）的神经架构提供了一种有效的解决方案这个问题。

②A recent work by Bowman et al. (2016) tackles language generation problem within the VAE framework (Kingma and Welling, 2013; Rezende et al., 2014). The authors demonstrate that with some care it is possible to successfully learn a latent variable generative model of text. Although their model is slightly outperformed by a traditional LSTM (Hochreiter and Schmidhuber, 1997) language model, their model achieves a similar effect as in computer vision where one can (i) sample realistic sentences by feeding randomly generated novel latent vectors through the decoder and (ii) linearly interpolate between two points in the latent space. Miao et al. (2015) apply VAE to bag-of-words representations of documents and the answer selection problem achieving good results on both tasks. Yang et al. (2017) discuss a VAE consisting of RNN encoder and CNN decoder so that the decoder’s receptive field is limited. They demonstrate that this allows for a better control of KL and reconstruction terms. Hu et al. (2017) build a VAE for text generation and design a cost function that encourages interpretability of the latent variables. Zhang et al. (2016), Serban et al. (2016) and Zhao et al. (2017) apply VAE to sequence-to-sequence problems, improving over deterministic alternatives. Chen et al. (2016) propose a hybrid model combining autoregressive convolutional layers with the VAE. The authors make an argument based on the Bit-Back
coding (Hinton and van Camp, 1993) that when the decoder is powerful enough the best thing for the encoder to do is to make the posterior distribution equivalent to the prior. While they experiment on images, this argument is very relevant to the textual data. A recent work by Bousquet et al. (2017) approaches VAEs and GANs from the optimal transport point of view. The authors show that commonly known blurriness of samples from VAEs trained on image data are a necessary property of the model. While the implications of their argument to models combining latent variables and autoregressive layers trained on non-image data are still unclear, the argument supports the hypothesis of Chen et al. (2016) that difficulty of training a hybrid model is not caused by a simple optimization difficulty but rather may be a more principled issue.
②鲍曼等人的最新著作。（2016年）解决了VAE框架内的语言生成问题（Kingma和Welling，2013年； Rezende等人，2014年）。作者证明，稍加小心就可以成功地学习文本的潜在变量生成模型。尽管他们的模型比传统的LSTM（Hochreiter and Schmidhuber，1997）语言模型略胜一筹，但他们的模型取得了与计算机视觉类似的效果，在计算机视觉中，人们可以（i）通过解码器提供随机生成的新潜在矢量来采样现实句子。（ii）在潜在空间中的两个点之间线性插值。苗等。（2015年）将VAE应用于文档的单词袋表示法和答案选择问题，在两项任务上均取得了良好的效果。杨等。（2017年）讨论了由RNN编码器和CNN解码器组成的VAE，从而限制了解码器的接收范围。他们证明这可以更好地控制KL和重建条件。 Hu等。（2017）建立了一个用于文本生成的VAE，并设计了一个成本函数来鼓励潜在变量的可解释性。张等。（2016），Serban等。（2016）和Zhao等。（2017）将VAE应用于序列到序列问题，改进了确定性替代方案。 Chen等。（2016年）提出了一种混合模型，将自回归卷积层与VAE相结合。作者基于位回论证编码（Hinton and van Camp，1993），即当解码器功能强大时，编码器要做的最好的事情就是使后验分布与先验分布相等。当他们在图像上进行实验时，该论点与文本数据非常相关。 Bousquet等人的最新著作。（2017）从最优运输的角度研究了VAE和GAN。作者表明，对图像数据进行训练的VAE样本的众所周知的模糊性是模型的必要属性。尽管他们的论点对结合潜变量和在非图像数据上训练的自回归层的模型的含义尚不清楚，但该论点支持了Chen等人的假设。（2016年），训练混合模型的困难不是由简单的优化困难引起的，而可能是更原则性的问题。

③Various techniques to improve training of VAE models where the total cost represents a trade-off between the reconstruction cost and KL term have been used so far: KL-term annealing and input dropout (Bowman et al., 2016; Sønderby et al., 2016), imposing structured sparsity on latent variables (Yeung et al., 2016) and more expressive formulations of the posterior distribution (Rezende and Mohamed, 2015; Kingma et al., 2016). A work by (Mescheder et al., 2017) follows the same motivation and combines GANs and VAEs allowing a model to use arbitrary complex formulations of both prior and posterior distributions. In Section 3.4 we propose another efficient technique to control the trade-off between KL and reconstruction terms.
③到目前为止，已经使用了各种技术来改进VAE模型的训练，其中总成本代表着重建成本与KL项之间的折衷：KL项退火和输入缺失（Bowman等人，2016；Sønderby等人， 2016年），对潜在变量施加结构化稀疏性（Yeung等人，2016年），以及对后验分布的更具表达性的表述（Rezende和Mohamed，2015年; Kingma等人，2016年）。（Mescheder等人，2017）的工作遵循相同的动机，并结合了GAN和VAE，使模型可以使用先验分布和后验分布的任意复杂公式。在第3.4节中，我们提出了另一种有效的技术来控制KL和重构项之间的折衷。

3 Model

In this section we first briefly explain the VAE framework of Kingma and Welling (2013), then describe our hybrid architecture where the feed forward part is composed of a fully convolutional encoder and a decoder that combines deconvolutional layers and a conventional RNN. Finally, we discuss optimization recipes that help VAE to respect latent variables, which is critical training a model with a meaningful latent space and being able to sample realistic sentences.
在这里插入图片描述
在本节中，我们首先简要介绍Kingma and Welling（2013）的VAE框架，然后介绍我们的混合架构，其中前馈部分由一个完全卷积的编码器和一个将反卷积层和常规RNN组合在一起的解码器组成。最后，我们讨论了有助于VAE尊重潜在变量的优化方法，这对于训练具有有意义的潜在空间并能够对现实句子进行采样的模型至关重要。

3.1 Variational Autoencoder

The VAE is a recently introduced latent variable generative model, which combines variational inference with deep learning. It modifies the conventional autoencoder framework in two key ways. Firstly, a deterministic internal representation z (provided by the encoder) of an input x is replaced with a posterior distribution q(z|x). Inputs are then reconstructed by sampling z from this posterior and passing them through a decoder. To make sampling easy, the posterior distribution is usually parametrized by a Gaussian with its mean and variance predicted by the encoder. Secondly, to ensure that we can sample from any point of the latent space and still generate valid and diverse outputs, the posterior q(z|x) is regularized with its KL divergence from a prior distribution p(z). The prior is typically chosen to be also a Gaussian with zero mean and unit variance, such that the KL term between posterior and prior can be computed in closed form (Kingma and Welling, 2013). The total VAE cost is composed of the reconstruction term, i.e., negative log-likelihood of the data, and the KL regularizer:
在这里插入图片描述
VAE是最近引入的潜在变量生成模型，该模型将变异推理与深度学习相结合。它以两种主要方式修改了传统的自动编码器框架。首先，将输入x的确定性内部表示z（由编码器提供）替换为后验分布q（z | x）。然后，通过从该后验中采样z并将它们传递给解码器来重构输入。为了简化采样，后验分布通常由高斯参数化，其均值和方差由编码器预测。其次，为了确保我们可以从潜在空间的任何点采样并仍然产生有效且多样化的输出，后验q（z | x）的KL散度与先验分布p（z）进行正则化。通常将先验选择为均值和单位方差为零的高斯，因此可以以封闭形式计算后验和先验之间的KL项（Kingma和Welling，2013年）。 VAE的总费用由重建项（即数据的对数可能性为负）和KL正则化器组成：
在这里插入图片描述

Kingma and Welling (2013) show that the loss function from Eq (1) can be derived from the probabilistic model perspective and it is an upper bound on the true negative likelihood of the data. One can view a VAE as a traditional Autoencoder with some restrictions imposed on the internal representation space. Specifically, using a sample from the q(z|x) to reconstruct the input instead of a deterministic z, forces the model to map an input to a region of the space rather than to a single point. The most straight-forward way to achieve a good reconstruction error in this case is to predict a very sharp probability distribution effectively corresponding to a single point in the latent space (Raiko et al., 2014). The additional KL term in Eq (1) prevents this behavior and forces the model to find a solution with, on one hand, low reconstruction error and, on the other, predicted posterior distributions close to the prior. Thus, the decoder part of the VAE is capable of reconstructing a sensible data sample from every point in the latent space that has non-zero probability under the prior. This allows for straightforward generation of novel samples and linear operations on the latent codes. Bowman et al. (2016) demonstrate that this does not work in the fully deterministic
Autoencoder framework . In addition to regularizing the latent space, KL term indicates how much information the VAE stores in the latent vector.
Kingma和Welling（2013）表明，方程（1）的损失函数可以从概率模型的角度推导，这是数据真实负可能性的上限。可以将VAE视为传统的自动编码器，但对内部表示空间有一些限制。具体来说，使用来自q（z | x）的样本代替确定性z来重建输入，强制模型将输入映射到空间的区域而不是单个点。在这种情况下，实现良好重构误差的最直接方法是预测非常潜在的概率分布，有效地对应于潜在空间中的单个点（Raiko等人2014）。等式（1）中的附加KL项防止了这种行为，并迫使模型找到一种解决方案，一方面具有较低的重构误差，另一方面，其预测后验分布与前者接近。因此，VAE的解码器部分能够从潜在空间中的每个点重建先验条件下的非零概率的敏感数据样本。这使得可以直接生成新样本并在潜码上进行线性运算。鲍曼等（2016）证明这在完全确定性的方法中不起作用自动编码器框架。除了规范化潜在空间外，KL项还表示VAE在潜在向量中存储了多少信息。

Bowman et al. (2016) propose a VAE model for text generation where both encoder and decoder are LSTM networks (Figure 1). We will refer to this model as LSTM VAE in the remainder of the paper. The authors show that adapting VAEs to text generation is more challenging as the decoder tends to ignore the latent vector (KL term is close to zero) and falls back to a language model. Two training tricks are required to mitigate this issue: (i) KL-term annealing where its weight in Eq (1) gradually increases from 0 to 1 during the training; and (ii) applying dropout to the inputs of the decoder to limit its expressiveness and thereby forcing the model to rely more on the latent variables. We will discuss these tricks in more detail in Section 3.4. Next we describe a deconvolutional layer, which is the core element of the decoder in our VAE model.
鲍曼等。（2016年）提出了一种用于文本生成的VAE模型，其中编码器和解码器均为LSTM网络（图1）。在本文的其余部分中，我们将此模型称为LSTM VAE。作者表明，使VAE适应文本生成更具挑战性，因为解码器倾向于忽略潜矢量（KL项接近于零）并退回到语言模型。需要两个训练技巧来缓解此问题：（i）KL项退火，其中在等式（1）中其权重在训练过程中从0逐渐增加到1；（ii）将压差应用于解码器的输入以限制其表示性，从而迫使模型更多地依赖于潜在变量。我们将在3.4节中更详细地讨论这些技巧。接下来，我们描述一个反卷积层，它是我们VAE模型中解码器的核心元素。

3.2 Deconvolutional Networks

A deconvolutional layer (also referred to as transposed convolutions (Gulrajani et al., 2016) and fractionally strided convolutions (Radford et al., 2015)) performs spatial up-sampling of its inputs and is an integral part of latent variable generative models of images (Radford et al., 2015; Gulrajani et al., 2016) and semantic segmentation algorithms (Noh et al., 2015). Its goal is to perform an “inverse” convolution operation and increase spatial size of the input while decreasing the number of feature maps. This operation can be viewed asa backward pass of a convolutional layer and can be implemented by simply switching the forward and backward passes of the convolution operation. In the context of generative modeling based on global representations, the deconvolutions are typically used as follows: the global representation is first linearly mapped to another representation with small spatial resolution and large number of feature maps. A stack of deconvolutional layers is then applied to this representation, each layer progressively increasing spatial resolution and decreasing the amount of feature channels. The output of the last layer is an image or, in our case, a text fragment. A notable example of such a model is the deep network of (Radford et al., 2015) trained with adversarial objective. Our model uses a similar approach but is instead trained with the VAE objective.
在这里插入图片描述
去卷积层（也称为转置卷积（Gulrajani et al。，2016）和分步卷积（Radford et al。，2015））对其输入进行空间上采样，并且是卷积潜变量生成模型不可分割的一部分图像（Radford等，2015; Gulrajani等，2016）和语义分割算法（Noh等，2015）。其目标是执行“逆”卷积运算并增加输入的空间大小，同时减少特征图的数量。此操作可以看作是卷积层的后向传递，可以通过简单地向前切换来实现和卷积运算的后向传递。在基于全局表示的生成建模的情况下，通常会按如下方式使用去卷积：首先将全局表示线性地映射到具有较小空间分辨率和大量特征图的另一个表示。然后将一叠反卷积层应用于此表示，每层逐渐提高空间分辨率并减少特征通道的数量。最后一层的输出是图像，在我们的例子中是文本片段。这种模型的一个显着例子是（Radford et al。，2015）的深层网络。以对抗目标进行训练。我们的模型使用了类似的方法，但是使用VAE目标进行了训练。

There are two primary motivations for choosing deconvolutional layers instead of the dominantly used recurrent ones: firstly, such layers have extremely efficient GPU implementations due to their fully parallel structure. Secondly, feed-forward architectures are typically easier to optimize than their recurrent counterparts, as the number of back-propagation steps is constant and potentially much smaller than in RNNs. Both points become significant as the length of the generated text increases. Next, we describe our VAEarchitecture that blends deconvolutional and RNN layers in the decoder to allow for better control over the KL-term.
选择反卷积层而不是经常使用的递归层有两个主要动机：首先，这种层由于具有完全并行的结构而具有极其高效的GPU实现。其次，前馈架构通常比其循环对应架构更易于优化，因为反向传播步骤的数量是恒定的，并且可能比RNN中的要小得多。随着生成的文本长度的增加，这两点都变得很重要。接下来，我们描述我们的VAE架构，该架构在解码器中混合了去卷积和RNN层，以便更好地控制KL项。

3.3 Hybrid Convolutional-Recurrent VAE

Our model is composed of two relatively independent modules. The first component is a standard VAE where the encoder and decoder modules are parametrized by convolutional and deconvolutional layers respectively (see Figure 2(a)). This architecture is attractive for its computational efficiency and simplicity of training.
我们的模型由两个相对独立的模块组成。第一个组件是标准VAE，其中编码器和解码器模块分别由卷积和反卷积层参数化（请参见图2（a））。这种体系结构因其计算效率高和训练简单而具有吸引力。
The other component is a recurrent language model consuming activations from the deconvolutional decoder concatenated with the previous output characters. We consider two flavors of recurrent functions: a conventional LSTM network (Figure 2(b)) and a stack of masked convolutions also known as the ByteNet decoder from Kalchbrenner et al. (2016) (Figure 2©). The primary reason for having a recurrent component in the decoder is to capture dependencies between elements of the text sequences – a hard task for a fully feed-forward architecture. Indeed, the conditional distribution P(x|z) = P(x1, . . . , xn|z) of generated sentences cannot be richly represented with a feed-forward network. Instead, it factor izes as:
在这里插入图片描述
where components are independent of each other and are conditioned only on z. To minimize the reconstruction cost the model is forced to encode every detail of a text fragment. A recurrent language model instead models the full joint distribution of output sequences without having to make independence assumptions P(x1, . . . , xn|z) =
在这里插入图片描述
Thus, adding a recurrent layer on top of our fully feed-forward encoder-decoder architecture relieves it from encoding every aspect of a text fragment into the latent vector and allows it to instead focus on more high-level semantic and stylistic features.
另一个组件是循环语言模型，它消耗来自与前一个输出字符连接的反卷积解码器的激活。我们考虑了两种递归函数：传统的LSTM网络（图2（b））和一堆屏蔽卷积，也被Kalchbrenner等人称为ByteNet解码器。（2016）（图2（c））。在解码器中具有循环组件的主要原因是捕获文本序列元素之间的依赖性，这对于完全前馈架构而言是一项艰巨的任务。实际上，生成的句子的条件分布P（x | z）= P（x1，…，xn | z）不能用前馈网络来丰富地表示。相反，它的大小为：
在这里插入图片描述
其中组件彼此独立并且仅以z为条件。为了最小化重建成本，模型被迫对文本片段的每个细节进行编码。相反，循环语言模型对输出序列的完整联合分布进行建模，而不必进行独立性假设P（x1，…，xn | z）=
在这里插入图片描述
因此，在我们的完全前馈编码器-解码器体系结构之上添加循环层，可以避免将文本片段的各个方面编码为潜在向量，并使其专注于更高级的语义和样式特征。

Note that the feed-forward part of our model is different from the existing fully convolutional approaches of Dauphin et al. (2016) and Kalchbrenner et al. (2016) in two respects: firstly, while being fully parallelizable during training, these models still require predictions from previous time steps during inference and thus behave as a variant of recurrent networks. In contrast, expansion of the z vector is fully parallel in our model (except for the recurrent component). Secondly, our model down- and up-samples a text fragment during processing while the existing fully convolutional decoders do not. Preserving spatial resolution can be beneficial to the overall result, but comes at a higher computational cost. Lastly, we note that our model imposes an upper bound on the size of text samples it is able to generate. While it is possible to model short texts by adding special padding characters at the end of a sample, generating texts longer than certain thresholds is not possible by design. This is not an unavoidable restriction, since the model can be extended to generate variable sized text fragments by, for example, variable sized latent codes. These extensions however are out of scope of this work.
请注意，我们模型的前馈部分与Dauphin等人的现有全卷积方法不同。（2016）和Kalchbrenner等人。（2016年）有两个方面：首先，虽然这些模型在训练过程中是完全可并行化的，但在推理过程中仍然需要根据先前的时间步进行预测，因此表现为递归网络的变体。相反，z向量的扩展在我们的模型中是完全平行的（除了递归分量）。其次，我们的模型在处理过程中对文本片段进行了降采样和升采样，而现有的全卷积解码器则没有。保留空间分辨率可能对总体结果有利，但会带来更高的计算成本。最后，我们注意到我们的模型对它能够生成的文本样本的大小施加了上限。尽管可以通过在样本末尾添加特殊的填充字符来对短文本进行建模，但设计产生的文本长于某些阈值是不可能的。这不是不可避免的限制，因为可以通过例如可变大小的潜在代码来扩展模型以生成可变大小的文本片段。但是，这些扩展超出了本文的范围。

3.4 Optimization Difficulties

The addition of the recurrent component results in optimization difficulties that are similar to those described by Bowman et al. (2016). In most cases the model converges to a solution with a vanishingly small KL term, thus effectively falling back to a conventional language model. Bowman et al. (2016) have proposed to use input dropout and KL term annealing to encourage their model to encode meaningful representations into the z vector. We found that these techniques also help our model to achieve solutions with non-zero KL term.
循环组件的添加导致优化困难，类似于Bowman等人描述的困难。（2016）。在大多数情况下，模型会收敛到KL项几乎消失的解决方案，从而有效地退回到传统的语言模型。鲍曼等。（2016）提出使用输入丢失和KL项退火来鼓励他们的模型将有意义的表示编码为z向量。我们发现这些技术还有助于我们的模型以非零的KL项获得解决方案。

KL term annealing can be viewed as a gradual transition from conventional deterministic Autoencoder to a full VAE. In this work we use linear annealing from 0 to 1. We have experimented with other schedules but did not find them to have a significant impact on the final result. As long as the KL term weight starts to grow sufficiently slowly, the exact shape and speed of its growth does not seem to affect the overall result. We have found the following heuristic to work well: we first run a model with KL weight fixed to 0 to find the number of iterations it needs to converge. We then configure the annealing schedule to start after the unregularized model has converged and last for no less than 20% of that amount.
KL项退火可以看作是从传统确定性自动编码器到完整VAE的逐步过渡。在这项工作中，我们使用从0到1的线性退火。我们已经试验了其他时间表，但没有发现它们对最终结果有重大影响。只要KL术语权重开始足够缓慢地增长，它的确切形状和增长速度似乎就不会影响整体结果。我们发现以下启发式方法可以很好地起作用：首先运行KL权重固定为0的模型，以找到需要收敛的迭代次数。然后，我们将退火时间表配置为在非正规模型收敛之后开始，并且持续时间不少于该数量的20％。

While helping to regularize the latent vector, input dropout tends to slow down convergence. We propose an alternative technique to encourage the model to compress information into the latent vector: in addition to the reconstruction cost computed on the outputs of the recurrent language model, we also add an auxiliary reconstruction term computed from the activations of the last deconvolutional layer:
在这里插入图片描述
在帮助规范化潜矢量的同时，输入丢失会减慢收敛速度。我们提出了一种替代技术来鼓励模型将信息压缩到潜在向量中：除了在递归语言模型的输出上计算出的重建成本之外，我们还添加了辅助重建
根据最后一个反卷积层的激活计算出的项：
在这里插入图片描述
Since at this layer the model does not have access to previous output elements it needs to rely on
the z vector to produce a meaningful reconstruction. The final cost minimized by our model is:

由于模型在此层无法访问以前的输出元素，因此需要依赖z向量来生成有意义的重构。通过我们的模型最小化的最终成本是：
在这里插入图片描述
where α is a hyperparameter, Jaux is the intermediate reconstruction term and Jvae is the bound
from Eq (1). Expanding the two terms from Eq (3) gives:

其中α是超参数，Jaux是中间重构项，Jvae是方程（1）的界。展开式（3）中的两个项可以得到：
在这里插入图片描述
The objective function from Eq (4) puts a mild constraint on the latent vector to produce features useful for historyless reconstruction. Since the autoregressive part reuses these features, it also improves the main reconstruction term. We are thus able to encode information in the latent vector without hurting expressiveness of the decoder.
方程（4）的目标函数对潜矢量施加了适度的约束，以产生可用于无历史重建的特征。由于自回归部分重用了这些功能，因此它也改善了主要的重构项。因此，我们能够在潜在矢量中编码信息，而不会损害解码器的表达能力。

One can view the objective function in Eq 4 as a joint objective for two VAEs: one only feedforward, as in Figure 2(a), and the other combining feed-forward and recurrent parts, as in Figures 2(b) and 2©, that partially share parameters. Since the feed-forward VAE is incapable of producing reasonable reconstructions without making use of the latent vector, the full architecture also gains access to the latent vector through shared parameters. We note that this trick comes at a cost of worse result on the density estimation task, since part of the parameters of the full model are trained to optimize an objective that does not capture all the dependencies that exist in the textual data. However, the gap between purely deterministic LM and our model is small and easily controllable by the α hyperparameter. We refer the reader to Figure 4 for quantitative results regarding the effect of α on the performance of our model on the LM task.
一个人可以将等式4中的目标函数视为两个VAE的联合目标：一个仅如图2（a）所示的前馈，另一个将前馈和递归部分组合起来，如图2（b）和2（图2） c），部分共享参数。由于前馈VAE在不使用潜矢量的情况下无法产生合理的重构，因此完整架构还可以通过共享参数访问潜矢量。我们注意到，由于对完整模型的部分参数进行了训练以优化未捕获文本数据中存在的所有依存关系的目标，因此此技巧的代价是在密度估计任务上产生更差的结果。但是，纯确定性LM与我们的模型之间的差距很小，并且可以通过α超参数轻松控制。我们请读者参考图4，了解有关α对LM任务模型性能的影响的定量结果。

4 Experiments

5 Conclusions

We have introduced a novel generative model of natural texts based on the VAE framework. Its core components are a convolutional encoder and a deconvolutional decoder combined with a recurrent layer. We have shown that the feed-forward part of our model architecture makes it easier to train a VAE and avoid the problem of KL-term collapsing to zero, where the decoder falls back to a standard language model thus inhibiting the sampling ability of VAE. Additionally, we propose an efficient way to encourage the model to rely on the latent vector by introducing an additional cost term in the training objective. We observe that it
works well on long sequences which is hard to achieve with purely RNN-based VAEs using the previously proposed tricks such as KL-term annealing and input dropout. Finally, we have extensively evaluated the trade-off between the KLterm and the reconstruction loss. In particular, we investigated the effect of the receptive field size on the ability of the model to respect the latent vector which is crucial for being able to generate realistic and diverse samples. In future work we plan to apply our VAE model to semi-supervised NLP tasks and experiment with conditioning generation on text attributes such as sentiment and writing style.
我们介绍了一种基于VAE框架的自然文本生成模型。它的核心组件是与循环层相结合的卷积编码器和反卷积解码器。我们已经证明，模型架构的前馈部分使训练VAE更容易，并且避免了KL项崩溃为零的问题，在该情况下，解码器回退到标准语言模型，从而抑制了VAE的采样能力。此外，我们提出了一种有效的方法，通过在训练目标中引入额外的费用项来鼓励模型依赖于潜在向量。我们观察到使用以前提出的技巧（例如KL项退火和输入缺失），对于纯基于RNN的VAE很难实现的长序列效果很好。最后，我们已经广泛评估了KLterm和重建损失之间的权衡。尤其是，我们研究了接收场大小对模型尊重潜在向量的能力的影响，这对于能够生成现实而多样的样本至关重要。在未来的工作中，我们计划将我们的VAE模型应用于半监督的NLP任务，并尝试根据诸如情感和写作风格的文本属性进行条件生成。