论文阅读：Sequence Generation by Editing Prototype

最新推荐文章于 2024-06-21 03:04:28 发布

Lcyztf

最新推荐文章于 2024-06-21 03:04:28 发布

阅读量1.8k

点赞数 2

分类专栏： Dialogue Systems

本文链接：https://blog.csdn.net/Lcyztf/article/details/82256403

版权

Dialogue Systems 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

一、Response Generation by Context-aware Prototype Editing

是一个retrieval——edit vector——conditional generating的过程，目标是解决safe response问题，让生成的回答更加informative and engaging，intuiation是比较c-c差异然后改写r。注意两点：①retrieval和generative不是end2end的。②editing的过程是semi-explicitly的，准确来讲，是explicitly create an edit vector based on c-c lexicon similarity，然后用这个edit vector来implicitly guidting generation process的。

1、Model Details

（1）Prototype Selector

本质上就是一个pre-retrieval model

①在testing的时候，针对test set中的case，在training set中选candidate。用c-c matching做一个index（by pylucene）即可。

②在training的时候，针对training set中的每一个case {c, r}，构建出来{{c, r}, {c', r'}}。具体的，先用r-r matching 做一个index（by pylucene）选出top30，然后用Jaccard similarity（两个BOW，交集除以并集，是一种lexicon similarity）来做一个filter，只保留[0.3, 0.7]的prototype。这里的intuition如下：如果大于0.7，generator就会学着“copy” prototype；而小于0.3则是：A neural editor model performs well only if a prototype is lexically similar(Guuetal.,2017) to its ground-truth. 至于为什么不用c-c matching了，because similar contexts may correspond to totally different responses, so-calledone-to-many phenomenon in dialogue generation, that impedes editor training due to the large lexicon gap.

（2）Context-Aware Neural Editor

①Edit Vector Generation（可以视为是explicitly）

1>对于prototype response，进行一个BiRNN encode [hf, hb]

2>计算context diff-vector

本质是一个insertion vector和一个deletion vector的concatenation。拿insertion vector举例：是insertion word set的weighted average sum。weights的计算类似Attention，concat+MLP来算分，并在insertion set上normalize。

recap： attention.这个paper的attention把三四种综合了一下。（注：前三种是Luong的，最后一种是Bahdanau的）

We compute a context diff-vector diffc by an attention mechanism deﬁned as follows:

3>计算edit vector

Then we compute the edit vector z by following transformation:

Equation 8 can be regarded as a mapping from context differences to response differences.

（3）Prototype Editing（seq2seq agmented with edit vector z）

这里和seq2seq不同的地方就是，这里把embedding和z给concat到一起作为input，其他和S2SA没有什么区别。

另外，这个seq2seq是prototype r生成r的过程！！！本质可以理解为“改写”。

整个模型只有这个seq2seq是可训练的，直接就MLE了。

2、Experiment

（1）Settings

training-eval-testing：2000W， 1W， 1W

The average length of contexts and responses are 11.64 and 12.33 respectively.

batch_size 128

adam 1e-3, reduce it by half if perplexity on validation begins to increase. We will stop training if the perplexity on validation keeps increasing in two successive epochs.

word_embedding = edit vector = 512

1-layer GRU with hidden 1024

vocabulary 3W

beam_size 20

We remove $UNK$ from the target vocabulary, because it always causes ﬂuency issue in evaluation. 这一条蛮有意思的。

（2）Models

Retrieval-default：直接lucene选出来的top1。对应Edit-default。

Retrieval-Rerank：lucene选出来top20，然后用一个训练好的dual-lstm来给这20个打分做rerank。对应Edit-1-Rerank。

Edit-N-Rerank：也是一般来看效果最好的一个，把lucene选出来的20个全部都改了，然后再dual-lstm给rerank。

（3）Evaluation Metrics

没有用BLEU（分数打不过吧= =），用了relevance， distinct，和自己定义个一个指标来做的。

①relevance：We employ Embedding Average (Average), Embedding Extrema (Extrema), and Embedding Greedy (Greedy) (Liu et al., 2016) to evaluate response relevance.

②diversity：We evaluate the response diversity based on the ratios of distinct unigrams and bigrams in generated responses.一般可以被认为是意义丰富，信息量大，informative and engaging。越高越好，一般检索比生成高不少。

③originality：that is deﬁned as the ratio of generated responses that do not appear in the training set. （exactly the same）这个是为了说明解决了safe response的问题。

（4）Some Results

paper分析了很多各个指标的好坏以及原因，这里只记录一些有趣的点：

可以实现deletion，replacement，insertion等等操作，有30%的时候也会完全改写这个prototype。

reranking的matching model对topic relevence效果还是很好的。

二、Retrieve and Reﬁne: Improved Sequence Generation Models For Dialogue

这篇虽然是FB做的，但是其实感觉做的蛮丑的……没有wuyu做的elegant。

motivation很一致：Tuning informative and engaging retrieval responses to the speciﬁc context. 本文的模型真的是remarkably straightforward，they take a standard generative model and concatenate the output of a retrieval model to its usual input, and then generate as usual. 是implicitly的方法，没有使用edit vector这种latent variable。

这里有一个statement非常到位：

MT，summarization都是generation is suitably constrained by the source sentence.

但是dialogue不同，the context still allows many interpretations.换句话说，对话里面source 对target的约束相对弱。

1、Model

RetrieveNReﬁne, or RetNRef for short.

经过一个retriever，然后把input和检索的结果concat到一起作为新的input（中间加一个special separator token），然后seq2seq生成refined 对话。

Seq2Seq model: a 2-layer LSTM with attention.

Retriever：Key-Value Memory Network, which attends over the dialogue history, to learn input and candidate retrieval embeddings that match using cosine similarity.

在构建training set的时候：precompute the retrieval result for every dialogue turn in the training set.

Instead of using the top ranking results, they rerank the top 100 predictions of each by their similarity to the label (in embedding space). This should help avoid the problem of the reﬁnement being too far away from the original retrieval.(不懂)

Use Retriever More：

作者在实验中发现，简单concat的操作会使得S2SA don't pay enough attention to retrieval response(感觉作者是单纯从结果看的，结果不理想？而不是看attention的权重)。于是……作者选择……truncate context history……作者也提到可以treating the two inputs as independent sources to do attention over.

Fix Retrieval Copy Errors

Our model learns to sometimes ignore the retrieval (when it is bad), sometimes use it partially, and other times simply copy it. However, when it is mostly copied but only changes a word or two, we observed it made mistakes more often than not, leading to less meaningful utterances. 简单来讲，如果model copy了大部分retrieval，只是修改了几个word，这时候容易改糊。所以作者提出：如果和retrieval的response overlap>60%就直接copy，否则就不做更改。

三、Exemplar Encoder-Decoder for Neural Conversation Generation —— ACL 18

没有使用edit vector，而是implicitly 把retrieval response也通过RNN建模传递信息。

1、Model：

（1） Retrieval of Similar Context-Response Pairs （这一部分可替代，就是lucene就可以做）

training set中的context-response pairs转成tf-idf vector，然后Given an input context c, we construct a query that weighs the last utterance in the context twice as much as the rest of the context and use it to retrieve the top-k similar context-response pairs from the index based on a BM25 (Robertson et al., 2009) retrieval model.

（2）Exemplar Encoder Network

共有两种encoder，用同一个context encoder处理当前的context和exemplar context，然后用response encoder处理所有的exemplar response。如上图。然后送给decoder的是K个vector e(k)，每个都是当前的context和ememplar context concatenate在一起的。

简言之，exemplar response用来给decoder提供信息，exemplar context用来计算similarity score。

（3）Exemplar Decoder Network

首先计算所有exemplar 和当前context的similarity score：当前context（e）和所有的exemplar context（e）做dot product，然后softmax归一化。（即Luong 归纳的 attention的最简单的形式）

Q：这里的本质是计算一个matching score，为什么不单独搞一个结构更加复杂一点的检索式model来做这件事呢？pretrain好的或者再finetune也没事呀？另外，这样一个implicitly的matching结构的效果究竟真的有帮助吗、不用再explicitly在最后加一个loss项吗？

A：如果有相应的数据来做training当然是更好的，但是这里的K个exemplar都是tf-idf选出来的那种，本身都是比较similar的结果了，没有这种数据来pretrain一个好的retrieval model。反而不如当作attention来implicitly地学习matching，效果也不会差。（其实可以negative sampling训一个retrieval然后来做初始化应该会bring slight improvement。

然后objective function如下：weighted sum of the likelihood of generating response conditioned on all exemplar.

在test（inference）的时候，就只选出context的score最高的那一条，然后进decoder就好啦。

（4）The Encoders and Decoder in Details

模型的具体结构：

① context encoder（multiturn）：HERD

utterance encoder： LSTM， last hidden state is referred to as utterance embedding

context encoder：LSTM， last hidden state is referred to as context embedding

② response encoder ： LSTM，last hidden state is referred to as response embedding

2、 Experimental Setup

使用了Ubuntu 2.0，corresponding hyperparameters are as follows

① initialization：

We initialize the word embedding matrix as well as the weights of context and response encoders from the standard normal distribution with mean 0 and variance 0.01. The biases of the encoders and decoder are initialized with 0.

② The word embedding matrix is shared by the context and response encoders.

③ embedding_size and hidden_size:

We use a word embedding size of 600, whereas the size of the hidden layers of the LSTMs in context and response encoders and the decoder is ﬁxed at 1200.

④ K=5 for Ubuntu

⑤ Adam with initial lr = 1e-4， early stopping

⑥ batch_size = 20

⑦ beam_size = 5

3、Evaluation

①Ubuntu的使用的是Serban Multire-solution recurrent neural networks: An application to dialogue response generation提出的专门搞ubuntu的metrics，使用下面的script做

https://github.com/julianser/Ubuntu-MultiresolutionTools/blob/master/ActEntRepresentation/evalﬁle.sh

②然后这里还elaborate了一个标准的Embedding Metrics：用来替换基于word overlap的BLEU

We use pre-trained Google news word embeddings similar to Serbanetal.(2017b), for easy reproducibility as these metrics are sensitive to the word embeddings used.

Google News-vectors-negative300.bin from https:// code.google.com/archive/p/word2vec/

1> Average：Average word embedding vectors are computed for the candidate response and ground truth. The cosine similarity is computed between these averaged embeddings. High similarity gives as indication that ground truth and predicted response have similar words. 计算average embedding vector之间的cosine similarity，is the most reﬂective of performance.

2> Greedy: Greedy matching score ﬁnds the most similar word in predicted response to ground truth response using cosine similarity. 寻找生成的句子和真实句子中最相似的一对单词，把这对单词的相似度近似为句子的距离。

3> Extrema: Vectorextremascorecomputesthe maximum or minimum value of each dimension of word vectors in candidate response and ground truth.

Of these, the embedding average metric is the most reﬂective of performance for our setup. The extrema representation, for instance, is very sensitive to text length and becomes ineffective beyond single length sentences(Forgues et al., 2014).对句中单词词向量的每一个维度提取最大(小)值作为句子向量对应维度的数值，然后计算cosine similarity。

https://github.com/julianser/hed-dlg-truncated/blob/master/Evaluation/embedding_metrics.py

四、Retrieve, Rerank and Rewrite: Soft Template Based Neural Summarization——ACL18

简洁的方法，elaborative的实验，released代码，感到舒适的work。

思路和前面一样，也如题目所言，首先lucene retrieval candidate sentences，然后matching model rerank，最后implicitly用rerank top1 的sentence representation来增强生成。

1、Model

（1）Retrieve：Lucene default settings 选30个candidate。

（2）Rerank：bi-linear network来做matching：Bilinear outperforms multi-layer forward neural networks in relevance measurement.

首先经过shared encoder得到sentence representation

然后经过bi-linear matching得到matching score，选出最大的即可。

（3）Rewrite

只是这里没看懂……这个concatenation是怎么个意思？？

2、training objective： rerank+rewrite

（1）Rerank：We expect the predicted saliency s(r,x) close to the actual saliency s∗(r,y∗). 这里是希望更直接地去更新bilinear matching network. 因为是要用s(r,x)来给所有的candidate templete打分，希望打分准确即好的templete分数高。好的templete即和gold summary像的即和gold summary之间的s∗(r,y∗)高的。这样分析下来，本质就是predicted saliency ->actual salency。

首先梳理下符号：x是source sentence， r是soft templete， y*是gold summary.

（2）Rewrite：standard MLE

（3）训练方式：paper中提到了两种训练方式，一种是train rerank and rewrite in pipeline。（先pretrain好rerank然后再单独train rewrite）另外一种是当作一个整体来训练，两部分loss按照1:1加起来。实验证明，in pipeline最后的ppl会更低，但是放在一起ROUGE指标会更好。

3、Experiment Setup

使用了OpenNMT，基本是default setting。

word_embedding = hidden_size = 500

2 layer Bi-LSTM encoder

Add the argument “-share_embeddings” to share the word embeddings between the encoder and decoder. This practice largely reduces model parameters for the monolingual task.

training的时候：

batch size：64

dropout probability p = 0.3 for the RNN layers（2 layers）

learning rate decay：decay by 50% if the generation loss does not decrease on the validation set.

testing的时候：

beam_size = 5

add the argument “-replace_unk”

Since the generated summaries are often shorter than the actual ones, we introduceanadditionallengthpenaltyargument“alpha 1” to encourage longer generation.

4、 Experiment Results and Discussions

（1）ppl方面 pipeline outperforms Re3Sum，可能是因为ppl本质就是exp(cross entropy),当然是直接min CE的模型ppl最低。

（2）通过和纯的opennmt作比较，发现soft templete还是很有效的。

（3）通过直接比较templete的ROUGE，可以发现rerank还是有效的（ROUGE1提高4个点），但是rerank的结果和max的结果还是差了ROUGE1=10，说明rerank的matching model is still far from perfect.

（4）beam search进行decode，对于原始的seq2seq模型，可见top n基本长得差不多，而Re3Sum接受dissimilar templete以后可以产生非常diverse的结果。这种方法也可以用来提高diversity.

五、An Ensemble of Retrieval-Based and Generation-Based Human-Computer Conversation Systems

implicitly 用多个检索结果multi-seq2seq促进生成，然后把检索的结果和生成的结果走一个rerank得到得分最高的再输出。

1、Model

（1）Retrieval Module

通过一个state-of-the-practice information retrieval system（lucene）根据q-q匹配选出k个candidate r。

（2）Generation Module

Multi-seq2seq model, which takes the original query q and k retrieved candidate replies r∗ 1,r∗ 2,...,r∗ k as input, and generates a new reply r+.

一个非常有趣的发现：In open-domain conversation systems, if the query does not carry sufﬁcient information, seq2seq tends to generate short and meaningless sentences.

用k+1个Bi-GRU encoder，标准decoder，对query和k个candidate reply都使用attention，对k个candidate都有copy 。

① decoder initial state：

把Bi-GRU encoder的last hidden state给concat起来[q, r1, r2, ... rk]，然后经过一个linear transformation变成decoder initial state。（that is so-called sentence-level attention……？？？）

② Attention Mechanism

简单来讲，matching部分使用bi-linear function，然后attention可以认为是对于每个source sentence互不影响地单独做，分别得到context以后直接加和。（q和candidate r应当是不同参数的encoder吧，加和……有效果？？ELMo也有类似的问题？？另外直接加和的话scale的问题怎么解决？）

③ Copy Mechanism ——> implicit key-word extraction

正常的RNN output prob和k个candidate选词的概率和。

下面重点分析一下candidate选词的概率计算：

pr∗ m reﬂect the matching degree between the current state vector st and the corresponding states of yt in encoders.

一个简单的bi-linear matching 然后过一个sigmoid function得到的输出。

(Q:这里概率不用做归一化的吗？？）

It's worth noting that:

1> If yt has not appeared in a retrieved replies r∗ m, the corresponding probabilities pr∗ m would be zero. 这个copy和Get to the Point类似，可以认为是一k+1个probability distribution以word为横轴叠加起来，除了没有进行概率归一化……

2> We do not copy words from query as queries and replies are not sharing the same vocabulary and word embeddings. 可见作者在monolingual dialogue system中也对q和r使用了两个不同的embedding。甚至vocabulary都不一样……

（3）Re-ranker

从k+1个检索个生成里面选出分数最高的最为最后的的response。本质上就是一个matching model，但是这里作者用了一个…… Gradient Boosting Decision Tree 根据high leveled manual feature来做classification。

（4） Model training

retrieval就用lucene，不用训练。

multi-seq2seq使用MLE训练，训练的时候训练集的每一条case都包括了<q, r>和对应的k个candidate。

rerank的负样本用negative sampling。

2、 Experiment Analysis

从结果来看，两个ensemble（multi-seq2seq、rerank）都是significant improment（虽然是BLEU……）。

这里作者非常贴心（细心）地指出了在两个ensemble model中retrieval所占的比例。毕竟只要seq2seq足够差，系统就退化为裸的retrieval，这样BLEU值一定会提高（提高不少）。

此外，如果作者能探究一下k的选取对结果的影响就更好了。这里k=2，但是可以看到retrieval-2的BLEU就挺差的了，如果只用top1，效果会退化多少？

最后，retrieval 是lucene做的，参照Ziqiang Cao ACL 18 的Re3Sum的模型结构和templete 的ROUGE分析，这里再加一个reranker应该会更好（maybe）。

六、Skeleton-to-Response: Dialogue Generation Guided by Retrieval Memory

七、Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning ——ACL17 自动化所

一篇QA往dialogue上靠的paper，to be continued……

The semantic units (words, phrases and entities) in a natural answer are dynamically predicted from the vocabulary, copied from the given question and/or retrieved from the corresponding knowledge base jointly.

八、 Generating Sentences by Editing Prototypes ——2017 Guu，Liang

to be continued...

unconditional sequence generation. 是一种比较elegant and smooth的方式，semi-explicitly。

第一篇Wu Yu的paper就是借鉴了这篇paper中的edit vector的构造。不过在用insertion/deletion构建以后，本文还做了一些数学上的工作。值得分析。

如果让 edit vector 作为一种隐变量，也遵循某种分布，那么同样的 edit vector 应该符合同一种 edit operation，并且对于句子的改写是一种微小的可控的操作。

1 Introduction

seq2seq model即NLM都是generate from scratch， in a left-to-right manner. Naive strategies to increase diversity have been shown to compromise grammaticality, suggesting that current NLMs may lack the inductive bias to faithfully represent the full diversity of complex utterances.

一句话概括模型：It ﬁrst samples a random prototype sentence from the training corpus, and then invokes a neural editor, which draws a random “edit vector” and generates a new sentence by attending to the prototype while conditioning on the edit vector.

2 Problem statement

（1）Prototype selector：randomly sample a prototype sentence obeying normal distribution

（2）Neural editor: First we draw an edit vector z, from an edit distribution p(z), which encodes the type of edit. 然后基于prototype x和edit vector共通生成sentence x： p（x | x', c）。

Our formulation stems from the observation that many sentences in a large corpus can be represented as minor transformations of other sentences. This implies that a neural editor which modelslexically similar sentences should bean effective generative model for large parts of the test set.

A secondary goal for the neural editor is to capture certain semantic properties：

①Semantic smoothness: an edit should be able to alter the semantics of a sentence by a small and well-controlled amount, while multiple edits should make it possible to accumulate a larger change. 即修改应当smooth and controlable，we prefer small and smooth modification.

2. Consistent edit behavior: the edit vector z should model/control the variation in the type of edit that is performed. When we apply the same edit vector on different sentences, the neural editor should perform semantically analogous edits across the sentences. 即edit vector应当interpretable，代表一种edit operation。换句话说，对于同一个edit vector作用的不同的prototype上应当有相似的效果。

3 Approach

（1）training objective：

maximize likelihood p(x | x', z), 公式如上，纯数学解的话遇到两个问题：①sum over training set中所有的x‘ is computationally expensive; ②integration over the latent edit vector z is intractable. 这里做出如下的近似：

①解决第一个sum over X的问题，only summing over x’ that are lexically similar(Jaccard distance) to x. （变小了）

②解决第二个integration的问题，they lower bound the integral over latent edit vectors by modeling z with a variational auto encoder, which admits tractable inference via the evidence lower bound (ELBO), which incidentally also provides additional semantic structure. 关于VAE in sequence generation的问题：看Generating sentences from a continuous space 这篇paper。