blender2ogre
Facebook’s open sourced chatbot “Blender” is breaking all records previously set by Google’s “Meena”. In this post, we will go over the Poly-Encoders Transformer architecture, that forms the crux of Blender.
Facebook的开源聊天机器人“ Blender”打破了谷歌“ Meena”先前设定的所有记录。 在本文中,我们将介绍构成Blender症结所在的Poly-Encoders Transformer体系结构。
You can read Part 1 of this series, where we have gone over the Data Sets on which the chatbot is trained, on TDS.
您可以阅读本系列的第1部分,我们已经在TDS上浏览了聊天机器人所训练的数据集。
Assuming the reader has a prior understanding of Attention, Transformers, BERT and Generative Language Models, I shall march forth.
假设读者对注意力,变形金刚,BERT和生成语言模型有事先的了解,我将继续前进。
介绍: (Introduction:)
Before seeing how the Poly-Encoder is used in the context of Blender, we will first understand them independently. The datasets and (fake) training tasks employed in pre-training and fine-tuning the Blender (which are explained in detail in Part 1) should not be confused with the details am about to explain below. The experimental settings given here, are to understand a specific task called “Multi-Sentence Scoring” and the Encoder architectures trained for that task, in a generic setting. And then among the Encoder architectures trained for this task, we will see how the Poly-Encoders are superior.
在了解如何在Blender的上下文中使用Poly-Encoder之前,我们将首先独立地理解它们。 在Blender的预训练和微调(在第1部分中进行了详细说明)中使用的数据集和(假)训练任务不应与下面将要解释的细节相混淆。 此处给出的实验设置是为了在通用设置中理解称为“ 多句子评分 ”的特定任务以及为此任务训练的编码器体系结构。 然后,在接受过此任务训练的编码器体系结构中,我们将看到Poly-Encoders如何出众。
任务: (Task:)
Multi-Sentence scoring does pairwise comparison between the input and output sequences. Given an input sequence, we score a set of candidate labels.
多句评分在输入和输出序列之间进行成对比较。 给定一个输入序列,我们对一组候选标签进行评分。
From here on, we’ll represent the input-output pair by [INPUT, LABEL].
从这里开始,我们将用[INPUT,LABEL]表示输入输出对。
The goal is to find the best label from among a finite list of candidate labels. The Encoder used is the BERT-Base with 12 Encoder blocks, 12 Attention heads and 768 hidden neurons in the Feed Forward Network.
目的是从有限的候选标签列表中找到最佳标签。 前馈网络中使用的编码器是具有12个编码器块,12个注意力头和768个隐藏神经元的BERT-Base。
培训前: (Pre-Training:)
Two versions of pre-training are done for this task:
为此任务完成了两种预训练版本:
- pre-trained like BERT, on the Toronto Book Corpus and Wikipedia. Here, the [INPUT, LABEL] can be thought of as [Sentence A, Sentence B]. 像BERT一样经过预先培训的人,在多伦多图书语料库和维基百科上。 这里,[输入,标签]可以被认为是[句子A,句子B]。
- pre-trained on the public domain social media conversations available from Reddit. Here, the [INPUT, LABEL] can be understood as [Context, Next Sentence] 对Reddit提供的公共领域社交媒体对话进行了预培训。 在这里,[输入,标签]可以理解为[上下文,下一句]
假培训任务: (Fake Training Tasks:)
The training tasks are the same ones used in the pre-training of BERT.
训练任务与BERT的预训练中使用的任务相同。
MLM: Masked Language Model: Here a certain percentage of the input tokens are masked at random (with [MASK] token). The task then is to learn to predict the masked tokens.
MLM:屏蔽的语言模型:在此,一定百分比的输入令牌被随机屏蔽(使用[MASK]令牌)。 然后的任务是学习预测被屏蔽的令牌。
NSP: Next Sentence Prediction: Here given 2 sentences A and B, the task is to say if B follows A? (with Negative Sampling). Negative Sampling is implemented by taking a random sentence from the dataset as B, 50% of the time.
NSP:下一句预测:给定2个句子A和B,任务是说B是否跟随A? (使用负采样)。 负采样是通过在50%的时间中从数据集中抽取一个随机句子作为B来实现的。
A little digression here. A trick that I use, to remember the nature of these pre-training tasks in BERT is to draw a direct comparison with the fake training tasks used in generating the Word2Vec embeddings, namely: 1) CBOW 2) Skip-Gram. If you could recall, in CBOW (Continuous Bag of Words), given a context the task is to predict the target word — similar to the MLM task. And in the Skip-Gram model, given the target word predict the context => but instead of predicting the context/neighbouring word, we change the dataset and the task becomes: given the target word and another word -> predict if the other word is a neighbour of the target word or not (binary classification problem). Since the initial dataset was formed only with target words and words in their context, the modified dataset now contains only positive examples. So we introduce noise by negative sampling. Very very similar to the NSP task of BERT. (If you think there is any inconsistency in drawing such a comparison between the training tasks of BERT and Word Embeddings, do let me know in the comments. Thanks!)
这里有点题外话。 我要记住BERT中这些预训练任务的性质的技巧是,将其与生成Word2Vec嵌入所使用的假训练任务进行直接比较,即:1)CBOW 2)Skip-Gram。 如果您还记得,在CBOW(连续词袋)中,给定上下文,任务是预测目标词,类似于MLM任务。 在Skip-Gram模型中,给定目标单词预测上下文=>,但是我们没有更改预测上下文/相邻单词,而是更改了数据集,任务变成了:给定目标单词和另一个单词->预测另一个单词是否是目标词的邻居(二进制分类问题)。 由于初始数据集仅由目标词和上下文中的词组成,因此修改后的数据集现在仅包含肯定示例。 因此,我们通过负采样引入噪声。 非常类似于BERT的NSP任务。 (如果您认为在进行BERT和单词嵌入的训练任务之间的这种比较时有任何不一致之处,请在评论中告诉我。谢谢!)
微调: (Fine-Tuning:)
The model is fine-tuned separately on the ConvAI2 dataset, thereby encouraged to learn the “Personality” trait and on the Ubuntu chat logs which would help them learn “Domain Knowledge/Expertise”.
该模型在ConvAI2数据集上分别进行了微调,从而鼓励学习“个性”特质和Ubuntu聊天日志,这将帮助他们学习“领域知识/专业知识”。
架构: (Architectures:)
We will see 3 Encoder architectures to solve the “Multi-Sentence Scoring” task, namely,
我们将看到3种编码器架构来解决“多句子评分”任务,即
- Bi-Encoder 双编码器
- Cross-Encoder 交叉编码器
- Poly-Encoder 多编码器
The performance of an architecture during inferencing is measured both by the quality of the prediction and also by the prediction speed.
推断期间的体系结构性能既可以通过预测的质量来衡量,也可以通过预测速度来衡量。
Before proceeding, it is important to remember that this is a Retrieval and NOT Generative task: we only need to retrieve a correct label from a fixed set of candidate labels.
在继续之前,重要的是要记住这是一个检索而不是生成的任务 :我们只需要从一组固定的候选标签中检索正确的标签即可。
双编码器: (Bi-Encoder:)
In Bi-Encoders, Self-Attention is performed over the Input and Label separately. This is nothing but the more generic concept of a Vector Space model. This architecture has the advantage of being faster during inferencing, because we can pre-compute & cache encodings of large, fixed set of candidate labels. This is made possible as the labels are getting encoded separately and have no dependancy with that of the input context.
在Bi-Encoders中,对输入和标签分别执行自注意。 这不过是向量空间模型的一般概念。 这种架构的优点是在推理过程中速度更快,因为我们可以预先计算和缓存固定的大型候选标签集的编码。 由于可以分别对标签进行编码并且与输入上下文之间没有依赖关系,因此这成为可能。
- Both the INPUT and LABEL are surrounded by a special token [S]. This is similar to the [CLS] token in BERT, which captures the features of the entire sentence. INPUT和LABEL都被特殊标记[S]包围。 这类似于BERT中的[CLS]令牌,该令牌捕获整个句子的特征。
- The embeddings input to the Encoder is a combination of Token Embeddings + Segment Embeddings + Position Embeddings. The Segment Embedding is generally used to say if a token belongs to Sentence A or Sentence B (in the context of BERT). Since the INPUT and LABEL are encoded separately here, the Segment Embedding is ‘0’ in both the cases. 输入到编码器的嵌入是令牌嵌入+段嵌入+位置嵌入的组合。 段嵌入通常用于表示令牌是属于句子A还是句子B(在BERT的上下文中)。 由于此处INPUT和LABEL分别编码,因此在两种情况下,段嵌入均为'0'。
- Map the input and candidate label separately to a common feature space. In the formula shown, T1 and T2 are two separate Transformers (Encoders). 将输入标签和候选标签分别映射到公共要素空间。 在所示的公式中,T1和T2是两个单独的变压器(编码器)。
- The Encoder, after performing Self-Attention on the Input token embeddings, gives the encoder representations for every token like: 编码器在对输入令牌嵌入执行自注意之后,为每个令牌提供编码器表示形式,例如:
A reduce function (red) is then used to reduce this to a single embedding representation. The reduce function can be any of the following:
然后使用reduce函数( 红色 )将其简化为单个嵌入表示。 reduce函数可以是以下任意一个:
-> it can either take the representation of the first token. This is the representation corresponding to the special token [S]
->它可以采用第一个标记的表示形式。 这是与特殊令牌[S]对应的表示形式
-> or we can take the average over all the output embeddings
->或者我们可以取所有输出嵌入的平均值
-> or we can take the average over the first ‘m’ (m<N; where N — token length) output embeddings
->或者我们可以取第一个“ m”(m <N;其中N是令牌长度)输出嵌入的平均值
- Once the INPUT and LABEL are represented thus in a common vector space, measure the similarity between them using standard dot product or any other non-linear function. 一旦在公共向量空间中表示了INPUT和LABEL,就可以使用标准点积或任何其他非线性函数来测量它们之间的相似性。
- We then minimize the Cross Entropy loss function, where the logits look like: 然后,我们最小化交叉熵损失函数,其logit如下所示:
交叉编码器: (Cross-Encoder:)
- Here, the INPUT and the LABEL are concatenated and Full Self Attention is performed between the entire sequence of input and label.That is, every token of the input would attend to every token of the label and vice versa. This gives rise to rich interactions between the input and label. 这里,INPUT和LABEL串联在一起,在输入和标签的整个序列之间执行完全自注意,即,输入的每个标记将与标签的每个标记保持一致,反之亦然。 这引起输入和标签之间的丰富交互。
- Even here, both the INPUT and LABEL are surrounded by a special token [S]. 即使在这里,INPUT和LABEL都被特殊标记[S]包围。
- Again, the embeddings input to the Encoder is a combination of Token Embeddings + Segment Embeddings + Position Embeddings. Since the INPUT and LABEL are combined, the Segment Embedding is ‘0’ for a INPUT token and ‘1’ for a LABEL token. 同样,输入到编码器的嵌入是令牌嵌入+段嵌入+位置嵌入的组合。 由于将INPUT和LABEL组合在一起,因此对于INPUT令牌,段嵌入为'0',对于LABEL令牌为'1'。
- Cross-Encoders give higher accuracy than the Bi-Encoder, because of the full bi-directional attention between the input and the label. At the same time, they are extremely slow during inferencing — because, as each of the candidate labels are supposed to be concatenated with the input context, and cannot be encoded separately like in the case of Bi-Encoders. Therefore candidate embeddings cannot be pre-computed and cached. When the number of candidate labels is huge (as it is in most real scenarios), cross-encoders do not scale. 交叉编码器比双向编码器具有更高的精度,因为输入和标签之间完全双向关注。 同时,它们在推论过程中非常慢-因为每个候选标签都应该与输入上下文连接在一起,并且不能像Bi-Encoders那样单独编码。 因此,候选嵌入无法预先计算和缓存。 当候选标签的数量很大时(在大多数实际情况下就是如此),交叉编码器无法缩放。
- After Self-Attention, the Transformer gives the encoder representations for all the input tokens. We reduce this to a single representation, by taking the embedding corresponding to the first token (i.e. the special token [S]). This embedding vector is then converted to a scalar score by doing a linear projection. These two steps are shown below: 自注意之后,转换器会为所有输入令牌提供编码器表示形式。 通过采用对应于第一个令牌(即特殊令牌[S])的嵌入,我们将其简化为单个表示。 然后通过进行线性投影将该嵌入矢量转换为标量分数。 这两个步骤如下所示:
- The training objective here too is to minimize the Cross-Entropy loss function given by the logits: 这里的训练目标也是最小化logits给出的交叉熵损失函数:
where ‘cand1’ is the correct candidate and the others are negatives taken from the training set. One problem here is that, in the bi-encoder we could use the other labels in the batch as negative training samples- here we cannot do that. we use external negatives provided in the training set. Because it is computation heavy, the in memory batch size of the cross encoder is also very small.
其中“ cand1”是正确的候选者,其他为从训练集中得出的否定词。 这里的一个问题是,在双编码器中,我们可以将批次中的其他标签用作否定训练样本-在这里我们不能这样做。 我们使用训练集中提供的外部底片。 由于计算量大,因此交叉编码器的内存批量大小也很小。
多编码器: (Poly-Encoder:)
- Poly-Encoder take the best qualities of Bi- and Cross-Encoders. Therefore, it is faster during inferencing than the Cross-Encoders and have better accuracy than Bi-Encoders. Poly-Encoder采用双向和交叉编码器的最佳质量。 因此,在推理期间它比交叉编码器快,并且比双编码器具有更好的精度。
- The Candidate Label is encoded separately. 候选标签是单独编码的。
- Given the input context like: 给定输入上下文,例如:
we perform 3 types of Attention, as explained below:
我们执行三种类型的注意力,如下所述:
- Self-Attention over the Input Context’s tokens and we get: 通过对输入上下文的标记进行自我关注,我们得到:
- Second, we learn ‘m’ codes (or queries in the parlance of Self-Attention), where m < N (N being the length of the INPUT). The number of codes to be learnt, ‘m’, is a hyperparameter. Each code Ci attends over all the outputs of the previous Self-Attention. The ‘m’ codes are randomly initialized. 其次,我们学习“ m”个代码(或用“自我注意”的说法查询),其中m <N(N是输入的长度)。 要学习的代码数“ m”是一个超参数。 每个代码Ci都参与先前的“自我注意”的所有输出。 “ m”码是随机初始化的。
- We first get the Attention weights (w’s) by performing a dot-product attention (or a multiplicative attention in general) between the ‘m’ codes — which serve as the “Queries”, and the previous Self-Attention outputs (Out’s)—which serve as the “Keys”. Then use these attention weights to get a weighted sum of the previous Self-Attention outputs(Out’s) — which serve as the “Values”. 我们首先通过在用作“查询”的“ m”代码与先前的“自我注意”输出(“注意”)之间执行点乘注意(通常是乘以关注)来获得注意权重(w)用作“钥匙”。 然后使用这些注意权重来获取以前的“自我注意”输出(输出)的加权总和-用作“值”。
- Think about why we are doing this kind of an Attention mechanism here. In a Bi-Encoder, the candidate label does not attend over the tokens of the input context. A Cross-Encoder on the other extreme, makes the candidate label attend over every token of the input context. Somehow in the Poly-Encoder we are trying to find a middle ground, by making the candidate label embedding attend over not the entire input context, but over a subset of features learnt from the input context. 想一想为什么我们要在这里进行这种Attention机制。 在双编码器中,候选标签不会出现在输入上下文的令牌上。 另一方面,交叉编码器使候选标签遍历输入上下文的每个标记。 在Poly-Encoder中,我们试图通过使候选标签嵌入不在整个输入上下文中,而是从从输入上下文中学到的特征子集中,来寻求中间立场。
- The third kind of attention (alluded to in the previous paragraph) is between the ‘m’ global features of the Input Context and the embedding of the Candidate Label. 第三类注意事项(在上一段中没有涉及)是在“输入上下文”的“ m”个全局特征和“候选标签”的嵌入之间。
- Now we compute the Similarity score between the Input Context embedding and the Candidate Label embedding as: 现在,我们将输入上下文嵌入和候选标签嵌入之间的相似性得分计算为:
- Once again, the training objective here too is to minimize the Cross-Entropy loss function given by the logits as before. 再次,这里的训练目标也是使Logits像以前一样给出的交叉熵损失函数最小。
We saw three different Encoder architectures for the task of “Multi-Sentence Scoring” and saw how the Poly-Encoders were better. In the next part, we will see how the Poly-Encoders are used in the Blender and also about the different Model Architectures and training objectives. We will also touch upon the Evaluation methods used to compare the performance of Blender with that of the other Chatbots.
我们看到了用于“多句子评分”任务的三种不同的编码器体系结构,并看到了多编码器如何更好。 在下一部分中,我们将了解在Blender中如何使用Poly-Encoders,以及不同的模型架构和培训目标。 我们还将介绍用于比较Blender与其他Chatbots性能的评估方法。
Note: All the notations, formulae and the Encoder block diagrams above are the same as used in the original paper mentioned in Ref.[1].
注意:上面的所有符号,公式和编码器框图与参考文献[1]中提及的原始论文中使用的相同。
翻译自: https://towardsdatascience.com/blender-bot-part-2-the-transformer-2e4d960b149f
blender2ogre