Paper：《ELMO：Deep contextualized word representations》翻译与解读_l. deep contextualized word representations-CSDN博客

本文链接：https://blog.csdn.net/qq_41185868/article/details/86550380

Paper：《ELMO：Deep contextualized word representations》翻译与解读

《ELMO：Deep contextualized word representations》翻译与解读

Abstract

1 Introduction

2 Related work

3 ELMo: Embeddings from Language Models

3.1 Bidirectional language models

3.2 ELMo

3.3 Using biLMs for supervised NLP tasks

4 Evaluation

5 Analysis

5.1 Alternate layer weighting schemes

5.2 Where to include ELMo?

5.3 What information is captured by the biLM’s representations?

5.4 Sample efficiency

5.5 Visualization of learned weights

6 Conclusion

《ELMO：Deep contextualized word representations》翻译与解读

链接	https://arxiv.org/abs/1802.05365
日期	2018年2月
作者	Matthew E. Peters† , Mark Neumann† , Mohit Iyyer† , Matt Gardner† , {matthewp,markn,mohiti,mattg}@allenai.org Christopher Clark∗ , Kenton Lee∗ , Luke Zettlemoyer†∗ {csquared,kentonl,lsz}@cs.washington.edu †Allen Institute for Artificial Intelligence ∗Paul G. Allen School of Computer Science & Engineering, University of Washington

Abstract

We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pretrained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.

我们介绍了一种新型的深层语境化的词语表示法，它对（1）词语使用的复杂特征（如语法和语义）和（2）这些使用如何在不同的语言语境中变化（即，对多义词进行建模）进行建模。我们的词向量是深度双向语言模型（biLM）内部状态的学习函数，该模型在大型文本语料库上预训练。我们表明，这些表示可以很容易地添加到现有的模型中，并在六个具有挑战性的NLP问题（包括问题回答、文本蕴涵和情感分析）中显著提高了技术水平。我们还提供了一个分析，表明暴露预训练网络的深层内部是至关重要的，它允许下游模型混合不同类型的半监督信号。

1 Introduction

Pre-trained word representations (Mikolov et al., 2013; Pennington et al., 2014) are a key component in many neural language understanding models. However, learning high quality representations can be challenging. They should ideally model both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). In this paper, we introduce a new type of deep contextualized word representation that directly addresses both challenges, can be easily integrated into existing models, and significantly improves the state of the art in every considered case across a range of challenging language understanding problems.

预训练的词表示(Mikolov等人，2013;Pennington等人，2014)是许多神经语言理解模型的关键组成部分。然而，学习高质量的表示可能是具有挑战性的。他们应该对(1)词语使用的复杂特征(例如，句法和语义)和(2)这些用法如何在语言上下文中变化(例如，对一词多义进行建模)进行理想的建模。在本文中，我们介绍了一种新型的深度语境化单词表示，它直接解决了这两种挑战，可以很容易地集成到现有的模型中，并在一系列具有挑战性的语言理解问题的每个考虑的情况下显著提高了技术的状态。

Our representations differ from traditional word type embeddings in that each token is assigned a representation that is a function of the entire input sentence. We use vectors derived from a bidirectional LSTM that is trained with a coupled language model (LM) objective on a large text corpus. For this reason, we call them ELMo (Embeddings from Language Models) representations. Unlike previous approaches for learning contextualized word vectors (Peters et al., 2017; McCann et al., 2017), ELMo representations are deep, in the sense that they are a function of all of the internal layers of the biLM. More specifically, we learn a linear combination of the vectors stacked above each input word for each end task, which markedly improves performance over just using the top LSTM layer.

Combining the internal states in this manner allows for very rich word representations. Using intrinsic evaluations, we show that the higher-level LSTM states capture context-dependent aspects of word meaning (e.g., they can be used without modification to perform well on supervised word sense disambiguation tasks) while lowerlevel states model aspects of syntax (e.g., they can be used to do part-of-speech tagging). Simultaneously exposing all of these signals is highly beneficial, allowing the learned models select the types of semi-supervision that are most useful for each end task.

我们的表示不同于传统的单词类型嵌入，因为每个标记都被分配了一个表示，该表示是整个输入句子的函数。我们使用来自双向LSTM的向量，该LSTM是用耦合语言模型(LM)目标在大型文本语料库上训练的。出于这个原因，我们称它们为ELMo(来自语言模型的嵌入)表示。不同于以往的方法学习上下文化的单词向量(Peters et al.， 2017;McCann et al.， 2017)， ELMo表示是深度的，因为它们是biLM的所有内层的功能。更具体地说，我们学习了每个终端任务的每个输入字之上的向量的线性组合，这比仅使用顶层LSTM层显著提高了性能。

以这种方式组合内部状态可以实现非常丰富的单词表示。使用内在的评价,我们表明,高级LSTM州捕获词义的上下文相关的方面(例如,他们可以使用不需要修改监督词义消歧任务上的表现良好)虽然lowerlevel状态模型方面的语法(例如,他们可以用来做词性标注)。同时暴露所有这些信号是非常有益的，允许学习的模型选择对每个最终任务最有用的半监督类型。

Extensive experiments demonstrate that ELMo representations work extremely well in practice. We first show that they can be easily added to existing models for six diverse and challenging language understanding problems, including textual entailment, question answering and sentiment analysis. The addition of ELMo representations alone significantly improves the state of the art in every case, including up to 20% relative error reductions. For tasks where direct comparisons are possible, ELMo outperforms CoVe (McCann et al., 2017), which computes contextualized representations using a neural machine translation encoder. Finally, an analysis of both ELMo and CoVe reveals that deep representations outperform those derived from just the top layer of an LSTM. Our trained models and code are publicly available, and we expect that ELMo will provide similar gains for many other NLP problems.1

大量的实验表明，ELMo表示在实践中工作得非常好。我们首先表明，它们可以很容易地添加到现有的六种不同的和具有挑战性的语言理解问题的模型中，包括文本蕴涵、问题回答和情感分析。在任何情况下，仅添加ELMo表示就可以显著提高技术水平，包括高达20%的相对误差减少。对于可能进行直接比较的任务，ELMo优于CoVe (McCann等人，2017)，后者使用神经机器翻译编码器计算上下文化表示。最后，对ELMo和CoVe的分析表明，深度表示优于仅从LSTM的顶层导出的表示。我们训练过的模型和代码是公开的，我们希望ELMo能够为其他许多NLP问题提供类似的改进

2 Related work

Due to their ability to capture syntactic and semantic information of words from large scale unlabeled text, pretrained word vectors (Turian et al., 2010; Mikolov et al., 2013; Pennington et al., 2014) are a standard component of most state-ofthe-art NLP architectures, including for question answering (Liu et al., 2017), textual entailment (Chen et al., 2017) and semantic role labeling (He et al., 2017). However, these approaches for learning word vectors only allow a single contextindependent representation for each word.

Previously proposed methods overcome some of the shortcomings of traditional word vectors by either enriching them with subword information (e.g., Wieting et al., 2016; Bojanowski et al., 2017) or learning separate vectors for each word sense (e.g., Neelakantan et al., 2014). Our approach also benefits from subword units through the use of character convolutions, and we seamlessly incorporate multi-sense information into downstream tasks without explicitly training to predict predefined sense classes.

由于它们能够从大规模的无标签文本中捕捉单词的句法和语义信息，预训练的单词向量(Turian et al.， 2010;Mikolov等人，2013;Pennington等人，2014)是大多数最先进的NLP架构的标准组件，包括问题回答(Liu等人，2017)、文本蕴涵(Chen等人，2017)和语义角色标注(He等人，2017)。然而，这些学习单词向量的方法只允许每个单词有一个独立于上下文的表示。

之前提出的方法通过丰富子词信息来克服传统单词向量的一些缺点(如Wieting et al.， 2016;Bojanowski等人，2017年)或学习每个单词意义的单独向量(如Neelakantan等人，2014年)。我们的方法还受益于通过使用字符卷积的子词单元，我们可以无缝地将多感觉信息整合到下游任务中，而无需明确地训练来预测预定义的感觉类。

Other recent work has also focused on learning context-dependent representations. context2vec (Melamud et al., 2016) uses a bidirectional Long Short Term Memory (LSTM; Hochreiter and Schmidhuber, 1997) to encode the context around a pivot word. Other approaches for learning contextual embeddings include the pivot word itself in the representation and are computed with the encoder of either a supervised neural machine translation (MT) system (CoVe; McCann et al., 2017) or an unsupervised language model (Peters et al., 2017). Both of these approaches benefit from large datasets, although the MT approach is limited by the size of parallel corpora. In this paper, we take full advantage of access to plentiful monolingual data, and train our biLM on a corpus with approximately 30 million sentences (Chelba et al., 2014). We also generalize these approaches to deep contextual representations, which we show work well across a broad range of diverse NLP tasks.

其他最近的工作也集中在学习上下文相关的表征。context2vec (Melamud et al.， 2016)使用双向长短期记忆(LSTM;Hochreiter和Schmidhuber, 1997)编码围绕一个主词的上下文。其他学习上下文嵌入的方法包括在表示中使用主词本身，并使用监督神经机器翻译(MT)系统(CoVe;McCann等人，2017)或无监督语言模型(Peters等人，2017)。这两种方法都受益于大型数据集，尽管MT方法受到并行语料库大小的限制。在本文中，我们充分利用了大量的单语数据，并在一个大约3000万句的语料库上训练我们的biLM (Chelba et al.， 2014)。我们还将这些方法推广到深层上下文表示中，我们表明这些方法在广泛的不同NLP任务中都能很好地工作。

Previous work has also shown that different layers of deep biRNNs encode different types of information. For example, introducing multi-task syntactic supervision (e.g., part-of-speech tags) at the lower levels of a deep LSTM can improve overall performance of higher level tasks such as dependency parsing (Hashimoto et al., 2017) or CCG super tagging (Søgaard and Goldberg, 2016). In an RNN-based encoder-decoder machine translation system, Belinkov et al. (2017) showed that the representations learned at the first layer in a 2- layer LSTM encoder are better at predicting POS tags then second layer. Finally, the top layer of an LSTM for encoding word context (Melamud et al., 2016) has been shown to learn representations of word sense. We show that similar signals are also induced by the modified language model objective of our ELMo representations, and it can be very beneficial to learn models for downstream tasks that mix these different types of semi-supervision.

Dai and Le (2015) and Ramachandran et al. (2017) pretrain encoder-decoder pairs using language models and sequence autoencoders and then fine tune with task specific supervision. In contrast, after pretraining the biLM with unlabeled data, we fix the weights and add additional taskspecific model capacity, allowing us to leverage large, rich and universal biLM representations for cases where downstream training data size dictates a smaller supervised model.

先前的研究也表明，不同层次的深层birnn编码不同类型的信息。例如，在深度LSTM的低层引入多任务句法监督(如词性标签)可以提高更高级别任务的整体性能，如依赖句法分析(Hashimoto等，2017)或CCG超级标签(Søgaard和Goldberg, 2016)。在基于rnn的编码器-解码器机器翻译系统中，Belinkov等人(2017)表明，在2层LSTM编码器中，在第一层学习到的表示比第二层更能预测POS标签。最后，LSTM的顶层用于编码单词上下文(Melamud et al.， 2016)已经被证明可以学习单词意义的表示。我们发现，我们的ELMo表示的修改的语言模型目标也会诱导出类似的信号，这对于学习混合了这些不同类型的半监督的下游任务的模型非常有益。

Dai and Le(2015)和Ramachandran等人(2017)使用语言模型和序列自动编码器预训练编码器-解码器对，然后通过任务特定监督进行微调。相反，在使用未标记数据预训练biLM之后，我们确定了权重，并添加了额外的特定任务模型容量，允许我们利用大型、丰富和通用的biLM表示，在下游训练数据大小决定较小监督模型的情况下。

3 ELMo: Embeddings from Language Models

Unlike most widely used word embeddings (Pennington et al., 2014), ELMo word representations are functions of the entire input sentence, as described in this section. They are computed on top of two-layer biLMs with character convolutions (Sec. 3.1), as a linear function of the internal network states (Sec. 3.2). This setup allows us to do semi-supervised learning, where the biLM is pretrained at a large scale (Sec. 3.4) and easily incorporated into a wide range of existing neural NLP architectures (Sec. 3.3).

与最广泛使用的词嵌入(Pennington et al.， 2014)不同，ELMo词表示是整个输入句子的函数，如本节所述。它们是在带有字符卷积(第3.1节)的双层bilm上计算的，作为内部网络状态的线性函数(第3.2节)。这种设置允许我们进行半监督学习，biLM在大范围内进行预训练(第3.4节)，并容易地纳入广泛的现有神经NLP架构(第3.3节)。

3.1 Bidirectional language models

Given a sequence of N tokens, (t1, t2, ..., tN ), a forward language model computes the probability of the sequence by modeling the probability of token tk given the history (t1, ..., tk−1):

Recent state-of-the-art neural language models (Jozefowicz et al. ´ , 2016; Melis et al., 2017; Merity et al., 2017) compute a context-independent token representation x LM k (via token embeddings or a CNN over characters) then pass it through L layers of forward LSTMs. At each position k, each LSTM layer outputs a context-dependent representation −→h LM k,j where j = 1, . . . , L. The top layer LSTM output, −→h LM k,L , is used to predict the next token tk+1 with a Softmax layer

给定N个令牌序列，(t1, t2，…)， tN)，正向语言模型通过建模给定历史(t1，…)的标记tk的概率来计算序列的概率。, tk−1):

最新的最先进的神经语言模型(Jozefowicz等人，2016;Melis等人，2017;Merity等人，2017)计算一个上下文无关的令牌表示x LM k(通过令牌嵌入或字符上的CNN)，然后通过L层转发lstm。在每个位置k，每个LSTM层输出一个上下文相关的表示-→h LM k,j，其中j = 1，…，L顶层LSTM输出，−→h LM k,L，用于预测下一个令牌tk+1，具有Softmax层

A backward LM is similar to a forward LM, except it runs over the sequence in reverse, predicting the previous token given the future context:

It can be implemented in an analogous way to a forward LM, with each backward LSTM layer j in a L layer deep model producing representations ←− h LM k,j of tk given (tk+1, . . . , tN ). A biLM combines both a forward and backward LM. Our formulation jointly maximizes the log likelihood of the forward and backward directions:

We tie the parameters for both the token representation (Θx) and Softmax layer (Θs) in the forward and backward direction while maintaining separate parameters for the LSTMs in each direction. Overall, this formulation is similar to the approach of Peters et al. (2017), with the exception that we share some weights between directions instead of using completely independent parameters. In the next section, we depart from previous work by introducing a new approach for learning word representations that are a linear combination of the biLM layers.

它可以以类似于正向LM的方式实现，在L层深层模型中，每个向后LSTM层j产生表示←−h LM k,j of tk given (tk+1，…tN)。biLM结合了向前LM和向后LM。我们的公式共同使正向和反向方向的对数似然最大化:

我们在正反两个方向上绑定令牌表示(Θx)和Softmax层(Θs)的参数，同时在每个方向上为lstm维护单独的参数。总的来说，这个公式类似于Peters等人(2017)的方法，除了我们在方向之间共享一些权重，而不是使用完全独立的参数。在下一节中，我们将从前面的工作中分离出来，介绍一种学习单词表示的新方法，它是biLM层的线性组合。

3.2 ELMo

ELMo is a task specific combination of the intermediate layer representations in the biLM. For each token tk, a L-layer biLM computes a set of 2L + 1 representations

For inclusion in a downstream model, ELMo collapses all layers in R into a single vector, ELMok = E(Rk; Θe). In the simplest case, ELMo just selects the top layer, E(Rk) = h LM k,L , as in TagLM (Peters et al., 2017) and CoVe (McCann et al., 2017). More generally, we compute a task specific weighting of all biLM layers:

In (1), s task are softmax-normalized weights and the scalar parameter γ task allows the task model to scale the entire ELMo vector. γ is of practical importance to aid the optimization process (see supplemental material for details). Considering that the activations of each biLM layer have a different distribution, in some cases it also helped to apply layer normalization (Ba et al., 2016) to each biLM layer before weighting.

ELMo是biLM中特定于任务的中间层表示的组合。对于每个令牌tk, l层biLM计算一组2L + 1表示

为了包含在下游模型中，ELMo将R中的所有层合并为一个单个向量，ELMok = E(Rk;Θe)。在最简单的情况下，ELMo只选择顶层，E(Rk) = h LM k,L，如TagLM (Peters et al.， 2017)和CoVe (McCann et al.， 2017)。更一般地，我们计算所有biLM层的特定任务权重:

在(1)中，s任务是softmax归一化的权值，标量参数γ任务允许任务模型缩放整个ELMo向量。γ对于辅助优化过程具有实际重要性(详情见补充材料)。考虑到每个biLM层的活化分布不同，在某些情况下，在加权前对每个biLM层进行层归一化(Ba et al.， 2016)也有帮助。

3.3 Using biLMs for supervised NLP tasks

Given a pre-trained biLM and a supervised architecture for a target NLP task, it is a simple process to use the biLM to improve the task model. We simply run the biLM and record all of the layer representations for each word. Then, we let the end task model learn a linear combination of these representations, as described below.

First consider the lowest layers of the supervised model without the biLM. Most supervised NLP models share a common architecture at the lowest layers, allowing us to add ELMo in a consistent, unified manner. Given a sequence of tokens (t1, . . . , tN ), it is standard to form a context-independent token representation xk for each token position using pre-trained word embeddings and optionally character-based representations. Then, the model forms a context-sensitive representation hk, typically using either bidirectional RNNs, CNNs, or feed forward networks.

给定预训练的biLM和目标NLP任务的监督架构，使用biLM改进任务模型是一个简单的过程。我们只需运行biLM并记录每个单词的所有层表示。然后，我们让最终任务模型学习这些表示的线性组合，如下所述。首先考虑没有biLM的监督模型的最低层。大多数受监督的NLP模型在最低层共享一个公共架构，允许我们以一致、统一的方式添加ELMo。给定一个令牌序列(t1，…， tN)，标准的做法是使用预先训练好的词嵌入和可选的基于字符的表示，为每个标记位置形成上下文无关的标记表示xk。然后，该模型形成一个上下文敏感的表示hk，通常使用双向rnn, cnn或前馈网络。

To add ELMo to the supervised model, we first freeze the weights of the biLM and then concatenate the ELMo vector ELMotask k with xk and pass the ELMo enhanced representation [xk; ELMotask k ] into the task RNN. For some tasks (e.g., SNLI, SQuAD), we observe further improvements by also including ELMo at the output of the task RNN by introducing another set of output specific linear weights and replacing hk with [hk; ELMotask k ]. As the remainder of the supervised model remains unchanged, these additions can happen within the context of more complex neural models. For example, see the SNLI experiments in Sec. 4 where a bi-attention layer follows the biLSTMs, or the coreference resolution experiments where a clustering model is layered on top of the biLSTMs.

为了将ELMo添加到监督模型中，我们首先冻结biLM的权重，然后将ELMo向量ELMotask k与xk连接，并传递ELMo增强表示[xk;将ELMotask k]导入任务RNN。对于一些任务(如SNLI、SQuAD)，我们观察到进一步的改进，通过引入另一组输出特定的线性权重，并将hk替换为[hk;ELMotask k)。由于监督模型的其余部分保持不变，这些添加可以在更复杂的神经模型的背景下发生。例如，参见第4节SNLI实验，在biLSTMs之后有一个双注意层，或者在biLSTMs之上有一个聚类模型的共参考分辨率实验。

Finally, we found it beneficial to add a moderate amount of dropout to ELMo (Srivastava et al., 2014) and in some cases to regularize the ELMo weights by adding λkwk 2 2 to the loss. This imposes an inductive bias on the ELMo weights to stay close to an average of all biLM layers.

最后，我们发现在ELMo中加入适量的dropout是有益的(Srivastava et al.， 2014)，在某些情况下，通过在loss中加入λkwk 22来正则化ELMo权重。这对ELMo权重施加了一个归纳偏差，以保持接近所有biLM层的平均值。

3.4 Pre-trained bidirectional language model architecture

The pre-trained biLMs in this paper are similar to the architectures in Jozefowicz et al. ´ (2016) and Kim et al. (2015), but modified to support joint training of both directions and add a residual connection between LSTM layers. We focus on large scale biLMs in this work, as Peters et al. (2017) highlighted the importance of using biLMs over forward-only LMs and large scale training.

To balance overall language model perplexity with model size and computational requirements for downstream tasks while maintaining a purely character-based input representation, we halved all embedding and hidden dimensions from the single best model CNN-BIG-LSTM in Jozefowicz et al. ´ (2016). The final model uses L = 2 biLSTM layers with 4096 units and 512 dimension projections and a residual connection from the first to second layer. The context insensitive type representation uses 2048 character n-gram convolutional filters followed by two highway layers (Srivastava et al., 2015) and a linear projection down to a 512 representation. As a result, the biLM provides three layers of representations for each input token, including those outside the training set due to the purely character input. In contrast, traditional word embedding methods only provide one layer of representation for tokens in a fixed vocabulary.

本文预训练的bilm与Jozefowicz等人(2016)和Kim等人(2015)的架构相似，但经过修改，支持两个方向的联合训练，并在LSTM层之间增加了残余连接。在这项工作中，我们关注的是大规模的bilm，因为Peters等人(2017)强调了使用bilm比只使用前向lm和大规模培训的重要性。在Jozefowicz等人(2016)的研究中，为了平衡整体语言模型的复杂性，同时保持纯粹的基于字符的输入表示，我们将所有嵌入和隐藏维度从单个最佳模型cnn -大- lstm中减半。最终的模型使用L = 2 biLSTM层，4096个单元，512个尺寸投影，从第一层到第二层有一个残余连接。上下文不敏感类型表示使用2048个字符n-gram卷积滤波器，然后是两个高速公路层(Srivastava等人，2015年)和一个线性投影到512表示。因此，biLM为每个输入令牌提供了三层表示，包括那些由于纯字符输入而在训练集之外的层。相比之下，传统的单词嵌入方法仅为固定词汇表中的标记提供一层表示。

After training for 10 epochs on the 1B Word Benchmark (Chelba et al., 2014), the average forward and backward perplexities is 39.7, compared to 30.0 for the forward CNN-BIG-LSTM. Generally, we found the forward and backward perplexities to be approximately equal, with the backward value slightly lower.

Once pretrained, the biLM can compute representations for any task. In some cases, fine tuning the biLM on domain specific data leads to significant drops in perplexity and an increase in downstream task performance. This can be seen as a type of domain transfer for the biLM. As a result, in most cases we used a fine-tuned biLM in the downstream task. See supplemental material for details.

在1B Word Benchmark (Chelba et al.， 2014)上进行了10个epoch的训练后，平均向前和向后困惑为39.7，而向前CNN-BIG-LSTM为30.0。一般来说，我们发现正向和向后的困惑是近似相等的，向后的值略低。

一旦经过预处理，biLM可以计算任何任务的表示。在某些情况下，对特定领域数据的biLM进行微调可以显著降低复杂程度，并提高下游任务的性能。这可以看作是biLM的一种域转移。因此，在大多数情况下，我们在下游任务中使用了一个经过微调的biLM。详见补充资料。

4 Evaluation

Table 1 shows the performance of ELMo across a diverse set of six benchmark NLP tasks. In every task considered, simply adding ELMo establishes a new state-of-the-art result, with relative error reductions ranging from 6 - 20% over strong base models. This is a very general result across a diverse set model architectures and language understanding tasks. In the remainder of this section we provide high-level sketches of the individual task results; see the supplemental material for full experimental details.

Question answering The Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016) contains 100K+ crowd sourced questionanswer pairs where the answer is a span in a given Wikipedia paragraph. Our baseline model (Clark and Gardner, 2017) is an improved version of the Bidirectional Attention Flow model in Seo et al. (BiDAF; 2017). It adds a self-attention layer after the bidirectional attention component, simplifies some of the pooling operations and substitutes the LSTMs for gated recurrent units (GRUs; Cho et al., 2014). After adding ELMo to the baseline model, test set F1 improved by 4.7% from 81.1% to 85.8%, a 24.9% relative error reduction over the baseline, and improving the overall single model state-of-the-art by 1.4%. A 11 member ensemble pushes F1 to 87.4, the overall state-of-the-art at time of submission to the leaderboard.2 The increase of 4.7% with ELMo is also significantly larger then the 1.8% improvement from adding CoVe to a baseline model (McCann et al., 2017).

表1显示了ELMo在6个不同的基准NLP任务集上的性能。在考虑的每一项任务中，只需添加ELMo，就可以建立一个新的最先进的结果，相对于强大的基础模型，误差降低幅度从6 - 20%不等。在不同的集合模型体系结构和语言理解任务中，这是一个非常普遍的结果。在本节的其余部分中，我们将提供个别任务结果的高级草图;完整的实验细节见补充材料。

斯坦福问答数据集(SQuAD) (Rajpurkar等人，2016年)包含100K+众包问题回答对，其中答案是给定维基百科段落中的一个跨度。我们的基线模型(Clark and Gardner, 2017)是Seo等人的双向注意流模型的改进版本(BiDAF;2017)。它在双向注意组件后增加了一个自注意层，简化了部分池化操作，并将lstm替换为门选循环单元(gru;赵等人，2014)。在基线模型中加入ELMo后，测试集F1从81.1%提高到85.8%，提高了4.7%，相对于基线降低了24.9%的相对误差，并将单个模型的整体水平提高了1.4%。11名成员将F1推向87.4，在提交到排行榜时的整体最先进的与将CoVe添加到基线模型中1.8%的改善相比，ELMo增加4.7%的改善也显著增加(McCann et al.， 2017)。

Table 1: Test set comparison of ELMo enhanced neural models with state-of-the-art single model baselines across six benchmark NLP tasks. The performance metric varies across tasks – accuracy for SNLI and SST-5; F1 for SQuAD, SRL and NER; average F1 for Coref. Due to the small test sizes for NER and SST-5, we report the mean and standard deviation across five runs with different random seeds. The “increase” column lists both the absolute and relative improvements over our baseline.

表1:ELMo增强神经模型与最先进的单模型基线在六个基准NLP任务上的测试集比较。性能指标因任务而异——SNLI和SST-5的准确性;F1代表SQuAD, SRL和NER;Coref的平均F1。由于NER和SST-5的测试规模较小，我们报告了五次不同随机种子运行的平均值和标准偏差。“增加”列列出了相对于基线的绝对改进和相对改进。

Textual entailment Textual entailment is the task of determining whether a “hypothesis” is true, given a “premise”. The Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) provides approximately 550K hypothesis/premise pairs. Our baseline, the ESIM sequence model from Chen et al. (2017), uses a biLSTM to encode the premise and hypothesis, followed by a matrix attention layer, a local inference layer, another biLSTM inference composition layer, and finally a pooling operation before the output layer. Overall, adding ELMo to the ESIM model improves accuracy by an average of 0.7% across five random seeds. A five member ensemble pushes the overall accuracy to 89.3%, exceeding the previous ensemble best of 88.9% (Gong et al., 2018).

Semantic role labeling A semantic role labeling (SRL) system models the predicate-argument structure of a sentence, and is often described as answering “Who did what to whom”. He et al. (2017) modeled SRL as a BIO tagging problem and used an 8-layer deep biLSTM with forward and backward directions interleaved, following Zhou and Xu (2015). As shown in Table 1, when adding ELMo to a re-implementation of He et al. (2017) the single model test set F1 jumped 3.2% from 81.4% to 84.6% – a new state-of-the-art on the OntoNotes benchmark (Pradhan et al., 2013), even improving over the previous best ensemble result by 1.2%.

Coreference resolution Coreference resolution is the task of clustering mentions in text that refer to the same underlying real world entities. Our baseline model is the end-to-end span-based neural model of Lee et al. (2017). It uses a biLSTM and attention mechanism to first compute span representations and then applies a softmax mention ranking model to find coreference chains. In our experiments with the OntoNotes coreference annotations from the CoNLL 2012 shared task (Pradhan et al., 2012), adding ELMo improved the average F1 by 3.2% from 67.2 to 70.4, establishing a new state of the art, again improving over the previous best ensemble result by 1.6% F1.

文本蕴涵是在给定一个“前提”的情况下，确定一个“假设”是否为真。斯坦福自然语言推理(SNLI)语料库(Bowman et al.， 2015)提供了大约550K的假设/前提对。我们的基线是Chen et al.(2017)的ESIM序列模型，使用biLSTM编码前提和假设，然后是矩阵注意层、局部推理层、biLSTM推理合成层，最后在输出层之前进行池化操作。总的来说，将ELMo添加到ESIM模型中，在5个随机种子中平均提高了0.7%的精度。五成员集合将整体准确率推高到89.3%，超过了之前的最佳集合88.9% (Gong et al.， 2018)。

语义角色标注(Semantic role labeling, SRL)系统对句子的谓语-argument结构进行建模，通常描述为回答“谁对谁做了什么”。继Zhou和Xu(2015)之后，He等人(2017)将SRL建模为一个BIO标记问题，并使用了一个前后方向交错的8层深层biLSTM。如表1所示,当添加艾尔摩的重新实现他et al .(2017)单一模型测试集F1上涨3.2%从81.4%降至84.6%——一个新的先进的OntoNotes基准(Pradhan et al ., 2013),甚至过去最好的整体结果提高1.2%。

共引用解析共引用解析是将文本中提到的同一潜在现实世界实体进行聚类的任务。我们的基线模型是Lee等人(2017)的端到端跨神经模型。该方法利用biLSTM和注意机制，先计算跨度表示，然后应用softmax提及排序模型求解共参考链。在我们的实验中，使用CoNLL 2012共享任务的OntoNotes共参考评注(Pradhan等人，2012)，添加ELMo将平均F1从67.2提高到70.4，提高了3.2%，建立了一个新的技术状态，再次比之前的最佳集合结果提高了1.6% F1。

Named entity extraction The CoNLL 2003 NER task (Sang and Meulder, 2003) consists of newswire from the Reuters RCV1 corpus tagged with four different entity types (PER, LOC, ORG, MISC). Following recent state-of-the-art systems (Lample et al., 2016; Peters et al., 2017), the baseline model uses pre-trained word embeddings, a character-based CNN representation, two biLSTM layers and a conditional random field (CRF) loss (Lafferty et al., 2001), similar to Collobert et al. (2011). As shown in Table 1, our ELMo enhanced biLSTM-CRF achieves 92.22% F1 averaged over five runs. The key difference between our system and the previous state of the art from Peters et al. (2017) is that we allowed the task model to learn a weighted average of all biLM layers, whereas Peters et al. (2017) only use the top biLM layer. As shown in Sec. 5.1, using all layers instead of just the last layer improves performance across multiple tasks.

Sentiment analysis The fine-grained sentiment classification task in the Stanford Sentiment Treebank (SST-5; Socher et al., 2013) involves selecting one of five labels (from very negative to very positive) to describe a sentence from a movie review. The sentences contain diverse linguistic phenomena such as idioms and complex syntac-Sentiment analysis Ttic constructions such as negations that are difficult for models to learn. Our baseline model is the biattentive classification network (BCN) from McCann et al. (2017), which also held the prior state-of-the-art result when augmented with CoVe embeddings. Replacing CoVe with ELMo in the BCN model results in a 1.0% absolute accuracy improvement over the state of the art.

CoNLL 2003 NER任务(Sang and Meulder, 2003)由来自路透社RCV1语料库的新闻通讯社(newswire)组成，标记有四种不同的实体类型(PER, LOC, ORG, MISC)。根据最近最先进的系统(Lample等，2016;Peters等人，2017)，基线模型使用预训练的词嵌入、基于字符的CNN表示、两个biLSTM层和一个条件随机场(CRF)损失(Lafferty等人，2001)，类似于Collobert等人(2011)。如表1所示，我们的ELMo增强biLSTM-CRF在5次运行中平均F1达到92.22%。我们的系统与Peters等人(2017)之前的技术状态之间的关键区别是，我们允许任务模型学习所有biLM层的加权平均，而Peters等人(2017)只使用顶层的biLM层。如第5.1节所示，使用所有层而不是仅使用最后一层可以提高跨多个任务的性能。

斯坦福情感树库(SST-5)的细粒度情感分类任务;Socher et al.， 2013)涉及从五种标签中选择一种(从非常消极到非常积极)来描述电影评论中的一句话。这些句子中包含了各种各样的语言现象，如成语，复杂的句法-情感分析，否定等语法结构是模型难以学习的。我们的基线模型是McCann等人(2017)提出的双注意分类网络(BCN)，当使用CoVe嵌入进行增强时，BCN也保持了先前的最先进的结果。在BCN模型中，用ELMo替换CoVe，与现有技术相比，绝对精度提高了1.0%。

Table 3: Development set performance for SQuAD, SNLI and SRL when including ELMo at different locations in the supervised model.

表3:在监督模型中包含ELMo的不同位置时，SQuAD、SNLI和SRL的开发集表现。

5 Analysis

This section provides an ablation analysis to validate our chief claims and to elucidate some interesting aspects of ELMo representations. Sec. 5.1 shows that using deep contextual representations in downstream tasks improves performance over previous work that uses just the top layer, regardless of whether they are produced from a biLM or MT encoder, and that ELMo representations provide the best overall performance. Sec. 5.3 explores the different types of contextual information captured in biLMs and uses two intrinsic evaluations to show that syntactic information is better represented at lower layers while semantic information is captured a higher layers, consistent with MT encoders. It also shows that our biLM consistently provides richer representations then CoVe. Additionally, we analyze the sensitivity to where ELMo is included in the task model (Sec. 5.2), training set size (Sec. 5.4), and visualize the ELMo learned weights across the tasks (Sec. 5.5).

本节提供消融分析，以验证我们的主要主张，并阐明ELMo表示的一些有趣方面。第5.1节表明，在下游任务中使用深度上下文表示比以前只使用顶层的工作提高了性能，无论它们是由biLM还是MT编码器产生的，ELMo表示提供了最佳的整体性能。第5.3节探讨了在bilm中捕获的不同类型的上下文信息，并使用两个内在评估来表明，语法信息在较低的层中更好地表示，而语义信息在较高的层中捕获，这与MT编码器一致。它还表明，我们的biLM始终比CoVe提供更丰富的表示。此外，我们分析了ELMo在任务模型中包含的位置(第5.2节)、训练集大小(第5.4节)的敏感性，并将ELMo在各个任务中学习到的权重可视化(第5.5节)。

5.1 Alternate layer weighting schemes

There are many alternatives to Equation 1 for combining the biLM layers. Previous work on contextual representations used only the last layer, whether it be from a biLM (Peters et al., 2017) or an MT encoder (CoVe; McCann et al., 2017). The choice of the regularization parameter λ is also important, as large values such as λ = 1 effectively reduce the weighting function to a simple average over the layers, while smaller values (e.g., λ = 0.001) allow the layer weights to vary.	对于等式1，有许多替代方法可以组合biLM层。之前关于上下文表示的工作只使用了最后一层，无论是来自biLM (Peters et al.， 2017)或MT编码器(CoVe;麦肯等人，2017年)。正则化参数λ的选择也很重要，因为像λ = 1这样的大值有效地将加权函数减少到对层的简单平均值，而较小的值(例如λ = 0.001)允许层的权重变化。
Table 2 compares these alternatives for SQuAD, SNLI and SRL. Including representations from all layers improves overall performance over just using the last layer, and including contextual representations from the last layer improves performance over the baseline. For example, in the case of SQuAD, using just the last biLM layer improves development F1 by 3.9% over the baseline. Averaging all biLM layers instead of using just the last layer improves F1 another 0.3% (comparing “Last Only” to λ=1 columns), and allowing the task model to learn individual layer weights improves F1 another 0.2% (λ=1 vs. λ=0.001). A small λ is preferred in most cases with ELMo, although for NER, a task with a smaller training set, the results are insensitive to λ (not shown).	表2比较了SQuAD、SNLI和SRL的这些替代方案。与只使用最后一层相比，包含来自所有层的表示可以提高整体性能，而包含来自最后一层的上下文表示可以提高基线的性能。例如，在SQuAD的例子中，仅使用最后的biLM层将开发F1比基线提高3.9%。平均所有biLM层，而不是只使用最后一层，使F1又提高了0.3%(将“last Only”与λ=1列相比)，并允许任务模型学习单个层的权重，使F1又提高了0.2% (λ=1 vs. λ=0.001)。在ELMo的大多数情况下，较小的λ是首选，尽管对于NER，一个具有较小训练集的任务，结果对λ不敏感(未显示)。
The overall trend is similar with CoVe but with smaller increases over the baseline. For SNLI, averaging all layers with λ=1 improves development accuracy from 88.2 to 88.7% over using just the last layer. SRL F1 increased a marginal 0.1% to 82.2 for the λ=1 case compared to using the last layer only.	总体趋势与CoVe相似，但比基线增加较小。对于SNLI，平均所有层λ=1比仅使用最后一层的开发精度从88.2提高到88.7%。与仅使用最后一层相比，SRL F1增加了边际0.1%到82.2 λ=1的情况。

5.2 Where to include ELMo?

All of the task architectures in this paper include word embeddings only as input to the lowest layer biRNN. However, we find that including ELMo at the output of the biRNN in task-specific architectures improves overall results for some tasks. As shown in Table 3, including ELMo at both the input and output layers for SNLI and SQuAD improves over just the input layer, but for SRL (and coreference resolution, not shown) performance is highest when it is included at just the input layer. One possible explanation for this result is that both the SNLI and SQuAD architectures use attention layers after the biRNN, so introducing ELMo at this layer allows the model to attend directly to the biLM’s internal representations. In the SRL case, the task-specific context representations are likely more important than those from the biLM.

本文中所有的任务体系结构都只将单词嵌入作为最低层biRNN的输入。然而，我们发现在任务特定架构的biRNN的输出中包含ELMo可以改善某些任务的总体结果。如表3所示，在SNLI和SQuAD的输入和输出层包含ELMo比仅在输入层进行改进，但对于SRL(并没有显示相关分辨率)，当仅在输入层包含ELMo时，性能最高。对这个结果的一种可能解释是SNLI和SQuAD架构在biRNN之后都使用了注意层，因此在这一层引入ELMo允许模型直接参与biLM的内部表示。在SRL的情况下，特定于任务的上下文表示可能比来自biLM的上下文表示更重要。

Table 5: All-words fine grained WSD F1. For CoVe and the biLM, we report scores for both the first and second layer biLSTMs.

Table 6: Test set POS tagging accuracies for PTB. For CoVe and the biLM, we report scores for both the first and second layer biLSTMs.

表5:全字细粒度WSD F1。对于CoVe和biLM，我们报告第一层和第二层biLSTMs的分数。

表6:PTB测试集POS标记精度。对于CoVe和biLM，我们报告第一层和第二层biLSTMs的分数。

5.3 What information is captured by the biLM’s representations?

Since adding ELMo improves task performance over word vectors alone, the biLM’s contextual representations must encode information generally useful for NLP tasks that is not captured in word vectors. Intuitively, the biLM must be disambiguating the meaning of words using their context. Consider “play”, a highly polysemous word. The top of Table 4 lists nearest neighbors to “play” using GloVe vectors. They are spread across several parts of speech (e.g., “played”, “playing” as verbs, and “player”, “game” as nouns) but concentrated in the sportsrelated senses of “play”. In contrast, the bottom two rows show nearest neighbor sentences from the SemCor dataset (see below) using the biLM’s context representation of “play” in the source sentence. In these cases, the biLM is able to disambiguate both the part of speech and word sense in the source sentence.	由于添加ELMo比单独的单词向量提高了任务性能，因此biLM的上下文表示必须编码在单词向量中没有捕获的NLP任务通常有用的信息。直观地说，biLM必须使用上下文消除词语含义的歧义。以“play”为例，它是一个高度多义词。表4的顶部列出了使用手套向量“玩”的最近邻居。它们分布在不同的词性中(例如，“played”，“playing”作为动词，“player”，“game”作为名词)，但集中在与运动相关的“play”意义上。相比之下，下面两行使用源句子中的biLM上下文表示“play”显示SemCor数据集(见下面)中最近的相邻句子。在这些情况下，biLM能够消除源句中词性和词义的歧义。
These observations can be quantified using an intrinsic evaluation of the contextual representations similar to Belinkov et al. (2017). To isolate the information encoded by the biLM, the representations are used to directly make predictions for a fine grained word sense disambiguation (WSD) task and a POS tagging task. Using this approach, it is also possible to compare to CoVe, and across each of the individual layers. Word sense disambiguation Given a sentence, we can use the biLM representations to predict the sense of a target word using a simple 1- nearest neighbor approach, similar to Melamud et al. (2016). To do so, we first use the biLM to compute representations for all words in SemCor 3.0, our training corpus (Miller et al., 1994), and then take the average representation for each sense. At test time, we again use the biLM to compute representations for a given target word and take the nearest neighbor sense from the training set, falling back to the first sense from WordNet for lemmas not observed during training.	这些观察可以使用与Belinkov等人(2017)相似的上下文表征的内在评估来量化。为了隔离biLM编码的信息，使用这些表示直接对细粒度词义消除歧义(WSD)任务和POS标记任务进行预测。使用这种方法，还可以与CoVe进行比较，并跨每个单独的层。对于一个句子，我们可以使用biLM表示来预测目标词的意义，使用简单的1-最近邻方法，类似于Melamud et al.(2016)。为此，我们首先使用biLM计算SemCor 3.0中所有单词的表示，我们的训练语料库(Miller et al.， 1994)，然后取每个意义的平均表示。在测试时，我们再次使用biLM来计算给定目标单词的表示，并从训练集中获取最近的邻居意义，对于在训练中未观察到的词理，返回到WordNet中的第一个意义。
Table 5 compares WSD results using the evaluation framework from Raganato et al. (2017b) across the same suite of four test sets in Raganato et al. (2017a). Overall, the biLM top layer representations have F1 of 69.0 and are better at WSD then the first layer. This is competitive with a state-of-the-art WSD-specific supervised model using hand crafted features (Iacobacci et al., 2016) and a task specific biLSTM that is also trained with auxiliary coarse-grained semantic labels and POS tags (Raganato et al., 2017a). The CoVe biLSTM layers follow a similar pattern to those from the biLM (higher overall performance at the second layer compared to the first); however, our biLM outperforms the CoVe biLSTM, which trails the WordNet first sense baseline. POS tagging To examine whether the biLM captures basic syntax, we used the context representations as input to a linear classifier that predicts POS tags with the Wall Street Journal portion of the Penn Treebank (PTB) (Marcus et al., 1993). As the linear classifier adds only a small amount of model capacity, this is direct test of the biLM’s representations. Similar to WSD, the biLM representations are competitive with carefully tuned, task specific biLSTMs (Ling et al., 2015; Ma and Hovy, 2016). However, unlike WSD, accuracies using the first biLM layer are higher than the top layer, consistent with results from deep biLSTMs in multi-task training (Søgaard and Goldberg, 2016; Hashimoto et al., 2017) and MT (Belinkov et al., 2017). CoVe POS tagging accuracies follow the same pattern as those from the biLM, and just like for WSD, the biLM achieves higher accuracies than the CoVe encoder.	表5比较了Raganato等人(2017b)的评估框架在Raganato等人(2017a)的四个测试集相同套件中的WSD结果。总体而言，biLM顶层表示的F1为69.0,WSD优于第一层。这可以与使用手工制作功能的最先进的特定于wsd的监督模型(Iacobacci等人，2016年)和也使用辅助粗粒度语义标签和POS标签进行训练的特定于任务的biLSTM (Raganato等人，2017a)相媲美。CoVe biLSTM层遵循与biLM层相似的模式(与第一层相比，第二层的总体性能更高);然而，我们的biLM优于CoVe biLSTM，后者落后于WordNet的第一感觉基线。为了检查biLM是否捕获基本语法，我们使用上下文表示作为输入到线性分类器，该分类器预测POS标签与宾夕法尼亚树库(PTB)的华尔街日报部分(Marcus et al.， 1993)。由于线性分类器只增加了少量的模型容量，这是对biLM表示的直接测试。与WSD类似，biLM表示与经过仔细调整的、特定于任务的biLSTMs具有竞争性(Ling et al.， 2015;Ma和Hovy, 2016)。然而，与WSD不同的是，使用第一层biLM的准确性高于顶层，这与在多任务训练中使用深层biLSTMs的结果一致(Søgaard和Goldberg, 2016;Hashimoto等人，2017)和MT (Belinkov等人，2017)。CoVe POS标记的准确性遵循与biLM相同的模式，并且就像对于WSD一样，biLM实现了比CoVe编码器更高的准确性。
Implications for supervised tasks Taken together, these experiments confirm different layers in the biLM represent different types of information and explain why including all biLM layers is important for the highest performance in downstream tasks. In addition, the biLM’s representations are more transferable to WSD and POS tagging than those in CoVe, helping to illustrate why ELMo outperforms CoVe in downstream tasks.	综上所述，这些实验证实了biLM中的不同层代表不同类型的信息，并解释了为什么包括所有biLM层对于下游任务的最高性能非常重要。此外，biLM的表示比CoVe中的更容易转移到WSD和POS标记，这有助于说明为什么ELMo在下游任务中优于CoVe。

5.4 Sample efficiency

Adding ELMo to a model increases the sample efficiency considerably, both in terms of number of parameter updates to reach state-of-the-art performance and the overall training set size. For example, the SRL model reaches a maximum development F1 after 486 epochs of training without ELMo. After adding ELMo, the model exceeds the baseline maximum at epoch 10, a 98% relative decrease in the number of updates needed to reach the same level of performance.

将ELMo添加到模型中大大提高了样本效率，无论是在达到最先进性能的参数更新数量方面，还是在整体训练集大小方面。例如，在486次没有ELMo的训练之后，SRL模型达到了F1的最大发展。在添加ELMo之后，模型超过了第10纪元时的基线最大值，达到相同性能水平所需的更新数量相对减少了98%。

In addition, ELMo-enhanced models use smaller training sets more efficiently than models without ELMo. Figure 1 compares the performance of baselines models with and without ELMo as the percentage of the full training set is varied from 0.1% to 100%. Improvements with ELMo are largest for smaller training sets and significantly reduce the amount of training data needed to reach a given level of performance. In the SRL case, the ELMo model with 1% of the training set has about the same F1 as the baseline model with 10% of the training set.

此外，ELMo增强的模型比没有ELMo的模型更有效地使用更小的训练集。图1比较了在训练集的百分比从0.1%到100%的情况下，有和没有ELMo的基线模型的性能。ELMo的改进对于较小的训练集是最大的，并且显著减少了达到给定性能水平所需的训练数据量。在SRL案例中，拥有1%训练集的ELMo模型与拥有10%训练集的基线模型具有相同的F1。

Figure 1: Comparison of baseline vs. ELMo performance for SNLI and SRL as the training set size is varied from 0.1% to 100%.

Figure 2: Visualization of softmax normalized biLM layer weights across tasks and ELMo locations. Normalized weights less then 1/3 are hatched with horizontal lines and those greater then 2/3 are speckled.

图1:当训练集大小从0.1%到100%时，SNLI和SRL的基线和ELMo性能比较。

图2:softmax归一化biLM层跨任务和ELMo位置权重的可视化小于1/3的标准化重量以水平线孵化，大于2/3的重量以斑点状孵化。

5.5 Visualization of learned weights

Figure 2 visualizes the softmax-normalized learned layer weights. At the input layer, the task model favors the first biLSTM layer. For coreference and SQuAD, the this is strongly favored, but the distribution is less peaked for the other tasks. The output layer weights are relatively balanced, with a slight preference for the lower layers.

图2显示了softmax归一化学习层权重。在输入层，任务模型倾向于第一个biLSTM层。对于coreference和SQuAD来说，这是非常受欢迎的，但对于其他任务来说，其分布并没有达到峰值。输出层的权重相对平衡，对较低的层有轻微的偏好。

6 Conclusion

We have introduced a general approach for learning high-quality deep context-dependent representations from biLMs, and shown large improvements when applying ELMo to a broad range of NLP tasks. Through ablations and other controlled experiments, we have also confirmed that the biLM layers efficiently encode different types of syntactic and semantic information about wordsin-context, and that using all layers improves overall task performance.

我们介绍了一种从bilm中学习高质量深度上下文相关表示的一般方法，并在将ELMo应用于广泛的NLP任务时显示出了很大的改进。通过消融和其他控制实验，我们也证实了biLM层有效地编码了上下文中关于单词的不同类型的句法和语义信息，并且使用所有层提高了整体任务性能。