论文阅读：Deep contextualized word representations_deep contextualized word representations引用-CSDN博客

本文链接：https://blog.csdn.net/Lcyztf/article/details/81917040

NAACL 18 Best Paper

本文再度提醒我们，deep learning的精髓在于representation，而NLP至今没有把最根本的表示——embedding和language model做好（没有下大气力去灌）。A good low-level representation can bring significant improvement that beyond our wildest expectation. 本文提出的ELMo (Embeddings from Language Models)是LM的不同层layer的activation的softmax-normalized weighted sum，可以说是functions of the entire input sentence. 通过在large monolingual corpus上训练的Deep BiLM和在task中学习的combination weights，ELMo实现了rich and universal biLM knowledge和task-specific representation的完美结合。本文实验刷爆一众指标，读起来酣畅淋漓，真不愧是best paper。

Intuiation：multilayer-lstm的低层的activation主要是model syntactic information，while higher layers model semantic information。例如在低层用引入诸如词性标注等syntactic supervision会提高使用semantic的higher level task的performance。拿出一个deep LM的不同层的activation作为 contextualized embedding来给任务用，越低的层对于syntactic tasks表现越好，越高的层对semantic tasks表现越好。

Differentlayers of deep biRNNs encode different types of information. Higher-level LSTM states capture context-dependent aspects of word meaning (e.g., they can be used without modiﬁcation to perform well on supervised word sense disambiguation tasks) while lower level states model aspects of syntax(e.g., they can be used to do part-of-speech tagging).

前面关于contextualized embedding的方法都是LM或者MT的encoder 的top layer的activation作为contextualized embedding。

Other approaches for learning contextual embeddings include the pivot word itself in the representation and are computed with the encoder of either a supervised neural machine translation (MT) system (CoVe; McCann et al., 2017) or an unsupervised language model (Peters et al., 2017).

Model structure：

3.1 Bidirectionallanguagemodels

A biLM combines both a forward and backward LM. Our formulation jointly maximizes the log likelihood of the forward and backward directions:

We tie the parameters for both the token representation (Θx) and Softmax layer (Θs) in the forward and backward direction while maintaining separate parameters for the LSTMs in each direction. 即BiLM的 token embedding 和 output fc layer的参数是共享的，然后forward和backward的LSTM cell参数是单独的。

3.2 ELMo

ELMo is a task speciﬁc combination of the intermediate layer representations in the biLM.

S(task) are softmax-normalized weights and the scalar parameter γ(task) allows the task model to scale the entire ELMo vector. γ is of practical importance to aid the optimization process. Considering that the activations of each biLM layer have a different distribution, in some cases it also helped to apply layer normalization(Baetal.,2016)to each biLM layer before weighting.

这里有一个非常naive的问题：

不同的layer表示的信息是不同的，可以认为activation的distribution是不同的，也在不同的vector space下，直接相加难道不会扰乱了吗……layer norm回头去看一下paper。

3.3 Using biLMs for supervised NLP tasks

We simply run the biLM and record all of the layer representations for each word. Then, we let the end task model learn a linear combination of these representations.

简单来说：首先将 BiLM的参数固定作为一个固定的embedding来源。对于一个input sequence，首先过BiLM，然后得到一个ELMo vector（weighted sum），将ELMo vector和原来的embedding vector concat起来，作为ELMo-enhanced embedding来喂到task model里面去。

同时，对于有的任务，我们还可以在top layer上将ELMo和activation h 给 concat到一起，然后走output fc layer然后输出。

对于BiLM最好搞一些dropout（一般multilayer LSTM肯定要搞一点dropout啦），然后将linear weights给来一个L2 norm，根据λ的大小可以让模型学到ELMo的各个层的不同权重的组合。比如λ=1的时候基本会均匀分布，λ=0.001的时候就会学到layer之间权重的variance会比较大的分布。

Q：这里的task model是指pretrain好的吗？那这样搞一通，LSTM的input matrix维数不就变了……多出来的那部分参数重新学？如果不是pretrain好的，那为什么还要带传统的context-independent的embedding？

3.4 Pre-trainedbidirectionallanguagemodel architecture

对于task model，fine tuning BiLM会进一步提供效果——This can be seen as a type of domain transfer for the biLM.

Once pretrained, the biLM can compute representations for any task. In some cases, ﬁne tuning the biLM on domain speciﬁc data leads to signiﬁcant drops in perplexity and an increase in downstream task performance. This can be seen as a type of domain transfer for the biLM. As a result, in most cases we used a ﬁne-tuned biLM in the downstream task.

Analysis

5.1 Alternate layer weighting schemes

The choice of the regularization parameter λ is also important, as large values such as λ = 1 effectively reduce the weighting function to a simple average over the layers,while smaller values(e.g., λ = 0.001) allow the layer weights to vary.

A small λ is preferred in most cases with ELMo, althoughforNER,ataskwithasmallertrainingset, the results are insensitive to λ (not shown).

5.2 Where to include ELMo?

一般来讲，在input embedding处使用ELMo-enhanced embedding一般是有用的，但是对于一些任务而言，除了在input处使用以外，在output of rnn model处和activation h concat到一起来增强也有帮助。一般来讲，这要看是wide range of knowledge和task-specific knowledge哪个更重要就是了，因为在那一层做的话，decoder是要做attention的。

5.4 Sample efﬁciency

两个方面：

①同样的数据，使用ELMo的模型收敛速度更快

②ELMo-enhanced models use smaller training sets more efﬁciently than models without ELMo. 使用小数据集能更有效。

5.5 Visualization of learned weights

At the input layer, the task model favors the ﬁrst biLSTM layer.

The output layer weights are relatively balanced, with a slight preference for the lower layers.

https://blog.csdn.net/weixin_37947156/article/details/83146349