Deep Contextualized Word Representations
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, et al., Deep Contextualized Word Representations, NAACL (2018)
摘要
深度上下文词表示(deep contextualized word representation):
(1)词特征(complex characteristics of word),如语法(syntax)、语义(semantics)
(2)一词多义(vary across linguistic contexts i.e., polysemy)
本文词向量为双向深度语言模型中间状态的函数(learned functions of the internal states of a deep bidirectional language model (biLM))
验证任务:问答(question answering)、语义继承(textual entailment)、情感分析(sentiment analysis)等
1 引言
预训练词表示(pre-trained word representations)理想(ideally)建模:
(1)词特征(complex characteristics of word),如语法(syntax)、语义(semantics)
(2)一词多义(vary across linguistic contexts i.e., polysemy)
本文提出一种深度上下文词表示(deep contextualized word representation):
(1)各词条的表示为整条语句的函数(each token is assigned a representation that is a function of the entire input sentence);
(2)向量由耦合语言模型损失训练的双向LSTM导出(vectors derived from a bidirectional LSTM that is trained with a coupled language model (LM) objective on a large text corpus)。
本文称之为ELMo(Embeddings from Language Models),其中“深”指:表示为biLM所有中间层的函数(ELMo representations are deep, in the sense that they are a function of all of the internal layers of the biLM)。此外,对每个终端任务,可学习不同的向量线性组合(learn a linear combination of the vectors stacked above each input word for each end task)。
LSTM的高层LSTM向量能捕获词条的上下文相关语义(higher-level LSTM states capture context-dependent aspects of word meaning);低层向量捕获词条语法特征(lower-level states model aspects of syntax)。
■本文中,biLM指ELMo,biRNN指下游任务模型。■
2 相关工作
由于能够从大规模无标签语料库(large scale unlabeled text)中捕获词条的语法、语义信息(capture syntactic and semantic information of words),预训练词向量(pretrained word vectors)已成为顶尖NLP标准组件(a standard component of most state-of-the-art NLP architectures),如问答(question answering)、语义继承(textual entailment)、语义角色标注(semantic role labeling)。然而,早期词向量方法仅为各词条分配一个与上下文无关的单一表示(a single context-independent representation for each word)。
为克服这一缺陷,子词条信息(subword information)、词条词义向量(learning separate vectors for each word sense)等方法被提出。本文通过字符卷积从子词条中获益(benefits from subword units through the use of character convolutions),并能将多词义信息与下游任务无缝衔接(seamlessly incorporate multi-sense information into downstream tasks without explicitly training to predict predefined sense classes)。
上下文相关表示(context-dependent representations):context2vec,使用双向LSTM对引导词的上下文编码(encode the context around a pivot word)。
3 ELMo:语言模型词嵌入(ELMo: Embeddings from Language Models)
ELMo词表示(word represntations)为输入语句的函数(functions of the entire input sentence):
(1)取自字符卷积双层biLMs(computed on top of two-layer biLMs with character convolutions)
(2)中间网络状态的线性函数(a linear function of the internal network states)
通过大规模数据集预训练,biLM能够半监督学习(allows us to do semi-supervised learning, where the biLM is pretrained at a large scale),并易于加入现有NLP结构中(easily incorporated into a wide range of existing neural NLP architectures)。
3.1 双向语言模型(bidirectional language models)
给定长度为 N N N的词条序列(a sequence of N N N tokens), ( t i , t 2 , … , t N ) (t_{i}, t_{2}, \dots, t_{N}) (ti,t2,…,tN),前向语言模型(forward language model)对序列概率的表示为:给定历史序列 ( t 1 , … , t k − 1 ) (t_{1}, \dots, t_{k - 1}) (t1,…,tk−1),计算词条 t k t_{k} tk的概率:
p ( t p 1 , t 2 , … , t N ) = ∏ k = 1 N p ( t k ∣ t 1 , … , t k − 1 ) p(t_p{1}, t_{2}, \dots, t_{N}) = \prod_{k = 1}^{N} p(t_{k} | t_{1}, \dots, t_{k - 1}) p(tp1,t2,…,tN)=k=1∏Np(tk∣t1,…,tk−1)</