文献阅读 - Deep Contextualized Word Representations

Deep Contextualized Word Representations


M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, et al., Deep Contextualized Word Representations, NAACL (2018)


摘要

深度上下文词表示(deep contextualized word representation):

(1)词特征(complex characteristics of word),如语法(syntax)、语义(semantics)

(2)一词多义(vary across linguistic contexts i.e., polysemy)

本文词向量为双向深度语言模型中间状态的函数(learned functions of the internal states of a deep bidirectional language model (biLM))

验证任务:问答(question answering)、语义继承(textual entailment)、情感分析(sentiment analysis)等

1 引言

预训练词表示(pre-trained word representations)理想(ideally)建模:

(1)词特征(complex characteristics of word),如语法(syntax)、语义(semantics)

(2)一词多义(vary across linguistic contexts i.e., polysemy)

本文提出一种深度上下文词表示(deep contextualized word representation):

(1)各词条的表示为整条语句的函数(each token is assigned a representation that is a function of the entire input sentence);

(2)向量由耦合语言模型损失训练的双向LSTM导出(vectors derived from a bidirectional LSTM that is trained with a coupled language model (LM) objective on a large text corpus)。

本文称之为ELMo(Embeddings from Language Models),其中“深”指:表示为biLM所有中间层的函数(ELMo representations are deep, in the sense that they are a function of all of the internal layers of the biLM)。此外,对每个终端任务,可学习不同的向量线性组合(learn a linear combination of the vectors stacked above each input word for each end task)。

LSTM的高层LSTM向量能捕获词条的上下文相关语义(higher-level LSTM states capture context-dependent aspects of word meaning);低层向量捕获词条语法特征(lower-level states model aspects of syntax)。

■本文中,biLM指ELMo,biRNN指下游任务模型。■

2 相关工作

由于能够从大规模无标签语料库(large scale unlabeled text)中捕获词条的语法、语义信息(capture syntactic and semantic information of words),预训练词向量(pretrained word vectors)已成为顶尖NLP标准组件(a standard component of most state-of-the-art NLP architectures),如问答(question answering)、语义继承(textual entailment)、语义角色标注(semantic role labeling)。然而,早期词向量方法仅为各词条分配一个与上下文无关的单一表示(a single context-independent representation for each word)。

为克服这一缺陷,子词条信息(subword information)、词条词义向量(learning separate vectors for each word sense)等方法被提出。本文通过字符卷积从子词条中获益(benefits from subword units through the use of character convolutions),并能将多词义信息与下游任务无缝衔接(seamlessly incorporate multi-sense information into downstream tasks without explicitly training to predict predefined sense classes)。

上下文相关表示(context-dependent representations):context2vec,使用双向LSTM对引导词的上下文编码(encode the context around a pivot word)。

3 ELMo:语言模型词嵌入(ELMo: Embeddings from Language Models)

ELMo词表示(word represntations)为输入语句的函数(functions of the entire input sentence):

(1)取自字符卷积双层biLMs(computed on top of two-layer biLMs with character convolutions)

(2)中间网络状态的线性函数(a linear function of the internal network states)

通过大规模数据集预训练,biLM能够半监督学习(allows us to do semi-supervised learning, where the biLM is pretrained at a large scale),并易于加入现有NLP结构中(easily incorporated into a wide range of existing neural NLP architectures)。

3.1 双向语言模型(bidirectional language models)

给定长度为 N N N的词条序列(a sequence of N N N tokens), ( t i , t 2 , … , t N ) (t_{i}, t_{2}, \dots, t_{N}) (ti,t2,,tN),前向语言模型(forward language model)对序列概率的表示为:给定历史序列 ( t 1 , … , t k − 1 ) (t_{1}, \dots, t_{k - 1}) (t1,,tk1),计算词条 t k t_{k} tk的概率:

p ( t p 1 , t 2 , … , t N ) = ∏ k = 1 N p ( t k ∣ t 1 , … , t k − 1 ) p(t_p{1}, t_{2}, \dots, t_{N}) = \prod_{k = 1}^{N} p(t_{k} | t_{1}, \dots, t_{k - 1}) p(tp1,t2,,tN)=k=1Np(tkt1,,tk1)

后向语言模型(backward language model):

  • 2
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值