文献阅读 - Deep Contextualized Word Representations

最新推荐文章于 2022-10-31 02:30:07 发布

K5niper

最新推荐文章于 2022-10-31 02:30:07 发布

阅读量1.5k

点赞数 2

分类专栏：文献阅读

本文链接：https://blog.csdn.net/zhaoyin214/article/details/103760298

版权

ELMo是深度上下文词表示的一种方法，它利用双向语言模型的内部状态来生成词向量，这些向量可以捕捉词的语法和语义特性。通过结合不同层的LSTM状态，ELMo能够适应不同的NLP任务，并在问答、语义继承、情感分析等领域表现出色。

摘要由CSDN通过智能技术生成

Deep Contextualized Word Representations

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, et al., Deep Contextualized Word Representations, NAACL (2018)

摘要

深度上下文词表示（deep contextualized word representation）：

（1）词特征（complex characteristics of word），如语法（syntax）、语义（semantics）

（2）一词多义（vary across linguistic contexts i.e., polysemy）

本文词向量为双向深度语言模型中间状态的函数（learned functions of the internal states of a deep bidirectional language model (biLM)）

验证任务：问答（question answering）、语义继承（textual entailment）、情感分析（sentiment analysis）等

1 引言

预训练词表示（pre-trained word representations）理想（ideally）建模：

（1）词特征（complex characteristics of word），如语法（syntax）、语义（semantics）

（2）一词多义（vary across linguistic contexts i.e., polysemy）

本文提出一种深度上下文词表示（deep contextualized word representation）：

（1）各词条的表示为整条语句的函数（each token is assigned a representation that is a function of the entire input sentence）；

（2）向量由耦合语言模型损失训练的双向LSTM导出（vectors derived from a bidirectional LSTM that is trained with a coupled language model (LM) objective on a large text corpus）。

本文称之为ELMo（Embeddings from Language Models），其中“深”指：表示为biLM所有中间层的函数（ELMo representations are deep, in the sense that they are a function of all of the internal layers of the biLM）。此外，对每个终端任务，可学习不同的向量线性组合（learn a linear combination of the vectors stacked above each input word for each end task）。

LSTM的高层LSTM向量能捕获词条的上下文相关语义（higher-level LSTM states capture context-dependent aspects of word meaning）；低层向量捕获词条语法特征（lower-level states model aspects of syntax）。

■本文中，biLM指ELMo，biRNN指下游任务模型。■

2 相关工作

由于能够从大规模无标签语料库（large scale unlabeled text）中捕获词条的语法、语义信息（capture syntactic and semantic information of words），预训练词向量（pretrained word vectors）已成为顶尖NLP标准组件（a standard component of most state-of-the-art NLP architectures），如问答（question answering）、语义继承（textual entailment）、语义角色标注（semantic role labeling）。然而，早期词向量方法仅为各词条分配一个与上下文无关的单一表示（a single context-independent representation for each word）。

为克服这一缺陷，子词条信息（subword information）、词条词义向量（learning separate vectors for each word sense）等方法被提出。本文通过字符卷积从子词条中获益（benefits from subword units through the use of character convolutions），并能将多词义信息与下游任务无缝衔接（seamlessly incorporate multi-sense information into downstream tasks without explicitly training to predict predefined sense classes）。

上下文相关表示（context-dependent representations）：context2vec，使用双向LSTM对引导词的上下文编码（encode the context around a pivot word）。

3 ELMo：语言模型词嵌入（ELMo: Embeddings from Language Models）

ELMo词表示（word represntations）为输入语句的函数（functions of the entire input sentence）：

（1）取自字符卷积双层biLMs（computed on top of two-layer biLMs with character convolutions）

（2）中间网络状态的线性函数（a linear function of the internal network states）

通过大规模数据集预训练，biLM能够半监督学习（allows us to do semi-supervised learning, where the biLM is pretrained at a large scale），并易于加入现有NLP结构中（easily incorporated into a wide range of existing neural NLP architectures）。

3.1 双向语言模型（bidirectional language models）

给定长度为 $N$ 的词条序列（a sequence of $N$ tokens）， $(t_{i}, t_{2}, \dots, t_{N})$ ，前向语言模型（forward language model）对序列概率的表示为：给定历史序列 $(t_{1}, \dots, t_{k - 1})$ ，计算词条 $t_{k}$ 的概率：