文献阅读 - GloVe: Global Vectors for Word Representation

最新推荐文章于 2023-07-27 21:38:36 发布

K5niper

最新推荐文章于 2023-07-27 21:38:36 发布

阅读量1.1k

点赞数 5

分类专栏：文献阅读

本文链接：https://blog.csdn.net/zhaoyin214/article/details/103366589

版权

GloVe是一种全局词向量表示模型，结合了全局矩阵分解和局部上下文窗口方法的优点，利用词频统计信息生成有意义的词向量。通过对词-词共现矩阵的统计信息进行加权最小二乘回归，解决了现有模型的不足，适用于词义理解、词类比等任务。

摘要由CSDN通过智能技术生成

GloVe: Global Vectors for Word Representation

J. Pennington, R. Socher, C. D. Manning, GloVe: Global Vectors for Word Representation, EMNLP (2014)

摘要

现有单词向量空间表示学习（learning vector space representations of words）通过向量运算（vector arithmetic）获取精细语义和语法规则（fine-grained semantic and syntactic regularities），但这些规则可解释性很差（these regularities has remained opaque）。

本文对能够生成融合语义、语法规则词向量的模型所需属性进行分析（analyze and make explicit the model properties needed for such regularities to emerge in word vectors），得到全局对数双线性回归模型（global log-bilinear regression model）。该模型兼具全局矩阵分解（global matrix factorization）和局部上下文窗口方法（local context window methods）的优点。

本文模型训练只使用词-词共现矩阵中的非零元素（efficiently leverages statistical information by training only on the nonzero elements in a word-word co-occurrence matrix），模型生成的词向量空间具有语义结构（a vector space with meaningful substructure）。

1 引言

语义向量空间模型使用实值向量表示词条（semantic vector space models of language represent each word with a real-valued vector）。

词表示质量评价方法：词向量对之间的距离或角度（most word vector methods rely on the distance or angle between pairs of word vectors as the primary method for evaluating the intrinsic quality of such a set of word representations）

词向量（word vectors）的学习方法：（1）全局矩阵分解（global matrix factorization methods），如隐含语义分析（latent semantic analysis，LSA）；（2）局部上下文窗口（local context window methods），如skip-gram。

全局矩阵分解能够充分利用统计信息（leverage statistical information），但在词类比任务（the word analogy task）上表现较差，即其向量空间结构非最优（a sub-optimal vector space structure）；局部上下文窗口在词类比任务表现更好，但却忽视了语料库（corpus）的统计信息（poorly utilize the statistics of the corpus since they train on separate local context windows instead of on global co-occurrence counts）

2 相关工作

矩阵分解（Matrix Factorization Methods）：分解语料库统计信息矩阵（decompose large matrices that capture statistical information about a corpus），使用低秩近似（low-rank approximations）生成维单词表示（generating low-dimensional word representations）。

语料库统计信息矩阵组织形式分为：（1）词条-文档（term-document）类型，行对应词条、列对应文档（the rows correspond to words or terms, and the columns correspond to different documents in the corpus）；（2）词条-词条（term-term）类型，行、列均对应词条，矩阵元素对应给定词在目标词上下文中出现的频次（the rows and columns correspond to words and the entries correspond to the number of times a given word occurs in the context of another given word）。

局部窗口（Shallow Window-Based Methods）：学习在局部上下文窗口中预测的词表示（learn word representations that aid in making predictions within local context windows），如skip-gram和CBOW（continuous bag-of-words）、vLBL和ivLBL（closely-related vector log-bilinear models）。

skip-gram、ivLBL模型的目标为根据给定词预测上下文（predict a word’s context given the word itself）；CBOW、vLBL模型的目标为根据上下文预测给定词（predict a word given its context）。

3 GloVe模型

语料库中词频统计信息（statistics of word occurrences in a corpus）是非监督单词表示学习（unsupervised methods for learning word representations）的主要信息源（source of information available），其核心问题在于：（1）如何根据统计信息生成词义（how meaning is generated from these statistics）；（2）词向量如何表示词义（how the resulting word vectors might represent that meaning）。

GloVe模型：语料库全局统计信息（the global corpus statistics）词向量模型。

$\mathbf{X}$ ：词条共现矩阵（the matrix of word-word co-occurrence counts）， $X_{ij}$ ：词条 $j$ 出现在词条 $i$ 的上下文中的次数， $X_{i} = \sum_{k} X_{ik}$ ， $P_{ij} = P(j | i) = \frac{X_{ij}}{X_{i}}$ 表示词条 $j$ 出现在词条 $i$ 的上下文中的概率（the probability that word $j$ appear in the context of word $i$ ）。