文献阅读 - From Word Embeddings To Document Distances

最新推荐文章于 2023-03-23 16:12:46 发布

K5niper

最新推荐文章于 2023-03-23 16:12:46 发布

阅读量533

点赞数 1

分类专栏：文献阅读

本文链接：https://blog.csdn.net/zhaoyin214/article/details/102773677

版权

本文介绍了从Word Embeddings到Document Distances的理论与实践，特别是Word Mover's Distance（WMD）作为衡量文档不相似度的无超参数方法。WMD通过计算单词在文档间移动的最小累积距离来表示文档之间的语义关系，具有高检索准确性和可解释性。此外，文章还探讨了快速近似WMD距离的方法，如词质心距离和松弛WMD，以提高效率。

摘要由CSDN通过智能技术生成

From Word Embeddings To Document Distances

M. J. Kusner, Y. Sun, N. I. Kolkin, K. Q. Weinberger, From Word Embeddings To Document Distances, ICML (2015)

摘要

词嵌入（word embedding）：根据单词在语句中的局部共存性，学习单词语义层面的表示（semantically meaningful representations for words）。

单词移动距离（Word Mover’s Distance，WMD）：基于词嵌入，衡量文本文档（text documents）间距离的函数。WMD以一个文档的嵌入词移动至另一个文档的嵌入词的最小距离（the minimum amount of distance that the embedded words of one document need to “travel” to reach the embedded words of another document）作为两个文本文档间不相似度（dissimilarity）的度量。

WMD测度不包含超参数（hyperparameters）。

1 引言

文档表示的最常用的两种方法：

词袋模型（bag of words，BOW）；
词频逆文档频率（term frequency - inverse document frequency，TF-IDF）。

由于各文档的BOW（TF-IDF）表示通常近似正交性（frequent near-orthogonality），二者并不适于度量文档距离；另外，二者无法表示不同单词间的距离（not capture the distance between individual words）。

文档的低维隐含变量表示（a latent low-dimensional representation of documents）：

隐含语义索引（Latent Semantic Indexing，LSI）：对BOW特征空间（feature space）进行特征分解（eigendecompose）；
主体模型（Latent Dirichlet Allocation，LDA）：将相似词按概率分配到不同的主题（probabilistically groups similar words into topics），并将文档表示这些主题的分布（represents documents as distribution over these topics）

通常，语义关系体现在词向量的运算上（semantic relationships are often preserved in vector operations on word vectors），即嵌入词向量间的距离能够表示语义（distances between embedded word vectors are to some degree semantically meaningful）。本文将文本文档表示为嵌入词的加权点云（a weighted point cloud of embedded words），文本文档 $A$ 和 $B$ 间的单词移动距离（Word Mover’s Distance，WMD）定义为：为匹配（match）文档 $B$ 的点云（point cloud），文档 $A$ 中的单词（words from document $A$ ）所需移动（travel）的最小累积距离（minimum cumulative distance），Fig. 1。

在这里插入图片描述
WMD最优问题是最短测地距离（Earth Mover’s Distance，EWD）运输问题（transportation problem）的特例。本文给出几个下界距离（lower bounds）用于近似WMD或对查询范围剪枝（approximations or to prune away documents that are provably not amongst the $k$ -nearest neighbors of a query）。

WMD特性：（1）无超参（hyper-parameter free）；（2）可解释性强（highly interpretable），文档距离可解释为少量不同单词间的稀疏距离（the distance between two documents can be broken down and explained as the sparse distances between few individual words）；（3）高检索准确率（high retrieval accuracy）。

2 相关工作

Okapi BM25

LDA

LSI

TextTiling-EMD

Stacked Denoising Autoencoders （SDA）、mSDA

Componential Counting Grid

3 Word2Vec词嵌入（Word2Vec Embedding）

word2vec：词嵌入过程（word-embedding procedure），使用（浅层）神经网络语言模型（a (shallow) neural network language model）学习单词的向量表示（vector representation）。

skip-gram模型：由输入层、投影层（a projection layer）和输出层组成，用于预测相邻单词（nearby words）。通过最大化语料库（corpus）中相邻单词（neighboring words）的对数概率（log probability），训练各单词词向量（word vector），即给定单词序列（a sequence of words） $w_{1}, \cdots, w_{T}$ ：

$\frac{1}{T} \sum_{t = 1}^{T} \sum_{j \in nb(t)} \log p(w_{j} | w_{t})$

其中， $n b (t)$ 表示单词 $t$ 的相邻单词集合、 $p(w_{j} | w_{t})$ 表示相应词向量（associated word vectors） $\mathbf{v}_{w_{j}}$ 和 $\mathbf{v}_{w_{t}}$ 之间的层级归一化指数（hierarchical softmax）。由于结构简单和层级归一化指数，skip-gram能够使用台式机在数十亿单词上训练（due to its surprisingly simple architecture and the use of the hierarchical softmax, the skip-gram model can be trained on a single machine on billions of words per hour using a conventional desktop computer），因此能学到复杂的单词关系。

4 WMD距离（Word Mover’s Distance）

$\mathbf{X} \in \R^{d \times n}$ 表示 $n$ 个单词的word2vec嵌入矩阵（a word2vec embedding matrix），其第 $i$ 列 $\mathbf{x}_{i} \in \R^{d}$ 表示第 $i$ 个单词在 $d$ 维空间中的词嵌入。假设文本文档表示为归一化词袋模型（normalized bag-of-words，nBOW）向量 $\mathbf{d} \in \R^{n}$ ，即如果单词 $i$ 出现 $c_{i}$ 次，则 $d_{i} = \frac{c_{i}}{\sum_{j = 1}^{n} c_{j}}$ 。通常，nBOW向量 $\mathbf{d}$ 非常稀疏（very sparse）。

$n$ BOW（ $n$ BOW representation）

向量 $\mathbf{d}$ 为 $n - 1$ 维单纯形（simplex），包含不同唯一词的两文档（different unique words）位于单纯形不同的区域中，但这两个文档的语义确可能相近（semantically close）。

词映射损失（word travel cost）

本文将单词对（individual word pairs）间的语义相似度（document distance metric）包含进文档距离度量（document distance metric）。单词不相似度通常采用在word2vec嵌入空间（the word2vec embedding space）中的欧氏距离（Euclidean distance）度量。单词 $i$ 和 $j$ 之间的距离为： $\| \mathbf{x}_{i} - \mathbf{x}_{j} \|_{2}$ ，表示一个单词移动到另一个单词的代价（the cost associated with “traveling” from one word to another）。

文档距离（document distance）

（1）令 $\mathbf{d}$ 、 $\mathbf{d}^{\prime}$ 表示两个文档在 $n - 1$ 维单纯形（simplex）上的 $n$ BOW表示。

（2）假定 $\mathbf{d}$ 中的每个单词 $i$ 都可以全部或部分映射到 $\mathbf{d}^{\prime}$

最低0.47元/天解锁文章

K5niper

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
文献阅读 - From Word Embeddings To Document Distances

From Word Embeddings To Document DistancesM. J. Kusner, Y. Sun, N. I. Kolkin, K. Q. Weinberger, From Word Embeddings To Document Distances, ICML (2015)摘要词嵌入（word embedding）：根据单词在语句中的局部共存性，学习单词语义层...
复制链接

扫一扫

专栏目录