论文阅读笔记——《a simple but tough-to-beat baseline for sentence embeddings》

最新推荐文章于 2022-05-31 15:12:03 发布

chloe_au_yeung

最新推荐文章于 2022-05-31 15:12:03 发布

阅读量1.3k

点赞数 1

本文链接：https://blog.csdn.net/chloe_ou/article/details/82851473

版权

《a simple but tough-to-beat baseline for sentence embeddings》published at ICLR 2017.

ICLR会议的论文总是创意层出不穷，相信将来该会议的地位越来越受人们尊重。

本文提出了一个简单但是完胜现在已有的很多方法的sentence embedding方法。作者称之为WR方法，W stands for weighted average, R stands for removing some special direction which is derived from a generative model of texts. 简单地说，模型的输入是一个已有的word embedding，基于该 word embedding 和 sentence s, 通过加权求平均的方法求得sentence s的embedding，然后使用主成分分析去掉一些special direction.

sentence embedding 是由c_s的MLE计算得到的，c_s用作者的说法是，"represents what is being talked about"，个人理解是包含一定语义的向量，用于表示这个句子的主旨。为了更好地模拟句子的语义，作者设计两个"smoothing term"，如下图：

Section 3.1合理性讨论：

证明了使用了sub-sampling的word2vec模型，其实是在w的向量更新的方向（梯度）加上了一个权重，实验证明word2vec的权重(w = sigma(q*v))和本文模型中的权重(w = a/(a+p(w)))是相似的。（但是一个是sentence embedding，一个是word embedding，可以直接比较吗？ word embedding 是前面n-1个词的embedding结果的加权平均，句子是由词组成的，所以可以类似地比较，大概是这样）

关于word2vec，详细可看https://www.cnblogs.com/peghoty/p/3857839.html（非常有用）

原文代码:https://github.com/PrincetonML/SIF