- 绪论
Distributed Representations of Sentences and Documents是Mikolov继word2vec后的另一力作——将文本表示为矢量。
将文本表示为矢量,是大量文本处理相关算法(文本分类、聚类等)的必然要求。最简单、最直观的方法是bag-of-words (BOW),即将文本拆解为单词,以单词作为矢量空间的维度,以每个单词在文本中出现的频率作为文本矢量对应维度的值。BOW的缺点是忽略了词语在文本中出现的先后次序,并且没有考虑词语的语义信息。另外一种方法bag-of-n-grams考虑了词序,却增加了维度,加剧了数据稀疏。
在直接考虑语义的情况下,假设已经有了单词矢量的获取方法,且单词能够包含了词语的语义信息,那么一种直接的方法是对一篇文档中包含的单词矢量加权平均,得到的新矢量即为该文档的矢量(Mitchell & Lapata, 2010; Zanzotto et al., 2010; Yessenalina & Cardie, 2011; Grefenstette et al., 2013; Mikolov et al., 2013c).
另一种较复杂的方法是按照句子解析树的词序,将句子组织为矩阵而非矢量,该方法不能应用于文档,只局限于句子,因为方法的核心是句子的解析。(Socher et al., 2011b).
- PV-DM
Figure 1. A framework for learning paragraph vector. This frame-work is similar to the framework presented in Figure 1; the only change is the additional paragraph token that is mapped to a vec-tor via matrix D. In this model, the concatenation or average of this vector with a context of three words is used to predict the fourth word. The paragraph vector represents the missing infor-mation from the current context and can act as a memory of the topic of the paragraph.
- PV-DBOW
Figure 3. Distributed Bag of Words version of paragraph vectors. In this version, the paragraph vector is trained to predict the words in a small window.
- 参考文献
- Distributed Representations of Sentences and Documents