《Distributed representationss of Sentences and Documents》
Quoc Le and Tomas Mikolov, 2014
文章目录
1. Distributed Memory Model of Paragraph Vectors (PV-DM).
1.1 模型架构图
有点类似word2vec中的CBOW模型,根据上下文预测当前词。
在PV-DM模型中,矩阵 W W W为词向量矩阵,矩阵 D D D为段落向量矩阵。
每一个段落被映射为矩阵 D D D中的一个唯一的向量,每个单词同样被映射为矩阵 W W W中的一个唯一向量。
paragraph向量和词向量通过取平均(average)或者连接(concatenate)的方法结合,预测目标词向量。
这里的context上下文是从当前段落中的滑动窗口内采样得到的固定长度的(代码中应该是通过reduced_window
来实现采样的,见下面代码阅读),段落向量只在同一个paragraph中共享,词向量在paragraph之间共享。
1.2 相关代码阅读
gensim3.8.0
中Doc2vec-DM模型相关代码阅读如下(如果之前学习过Word2vec的源码,那么对doc2vec源码的理解会更加容易一些)
- 通过average计算上下文向量
def train_document_dm(model, doc_words, doctag_indexes, alpha, work=None, neu1=None,
learn_doctags=True, learn_words=True, learn_hidden=True,
word_vectors=None, word_locks=None, doctag_vectors=None, doctag_locks=None):
"""Update distributed memory model ("PV-DM") by training on a single document.
使用一篇doc对PV-DM模型进行更新
Called internally from :meth:`~gensim.models.doc2vec.Doc2Vec.train` and
:meth:`~gensim.models.doc2vec.Doc2Vec.infer_vector`. This method implements
the DM model with a projection (input) layer that is either the sum or mean of
the context vectors, depending on the model's `dm_mean` configuration field.
Notes
-----
This is the non-optimized, Python version. If you have cython installed, gensim
will use the optimized version from :mod:`gensim.models.doc2vec_inner` instead.
Parameters
----------
model : :class:`~gensim.models.doc2vec.Doc2Vec`
The model to train.
doc_words : list of str
The input document as a list of words to be used for training. Each word will be looked up in
the model's vocabulary.
doctag_indexes : list of int
Indices into `doctag_vectors` used to obtain the tags of the document.
alpha : float
Learning rate.
work : object
UNUSED.
neu1 : object
UNUSED.
learn_doctags : bool, optional
Whether the tag vectors should be updated.
learn_words : bool, optional
Word vectors will be updated exactly as per Word2Vec skip-gram training only if **both**
`learn_words` and `train_words` are set to True.
learn_hidden : bool, optional
Whether or not the weights of the hidden layer will be updated.
word_vectors : iterable of list of float, optional
Vector representations of each word in the model's vocabulary.
word_locks : list of float, optional
Lock factors for each word in the vocabulary.
doctag_vectors : list of list of float, optional
Vector representations of the tags. If None, these will be retrieved from the model.
doctag_locks : list of float, optional
The lock factors for each tag.
Returns
-------
int
Number of words in the input document that were actually used for training (they were found in the
vocabulary and they were not discarded by negative sampling).