Doc2vec论文阅读及源码理解

最新推荐文章于 2022-11-26 21:38:17 发布

ForcedOverflow

最新推荐文章于 2022-11-26 21:38:17 发布

阅读量3k

点赞数 4

分类专栏： pre-trained model 文章标签： doc2vec

本文链接：https://blog.csdn.net/u014568072/article/details/98306189

版权

《Distributed representationss of Sentences and Documents》

Quoc Le and Tomas Mikolov, 2014

文章目录

《Distributed representationss of Sentences and Documents》

1. Distributed Memory Model of Paragraph Vectors (PV-DM).

1.1 模型架构图

有点类似word2vec中的CBOW模型，根据上下文预测当前词。
在这里插入图片描述

在PV-DM模型中，矩阵 $W$ 为词向量矩阵，矩阵 $D$ 为段落向量矩阵。

每一个段落被映射为矩阵 $D$ 中的一个唯一的向量，每个单词同样被映射为矩阵 $W$ 中的一个唯一向量。

paragraph向量和词向量通过取平均（average）或者连接（concatenate）的方法结合，预测目标词向量。

这里的context上下文是从当前段落中的滑动窗口内采样得到的固定长度的（代码中应该是通过reduced_window来实现采样的，见下面代码阅读），段落向量只在同一个paragraph中共享，词向量在paragraph之间共享。

1.2 相关代码阅读

gensim3.8.0中Doc2vec-DM模型相关代码阅读如下（如果之前学习过Word2vec的源码，那么对doc2vec源码的理解会更加容易一些）

通过average计算上下文向量

    def train_document_dm(model, doc_words, doctag_indexes, alpha, work=None, neu1=None,
                          learn_doctags=True, learn_words=True, learn_hidden=True,
                          word_vectors=None, word_locks=None, doctag_vectors=None, doctag_locks=None):
        """Update distributed memory model ("PV-DM") by training on a single document.
        使用一篇doc对PV-DM模型进行更新

        Called internally from :meth:`~gensim.models.doc2vec.Doc2Vec.train` and
        :meth:`~gensim.models.doc2vec.Doc2Vec.infer_vector`. This method implements
        the DM model with a projection (input) layer that is either the sum or mean of
        the context vectors, depending on the model's `dm_mean` configuration field.

        Notes
        -----
        This is the non-optimized, Python version. If you have cython installed, gensim
        will use the optimized version from :mod:`gensim.models.doc2vec_inner` instead.

        Parameters
        ----------
        model : :class:`~gensim.models.doc2vec.Doc2Vec`
            The model to train.
        doc_words : list of str
            The input document as a list of words to be used for training. Each word will be looked up in
            the model's vocabulary.
        doctag_indexes : list of int
            Indices into `doctag_vectors` used to obtain the tags of the document.
        alpha : float
            Learning rate.
        work : object
            UNUSED.
        neu1 : object
            UNUSED.
        learn_doctags : bool, optional
            Whether the tag vectors should be updated.
        learn_words : bool, optional
            Word vectors will be updated exactly as per Word2Vec skip-gram training only if **both**
            `learn_words` and `train_words` are set to True.
        learn_hidden : bool, optional
            Whether or not the weights of the hidden layer will be updated.
        word_vectors : iterable of list of float, optional
            Vector representations of each word in the model's vocabulary.
        word_locks : list of float, optional
            Lock factors for each word in the vocabulary.
        doctag_vectors : list of list of float, optional
            Vector representations of the tags. If None, these will be retrieved from the model.
        doctag_locks : list of float, optional
            The lock factors for each tag.

        Returns
        -------
        int
            Number of words in the input document that were actually used for training (they were found in the
            vocabulary and they were not discarded by negative sampling).

最低0.47元/天解锁文章

ForcedOverflow

关注

4
点赞
踩
13

收藏

觉得还不错? 一键收藏
1
评论
Doc2vec论文阅读及源码理解

《Distributed representationss of Sentences and Documents》Quoc Le and Tomas Mikolov, 2014Model1. Distributed Memory Model of Paragraph Vectors (PV-DM).1.1 模型架构图（有点类似word2vec中的CBOW模型，根据上下文预测当前词）在P...
复制链接

扫一扫