Doc2vec论文阅读及源码理解

《Distributed representationss of Sentences and Documents》

Quoc Le and Tomas Mikolov, 2014

1. Distributed Memory Model of Paragraph Vectors (PV-DM).

1.1 模型架构图

有点类似word2vec中的CBOW模型,根据上下文预测当前词。
在这里插入图片描述

在PV-DM模型中,矩阵 W W W为词向量矩阵,矩阵 D D D为段落向量矩阵。

每一个段落被映射为矩阵 D D D中的一个唯一的向量,每个单词同样被映射为矩阵 W W W中的一个唯一向量。

paragraph向量和词向量通过取平均(average)或者连接(concatenate)的方法结合,预测目标词向量。

这里的context上下文是从当前段落中的滑动窗口内采样得到的固定长度的(代码中应该是通过reduced_window来实现采样的,见下面代码阅读),段落向量只在同一个paragraph中共享,词向量在paragraph之间共享。

1.2 相关代码阅读

gensim3.8.0中Doc2vec-DM模型相关代码阅读如下(如果之前学习过Word2vec的源码,那么对doc2vec源码的理解会更加容易一些)

  • 通过average计算上下文向量
    def train_document_dm(model, doc_words, doctag_indexes, alpha, work=None, neu1=None,
                          learn_doctags=True, learn_words=True, learn_hidden=True,
                          word_vectors=None, word_locks=None, doctag_vectors=None, doctag_locks=None):
        """Update distributed memory model ("PV-DM") by training on a single document.
        使用一篇doc对PV-DM模型进行更新

        Called internally from :meth:`~gensim.models.doc2vec.Doc2Vec.train` and
        :meth:`~gensim.models.doc2vec.Doc2Vec.infer_vector`. This method implements
        the DM model with a projection (input) layer that is either the sum or mean of
        the context vectors, depending on the model's `dm_mean` configuration field.

        Notes
        -----
        This is the non-optimized, Python version. If you have cython installed, gensim
        will use the optimized version from :mod:`gensim.models.doc2vec_inner` instead.

        Parameters
        ----------
        model : :class:`~gensim.models.doc2vec.Doc2Vec`
            The model to train.
        doc_words : list of str
            The input document as a list of words to be used for training. Each word will be looked up in
            the model's vocabulary.
        doctag_indexes : list of int
            Indices into `doctag_vectors` used to obtain the tags of the document.
        alpha : float
            Learning rate.
        work : object
            UNUSED.
        neu1 : object
            UNUSED.
        learn_doctags : bool, optional
            Whether the tag vectors should be updated.
        learn_words : bool, optional
            Word vectors will be updated exactly as per Word2Vec skip-gram training only if **both**
            `learn_words` and `train_words` are set to True.
        learn_hidden : bool, optional
            Whether or not the weights of the hidden layer will be updated.
        word_vectors : iterable of list of float, optional
            Vector representations of each word in the model's vocabulary.
        word_locks : list of float, optional
            Lock factors for each word in the vocabulary.
        doctag_vectors : list of list of float, optional
            Vector representations of the tags. If None, these will be retrieved from the model.
        doctag_locks : list of float, optional
            The lock factors for each tag.

        Returns
        -------
        int
            Number of words in the input document that were actually used for training (they were found in the
            vocabulary and they were not discarded by negative sampling).

       
  • 4
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
Python Doc2Vec is an algorithm for generating vector representations of documents. It is an extension of the Word2Vec algorithm, which generates vector representations of words. Doc2Vec is used for tasks such as text classification, document similarity, and clustering. The basic idea behind Doc2Vec is to train a neural network to predict the probability distribution of words in a document. The network takes both the document and a context word as input, and predicts the probability of each word in the vocabulary being the next word in the document. The output of the network is a vector representation of the document. Doc2Vec can be implemented using the Gensim library in Python. The Gensim implementation of Doc2Vec has two modes: Distributed Memory (DM) and Distributed Bag of Words (DBOW). In DM mode, the algorithm tries to predict the next word in the document using both the context words and the document vector. In DBOW mode, the algorithm only uses the document vector to predict the next word. To use Doc2Vec with Gensim, you need to first create a corpus of documents. Each document should be represented as a list of words. You can then create a Doc2Vec model and train it on the corpus. Once the model is trained, you can use it to generate vector representations of new documents. Here's an example of training a Doc2Vec model using Gensim: ``` from gensim.models.doc2vec import Doc2Vec, TaggedDocument from nltk.tokenize import word_tokenize # create a corpus of documents doc1 = TaggedDocument(words=word_tokenize("This is the first document."), tags=["doc1"]) doc2 = TaggedDocument(words=word_tokenize("This is the second document."), tags=["doc2"]) doc3 = TaggedDocument(words=word_tokenize("This is the third document."), tags=["doc3"]) corpus = [doc1, doc2, doc3] # create a Doc2Vec model and train it on the corpus model = Doc2Vec(corpus, vector_size=100, window=5, min_count=1, workers=4, epochs=50) # generate vector representations of new documents new_doc = word_tokenize("This is a new document.") vector = model.infer_vector(new_doc) ``` In this example, we create a corpus of three documents and train a Doc2Vec model with a vector size of 100, a window size of 5, a minimum word count of 1, and 50 epochs. We then generate a vector representation of a new document using the `infer_vector` method.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值