Doc2vec

6 篇文章 0 订阅

Doc2Vec 是基于Word2Vec的思想。
只是巧妙的加了一个paragraph id 以此来表示整个文档的向量分布。

在这里插入图片描述
假设一篇文章共有N个段落,每个段落都会有个一个paragraph_id,然后每个段落做PV-DM模型。
和word2vec的主要不同点即为,paragraph_Id 由一个矩阵D构成,他和其他word vector 组成的W矩阵进行concate连接,用来预测中心词。

还有另外一种形式的Doc2vec,即为给定一个paragraph_id,不加入任何单词文本信息,去预测该段落拥有的一些单词。(类似于word2vec 中的Skip-gram模型)称之为Distributed Bag of Words version of paragraph vectors.(PV-DBOW).
在这里插入图片描述

Doc2vec的优点:
1.首先,它是一种无监督的算法,并不需要标注数据就可以进行一些训练。
2.其次,他继承了word2vec的语义上的优点。即相似的词,向量距离相近。
3.另外,他还在训练时,考虑了上下文的单词顺序关系。因此,会拥有一些单词顺序上的语义理解。但是又不会像n-gram那样生成很高维度的数据。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
Python Doc2Vec is an algorithm for generating vector representations of documents. It is an extension of the Word2Vec algorithm, which generates vector representations of words. Doc2Vec is used for tasks such as text classification, document similarity, and clustering. The basic idea behind Doc2Vec is to train a neural network to predict the probability distribution of words in a document. The network takes both the document and a context word as input, and predicts the probability of each word in the vocabulary being the next word in the document. The output of the network is a vector representation of the document. Doc2Vec can be implemented using the Gensim library in Python. The Gensim implementation of Doc2Vec has two modes: Distributed Memory (DM) and Distributed Bag of Words (DBOW). In DM mode, the algorithm tries to predict the next word in the document using both the context words and the document vector. In DBOW mode, the algorithm only uses the document vector to predict the next word. To use Doc2Vec with Gensim, you need to first create a corpus of documents. Each document should be represented as a list of words. You can then create a Doc2Vec model and train it on the corpus. Once the model is trained, you can use it to generate vector representations of new documents. Here's an example of training a Doc2Vec model using Gensim: ``` from gensim.models.doc2vec import Doc2Vec, TaggedDocument from nltk.tokenize import word_tokenize # create a corpus of documents doc1 = TaggedDocument(words=word_tokenize("This is the first document."), tags=["doc1"]) doc2 = TaggedDocument(words=word_tokenize("This is the second document."), tags=["doc2"]) doc3 = TaggedDocument(words=word_tokenize("This is the third document."), tags=["doc3"]) corpus = [doc1, doc2, doc3] # create a Doc2Vec model and train it on the corpus model = Doc2Vec(corpus, vector_size=100, window=5, min_count=1, workers=4, epochs=50) # generate vector representations of new documents new_doc = word_tokenize("This is a new document.") vector = model.infer_vector(new_doc) ``` In this example, we create a corpus of three documents and train a Doc2Vec model with a vector size of 100, a window size of 5, a minimum word count of 1, and 50 epochs. We then generate a vector representation of a new document using the `infer_vector` method.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值