Doc2Vec、Paragraph Vector介绍

1 简介

本文根据2014年《Distributed Representations of Sentences and Documents》翻译总结的。

Bag-of-words方法有两个主要的缺点:一是丢失了词汇间的顺序,二是忽视了词汇的语义信息。如果把“powerful”、 “strong” 、“Paris”这3个单词认为是相同距离的。实际上,“powerful”相比“Paris”应该更靠近“strong”。

本文,我们提出了Paragraph Vector,采用无监督算法从可变长度的文本中学习固定长度的特征表示,可变长度的文本包括句子、段落、文档等等。Paragraph Vector克服了word vector的两个缺点,一是使“powerful”相比“Paris”更靠近“strong”,二是考虑了单词顺序,至少在小的上下文中。

2 算法介绍

2.1 Word vector

如下图,根据the、cat、sat预测第4个单词on。每个单词构成矩阵W的一列,然后W通过concatenate或者average进行预测“on”。
在这里插入图片描述

公式如下:
根据周边单词预测下一个单词,目标函数如下:
在这里插入图片描述

预测任务一般采用多分类,如softmax,如下:

在这里插入图片描述

每一个输出的单词y,公式如下,其中U,b是softmax参数,h是由W通过concatenate或者average构建的:

在这里插入图片描述

2.2 Paragraph Vector

如下图,在word vector基础上引入矩阵D,代表paragraph。主要修改就是在word vector公式1中的h变成由W和D一起构建的。
Paragraph token可以认为一个记录,其记录了当前上下文丢失的信息,如段落主题。因为我们称该模型为 Distributed Memory Model of Paragraph Vectors (PV-DM).

该Paragraph vector D是跨段落的,不同段落的不一样。而 word vector W是在各段落间共享的、相同的。

Paragraph vector和word vector 是使用stochastic gradient descent进行训练的,采用了反向传播(backpropagation)。

在预测时,固定word vector,采用梯度下降实时生成Paragraph vector。

我们然后可以使用Paragraph vector D做一些分类预测任务。
在这里插入图片描述

2.3 Paragraph Vector without word ordering: Distributed bag of words

如下图,忽略输入单词的顺序。称为Distributed Bag of Words version of Paragraph Vector (PV-DBOW)。

在这里插入图片描述

3 实验结果

在这里插入图片描述

4 使用

from nltk.tokenize import word_tokenize
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

tail_msg_list = msgs_df[‘msgs’].tolist()
tokenized_sent = [word_tokenize(s.lower()) for s in tail_msg_list]

print(‘开始训练embbeding模型(Doc2Vec)…’)
tagged_data = [TaggedDocument(d, [i]) for i, d in enumerate(tokenized_sent)]
model_d2v = Doc2Vec(tagged_data,dm = 1, alpha=0.1, vector_size= VECTOR_SIZE, min_alpha=0.025)
#训练
model_d2v.train(tagged_data, total_examples=model_d2v.corpus_count, epochs=20)
#预测
v = model_d2v.infer_vector(word_tokenize(row[‘msgs’].lower()), epochs=10)

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Python Doc2Vec is an algorithm for generating vector representations of documents. It is an extension of the Word2Vec algorithm, which generates vector representations of words. Doc2Vec is used for tasks such as text classification, document similarity, and clustering. The basic idea behind Doc2Vec is to train a neural network to predict the probability distribution of words in a document. The network takes both the document and a context word as input, and predicts the probability of each word in the vocabulary being the next word in the document. The output of the network is a vector representation of the document. Doc2Vec can be implemented using the Gensim library in Python. The Gensim implementation of Doc2Vec has two modes: Distributed Memory (DM) and Distributed Bag of Words (DBOW). In DM mode, the algorithm tries to predict the next word in the document using both the context words and the document vector. In DBOW mode, the algorithm only uses the document vector to predict the next word. To use Doc2Vec with Gensim, you need to first create a corpus of documents. Each document should be represented as a list of words. You can then create a Doc2Vec model and train it on the corpus. Once the model is trained, you can use it to generate vector representations of new documents. Here's an example of training a Doc2Vec model using Gensim: ``` from gensim.models.doc2vec import Doc2Vec, TaggedDocument from nltk.tokenize import word_tokenize # create a corpus of documents doc1 = TaggedDocument(words=word_tokenize("This is the first document."), tags=["doc1"]) doc2 = TaggedDocument(words=word_tokenize("This is the second document."), tags=["doc2"]) doc3 = TaggedDocument(words=word_tokenize("This is the third document."), tags=["doc3"]) corpus = [doc1, doc2, doc3] # create a Doc2Vec model and train it on the corpus model = Doc2Vec(corpus, vector_size=100, window=5, min_count=1, workers=4, epochs=50) # generate vector representations of new documents new_doc = word_tokenize("This is a new document.") vector = model.infer_vector(new_doc) ``` In this example, we create a corpus of three documents and train a Doc2Vec model with a vector size of 100, a window size of 5, a minimum word count of 1, and 50 epochs. We then generate a vector representation of a new document using the `infer_vector` method.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值