doc2vec介绍和实践

最新推荐文章于 2024-05-19 09:35:52 发布

野营者007

最新推荐文章于 2024-05-19 09:35:52 发布

阅读量2k

点赞数

文章标签： doc2vec NLP

本文链接：https://blog.csdn.net/qq_40136685/article/details/90756105

版权

本文介绍了doc2vec模型，对比了它与BOW、LDA的区别，详细阐述了word2vec的基础，并展示了doc2vec如何形成文档向量。在实践中，文章提到了训练参数的选择及遇到的问题，如doc2vec不能预测未见过的词和与循环神经网络结合的挑战。

摘要由CSDN通过智能技术生成

简介

与其他方法的比较

bag of words (BOW)：不会考虑词语出现的先后顺序。

Latent Dirichlet Allocation (LDA)：更偏向于从文中提取关键词和核心思想extracting topics/keywords out of texts，但是非常难调参数并且难以评价模型的好坏。

基石：word2vec

Word2vec 是一种计算效率特别高的预测模型，用于学习原始文本中的字词嵌入。它分为两种类型：连续词袋模型 (CBOW) 和 Skip-Gram 模型。从算法上看，这些模型比较相似，只是 CBOW 从源上下文字词（“the cat sits on the”）中预测目标字词（例如“mat”），而 skip-gram 从目标字词中预测源上下文字词。这种调换似乎是一种随意的选择，但从统计学上来看，它有助于 CBOW 整理很多分布信息（通过将整个上下文视为一个观察对象）。在大多数情况下，这对于小型数据集来说是很有用的。但是，skip-gram 将每个上下文-目标对视为一个新的观察对象，当我们使用大型数据集时，skip-gram 似乎能发挥更好的效果。

CBOW: Continuous bag of words creates a sliding window around current word, to predict it from “context” — the surrounding words. Each word is represented as a feature vector. After training, these vectors become the word vectors.

Doc2Vec形成

word2vec + document-unique feature vector = when training the word vectors W, the document vector D is trained as well, and in the end of training, it holds a numeric representation of the document.

Distributed Memory version of Paragraph Vector (PV-DM). It acts as a memory that remembers what is missing from the current context — or as the topic of the paragraph. While the word vectors represent the concept of a word, the document vector intends to represent the concept of a document.

another algorithm, which is similar to skip-gram may be used Distributed Bag of Words version of Paragraph Vector (PV-DBOW).

参数

Parameters:	documents (iterable of list of `TaggedDocument`, opti

最低0.47元/天解锁文章

野营者007

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
doc2vec介绍和实践

简介与其他方法的比较bag of words (BOW)：不会考虑词语出现的先后顺序。Latent Dirichlet Allocation (LDA)：更偏向于从文中提取关键词和核心思想extracting topics/keywords out of texts，但是非常难调参数并且难以评价模型的好坏。基石：word2vecWord2vec 是一种计算效率特别高的预测模型，...
复制链接

扫一扫