Doc2Vec Model （一）

最新推荐文章于 2024-06-03 16:12:17 发布

桃子小迷妹

最新推荐文章于 2024-06-03 16:12:17 发布

阅读量244

点赞数

分类专栏：机器学习 Python Linux（Ubuntu）

本文链接：https://blog.csdn.net/weixin_43846270/article/details/109214898

版权

Python 同时被 3 个专栏收录

109 篇文章 1 订阅

订阅专栏

Linux（Ubuntu）

109 篇文章 3 订阅

订阅专栏

机器学习

4 篇文章 0 订阅

订阅专栏

Tutorial of Doc2Vec Model

本教程介绍 $D o c u m e n t s$ , $C o r p u s$ , $V e c t o r s$ , $M o d e l s$ ：理解和使用 $g e n s i m$ 所需的基本概念和术语。

import pprint

$g e n s i m$ 的核心概念：

$D o c u m e n t$ : 一些文本
$C o r p u s$ : 一个 $d o c u m e n t$ 集
$V e c t o r$ : 一个 $d o c u m e n t$ 在数学上的方便表示
$M o d e l$ : 一种将向量从一种表示法转换成另一种表示法的算法

让我们更详细地研究一下这些。

Document

在 $G e n s i m$ 中，一个 $d o c u m e n t$ 是一个 $text\ sequence\ type$ （文本序列类型）的对象（ $p y t h o n 3$ 中通常称为 $s t r$ ）。一个 $d o c u m e n t$ 可以是140个字符的 $character\ tweet$ （人物推特）、单个段落（i.e., 期刊文章摘要）、新闻文章或一本书。

document = "Human machine interface for lab abc computer applications"

Corpus

语料库是 $d o c u m e n t$ 对象的集合。语料库在 $G e n s i m$ 中承担两个角色：

用于训练模型的输入。在训练过程中，利用 $t r a i n i n g c o r p u s$ 寻找共同的主题，初始化它们的内部模型参数。
$G e n s i m$ 专注于无监督的模型，因此不需要人为干预，例如昂贵的注释或手工标记文档。
要组织的文档。训练后，可以使用主题模型从新 $d o c u m e n t$ (训练语料库中没有的 $d o c u m e n t$ ) 中提取主题。

该语料库可用于相似度查询索引、语义相似度查询、聚类等。

下面是一个示例语料库。它由9个 $d o c u m e n t$ 组成，每个 $d o c u m e n t$ 是由一个句子组成的字符串。

text_corpus = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

上面的示例将整个语料库加载到内存中。实际上，语料库可能非常大，所以不可能将它们加载到内存中。 $G e n s i m$ 通过一次流化一个 $d o c u m e n t$ 来智能地处理这些语料库。有关详细信息，请参阅 Corpus Streaming – One Document at a Time。

为了便于说明，这是一个例子的语料库的特别小。另一个例子是莎士比亚写的所有戏剧的列表，维基百科上所有文章的列表，或者某个特定的人发的所有tweet。

在收集我们的语料库之后，通常会有一些预处理步骤。我们将简单化，只删除一些常用的英语单词(如: $t h e$ )和只在语料库中出现过一次的单词。在此过程中，我们将标记我们的数据。标记化将 $d o c u m e n t$ 分解为单词(在本例中使用空格作为分隔符)。

除了使用小写字母和空格分隔外，还有更好的方法来执行预处理。有效的预处理超出了本教程的范围：如果您感兴趣，请查看 $gensim.utils.simple\_preprocess()$ 函数。

stoplist = set('for a of the and to in'.split(' ')) # 停止词列表

# 将每一个 document 小写， 用空格进行分割， 并且过滤停止词
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in text_corpus]
pprint.pprint(texts)  
       
>> [['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]
 
# 计算单词频率
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

# 仅保留出现频率超过一次的单词
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]
pprint.pprint(processed_corpus)

>> [['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

在继续之前，我们希望将语料库中的每个单词与一个惟一的整数ID关联起来。我们可以使用 $g e n s i m . c o r p o r a . D i c t i o n a r y$ 类。这个字典定义了我们处理过程所知道的所有单词的词汇表。

from gensim import corpora

dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)

>>Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)

因为我们的语料库很小，所以在这个 $g e n s i m . c o r p o r a . D i c t i o n a r y$ 中只有12个不同的 $t o k e n$ 。对于较大的语料库，包含数十万 $t o k e n$ 的字典是很常见的。

Vector

为了推断我们的语料库中的潜在结构，我们需要一种方法来表示我们可以在数学上操作的文档。一种方法是将每个 $d o c u m e n t$ 表示为特征向量。例如，一个特征可以被认为是一对问答:

$s p l o n g e$ 这个单词在 $d o c u m e n t$ 中出现多少次? 0
这个 $d o c u m e n t$ 由多少个段落组成? 2
文档使用多少字体? 5

问题通常只用它的整数id(比如:1、2和3)表示，然后 $d o c u m e n t$ 的表示就变成一系列对，比如 $(1, 0.0) 、 (2, 2.0) 、 (3, 5.0)$ 。这就是所谓的 $dense\ vector$ ，因为它包含了上面每个问题的明确答案。

如果我们提前知道了所有的问题，我们可以隐式地保留它们，并简单地将 $d o c u m e n t$ 表示为 $(0, 2, 5)$ 。这个答案序列就是我们 $d o c u m e n t$ 的向量(在本例中为三维 $dense\ vector$ )。在现实应用中， $G e n s i m$ 中只允许问题的答案是(或可以转换为)单个浮点数。

在实践中，向量通常由许多零值组成。为了节省内存， $G e n s i m$ 省略了值为 $0.0$ 的所有向量元素。因此，上面的例子变成了(2,2.0)、(3,5.0)。这被称为 $sparse\ vector$ 或 $bag-of-words\ vector$ 。在这种稀疏表示中，所有缺失特征的值都可以明确地解析为 $0, 0.0$ 。

假设问题是相同的，我们可以比较两个不同 $d o c u m e n t$ 的向量。例如，假设我们有两个向量 $(0.0, 2.0, 5.0)$ 和 $(0.1, 1.9, 4.9)$ 。因为这些向量彼此非常相似，所以我们可以得出结论，与这些向量对应的 $d o c u m e n t$ 也是相似的。当然，这个结论的正确性取决于我们一开始选择问题的好坏。

另一种将文档表示为矢量的方法是 $b a g - o f - w o r d s$ 模型。在 $b a g - o f - w o r d s$ 模型下，每个 $d o c u m e n t$ 由一个向量表示，这个向量包含字典中每个单词的频率计数。例如，假设我们有一个字典，里面有单词 $[^{'} c o f f e e^{'} ，^{'} m i l k^{'} ，^{'} s u g a r^{'} ，^{'} s p o o n^{'}]$ 。由字符串 $“coffee\ milk\ coffee”$ 组成的 $d o c u m e n t$ 将由向量 $[2, 1, 0, 0]$ 表示，其中向量的条目(按顺序)是文档中出现的 $“ c o f f e e ” 、 “ m i l k ” 、 “ s u g a r ”$ 和 $“ s p o o n ”$ 。向量的长度等于字典的条目的数量。 $b a g - o f - w o r d s$ 模型的主要属性之一是，它完全忽略 $t o k e n$ 在被编码的文档中的顺序，这就是 $b a g - o f - w o r d s$ 名称的来源。

我们处理过的语料库中有12个独特的单词，这意味着每个 $d o c u m e n t$ 将在 $b a g - o f - w o r d s$ 模型下用12维向量表示。我们可以使用字典将标记过的文档转换成这些12维向量。我们可以看到这些id对应的是:

pprint.pprint(dictionary.token2id)

>>{'computer': 0,
 'eps': 8,
 'graph': 10,
 'human': 1,
 'interface': 2,
 'minors': 11,
 'response': 3,
 'survey': 4,
 'system': 5,
 'time': 6,
 'trees': 9,
 'user': 7}

例如，假设我们希望向量化短语 $“ H u m a n c o m p u t e r i n t e r a c t i o n ”$ (注意，这个短语不在我们最初的语料库中)。我们可以使用字典的 $d o c 2 b o w$ 方法为 $d o c u m e n t$ 创建 $b a g - o f - w o r d s$ 表示，该方法返回单词计数的稀疏表示:

new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)

>>[(0, 1), (1, 1)]

每个元组中的第一个条目对应于字典中 $t o k e n$ 的ID，第二个条目对应于该 $t o k e n$ 的计数。
请注意， $“ i n t e r a c t i o n ”$ 在原始语料库中没有出现，因此不包括在向量化中。还要注意，这个向量只包含实际出现在文档中的单词的条目。由于任何给定的 $d o c u m e n t$ 只包含字典中众多单词中的少数几个单词，因此没有在向量化中出现的单词被隐式地表示为零，以节省空间。

我们可以将整个原始语料库转换为向量列表:

>> [[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]

注意，虽然这个列表完全存在于内存中，但在大多数应用程序中，您需要一个更加具有可伸缩性的解决方案。幸运的是， $g e n s i m$ 允许您使用每次返回单个 $d o c u m e n t$ 向量的任何迭代器。有关更多细节，请参阅文档。

$d o c u m e n t$ 和向量之间的区别在于，前者是文本，后者是文本的便利数学表示。有时，人们会交替使用这两个术语:例如，给定某个任意的 $document\ D$ ，他们不会说“对应于 $document\ D$ 的向量”，而是会说“向量 $D$ ”或“ $document\ D$ ”。这以模糊为代价实现了简洁。
只要您记得 $d o c u m e n t$ 存在于 $d o c u m e n t$ 空间中，而向量存在于向量空间中，那么上述模糊性是可以接受的。

根据获取表示的方式，两个不同的文档可能具有相同的向量表示。

Model

既然我们已经对语料库进行了向量化，我们就可以开始使用模型对其进行转换了。我们使用模型作为一个抽象术语，指从一种 $d o c u m e n t$ 表示到另一种的转换。在 $g e n s i m$ 中， $d o c u m e n t$ 被表示为向量，因此模型可以被认为是两个向量空间之间的转换。在训练过程中，当模型读取训练语料库时，模型会学习这个转换的细节。

一个简单的模型示例是 $t f - i d f$ 。 $t f - i d f$ 模型将向量从 $b a g - o f - w o r d s$ 表示转换为向量空间，其中根据语料库中每个词的相对罕见度对频率计数进行加权。

这里有一个简单的例子。让我们初始化 $t f - i d f$ 模型，在我们的语料库上训练它，并转换字符串 $“system\ minor”$ :

from gensim import models

# train the model
tfidf = models.TfidfModel(bow_corpus)

# transform the "system minors" string
words = "system minors".lower().split()
print(tfidf[dictionary.doc2bow(words)])

>> [(5, 0.5898341626740045), (11, 0.8075244024440723)]

$t f - i d f$ 模型再次返回一个元组列表，其中第一个条目是 $t o k e n$ ID，第二个条目是 $t f - i d f$ 权重。注意， $“ s y s t e m ”$ 对应的ID(在原始语料库中出现了4次)的权重低于 $“ m i n o r ”$ 对应的ID(仅出现了两次)。

您可以将训练过的模型保存到磁盘，然后将其加载回来，或者对新训练 $d o c u m e n t$ 进行训练，也可以转换新 $d o c u m e n t$ 。

$g e n s i m$ 提供了许多不同的模型(转换)。有关更多信息，请参见 Topics and Transformations。

一旦你创建了模型，你就可以用它做各种很酷的事情。例如，通过 $t f - i d f$ 对整个语料库进行转换并进行索引，为相似度查询做准备:

from gensim import similarities

index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=12)

查询我们的查询 $d o c u m e n t$ ( $query\_document$ )对于语料库中每个 $d o c u m e n t$ 的相似性:

query_document = 'system engineering'.split()
query_bow = dictionary.doc2bow(query_document)
sims = index[tfidf[query_bow]]
print(list(enumerate(sims)))

>> [(0, 0.0), (1, 0.32448703), (2, 0.41707572), (3, 0.7184812), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]

如何读取这个输出? $d o c u m e n t 3$ 的相似度得分为 0.718=72%， $d o c u m e n t 2$ 的相似度得分为42%等。我们可以通过排序使其更具可读性:

for document_number, score in sorted(enumerate(sims), key=lambda x: x[1], reverse=True):
    print(document_number, score)

>> 3 0.7184812
2 0.41707572
1 0.32448703
0 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0

桃子小迷妹

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Doc2Vec Model （一）

Tutorial of Doc2Vec Model本教程介绍 DocumentsDocumentsDocuments, CorpusCorpusCorpus, VectorsVectorsVectors, ModelsModelsModels：理解和使用 gensimgensimgensim 所需的基本概念和术语。import pprintgensimgensimgensim 的核心概念：DocumentDocumentDocument: 一些文本CorpusCorpusCorpus: 一个
复制链接

扫一扫

专栏目录