Gensim-主题模型攻略：语料库和向量空间

最新推荐文章于 2022-08-13 15:07:53 发布

IT_ER

最新推荐文章于 2022-08-13 15:07:53 发布

阅读量915

点赞数 1

分类专栏：主题模型文章标签：语料库 corpus 主题模型向量空间 Gensim

本文链接：https://blog.csdn.net/IT_ER/article/details/82346651

版权

这篇博客介绍了如何使用Gensim处理文本数据，包括将字符串转换为向量，创建语料库流，理解不同的语料库格式，以及与NumPy和SciPy的兼容性。通过示例展示了从文本到向量化表示的过程，并提到了Matrix Market等语料库格式。

摘要由CSDN通过智能技术生成

语料库和向量空间

gensim安装之后，就有了一件对付巨量文本的武器了，还不快大展身手
想看logging信息就别忘了：

>>> import logging
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

字符串到向量

以下将文本表示成字符串的形式，共九个文本，每个文本有一句话组成。

>>> from gensim import corpora
>>>
>>> documents = ["Human machine interface for lab abc computer applications",
>>>              "A survey of user opinion of computer system response time",
>>>              "The EPS user interface management system",
>>>              "System and human system engineering testing of EPS",
>>>              "Relation of user perceived response time to error measurement",
>>>              "The generation of random binary unordered trees",
>>>              "The intersection graph of paths in trees",
>>>              "Graph minors IV Widths of trees and well quasi ordering",
>>>              "Graph minors A survey"]

首先，我们标注文本，删除常见词：

>>> # remove common words and tokenize
>>> stoplist = set('for a of the and to in'.split())
>>> texts = [[word for word in document.lower().split() if word not in stoplist]
>>>          for document in documents]
>>>
>>> # remove words that appear only once
>>> from collections import defaultdict
>>> frequency = defaultdict(int)
>>> for text in texts:
>>>     for token in text:
>>>         frequency[token] += 1
>>>
>>> texts = [[token for token in text if frequency[token] > 1]
>>>          for text in texts]
>>>
>>> from pprint import pprint  # pretty-printer
>>> pprint(texts)
[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'