语料库和向量空间
gensim安装之后,就有了一件对付巨量文本的武器了,还不快大展身手
想看logging信息就别忘了:
>>> import logging
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
字符串到向量
以下将文本表示成字符串的形式,共九个文本,每个文本有一句话组成。
>>> from gensim import corpora
>>>
>>> documents = ["Human machine interface for lab abc computer applications",
>>> "A survey of user opinion of computer system response time",
>>> "The EPS user interface management system",
>>> "System and human system engineering testing of EPS",
>>> "Relation of user perceived response time to error measurement",
>>> "The generation of random binary unordered trees",
>>> "The intersection graph of paths in trees",
>>> "Graph minors IV Widths of trees and well quasi ordering",
>>> "Graph minors A survey"]
首先,我们标注文本,删除常见词:
>>> # remove common words and tokenize
>>> stoplist = set('for a of the and to in'.split())
>>> texts = [[word for word in document.lower().split() if word not in stoplist]
>>> for document in documents]
>>>
>>> # remove words that appear only once
>>> from collections import defaultdict
>>> frequency = defaultdict(int)
>>> for text in texts:
>>> for token in text:
>>> frequency[token] += 1
>>>
>>> texts = [[token for token in text if frequency[token] > 1]
>>> for text in texts]
>>>
>>> from pprint import pprint # pretty-printer
>>> pprint(texts)
[['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'