1、corpora和dictionary
corpora是gensim中的一个基本概念,是文档集的表现形式。corpora就是一个二维矩阵
#_*_coding:utf-8_*_
from collections import defaultdict
from gensim import corpora
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
stoplist = set('for a of the and to in'.split())
#去停用词
raw_data = []
for words in documents:
tmp = []
for word in words.lower().split():
if word not in stoplist:
tmp.append(word)
raw_data.append(tmp)
#去掉只出现一次的词
fre = defaultdict(int) #dict 类似Java8中的map.getOrDefault()
for item in raw_data:
for token in item:
fre[token] += 1
new_data = []
for item in raw_data:
tmp = []
for token in item:
if fre[token] > 1:
tmp.append(token)
new_data.append(tmp)
#生成词典
dictionary = corpora.Dictionary(new_data)
#遍历词典
for k,v in dictionary.items():
print(k,v)
#生成词向量
corpus = [dictionary.doc2bow(text) for text in new_data]
print(corpus)
字典保存到本地
dictionary.save('../dictionary.dict')
将词向量本地到磁盘
corpora.MmCorpus.serialize('../dictionary.mm',corpus)
加载字典、词向量
dictionary = corpora.Dictionary.load('../dictionary.dict')
corpus = corpora.MmCorpus('../dictionary.mm')
在models中,可以对corpus做进一步处理,比如lsi模型、LDA模型、tfidf模型
加载TF-IDF模型
tfidf_model= models.TfidfModel(corpus)
corpus_tfidf = tfidf_model[corpus]
输出tfidf模型结果
print(list(corpus_tfidf))
以下向量的输出与此类似
加载lsi模型,可用作聚类、分类
si_model= models.LsiModel(corpus_tfidf,id2word=dictionary,num_topics=2)
corpus_lsi = lsi_model[corpus_tfidf]
打印各topic含义
print(lsi_model.print_topics(2))
打印结果为:
[(0, '0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"time" + 0.060*"response" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"'), (1, '-0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"time" + -0.320*"response" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"')]