gensim

1、corpora和dictionary

corpora是gensim中的一个基本概念,是文档集的表现形式。corpora就是一个二维矩阵

#_*_coding:utf-8_*_

from collections import defaultdict
from gensim import corpora

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

stoplist = set('for a of the and to in'.split())
#去停用词
raw_data = []
for words in documents:
    tmp = []
    for word in words.lower().split():
        if word not in stoplist:
            tmp.append(word)
    raw_data.append(tmp)

#去掉只出现一次的词
fre = defaultdict(int)  #dict 类似Java8中的map.getOrDefault()
for item in raw_data:
    for token in item:
        fre[token] += 1

new_data = []
for item in raw_data:
    tmp = []
    for token in item:
        if fre[token] > 1:
            tmp.append(token)
    new_data.append(tmp)
#生成词典
dictionary = corpora.Dictionary(new_data)
#遍历词典
for k,v in dictionary.items():
    print(k,v)
#生成词向量
corpus = [dictionary.doc2bow(text) for text in new_data]
print(corpus)

字典保存到本地

dictionary.save('../dictionary.dict')

将词向量本地到磁盘

corpora.MmCorpus.serialize('../dictionary.mm',corpus)

加载字典、词向量

dictionary = corpora.Dictionary.load('../dictionary.dict')
corpus = corpora.MmCorpus('../dictionary.mm')

在models中,可以对corpus做进一步处理,比如lsi模型、LDA模型、tfidf模型
加载TF-IDF模型

tfidf_model= models.TfidfModel(corpus)
corpus_tfidf = tfidf_model[corpus]

输出tfidf模型结果

print(list(corpus_tfidf))

以下向量的输出与此类似
加载lsi模型,可用作聚类、分类

si_model= models.LsiModel(corpus_tfidf,id2word=dictionary,num_topics=2)
corpus_lsi = lsi_model[corpus_tfidf]

打印各topic含义

print(lsi_model.print_topics(2))

打印结果为:

[(0, '0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"time" + 0.060*"response" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"'), (1, '-0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"time" + -0.320*"response" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"')]

内容参考:
https://www.cnblogs.com/keye/p/9190304.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值