使用Gensim计算LSI时两种语料库流的实现方式

最新推荐文章于 2024-02-25 17:39:35 发布

蛐蛐蛐

最新推荐文章于 2024-02-25 17:39:35 发布

阅读量1k

点赞数

分类专栏： Python技巧

本文链接：https://blog.csdn.net/qysh123/article/details/81917222

版权

Python技巧专栏收录该内容

94 篇文章 2 订阅

订阅专栏

我对自然语言处理实在不熟，这篇博客也就当是学习了。

最近需要将一个待分析对象（一个软件项目或者是一个源代码）表示为一个document，再从自然语言处理的角度计算它们的相似性。免不了要用到LSI这样的方法。网上介绍Gensim中LSI的实现的方法已经很多，也有很多教程指出：如果语料库特别大，那我们同时载入内存是不现实的，这时候就要用到语料库流的概念，其实也有些文章已经介绍了：https://blog.csdn.net/kl28978113/article/details/77881458，https://blog.csdn.net/qq_30868235/article/details/80628719，https://blog.csdn.net/zlbflying/article/details/49329339。但可惜的是，大家似乎都说得不太清楚，我简单总结了一个例子，帮助大家在上面那些文章的基础上理解两种语料库流的实现方法：

# -*- coding:utf-8 -*-

from gensim import corpora, models
import numpy as np

            
class MyCorpus(object):
    def __init__(self, in_file):
        self.in_file = in_file
    
    def __iter__(self):
        for line in open(self.in_file):
            yield dictionary.doc2bow(line.lower().split())


FILE_STRING="语料文件"
NUM_TOPICS=128


dictionary = corpora.Dictionary(line.lower().split() for line in open(FILE_STRING))
print dictionary
once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.iteritems() if docfreq == 1]
dictionary.filter_tokens(once_ids)#去掉只出现一次的词
dictionary.compactify()
print dictionary#可以看到第二次print的时候dictionary精简了很多

corpus = [dictionary.doc2bow(line.lower().split()) for line in open(FILE_STRING)]#在这种情况下，语料库是一个矩阵，每一行对应一个document
print corpus[1]
corpus = MyCorpus(in_file=FILE_STRING)#在这种情况下语料库是一个对象，我们可以每次取出其一个向量
for index,vector in enumerate(corpus):
    if(index==1):
        print vector#可以看到，两种print的结果是相同的。

tfidf_model = models.TfidfModel(corpus)  
corpus_tfidf = tfidf_model[corpus]
lsi_model = models.LsiModel(corpus_tfidf,id2word=dictionary,num_topics=NUM_TOPICS)
corpus_lsi = lsi_model[corpus_tfidf]