自然语言处理Gensim入门：建模与模型保存

North_D

已于 2024-02-25 17:48:17 修改

阅读量1.2k

点赞数 24

分类专栏： AI深度学习文章标签：自然语言处理算法人工智能 python gensim 主题提取 NLP

于 2024-02-25 17:39:35 首次发布

本文链接：https://blog.csdn.net/qq_39813001/article/details/136284922

版权

AI深度学习专栏收录该内容

21 篇文章 4 订阅

订阅专栏

文章目录

自然语言处理Gensim入门：建模与模型保存

自然语言处理Gensim入门：建模与模型保存

关于gensim基础知识

Gensim是一个专门针对大规模文本数据进行主题建模和相似性检索的Python库。
MmCorpus是gensim用于高效读写大型稀疏矩阵的一种格式，适用于大数据集。
TF-IDF是一种常见的文本表示方法，通过对词频进行加权以突出重要性较高的词语。
LSI、LDA和RP都是降维或主题提取方法，常用于信息检索、文本分类和聚类任务。

这段代码是使用gensim库生成主题模型的一个脚本，它根据用户提供的语言和方法参数来训练文本数据集，并将训练好的模型保存为文件。以下是核心代码逻辑的分析与解释：

1. 模块导入

导入了logging模块用于记录程序运行日志。
导入sys模块以获取命令行参数和程序名。
导入os.path模块处理文件路径相关操作。
从gensim.corpora导入dmlcorpus（一个用于加载特定格式语料库的模块）和MmCorpus（存储稀疏矩阵表示的文档-词项矩阵的类）。
从gensim.models导入四个模型：lsimodel、ldamodel、tfidfmodel、rpmodel，分别对应潜在语义索引（LSI）、潜在狄利克雷分配（LDA）、TF-IDF转换模型以及随机投影（RP）。

2. 内部变量定义

DIM_RP, DIM_LSI, DIM_LDA 分别指定了RP、LSI和LDA模型的维度大小。

3. 主函数入口 (`if name == 'main':`)

配置日志输出格式并设置日志级别为INFO。
检查输入参数数量是否满足要求（至少包含语言和方法两个参数），否则打印帮助信息并退出程序。
获取指定的语言和方法参数。

4. 加载语料库映射

根据传入的语言参数创建DmlConfig对象，该对象包含了语料库的相关配置信息，如存放结果的目录等。
加载词汇表字典，即wordids.txt文件，将其转换成id2word字典结构，以便在后续模型构建中将词语ID映射回实际词语。

5. 加载和预处理语料库

使用MmCorpus加载二进制bow.mm文件，该文件存储了文档-词项矩阵，每个文档是一个稀疏向量表示。

6. 根据方法参数选择模型训练方式

如果方法为’tfidf’，则训练并保存TF-IDF模型，该模型对原始词频进行加权，增加了逆文档频率因子。
若方法为’lda’，则训练LDA模型，这是一个基于概率统计的主题模型，通过文档-主题分布和主题-词语分布抽取主题结构。
若方法为’lsi’，首先用TF-IDF模型转换语料，然后在此基础上训练LSI模型，它是一种线性代数方法，用于发现文本中的潜在主题空间。
若方法为’rp’，同样先转为TF-IDF表示，然后训练RP模型，利用随机投影技术降低数据维数。
对于未知的方法，抛出ValueError异常。

7. 保存模型和变换后的语料

训练完相应模型后，将其保存到指定的文件中（例如model_lda.pkl或model_lsi.pkl）。
将原始语料经过所训练模型变换后得到的新语料（即主题表示形式）保存为一个新的MM格式文件，文件名反映所使用的主题模型方法。

8.代码

#!/usr/bin/env python
#
# Copyright (C) 2010 Radim Rehurek <radimrehurek@seznam.cz>
# Licensed under the GNU LGPL v2.1 - https://www.gnu.org/licenses/old-licenses/lgpl-2.1.en.html

"""
USAGE: %(program)s LANGUAGE METHOD
    Generate topic models for the specified subcorpus. METHOD is currently one \
of 'tfidf', 'lsi', 'lda', 'rp'.

Example: ./gensim_genmodel.py any lsi
"""


import logging
import sys
import os.path

from gensim.corpora import dmlcorpus, MmCorpus
from gensim.models import lsimodel, ldamodel, tfidfmodel, rpmodel

import gensim_build


# internal method parameters
DIM_RP = 300  # dimensionality for random projections
DIM_LSI = 200  # for lantent semantic indexing
DIM_LDA = 100  # for latent dirichlet allocation


if __name__ == '__main__':
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logging.info("running %s", ' '.join(sys.argv))

    program = os.path.basename(sys.argv[0])

    # check and process input arguments
    if len(sys.argv) < 3:
        print(globals()['__doc__'] % locals())
        sys.exit(1)
    language = sys.argv[1]
    method = sys.argv[2].strip().lower()

    logging.info("loading corpus mappings")
    config = dmlcorpus.DmlConfig('%s_%s' % (gensim_build.PREFIX, language),
                                 resultDir=gensim_build.RESULT_DIR, acceptLangs=[language])

    logging.info("loading word id mapping from %s", config.resultFile('wordids.txt'))
    id2word = dmlcorpus.DmlCorpus.loadDictionary(config.resultFile('wordids.txt'))
    logging.info("loaded %i word ids", len(id2word))

    corpus = MmCorpus(config.resultFile('bow.mm'))

    if method == 'tfidf':
        model = tfidfmodel.TfidfModel(corpus, id2word=id2word, normalize=True)
        model.save(config.resultFile('model_tfidf.pkl'))
    elif method == 'lda':
        model = ldamodel.LdaModel(corpus, id2word=id2word, num_topics=DIM_LDA)
        model.save(config.resultFile('model_lda.pkl'))
    elif method == 'lsi':
        # first, transform word counts to tf-idf weights
        tfidf = tfidfmodel.TfidfModel(corpus, id2word=id2word, normalize=True)
        # then find the transformation from tf-idf to latent space
        model = lsimodel.LsiModel(tfidf[corpus], id2word=id2word, num_topics=DIM_LSI)
        model.save(config.resultFile('model_lsi.pkl'))
    elif method == 'rp':
        # first, transform word counts to tf-idf weights
        tfidf = tfidfmodel.TfidfModel(corpus, id2word=id2word, normalize=True)
        # then find the transformation from tf-idf to latent space
        model = rpmodel.RpModel(tfidf[corpus], id2word=id2word, num_topics=DIM_RP)
        model.save(config.resultFile('model_rp.pkl'))
    else:
        raise ValueError('unknown topic extraction method: %s' % repr(method))

    MmCorpus.saveCorpus(config.resultFile('%s.mm' % method), model[corpus])

    logging.info("finished running %s", program)