TypeError: doc2bow expects an array of unicode tokens on input, not a single string

最新推荐文章于 2023-05-06 10:34:11 发布

ov大鱼vo

最新推荐文章于 2023-05-06 10:34:11 发布

阅读量4.6k

点赞数 5

分类专栏： Python

本文链接：https://blog.csdn.net/flowingfog/article/details/94627449

版权

Python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

构造词典时，输入的不应为原始文档集合，而是字符数组的数组。

import nltk
from gensim.models.ldamulticore import LdaModel
from gensim.corpora.dictionary import Dictionary
from gensim.test.utils import datapath
class LDA():
    def __init__(self):
        super(LDA, self).__init__()
    def getDataArray(self,src):
        file = open(src, 'r', encoding='UTF-8')
        dataset = []  # Gensim输入数据
        count=0
        for text in file.readlines():
            print(count)
            # corpus=[dictory.doc2bow(nltk.tokenize(text)) for text in file.readlines()]
            tokens = nltk.word_tokenize(text)
            dataset.append(tokens)
            count+=1
        print('Get data array complete!')
        return dataset

if __name__=="__main__":
    lda=LDA()
    path=你的文件路径
    dicFile=你的词典路径
    dataset=lda.getDataArray(path)
    dictory=Dictionary(dataset)
    dictory.save_as_text(dicFile)#词典保存为文件
    corpus=[dictory.doc2bow(text) for text in dataset]#转换为数字表示
    # 利用处理好的语料训练模型
    lda = LdaModel(corpus, num_topics=5,alpha='auto',eval_every=5)

    #save model
    temp_file=datapath('model')
    lda.save(temp_file)

ov大鱼vo

关注

5
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
TypeError: doc2bow expects an array of unicode tokens on input, not a single string

构造词典时，输入的不应为原始文档集合，而是字符数组的数组。import nltkfrom gensim.models.ldamulticore import LdaModelfrom gensim.corpora.dictionary import Dictionaryfrom gensim.test.utils import datapathclass LDA(): ...
复制链接

扫一扫

专栏目录