使用英文做LDA建模

data.txt文本是做了预处理等操作生成的数据,每一行代表一条数据:

in conjunction with the release of the the allen institute for ai partnered with
the recent outbreak of the deadly and highly infectious covid disease caused by 
coronaviruses is related illness that vary from a common cold more severe 
it is shown that the evaporation rate of a liquid sample containing the 
covid illness an on going epidemic started in wuhan city china in december 
in the beginning of december covid virus that slipped from animals humans in 

建模代码:

from gensim import corpora
import gensim  # pip install gensim


def get_topic(all_contents, num_topic=10):
    # num_topic 定义LDA模型需要训练成多少类
    try:
        def lda_analyze(all_contents, num_topic=10):
            """这是训练LDA的核心方法"""
            dictionary = corpora.Dictionary(all_contents)
            corpus = [dictionary.doc2bow(sentence) for sentence in all_contents]
            lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topic)  # 核心代码
            return lda

        # all_contents is list to list
        lda = lda_analyze(all_contents, num_topic=num_topic)
        for topic in lda.print_topics(num_words=20):  # 这里是打印LDA分类的结果
            print(topic[1])
        # save model
        lda.save('lda_' + str(num_topic) + '.model')
    except Exception as e:
        print(e)


# 整合data的核心代码
data = list(iter(open('data.txt')))
data = [content.split() for content in data]
for i in range(16):
    get_topic(data, i + 1)  # 从分为1个类别到16个类别,都跑一跑,然后把结果保存下来

 

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

呆萌的代Ma

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值