LDA主题困惑度与主题距离计算方法(一)

        LDA主题挖掘时,通常采用主题困惑度进行最佳主题数的选择。在利用sklearn.decomposition.LatentDirichletAllocation进行主题困惑度计算时,其提供了一个perplexity函数,但是实际计算结果是随着主题数的增加,函数值也在无规律的增加。通过查阅相关文章,发现在GitHub网站上,对LatentDirichletAllocation源码的issues中,提出这是该代码的一个bug,因此,如果想计算主题困惑度,可以参考https://blog.csdn.net/xxidaojia/article/details/102702492?utm_source=distribute.pc_relevant.none-task这篇博文,里面是采用gensim进行的lda分析。当然,两种分析工具出来的效果上还有差异,个人感觉利用LatentDirichletAllocation进行LDA分析主题效果要稍微好一些。因此,参考上述博文,将其引入到LatentDirichletAllocation。

def docperplexity(ldamodel, testset, dictionary, size_dictionary, num_topics):
    print('the info of this ldamodel: \n')
    print('num of topics: %s'%(num_topics))
    prep = 0.0
    prob_doc_sum = 0.0
    topic_word_list = [] 
    for topic_id in range(num_topics):
        topic_word = ldamodel.show_topic(topic_id, size_dictionary)
        dic = {}
        for word, probability in topic_word:
            dic[word] = probability
        topic_word_list.append(dic)  
    doc_topics_ist = []  
    for doc in testset:
        doc_topics_ist.append(ldamodel.get_document_topics(doc, minimum_probability=0))
    testset_word_num = 0
    for i in range(len(testset)):
        prob_doc = 0.0  # the probablity of the doc
        doc = testset[i]
        doc_word_num = 0  
        for word_id, num in dict(doc).items():
            prob_word = 0.0  
            doc_word_num += num
            word = dictionary[word_id]
            for topic_id in range(num_topics):
                # cal p(w) : p(w) = sumz(p(z)*p(w|z))
                prob_topic = doc_topics_ist[i][topic_id][1]
                prob_topic_word = topic_word_list[topic_id][word]
                prob_word += prob_topic * prob_topic_word
            prob_doc += math.log(prob_word)  # p(d) = sum(log(p(w)))
        prob_doc_sum += prob_doc
        testset_word_num += doc_word_num
    prep = math.exp(-prob_doc_sum / testset_word_num)  # perplexity = exp(-sum(p(d)/sum(Nd))
    print("模型困惑度的值为 : %s" % prep)



df = pd.read_excel('all-excel/2019-1.xls',sheet_name='Sheet1')
df = pd.DataFrame(df['comment'].astype(str))

#调用jieba分词函数,在此写成了chinese_word_cut
df['content_cutted'] = df['comment'].apply(chinese_word_cut)
n_features = 500#ti-idf特征。根据语料大小确定合适的梳理

documents=df['content_cutted'].values.tolist()
docList=[]
for x in documents:
    xlist = x.split()
    xlist=[xl.upper() for xl in xlist]
    docList.append(xlist)
documents = docList

#获取lda的训练集和测试集
train, test = train_test_split(list(df['content_cutted'].values), test_size = 0.2)
tf_vectorizer = TfidfVectorizer(strip_accents = 'unicode',
                                max_features=n_features,
                                max_df = 0.99,
                                min_df = 0.002) #去除文档内出现几率过大或过小的词汇
doc_train = tf_vectorizer.fit_transform(train)

features = tf_vectorizer.get_feature_names()
doc_test = tf_vectorizer.fit_transform(test)

perplexity = []
alpha = 0.1
beta = 0.1

for topics in range(1, 21, 2):
    # Fit LDA to the data 
    LDA = LatentDirichletAllocation(n_components = topics, doc_topic_prior = alpha, topic_word_prior = beta)
    news_lda = LDA.fit(doc_train)
    
    
    perplexity.append(news_lda.perplexity(doc_test))

print (perplexity)

对一个内容的主题挖掘,其复杂度为:

主题距离的计算方法(二) https://blog.csdn.net/snail82/article/details/104553900

  • 7
    点赞
  • 33
    收藏
    觉得还不错? 一键收藏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值