LDA主题困惑度与主题距离计算方法（一）

最新推荐文章于 2022-10-10 10:47:06 发布

snail82

最新推荐文章于 2022-10-10 10:47:06 发布

阅读量6.2k

点赞数 7

分类专栏：技术文档

本文链接：https://blog.csdn.net/snail82/article/details/104553285

版权

技术文档专栏收录该内容

15 篇文章 2 订阅

订阅专栏

LDA主题挖掘时，通常采用主题困惑度进行最佳主题数的选择。在利用sklearn.decomposition.LatentDirichletAllocation进行主题困惑度计算时，其提供了一个perplexity函数，但是实际计算结果是随着主题数的增加，函数值也在无规律的增加。通过查阅相关文章，发现在GitHub网站上，对LatentDirichletAllocation源码的issues中，提出这是该代码的一个bug，因此，如果想计算主题困惑度，可以参考https://blog.csdn.net/xxidaojia/article/details/102702492?utm_source=distribute.pc_relevant.none-task这篇博文，里面是采用gensim进行的lda分析。当然，两种分析工具出来的效果上还有差异，个人感觉利用LatentDirichletAllocation进行LDA分析主题效果要稍微好一些。因此，参考上述博文，将其引入到LatentDirichletAllocation。

def docperplexity(ldamodel, testset, dictionary, size_dictionary, num_topics):
    print('the info of this ldamodel: \n')
    print('num of topics: %s'%(num_topics))
    prep = 0.0
    prob_doc_sum = 0.0
    topic_word_list = [] 
    for topic_id in range(num_topics):
        topic_word = ldamodel.show_topic(topic_id, size_dictionary)
        dic = {}
        for word, probability in topic_word:
            dic[word] = probability
        topic_word_list.append(dic)  
    doc_topics_ist = []  
    for doc in testset:
        doc_topics_ist.append(ldamodel.get_document_topics(doc, minimum_probability=0))
    testset_word_num = 0
    for i in range(len(testset)):
        prob_doc = 0.0  # the probablity of the doc
        doc = testset[i]
        doc_word_num = 0  
        for word_id, num in dict(doc).items():
            prob_word = 0.0  
            doc_word_num += num
            word = dictionary[word_id]
            for topic_id in range(num_topics):
                # cal p(w) : p(w) = sumz(p(z)*p(w|z))
                prob_topic = doc_topics_ist[i][topic_id][1]
                prob_topic_word = topic_word_list[topic_id][word]
                prob_word += prob_topic * prob_topic_word
            prob_doc += math.log(prob_word)  # p(d) = sum(log(p(w)))
        prob_doc_sum += prob_doc
        testset_word_num += doc_word_num
    prep = math.exp(-prob_doc_sum / testset_word_num)  # perplexity = exp(-sum(p(d)/sum(Nd))
    print("模型困惑度的值为 : %s" % prep)



df = pd.read_excel('all-excel/2019-1.xls',sheet_name='Sheet1')
df = pd.DataFrame(df['comment'].astype(str))

#调用jieba分词函数，在此写成了chinese_word_cut
df['content_cutted'] = df['comment'].apply(chinese_word_cut)
n_features = 500#ti-idf特征。根据语料大小确定合适的梳理

documents=df['content_cutted'].values.tolist()
docList=[]
for x in documents:
    xlist = x.split()
    xlist=[xl.upper() for xl in xlist]
    docList.append(xlist)
documents = docList

#获取lda的训练集和测试集
train, test = train_test_split(list(df['content_cutted'].values), test_size = 0.2)
tf_vectorizer = TfidfVectorizer(strip_accents = 'unicode',
                                max_features=n_features,
                                max_df = 0.99,
                                min_df = 0.002) #去除文档内出现几率过大或过小的词汇
doc_train = tf_vectorizer.fit_transform(train)

features = tf_vectorizer.get_feature_names()
doc_test = tf_vectorizer.fit_transform(test)

perplexity = []
alpha = 0.1
beta = 0.1

for topics in range(1, 21, 2):
    # Fit LDA to the data 
    LDA = LatentDirichletAllocation(n_components = topics, doc_topic_prior = alpha, topic_word_prior = beta)
    news_lda = LDA.fit(doc_train)
    
    
    perplexity.append(news_lda.perplexity(doc_test))

print (perplexity)

对一个内容的主题挖掘，其复杂度为：

主题距离的计算方法（二） https://blog.csdn.net/snail82/article/details/104553900

snail82

关注

7
点赞
踩
33

收藏

觉得还不错? 一键收藏
4
评论
LDA主题困惑度与主题距离计算方法（一）

LDA主题挖掘时，通常采用主题困惑度进行最佳主题数的选择。在利用sklearn.decomposition.LatentDirichletAllocation进行主题困惑度计算时，其提供了一个perplexity函数，但是实际计算结果是随着主题数的增加，函数值也在无规律的增加。通过查阅相关文章，发现在GitHub网站上，对LatentDirichletAllocation源码的is...
复制链接

扫一扫