LDA主题挖掘时,通常采用主题困惑度进行最佳主题数的选择。在利用sklearn.decomposition.LatentDirichletAllocation进行主题困惑度计算时,其提供了一个perplexity函数,但是实际计算结果是随着主题数的增加,函数值也在无规律的增加。通过查阅相关文章,发现在GitHub网站上,对LatentDirichletAllocation源码的issues中,提出这是该代码的一个bug,因此,如果想计算主题困惑度,可以参考https://blog.csdn.net/xxidaojia/article/details/102702492?utm_source=distribute.pc_relevant.none-task这篇博文,里面是采用gensim进行的lda分析。当然,两种分析工具出来的效果上还有差异,个人感觉利用LatentDirichletAllocation进行LDA分析主题效果要稍微好一些。因此,参考上述博文,将其引入到LatentDirichletAllocation。
def docperplexity(ldamodel, testset, dictionary, size_dictionary, num_topics):
print('the info of this ldamodel: \n')
print('num of topics: %s'%(num_topics))
prep = 0.0
prob_doc_sum = 0.0
topic_word_list = []
for topic_id in range(num_topics):
topic_word = ldamodel.show_topic(topic_id, size_dictionary)
dic = {}
for word, probability in topic_word:
dic[word] = probability
topic_word_list.append(dic)
doc_topics_ist = []
for doc in testset:
doc_topics_ist.append(ldamodel.get_document_topics(doc, minimum_probability=0))
testset_word_num = 0
for i in range(len(testset)):
prob_doc = 0.0 # the probablity of the doc
doc = testset[i]
doc_word_num = 0
for word_id, num in dict(doc).items():
prob_word = 0.0
doc_word_num += num
word = dictionary[word_id]
for topic_id in range(num_topics):
# cal p(w) : p(w) = sumz(p(z)*p(w|z))
prob_topic = doc_topics_ist[i][topic_id][1]
prob_topic_word = topic_word_list[topic_id][word]
prob_word += prob_topic * prob_topic_word
prob_doc += math.log(prob_word) # p(d) = sum(log(p(w)))
prob_doc_sum += prob_doc
testset_word_num += doc_word_num
prep = math.exp(-prob_doc_sum / testset_word_num) # perplexity = exp(-sum(p(d)/sum(Nd))
print("模型困惑度的值为 : %s" % prep)
df = pd.read_excel('all-excel/2019-1.xls',sheet_name='Sheet1')
df = pd.DataFrame(df['comment'].astype(str))
#调用jieba分词函数,在此写成了chinese_word_cut
df['content_cutted'] = df['comment'].apply(chinese_word_cut)
n_features = 500#ti-idf特征。根据语料大小确定合适的梳理
documents=df['content_cutted'].values.tolist()
docList=[]
for x in documents:
xlist = x.split()
xlist=[xl.upper() for xl in xlist]
docList.append(xlist)
documents = docList
#获取lda的训练集和测试集
train, test = train_test_split(list(df['content_cutted'].values), test_size = 0.2)
tf_vectorizer = TfidfVectorizer(strip_accents = 'unicode',
max_features=n_features,
max_df = 0.99,
min_df = 0.002) #去除文档内出现几率过大或过小的词汇
doc_train = tf_vectorizer.fit_transform(train)
features = tf_vectorizer.get_feature_names()
doc_test = tf_vectorizer.fit_transform(test)
perplexity = []
alpha = 0.1
beta = 0.1
for topics in range(1, 21, 2):
# Fit LDA to the data
LDA = LatentDirichletAllocation(n_components = topics, doc_topic_prior = alpha, topic_word_prior = beta)
news_lda = LDA.fit(doc_train)
perplexity.append(news_lda.perplexity(doc_test))
print (perplexity)
对一个内容的主题挖掘,其复杂度为:
主题距离的计算方法(二) https://blog.csdn.net/snail82/article/details/104553900