python属于多模型语言_用python计算lda语言模型的困惑度并作图

最新推荐文章于 2023-05-31 21:33:38 发布

weixin_39603598

最新推荐文章于 2023-05-31 21:33:38 发布

阅读量668

点赞数

文章标签： python属于多模型语言

困惑度一般在自然语言处理中用来衡量训练出的语言模型的好坏。在用LDA做主题和词聚类时，原作者D.Blei就是采用了困惑度来确定主题数量。文章中的公式为：

perplexity=exp^{ - (∑log(p(w))) / (N) }

其中，P(W)是指的测试集中出现的每一个词的概率，具体到LDA的模型中就是P(w)=∑z p(z|d)*p(w|z)【z,d分别指训练过的主题和测试集的各篇文档】。分母的N是测试集中出现的所有词，或者说是测试集的总长度，不排重。

因而python程序代码块需要包括几个方面：

1.对训练的LDA模型，将Topic-word分布文档转换成字典，方便查询概率，即计算perplexity的分子

2.统计测试集长度，即计算perplexity的分母

3.计算困惑度

4.对于不同的Topic数量的模型，计算的困惑度，画折线图。

python代码如下：

1 #-*- coding: UTF-8-*-

2 importnumpy3 importmath4 importstring5 importmatplotlib.pyplot as plt6 importre7

8 def dictionary_found(wordlist): #对模型训练出来的词转换成一个词为KEY,概率为值的字典。

9 word_dictionary1={}10 for i inxrange(len(wordlist)):11 if i%2==0:12 if word_dictionary1.has_key(wordlist[i])==True:13 word_probability=word_dictionary1.get(wordlist[i])14 word_probability=float(word_probability)+float(wordlist[i+1])15 word_dictionary1.update({wordlist[i]:word_probability})16 else:17 word_dictionary1.update({wordlist[i]:wordlist[i+1]})18 else:19 pass

20 returnword_dictionary121

22 def look_into_dic(dictionary,testset): #对于测试集的每一个词，在字典中查找其概率。

23 '''Calculates the TF-list for perplexity'''

24 frequency=[]25 letter_list=[]26 a=0.0

27 for letter intestset.split():28 if letter not inletter_list:29 letter_list.append(letter)30 letter_frequency=(dictionary.get(letter))31 frequency.append(letter_frequency)32 else:33 pass

34 for each infrequency:35 if each!=None:36 a+=float(each)37 else:38 pass

39 returna40

42 def f_testset_word_count(testset): #测试集的词数统计

43 '''reture the sum of words in testset which is the denominator of the formula of Perplexity'''

44 testset_clean=testset.split()45 return (len(testset_clean)-testset.count("\n"))46

47 def f_perplexity(word_frequency,word_count): #计算困惑度

48 '''Search the probability of each word in dictionary49 Calculates the perplexity of the LDA model for every parameter T'''

50 duishu=-math.log(word_frequency)51 kuohaoli=duishu/word_count52 perplexity=math.exp(kuohaoli)53 returnperplexity54

55 def graph_draw(topic,perplexity): #做主题数与困惑度的折线图

56 x=topic57 y=perplexity58 plt.plot(x,y,color="red",linewidth=2)59 plt.xlabel("Number of Topic")60 plt.ylabel("Perplexity")61 plt.show()62

64 topic=[]65 perplexity_list=[]66 f1=open('/home/alber/lda/GibbsLDA/jd/test.txt','r') #测试集目录67 testset=f1.read()68 testset_word_count=f_testset_word_count(testset) #call the function to count the sum-words in testset

69 for i in xrange(14):70 dictionary={}71 topic.append(5*(3i+1)) #模型文件名的迭代公式72 trace="/home/alber/lda/GibbsLDA/jd/stats/model-final-"+str(5*(i+1))+".txt" #模型目录

73 f=open(trace,'r')74 text=f.readlines()75 word_list=[]76 for line intext:77 if "Topic" not inline:78 line_clean=line.split()79 word_list.extend(line_clean)80 else:81 pass

82 word_dictionary=dictionary_found(word_list)83 frequency=look_into_dic(word_dictionary,testset)84 perplexity=f_perplexity(frequency,testset_word_count)85 perplexity_list.append(perplexity)86 graph_draw(topic,perplexity_list)

下面是画出的折线图，在拐点附近再调整参数（当然与测试集有关，有图为证～～），寻找最优的主题数。实验证明，只要Topic选取数量在其附近，主题抽取一般比较理想。