LDA程序使用的是JgibbLDA,根据其输出的结果,一师兄给了如下的计算perplexity函数
/**
* @param tw_list是topic word矩阵(.phi文件)的每一行
* @param dt_list是document topic 矩阵(.theta)的每一行
* @param as_list是 .tassign文件的每一行
* */
public double compute(ArrayList<String[]> tw_list, ArrayList<String[]> dt_list, ArrayList<String[]> as_list){
double perp = 0.0;
double sum_ln_pt = 0.0;
int sum_t = 0;
double sum_value = 0.0;
for(int i = 0; i < as_list.size(); i++){///每一行 对应一个 document
String as_arr[] = as_list.get(i);
sum_t += as_arr.length; // 那个 ducument 有多少 word
double pt = 0.0;//用来加总 每一个document的 log概率
for(String as:as_arr){对于每一个词
if(!as.isEmpty()){
//System.out.println("as = " + as);
double pz = 0.0;
String tz_arr[] = as.split(":");
int t_id = Integer.parseInt(tz_arr[0]);// t_id 是 词 的id
//int z_id = Integer.parseInt(tz_arr[1]);
for(int j = 0; j < tw_list.size(); j++){ /遍历topics
double p_tz = 0.0;
p_tz = Double.parseDouble(tw_list.get(j)[t_id]);// 第 j 个topic 第 t_id个词
double p_zd = 0.0;
p_zd = Double.parseDouble(dt_list.get(i)[j]);//第i篇document 第j个topic
pz += p_tz*p_zd;
}//end for(int j = 0; j < tz_list.size(); j++)
pt += Math.log(pz);
}
}//end for(String as:as_arr)
sum_ln_pt += pt;
}//end for i
//System.out.println("sum_ln_pt = " + sum_ln_pt);
//System.out.println("sum_t = " + sum_t);
perp = Math.pow(Math.E, (-sum_ln_pt/sum_t));
return perp;
}
以测试文件test_input.txt为例,参数等见上一篇文章“ JgibbLDA输出结果说明与示例”
4
sport Spanish football association competition club tickets scored win winners keeper shots best goal campaign season's Champions League
France team France Football Federation president national team training session Champions record European competition without recording a single victory
quit my job to travel passport world travel is a luxury for the privileged the rich or the retired travel stories Have a long-term plan visa-free destinations Central Station
City of London dry gin drinking building older foundations River Fleet flavour gin and tonic be served with cubed ice fruit floral spicy earthy savoury citrus
选择不同的 number of topic输出不同的结果,如下:
Topic No. = 1 perplexity = 76.64751047832225
Topic No. = 2 perplexity = 47.257884731936784
Topic No. = 3 perplexity = 43.45364540615705
Topic No. = 4 perplexity = 48.62134551577707
可见 topic number 为3是比较好的--------- end ---------