JgibbLDA输出结果说明与示例

最新推荐文章于 2022-02-25 09:01:46 发布

William_Dong

最新推荐文章于 2022-02-25 09:01:46 发布

阅读量3.2k

点赞数 1

分类专栏：自然语言处理文章标签： LDA JAVA text mining

本文链接：https://blog.csdn.net/dongweionly/article/details/50286215

版权

自然语言处理专栏收录该内容

6 篇文章 0 订阅

订阅专栏

JgibbLDA输出以下几个文件：

.others文件存储LDA模型参数，如alpha、beta等。

.phi文件存储topic-word分布，每一个元素是p(word|topic),每一行是一个主题，列内容为词语(应该是设定的top多少的词)。

.theta文件存储document-topic分布，每一个元素是p(topic|document),每一行是一个文档，列内容是主题概率。

.tassign文件是训练预料中单词的主题指定（归属），每一行是一个语料文档。

.twords文件是存放每个topic下面选出的top words以及对应的权重

wordmap.txt是整个corpus中出现的distinctive的所有词，词的id是按照出现的顺序来编的，但是在wordmap.txt里词是按照字母顺序来排的。

下面举例说明结果：

test_input.txt中有4篇文档，前两个文档是关于sport的（足球），后两个文档是关于travel。test_input.txt内容如下：

4
sport Spanish football association competition club tickets scored win winners keeper shots best goal campaign season's Champions League 
France team France Football Federation president national team training session Champions record European competition without recording a single victory 
quit my job to travel passport world travel is a luxury for the privileged the rich or the retired travel stories Have a long-term plan visa-free destinations Central Station 
City of London dry gin drinking building older foundations River Fleet flavour gin and tonic be served with cubed ice fruit floral spicy earthy savoury citrus

LDA 模型参数如下：

alpha	0.5 
beta	0.1
topicNum	2
niters	1000
savestep	1000
twords	10

设置的是2个topic，每个topic下面有10个词。先看wordmap.txt中的内容，由于test_input.txt中不重复的词有81个，所以里面第一行是总词数，第一个词从编码0开始，具体如下：

81
competition 4
Central 54
ice 74
earthy 78
without 28
building 62
passport 38
Federation 21
record 26
club 5
Spanish 1
plan 51
floral 76
League 17
goal 13
drinking 61
Fleet 66
keeper 10
destinations 53
foundations 64
is 40
Have 49
dry 59
City 56
spicy 77
European 27
my 34
privileged 44
Station 55
savoury 79
served 71
London 58
campaign 14
tonic 69
shots 11
job 35
tickets 6
be 70
season's 15
session 25
fruit 75
for 42
association 3
recording 29
best 12
training 24
gin 60
world 39
and 68
of 57
national 23
River 65
retired 47
older 63
France 18
win 8
winners 9
a 30
or 46
stories 48
flavour 67
cubed 73
victory 32
rich 45
football 2
team 19
Football 20
citrus 80
single 31
the 43
Champions 16
with 72
scored 7
luxury 41
quit 33
to 36
visa-free 52
travel 37
sport 0
president 22
long-term 50

.tassign 文件每行对应一个document，其中的元素是 word_id : topic_id，意思是第word_id个词是属于第topic_id的，具体如下：

0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0 16:0 17:1 
18:0 19:0 18:0 20:0 21:0 22:0 23:0 19:0 24:0 25:0 16:0 26:0 27:0 4:0 28:0 29:0 30:0 31:0 32:0 
33:1 34:1 35:1 36:1 37:1 38:0 39:1 37:1 40:0 30:1 41:1 42:1 43:1 44:1 43:1 45:1 46:1 43:1 47:1 37:1 48:1 49:1 30:1 50:1 51:1 52:1 53:1 54:1 55:0 
56:1 57:1 58:0 59:1 60:1 61:1 62:1 63:1 64:1 65:1 66:1 67:1 60:1 68:1 69:0 70:1 71:1 72:1 73:1 74:0 75:1 76:1 77:1 78:1 79:0 80:0

.twords文件直接就是每个topic下的出现频率最高的词以及权重：

Topic 0th:
	competition 0.04030710172744722
	Champions 0.04030710172744722
	France 0.04030710172744722
	team 0.04030710172744722
	sport 0.02111324376199616
	Spanish 0.02111324376199616
	football 0.02111324376199616
	association 0.02111324376199616
	club 0.02111324376199616
	tickets 0.02111324376199616
Topic 1th:
	travel 0.05525846702317291
	the 0.05525846702317291
	a 0.03743315508021391
	gin 0.03743315508021391
	League 0.0196078431372549
	quit 0.0196078431372549
	my 0.0196078431372549
	job 0.0196078431372549
	to 0.0196078431372549
	world 0.0196078431372549

下面是最重要的两个输出文件 .phi以及.theta

.phi是topic-word矩阵，本测试中topic只有2个，因而行数是2，列中的word并不是在参数中设置的topic word个数，这个topic word个数只是控制显示多少个word的，实际上计算中用的是所有的word，因而这里topic word矩阵的列是所有的word，即wordmap.txt中的所有word，所以列的维度是81. .phi文件如下：

0.021113	0.021113	0.021113	0.021113	0.040307	0.021113	0.021113	0.021113	0.021113	0.021113	0.021113	0.021113	0.021113	0.021113	0.021113	0.021113	0.040307	0.001919	0.040307	0.040307	0.021113	0.021113	0.021113	0.021113	0.021113	0.021113	0.021113	0.021113	0.021113	0.021113	0.021113	0.021113	0.021113	0.001919	0.001919	0.001919	0.001919	0.001919	0.021113	0.001919	0.021113	0.001919	0.001919	0.001919	0.001919	0.001919	0.001919	0.001919	0.001919	0.001919	0.001919	0.001919	0.001919	0.001919	0.001919	0.021113	0.001919	0.001919	0.021113	0.001919	0.001919	0.001919	0.001919	0.001919	0.001919	0.001919	0.001919	0.001919	0.001919	0.021113	0.001919	0.001919	0.001919	0.001919	0.021113	0.001919	0.001919	0.001919	0.001919	0.021113	0.021113
0.001783	0.001783	0.001783	0.001783	0.001783	0.001783	0.001783	0.001783	0.001783	0.001783	0.001783	0.001783	0.001783	0.001783	0.001783	0.001783	0.001783	0.019608	0.001783	0.001783	0.001783	0.001783	0.001783	0.001783	0.001783	0.001783	0.001783	0.001783	0.001783	0.001783	0.037433	0.001783	0.001783	0.019608	0.019608	0.019608	0.019608	0.055258	0.001783	0.019608	0.001783	0.019608	0.019608	0.055258	0.019608	0.019608	0.019608	0.019608	0.019608	0.019608	0.019608	0.019608	0.019608	0.019608	0.019608	0.001783	0.019608	0.019608	0.001783	0.019608	0.037433	0.019608	0.019608	0.019608	0.019608	0.019608	0.019608	0.019608	0.019608	0.001783	0.019608	0.019608	0.019608	0.019608	0.001783	0.019608	0.019608	0.019608	0.019608	0.001783	0.001783