JgibbLDA输出以下几个文件:
.others文件存储LDA模型参数,如alpha、beta等。
.phi文件存储topic-word分布,每一个元素是p(word|topic),每一行是一个主题,列内容为词语(应该是设定的top多少的词)。
.theta文件存储document-topic分布,每一个元素是p(topic|document),每一行是一个文档,列内容是主题概率。
.tassign文件是训练预料中单词的主题指定(归属),每一行是一个语料文档。
.twords文件是存放每个topic下面选出的top words以及对应的权重wordmap.txt是整个corpus中出现的distinctive的所有词,词的id是按照出现的顺序来编的,但是在wordmap.txt里词是按照字母顺序来排的。
下面举例说明结果:
test_input.txt中有4篇文档,前两个文档是关于sport的(足球),后两个文档是关于travel。test_input.txt内容如下:
4
sport Spanish football association competition club tickets scored win winners keeper shots best goal campaign season's Champions League
France team France Football Federation president national team training session Champions record European competition without recording a single victory
quit my job to travel passport world travel is a luxury for the privileged the rich or the retired travel stories Have a long-term plan visa-free destinations Central Station
City of London dry gin drinking building older foundations River Fleet flavour gin and tonic be served with cubed ice fruit floral spicy earthy savoury citrus
LDA 模型参数如下:
alpha 0.5
beta 0.1
topicNum 2
niters 1000
savestep 1000
twords 10
设置的是2个topic,每个topic下面有10个词。先看wordmap.txt中的内容,由于test_input.txt中不重复的词有81个,所以里面第一行是总词数,第一个词从编码0开始,具体如下:
81
competition 4
Central 54
ice 74
earthy 78
without 28
building 62
passport 38
Federation 21
record 26
club 5
Spanish 1
plan 51
floral 76
League 17
goal 13
drinking 61
Fleet 66
keeper 10
destinations 53
foundations 64
is 40
Have 49
dry 59
City 56
spicy 77
European 27
my 34
privileged 44
Station 55
savoury 79
served 71
London 58
campaign 14
tonic 69
shots 11
job 35
tickets 6
be 70
season's 15
session 25
fruit 75
for 42
association 3
recording 29
best 12
training 24
gin 60
world 39
and 68
of 57
national 23
River 65
retired 47
older 63
France 18
win 8
winners 9
a 30
or 46
stories 48
flavour 67
cubed 73
victory 32
rich 45
football 2
team 19
Football 20
citrus 80
single 31
the 43
Champions 16
with 72
scored 7
luxury 41
quit 33
to 36
visa-free 52
travel 37
sport 0
president 22
long-term 50
.tassign
文件每行对应一个document,其中的元素是 word_id : topic_id,意思是第word_id个词是属于第topic_id的,具体如下:
0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0 16:0 17:1
18:0 19:0 18:0 20:0 21:0 22:0 23:0 19:0 24:0 25:0 16:0 26:0 27:0 4:0 28:0 29:0 30:0 31:0 32:0
33:1 34:1 35:1 36:1 37:1 38:0 39:1 37:1 40:0 30:1 41:1 42:1 43:1 44:1 43:1 45:1 46:1 43:1 47:1 37:1 48:1 49:1 30:1 50:1 51:1 52:1 53:1 54:1 55:0
56:1 57:1 58:0 59:1 60:1 61:1 62:1 63:1 64:1 65:1 66:1 67:1 60:1 68:1 69:0 70:1 71:1 72:1 73:1 74:0 75:1 76:1 77:1 78:1 79:0 80:0
.twords文件直接就是每个topic下的出现频率最高的词以及权重:
Topic 0th:
competition 0.04030710172744722
Champions 0.04030710172744722
France 0.04030710172744722
team 0.04030710172744722
sport 0.02111324376199616
Spanish 0.02111324376199616
football 0.02111324376199616
association 0.02111324376199616
club 0.02111324376199616
tickets 0.02111324376199616
Topic 1th:
travel 0.05525846702317291
the 0.05525846702317291
a 0.03743315508021391
gin 0.03743315508021391
League 0.0196078431372549
quit 0.0196078431372549
my 0.0196078431372549
job 0.0196078431372549
to 0.0196078431372549
world 0.0196078431372549
下面是最重要的两个输出文件 .phi以及.theta
.phi是topic-word矩阵,本测试中topic只有2个,因而行数是2,列中的word并不是在参数中设置的topic word个数,这个topic word个数只是控制显示多少个word的,实际上计算中用的是所有的word,因而这里topic word矩阵的列是所有的word,即wordmap.txt中的所有word,所以列的维度是81. .phi文件如下:
0.021113 | 0.021113 | 0.021113 | 0.021113 | 0.040307 | 0.021113 | 0.021113 | 0.021113 | 0.021113 | 0.021113 | 0.021113 | 0.021113 | 0.021113 | 0.021113 | 0.021113 | 0.021113 | 0.040307 | 0.001919 | 0.040307 | 0.040307 | 0.021113 | 0.021113 | 0.021113 | 0.021113 | 0.021113 | 0.021113 | 0.021113 | 0.021113 | 0.021113 | 0.021113 | 0.021113 | 0.021113 | 0.021113 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.021113 | 0.001919 | 0.021113 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.021113 | 0.001919 | 0.001919 | 0.021113 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.021113 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.021113 | 0.001919 | 0.001919 | 0.001919 | 0.001919 | 0.021113 | 0.021113 |
0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.019608 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.001783 | 0.037433 | 0.001783 | 0.001783 | 0.019608 | 0.019608 | 0.019608 | 0.019608 | 0.055258 | 0.001783 | 0.019608 | 0.001783 | 0.019608 | 0.019608 | 0.055258 | 0.019608 | 0.019608 | 0.019608 | 0.019608 | 0.019608 | 0.019608 | 0.019608 | 0.019608 | 0.019608 | 0.019608 | 0.019608 | 0.001783 | 0.019608 | 0.019608 | 0.001783 | 0.019608 | 0.037433 | 0.019608 | 0.019608 | 0.019608 | 0.019608 | 0.019608 | 0.019608 | 0.019608 | 0.019608 | 0.001783 | 0.019608 | 0.019608 | 0.019608 | 0.019608 | 0.001783 | 0.019608 | 0.019608 | 0.019608 | 0.019608 | 0.001783 | 0.001783 |
.theta矩阵是document-topic矩阵,那么本测试中有4个document、2个topic,则该矩阵是4行2列的,具体如下:
0.921053 | 0.078947 |
0.975 | 0.025 |
0.116667 | 0.883333 |
0.203704 | 0.796296 |
---- end -----