GibbsLDA++

3.4 例子学习

         比如,我们想估计一个LDA模型,对一个文档集合,存储在文件models/casestudy/trndocs.dat中。继而使用其模型来做推论,为存储在文件models/casestudy/newdocs.dat中的新数据。

         我们想要估计100个主题,alpha=0.5beta=1.我们想完成1000 Gibbs取样重复,保存模型,在每100个重复,并且每次保存模型,都打印出每个话题的最相似20个单词。设想我们现在在GibbsLDA++的根目录,我们将运行如下的命令,从头估计LDA模型。

 

         $ src/lda -est -alpha 0.5 -beta 0.1 -ntopics 100 -niters 1000 -savestep 100 -twords 20 -dfile models/casestudy/trndocs.dat

 

         现在查看models/casestudy目录,我们可以看到如下的输出。

 

Outputs of Gibbs sampling estimation of GibbsLDA++ include the following files:

 

<model_name>.others

<model_name>.phi

<model_name>.theta

<model_name>.tassign

<model_name>.twords

 

in which:

 

<model_name>: is the name of a LDA model corresponding to the time step it was saved on the hard disk. For example, the name of the model was saved at the Gibbs sampling iteration 400th will be model-00400. Similarly, the model was saved at the 1200th iteration is model-01200. The model name of the last Gibbs sampling iteration is model-final.

 

<model_name>.others: This file contains some parameters of LDA model, such as:

 

alpha=?

beta=?

ntopics=? # i.e., number of topics

ndocs=? # i.e., number of documents

nwords=? # i.e., the vocabulary size

liter=? # i.e., the Gibbs sampling iteration at which the model was saved

 

<model_name>.phi: This file contains the word-topic distributions, i.e., p(wordw|topict). Each line is a topic, each column is a word in the vocabulary.

 

<model_name>.theta: This file contains the topic-document distributions, i.e., p(topict|documentm). Each line is a document and each column is a topic.

 

<model_name>.tassign: This file contains the topic assignments for words in training data. Each line is a document that consists of a list of <wordij>:<topic of wordij>

 

 

 

 

<model_file>.twords: This file contains twords most likely words of each topic. twords is specified in the command line (see Sections 3.1.1 and 3.1.2).

 

GibbsLDA++ also saves a file called wordmap.txt that contains the maps between words and word's IDs (integer). This is because GibbsLDA++ works directly with integer IDs of words/terms inside instead of text strings.

 

         现在,我们想要继续完成另一个800 Gibbs取样重复,从先前估计的模型model-01000savestep=100twords=30,我们完成如下的命令:

 

         $ src/lda -estc -dir models/casestudy/ -model model-01000 -niters 800 -savestep 100 -twords 30

 

         现在查看casestudy目录来看输出。

 

         现在,如果我们想要推论(30 Gibbs取样重复)为新数据newdocs.dat使用一个先前估计的LDA模型,比如model-01800,我们完成如下的命令:

 

          src/lda -inf -dir models/casestudy/ -model model-01800 -niters 30 -twords 20 -dfile newdocs.dat

        

         现在,查看casestudy目录,我们可以看到推论的输出

 

newdocs.dat.others

newdocs.dat.phi

newdocs.dat.tassign

newdocs.dat.theta

newdocs.dat.twords

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值