LDA模型学习（代码）

最新推荐文章于 2024-05-17 08:57:45 发布

beck_zhou

最新推荐文章于 2024-05-17 08:57:45 发布

阅读量8.8k

点赞数

分类专栏：转型——创业/经济/金融/投资/理财 C/C++（win32和linux）算法研究(数据挖掘、机器学习、自然语言、深度学习、搜索引擎)

本文链接：https://blog.csdn.net/zhoubl668/article/details/8365710

版权

本文介绍如何将LDA模型应用于文本聚类，过程中涉及到的概率论和数学知识，以及通过Mallet库尝试实现，但因示例不足而转向其他源代码。最终采用文本分词和词频统计构建document矩阵进行操作。

摘要由CSDN通过智能技术生成

为了把LDA算法用于文本聚类，我真的是绞尽脑汁。除了去看让我头大的概率论、随机过程、高数这些基础的数学知识，还到网上找已经实现的源代码。

最先让我看到署光的是Mallet,我研究了大概一个星期，最后决定放弃了。因为Mallet作者提供的例子实在太少了。

回到了网上找到的这样一段源代码：

 
 
 
  /*  
   * (C) Copyright 2005, Gregor Heinrich (gregor :: arbylon : net) (This file is  
   * part of the org.knowceans experimental software packages.)  
   */ 
  /*  
   * LdaGibbsSampler is free software; you can redistribute it and/or modify it  
   * under the terms of the GNU General Public License as published by the Free  
   * Software Foundation; either version 2 of the License, or (at your option) any  
   * later version.  
   */ 
  /*  
   * LdaGibbsSampler is distributed in the hope that it will be useful, but  
   * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or  
   * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more  
   * details.  
   */ 
  /*  
   * You should have received a copy of the GNU General Public License along with  
   * this program; if not, write to the Free Software Foundation, Inc., 59 Temple  
   * Place, Suite 330, Boston, MA 02111-1307 USA  
   */ 
   
  /*  
   * Created on Mar 6, 2005  
   */ 
  package com.xh.lda;  
   
  import java.text.DecimalFormat;  
  import java.text.NumberFormat;  
   
  /**  
   * Gibbs sampler for estimating the best assignments of topics for words and  
   * documents in a corpus. The algorithm is introduced in Tom Griffiths' paper  
   * "Gibbs sampling in the generative model of Latent Dirichlet Allocation"  
   * (2002).  
   *   
   * @author heinrich  
   */ 
  public class LdaGibbsSampler {  
   
      /**  
       * document data (term lists)  
       */ 
      int[][] documents;  
   
      /**  
       * vocabulary size  
       */ 
      int V;  
   
      /**  
       * number of topics  
       */ 
      int K;  
   
      /**  
       * Dirichlet parameter (document--topic associations)  
       */ 
      double alpha;  
   
      /**  
       * Dirichlet parameter (topic--term associations)  
       */ 
      double beta;  
   
      /**  
       * topic assignments for each word.  
       */ 
      int z[][];  
   
      /**  
       * cwt[i][j] number of instances of word i (term?) assigned to topic j.  
       */ 
      int[][] nw;  
   
      /**  
       * na[i][j] number of words in document i assigned to topic j.  
       */ 
      int[][] nd;  
   
      /**  
       * nwsum[j] total number of words assigned to topic j.  
       */ 
      int[] nwsum;  
   
      /**  
       * nasum[i] total number of words in document i.  
       */ 
      int[] ndsum;  
   
      /**  
       * cumulative statistics of theta  
       */ 
      double[][] thetasum;  
   
      /**  
       * cumulative statistics of phi  
       */ 
      double[][] phisum;  
   
      /**  
       * size of statistics  
       */ 
      int numstats;  
   
      /**  
       * sampling lag (?)  
       */ 
      private static int THIN_INTERVAL = 20;  
   
      /**  
       * burn-in period  
       */ 
      private static