概率语言模型及其变形系列(5)-LDA Gibbs Sampling 的JAVA实现

最新推荐文章于 2021-03-08 10:50:00 发布

置顶

LarryNLPIR

最新推荐文章于 2021-03-08 10:50:00 发布

阅读量4.4w

点赞数 23

分类专栏：机器学习 NLP/IR 数据挖掘 JAVA PGM/Topic Model 文章标签： Gibbs Sampling java lda NLP Topic Model

本文链接：https://blog.csdn.net/yangliuy/article/details/8457329

版权

本文档详细介绍了LDA Gibbs Sampling的JAVA实现，包括文档预处理、算法流程和参数设置，并展示了将其应用于Newsgroup 18828新闻文档集上的主题分析结果，证明了LDA在无监督学习中的有效性。

摘要由CSDN通过智能技术生成

本系列博文介绍常见概率语言模型及其变形模型，主要总结PLSA、LDA及LDA的变形模型及参数Inference方法。初步计划内容如下

第一篇：PLSA及EM算法

第二篇：LDA及Gibbs Samping

第三篇：LDA变形模型-Twitter LDA，TimeUserLDA，ATM，Labeled-LDA，MaxEnt-LDA等

第四篇：基于变形LDA的paper分类总结（bibliography）

第五篇：LDA Gibbs Sampling 的JAVA实现

第五篇 LDA Gibbs Sampling的JAVA 实现

在本系列博文的前两篇，我们系统介绍了PLSA, LDA以及它们的参数Inference 方法，重点分析了模型表示和公式推导部分。曾有位学者说，“做研究要顶天立地”，意思是说做研究空有模型和理论还不够，我们还得有扎实的程序code和真实数据的实验结果来作为支撑。本文就重点分析 LDA Gibbs Sampling的JAVA 实现，并给出apply到newsgroup18828新闻文档集上得出的Topic建模结果。

本项目Github地址 https://github.com/yangliuy/LDAGibbsSampling

1、文档集预处理

要用LDA对文本进行topic建模，首先要对文本进行预处理，包括token，去停用词，stem，去noise词，去掉低频词等等。当语料库比较大时，我们也可以不进行stem。然后将文本转换成term的index表示形式，因为后面实现LDA的过程中经常需要在term和index之间进行映射。Documents类的实现如下，里面定义了Document内部类，用于描述文本集合中的文档。

package liuyang.nlp.lda.main;

import java.io.File;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import liuyang.nlp.lda.com.FileUtil;
import liuyang.nlp.lda.com.Stopwords;

/**Class for corpus which consists of M documents
 * @author yangliu
 * @blog http://blog.csdn.net/yangliuy
 * @mail yangliuyx@gmail.com
 */

public class Documents {
	
	ArrayList<Document> docs; 
	Map<String, Integer> termToIndexMap;
	ArrayList<String> indexToTermMap;
	Map<String,Integer> termCountMap;
	
	public Documents(){
		docs = new ArrayList<Document>();
		termToIndexMap = new HashMap<String, Integer>();
		indexToTermMap = new ArrayList<String>();
		termCountMap = new HashMap<String, Integer>();
	}
	
	public void readDocs(String docsPath){
		for(File docFile : new File(docsPath).listFiles()){
			Document doc = new Document(docFile.getAbsolutePath(), termToIndexMap, indexToTermMap, termCountMap);
			docs.add(doc);
		}
	}
	
	public static class Document {	
		private String docName;
		int[] docWords;
		
		public Document(String docName, Map<String, Integer> termToIndexMap, ArrayList<String> indexToTermMap, Map<String, Integer> termCountMap){
			this.d