LDA理解以及源码分析（二）-CSDN博客

本文链接：https://blog.csdn.net/pirage/article/details/50239209

这篇博文是LDA系列的第二部分，深入讲解了LDA的基础知识，包括共轭、多项式分布、狄利克雷分布，以及在文本中的应用。接着，文章介绍了LDA的概率图模型和参数推导，并提供了GibbsLDA++-0.2的源码分析，重点剖析了estimate()和sampling()函数。还给出了工具包的下载链接和使用简介。

摘要由CSDN通过智能技术生成

LDA系列的讲解分多个博文给出，主要大纲如下：

LDA相关的基础知识
- 什么是共轭
- multinomial分布
- Dirichlet分布
LDA in text
- LAD的概率图模型
- LDA的参数推导
- 伪代码
GibbsLDA++-0.2源码分析
Python实现GibbsLDA
参考资料

GibbsLDA++-0.2源码分析

GibbsLDA++-0.2工具包下载地址为：下载

工具包里docs文件夹里有说明文件GibbsLDA++Manual.pdf，按照要求编译就可以使用，很方便。（具体使用方法后面给出）

代码在文件夹src中，主要有这么几个类：dataset，model， strtokenizer，utils以及lda.cpp, constants.h文件。

dataset

    //两个全局变量，分别存储word和id的对应。
    typedef map<string, int> mapword2id;
    typedef map<int, string> mapid2word;
    //类document
    class document {
    public:
        //保存每个word对应的id
        int * words; 
        string rawstr;
        //文章的words总数
        int length; 
        document() {}
        document(int length) {}
        document(int length, int * words) {}
        document(int length, int * words, string rawstr) {}
        document(vector<int> & doc) {}
        document(vector<int> & doc, string rawstr) {}
        ~document() {}
    };
    class dataset {
    public:
        document ** docs; 
        document ** _docs; // used only for inference
        map<int, int> _id2id; // also used only for inference
        int M; // documents总数
        int V; // words总数
        dataset() {}
        dataset(int M) {}   
        ~dataset() {}
        void deallocate() {}
        void add_doc(document * doc, int idx) {}   
        void _add_doc(document * doc, int idx) {}       
        //根据pword2id写wordmap，文件每行都是“word id”的格式
        static int write_wordmap(string wordmapfile, mapword2id * pword2id);
        //读wordmap中的内容，存储到pword2id中
        static int read_wordmap(string wordmapfile, mapword2id * pword2id);
        static int read_wordmap(string wordmapfile, mapid2word * pid2word);