GibbsLDA dataset.h分析

最新推荐文章于 2022-10-29 19:07:30 发布

hello_pig1995

最新推荐文章于 2022-10-29 19:07:30 发布

阅读量518

点赞数

分类专栏： LDA 文章标签： LDA

本文链接：https://blog.csdn.net/Zhaohui1995_Yang/article/details/51771084

版权

LDA 专栏收录该内容

18 篇文章 0 订阅

订阅专栏

GibbsLDA dataset.h分析

直觉上来说，可能和wordmap.txt有着很大关系，因为原文件中包含两个map，分别是mapid2word和mapword2id，来研读一下。

// map of words/terms [string => int]
typedef map<string, int> mapword2id;
// map of words/terms [int => string]
typedef map<int, string> mapid2word;

有document和dataset两个类。

class document:

完全就是一个建立的函数。各种构造函数。其中包含有一个int类型指针(int*)words，这个很关键，因为gibbsLDA就是使用的是map之后的数字来表示原始的字符串，所以我猜这个应该就是用int表示的一个一个的字符串了,string类型的原始字符串(string)rawstr,int类型的长度(int)length。

构造函数1:空构造函数，空赋值。

document() {
words = NULL;
rawstr = "";
length = 0; 
}

构造函数2:仅带有长度的构造函数。有了长度就来一个数组好了，words指向int表示的文章的数组，length纪录长度。

document(int length) {
this->length = length;
rawstr = "";
words = new int[length];    
}

构造函数3:带有长度和原始数据。

document(int length, int * words) {
this->length = length;
rawstr = "";
this->words = new int[length];
for (int i = 0; i < length; i++) {
    this->words[i] = words[i];
}
}

构造函数4:带有原始字符串的构造函数。

document(int length, int * words, string rawstr) {
this->length = length;
this->rawstr = rawstr;
this->words = new int[length];
for (int i = 0; i < length; i++) {
    this->words[i] = words[i];
}
}

构造函数5:用vector表示的向量中包含有int类型文章。

document(vector<int> & doc) {
this->length = doc.size();
rawstr = "";
this->words = new int[length];
for (int i = 0; i < length; i++) {
    this->words[i] = doc[i];
}
}

构造函数6:包含有原始字符串和vector的文章。

document(vector<int> & doc, string rawstr) {
this->length = doc.size();
this->rawstr = rawstr;
this->words = new int[length];
for (int i = 0; i < length; i++) {
    this->words[i] = doc[i];
}
}

析构函数：释放指针。

~document() {
if (words) {
    delete words;
}
}

可以发现对于class document来说，功能不算很多，就是记录原始数据文章的用处，其中rawstr为字符串，word为int类型的指针。

class dataset

主要也就是构造啊，插入啊什么的，也不算很复杂。

docs为指向诸多*documents的指针。(document **)docs。

_docs为在inference情况下使用的指针。

_id2id也是在inference情况下才会使用的。

M = number of documents.

V = number of words.

构造函数1:空构造函数

构造函数2:M篇文章的构造函数。

析构函数：主要是对于指针进行释放，也就是对于docs和_docs进行释放。

dellocate函数是个很神奇的函数，虽然还没有明白到底在什么情况下使用，它把所有的docs和_docs都清空了。

add_doc，在idx的位置添加一篇文章。

void add_doc(document * doc, int idx) {
if (0 <= idx && idx < M) {
    docs[idx] = doc;
}
}

_add_doc作用同上，不过就是inference情况下使用。

所以目前就知道，可以构造和析构class dataset。

static int write_wordmap(string wordmapfile, mapword2id * pword2id);
static int read_wordmap(string wordmapfile, mapword2id * pword2id);
static int read_wordmap(string wordmapfile, mapid2word * pid2word);

int read_trndata(string dfile, string wordmapfile);
int read_newdata(string dfile, string wordmapfile);
int read_newdata_withrawstrs(string dfile, string wordmapfile);

这些函数，read_wordmap可能就是和wordmap.txt有关的，也就是str(word)和int(word)相互map的。

而read_trndata可能就是读取训练trndata.dat数据的函数，而read_newdata可能就是在inference的时候使用的函数。dfile就是文件路径，wordmapfile我想就是wordmap.txt吧。