srilm 阅读文档14

最新推荐文章于 2016-03-02 16:20:30 发布

kevinfight

最新推荐文章于 2016-03-02 16:20:30 发布

阅读量849

点赞数

分类专栏：语言模型文章标签：文档 file buffer class output insert

本文链接：https://blog.csdn.net/kevinfight/article/details/6423926

版权

语言模型专栏收录该内容

17 篇文章 0 订阅

订阅专栏

NgramStats.h NgramStats.cc
文档作者：jianzhu
创立时间：08.09.18

--------------------------------------
1、概述
--------------------------------------
 这两个文件主要实现了统计ngram的相关函数和功能，同时定义了
将统计好的ngram以文本方式和二进制方式输出到文件和从文件中读出
ngram的相关功能。
 NgramCounts类
 该类主要定义了统计一个文本文件中ngram和读写统计好的
 ngram的功能，并定义了按概率裁剪ngram的功能。
 该类提供了如下函数
 a) 构造函数
 b) 析构函数
 c) findCount函数
 d) insertCount函数
 e) removeCount函数
 f) countSentence函数
 g) read函数
 h) write函数
 i) writeBinary函数
 j) parseNgram函数
 k) readNgram函数
 l) writeNgram函数
 m) sumCounts函数
 n) pruneCounts函数
 o) setCounts函数
 p) dump函数
 q) clear函数

 NgramCountsIter类
 该类主要用于迭代NgramCounts类中特定元的所有ngram，方法主要是
 通过调用TrieIter2迭代器。
 该类提供了如下函数
 a) 构造函数
 b) init函数
 c) next函数

 NgramStats类
 该类为特化了的NgramCounts的子类。

 NgramsIter类
 该类为特化了的NgramCountsIter类，功能和NgramCountsIter类相同，主要
 用于迭代NgramStats类。

--------------------------------------
2、函数功能解释
--------------------------------------
a) 构造函数
<src>
0 template <class CountT>
1 NgramCounts<CountT>::NgramCounts(Vocab &vocab, unsigned int maxOrder)
2 : LMStats(vocab), order(maxOrder), intersect(false)
3 {
4 }
</src>
 功能：对当前对象的成员变量初始化

 细解：第2行通过成员初始化列表的方式对基类LMStats中的成员变量初始化，同时
 初始化ngram的长度(order)和intersect值。

c) findCount函数
 该类提供了两种不同的findCount函数，相互构成了重载的关系。
i)
<src>
0 CountT *findCount(const VocabIndex *words)
1 {
2 return counts.find(words);
3 }
</src>
 功能：获取对应于传入的ngram的统计量。

 细解：第2行通过调用成员变量counts的函数find来获取对应于ngram的统计量。注
 这里的ngram为word index数组。而counts是一个trie类型的数据结构，因此
 实际相当于在trie中获取对应于传入的ngram的统计量。
ii)
<src>
0 CountT *findCount(const VocabIndex *words, VocabIndex word1)
1 {
2 NgramNode *node = counts.findTrie(words);
3 return node ? node->find(word1) : 0;
4 }
</src>
 功能：获取对应于传入的ngram的统计量。

 细解：本函数和上一个函数功能几乎一样，唯一不同的是这里接口参数
 由两部分组成，前一部分是一个ngram的前缀，而后一部分是ngram的最后一个词。
 因此第2行首先通过调用trie的成员函数findTrie找到ngram前缀对应的subtrie，然
 后第3行通过调用该subtrie的find函数，在其中搜索ngram的最后一个词，若找到
 该词，则返回其对应的统计量，否则返回0。

d) insertCount函数
 该类提供了两种不同的insertCount函数，相互构成了重载的关系。
i)
<src>
0 CountT *insertCount(const VocabIndex *words)
1 {
2 return counts.insert(words);
3 }
</src>
 功能：将传入的ngram添加到trie中，同时返回其对应的data值指针。

 细解：第2行通过调用trie的insert函数，将传入的ngram加载到trie中，
 同时返回该ngram对应的data值指针。

ii)
<src>
0 CountT *insertCount(const VocabIndex *words, VocabIndex word1)
1 {
2 NgramNode *node = counts.insertTrie(words);
3 return node->insert(word1);
4 }
</src>
 功能：将传入的ngram添加到trie中，同时返回其对应的data值指针。

 细解：本函数和上一个函数功能几乎一样，唯一不同的是这里接口参数
 由两部分组成，前一部分是一个ngram的前缀，而后一部分是ngram的最后一个词。
 因此第2行首先通过调用trie的成员函数insertTrie将ngram前缀加载到trie中，同
 是返回该trie的subtrie，然后第3行通过调用该subtrie的insert函数将最后一个词
 加载到subtrie中，同时返回由前缀和最后一个词构成的ngram对应的data值指针。

e) removeCount函数
 该类提供了两种不同的removeCount函数，相互构成了重载的关系。
i)
<src>
0 CountT *removeCount(const VocabIndex *words)
1 {
2 return counts.remove(words);
3 }
</src>
 功能：从trie中删除对应的ngram的信息

 细解：第2行通过调用trie的remove函数删除传入的ngram对应信息。

ii)
<src>
0 CountT *removeCount(const VocabIndex *words, VocabIndex word1)
1 {
2 NgramNode *node = counts.findTrie(words);
3 return node ? node->remove(word1) : 0;
4 }
</src>
 功能：从trie中删除对应的ngram信息

 细解：本函数和上一个函数功能几乎一样，唯一不同的是这里接口参数
 由两部分组成，前一部分是一个ngram的前缀，而后一部分是ngram的最后一个词。
 因此第2行首先通过调用trie的成员函数findTrie找到ngram前缀对应的subtrie，
 然后第3行通过调用该subtrie的remove函数将最后一个词从subtrie中移除，同
 返回其对应的data值指针。此时由前缀和最后一个词构成的ngram已经从trie中
 移除。

f) countSentence函数
 该类提供了四种不同的countSentence函数，相互构成了重载的关系。
i)
<src>
0 virtual unsigned countSentence(const VocabString *word)
1 {
2 return countSentence(word, (CountT)1);
3 }
</src>
 功能：将基于字符串的句子中的所有ngram信息保存到trie中。

 细解：第2行通过调用重载的基于字符串的countSentence函数，将
 句子中的所有ngram信息保存到trie中。

ii)
<src>
0 template <class CountT>
1 unsigned int
2 NgramCounts<CountT>::countSentence(const VocabString *words, CountT factor)
3 {
4 static VocabIndex wids[maxWordsPerLine + 3];
5 unsigned int howmany;
6
7 if (openVocab) {
8 howmany = vocab.addWords(words, wids + 1, maxWordsPerLine + 1);
9 } else {
10 howmany = vocab.getIndices(words, wids + 1, maxWordsPerLine + 1,
11 vocab.unkIndex());
12 }
13
14 /*
15 * Check for buffer overflow
16 */
17 if (howmany == maxWordsPerLine + 1) {
18 return 0;
19 }
20
21 /*
22 * update OOV count
23 */
24 if (!openVocab) {
25 for (unsigned i = 1; i <= howmany; i++) {
26 if (wids[i] == vocab.unkIndex()) {
27 stats.numOOVs ++;
28 }
29 }
30 }
31
32 /*
33 * Insert begin/end sentence tokens if necessary
34 */
35 VocabIndex *start;
36
37 if (addSentStart && wids[1] != vocab.ssIndex()) {
38 wids[0] = vocab.ssIndex();
39 start = wids;
40 } else {
41 start = wids + 1;
42 }
43
44 if (addSentEnd && wids[howmany] != vocab.seIndex()) {
45 wids[howmany + 1] = vocab.seIndex();
46 wids[howmany + 2] = Vocab_None;
47 }
48
49 return countSentence(start, factor);
50 }
</src>
 功能：将传入的基于字符串的句子中的每个词转换为其对应的索引值，同时
 通过调用基于索引的countSentence函数将该句子中所有ngram保存到trie中。

 细解：第7-9行判断当当前统计的句子是一个开放的词表时，将句子中的每个
 词加入到词表中，同时将每个词对应的索引保存到wids数组中。
 第9-11行处理当当前统计的句子是一个封闭的词表时，获得每个词对应的索引
 并将其保存到wids数组中。
 第24-30行处理当当前统计的句子是一个封闭的词表时，统计所有被处理的句子
 中的未登录词数。
 第37-42行根据是否需要在当前句子中加入句子开始标记，执行相应的操作。
 第44-47行根据是否需要在当前句子中加入句子结束标记，执行相应的操作。
 第49行通过调用基于索引的countSentence函数将当前句子中的所有ngram信息
 保存到trie中。

iii)
<src>
0 virtual unsigned countSentence(const VocabIndex *word)
1 {
2 return countSentence(word, (CountT)1);
3 }
</src>
 功能：将传入的基于索引的句子中的所有ngram信息保存到trie中。

 细解：第2行通过调用重载的基于索引的countSentence函数将当前句子中的所有
 ngram保存到trie中。

iv)
<src>
0 template <class CountT>
1 unsigned int
2 NgramCounts<CountT>::countSentence(const VocabIndex *words, CountT factor)
3 {
4 unsigned int start;
5
6 for (start = 0; words[start] != Vocab_None; start++) {
7 incrementCounts(words + start, 1, factor);
8 }
9
10 /*
11 * keep track of word and sentence counts
12 */
13 stats.numWords += start;
14 if (words[0] == vocab.ssIndex()) {
15 stats.numWords --;
16 }
17 if (start > 0 && words[start-1] == vocab.seIndex()) {
18 stats.numWords --;
19 }
20
21 stats.numSentences ++;
22
23 return start;
24 }
</src>
 功能：统计传入的句子中的所有ngram的信息，并将这些信息保存到trie树中。

 细解：第6-8行通过循环调用私有函数incrementCounts函数统计当前句子中的
 所有ngram的信息。注这三行的时间复杂度为N*M，其中N为需要统计的ngram的
 长度，M为句子中的词数。后面分析incrementCounts函数时会了解到N。
 第13-19行统计所有被分析句子中的词数；
 第21行统计句子数；
 第23行返回当前句子中的词数。

 incrementCounts函数
 <src>
 0 template <class CountT>
 1 void
 2 NgramCounts<CountT>::incrementCounts(const VocabIndex *words,
 3 unsigned minOrder, CountT factor)
 4 {
 5 NgramNode *node = &counts;
 6
 7 for (int i = 0; i < order; i++) {
 8 VocabIndex wid = words[i];
 9
 10 /*
 11 * check of end-of-sentence
 12 */
 13 if (wid == Vocab_None) {
 14 break;
 15 } else {
 16 node = node->insertTrie(wid);
 17 if (i + 1 >= minOrder) {
 18 node->value() += factor;
 19 }
 20 }
 21 }
 22 }
 </src>
 功能：将当前ngram中的每一个词保存到trie树中，同时保存每一阶对应的
 统计量。

 细解：第5行首先获得当前trie的根结点。
 第7-21行循环处理当前ngram中每一元，同时将其相关信息保存到trie树中。
 e.g. 假设order为3，则第8行获得ngram中的每个词然后判断该词是否是一个合理
 的词，若是一个合理的词，则将该词加入加入到trie树中，同时获得该词对应
 的subtrie，并判断是否需要统计当前ngram的统计信息，若需要则将当前ngram的
 统计信息加入到trie中。

g) read函数
 该类提供了两种不同的read函数，相互构成了重载的关系。
i)
<src>
0 Boolean read(File &file)
1 {
2 return read(file, order);
3 }
</src>
 功能：读出文件保存的每一个ngram，同时将这些ngram保存到trie中。

 细解：第2行通过抵用带参数order的read函数来读出file中的每一个ngram，并将这些
 ngram保存到trie中。

ii)
<src>
0 template <class CountT>
1 Boolean
2 NgramCounts<CountT>::read(File &file, unsigned int order, Boolean limitVocab)
3 {
4 VocabString words[maxNgramOrder + 1];
5 VocabIndex wids[maxNgramOrder + 1];
6 CountT count;
7 unsigned int howmany;
8
9 /*
10 * Check for binary format
11 */
12 char *firstLine = file.getline();
13
14 if (!firstLine) {
15 return true;
16 } else {
17 if (strcmp(firstLine, NgramStats_BinaryFormatString) == 0) {
18 File binaryFile(file.name, "rb");
19 return readBinary(binaryFile, order, limitVocab);
20 } else {
21 file.ungetline();
22 }
23 }
24
25 while (howmany = readNgram(file, words, maxNgramOrder + 1, count)) {
26 /*
27 * Skip this entry if the length of the ngram exceeds our
28 * maximum order
29 */
30 if (howmany > order) {
31 continue;
32 }
33 /*
34 * Map words to indices
35 */
36 if (limitVocab) {
37 /*
38 * skip ngram if not in-vocabulary
39 */
40 if (!vocab.checkWords(words, wids, maxNgramOrder)) {
41 continue;
42 }
43 } else if (openVocab) {
44 vocab.addWords(words, wids, maxNgramOrder);
45 } else {
46 vocab.getIndices(words, wids, maxNgramOrder, vocab.unkIndex());
47 }
48
49 /*
50 * Update the count
51 */
52 CountT *cnt = intersect ?
53 counts.find(wids) :
54 counts.insert(wids);
55
56 if (cnt) {
57 *cnt += count;
58 }
59 }
60 /*
61 * XXX: always return true for now, should return false if there was
62 * a format error in the input file.
63 */
64 return true;
65 }
</src>
 功能：读出文件中的每一个ngram并将其保存到trie中。

 细解：第12行首先读出文件中第1行。
 第17-19行首先处理当当前文件为二进制格式时，通过调用readBinary函数读出文件中
 的每一个ngram，并将其保存到trie中；否则执行第20-22行。
 第20-22行处理当当前文件为文本格式时，将读出的行返回。
 第25-59行循环读出文本文件中的每一个ngram，并将其保存到trie中。首先通过调用
 readNgram函数读出文件中每一个ngram和其对应的统计量。然后执行36-47行，将读出
 的基于字符串的ngram转化为基于索引的ngram。
 第52-54行获取当前ngram对应的data值，或将该ngram加入到trie中。
 第56-58行将当前当前ngram对应的count值保存到trie中。

 readBinary函数
 注：该函数的功能在writeBinary函数后介绍

h) write函数
 该类提供了两种不同的write函数，相互构成了重载关系。
i)
<src>
0 void write(File &file)
1 {
2 write(file, order);
3 }
</src>
 功能：通过调用重载的write函数，将ngram写入到文件中。

 细解：第2行调用重载的write函数将ngram写入到文件中。

ii)
<src>
0 template <class CountT>
1 void
2 NgramCounts<CountT>::write(File &file, unsigned int order, Boolean sorted)
3 {
4 static char buffer[maxLineLength];
5 writeNode(counts, file, buffer, buffer, 1, order, sorted);
6 }
</src>
 功能：将特定阶的所有ngram保存到文件中

 细解：第4行在堆上分配了一个字符串buffer用于保存ngram
 第5行通过调用writeNode递归函数将特定阶的所有ngram写入到文件中，同时
 根据是否需要排序，对需要写入到文件中的ngram进行排序。

 writeNode函数
 <src>
 0 template <class CountT>
 1 void
 2 NgramCounts<CountT>::writeNode(
 3 NgramNode &node, /* the trie node we're at */
 4 File &file, /* output file */
 5 char *buffer, /* output buffer */
 6 char *bptr, /* pointer into output buffer */
 7 unsigned int level, /* current trie level */
 8 unsigned int order, /* target trie level */
 9 Boolean sorted) /* produce sorted output */
 10 {
 11 NgramNode *child;
 12 VocabIndex wid;
 13
 14 TrieIter<VocabIndex,CountT> iter(node, sorted ? vocab.compareIndex() : 0);
 15
 16 /*
 17 * Iterate over the child nodes at the current level,
 18 * appending their word strings to the buffer
 19 */
 20 while (!file.error() && (child = iter.next(wid))) {
 21 VocabString word = vocab.getWord(wid);
 22
 23 if (word == 0) {
 24 cerr << "undefined word index " << wid << "/n";
 25 continue;
 26 }
 27
 28 unsigned wordLen = strlen(word);
 29
 30 if (bptr + wordLen + 1 > buffer + maxLineLength) {
 31 *bptr = '0';
 32 cerr << "ngram ["<< buffer << word
 33 << "] exceeds write buffer/n";
 34 continue;
 35 }
 36
 37 strcpy(bptr, word);
 38
 39 /*
 40 * If this is the final level, print out the ngram and the count.
 41 * Otherwise set up another level of recursion.
 42 */
 43 if (order == 0 || level == order) {
 44 fprintf(file, "%s/t%s/n", buffer, countToString(child->value()));
 45 }
 46
 47 if (order == 0 || level < order) {
 48 *(bptr + wordLen) = ' ';
 49 writeNode(*child, file, buffer, bptr + wordLen + 1, level + 1,
 50 order, sorted);
 51 }
 52 }
 53 }
 </src>
 功能：递归获取传入的Trie树中长度为order的所有ngram，并将其写入到文件中。

 细解：第14行通过调用TrieIter的构造函数构造一个遍历当前树所有节点的迭代器，然后执行
 第20-52行。
 第20行通过调用TrieIter的next函数遍历当前trie树的每一个结点，同时保存该节点的子树；
 第21行通过调用vocab的getWord函数将词索引转换为词本身，然后执行第28-35行；
 第28-35行通过判断当前要保存的词是否超过buffer的剩余空间大小；
 第37行将当前词保存到前驱ngram的后面，然后执行43-45行；
 第43-45行判断当前遍历的树的深度是否等于要写的ngram的长度，若相等，则将该ngram和其
 对应的统计量写入到文件中；否则执行第47-51行；
 第47-51行处理当当前处理的ngram长度小于要写的长度时，递归调用writeNode函数，同时准备
 下一次写入的buffer开始地址等参数。
 总体来说，writeNode会递归调用其自身直到到达Trie树的ngram长度那一层，然后在处理完那一
 层的所有结点的基础上回溯到上一层，并处理上一层的其他结点，这些循环递归处理，直到处理
 完树中所有长度为order的ngram。

i) writeBinary函数
<src>
0 template <class CountT>
1 Boolean
2 NgramCounts<CountT>::writeBinary(File &file, unsigned int order)
3 {
4 /*
5 * Magic string
6 */
7 fprintf(file, "%s", NgramStats_BinaryFormatString);
8
9 /*
10 * Maximal count order
11 */
12 fprintf(file, "maxorder %u/n", order > 0 ? order : this->order);
13
14 /*
15 * Vocabulary index
16 */
17 vocab.writeIndexMap(file);
18
19 long long offset = ftello(file);
20
21 // detect if file is not seekable
22 if (offset < 0) {
23 file.position() << strerror(errno) << endl;
24 return false;
25 }
26
27 /*
28 * Count data
29 */
30 return writeBinaryNode(counts, 1, order, file, offset);
31 }
</src>
 功能：以2进制方式将Trie树中所有长度为order的ngram写入到文件中

 细解：第7行首先往二进制文件中写入二进制标示字符串；
 第12行往二进制文件中写入当前的要写入文件中ngram的长度；
 第17行通过调用vocab的成员函数writeIndexMap将vocab中的词以
 index vocab
 方式写入到文件中。
 第30行通过调用writeBinaryNode递归写入Trie树中所有长度为order的ngram
 及其统计信息，这些ngram由index组成。

 writeBinaryNode函数
 <src>
 0 template <class CountT>
 1 Boolean
 2 NgramCounts<CountT>::writeBinaryNode(NgramNode &node,
 3 unsigned level, unsigned order,
 4 File &file, long long &offset)
 5 {
 6 unsigned effectiveOrder = order > 0 ? order : this->order;
 7
 8 if (level > effectiveOrder) {
 9 // when reaching the maximal order don't write an offset to save space
 10 return true;
 11 } else {
 12 // guess number of bytes needed for storing subtrie rooted at node
 13 // based on its depth (if we guess wrong we need to redo the whole
 14 // subtrie later)
 15 unsigned subtrieDepth = effectiveOrder - level;
 16 unsigned offsetBytes = subtrieDepth == 0 ? 2 :
 17 subtrieDepth <= 3 ? 4 : 8;
 18
 19 long long startOffset = offset; // remember start offset
 20 long long endOffset = offset;
 21 while (true)
 22 {
 23 // write placeholder value
 24 unsigned nbytes = writeBinaryCount(file, (unsigned long long)0,
 25 offsetBytes);
 26 if (!nbytes) return false;
 27 offset += nbytes;
 28
 29 if (order == 0 || level <= order) {
 30 NgramNode *child;
 31 TrieIter<VocabIndex,CountT> iter(node);
 32 VocabIndex wid;
 33
 34 while (child = iter.next(wid)) {
 35 nbytes = writeBinaryCount(file, wid);
 36 if (!nbytes) return false;
 37 offset += nbytes;
 38
 39 if (order > 0 && level < order) {
 40 nbytes = writeBinaryCount(file, (CountT)0);
 41 } else {
 42 nbytes = writeBinaryCount(file, child->value());
 43 }
 44 if (!nbytes) return false;
 45 offset += nbytes;
 46
 47 if (!writeBinaryNode(*child, level + 1, order, file, offset)) {
 48 return false;
 49 }
 50 }
 51 }
 52
 53 endOffset = offset;
 54
 55 if (fseeko(file, startOffset, SEEK_SET) < 0) {
 56 file.offset() << strerror(errno) << endl;
 57 return false;
 58 }
 59
 60 // don't update offset since we're skipping back in file
 61 nbytes = writeBinaryCount(file,
 62 (unsigned long long)(endOffset-startOffset),
 63 offsetBytes);
 64 if (!nbytes) return false;
 65
 66 // now check that the number of bytes used for offset was actually ok
 67 if (nbytes > offsetBytes) {
 68 file.offset() << "increasing offset bytes from " << offsetBytes
 69 << " to " << nbytes
 70 << " (order " << effectiveOrder << ","
 71 << " level " << level << ")/n";
 72
 73 offsetBytes = nbytes;
 74
 75 if (fseeko(file, startOffset, SEEK_SET) < 0) {
 76 file.offset() << strerror(errno) << endl;
 77 return false;
 78 }
 79 offset = startOffset;
 80 }
 81 else
 82 {
 83 break;
 84 }
 85 }
 86 if (fseeko(file, endOffset, SEEK_SET) < 0) {
 87 file.offset() << strerror(errno) << endl;
 88 return false;
 89 }
 90
 91 return true;
 92 }
 93 }
 </src>
 功能：以二进制方式递归写入Trie树中所有长度为order的ngram及其统计信息。

 细解：第15-17行首先估计当前要写的subTrie的大小，需要的位数。然后通过
 第24-25行往文件中预先写入该位数，这样起到一个垫位作用，为了后续写完
 subTrie时，可以移到该位置写入subTrie真实的大小。
 第29-51行以二进制方式递归往文件中写入当前结点下所有满足长度为order的ngram
 及其统计信息。第31行构建当前Trie树的迭代器，第34-50行循环迭代当前Trie中的
 每一个结点，同时通过第47行递归调用方式往文件中写入每一个长度为order的ngram。
 对于每一个ngram其长度小于order的元将其统计信息记录为0，而长度为order的元直接
 以二进制方式往文件中写入该统计量。
 第55-58行，将文件指针移动到记录当前trie树的subTrie大小的位置，（即垫位符位置）
 然后执行第61-85行。
 第61-85行，首先将subTrie的大小写入文件中，并判断保存subTrie大小需要的字节数是
 否小于等于一开始写入的垫位符字节数，若满足该条件说明当前subTrie写入成功；否则
 需要移动到while循环开始处重新写入当前subTrie。
 递归执行当前函数直到写完所有的长度为order的ngram为止，对于每一个Trie当写入其子
 树时都需要写入其子树的大小。写入该数据的目的主要是为了后续将该trie中文件中正确
 读出时起到对齐作用。

 二进制方式写入文件的格式如下所示：
 binary_identify_string
 ngram_order
 id->word map
 id1 w1
 id2 w2
 ... ...
 .(end_identification)
 trie_length
 w11 0
 trie_length
 w21 0
 trie_length
 w31 count
 w32 count
 ... ...
 w22 0
 trie_length
 w31 count
 w32 count
 ... ...
 ... 0 ... ...

w12 0
 trie_length
 w21 0
 trie_length
 w31 count
 w32 count
 ... ...
 w22 0
 trie_length
 w31 count
 w32 count
 ... ...
 ... 0 ... ...
 ... . ... . ... ...

 readBinary函数
 <src>
 0 template <class CountT>
 1 Boolean
 2 NgramCounts<CountT>::readBinary(File &file, unsigned order, Boolean limitVocab)
 3 {
 4 char *firstLine = file.getline();
 5
 6 if (!firstLine || strcmp(firstLine, NgramStats_BinaryFormatString) != 0) {
 7 file.position() << "bad binary format/n";
 8 return false;
 9 }
 10
 11 /*
 12 * Maximal count order
 13 */
 14 unsigned maxOrder;
 15 if (fscanf(file, "maxorder %u", &maxOrder) != 1) {
 16 file.position() << "could not read ngram order/n";
 17 return false;
 18 }
 19
 20 /*
 21 * Vocabulary map
 22 */
 23 Array<VocabIndex> vocabMap;
 24
 25 if (!vocab.readIndexMap(file, vocabMap, limitVocab)) {
 26 return false;
 27 }
 28
 29 long long offset = ftello(file);
 30
 31 // detect if file is not seekable
 32 if (offset < 0) {
 33 file.position() << strerror(errno) << endl;
 34 return false;
 35 }
 36
 37 /*
 38 * Count data
 39 */
 40 return readBinaryNode(counts, order, maxOrder, file, offset, limitVocab, vocabMap);
 41 }
 </src>
 功能：以二进制方式从文件中读出由长度为order的所有ngram构成的trie。

 细解：第6-9行首先判断当前文件是否是二进制格式。若是则执行第25-27行构建一个vocab到index和
 index到vocab的映射。注由于写入的时候读出的时候可能使用的词表不一样，因此这里需要使用vocabMap
 对象将旧的id（即二进制方式保存到文件中的id)映射为新的id，这样做的目的是后续可以通过vocabMap
 将旧的id映射为新的id，同时获得旧的id实际对应的词。
 第29行获取二进制文件总的大小；
 第40行通过调用readBinaryNode函数递归读出整个二进制Trie。


 注：这里读出的trie，只有最高阶的ngram存在统计信息，而所有前驱ngram统计信息均为0。
 但是根据ngram的属性，可以通过调用sumCounts函数求出所有前驱ngram的统计信息。

 readBinaryNode函数
 <src>
 0 template <class CountT>
 1 Boolean
 2 NgramCounts<CountT>::readBinaryNode(NgramNode &node,
 3 unsigned order, unsigned maxOrder,
 4 File &file, long long &offset,
 5 Boolean limitVocab,
 6 Array<VocabIndex> &vocabMap)
 7 {
 8 if (maxOrder == 0) {
 9 return true;
 10 } else {
 11 unsigned long long endOffset, trieLength;
 12 unsigned nbytes;
 13
 14 nbytes = readBinaryCount(file, trieLength);
 15 if (!nbytes) {
 16 return false;
 17 }
 18 endOffset = offset + trieLength;
 19 offset += nbytes;
 20
 21
 22 if (order == 0) {
 23 if (fseeko(file, endOffset, SEEK_SET) < 0) {
 24 file.offset() << strerror(errno) << endl;
 25 return false;
 26 }
 27 offset = endOffset;
 28 } else {
 29 while (offset < endOffset) {
 30 VocabIndex oldWid;
 31
 32 nbytes = readBinaryCount(file, oldWid);
 33 if (!nbytes) {
 34 return false;
 35 }
 36 offset += nbytes;
 37
 38 if (oldWid >= vocabMap.size()) {
 39 file.offset() << "word index " << oldWid
 40 << " out of range/n";
 41 return false;
 42 }
 43 VocabIndex wid = vocabMap[oldWid];
 44 NgramNode *child = 0;
 45
 46 if (wid != Vocab_None) {
 47 child = intersect ?
 48 node.findTrie(wid) :
 49 node.insertTrie(wid);
 50 }
 51
 52 if (child == 0) {
 53 // skip count value and subtrie
 54 CountT dummy;
 55 nbytes = readBinaryCount(file, dummy);
 56 if (!nbytes) {
 57 return false;
 58 }
 59 offset += nbytes;
 60
 61 if (!readBinaryNode(node, 0, maxOrder-1, file, offset,
 62 limitVocab, vocabMap)) {
 63 return false;
 64 }
 65 } else {
 66 // read count value and subtrie
 67 CountT count;
 68 nbytes = readBinaryCount(file, count);
 69 if (!nbytes) {
 70 return false;
 71 }
 72 child->value() += count;
 73 offset += nbytes;
 74
 75 if (!readBinaryNode(*child, order-1, maxOrder-1,
 76 file, offset, limitVocab, vocabMap)) {
 77 return false;
 78 }
 79 }
 80 }
 81
 82 if (offset != endOffset) {
 83 file.offset() << "data misaligned/n";
 84 return false;
 85 }
 86 }
 87
 88 return true;
 89 }
 90 }
 </src>
 功能：以二进制方式递归读出Trie树中所有长度为order的ngram及其统计信息。

 细解：由于Trie树以以下格式写入到文件中，因此读的时候需要遵循该格式
 trie_length
 w11 0
 trie_length
 w21 0
 trie_length
 w31 count
 w32 count
 ... ...
 w22 0
 trie_length
 w31 count
 w32 count
 ... ...
 ... 0 ... ...

 w12 0
 trie_length
 w21 0
 trie_length
 w31 count
 w32 count
 ... ...
 w22 0
 trie_length
 w31 count
 w32 count
 ... ...
 ... 0 ... ...
 ... . ... . ... ...

 第14行读出当前subTrie的大小，第18行记录当前subTrie的结束位置，而19行记录
 当前subTrie的开始位置；
 第29-80行递归读出当前subTrie，并判断是否读到结束位置处；
 第32行首先读出文件中保存的subTrie各词对应的vocabIndex，第43行将该vocabIndex
 映射为新的vocabIndex；
 第46-50行将新的vocabIndex保存到动态生成的subTrie中，并判断该vocabIndex对应
 的subTrie是否为空，若为空则需要读出文件中保存的该subTrie下的所有信息，并忽略
 这些信息；否则执行第65-79行；
 第68行首先读出以当前vocabIndex为结尾的ngram的统计信息，然后执行第72行将该统计
 信息记录到当前结点的data值位置处。然后执行第75-76行。
 第75-76行递归调用当前函数本身直到读出完整的subTrie信息。

j) parseNgram函数
<src>
0 template <class CountT>
1 unsigned int
2 NgramCounts<CountT>::parseNgram(char *line,
3 VocabString *words,
4 unsigned int max,
5 CountT &count)
6 {
7 unsigned howmany = Vocab::parseWords(line, words, max);
8
9 if (howmany == max) {
10 return 0;
11 }
12
13 /*
14 * Parse the last word as a count
15 */
16 if (!stringToCount(words[howmany - 1], count)) {
17 return 0;
18 }
19
20 howmany --;
21 words[howmany] = 0;
22
23 return howmany;
24 }
</src>
 功能：从一个句子中分析出最大长度为max-1的ngram，及该ngram对应的统计量

 细解：第7行调用Vocab的静态函数parseWords从句子中分析出每一个词，并将其
 保存到words中；
 第16-17行将最后一个词即ngram统计量转换为整数，并保存到count中；然后执行
 第20-21行将ngram的下一个单元置为空，并执行第23行返回ngram的长度。

 注：这里将parseNgram函数声明为static类型，主要原因是因为该函数需要调用Vocab
 的静态函数parseWords。

k) readNgram函数
<src>
0 template <class CountT>
1 unsigned int
2 NgramCounts<CountT>::readNgram(File &file,
3 VocabString *words,
4 unsigned int max,
5 CountT &count)
6 {
7 char *line;
8
9 /*
10 * Read next ngram count from file, skipping blank lines
11 */
12 line = file.getline();
13 if (line == 0) {
14 return 0;
15 }
16
17 unsigned howmany = parseNgram(line, words, max, count);
18
19 if (howmany == 0) {
20 file.position() << "malformed N-gram count or more than " << max - 1 << " words per line/n";
21 return 0;
22 }
23
24 return howmany;
25 }
</src>
 功能：从文件中按行读出每一个长度小于max的ngram及其统计信息

 细解：

l) writeNgram函数
<src>
0 template <class CountT>
 unsigned int
2 NgramCounts<CountT>::writeNgram(File &file,
3 const VocabString *words,
4 CountT count)
5 {
6 unsigned int i;
7
8 if (words[0]) {
9 fprintf(file, "%s", words[0]);
10 for (i = 1; words[i]; i++) {
11 fprintf(file, " %s", words[i]);
12 }
13 }
14 fprintf(file, "/t%s/n", countToString(count));
15
16 return i;
17 }
</src>
 功能：将字符串ngram及其统计量写入到文件中

 细解：第8-13行将ngram字符串写入到文件中；
 首先通过第9行将ngram的最低元写入到文件中，然后通过for循环
 输出ngram中每一元字符串，同时在该字符串之前输出一个空格；
 第14行通过调用fprintf函数将ngram的统计信息写入到文件中，
 统计信息和ngram本身通过制表符隔开。
 第16行返回写入的ngram元数。

m) sumCounts函数
 该类定义了两个sumCounts函数，相互构成了重载关系。
i)
<src>
0 CountT sumCounts()
1 {
2 return sumCounts(order);
3 }
</src>
 功能：通过使用对象的order调用sumCounts函数，统计低于order元
 的所有ngram的统计信息。

 细解：第2行使用当前对象的order函数调用sumCounts函数统计低于
 order元的所有ngram的统计信息。
ii)
<src>
0 template <class CountT>
1 CountT
2 NgramCounts<CountT>::sumCounts(unsigned int order)
3 {
4 return sumNode(counts, 1, order);
5 }
</src>
 功能：通过调用sumNode函数，统计低于order元的所有ngram的统计信息。

 细解：第2行调用私有函数sumNode递归统计低于order元的所有ngram的统
 计信息。

 sumNode函数
 <src>
 0 template <class CountT>
 1 CountT
 2 NgramCounts<CountT>::sumNode(NgramNode &node, unsigned level, unsigned order)
 3 {
 4 /*
 5 * For leaf nodes, or nodes beyond the maximum level we are summing,
 6 * return their count, leaving it unchanged.
 7 * For nodes closer to the root, replace their counts with the
 8 * the sum of the counts of all the children.
 9 */
 10 if (level > order || node.numEntries() == 0) {
 11 return node.value();
 12 } else {
 13 NgramNode *child;
 14 TrieIter<VocabIndex,CountT> iter(node);
 15 VocabIndex wid;
 16
 17 CountT sum = 0;
 18
 19 while (child = iter.next(wid)) {
 20 sum += sumNode(*child, level + 1, order);
 21 }
 22
 23 node.value() = sum;
 24
 25 return sum;
 26 }
 27 }
 </src>
 功能：递归方式统计出所有低于order元的ngram统计量

 细解：第19-21行循环遍历Trie树中的每一个结点，并递归到最高阶的ngram，
 直到处理完所有子结点后，回溯到父节点，并将所有子节点统计量之和保存
 到父亲结点中（即低阶的ngram）。

n) pruneCounts函数
<src>
0 template <class CountT>
1 unsigned
2 NgramCounts<CountT>::pruneCounts(CountT minCount)
3 {
4 unsigned npruned = 0;
5 makeArray(VocabIndex, ngram, order + 1);
6
7 for (unsigned i = 1; i <= order; i++) {
8 CountT *count;
9 NgramCountsIter<CountT> countIter(*this, ngram, i);
10
11 /*
12 * This enumerates all ngrams
13 */
14 while (count = countIter.next()) {
15 if (*count < minCount) {
16 removeCount(ngram);
17 npruned ++;
18 }
19 }
20 }
21 return npruned;
22 }
</src>
 功能：从Ngram构成的Trie树中裁剪去所有频率低于minCount的ngram。

 细解：第7-20行通过调用NgramCountsIter函数循环处理1-order元的所有
 ngram，并裁剪掉那些频率低于minCount的ngram。
 第9行首先构造一个当前ngram的迭代器，并传入要迭代的ngram的元数；
 第14-19行通过调用TrieIter2迭代器循环获取所有特定阶数的ngram，
 并将那些统计频率低于minCount的ngram从Trie树中剪去。

o) setCounts函数
<src>
0 template <class CountT>
1 void
2 NgramCounts<CountT>::setCounts(CountT value)
3 {
4 makeArray(VocabIndex, ngram, order + 1);
5
6 for (unsigned i = 1; i <= order; i++) {
7 CountT *count;
8 NgramCountsIter<CountT> countIter(*this, ngram, i);
9
10 /*
11 * This enumerates all ngrams
12 */
13 while (count = countIter.next()) {
14 *count = value;
15 }
16 }
17 }
</src>
 功能：将Ngram构成的Trie树中所有ngram的频率设为value。

 细解：第6-16行通过调用NgramCountsIter函数循环处理1-order元的所有
 ngram，并将该ngram的频率设为value。
 第8行首先构造一个当前ngram的迭代器，并传入要迭代的ngram的元数；
 第13-15行通过调用TrieIter2迭代器循环获取所有特定阶数的ngram，
 并将那些ngram的统计频率设为value值。

p) dump函数
<src>
0 template <class CountT>
1 void
2 NgramCounts<CountT>::dump()
3 {
4 cerr << "order = " << order << endl;
5 counts.dump();
6 }
</src>
 功能：通过调用Trie的dump函数输出Trie树中所有的ngram。

 细解：第5行通过调用Trie的dump函数输出Trie树中所有的ngram。

q) clear函数
<src>
0 void clear()
1 {
2 counts.clear();
3 }
</src>
 功能：清空所有的ngram

 细解：第2行通过调用Trie的clear函数清空当前Trie树中所有的ngram。


NgramCountsIter类
a) 构造函数
 NgramCountsIter类共提供了两种构造函数，相互构成了重载关系
i)
<src>
0 NgramCountsIter(NgramCounts<CountT> &ngrams, VocabIndex *keys,
1 unsigned order = 1,
3 int (*sort)(VocabIndex, VocabIndex) = 0)
4 : myIter(ngrams.counts, keys, order, sort)
5 {
6 }
</src>
 功能：初始化当前迭代器，用于迭代所有元数为order的ngram。

 详解：第4行通过成员处理化列表的方式初始化TrieIter2类型的私有成员。

ii)
<src>
0 NgramCountsIter(NgramCounts<CountT> &ngrams, const VocabIndex *start,
1 VocabIndex *keys, unsigned order = 1,
2 int (*sort)(VocabIndex, VocabIndex) = 0)
3 : myIter(*(ngrams.counts.insertTrie(start)), keys, order, sort)
4 {
5 }
</src>
 功能：初始化当前迭代器，用于迭代start叶子结点中所有元数为order的ngram。

 详解：第4行同时调用Trie的insertTrie函数首先获得start对应的叶子结点，然后
 通过成员初始化列表的方式初始化TrieIter2类型的私有成员。

kevinfight

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
srilm 阅读文档14

 NgramStats.h NgramStats.cc 文档作者：jianzhu 创立时间：08.09.18 -------------------------------------- 1、概述 -------------------------------------- 这两个文件主要实现了统计ngram的相关函数和功能，同时定义了 将统计好的ngram以文本方式和二进制方式输出到文件和从文件中读出<br
复制链接

扫一扫