srilm 阅读文档12

最新推荐文章于 2015-07-20 09:28:25 发布

kevinfight

最新推荐文章于 2015-07-20 09:28:25 发布

阅读量1.1k

点赞数

分类专栏：语言模型文章标签：文档 processing output statistics delete buffer

本文链接：https://blog.csdn.net/kevinfight/article/details/6423914

版权

语言模型专栏收录该内容

17 篇文章 0 订阅

订阅专栏

LM.h LM.cc
文档作者：jianzhu
创立时间：08.10.03

--------------------------------------
1、概述
--------------------------------------
 这两个文件定义了语言模型的最基本的接口和一些通用
的功能。
 LM类
 该类实现了语言模型的基本接口和一些通用功能
 该类提供了如下函数
 a) 构造函数
 b) 析构函数
 c) wordProb函数
 d) wordProbRecompute函数
 e) sentenceProb函数
 f) contextProb函数
 g) countsProb函数
 h) pplCountsFile函数
 i) pplFile函数
 j) rescoreFile函数
 k) setState函数
 l) wordProbSum函数
 m) generateWord函数
 n) generateSentence函数
 o) contextID函数
 p) contextBOW函数
 q) addUnkWords函数
 r) isNonWord函数
 s) read函数
 t) write函数
 u) writeBinary函数
 v) running函数
 w) followIter函数
 x) memStats函数
 y) removeNoise函数

 TruncatedContext类
 该类主要用于将VocabIndex类型的数组中特定位置的值设为Vocab_None，
 同时在析构的时候将Vocab_None还原为之前的值。
 该类提供了如下函数
 a) 构造函数
 b) 析构函数
 c) cast函数

--------------------------------------
2、函数功能解释
--------------------------------------
LM类
a) 构造函数
<src>
0 LM::LM(Vocab &vocab)
1 : vocab(vocab), noiseVocab(vocab)
2 {
3 _running = false;
4 reverseWords = false;
5 addSentStart = true;
6 addSentEnd = true;
7 stateTag = defaultStateTag;
8 writeInBinary = false;
9 }
</src>
 功能：构造函数，初始化成员变量

 细解：第1行通过成员初始化列表的方式初始化成员变量；
 函数体内部初始其他成员变量。

c) wordProb函数
 该类共声明了两种wordProb函数，相互构成了重载关系
i)
<src>
0 virtual LogP wordProb(VocabIndex word, const VocabIndex *context) = 0;
</src>
 功能：声明了基于VocabIndex类型的wordProb函数

 细解：该函数为一个纯虚函数，因此只是声明一个接口，其具体定义需要由继承
 自该类的子类实现。
 注：wordProb函数主要用于计算word基于context的概率，即P(word|context)，
 只不过这里的概率为概率的对数值。

ii)
<src>
0 LogP LM::wordProb(VocabString word, const VocabString *context)
1 {
2 unsigned int len = vocab.length(context);
3 makeArray(VocabIndex, cids, len + 1);
4
5 if (addUnkWords()) {
6 vocab.addWords(context, cids, len + 1);
7 } else {
8 vocab.getIndices(context, cids, len + 1, vocab.unkIndex());
9 }
10
11 LogP prob = wordProb(vocab.getIndex(word, vocab.unkIndex()), cids);
12
13 return prob;
14 }
</src>
 功能：计算由context和word组成的这个ngram的概率对数值

 细解：第2-9行首先将context转换为其对应的VocabIndex，然后执行第11行
 第11行首先通过调用vocab的getIndex函数将word转换为其对应的VocabIndex
 然后通过调用基于VocabIndex的wordProb函数计算P(word|context)。

d) wordProbRecompute函数
<src>
0 LogP LM::wordProbRecompute(VocabIndex word, const VocabIndex *context)
1 {
2 return wordProb(word, context);
3 }
</src>
 功能：计算由context和word组成的这个ngram的概率对数值

 细解：该函数和wordProb函数功能一样，只是该函数在最后一次调用wordProb函数
 后context没有变化的情况下，可以起到加速条件概率计算的作用。

e) sentenceProb函数
 该类共定义了两种类型的sentenceProb函数，相互构成了重载关系
i)
<src>
0 LogP LM::sentenceProb(const VocabIndex *sentence, TextStats &stats)
1 {
2 unsigned int len = vocab.length(sentence);
3 makeArray(VocabIndex, reversed, len + 2 + 1);
4 int i;
5
6 /*
7 * Indicate to lm methods that we're in sequential processing
8 * mode.
9 */
10 Boolean wasRunning = running(true);
11
12 /*
13 * Contexts are represented most-recent-word-first.
14 * Also, we have to prepend the sentence-begin token,
15 * and append the sentence-end token.
16 */
17 len = prepareSentence(sentence, reversed, len);
18
19 LogP totalProb = 0.0;
20 unsigned totalOOVs = 0;
21 unsigned totalZeros = 0;
22
23 for (i = len; i >= 0; i--) {
24 LogP probSum;
25
26 if (debug(DEBUG_PRINT_WORD_PROBS)) {
27 dout() << "/tp( " << vocab.getWord(reversed[i]) << " | "
28 << (reversed[i+1] != Vocab_None ?
29 vocab.getWord(reversed[i+1]) : "")
30 << (i < len ? " ..." : " ") << ") /t= " ;
31
32 if (debug(DEBUG_PRINT_PROB_SUMS)) {
33 /*
34 * XXX: because wordProb can change the state of the LM
35 * we need to compute wordProbSum first.
36 */
37 probSum = wordProbSum(&reversed[i + 1]);
38 }
39 }
40
41 /*
42 * <s> 中国人民解放军 </s>
43 * </s> 解放军人民中国 <s>
44 */
45 LogP prob = wordProb(reversed[i], &reversed[i + 1]);
46
47 if (debug(DEBUG_PRINT_WORD_PROBS)) {
48 dout() << " " << LogPtoProb(prob) << " [ " << prob << " ]";
49 if (debug(DEBUG_PRINT_PROB_SUMS)) {
50 dout() << " / " << probSum;
51 if (fabs(probSum - 1.0) > 0.0001) {
52 cerr << "/nwarning: word probs for this context sum to "
53 << probSum << " != 1 : "
54 << (vocab.use(), &reversed[i + 1]) << endl;
55 }
56 }
57 dout() << endl;
58 }
59 /*
60 * If the probability returned is zero but the
61 * word in question is <unk> we assume this is closed-vocab
62 * model and count it as an OOV. (This allows open-vocab
63 * models to return regular probabilties for <unk>.)
64 * If this happens and the word is not <unk> then we are
65 * dealing with a broken language model that return
66 * zero probabilities for known words, and we count them
67 * as a "zeroProb".
68 */
69 if (prob == LogP_Zero) {
70 if (reversed[i] == vocab.unkIndex()) {
71 totalOOVs ++;
72 } else {
73 totalZeros ++;
74 }
75 } else {
76 totalProb += prob;
77 }
78 }
79
80 running(wasRunning);
81
82 /*
83 * Update stats with this sentence
84 */
85 if (reversed[0] == vocab.seIndex()) {
86 stats.numSentences ++;
87 stats.numWords += len;
88 } else {
89 stats.numWords += len + 1;
90 }
91 stats.numOOVs += totalOOVs;
92 stats.zeroProbs += totalZeros;
93 stats.prob += totalProb;
94
95 return totalProb;
96 }
</src>
 功能：计算句子中包含的所有ngram的概率值之和，同时将句子中相关
 信息统计到stats中。

 细解：第17行将句子中的每一个词以逆序方式保存到reversed中，同时
 保存reverse中实际有效的词数（即去除<s>和</s>后的词数）。然后执
 行第23-78行。
 第23-78行通过循环方式统计句子中的所有ngram的概率值并将其叠加到
 totalProb中，同时统计句子的状态信息。
 举例：
 <s> 中国人民解放军 </s>
 经过第17行处理后，变为：
 </s> 解放军人民中国 <s>
 同时将len设为3
 因此23-78行的for循环处理ngram的顺序如下所示
 P(中国|<s>)
 P(人民|中国 <s>)
 p(解放军|人民中国 <s>)
 P(</s>|解放军人民中国 <s>)

 第37行通过调用 wordProbSum 函数求得vocab中所有基于当前历史的ngram
 概率之和，并与第51-55行判断该概率是否为1，若不为1，说明概率计算存
 在问题。
 第45行通过调用 wordProb 函数计算每一个ngram的概率，并于第69-77行
 根据ngram的概率计算结果，统计相应的统计量。
 第85-90行根据reversed[0]是否为</s>对应的VocabIndex执行相应的统计。
 若reversed[0]为</s>对应的VocabIndex，则将统计句子变量增1，同时将
 len增加到统计词的变量中；否则将len+1增加到统计词的变量中。
 第91-93行将句子中统计出的oov和zeroprob数量增加到stats中，同时将句
 子对应的概率值增加到stats中。

 注：将该函数和TextStats类的重载的输出运算符<<结合可以发现，TextStats
 中输出运算符<<输出的其实就是perpelexity，只不过这儿的perpelexity没有
 进行开N次放操作。不过作为模型的优劣对比来说，效果是一样的。

 prepareSentence函数
 <src>
 0 unsigned
 1 LM::prepareSentence(const VocabIndex *sentence, VocabIndex *reversed,
 2 unsigned len)
 3 {
 4 unsigned i, j = 0;
 5
 6 /*
 7 * Add </s> token if not already there.
 8 */
 9 if (len == 0 || sentence[reverseWords ? 0 : len - 1] != vocab.seIndex()) {
 10 if (addSentEnd) {
 11 reversed[j++] = vocab.seIndex();
 12 }
 13 }
 14
 15 for (i = 1; i <= len; i++) {
 16 VocabIndex word = sentence[reverseWords ? i - 1 : len - i];
 17
 18 if (word == vocab.pauseIndex() || noiseVocab.getWord(word)) {
 19 continue;
 20 }
 21
 22 reversed[j++] = word;
 23 }
 24
 25 /*
 26 * Add <s> token if not already there
 27 */
 28 if (len == 0 || sentence[reverseWords ? len - 1 : 0] != vocab.ssIndex()) {
 29 if (addSentStart) {
 30 reversed[j++] = vocab.ssIndex();
 31 } else {
 32 reversed[j++] = Vocab_None;
 33 }
 34 }
 35 reversed[j] = Vocab_None;
 36
 37 return j - 2;
 38 }
 </src>
 功能：如果句子中的内容为顺序表示方式，则将其按逆序表示方式保存到reversed中；否则将
 句子中的内容原原本本地拷贝到reversed中，同时返回句子中实际包含的词数。

 细解：第9-13行用于将</s>保存到reversed的第0个位置。第9行的条件运算符中的reverseWords
 是针对sentence中保存的内容顺序来说的。如果sentence中的内容顺序为逆序，则取第0个位置，
 否则取最后一个位置。
 第15-23行，同理将sentence中的内容按逆序方式保存到reversed中，同时过滤掉sentence中的
 pause和noise类型的词。
 第28-34行，在reversed中添加句子开始标记<s>对应的VocabIndex，然后执行第35行将reversed
 最后一个单元设为Vocab_None用于表示结束。
 第37行返回句子中包含的实际词数，即去除<s>和</s>以及pause,noise类型词后的词数目。

ii)
<src>
0 LogP LM::sentenceProb(const VocabString *sentence, TextStats &stats)
1 {
2 unsigned int len = vocab.length(sentence);
3 makeArray(VocabIndex, wids, len + 1);
4
5 if (addUnkWords()) {
6 vocab.addWords(sentence, wids, len + 1);
7 } else {
8 vocab.getIndices(sentence, wids, len + 1, vocab.unkIndex());
9 }
10
11 LogP prob = sentenceProb(wids, stats);
12
13 return prob;
14 }
</src>
 功能：基于字符串的句子概率统计函数

 细解：第2-9行首先将基于字符串的句子转换为基于字符串索引的句子，然后通过
 调用重载的基于索引的sentenceProb函数统计句子概率。

f) contextProb函数
<src>
0 LogP LM::contextProb(const VocabIndex *context, unsigned clength)
1 {
2 unsigned useLength = Vocab::length(context);
3 LogP jointProb = LogP_One;
4
5 if (clength < useLength) {
6 useLength = clength;
7 }
8
9 /*
10 * If the context is empty there is nothing left to do: return LogP_One
11 */
12 if (useLength > 0) {
13 /*
14 * Turn off debugging for contextProb computation
15 */
16 Boolean wasRunning = running(false);
17
18 TruncatedContext usedContext(context, useLength);
19
20 /*
21 * Accumulate conditional probs for all words in used context
22 */
23 for (unsigned i = useLength; i > 0; i--) {
24 VocabIndex word = usedContext[i - 1];
25 /*
26 * <s> 中国人民解放军 </s>
27 * </s> 解放军人民中国 <s>
28 */
29 /*
30 * If we're computing the marginal probability of the unigram
31 * <s> context we have to look up </s> instead since the former
32 * has prob = 0.
33 */
34 if (i == useLength && word == vocab.ssIndex()) {
35 word = vocab.seIndex();
36 }
37
38 LogP wprob = wordProb(word, &usedContext[i]);
39
40 /*
41 * If word is a non-event it has probability zero in the model,
42 * so the best we can do is to skip it.
43 * Note that above mapping turns <s> into a non-non-event, so
44 * it will be included.
45 */
46 if (wprob != LogP_Zero || !vocab.isNonEvent(word)) {
47 jointProb += wprob;
48 }
49 }
50 running(wasRunning);
51 }
52
53 return jointProb;
54 }
</src>
 功能：用于计算context构成的所有ngram的概率值之和

 细解：第12-51行在context长度不为零的情况下计算context构成的所有ngram的概率值之和
 否则直接执行第53行，返回LogP_Zero。
 第18行通过调用TruncatedContext构造函数，构造一个usedContext对象，同时将useLength
 对应的单元设为Vocab_None。
 第23-49行循环统计usedContext中所有ngram的概率值，并将其叠加到jointProb中。
 第53行返回context中所有ngram概率计算叠加的结果。

g) countsProb函数
<src>
0 LogP LM::countsProb(NgramStats &counts, TextStats &stats, unsigned countorder,
1 Boolean entropy)
2 {
3 makeArray(VocabIndex, ngram, countorder + 1);
4
5 LogP totalProb = 0.0;
6
7 /*
8 * Indicate to lm methods that we're in sequential processing
9 * mode.
10 */
11 Boolean wasRunning = running(true);
12
13 /*
14 * Enumerate all counts up the order indicated
15 */
16 for (unsigned i = 1; i <= countorder; i++ ) {
17 // use sorted enumeration in debug mode only
18 NgramsIter ngramIter(counts, ngram, i,
19 !debug(DEBUG_PRINT_WORD_PROBS) ? 0 :
20 vocab.compareIndex());
21
22 NgramCount *count;
23
24 /*
25 * This enumerates all ngrams of the given order
26 */
27 while (count = ngramIter.next()) {
28 TextStats ngramStats;
29
29 /*
30 * Skip zero counts since they don't contribute anything to
31 * the probability
32 */
33 if (*count == 0) {
34 continue;
35 }
36
37 /*
38 * reverse ngram for lookup
39 */
40 Vocab::reverse(ngram);
41
42 /*
43 * The rest of this loop is patterned after LM::sentenceProb()
44 */
45
46 if (debug(DEBUG_PRINT_WORD_PROBS)) {
47 dout() << "/tp( " << vocab.getWord(ngram[0]) << " | "
48 << (vocab.use(), &ngram[1])
49 << " ) /t= " ;
50 }
51 LogP prob = wordProb(ngram[0], &ngram[1]);
52
53 LogP jointProb = !entropy ? LogP_One :
54 contextProb(ngram, countorder);
55 Prob weight = *count * LogPtoProb(jointProb);
56
57 if (debug(DEBUG_PRINT_WORD_PROBS)) {
58 dout() << " " << LogPtoProb(prob) << " [ " << prob;
59
60 /*
61 * Include ngram count if not unity, so we can compute the
62 * aggregate log probability from the output
63 */
64 if (weight != 1.0) {
65 dout() << " *" << weight;
66 }
67 dout() << " ]";
68
69 if (debug(DEBUG_PRINT_PROB_SUMS)) {
70 Prob probSum = wordProbSum(&ngram[1]);
71 dout() << " / " << probSum;
72 if (fabs(probSum - 1.0) > 0.0001) {
73 cerr << "/nwarning: word probs for this context sum to "
74 << probSum << " != 1 : "
75 << (vocab.use(), &ngram[1]) << endl;
76 }
77 }
78 dout() << endl;
79 }
80
81 /*
82 * ngrams ending in </s> are counted as sentences, all others
83 * as words. This keeps the output compatible with that of
84 * LM::pplFile().
85 */
86 if (ngram[0] == vocab.seIndex()) {
87 ngramStats.numSentences = *count;
88 } else {
89 ngramStats.numWords = *count;
90 }
91
92 /*
93 * If the probability returned is zero but the
94 * word in question is <unk> we assume this is closed-vocab
95 * model and count it as an OOV. (This allows open-vocab
96 * models to return regular probabilties for <unk>.)
97 * If this happens and the word is not <unk> then we are
98 * dealing with a broken language model that return
99 * zero probabilities for known words, and we count them
100 * as a "zeroProb".
101 */
102 if (prob == LogP_Zero) {
103 if (ngram[0] == vocab.unkIndex()) {
104 ngramStats.numOOVs = *count;
105 } else {
106 ngramStats.zeroProbs = *count;
107 }
108 } else {
109 totalProb +=
110 (ngramStats.prob = weight * prob);
111 }
112
113 stats.increment(ngramStats);
114
115 Vocab::reverse(ngram);
116 }
117 }
118
119 running(wasRunning);
120
121 /*
122 * If computing entropy set total number of events to 1 so that
123 * ppl computation reflects entropy.
124 */
125 if (entropy) {
126 stats.numSentences = 0;
127 stats.numWords = 1;
128 }
129
130 return totalProb;
131 }
</src>
 功能：统计出counts中所有阶数小于等于countorder的ngram的概率信息，并将所有
 ngram的统计信息记录到stats中，同时返回所有ngram的概率信息。

 细解：第16-117行循环统计counts中所有元数小于等于countorder的ngram的概率信息。
 第18-20行通过调用NgramsIter构造函数，构造counts中特定元数的ngram迭代器。然后
 执行第27-116行。
 第27-116行处理迭代器迭代到的每一个ngram；
 第51行计算特定ngram的条件概率值。
 第53-55行计算当前ngram占ngram中所有子ngram的比重；
 第86-90行统计counts中句子数，或词条数。
 第102-111行统计ngram中未登录词数、零概率词数和非零概率的词的概率信息；
 第113行将当前ngram的统计结果叠加到所有ngram统计变量中；
 第130行返回所有ngram的统计概率值。

 注：当entropy为false时，totalProb中记录的统计量为*count * logProb
 当entropy为true时，totalProb中记录的统计量为*count * contextProb * logProb
 该式类似于熵的计算式 Pi*logPi (i=0,...,n) 所有项的求和结果除以n并取反。

h) pplCountsFile函数
<src>
0 unsigned int
1 LM::pplCountsFile(File &file, unsigned order, TextStats &stats,
2 const char *escapeString, Boolean entropy)
3 {
4 char *line;
5 unsigned escapeLen = escapeString ? strlen(escapeString) : 0;
6 unsigned stateTagLen = stateTag ? strlen(stateTag) : 0;
7
8 VocabString words[maxNgramOrder + 1];
9 makeArray(VocabIndex, wids, order + 1);
10 NgramStats *counts = 0;
11 TextStats sentenceStats;
12
13 while (line = file.getline()) {
14
15 if (escapeString && strncmp(line, escapeString, escapeLen) == 0) {
16 /*
17 * Output sentence-level statistics before each escaped line
18 */
19 if (counts) {
20 countsProb(*counts, sentenceStats, order, entropy);
21
22 if (debug(DEBUG_PRINT_SENT_PROBS)) {
23 dout() << sentenceStats << endl;
24 }
25
26 stats.increment(sentenceStats);
27 sentenceStats.reset();
28
29 delete counts;
30 counts = 0;
31 }
32 dout() << line;
33 continue;
34 }
35
36 /*
37 * check for directives to change the global LM state
38 */
39 if (stateTag && strncmp(line, stateTag, stateTagLen) == 0) {
40 /*
41 * pass the state info the lm to let it do whatever
42 * it wants with it
43 */
44 setState(&line[stateTagLen]);
45 continue;
46 }
47
48 if (!counts) {
49 counts = new NgramStats(vocab, order);
50 assert(counts != 0);
51 }
52
53 NgramCount count;
54 unsigned howmany =
55 counts->parseNgram(line, words, maxNgramOrder + 1, count);
56
57 /*
58 * Skip this entry if the length of the ngram exceeds our
59 * maximum order
60 */
61 if (howmany == 0) {
62 file.position() << "malformed N-gram count or more than "
63 << maxNgramOrder << " words per line/n";
64 continue;
65 } else if (howmany > order) {
66 continue;
67 }
68
69 /*
70 * Map words to indices
71 */
72 vocab.getIndices(words, wids, order + 1, vocab.unkIndex());
73
74 /*
75 * Update the counts
76 */
77 *counts->insertCount(wids) += count;
78 }
79
80 /*
81 * Output and update final sentence-level statistics
82 */
83 if (counts) {
84 countsProb(*counts, sentenceStats, order, entropy);
85
86 if (debug(DEBUG_PRINT_SENT_PROBS)) {
87 dout() << sentenceStats << endl;
88 }
89
90 stats.increment(sentenceStats);
91 delete counts;
92 }
93
94 return stats.numWords;
95 }
</src>
 功能：加载文件中的所有ngram的信息，并计算其统计信息。

 细解：第13-78行循环处理文件中的所有ngram，并将每一个ngram记录到
 counts中，同时根据是否遇到escapeLine来决定是否调用countsProb函数
 分析构建出的counts中的所有ngram的统计信息。
 第54-55行调用counts的成员函数parseNgram分析出从文件中读出的一行
 文本中的ngram及其统计量。
 第72行通过调用vocab的成员变量getIndices将基于字符串的ngram转换为
 基于索引的ngram。
 第77行将该ngram及其统计量记录到counts中。
 第83-92行通过调用countsProb函数分析counts中的所有ngram的统计信息。
 同时将句子的统计信息叠加到stats中。
 第94行返回文件中所有ngram的词数。

i) pplFile函数
<src>
0 unsigned int
1 LM::pplFile(File &file, TextStats &stats, const char *escapeString)
2 {
3 char *line;
4 unsigned escapeLen = escapeString ? strlen(escapeString) : 0;
5 unsigned stateTagLen = stateTag ? strlen(stateTag) : 0;
6 VocabString sentence[maxWordsPerLine + 1];
7 unsigned totalWords = 0;
8 unsigned sentNo = 0;
9 TextStats documentStats;
10 Boolean printDocumentStats = false;
11
12 while (line = file.getline()) {
13
14 if (escapeString && strncmp(line, escapeString, escapeLen) == 0) {
15 if (sentNo > 0 && debuglevel() == DEBUG_PRINT_DOC_PROBS) {
16 dout() << documentStats << endl;
17 documentStats.reset();
18 printDocumentStats = true;
19 }
20 dout() << line;
21 continue;
22 }
23
24 /*
25 * check for directives to change the global LM state
26 */
27 if (stateTag && strncmp(line, stateTag, stateTagLen) == 0) {
28 /*
29 * pass the state info the lm to let it do whatever
30 * it wants with it
31 */
32 setState(&line[stateTagLen]);
33 continue;
34 }
35
36 sentNo ++;
37
38 unsigned int numWords =
39 vocab.parseWords(line, sentence, maxWordsPerLine + 1);
40
41 if (numWords == maxWordsPerLine + 1) {
42 file.position() << "too many words per sentence/n";
43 } else {
44 TextStats sentenceStats;
45
46 if (debug(DEBUG_PRINT_SENT_PROBS)) {
47 dout() << sentence << endl;
48 }
49 LogP prob = sentenceProb(sentence, sentenceStats);
50
51 totalWords += numWords;
52
53 if (debug(DEBUG_PRINT_SENT_PROBS)) {
54 dout() << sentenceStats << endl;
55 }
56
57 stats.increment(sentenceStats);
58 documentStats.increment(sentenceStats);
59 }
60 }
61
62 if (printDocumentStats) {
63 dout() << documentStats << endl;
64 }
65
66 return totalWords;
67 }
</src>
 功能：读入文件中的所有文本行，同时统计文件的所有统计信息

 细解：第12-60行循环读入文件中的所有文本行，并计算统计每一行的概率信息。
 第38-39行通过调用vocab的parseWords函数分析处文本行中的所有单词。
 第49行通过调用sentenceProb函数计算该文本行的统计信息，并将相应的统计量
 记录到sentenceStats中。
 第57行将sentenceStats中的统计信息叠加到stats中。
 第66行返回分析出的文件中的所有单词数。

j) rescoreFile函数
<src>
0 unsigned
1 LM::rescoreFile(File &file, double lmScale, double wtScale,
2 LM &oldLM, double oldLmScale, double oldWtScale,
3 const char *escapeString)
4 {
5 char *line;
6 unsigned escapeLen = escapeString ? strlen(escapeString) : 0;
7 unsigned stateTagLen = stateTag ? strlen(stateTag) : 0;
8 unsigned sentNo = 0;
9
10 while (line = file.getline()) {
11
12 if (escapeString && strncmp(line, escapeString, escapeLen) == 0) {
13 fputs(line, stdout);
14 continue;
15 }
16
17 /*
18 * check for directives to change the global LM state
19 */
20 if (stateTag && strncmp(line, stateTag, stateTagLen) == 0) {
21 /*
22 * pass the state info the lm to let let if do whatever
23 * it wants with it
24 */
25 setState(&line[stateTagLen]);
26 continue;
27 }
28
29 sentNo ++;
30
31 /*
32 * parse an n-best hyp from this line
33 */
34 NBestHyp hyp;
35
36 if (!hyp.parse(line, vocab)) {
37 file.position() << "bad n-best hyp format/n";
38 } else {
39 hyp.decipherFix(oldLM, oldLmScale, oldWtScale);
40 hyp.rescore(*this, lmScale, wtScale);
41 // hyp.write((File)stdout, vocab);
42 /*
43 * Instead of writing only the total score back to output,
44 * keep all three scores: acoustic, LM, word transition penalty.
45 * Also, write this in straight log probs, not bytelog.
46 */
47 fprintf(stdout, "%g %g %d",
48 hyp.acousticScore, hyp.languageScore, hyp.numWords);
49 for (unsigned i = 0; hyp.words[i] != Vocab_None; i++) {
50 fprintf(stdout, " %s", vocab.getWord(hyp.words[i]));
51 }
52 fprintf(stdout, "/n");
53 }
54 }
55 return sentNo;
56 }
</src>
 功能：循环读入记录了NBest统计信息的文本文件中的所有文本行，同时重新计算每一行
 NBest的统计信息，并输出这些统计信息。

 细解：第10-54行循环读入文件中所有NBest格式的文本行，并根据新的语言模型重新重新计算
 其对应的统计量，同时输出这些统计量。
 第36行通过调用hyp的成员函数parse分析出NBest格式的文本中包含的信息。
 第39行通过调用hyp的成员函数decipherFix获得acousticScore值。
 第40行通过调用hyp的成员函数rescore获得新的languageScore值，和新的totalScore值。
 第47-50行按NBest的3.0格式输出每一行NBest文本重新计算后的结果。

l) wordProbSum函数
<src>
0 Prob
1 LM::wordProbSum(const VocabIndex *context)
2 {
3 double total = 0.0;
4 VocabIter iter(vocab);
5 VocabIndex wid;
6 Boolean first = true;
7 /*
8 * prob summing interrupts sequential processing mode
9 */
10 Boolean wasRunning = running(false);
11
12 while (iter.next(wid)) {
13 if (!isNonWord(wid)) {
14 total += LogPtoProb(first ?
15 wordProb(wid, context) :
16 wordProbRecompute(wid, context));
17 first = false;
18 }
19 }
20
21 running(wasRunning);
22 return total;
23 }
</src>
 功能：用于计算vocab中所有词基于当前context的条件概率和

 细解：第4行构建一个vocab的迭代器；
 第12-19行循环遍历vocab中的每一个词，然后计算每一个词基于context的
 概率，并将这些对数概率转化为概率值叠加到total中；
 第22行返回最终的概率和；

m) generateWord函数
<src>
0 VocabIndex
1 LM::generateWord(const VocabIndex *context)
2 {
3 /*
4 * Algorithm: generate random number between 0 and 1, and partition
5 * the interval 0..1 into pieces corresponding to the word probs.
6 * Chose the word whose interval contains the random value.
7 */
8 Prob rval = drand48();
9 Prob totalProb = 0.0;
10
11 VocabIter iter(vocab);
12 VocabIndex wid;
13
14 while (totalProb <= rval && iter.next(wid)) {
15 if (!isNonWord(wid)) {
16 totalProb += LogPtoProb(wordProb(wid, context));
17 }
18 }
19 return wid;
20 }
</src>
 功能：根据context随机生成一个词

 细解：第8行首先随机生成一个介于0-1之间的数；
 第11行构建vocab的迭代器；
 第14到18行迭代vocab中词，并计算每一个词的概率，同时将其叠加到totalProb中，
 直到totalProb大于rval为止，这时迭代到wid就是要生成的词；
 第19行返回该词；

n) generateSentence函数
该类共定义了两种不同的generateSentence函数，相互构成了重载关系
i)
<src>
0 VocabIndex *
1 LM::generateSentence(unsigned maxWords, VocabIndex *sentence)
2 {
3 static unsigned defaultResultSize = 0;
4 static VocabIndex *defaultResult = 0;
5
6 /*
7 * If no result buffer is supplied use our own.
8 */
9 if (sentence == 0) {
10 if (maxWords + 1 > defaultResultSize) {
11 defaultResultSize = maxWords + 1;
12 if (defaultResult) {
13 delete defaultResult;
14 }
15 defaultResult = new VocabIndex[defaultResultSize];
16 assert(defaultResult != 0);
17 }
18 sentence = defaultResult;
19 }
20
21 /*
22 * Since we need to add the begin/end sentences tokens, and
23 * partial contexts are represented in reverse we use a second
24 * buffer for partial sentences.
25 */
26 makeArray(VocabIndex, genBuffer, maxWords + 3);
27
28 unsigned last = maxWords + 2;
29 genBuffer[last] = Vocab_None;
30 genBuffer[--last] = vocab.ssIndex();
31
32 /*
33 * Generate words one-by-one until hitting an end-of-sentence.
34 */
35 while (last > 0 && genBuffer[last] != vocab.seIndex()) {
36 last --;
37 genBuffer[last] = generateWord(&genBuffer[last + 1]);
38 }
39
40 /*
41 * Copy reversed sentence to output buffer
42 */
43 unsigned i, j;
44 for (i = 0, j = maxWords; j > last; i++, j--) {
45 sentence[i] = genBuffer[j];
46 }
47 sentence[i] = Vocab_None;
48
49 return sentence;
50 }
</src>
 功能：随机生成一个句子

 细解：第35-38行通过调用generateWord函数随机生成每一个单词并将其保存到genBuffer中，直到
 遇到遇见结束符或genBuffer空间满了为止。
 由于genBuffer中保存的句子是逆序的句子，因此需要执行第44-46行将句子以顺序保存到sentence
 中。

ii)
<src>
0 VocabString *
1 LM::generateSentence(unsigned maxWords, VocabString *sentence)
2 {
3 static unsigned defaultResultSize = 0;
4 static VocabString *defaultResult = 0;
5
6 /*
7 * If no result buffer is supplied use our own.
8 */
9 if (sentence == 0) {
10 if (maxWords + 1 > defaultResultSize) {
11 defaultResultSize = maxWords + 1;
12 if (defaultResult) {
13 delete defaultResult;
14 }
15 defaultResult = new VocabString[defaultResultSize];
16 assert(defaultResult != 0);
17 }
18 sentence = defaultResult;
19 }
20
21 /*
22 * Generate words indices, then map them to strings
23 */
24 vocab.getWords(generateSentence(maxWords, (VocabIndex *)0),
25 sentence, maxWords + 1);
26
27 return sentence;
28 }
</src>
 功能：随机生成一个句子

 细解：该函数通过执行第24行调用基于词索引generateSentence方法随机生成一个句子
 然后将该句子通过vocab的成员函数getWords方法转换为基于词的句子并保存到sentence中。

w) followIter函数
<src>
0 virtual _LM_FollowIter *followIter(const VocabIndex *context)
1 {
2 return new _LM_FollowIter(*this, context);
3 }
</src>
 功能：创建一个当前LM对象的迭代器，并返回其指针地址

 细解：第2行通过new方法调用_LM_FollowIter构造函数，创建一个
 当前LM对象的迭代器，并返回其指针地址

y) removeNoise函数
<src>
0 VocabIndex *
1 LM::removeNoise(VocabIndex *words)
2 {
3 unsigned from, to;
4
5 for (from = 0, to = 0; words[from] != Vocab_None; from ++) {
6 if (words[from] != vocab.pauseIndex() &&
7 !noiseVocab.getWord(words[from]))
8 {
9 words[to++] = words[from];
10 }
11 }
12 words[to] = Vocab_None;
13
14 return words;
15 }
</src>
 功能：将words中的noiseword清除

 细解：第5-11行通过for循环将words中的noiseword清除。
 第12行将新的words中最后一个有意义词的后续位置设为Vocab_None
 第14行返回去除noiseword后的words地址。

TruncatedContext类
a) 构造函数
<src>
0 TruncatedContext(const VocabIndex *context, unsigned len)
1 : myContext(context), contextLength(len)
2 {
3 saved = myContext[contextLength];
4 ((VocabIndex *)myContext)[contextLength] = Vocab_None;
5 }
</src>
 功能：将context中的len位置的单元设为Vocab_None

 细解：第3行首先将context中len位置单元中的内容保存起来；
 第4行将context中len位置单元中的内容设为Vocab_None。

b) 析构函数
<src>
0 ~TruncatedContext()
1 {
2 ((VocabIndex *)myContext)[contextLength] = saved;
3 }
</src>
 功能：将context中len位置单元的内容设为之前的值

 细解：第2行将context中len位置单元的内容设为saved；

c) cast函数
<src>
0 inline operator const VocabIndex *()
1 {
2 return myContext;
3 }
</src>
 功能：返回TruncatedContext对象中的数组单元

 细解：第2行直接返回context数组单元的首地址

kevinfight

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
srilm 阅读文档12

 LM.h LM.cc 文档作者：jianzhu 创立时间：08.10.03 -------------------------------------- 1、概述 -------------------------------------- 这两个文件定义了语言模型的最基本的接口和一些通用 的功能。 LM类 该类实现了语言模型的基本接口和一些通用功能 该类提供
复制链接

扫一扫