Language Model分为两种,Process Language Model跟Sequence Language Model。
Sequence Language Model:所有长度的字符串所有的可能性,概率和为1.
Process Language Model:对于任何一种长度的字符串所有的可能性,概率和为1.
这里,有个基于LingPipe的NGramProcessLM例程,演示出长度为1的所有字符串概率和为1.
package chapter6;
import com.aliasi.lm.NGramProcessLM;
public class ProcessLmDemo {
public static void main(String[] args) {
int ngram = 1;
String textTrain = "ababababab";
double probSum = 0;
NGramProcessLM lm = new NGramProcessLM(ngram);
lm.handle(textTrain);
for (int i = 0X0000; i <= 0Xffff; i++) {
Character ch = (char) i;
String test = ch.toString();
double log2Prob = lm.log2Estimate(test);
double prob = Math.pow(2, log2Prob);
probSum += prob;
}
System.out.printf("Sum of prob=%.3f\n", probSum);
}
}
输出结果:
Sum of prob=1.000