This time I want to translate English to Chinese, so I choose Chinese as a language model.Go to the directory:/home/tianliang/mosesdecoder/srilm/bin/i686-gcc4, we will use the “ngram-count” to build a 5-gram model. The process is like below:
tianliang@ubuntu:~/mosesdecoder/srilm/bin/i686-gcc4/test$ mkdir test
tianliang@ubuntu:~/mosesdecoder/srilm/bin/i686-gcc4/test$ cd test
tianliang@ubuntu:~/mosesdecoder/srilm/bin/i686-gcc4/test$ ./ngram-count -text clean.chn -lm chinese.gz -order 5 -unk -wbdiscount -interpolate
Here it means: we will build Chinese file “clean.chn” into a 5-gram language model chinese.gz using the smoothing methods called Witten-Bell discounting and interpolated estimates. The chinese.gz looks like:
It shows the number of the n-gram models. For example, there are 594 3-gram models in our corpus.
Moses' toolkit does a great job of wrapping up calls to mkcls and GIZA++ inside a training script, and outputting the phrase and reordering tables needed for decoding. The script that does this is called train-factored-phrase-model.perl. In my running, the train-factored-phrase-model.perl is located at