1.编译安装sunpinyin
# sudo scons install
2. 新建文件夹slmdata
# mkdir slmdata
3. 下载词典文件dict.utf8-20120823.tar.bz2从 http://sourceforge.net/projects/open-gram/files/, 并解压到slmdata.解压后得到文件dict.utf8
# tar -jxvf dict.utf8-20210823.tar.bz2
4.准备预料库文件,我们这里下载搜狗的文本分类语料库http://www.sogou.com/labs/dl/c.html,解压后,更名为corpus.utf8并放入slmdata目录
5. 拷贝构建文件到slmdata
Makefile
corpus.utf8
dict.utf8
6. 开始编译
# sudo make
以下是构建过程信息:
make mmseg_trigrammake[1]: Entering directory `/media/sf_share/open-gram/open-gram'
mmseg -f bin -s 10 -a 9 -d dict.utf8 corpus.utf8 > lm_sc.ids
Loading lexicon...done
Processing corpus.utf8...@Offset 3945, 750 words, 42 ambiguious. Done!
ids2ngram -n 3 -p 20000000 -s lm_sc.id.3gm.tmp -o lm_sc.id.3gm lm_sc.ids
Processing lm_sc.ids:.
Merging...Done
rm -f lm_sc.id.3gm.tmp
slmbuild -n 3 -w 200000 -c 0,2,2 -d ABS,0.0005 -d ABS -d ABS -b 10 -e 9 \
-o lm_sc.3gm.raw lm_sc.id.3gm
Reading and Processing raw idngram...703 ngrams.
Counting Nr...
Appending psuedo tail node for each level...
Cuting according freq...
Cut level 3 with threshold 2...
Cut level 2 with threshold 2...
Cut level 1 with threshold 0...
Discounting...
Initializing level 3's Absolution discount method: parameter c=0.920721
Initializing level 2's Absolution discount method: parameter c=0.868512
Initializing level 1's Absolution discount method: Using given parameter c=0.000500
Discounting level 3 ...
Discounting level 2 ...
Discounting level 1 ...
Giving psuedo root level 0 a distribution...
Calculating Back-Off Weight...
Processing level 0 (2 items)...
Processing level 1 (352 items)...
Processing level 2 (16 items)...
Writing result file...
slmprune lm_sc.3gm.raw lm_sc.3gm R 100000 2500000 1000000
Reading language model lm_sc.3gm.raw...done!
Erasing items using Entropy distance
Level 3 (4 items), no need to cut as your command!
Level 2 (15 items), no need to cut as your command!
Level 1 (351 items), no need to cut as your command!
Updating back-off weight
Level 0...
Level 1...
Level 2...
Writing target language model lm_sc.3gm...done!
slmthread lm_sc.3gm lm_sc.t3g
Loading original slm...
first pass...
Compressing pr values...52 float values ==> 52 values
Compressing bow values...39 float values ==> 39 values
Threading the new model...
Writing out...done!
tslminfo -p -v -l dict.utf8 lm_sc.t3g > lm_sc.t3g.arpa
make[1]: Leaving directory `/media/sf_share/open-gram/open-gram'
make slm_trigram
make[1]: Entering directory `/media/sf_share/open-gram/open-gram'
slmseg -f bin -s 10 \
-d dict.utf8 -m lm_sc.t3g corpus.utf8 > lm_sc.ids
Loading lexicon...done
Processing corpus.utf8...@Offset 3945, 794 words, 0 ambiguious. Done!
make[1]: Leaving directory `/media/sf_share/open-gram/open-gram'
make slm_trigram
make[1]: Entering directory `/media/sf_share/open-gram/open-gram'
ids2ngram -n 3 -p 20000000 -s lm_sc.id.3gm.tmp -o lm_sc.id.3gm lm_sc.ids
Processing lm_sc.ids:.
Merging...Done
rm -f lm_sc.id.3gm.tmp
slmbuild -n 3 -w 200000 -c 0,2,2 -d ABS,0.0005 -d ABS -d ABS -b 10 -e 9 \
-o lm_sc.3gm.raw lm_sc.id.3gm
Reading and Processing raw idngram...749 ngrams.
Counting Nr...
Appending psuedo tail node for each level...
Cuting according freq...
Cut level 3 with threshold 2...
Cut level 2 with threshold 2...
Cut level 1 with threshold 0...
Discounting...
Initializing level 3's Absolution discount method: parameter c=0.927677
Initializing level 2's Absolution discount method: parameter c=0.880512
Initializing level 1's Absolution discount method: Using given parameter c=0.000500
Discounting level 3 ...
Discounting level 2 ...
Discounting level 1 ...
Giving psuedo root level 0 a distribution...
Calculating Back-Off Weight...
Processing level 0 (2 items)...
Processing level 1 (387 items)...
Processing level 2 (17 items)...
Writing result file...
slmprune lm_sc.3gm.raw lm_sc.3gm R 100000 2500000 1000000
Reading language model lm_sc.3gm.raw...done!
Erasing items using Entropy distance
Level 3 (4 items), no need to cut as your command!
Level 2 (16 items), no need to cut as your command!
Level 1 (386 items), no need to cut as your command!
Updating back-off weight
Level 0...
Level 1...
Level 2...
Writing target language model lm_sc.3gm...done!
slmthread lm_sc.3gm lm_sc.t3g
Loading original slm...
first pass...
Compressing pr values...56 float values ==> 56 values
Compressing bow values...39 float values ==> 39 values
Threading the new model...
Writing out...done!
slmseg -f bin -s 10 \
-d dict.utf8 -m lm_sc.t3g corpus.utf8 > lm_sc.ids
Loading lexicon...done
Processing corpus.utf8...@Offset 3945, 794 words, 0 ambiguious. Done!
tslminfo -p -v -l dict.utf8 lm_sc.t3g > lm_sc.t3g.arpa
make[1]: Leaving directory `/media/sf_share/open-gram/open-gram'
make lexicon3
make[1]: Entering directory `/media/sf_share/open-gram/open-gram'
ids2ngram -n 3 -p 20000000 -s lm_sc.id.3gm.tmp -o lm_sc.id.3gm lm_sc.ids
Processing lm_sc.ids:.
Merging...Done
rm -f lm_sc.id.3gm.tmp
slmbuild -n 3 -w 200000 -c 0,2,2 -d ABS,0.0005 -d ABS -d ABS -b 10 -e 9 \
-o lm_sc.3gm.raw lm_sc.id.3gm
Reading and Processing raw idngram...749 ngrams.
Counting Nr...
Appending psuedo tail node for each level...
Cuting according freq...
Cut level 3 with threshold 2...
Cut level 2 with threshold 2...
Cut level 1 with threshold 0...
Discounting...
Initializing level 3's Absolution discount method: parameter c=0.927677
Initializing level 2's Absolution discount method: parameter c=0.880512
Initializing level 1's Absolution discount method: Using given parameter c=0.000500
Discounting level 3 ...
Discounting level 2 ...
Discounting level 1 ...
Giving psuedo root level 0 a distribution...
Calculating Back-Off Weight...
Processing level 0 (2 items)...
Processing level 1 (387 items)...
Processing level 2 (17 items)...
Writing result file...
slmprune lm_sc.3gm.raw lm_sc.3gm R 100000 2500000 1000000
Reading language model lm_sc.3gm.raw...done!
Erasing items using Entropy distance
Level 3 (4 items), no need to cut as your command!
Level 2 (16 items), no need to cut as your command!
Level 1 (386 items), no need to cut as your command!
Updating back-off weight
Level 0...
Level 1...
Level 2...
Writing target language model lm_sc.3gm...done!
slmthread lm_sc.3gm lm_sc.t3g
Loading original slm...
first pass...
Compressing pr values...56 float values ==> 56 values
Compressing bow values...39 float values ==> 39 values
Threading the new model...
Writing out...done!
genpyt -i dict.utf8 -s lm_sc.t3g \
-l pydict_sc.log -o pydict_sc.bin
Opening language model...done!
Adding pinyin and corresponding words...
Warning! unrecognized syllable hng
57445 primitive nodes
79221 total nodes
Writing out...done!
Printing the lexicon out to log_file...done!
make[1]: Leaving directory `/media/sf_share/open-gram/open-gram'
7.生成结果;
$ ls -1
Makefile
corpus.utf8
dict.utf8
lm_sc.3gm
lm_sc.3gm.raw
lm_sc.id.3gm
lm_sc.ids
lm_sc.t3g
lm_sc.t3g.arpa
pydict_sc.bin
pydict_sc.log