Sunpinyin中SLM训练数据生成举例

本文介绍了如何使用Sunpinyin进行SLM(Statistical Language Model)训练数据的生成,包括编译安装Sunpinyin、创建数据文件夹、下载词典和语料库、编译以及查看生成结果的过程。
摘要由CSDN通过智能技术生成

1.编译安装sunpinyin

# sudo scons install

2. 新建文件夹slmdata

# mkdir slmdata

3. 下载词典文件dict.utf8-20120823.tar.bz2从 http://sourceforge.net/projects/open-gram/files/, 并解压到slmdata.解压后得到文件dict.utf8

# tar -jxvf dict.utf8-20210823.tar.bz2

4.准备预料库文件,我们这里下载搜狗的文本分类语料库http://www.sogou.com/labs/dl/c.html,解压后,更名为corpus.utf8并放入slmdata目录

5. 拷贝构建文件到slmdata

# cp /usr/local/share/doc/sunpinyin/SLM-train.mk Makefile
6. 准备好后,slmdata目录内容如下:
$ ls -1
Makefile
corpus.utf8
dict.utf8

6. 开始编译

# sudo make

以下是构建过程信息:

make mmseg_trigram
make[1]: Entering directory `/media/sf_share/open-gram/open-gram'
mmseg -f bin -s 10 -a 9 -d dict.utf8 corpus.utf8 > lm_sc.ids
Loading lexicon...done
Processing corpus.utf8...@Offset 3945, 750 words, 42 ambiguious. Done!
ids2ngram -n 3 -p 20000000 -s lm_sc.id.3gm.tmp -o lm_sc.id.3gm lm_sc.ids
Processing lm_sc.ids:.
Merging...Done
rm -f lm_sc.id.3gm.tmp
slmbuild -n 3 -w 200000 -c 0,2,2 -d ABS,0.0005 -d ABS -d ABS -b 10 -e 9 \
-o lm_sc.3gm.raw lm_sc.id.3gm
Reading and Processing raw idngram...703 ngrams.


Counting Nr...


Appending psuedo tail node for each level...


Cuting according freq...
    Cut level 3 with threshold 2...
    Cut level 2 with threshold 2...
    Cut level 1 with threshold 0...


Discounting...
    Initializing level 3's Absolution discount method: parameter c=0.920721
    Initializing level 2's Absolution discount method: parameter c=0.868512
    Initializing level 1's Absolution discount method: Using given parameter c=0.000500


    Discounting level 3 ...
    Discounting level 2 ...
    Discounting level 1 ...
    Giving psuedo root level 0 a distribution...


Calculating Back-Off Weight...
    Processing level 0 (2 items)...
    Processing level 1 (352 items)...
    Processing level 2 (16 items)...


Writing result file...
slmprune lm_sc.3gm.raw lm_sc.3gm R 100000 2500000 1000000
Reading language model lm_sc.3gm.raw...done!
Erasing items using Entropy distance
  Level 3 (4 items), no need to cut as your command!
  Level 2 (15 items), no need to cut as your command!
  Level 1 (351 items), no need to cut as your command!


Updating back-off weight
    Level 0...
    Level 1...
    Level 2...
Writing target language model lm_sc.3gm...done!


slmthread lm_sc.3gm lm_sc.t3g
Loading original slm...
first pass...
Compressing pr values...52 float values ==> 52 values
Compressing bow values...39 float values ==> 39 values
Threading the new model...
Writing out...done!
tslminfo -p -v -l dict.utf8 lm_sc.t3g > lm_sc.t3g.arpa
make[1]: Leaving directory `/media/sf_share/open-gram/open-gram'
make slm_trigram
make[1]: Entering directory `/media/sf_share/open-gram/open-gram'
slmseg -f bin -s 10 \
-d dict.utf8 -m lm_sc.t3g corpus.utf8 > lm_sc.ids
Loading lexicon...done
Processing corpus.utf8...@Offset 3945, 794 words, 0 ambiguious. Done!
make[1]: Leaving directory `/media/sf_share/open-gram/open-gram'
make slm_trigram
make[1]: Entering directory `/media/sf_share/open-gram/open-gram'
ids2ngram -n 3 -p 20000000 -s lm_sc.id.3gm.tmp -o lm_sc.id.3gm lm_sc.ids
Processing lm_sc.ids:.
Merging...Done
rm -f lm_sc.id.3gm.tmp
slmbuild -n 3 -w 200000 -c 0,2,2 -d ABS,0.0005 -d ABS -d ABS -b 10 -e 9 \
-o lm_sc.3gm.raw lm_sc.id.3gm
Reading and Processing raw idngram...749 ngrams.


Counting Nr...


Appending psuedo tail node for each level...


Cuting according freq...
    Cut level 3 with threshold 2...
    Cut level 2 with threshold 2...
    Cut level 1 with threshold 0...


Discounting...
    Initializing level 3's Absolution discount method: parameter c=0.927677
    Initializing level 2's Absolution discount method: parameter c=0.880512
    Initializing level 1's Absolution discount method: Using given parameter c=0.000500


    Discounting level 3 ...
    Discounting level 2 ...
    Discounting level 1 ...
    Giving psuedo root level 0 a distribution...


Calculating Back-Off Weight...
    Processing level 0 (2 items)...
    Processing level 1 (387 items)...
    Processing level 2 (17 items)...


Writing result file...
slmprune lm_sc.3gm.raw lm_sc.3gm R 100000 2500000 1000000
Reading language model lm_sc.3gm.raw...done!
Erasing items using Entropy distance
  Level 3 (4 items), no need to cut as your command!
  Level 2 (16 items), no need to cut as your command!
  Level 1 (386 items), no need to cut as your command!


Updating back-off weight
    Level 0...
    Level 1...
    Level 2...
Writing target language model lm_sc.3gm...done!


slmthread lm_sc.3gm lm_sc.t3g
Loading original slm...
first pass...
Compressing pr values...56 float values ==> 56 values
Compressing bow values...39 float values ==> 39 values
Threading the new model...
Writing out...done!
slmseg -f bin -s 10 \
-d dict.utf8 -m lm_sc.t3g corpus.utf8 > lm_sc.ids
Loading lexicon...done
Processing corpus.utf8...@Offset 3945, 794 words, 0 ambiguious. Done!
tslminfo -p -v -l dict.utf8 lm_sc.t3g > lm_sc.t3g.arpa
make[1]: Leaving directory `/media/sf_share/open-gram/open-gram'
make lexicon3
make[1]: Entering directory `/media/sf_share/open-gram/open-gram'
ids2ngram -n 3 -p 20000000 -s lm_sc.id.3gm.tmp -o lm_sc.id.3gm lm_sc.ids
Processing lm_sc.ids:.
Merging...Done
rm -f lm_sc.id.3gm.tmp
slmbuild -n 3 -w 200000 -c 0,2,2 -d ABS,0.0005 -d ABS -d ABS -b 10 -e 9 \
-o lm_sc.3gm.raw lm_sc.id.3gm
Reading and Processing raw idngram...749 ngrams.


Counting Nr...


Appending psuedo tail node for each level...


Cuting according freq...
    Cut level 3 with threshold 2...
    Cut level 2 with threshold 2...
    Cut level 1 with threshold 0...


Discounting...
    Initializing level 3's Absolution discount method: parameter c=0.927677
    Initializing level 2's Absolution discount method: parameter c=0.880512
    Initializing level 1's Absolution discount method: Using given parameter c=0.000500


    Discounting level 3 ...
    Discounting level 2 ...
    Discounting level 1 ...
    Giving psuedo root level 0 a distribution...


Calculating Back-Off Weight...
    Processing level 0 (2 items)...
    Processing level 1 (387 items)...
    Processing level 2 (17 items)...


Writing result file...
slmprune lm_sc.3gm.raw lm_sc.3gm R 100000 2500000 1000000
Reading language model lm_sc.3gm.raw...done!
Erasing items using Entropy distance
  Level 3 (4 items), no need to cut as your command!
  Level 2 (16 items), no need to cut as your command!
  Level 1 (386 items), no need to cut as your command!


Updating back-off weight
    Level 0...
    Level 1...
    Level 2...
Writing target language model lm_sc.3gm...done!


slmthread lm_sc.3gm lm_sc.t3g
Loading original slm...
first pass...
Compressing pr values...56 float values ==> 56 values
Compressing bow values...39 float values ==> 39 values
Threading the new model...
Writing out...done!
genpyt -i dict.utf8 -s lm_sc.t3g \
-l pydict_sc.log -o pydict_sc.bin
Opening language model...done!
Adding pinyin and corresponding words...
Warning! unrecognized syllable hng
    57445 primitive nodes
    79221 total nodes
Writing out...done!
Printing the lexicon out to log_file...done!

make[1]: Leaving directory `/media/sf_share/open-gram/open-gram'


7.生成结果;

$ ls -1
Makefile
corpus.utf8
dict.utf8
lm_sc.3gm
lm_sc.3gm.raw
lm_sc.id.3gm
lm_sc.ids
lm_sc.t3g
lm_sc.t3g.arpa
pydict_sc.bin
pydict_sc.log

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值