Sunpinyin中SLM训练数据生成举例

最新推荐文章于 2021-02-11 11:43:01 发布

tlight

最新推荐文章于 2021-02-11 11:43:01 发布

阅读量926

点赞数

分类专栏： NLP 文章标签： sunpinyin NLP

本文链接：https://blog.csdn.net/tlight/article/details/49980917

版权

NLP 专栏收录该内容

2 篇文章

订阅专栏

本文介绍了如何使用Sunpinyin进行SLM（Statistical Language Model）训练数据的生成，包括编译安装Sunpinyin、创建数据文件夹、下载词典和语料库、编译以及查看生成结果的过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1.编译安装sunpinyin

# sudo scons install

2. 新建文件夹slmdata

# mkdir slmdata

3. 下载词典文件dict.utf8-20120823.tar.bz2从 http://sourceforge.net/projects/open-gram/files/，并解压到slmdata.解压后得到文件dict.utf8

# tar -jxvf dict.utf8-20210823.tar.bz2

4.准备预料库文件，我们这里下载搜狗的文本分类语料库http://www.sogou.com/labs/dl/c.html，解压后，更名为corpus.utf8并放入slmdata目录

5. 拷贝构建文件到slmdata

# cp /usr/local/share/doc/sunpinyin/SLM-train.mk Makefile

6. 准备好后，slmdata目录内容如下:

$ ls -1
Makefile
corpus.utf8
dict.utf8

6. 开始编译

# sudo make

以下是构建过程信息：

make mmseg_trigram
make[1]: Entering directory `/media/sf_share/open-gram/open-gram'
mmseg -f bin -s 10 -a 9 -d dict.utf8 corpus.utf8 > lm_sc.ids
Loading lexicon...done
Processing corpus.utf8...@Offset 3945, 750 words, 42 ambiguious. Done!
ids2ngram -n 3 -p 20000000 -s lm_sc.id.3gm.tmp -o lm_sc.id.3gm lm_sc.ids
Processing lm_sc.ids:.
Merging...Done
rm -f lm_sc.id.3gm.tmp
slmbuild -n 3 -w 200000 -c 0,2,2 -d ABS,0.0005 -d ABS -d ABS -b 10 -e 9 \
-o lm_sc.3gm.raw lm_sc.id.3gm
Reading and Processing raw idngram...703 ngrams.

Counting Nr...

Appending psuedo tail node for each level...

Cuting according freq...
Cut level 3 with threshold 2...
Cut level 2 with threshold 2...
Cut level 1 with threshold 0...

Discounting...
Initializing level 3's Absolution discount method: parameter c=0.920721
Initializing level 2's Absolution discount method: parameter c=0.868512
Initializing level 1's Absolution discount method: Using given parameter c=0.000500

Discounting level 3 ...
Discounting level 2 ...
Discounting level 1 ...
Giving psuedo root level 0 a distribution...

Calculating Back-Off Weight...
Processing level 0 (2 items)...
Processing level 1 (352 items)...
Processing level 2 (16 items)...

Writing result file...
slmprune lm_sc.3gm.raw lm_sc.3gm R 100000 2500000 1000000
Reading language model lm_sc.3gm.raw...done!
Erasing items using Entropy distance
Level 3 (4 items), no need to cut as your command!
Level 2 (15 items), no need to cut as your command!
Level 1 (351 items), no need to cut as your command!

Updating back-off weight
Level 0...
Level 1...
Level 2...
Writing target language model lm_sc.3gm...done!

slmthread lm_sc.3gm lm_sc.t3g
Loading original slm...
first pass...
Compressing pr values...52 float values ==> 52 values
Compressing bow values...39 float values ==> 39 values
Threading the new model...
Writing out...done!
tslminfo -p -v -l dict.utf8 lm_sc.t3g > lm_sc.t3g.arpa
make[1]: Leaving directory `/media/sf_share/open-gram/open-gram'
make slm_trigram
make[1]: Entering directory `/media/sf_share/open-gram/open-gram'
slmseg -f bin -s 10 \
-d dict.utf8 -m lm_sc.t3g corpus.utf8 > lm_sc.ids
Loading lexicon...done
Processing corpus.utf8...@Offset 3945, 794 words, 0 ambiguious. Done!
make[1]: Leaving directory `/media/sf_share/open-gram/open-gram'
make slm_trigram
make[1]: Entering directory `/media/sf_share/open-gram/open-gram'
ids2ngram -n 3 -p 20000000 -s lm_sc.id.3gm.tmp -o lm_sc.id.3gm lm_sc.ids
Processing lm_sc.ids:.
Merging...Done
rm -f lm_sc.id.3gm.tmp
slmbuild -n 3 -w 200000 -c 0,2,2 -d ABS,0.0005 -d ABS -d ABS -b 10 -e 9 \
-o lm_sc.3gm.raw lm_sc.id.3gm
Reading and Processing raw idngram...749 ngrams.

Counting Nr...

Appending psuedo tail node for each level...

Cuting according freq...
Cut level 3 with threshold 2...
Cut level 2 with threshold 2...
Cut level 1 with threshold 0...

Discounting...
Initializing level 3's Absolution discount method: parameter c=0.927677
Initializing level 2's Absolution discount method: parameter c=0.880512
Initializing level 1's Absolution discount method: Using given parameter c=0.000500

Discounting level 3 ...
Discounting level 2 ...
Discounting level 1 ...
Giving psuedo root level 0 a distribution...

Calculating Back-Off Weight...
Processing level 0 (2 items)...
Processing level 1 (387 items)...
Processing level 2 (17 items)...

Writing result file...
slmprune lm_sc.3gm.raw lm_sc.3gm R 100000 2500000 1000000
Reading language model lm_sc.3gm.raw...done!
Erasing items using Entropy distance
Level 3 (4 items), no need to cut as your command!
Level 2 (16 items), no need to cut as your command!
Level 1 (386 items), no need to cut as your command!

Updating back-off weight
Level 0...
Level 1...
Level 2...
Writing target language model lm_sc.3gm...done!

slmthread lm_sc.3gm lm_sc.t3g
Loading original slm...
first pass...
Compressing pr values...56 float values ==> 56 values
Compressing bow values...39 float values ==> 39 values
Threading the new model...
Writing out...done!
slmseg -f bin -s 10 \
-d dict.utf8 -m lm_sc.t3g corpus.utf8 > lm_sc.ids
Loading lexicon...done
Processing corpus.utf8...@Offset 3945, 794 words, 0 ambiguious. Done!
tslminfo -p -v -l dict.utf8 lm_sc.t3g > lm_sc.t3g.arpa
make[1]: Leaving directory `/media/sf_share/open-gram/open-gram'
make lexicon3
make[1]: Entering directory `/media/sf_share/open-gram/open-gram'
ids2ngram -n 3 -p 20000000 -s lm_sc.id.3gm.tmp -o lm_sc.id.3gm lm_sc.ids
Processing lm_sc.ids:.
Merging...Done
rm -f lm_sc.id.3gm.tmp
slmbuild -n 3 -w 200000 -c 0,2,2 -d ABS,0.0005 -d ABS -d ABS -b 10 -e 9 \
-o lm_sc.3gm.raw lm_sc.id.3gm
Reading and Processing raw idngram...749 ngrams.

Counting Nr...

Appending psuedo tail node for each level...

Cuting according freq...
Cut level 3 with threshold 2...
Cut level 2 with threshold 2...
Cut level 1 with threshold 0...

Discounting...
Initializing level 3's Absolution discount method: parameter c=0.927677
Initializing level 2's Absolution discount method: parameter c=0.880512
Initializing level 1's Absolution discount method: Using given parameter c=0.000500

Discounting level 3 ...
Discounting level 2 ...
Discounting level 1 ...
Giving psuedo root level 0 a distribution...

Calculating Back-Off Weight...
Processing level 0 (2 items)...
Processing level 1 (387 items)...
Processing level 2 (17 items)...

Writing result file...
slmprune lm_sc.3gm.raw lm_sc.3gm R 100000 2500000 1000000
Reading language model lm_sc.3gm.raw...done!
Erasing items using Entropy distance
Level 3 (4 items), no need to cut as your command!
Level 2 (16 items), no need to cut as your command!
Level 1 (386 items), no need to cut as your command!

Updating back-off weight
Level 0...
Level 1...
Level 2...
Writing target language model lm_sc.3gm...done!

slmthread lm_sc.3gm lm_sc.t3g
Loading original slm...
first pass...
Compressing pr values...56 float values ==> 56 values
Compressing bow values...39 float values ==> 39 values
Threading the new model...
Writing out...done!
genpyt -i dict.utf8 -s lm_sc.t3g \
-l pydict_sc.log -o pydict_sc.bin
Opening language model...done!
Adding pinyin and corresponding words...
Warning! unrecognized syllable hng
57445 primitive nodes
79221 total nodes
Writing out...done!
Printing the lexicon out to log_file...done!

make[1]: Leaving directory `/media/sf_share/open-gram/open-gram'

7.生成结果;

$ ls -1
Makefile
corpus.utf8
dict.utf8
lm_sc.3gm
lm_sc.3gm.raw
lm_sc.id.3gm
lm_sc.ids
lm_sc.t3g
lm_sc.t3g.arpa
pydict_sc.bin
pydict_sc.log