语言模型使用

最新推荐文章于 2024-04-16 11:18:42 发布

alicexc++

最新推荐文章于 2024-04-16 11:18:42 发布

阅读量1.2k

点赞数

分类专栏： NLP 文章标签：语言模型

本文链接：https://blog.csdn.net/xiuchixc/article/details/8224576

版权

NLP 专栏收录该内容

24 篇文章 0 订阅

订阅专栏

最近同门在处理srilm，问我大规模怎么弄，真惭愧好久没用了

这是以前写的

#inputfile切成小文件放到outputfiledir目录下，文件名的前缀outputfie_prefix
split -l 100 inputfile outputfiledir /outputfie_prefix
#进入outputfiedir文件
cd outfiledir
#生成文件列表，file-list
find . -not -name 'file-list' -type f > file-list
#设置ngram-count,ngram-merge，get-gt-counts,make-gt-discounts, make-kn-counts, make-kn-discounts的路径
export PATH=$PATH:/home/xiuchi/baseline500/srilm/bin
#当前目录下
../srilm/bin/make-batch-counts file-list 10 cat counts -order 5 -interpolate -kndiscount
../srilm/bin/merge-batch-counts counts
#这一步，加入-kndiscount会报错，如果不加入-order默认是3
../srilm/bin/make-big-lm -read counts/merge-iter6-1.ngrams.gz -lm ../lm/europarl.lm -order 5 -interpolate

参考文献
http://www.speech.sri.com/projects/srilm/manpages/training-scripts.1.html
http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html
http://hi.baidu.com/wit_yd/blog/item/a6655b122033ebcbc2fd782f.html

这是转自http://www.cnblogs.com/lacozhang/archive/2012/10/24/2737679.html

使用SRILM这个工具编写语言模型工具的问题就是内存不怎么够。。。，内存最多就3G，还在跑另外的一个程序，所以想要一次训练完是不可能的，在http://www-speech.sri.com/projects/srilm/manpages/srilm-faq.7.html给出了一个解决的办法，那就是将大文件拆分成多个小文件，然后再将多个小文件的count合并，完成最终的训练。具体的做法如下：

首先使用split将一个大文件分成最多26*26（使用字母后缀，这是默认的行为）或者是100（使用数字后缀，需要-d参数）个文件，可以将文件按行拆分（使用-l num参数）或者是按大小拆分（使用-b size参数），还可以给出文件的前缀（或者使用默认的x）。在进行拆分的时候将文件会将每num行放到一个文件中，文件按字母序产生，对于语言模型的使用来说需要命令：split -l NUM_LINE_OF_FILE BigFiles [ Prefix-name ] [ -d ]

然后使用make-batch-counts脚本：make-batch-counts file-list 5 cat counts -order 5 -sort，file-list是个文件名，这个文件存储了你分割大文件而得到的小文件的文件名；5的意识是每5个小文件用于一次ngram-count训练，获得对应的count文件；cat 是用于过滤输出的脚本名，我们这里直接输出；后面的是传给ngram-count的参数，可以自己随便写

再是使用merge-batch-counts将所有的小count文件合并成一个大的count文件： merge-batch-counts [ -l N ] counts [ filename-list ]，将counts目录下的所有文件合并成一个文件，如果有些文件不用参与合并，可以在最后添加一个filename-list，只有在filename-list里面出现的文件才会被用于合并；-l N参数之处，一次同时合并N个文件。

最后使用make-big-lm脚本，参数类似于ngram-count。make-big-lm -read *.gz -order 5 -lm my.lm