关于统计语言模型N-gram的理解及sirlm的使用

最新推荐文章于 2022-12-10 11:46:06 发布

wenstery

最新推荐文章于 2022-12-10 11:46:06 发布

阅读量4.9k

点赞数

分类专栏：大数据

大数据专栏收录该内容

4 篇文章 0 订阅

订阅专栏

srilm安装及ngram-count简单使用

分类：语音识别/理解 2013-02-05 18:15 2278人阅读评论(1) 收藏举报

SRILM是一个统计和分析语言模型的工具，提供一些命令行工具，如ngram,ngram-count，可以很方便的统计NGRAM的语言模型。

1，下载

我开始在这个站上下载，感觉很慢。 http://www.speech.sri.com/projects/srilm/download.html。然后直接换了个站下载，直接下载1.5版本的。

wget ftp://ftp.speech.sri.com/pub/people/stolcke/srilm/srilm-1.5.7.tar.gz 。这个版本也不低，现在最高的版本是1.7.

2，安装

我的机器是64位的。

安装这个包依赖于TCL包，TCL的下载地址是：http://www.tcl.tk/software/tcltk/download.html（这个包的安装，很常规，解压后，进入unix目录，下面就有configure文件了）

安装srilm过程：

export SRILM=`pwd`。
make MACHINE_TYPE=i686-m64。
如果提示找不到TCL库之类的错误，就修改Makefile文件，里面有 TCL_INCLUDE 与 TCL_LIBRARY 两个变量，比如可以分别设为-I/usr/local/include 以及 -L/usr/local/tcl8.5
进入test目录试一下，cd test ; make all .

这就编译完了，现在的命令行程序都在./bin/i686-m64/目录，我简单的把这个路径加到PATH里面去了。

3，测试

新建一个文本文件，如source.txt，随便搞个内容，如下：

[html]view plaincopyprint? 
    
 [root@localhost lm]# cat source.txt   
 If you do want to use SRILM or are generally interested in it, please consider joining the SRILM user mailing list.  
 [root@localhost lm]#  

然后执行命令

[html]view plaincopyprint? 
    
 ngram-count -text source.txt -lm source.lm  

这就会建立基于source.txt的统计语言模型了，存储在source.lm中，如下：

[html]view plaincopyprint? 
    
 [root@localhost lm]# cat source.lm   
   
 \data\  
 ngram 1=22  
 ngram 2=22  
 ngram 3=0  
   
 \1-grams:  
 -1.341524   </s>  
 -99 <s>   -99  
 -1.341524   If  -99  
 -1.050479   SRILM   -7.440329  
 -1.341524   are -99  
 -1.341524   consider    -99  
 -1.341524   do  -99  
 -1.341524   generally   -99  
 -1.341524   in  -99  
 -1.341524   interested  -99  
 -1.341524   it, -99  
 -1.341524   joining -99  
 -1.341524   list.   -99  
 -1.341524   mailing -99  
 -1.341524   or  -99  
 -1.341524   please  -99  
 -1.341524   the -99  
 -1.341524   to  -99  
 -1.341524   use -99  
 -1.341524   user    -99  
 -1.341524   want    -99  
 -1.341524   you -99  
   
 \2-grams:  
 0   <s> If  
 0   If you  
 -0.30103    SRILM or  
 -0.30103    SRILM user  
 0   are generally  
 0   consider joining  
 0   do want  
 0   generally interested  
 0   in it,  
 0   interested in  
 0   it, please  
 0   joining the  
 0   list. </s>  
 0   mailing list.  
 0   or are  
 0   please consider  
 0   the SRILM  
 0   to use  
 0   use SRILM  
 0   user mailing  
 0   want to  
 0   you do  
   
 \3-grams:  
   
 \end\  
 [root@localhost lm]#   

如果希望只针对指定的词进行统计，就建立一个词列表文件，如source.dict

[html]view plaincopyprint? 
    
 [root@localhost lm]# cat source.dict   
 you  
 are  
 list  
 please  
 [root@localhost lm]#  

这样的话，等下就只是统计这四个单词。执行命令：

[html]view plaincopyprint? 
    
 ngram-count -text source.txt -lm source.lm -vocab source.dict   

结果如下：

[html]view plaincopyprint? 
    
 [root@localhost lm]# cat source.lm   
   
 \data\  
 ngram 1=6  
 ngram 2=0  
 ngram 3=0  
   
 \1-grams:  
 -0.60206    </s>  
 -99 <s>  
 -0.60206    are  
 -7.180781   list  
 -0.60206    please  
 -0.60206    you  
   
 \2-grams:  
   
 \3-grams:  
   
 \end\  
 [root@localhost lm]#  

没有2-grams，修改source.dict，使其可以出现2-grams语法，如下：

[html]view plaincopyprint? 
    
 [root@localhost lm]# cat source.dict   
 you  
 do  
 mailing  
 are  
 list  
 please  
 [root@localhost lm]#   

再执行ngram-count，结果如下：

[html]view plaincopyprint? 
    
 [root@localhost lm]# cat source.lm   
   
 \data\  
 ngram 1=8  
 ngram 2=1  
 ngram 3=0  
   
 \1-grams:  
 -0.7781513  </s>  
 -99 <s>  
 -0.7781513  are  
 -0.7781513  do  
 -7.269613   list  
 -0.7781513  mailing  
 -0.7781513  please  
 -0.7781513  you -99  
   
 \2-grams:  
 0   you do  
   
 \3-grams:  
   
 \end\  
 [root@localhost lm]#