CMUSphinx Learn - Building Language Model

Building Language Model

构建语言模型

There are two types of models that describe language - grammars and statistical language models. Grammars describe very simple types of languages for command and control, and they are usually written by hand or generated automatically with plain code.

有两种描述语言模型 - 语法和统计语言模型,语法语言模型描述像命令和控制语言这样很简单的语言类型,它们通常是通过手写或者明码自动产生。

 

There are many ways to build the statistical language models. When your data set is large, there is sense to use CMU language modeling toolkit. When a model is small, you can use an online quick web service. When you need specific options or you just want to use your favorite toolkit which builds ARPA models, you can use it.

有很多方法来构建统计语言模型,当你的数据集很大的时候,使用CMU语言模型工具是很有帮助的;当模型很小时,可以使用快速在线web服务;你可以使用特殊选项或使用你喜欢的工具来构建ARPA模型。

 

Building a grammar

构建语法

Grammars are usually written manually in JSGF format:

JSGF格式的语法通常手写为:

#JSGF V1.0;
/**
 * JSGF Grammar for Hello World example
 */
grammar hello;
public <greet> = (Good morning | Hello) ( Bhiksha | Evandro | Paul | Philip | Rita | Will );
 

For more information on JSGF format see the documentation

关于JSGF格式的详细信息见以下文档

http://docs.oracle.com/cd/E17802_01/products/products/java-media/speech/forDevelopers/jsapi-doc/javax/speech/recognition/RuleGrammar.html

 

Building a Statistical Language Model Using CMUCLMTK

使用CMUCLMTK来构建语言模型

Required Software

软件要求

You need to download and install cmuclmtk. See CMU Sphinx Downloads for details.

需要下载安装cmuclmtk,详细见CMU Sphinx下载页。

 

Text preparation

First of all you need to cleanup text. Expand abbreviations, convert numbers to words, clean non-word items. For example to clean WikipediaXML dump you can use special python scripts. To cleanHTML pages you can tryhttp://code.google.com/p/boilerpipe/ a nice package specifically created to extract text from HTML

首先需要清理文本,扩展缩写,转换数字为单词,清除非单词,比如可以使用python脚本清理维基百科的XML,为了清理HTML网页,可以使用http://code.google.com/p/boilerpipe/,一个专门用来从HTML中提取文本的包。

 

For example on how to create language model from Wikipedia texts please see

比如,如何从维基百科文本创建语言模型,请看

http://trulymadlywordly.blogspot.ru/2011/03/creating-text-corpus-from-wikipedia.html

Once you went through the language model process, please submit your langauge model to CMUSphinx project, we'd be glad to share it!

当经过训练语言模型的过程,请提交你的语言模型到CMUSphinx项目,我们很乐意分享它!

 

Language modeling for Mandarin is largely the same as in English, with one addditional consideration, which is that the input text must be word segmented. A segmentation tool and associated word list is provided to accomplish this.

普通话语言模型和英语语言模型一样大,另一需要注意的是,输入的文本必须是已经分割好的词,分割工具和相关单词列表已经提供。

 

ARPA model training

APRA模型训练

The process for creating a language model is as follows:

下面是创建语言模型的过程:

 

1) Prepare a reference text that will be used to generate the language model.  The language model toolkit expects its input to be in the form of normalized text files, with utterances delimited by<s> and</s> tags. A number of input filters are available for specific corpora such as Switchboard, ISL and NIST meetings, and HUB5 transcripts. The result should be the set of sentences that are bounded by the start and end sentence markers: <s> and </s>. Here's an example:

1)准备用于产生语言模型的参考文本,语言模型工具需要输入的是标准形式的文本文件,对应的语料用<s>和</s>标签来分割,有很多滤波器可用于具体的语料,这些预料包括电话总机的、路由器、NIST会议和HUB5文本,句子集要用<s>和</s>句子标记来界定句子的开始和结束位置,下面是一个例子:

 

<s> generally cloudy today with scattered outbreaks of rain and drizzle persistent and heavy at times </s>
<s> some dry intervals also with hazy sunshine especially in eastern parts in the morning </s>
<s> highest temperatures nine to thirteen Celsius in a light or moderate mainly east south east breeze </s>
<s> cloudy damp and misty today with spells of rain and drizzle in most places much of this rain will be 
light and patchy but heavier rain may develop in the west later </s>
 

More data will generate better language models. Theweather.txt file from sphinx4 (used to generate the weather language model) contains nearly 100,000 sentences.

数据越多会产生更好的语言模型,sphinx4的weather.txt文件(用于产生天气语言模型)包含将近100000个句子。

 

2) Generate the vocabulary file. This is a list of all the words in the file:

2) 产生词汇文件,这是一个列表的文件中的所有单词:

    text2wfreq < weather.txt | wfreq2vocab > weather.tmp.vocab
 

3) You may want to edit the vocabulary file to remove words (numbers, misspellings, names).  If you find misspellings, it is a good idea to fix them in the input transcript.

3) 你可以编辑词汇文件来去除单词(数字、拼写错误、名字),如果发现拼写错误,最好能在输入文本中就改正它们。

 

4) If you want a closed vocabulary language model (a language model that has no provisions for unknown words), then you should remove sentences from your input transcript that contain words that are not in your vocabulary file.

4) 如果你需要一个自己的语言模型(一个无规则的未知单词语言模型),应该从输入文本中移除句子,输入文本包含单词,这些单词并不在词汇文件中。

 

5) Generate the arpa format language model with the commands:

5) 产生arpa格式的语言模型的命令:

% text2idngram -vocab weather.vocab -idngram weather.idngram < weather.closed.txt
% idngram2lm -vocab_type 0 -idngram weather.idngram -vocab \
     weather.vocab -arpa weather.arpa
 

6) Generate the CMU binary form (DMP)

6) 产生CMU二进制文件

sphinx_lm_convert -i weather.arpa -o weather.lm.DMP
 

The CMUCLTK tools and commands are documented at The CMU-Cambridge Language Modeling Toolkit page.

CMUCLTK工具和命令记录在The CMU-Cambridge Language Modeling Toolkit page中

 

Using other Language Model Toolkits

使用其他语言模型工具

You can also use any other toolkit that generates ARPA text files. However, the resulting files must be sorted in order to work with the Sphinx decoders. You can use thesphinx_lm_sort utility included with SphinxBase to sort an ARPA format language model file for use with Sphinx, e.g.:

你也可以使用其他任何工具产生的ARPA文本文件,可是,结果文件必须要进行分类以便和Sphinx编码器一起工作,你可以使用包含SphinxBase库的sphinx_lm_sort工具来分类ARPA格式的语言模型文件用于Sphinx,例如:

sphinx_lm_sort < unsorted.arpa > sorted.arpa
 

Then you can convert the model to DMP format and use it as usual.

你可以将模型转换为DMP格式,然后像平常一样使用。

Some toolkits you can try:

有些工具你可以试试:

They are usually pretty convenient to use.

它们使用通常很方便。

 

Building a simple language model using web service

使用web服务器来构建一个简单语言模型

 

If your language is English and text is small it's sometimes more convenient to use web service to build it. Language models built in this way are quite functional for simple command and control tasks. First of all you need to create a corpus.

如果你的语言是英语,文本很小,使用web服务器来构建它是非常方便的,用这种方式来构建的语言模型,对简单命令和控制任务是非常有效的,首要任务是建立一个语料库。

 

The “corpus” is just a list of sentences that you will use to train the language model. As an example, we will use a hypothetical voice control task for a mobile Internet device. We'd like to tell it things like “open browser”, “new e-mail”, “forward”, “backward”, “next window”, “last window”, “open music player”, and so forth. So, we'll start by creating a file calledcorpus.txt:

这里的“语料库”仅仅是句子列表,用来训练语言模型的。举个例子,我们使用声控任务来操作一台移动互联设备,我们准备告诉它类似于这样的事情,“打开浏览器”、“新建邮件”、“前进”、“后退”、“下一个窗口”、“最后窗口”、“打开音乐播放器”等等,因此,我们开始新建一个文件corpus.txt:

open browser
new e-mail
forward
backward
next window
last window
open music player
 

Then go to the page http://www.speech.cs.cmu.edu/tools/lmtool-new.html. Simply click on the “Browse…” button, select the corpus.txt file you created, then click “COMPILE KNOWLEDGE BASE”.

转到http://www.speech.cs.cmu.edu/tools/lmtool-new.html页面,单击“浏览...”按钮,选择创建的corpus.txt文件,然后单击“COMPILE KNOWLEDGE BASE”。

 

The legacy version is still available online also here:http://www.speech.cs.cmu.edu/tools/lmtool.html

旧版本在网上仍然可以在这 http://www.speech.cs.cmu.edu/tools/lmtool.html 获得

 

You should see a page with some status messages, followed by a page entitled “Sphinx knowledge base”. This page will contain links entitled “Dictionary” and “Language Model”. Download these files and make a note of their names (they should consist of a 4-digit number followed by the extensions .dic and.lm). You can now test your newly created language model with PocketSphinx.

你应该看一看“Sphinx knowledge base”网页信息,这个网页包含“字典”和“语言模型”的链接,下载这些文件并记录它们的名字(它们应该是由4个数字的扩展组成),你现在可以使用PocketSphinx来测试一下新建立的语言模型了。

 

Converting model into DMP format

将模型转换成DMP格式

 

To quickly load large models you probably would like to convert them to binary format that will save your decoder initialization time. That's not necessary with small models. Pocketsphinx and sphinx3 can handle both of them with-lm option. Sphinx4 requires you to submit DMP model into TrigramModel component and ARPA model to SimpleNGramModel component.

为了快速加载大型的模型,最好可以将模型转换为二进制格式,这样可以节省解码器的初始化时间,对小型模型是没有必要这样做的,PocketSphinx和sphinx3使用-lm选型都可以处理它们,Sphinx4要求DMP模型符合3-gram模型组件,ARPA模型符合简单N-gram模型组件。

 

ARPA format and DMP format are mutually convertable. You can produce other file withsphinx_lm_convert command from sphinxbase:

ARPA格式和DMP格式是可以相互转换的,你可以用sphinx_lm_convert命令从sphinxbase产生其他文件。

 

sphinx_lm_convert -i model.lm -o model.dmp
sphinx_lm_convert -i model.dmp -ifmt dmp -o model.lm -ofmt arpa
 

Using your language model

使用语言模型

This section will show you how to use, test, and improve the language model you created.

这一节将展示如何使用、测试、改进创建的语言模型。

Using your language model with PocketSphinx

PocketSphinx使用语言模型

 

If you have installed PocketSphinx, you will have a program called pocketsphinx_continuous which can be run from the command-line to recognize speech. Assuming it is installed under /usr/local, and your language model and dictionary are called 8521.dic and 8521.lm, try running the following command:

如果你安装了PocketSphinx,将有一个pocketsphinx_continuous程序,这个程序可以从命令行运行来识别语音,假设它被安装在user/local,语言模型和字典是8521.dic和8521.lm,试着运行以下命令:

 

pocketsphinx_continuous -lm 8521.lm -dict 8521.dic
 

You will see a lot of diagnostic messages, followed by a pause, then “READY…”. Now you can try speaking some of the commands. It should be able to recognize them with complete accuracy.If not, you may have problems with your microphone or sound card.

你会看到很多特征信息,后跟一个暂停,然后显示“READY...”。现在你可以尝试说出一些命令,它应该能够准确完整的识别出来,如果不能,可能是你的麦克风或声卡有些问题。

 

Using your language model with Sphinx4

Sphinx4使用语言模型

 

You just need to edit the configuration file and put a proper file name there. Sphinx-4 mostly works with DMP format. See the documentation for details:

需要编辑配置文件,并写入合适的文件名称,Sphinx-4大多数是使用DMP格式来工作的,详细见文档:

http://cmusphinx.sourceforge.net/sphinx4/doc/UsingSphinxTrainModels.html

 

开源——开源不仅仅意味着免费,但就算是仅仅是免费这一点,就非常重要了,不是吗? 跨平台——我的工作需要写的C++程序,就要求是跨Linux和Windows平台,没有选择Code::Blocks之前,我在Linux下用KDevelop,在Windows下使用Borland 或 Microsoft的软件,由于二者不兼容而要多做的事情太多。或许你暂时并不考虑跨平台,但为了将来,能跨平台总不是坏事,对了Code::Blocks也支持Mac系统呢。 纯C/C++写成——作为一名C++程序员,我“顽固”地保留一点可能并不必要的自尊:写C++程序,还是用C++写的IDE吧。让我选择一款以其它语言写成的IDE来写C++程序,我有那么一点点完全不必要的心理障碍。其实用C++写的程序最大好处是你不用额外安装庞大的运行环境,譬如你不用装.net也不装java。 支持多编译器——包括Borland C++,、VC++、Inter C++等等超过20个不同产家或版本编译器支持。无特定要求下,我还是主要用g++。配对的,调试器也是支持多种。 插件式的框架——插件式的集成开发环境,最著名的是Java编程工具Eclipse了,这种方式让一款IDE保留了良好的可扩展性,应该说,作为一款开源的IDE,这是最合理的选择。Code::Blocks很多核心功能,包括“调试功能”,都通过插件来实现。非核心方面的插件嘛,我用得最多的,是代码格式美化、自动上Google Codes查找……偶尔也会玩玩“俄罗斯方块”以及“贪吃蛇”…… 经常更新——几乎每个月都会有更新。开源软件最怕的就是不更新了。 内嵌可视设计——和大家熟悉的VB、Delphi/C++Builder相比,它的可视窗口设计器,其“傻瓜性”差了一大截,其主要原因在在主流的跨平台图形窗口的定位体系,都是采用定位“容器”来实现的,这一点一开始会不太习惯,但在熟悉之后,设计各种对话框,也非常直观。一点点不直观,换来的是你的程序很容易迁移到别的操作系统,同时还让你对窗口设计更深入了,倒也值。 C++扩展库支持——通过它的一个用以支持Dev C++的插件,可以下载大量C++开源的扩展库。比如网络操作,图形算法,压缩、加密等等……我现在最常用到一款就是iconv,用来转换汉字编码。扩展库下载,及使用方法,请见本站相关课程。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值