Building Language Model


There are two types of models that describe language - grammars and statistical language models. Grammars describe very simple types of languages for command and control, and they are usually written by hand or generated automatically with plain code.

有两种描述语言模型 - 语法和统计语言模型,语法语言模型描述像命令和控制语言这样很简单的语言类型,它们通常是通过手写或者明码自动产生。


There are many ways to build the statistical language models. When your data set is large, there is sense to use CMU language modeling toolkit. When a model is small, you can use an online quick web service. When you need specific options or you just want to use your favorite toolkit which builds ARPA models, you can use it.



Building a grammar


Grammars are usually written manually in JSGF format:


#JSGF V1.0;
 * JSGF Grammar for Hello World example
grammar hello;
public <greet> = (Good morning | Hello) ( Bhiksha | Evandro | Paul | Philip | Rita | Will );

For more information on JSGF format see the documentation



Building a Statistical Language Model Using CMUCLMTK


Required Software


You need to download and install cmuclmtk. See CMU Sphinx Downloads for details.

需要下载安装cmuclmtk,详细见CMU Sphinx下载页。


Text preparation

First of all you need to cleanup text. Expand abbreviations, convert numbers to words, clean non-word items. For example to clean WikipediaXML dump you can use special python scripts. To cleanHTML pages you can try a nice package specifically created to extract text from HTML



For example on how to create language model from Wikipedia texts please see


Once you went through the language model process, please submit your langauge model to CMUSphinx project, we'd be glad to share it!



Language modeling for Mandarin is largely the same as in English, with one addditional consideration, which is that the input text must be word segmented. A segmentation tool and associated word list is provided to accomplish this.



ARPA model training


The process for creating a language model is as follows:



1) Prepare a reference text that will be used to generate the language model.  The language model toolkit expects its input to be in the form of normalized text files, with utterances delimited by<s> and</s> tags. A number of input filters are available for specific corpora such as Switchboard, ISL and NIST meetings, and HUB5 transcripts. The result should be the set of sentences that are bounded by the start and end sentence markers: <s> and </s>. Here's an example:



<s> generally cloudy today with scattered outbreaks of rain and drizzle persistent and heavy at times </s>
<s> some dry intervals also with hazy sunshine especially in eastern parts in the morning </s>
<s> highest temperatures nine to thirteen Celsius in a light or moderate mainly east south east breeze </s>
<s> cloudy damp and misty today with spells of rain and drizzle in most places much of this rain will be 
light and patchy but heavier rain may develop in the west later </s>

More data will generate better language models. Theweather.txt file from sphinx4 (used to generate the weather language model) contains nearly 100,000 sentences.



2) Generate the vocabulary file. This is a list of all the words in the file:

2) 产生词汇文件,这是一个列表的文件中的所有单词:

    text2wfreq < weather.txt | wfreq2vocab > weather.tmp.vocab

3) You may want to edit the vocabulary file to remove words (numbers, misspellings, names).  If you find misspellings, it is a good idea to fix them in the input transcript.

3) 你可以编辑词汇文件来去除单词(数字、拼写错误、名字),如果发现拼写错误,最好能在输入文本中就改正它们。


4) If you want a closed vocabulary language model (a language model that has no provisions for unknown words), then you should remove sentences from your input transcript that contain words that are not in your vocabulary file.

4) 如果你需要一个自己的语言模型(一个无规则的未知单词语言模型),应该从输入文本中移除句子,输入文本包含单词,这些单词并不在词汇文件中。


5) Generate the arpa format language model with the commands:

5) 产生arpa格式的语言模型的命令:

% text2idngram -vocab weather.vocab -idngram weather.idngram < weather.closed.txt
% idngram2lm -vocab_type 0 -idngram weather.idngram -vocab \
     weather.vocab -arpa

6) Generate the CMU binary form (DMP)

6) 产生CMU二进制文件

sphinx_lm_convert -i -o weather.lm.DMP

The CMUCLTK tools and commands are documented at The CMU-Cambridge Language Modeling Toolkit page.

CMUCLTK工具和命令记录在The CMU-Cambridge Language Modeling Toolkit page中


Using other Language Model Toolkits


You can also use any other toolkit that generates ARPA text files. However, the resulting files must be sorted in order to work with the Sphinx decoders. You can use thesphinx_lm_sort utility included with SphinxBase to sort an ARPA format language model file for use with Sphinx, e.g.:


sphinx_lm_sort < >

Then you can convert the model to DMP format and use it as usual.


Some toolkits you can try:


They are usually pretty convenient to use.



Building a simple language model using web service



If your language is English and text is small it's sometimes more convenient to use web service to build it. Language models built in this way are quite functional for simple command and control tasks. First of all you need to create a corpus.



The “corpus” is just a list of sentences that you will use to train the language model. As an example, we will use a hypothetical voice control task for a mobile Internet device. We'd like to tell it things like “open browser”, “new e-mail”, “forward”, “backward”, “next window”, “last window”, “open music player”, and so forth. So, we'll start by creating a file calledcorpus.txt:


open browser
new e-mail
next window
last window
open music player

Then go to the page Simply click on the “Browse…” button, select the corpus.txt file you created, then click “COMPILE KNOWLEDGE BASE”.

转到页面,单击“浏览...”按钮,选择创建的corpus.txt文件,然后单击“COMPILE KNOWLEDGE BASE”。


The legacy version is still available online also here:

旧版本在网上仍然可以在这 获得


You should see a page with some status messages, followed by a page entitled “Sphinx knowledge base”. This page will contain links entitled “Dictionary” and “Language Model”. Download these files and make a note of their names (they should consist of a 4-digit number followed by the extensions .dic and.lm). You can now test your newly created language model with PocketSphinx.

你应该看一看“Sphinx knowledge base”网页信息,这个网页包含“字典”和“语言模型”的链接,下载这些文件并记录它们的名字(它们应该是由4个数字的扩展组成),你现在可以使用PocketSphinx来测试一下新建立的语言模型了。


Converting model into DMP format



To quickly load large models you probably would like to convert them to binary format that will save your decoder initialization time. That's not necessary with small models. Pocketsphinx and sphinx3 can handle both of them with-lm option. Sphinx4 requires you to submit DMP model into TrigramModel component and ARPA model to SimpleNGramModel component.



ARPA format and DMP format are mutually convertable. You can produce other file withsphinx_lm_convert command from sphinxbase:



sphinx_lm_convert -i model.lm -o model.dmp
sphinx_lm_convert -i model.dmp -ifmt dmp -o model.lm -ofmt arpa

Using your language model


This section will show you how to use, test, and improve the language model you created.


Using your language model with PocketSphinx



If you have installed PocketSphinx, you will have a program called pocketsphinx_continuous which can be run from the command-line to recognize speech. Assuming it is installed under /usr/local, and your language model and dictionary are called 8521.dic and 8521.lm, try running the following command:



pocketsphinx_continuous -lm 8521.lm -dict 8521.dic

You will see a lot of diagnostic messages, followed by a pause, then “READY…”. Now you can try speaking some of the commands. It should be able to recognize them with complete accuracy.If not, you may have problems with your microphone or sound card.



Using your language model with Sphinx4



You just need to edit the configuration file and put a proper file name there. Sphinx-4 mostly works with DMP format. See the documentation for details:



