Building phonetic dictionary

http://cmusphinx.sourceforge.net/wiki/tutorialdict

Building phonetic dictionary

Introduction

Phonetic dictionary provides system the data to map vocabulary words to sequence of phonemes. It looks like this:

hello H EH L OW
world W ER L D

Dictionary can contain alternative pronunciations, in that case you can designate them with a number in parenthesis

the TH IH
the(2) TH AH

There are various phonesets to represent phones like IPA or SAMPA, CMUSphinx does not yet require you to use any well-known phoneset, moreover, it prefers to use letter-only phone names without special symbols. This requirement simplifies some processing algorithms, for example, you can create files with phone names.

Dictionary should contain all the words you are interested in, otherwise recognizer will not be able to recognize them. However, it is not sufficient to have the words in the dictionary, the recognizer looks for the word both in the dictionary and in the language model. Without language model the word will not be recognized even if you added it in the dictionary.

There is no need to remove unused words from the dictionary unless you want to save memory, extra words in the dictionary do not affect accuracy.

Using existing dictionaries

There are number of dictionaries which cover languages we support - CMUDict for US English, French, German, Russian, Dutch, Italian, Spanish, Mandarin. Other dictionaries might be found on the web. If dictionary has proper format you can use it.

If dictionary does not cover all the words you are interested in you can extend it with g2p tool.

Using G2P-seq2seq to extend the dictionary

There are various tools to help you to extend an existing dictionary for new words or to build a new dictionary from scratch: Phonetisaurus, Sequitur.

We recommend to use our latest too g2p-seq2seq . It is based on neural networks implemented in Tensorflow framework and provides a state of the art accuracy of conversion.

An English model 2-layer LSTM with 256 hidden units is available for download on cmusphinx website. Unpack the model after download. It is trained on CMU English dictionary. Read my lips - this model works only for English. For other languages you need to bootstrap dictionary first as described below and then use G2P tool to extend it.

The easiest way to check how the tool works is to run it the interactive mode with model above and type the words

  g2p-seq2seq --interactive --model model_folder_path
  > hello
  HH EH L OW

To generate pronunciations for an English word list with a trained model, run

  g2p-seq2seq --decode your_wordlist --model model_folder_path

The wordlist is a text file with words, one word per line.

To train G2P you need a dictionary (word and phone sequence per line in standard form). To run the training

  g2p-seq2seq --train train_dictionary.dic --model model_folder_path

For more information on the tool see the corresponding page.

Bootstrapping dictionary for other languages

If you do not have dictionary for your language there are usually several ways on how you can obtain them.

Usually dictionaries are bootstrapped with hand-written rules. You can find a list of phonemes for your language in Wikipedia page about your language and write a simple Python script to map words to phonemes. The best dictionary could not be covered with rules though, most languages have quite irregular pronunciation which might not be very obvious for newcomer even if it is conventionally thought you speak what is written. This is due to coarticulation effects in human speech. But for basic dictionary rules are sufficiently good enough.

You can crawl Wiktionary to get mapping for significant amount of words covered there.

You can use TTS tools like from OpenMary written in Java or from espeak written in C to create the phonetic dictionary for the languages they support.

Many languages which use hieroglyphs like Korean or Japanese have specialized software like Mecab https://sourceforge.net/projects/mecab to romanize their words. You can use Mecab to build a phonetic dictionary by converting words to romanized form and then simply applying rules to turn them into phones.

It is enough to transcribe few thousand most common words to bootstrap the dictionary.

Once dictionary is bootstrapped you can extend it to larger vocabulary with the g2p-seq2seq tool as described in previous chapter.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
提供的源码资源涵盖了Java应用等多个领域,每个领域都包含了丰富的实例和项目。这些源码都是基于各自平台的最新技术和标准编写,确保了在对应环境下能够无缝运行。同时,源码中配备了详细的注释和文档,帮助用户快速理解代码结构和实现逻辑。 适用人群: 适合毕业设计、课程设计作业。这些源码资源特别适合大学生群体。无论你是计算机相关专业的学生,还是对其他领域编程感兴趣的学生,这些资源都能为你提供宝贵的学习和实践机会。通过学习和运行这些源码,你可以掌握各平台开发的基础知识,提升编程能力和项目实战经验。 使用场景及目标: 在学习阶段,你可以利用这些源码资源进行课程实践、课外项目或毕业设计。通过分析和运行源码,你将深入了解各平台开发的技术细节和最佳实践,逐步培养起自己的项目开发和问题解决能力。此外,在求职或创业过程中,具备跨平台开发能力的大学生将更具竞争力。 其他说明: 为了确保源码资源的可运行性和易用性,特别注意了以下几点:首先,每份源码都提供了详细的运行环境和依赖说明,确保用户能够轻松搭建起开发环境;其次,源码中的注释和文档都非常完善,方便用户快速上手和理解代码;最后,我会定期更新这些源码资源,以适应各平台技术的最新发展和市场需求。 所有源码均经过严格测试,可以直接运行,可以放心下载使用。有任何使用问题欢迎随时与博主沟通,第一时间进行解答!
提供的源码资源涵盖了小程序应用等多个领域,每个领域都包含了丰富的实例和项目。这些源码都是基于各自平台的最新技术和标准编写,确保了在对应环境下能够无缝运行。同时,源码中配备了详细的注释和文档,帮助用户快速理解代码结构和实现逻辑。 适用人群: 适合毕业设计、课程设计作业。这些源码资源特别适合大学生群体。无论你是计算机相关专业的学生,还是对其他领域编程感兴趣的学生,这些资源都能为你提供宝贵的学习和实践机会。通过学习和运行这些源码,你可以掌握各平台开发的基础知识,提升编程能力和项目实战经验。 使用场景及目标: 在学习阶段,你可以利用这些源码资源进行课程实践、课外项目或毕业设计。通过分析和运行源码,你将深入了解各平台开发的技术细节和最佳实践,逐步培养起自己的项目开发和问题解决能力。此外,在求职或创业过程中,具备跨平台开发能力的大学生将更具竞争力。 其他说明: 为了确保源码资源的可运行性和易用性,特别注意了以下几点:首先,每份源码都提供了详细的运行环境和依赖说明,确保用户能够轻松搭建起开发环境;其次,源码中的注释和文档都非常完善,方便用户快速上手和理解代码;最后,我会定期更新这些源码资源,以适应各平台技术的最新发展和市场需求。 所有源码均经过严格测试,可以直接运行,可以放心下载使用。有任何使用问题欢迎随时与博主沟通,第一时间进行解答!
提供的源码资源涵盖了Java应用等多个领域,每个领域都包含了丰富的实例和项目。这些源码都是基于各自平台的最新技术和标准编写,确保了在对应环境下能够无缝运行。同时,源码中配备了详细的注释和文档,帮助用户快速理解代码结构和实现逻辑。 适用人群: 适合毕业设计、课程设计作业。这些源码资源特别适合大学生群体。无论你是计算机相关专业的学生,还是对其他领域编程感兴趣的学生,这些资源都能为你提供宝贵的学习和实践机会。通过学习和运行这些源码,你可以掌握各平台开发的基础知识,提升编程能力和项目实战经验。 使用场景及目标: 在学习阶段,你可以利用这些源码资源进行课程实践、课外项目或毕业设计。通过分析和运行源码,你将深入了解各平台开发的技术细节和最佳实践,逐步培养起自己的项目开发和问题解决能力。此外,在求职或创业过程中,具备跨平台开发能力的大学生将更具竞争力。 其他说明: 为了确保源码资源的可运行性和易用性,特别注意了以下几点:首先,每份源码都提供了详细的运行环境和依赖说明,确保用户能够轻松搭建起开发环境;其次,源码中的注释和文档都非常完善,方便用户快速上手和理解代码;最后,我会定期更新这些源码资源,以适应各平台技术的最新发展和市场需求。 所有源码均经过严格测试,可以直接运行,可以放心下载使用。有任何使用问题欢迎随时与博主沟通,第一时间进行解答!

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值