The Kaldi Speech Recognition Toolkit

The Kaldi Speech Recognition Toolkit

Arnab Ghoshal and Daniel Povey

SLTC Newsletter, February 2012

Kaldi is a free open-source toolkit for speech recognition research. It is written in C++ and provides a speech recognition system based on finite-state transducers, using the freely available OpenFst, together with detailed documentation and scripts for building complete recognition systems. The tools compile on commonly used Unix-like systems and on Microsoft Windows. The goal of Kaldi is to have modern and flexible code that is easy to understand, modify, and extend. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users. Kaldi is available from SourceForge.

WHY ANOTHER SPEECH TOOLKIT?

The work on Kaldi [1] started during the 2009 Johns Hopkins University summer workshop project titled "Low Development Cost, High Quality Speech Recognition for New Languages and Domains," where we were working on acoustic modeling using subspace Gaussian mixture model (SGMM) [2]. In order to develop and test a new acoustic modeling technique, we needed a toolkit that was simple to understand and extend; had extensive linear algebra support; and came with a nonrestrictive license that allowed us to share our work with other researchers in academia or industry. We also preferred to use a finite-state transducer (FST) based framework.

While there were several potential choices for open-source ASR toolkits -- for example, HTKJulius(both written in C), Sphinx-4 (written in Java), and the RWTH ASR toolkit (written in C++, and closest to Kaldi in terms to design and features) -- our specific requirements meant that we had to write many of the components, including the decoder, by ourselves. Given the amount of effort invested and the continued use of the tools following the JHU workshop, it was a logical choice to extend the codebase into a full-featured toolkit. We had two follow-up summer workshops at the Brno University of Technology, Czech Republic, in 2010 and 2011, and further development of Kaldi is ongoing.

DESIGN OF KALDI

Important aspects of Kaldi include:

  • Integration with Finite State Transducers: We compile against the OpenFst toolkit (using it as a library).
  • Extensive linear algebra support: We include a matrix library that wraps standard BLAS and LAPACK routines.
  • Extensible design: We attempt to provide our algorithms in the most generic form possible. For instance, our decoders work with an interface that provides a score for a particular frame and FST input symbol. Thus the decoder could work from any suitable source of scores.
  • Open license: The code is licensed under Apache v2.0, which is one of the least restrictive licenses available.
  • Complete recipes: We make available complete recipes for building state-of-the art speech recognition systems, that work from widely available databases such as those provided by the Linguistic Data Consortium (LDC).
  • Thorough testing: The goal is for all or nearly all the code to have corresponding test routines.

Kaldi has an open and distributed development model, with a growing community of users and contributors. The original authors moderate contributions to the project.

FEATURES SUPPORTED IN KALDI

We intend Kaldi to support all commonly used techniques in speech recognition. The toolkit currently supports:

  • MFCC and PLP front-end, with cepstral mean and variance normalization, LDA, STC/MLLT, HLDA, VTLN, etc.
  • Modeling of context-dependent phones of arbitrary context lengths.
  • HMM/GMM acoustic models; phonetic decision trees.
  • Semi-continuous hidden Markov models [4].
  • Subspace Gaussian mixture models [2].
  • Speaker adaptation and adaptive training.
  • WFST-based decoders with lattice generation [3].
  • Lattice rescoring with acoustic and language models.
  • Discriminative training with MMI, boosted MMI (fMPE under development).

There is currently no language modeling code, but we support converting ARPA format LMs to FSTSs. In the recipes released with Kaldi, we use the freely available IRSTLM toolkit. However, one could potentially use a more fully-featured toolkit like SRILM. Current strands of development include: discriminative training with MPE, interface for transparent computation on CPUs and GPUs, hybrid ANN/HMM systems, etc.

ACKNOWLEDGEMENTS

The contributors to the Kaldi project are: Gilles Boulianne, Lukas Burget, Arnab Ghoshal, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Navdeep Jaitly, Stefan Kombrink, Petr Motlicek, Daniel Povey, Yanmin Qian, Korbinian Riedhammer, Petr Schwarz, Jan Silovsky, Georg Stemmer, Karel Vesely, and Chao Weng.

We would like to thank Michael Riley, who visited us in Brno to deliver lectures on finite state transducers and helped us understand OpenFst; Henrique (Rico) Malvar of Microsoft Research for allowing the use of his FFT code; and Patrick Nguyen for help with WSJ recipes and introducing the participants in the JHU workshop of 2009. We would like to acknowledge the help with coding and documentation from Sandeep Boda and Sandeep Reddy (sponsored by Go-Vivace Inc.) and Haihua Xu. We thank Pavel Matejka (and Phonexia s.r.o.) for allowing the use of feature processing code.

We would like to acknowledge the support of Geoffrey Zweig and Alex Acero at Microsoft Research, and Dietrich Klakow at Saarland University. We are grateful to Jan (Honza) Cernocky for helping us organize the workshop at the Brno University of Technology during August 2010 and 2011. Thanks to Tomas Kasparek for system support and Renata Kohlova for administrative support.

Finally, we would like to acknowledge participants and collaborators in the 2009 Johns Hopkins University Workshop, including Mohit Agarwal, Pinar Akyazi, Martin Karafiat, Feng Kai, Ariya Rastrow, Richard C. Rose and Samuel Thomas; and faculty and staff at JHU for their help during that workshop, including Sanjeev Khudanpur, Desiree Cleves, and the late Fred Jelinek.

REFERENCES

[1] D. Povey, A. Ghoshal, et al., "The Kaldi Speech Recognition Toolkit," in IEEE ASRU, 2011. 
[2] D. Povey, L. Burget et al., "The subspace Gaussian mixture model--A structured model for speech recognition," Computer Speech & Language, 25(2), pp. 404-439, April 2011. 
[3] D. Povey, M. Hannemann, et al., "Generating Exact Lattices in the WFST Framework," in IEEE ICASSP, 2012 (to appear). 
[4] K. Riedhammer, T. Bocklet, A. Ghoshal and D. Povey, "Revisiting Semi-Continuous Hidden Markov Models," in IEEE ICASSP, 2012 (to appear).

Kaldi page on Sourceforge

Arnab Ghoshal is a Research Associate at The University of Edinburgh. Email: a.ghoshal@ed.ac.uk

Daniel Povey is an Associate Research Scientist at The Johns Hopkins University Human Language Technology Center of Excellence. Email: dpovey@gmail.com

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
以下是基于Python和Kaldi工具包的GMM-HMM自动语音识别算法代码示例: ```python import kaldi_io import numpy as np from kaldi.feat.mfcc import Mfcc, MfccOptions from kaldi.feat.functions import compute_cmvn_stats from kaldi.hmm import HmmTopology, MakeAmDiagGmm, DecodableInterface, DecodableMatrixScaled from kaldi.matrix import SubVector, Vector from kaldi.util.table import SequentialMatrixReader, SequentialWaveReader from kaldi.fstext import SymbolTable, ReadFstKaldiGeneric from kaldi.decoder import LatticeFasterDecoder, CompactLatticeWriter # MFCC特征提取 def extract_mfcc(signal, sample_rate, num_mel_bins, num_ceps, use_energy=True, use_delta=True): mfcc_opts = MfccOptions() mfcc_opts.mel_opts.num_bins = num_mel_bins mfcc_opts.num_ceps = num_ceps mfcc_opts.use_energy = use_energy mfcc_opts.use_delta = use_delta mfcc = Mfcc(mfcc_opts) feats = mfcc.compute_features(signal, sample_rate) return feats # 计算CMVN统计量 def compute_cmvn_feats(feats): stats = compute_cmvn_stats(feats) feats_cmvn = feats.copy() feats_cmvn.add_row(-stats.mean_vec) feats_cmvn.scale_rows(1.0 / np.sqrt(stats.var_vec)) return feats_cmvn # 构建HMM拓扑结构 def build_hmm_topology(num_pdfs): topo = HmmTopology() topo.add_transition(0, 1, 0) for i in range(1, num_pdfs): topo.add_transition(i, i + 1, 0) topo.add_self_loop(1, 1) topo.set_final(num_pdfs, 0) return topo # 训练GMM模型 def train_gmm(data_rspecifier, num_gaussians, num_iterations, num_frames_per_batch): feats_rspec = f"ark,s,cs:apply-cmvn --norm-vars=false scp:{data_rspecifier} ark:- |" feats_reader = SequentialMatrixReader(feats_rspec) # 初始化GMM feats = feats_reader[0][1] dim = feats.shape[1] gmm = MakeAmDiagGmm(num_gaussians, dim, 1) # EM迭代 for i in range(num_iterations): sum_accs = gmm.AccumulateForUtterance(feats, 1.0) for j in range(1, feats_reader.num_rows()): utt, utt_feats = feats_reader[j] sum_accs.Add(gmm.AccumulateForUtterance(utt_feats, 1.0)) if (j + 1) % num_frames_per_batch == 0: gmm.Update(sum_accs, "m", True) sum_accs = gmm.ZeroAccs() if not sum_accs.IsZero(): gmm.Update(sum_accs, "m", True) return gmm # 解码 def decode(gmm, transducer, feats, word_symbols): decoder = LatticeFasterDecoder(transducer, 1.0, 0.0, True, 40.0, 30.0, 0.1, 0.1, True, True, True) decodable = DecodableMatrixScaled(DecodableInterface(gmm, feats), 1.0) # 解码 decoder.Decode(decodable) # 获取最佳路径 lattice = decoder.GetRawLattice() CompactLatticeWriter.Write("ark:| gzip -c > lat.gz", lattice) # 转录 with open("lat.gz", "rb") as f: for key, lat_str in kaldi_io.read_compress_archive(f): lat = CompactLattice() lat.ParseFromString(lat_str) best_path = kaldi.fst.shortestpath.shortestpath(lat).olabels words = [word_symbols.Find(sym).decode() for _, sym in best_path] print(key, " ".join(words)) if __name__ == '__main__': # 加载词典 word_symbols = SymbolTable.ReadText("words.txt") # 加载HMM拓扑 topo = ReadFstKaldiGeneric("topo") # 加载语言模型 graph = ReadFstKaldiGeneric("graph") # 加载数据 data_rspecifier = "scp:data.scp" # 提取MFCC特征 feats = extract_mfcc(signal, sample_rate, num_mel_bins=40, num_ceps=13) # 计算CMVN统计量 feats_cmvn = compute_cmvn_feats(feats) # 训练GMM gmm = train_gmm(data_rspecifier, num_gaussians=2048, num_iterations=10, num_frames_per_batch=10000) # 解码 decode(gmm, topo, feats_cmvn, graph, word_symbols) ``` 该代码使用Kaldi工具包实现了一个基于GMM-HMM的自动语音识别系统,包括MFCC特征提取、CMVN归一化、GMM训练和解码等步骤。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值