借用aishell模型和字典,更改了其中的语音和标注文件,我的语料比较少,只有6000+句,在运行run.sh过程中,训练语言模型时报错如下:
Not creating raw N-gram counts ngrams.gz and heldout_ngrams.gz since they already exist in data/local/lm/3gram-mincount (remove them if you want them regenerated) Iteration 1/6 of optimizing discounting parameters discount_ngrams: for n-gram order 1, D=0.600000, tau=0.675000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=0.675000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=0.825000 phi=2.000000 interpolate_ngrams: 2846 words in wordslist discount_ngrams: for n-gram order 1, D=0.600000, tau=0.900000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=0.900000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=1.100000 phi=2.000000 interpolate_ngrams: 2846 words in wordslist interpolate_ngrams: 2846 words in wordslist compute_perplexity: for history-state "", no total-count % is seen (perhaps you didn't put the training n-grams through interpolate_ngrams?) discount_ngrams: for n-gram order 1, D=0.600000, tau=1.215000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.215000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=1.485000 phi=2.000000 real 0m0.007s user 0m0.008s sys 0m0.001s compute_perplexity: for history-state "", no total-count % is seen (perhaps you didn't put the training n-grams through interpolate_ngrams?) real 0m0.006s user 0m0.008s sys 0m0.000s compute_perplexity: for history-state "", no total-count % is seen (perhaps you didn't put the training n-grams through interpolate_ngrams?) real 0m0.011s user 0m0.008s sys 0m0.000s Usage: optimize_alpha.pl alpha1 perplexity@alpha1 alpha2 perplexity@alpha2 alpha3 perplexity@alph3 at /home/ml/kaldi/tools/kaldi_lm/optimize_alpha.pl line 23. |
经过问题排查,发现aishell_train_lms.sh调用了kaldi-master/tools/kaldi-lm/train_lm.sh脚本进行语言模型的训练。脚本里提到,默认语料文本至少10k句,其中heldout_sent参数默认10000,猜测改小heldout_sent参数值会解决该错误,实际上无论我把该值改成多小,都是报相同的错误。在google论坛上搜了相关问题。
dan回答:对于语料较少的情况heldout_sent参数值设定为语料语句总数的1/10,这里修改aishell_train_lms.sh中68行的heldout_sent值没有作用,因为这是在exit 0(63行)之后,后面脚本不再执行。需要直接修改tools/kaldi_lm/train_lm.sh中的heldout_sent的值,这里要不断尝试不同的值,特别需要注意的是,修改该值之后需要删除data目录下所有生成的文件再运行run.sh方能成功(我前面之所以失败应该就是这个原因),否则始终可能报相同的错误。另外dan也建议不要使用kaldi自带工具进行语言模型的训练,而是安装SRILM,利用SRILM进行语言模型的训练,效果更好。