kaldi解码中, 排查静音对应到“啊“的异常问题

phoenix-bai

已于 2022-06-30 17:54:21 修改

阅读量692

点赞数

分类专栏：语音识别文章标签：语音识别人工智能

于 2022-06-30 17:49:38 首次发布

本文链接：https://blog.csdn.net/weixin_40103562/article/details/125545279

版权

语音识别专栏收录该内容

4 篇文章 0 订阅

订阅专栏

label: 对二幺五嗯好好谢谢嗯
asr pred: Demo 啊对哎呀捂好谢谢
phone sequence: Demo a1_S sil d_B w E4 Y_E A1_B Y y A1_E w_B u3_E sil h_B a3 W_E x_B y E4 x y E4_E sil

优先排查AM模型

思路, 多种方法:

先将音频解码, 通过lattice找出1-best的phone sequece, 看是否符合预期

wav_file=test.wav
echo "Demo $wav_file" | compute-fbank-feats --num-mel-bins=40 --sample-frequency=8000 scp:- ark,t:- | apply-cmvn-online asr_model_online/online_model/20211212/global_cmvn.stats ark,t:- ark,t:- > feat.txt
nnet3-latgen-faster-batch --num-threads=8 --frames-per-chunk=50 --extra-left-context=0 --extra-right-context=0 --extra-left-context-initial=0 --extra-right-context-final=0 --minimize=false --max-active=7000 --min-active=200 --beam=15.0 --lattice-beam=8.0 --acoustic-scale=0.1 --allow-partial=true --word-symbol-table=graph/graph_zjtb//words.txt exp/exp_zjtb/tdnn_0730/final.mdl graph/graph_zjtb/HCLG.fst ark,t:feat.txt "ark:test.lats"
exit
lattice-1best --acoustic-scale=0.1 ark:test.lats ark:test_1best.lats
lattice-to-phone-lattice --replace-words=true exp/exp_zjtb/tdnn_0730/final.mdl ark:test_1best.lats ark:test_phones.lats
lattice-best-path ark:test_phones.lats 'ark,t:| utils/int2sym.pl -f 2- lang/lang_zjtb/phones.txt > test_phone_sequence.txt' ark:test.ali

再看帧级别的预测, 即通过nnet3-compute输出帧的预测结果, 借助excel的HOME/conditional Formatting/Color Scales功能, 根据输出的概率值填充颜色, 其中最亮的, 如红色, 即对应的column_number为概率最高的pdf_id+1, pdf_id是0-based indexing. 再对照查看 pdf_id, 查看show-transitions出来的trans_model的结果里的pdf = pdf_id, 来看对应的phone, 确定是否为AM问题
```
# 输出帧级别的输出序列结果, 即pdf_id序列
nnet3-compute --apply-exp=true exp/exp_zjtb/tdnn_0730/final.mdl "ark,t:feat.txt" ark,t:- > test_frame_res.txt
show-transitions graph/graph_zjtb/phones.txt exp/exp_zjtb/tdnn_0730/final.mdl > test_occs.txt
```

由于我们这个音频, 正好是训练音频, 所以可以查看是否训练过程中, AM对齐有问题, 如

$cd exp/exp_zjtb/tri4b_ali_0730
$ali-to-pdf final.mdl "ark:gunzip -c ali.1.gz|" ark,t:- > ali.1.txt
$grep "0000bac252011e49a2949a6167d8bce9_r3.wav" ali.1.txt > test.ali

#结果: utt_id pdf_id序列
0000bac252011e49a2949a6167d8bce9_r3.wav 0 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 178 2887 2887 2887 2504 2504 2504 2177 2177 339 477 784 1986 1986 1986 1986 1543 1543 1543 1543 1543 1543 1543 1543 1543 1543 1543 3449 3449 1233 1233 1275 354 354 354 354 354 354 1388 1388 1388 1388 1388 828 877 877 877 877 2088 2088 2088 1992 1992 2669 2196 2196 1982 1982 2385 2385 1469 1469 1469 1469 1750 1750 1516 490 490 490 490 490 490 490 490 490 490 490 1719 1719 1719 608 608 1217 1217 1217 1217 720 720 720 720 720 720 720 720 720 720 720 720 720 153 153 153 153 153 153 153 153 153 153 153 153 0 0 0 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 179 179 179 179 179 179 179 179 179 179 179 179 179 179 179 179 178 178 178 178 178 178 178 178 178 178 178 178 178 178 178 178 178 178 225 225 225 225 381 572 55 55 1545 210 43 43 43 43 43 43 43 43 269 269 988 988 988 988 2878 2878 2878 534 534 534 534 406 406 3422 2457 2426 2426 1592 1592 1592 1592 2364 2364 3077 3077 3077 3583 3583 2302 2783 1870 1386 779 779 22 537 3426 3426 1884 1884 1884 1884 895 895 895 469 469 2231 2231 2736 2736 1548 1533 1533 1533 1533 1390 1390 1390 1390 835 835 835 835 835 835 835 835 0 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 177 178 178 178 178 178 178 178 178
# 可在show-transitions输出结果中, 查看Pdf_id对应的phone, 看对齐是否有问题

具体的相关命令:

● 抽取音频特征, 并apply cmvn

echo "Demo $wav_file" | compute-fbank-feats --num-mel-bins=40 --sample-frequency=8000 scp:- ark,t:- | apply-cmvn-online asr_model_online/online_model/20211212/global_cmvn.stats ark,t:- ark,t:- > feat.txt

● nnet3-latgen-faster /nnet3-latgen-faster-batch来对音频解码, 生成文本识别结果的同时, 生成lattice, 并用show_lattice.sh 来将lattice可视化:

nnet3-latgen-faster --frames-per-chunk=50 --extra-left-context=10 --extra-right-context=0 --minimize=false --max-active=7000 --min-active=200 --beam=15.0 --lattice-beam=8.0 --acoustic-scale=0.1 --allow-partial=true --word-symbol-table=graph/graph_zjtb//words.txt exp/exp_zjtb/tdnn_0730/final.mdl graph/graph_zjtb/HCLG.fst ark,t:feat.txt "ark:test.lats"
# Demo为音频utt_id, 即test.lats中的utt_id保持一致
./utils/show_lattice.sh --mode save Demo asr/lattice_test/test.lats.gz asr/lattice_test/words.txt

● nnet3-compute输出帧级别的预测结果, 即每一帧对应的所有pdf的概率列表, 其中, pdf是0-based indexing, 可将整个矩阵用excel打开, 用excel根据值渐变色功能, 可看出, 哪个概率pdf的概率最大, 即这帧预测结果为此pdf

# 输出帧级别的输出序列结果, 即pdf_id序列
nnet3-compute --apply-exp=true exp/exp_zjtb/tdnn_0730/final.mdl "ark,t:feat.txt" ark,t:- > test_frame_res.txt
show-transitions graph/graph_zjtb/phones.txt exp/exp_zjtb/tdnn_0730/final.mdl > test_occs.txt

● 通过show-transitions以human-readable格式打印出HMM转移模型, 如下:

show-transitions graph/graph_zjtb/phones.txt exp/exp_zjtb/tdnn_0730/final.mdl > test_occs.txt

输出结果, 其中pdf的值为pdf id, 即

Transition-state 1744: phone = d_B hmm-state = 2 pdf = 3184
Transition-id = 3487 p = 0.01 [self-loop]
Transition-id = 3488 p = 0.99 [2 -> 3]
Transition-state 1745: phone = d_B hmm-state = 2 pdf = 3351
Transition-id = 3489 p = 0.311331 [self-loop]
Transition-id = 3490 p = 0.688669 [2 -> 3]
Transition-state 1746: phone = e1 hmm-state = 0 pdf = 71
Transition-id = 3491 p = 0.01 [self-loop]
Transition-id = 3492 p = 0.990001 [0 -> 1]
Transition-state 1747: phone = e1 hmm-state = 0 pdf = 1136
Transition-id = 3493 p = 0.01 [self-loop]
Transition-id = 3494 p = 0.99 [0 -> 1]
Transition-state 1748: phone = e1 hmm-state = 0 pdf = 1675
Transition-id = 3495 p = 0.0854856 [self-loop]
Transition-id = 3496 p = 0.914514 [0 -> 1]

排查LM中, 不同句子的scoring情况:

● srilm的ngram language model相关介绍: https://web.stanford.edu/~jurafsky/slp3/3.pdf
● probabiltiy越高, perplexcity越低.

ngram -lm zhijian_lm.gz -ppl test.txt -debug 2

#test.txt内容如下:
啊对哎呀捂好谢谢
对哎呀捂好谢谢

输出结果

reading 890937 1-grams
reading 14733024 2-grams
reading 13701101 3-grams
啊对哎呀捂好谢谢
p( 啊 | ) = [2gram] 0.1468152 [ -0.8332291 ]
p( 对 | 啊 …) = [3gram] 0.06473259 [ -1.188877 ]
p( 哎呀 | 对 …) = [3gram] 0.0001231799 [ -3.90946 ]
p( 捂 | 哎呀 …) = [2gram] 3.734564e-05 [ -4.42776 ]
p( 好 | 捂 …) = [2gram] 0.04002296 [ -1.397691 ]
p( 谢谢 | 好 …) = [2gram] 0.04092193 [ -1.388044 ]
p( | 谢谢 …) = [3gram] 0.1421139 [ -0.8473634 ]
1 sentences, 6 words, 0 OOVs
0 zeroprobs, logprob= -13.99242 ppl= 99.75112 ppl1= 214.818

对哎呀捂好谢谢
p( 对 | ) = [2gram] 0.02417497 [ -1.616634 ]
p( 哎呀 | 对 …) = [3gram] 0.0001068959 [ -3.971039 ]
p( 捂 | 哎呀 …) = [2gram] 3.734564e-05 [ -4.42776 ]
p( 好 | 捂 …) = [2gram] 0.04002296 [ -1.397691 ]
p( 谢谢 | 好 …) = [2gram] 0.04092193 [ -1.388044 ]
p( | 谢谢 …) = [3gram] 0.1421139 [ -0.8473634 ]
1 sentences, 5 words, 0 OOVs
0 zeroprobs, logprob= -13.64853 ppl= 188.2588 ppl1= 536.6687

file test.txt: 2 sentences, 11 words, 0 OOVs
0 zeroprobs, logprob= -27.64096 ppl= 133.7295 ppl1= 325.6973

结论

由LM引入的多余的"啊", 因为lm ngram模型中, 确实出现了超大量的啊开头的句子. 开始重点优化语言模型, 方法:
● 计算出开头字的出现频率, 其中, 若以叹词开头的(啊, 嗯, 哦, 呃, 我等), 逐一选出, 按30%保留, 剩余的去除开头字.
● 清理相同字母连续多次出现的无意义的英文词.
● 进行多轮WER评估, 看效果.