语音识别-paddlespeech-流程梳理

lalahappy

已于 2024-05-15 09:57:42 修改

阅读量1k

点赞数 3

文章标签：语音识别人工智能

于 2024-05-13 16:36:18 首次发布

本文链接：https://blog.csdn.net/qq_42563807/article/details/138806966

版权

上一次研究语音识别是21年年底的事情了，记得当时是先进行了语音识别的应用，然后操作了模型的再次训练；两年过去，关于ASR相关流程忘得差不多了，这次基于paddlespeech的代码，进行了流程的梳理，关于一些细节还在学习中，先记录于此：

'zh:[conformer_wenetspeech-zh-16k], '
'en:[transformer_librispeech-en-16k], '
'zh_en:[conformer_talcs-codeswitch_zh_en-16k]'

本次测试的是中文、非流式模型，model = conformer_wenetspeech

语音识别，输入可以是.wav，输出是其对应的中文文字；

针对该测试调用的模型，该代码可简单分为三部分：

Init model and other resources from a specific path；
对输入的.wav预处理，wav–>vector/tensor；
预测，并输出结果

针对第二部分，涉及到的基本是：文件的读取，及，特征提取，等。

涉及的关键词，比如是：
.wav的读取，波形变换，MFCC， pcm16 -> pcm 32，fbank，等；

涉及的库：soundfile，librosa，python_speech_features 等；

针对第三部分，可以分为三步来阐述：
在这里插入图片描述

# paddlespeech.s2t.models.u2  line 876  --- U2Model(U2DecodeModel)
def _init_from_config(cls, configs: dict):
    """
    init sub module for model.
    Returns:
    vocab size(int), encoder(nn.Layer), decoder(nn.Layer), ctc(nn.Layer)
    """
    
    # U2 Encoder type: conformer---ConformerEncoder
        ---paddlespeech.s2t.modules.encoder.py
    # U2 Decoder type: bitransformer---BiTransformerDecoder----error
    # U2 Decoder type: transformer
        ---paddlespeech.s2t.modules.decoder.py
    # ctc decoder and ctc loss---CTCDecoderBase
        ---paddlespeech.s2t.modules.ctc.py

第一步：代码调用了conformer-encoder，进行编码：
输入：(batch, max_len, feat_dim) — [1, 498, 80]
输出：(B, maxlen, encoder_dim) — [1, 123, 512]

第二步：使用 CTCDecoderBase + ctc prefix beam search 对 encoder-out 进行操作，输出 beam_size个预测结果，其中，beam_size在该程序中设置为10；
输入：(B, maxlen, encoder_dim) — [1, 123, 512]
输出：长度为beam_size的列表，列表的每一项包括一个预测结果，及其得分；

[
	((1719, 4412, 66, 4641, 2397, 2139, 4935, 4381, 3184, 1286, 2084, 3642,
	  1719, 1411, 2180, 98, 4698, 205, 309, 1458), -0.0025442275918039605), 
    ((1719, 4412, 66, 4641, 2397, 2139, 4935, 4381, 3184, 1286, 2084, 3642, 
      1719, 1411, 2180, 4698, 205, 309, 1458), -7.808644069258369), 
        ----
]

第三步：使用 TransformerDecoder 进行最后的纠正与预测，其输入是第一步的encoder-out 和第二步的初步预测结果；

(
	['我认为跑步最重要的就是给我带来了身体健康'], 
    [(1719, 4412, 66, 4641, 2397, 2139, 4935, 4381, 3184, 1286, 2084, 3642, 
      1719, 1411, 2180, 98, 4698, 205, 309, 1458)]
)

关于第二步的：CTCDecoderBase + ctc prefix beam search：

对于 CTCDecoderBase，其输入是：
输入：(B, maxlen, encoder_dim) — [1, 123, 512]

 ctc_probs = self.ctc.log_softmax(encoder_out)

输出：(1, maxlen, vocab_size) — [1, 123, 5537]

将 encoder_out 进行了一个linear，输出维度是[1, maxlen, vocab_size]，然后进行softmax，得到每一步的关于vocab的概率分布；

然后针对该输出，进行pefix beam search，得到：长度为beam_size的列表，其中，列表的每一项包括一个预测结果，及其得分；

关于prefix beam search：
初始化 cur_hyps =[((), (0.0, -inf))]；两个概率分别为：
blank_ending_score, none_blank_ending_score

blank_ending_score（以空白符结尾的前缀的分数）:
当前缀以一个空白符（通常是CTC中的一个特殊标记，用来表示两个词之间的间隔）结尾时的分数。这个分数在后续步骤中可能用于处理重复字符或新字符的开始。

none_blank_ending_score（以非空白符结尾的前缀的分数）:
当前缀以一个非空白符（即实际的词汇字符）结尾时的分数。这个分数通常反映了到目前为止该前缀的累积概率。

① t-1 之后得到了beam_size个候选项（cur_hyps），这个候选项的每一个备选，都是包含了 t-1个前缀；

② 当前t，针对 vocab 有vocab_size个概率，选取beam_size个最大的；

③ 对当前 t 选取出来的beam_size个token，首先创建 next_hyps = {(): (-inf, -inf)};然后，针对每一个token，循环更新beam_size个候选项，即，对于当前时间步之前所有已经存在的前缀（cur_hyps），执行以下操作：

如果s是空白符：更新包含该前缀的概率，考虑加入空白符的可能性，即，更新n_pb；
如果s与前缀的最后一个词汇相同（即当前词汇是前一个词汇的重复）：
…
如果s与前缀的最后一个词汇不同：创建一个新的前缀（将s添加到当前前缀的末尾），并更新n_pnb。

cur_hyps = [(tuple(), (0.0, -float('inf')))]

for t in range(0, maxlen):
	logp = ctc_probs[t]
    # (vocab_size,)

    # key: prefix, value (pb, pnb), default value(-inf, -inf)
    next_hyps = defaultdict(lambda: (-float('inf'), -float('inf')))
    
	top_k_logp, top_k_index = logp.topk(beam_size) 
	
	for s in top_k_index:
        s = s.item()
		ps = logp[s].item()
		
		for prefix, (pb, pnb) in cur_hyps:
                    
			last = prefix[-1] if len(prefix) > 0 else None

			if s == blank_id:  # blank
				n_pb, n_pnb = next_hyps[prefix]
				n_pb = log_add([n_pb, pb + ps, pnb + ps])
				next_hyps[prefix] = (n_pb, n_pnb)

			elif s == last:
                #  Update *ss -> *s;
				n_pb, n_pnb = next_hyps[prefix]
				n_pnb = log_add([n_pnb, pnb + ps])
				next_hyps[prefix] = (n_pb, n_pnb)

				# Update *s-s -> *ss, - is for blank
				n_prefix = prefix + (s,)
				n_pb, n_pnb = next_hyps[n_prefix]
                n_pnb = log_add([n_pnb, pb + ps])
				next_hyps[n_prefix] = (n_pb, n_pnb)

			 else:
				n_prefix = prefix + (s,)
				# ----> ()->(s1,)->(s1, s2)
				n_pb, n_pnb = next_hyps[n_prefix]
				n_pnb = log_add([n_pnb, pb + ps, pnb + ps])
				next_hyps[n_prefix] = (n_pb, n_pnb)
			
	next_hyps = sorted(
		next_hyps.items(),
		key=lambda x: log_add(list(x[1])),
		reverse=True
	)

	cur_hyps = next_hyps[:beam_size]

step-1，选取的是10个候选token，分别为：
[0, 1719, 847 , 4850, 4764, 1265, 782 , 1076, 216 , 2084]
循环完之后，cur_hyps更新为：

[((), (-2.4914430468925275e-05, -inf)), 
((1719,), (-inf, -12.919618606567383)), 
((847,), (-inf, -13.054508209228516)), 
((4850,), (-inf, -13.208122253417969)), 
((4764,), (-inf, -13.351343154907227)), 
((1265,), (-inf, -13.604446411132812)), 
((782,), (-inf, -13.606643676757812)), 
((1076,), (-inf, -13.751394271850586)), 
((216,), (-inf, -13.80009651184082)), 
((2084,), (-inf, -14.129714965820312))]

step-2，选取的是10个候选token，分别为：
[0, 3184, 29 , 98 , 337 , 1719, 216 , 37 , 72 , 2084])

step-2之后，len(next_hyps)=97，不是10*10，因为step-2-1-10与step-2-10-1，都是(2084,)，应该直接更新概率；以此类推；

[
(), (1719,), (847,), (4850,), (4764,), (1265,), (782,), (1076,), (216,), (2084,), 

(3184,), (1719, 3184), (847, 3184),  (4850, 3184), (4764, 3184), (1265, 3184), (782, 3184), (1076, 3184), (216, 3184), (2084, 3184),

(29,), (1719, 29), (847, 29), (4850, 29), (4764, 29), (1265, 29), (782, 29), (1076, 29), (216, 29), (2084, 29),

(98,), (1719, 98), (847, 98), (4850, 98), (4764, 98), (1265, 98), (782, 98), (1076, 98), (216, 98), (2084, 98),

..........

(2084,), (1719, 2084), (847, 2084), (4850, 2084), (4764, 2084), 
(1265, 2084), (782, 2084), (1076, 2084), (216, 2084), (2084, 2084)
]

循环完之后，cur_hyps更新为：

[((), (-0.00012599880028574262, -inf)), 
((3184,), (-inf, -11.88964664904961)), 
((29,), (-inf, -11.929930805848926)), 
((1719,), (-12.9197196909372, -12.443134838204823)), 
((98,), (-inf, -12.162696003602832)), 
((216,), (-13.800197596210637, -12.56428247431168)), 
((337,), (-inf, -12.326809048341602)), 
((37,), (-inf, -12.565277218507617)), 
((2084,), (-14.12981605019013, -12.851693495566257)), 
((72,), (-inf, -12.808426022218555))]

step-3，选取的是10个候选token，分别为：
[0, 3184, 29 , 98 , 337 , 1719, 216 , 37 , 72 , 2084])

step-3-1，s=0，即按照 n_pb = log_add([n_pb, pb + ps, pnb + ps]) 更新n_pb；

10个循环之后，next_hyps：

 {
 (): (-0.00021540177294809837, -inf), 
 (3184,): (-11.889736052022272, -inf), 
 (29,): (-11.930020208821588, -inf), 
 (1719,): (-11.960242542061046, -inf), 
 (98,): (-12.162785406575495, -inf), 
 (216,): (-12.309288876340837, -inf), 
 (337,): (-12.326898451314264, -inf), 
 (37,): (-12.56536662148028, -inf), 
 (2084,): (-12.60604861765945, -inf), 
 (72,): (-12.808515425191217, -inf)}
 )

probabilityWithBlank：最后一个字符是空格的概率；
probabilityNoBlank：最后一个字符不是空格的概率；

按照上述步骤，可以理解为，使用 cur_hyps 的 pb，pnb，以及 s 本身的概率，来更新 t 时刻字符为空格的概率；虽然 maxlen 中有预测为空格，但是现在是在进行最后输出的预测，加不加个空格，输出的路径是不变的，所以不必更改pnb，更改pb即可；

step-3-2，s=3184

step-3-2-1，n_prefix = prefix + (s,)，即，n_prefix=3184, 对于前缀为()来说，路径更新为(3184,)，所以更新 n_pnb；

根据 n_pnb = log_add([n_pnb, pb + ps, pnb + ps])，更新n_pnb：

(3184,): (-11.889736052022272, -12.013047332113274)

step-3-2-2，s == last：

不更新prefix，算预测帧之间的重复：
使用 n_pnb = log_add([n_pnb, pnb + ps]) 更新 n_pnb：
(3184,): (-11.889736052022272, -12.013040470198828)
更新prefix，算是有了新路径:
使用 n_pnb = log_add([n_pnb, pb + ps]) 更新 n_pnb，
(3184, 3184): (-inf, -inf)

更新前缀，用的是pb，不更新，用的是pnb，这块儿应该是以CTC 的原理为前提：

If Y has two of the same character in a row, then a valid alignment must have an ϵ between them.
在这里插入图片描述
空格的作用：