远场多阵列语音识别(Far-filed multi-array speech recognition)

1. 本章内容
  1. 本博客介绍基于Attention的beamformer技术(多麦克风波束合成)。
  2. 对其文章中代码进行复现。
  3. 只复现了beamformer代码,集成到ASR(wenet)中的代码等待我后期GitHub开源。
2. 文章详情
  1. 引用:Gong, R. , et al. “Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-field Speech Recognition.” 2021.
  2. arXiv:https://arxiv.org/abs/2109.04783
3. 原文解读
  1. 网络结构(构思还是很easy的)
    在这里插入图片描述
    (1) 最下面:类似于mel谱
    (2)右侧:计算多通路之间的cross-attention,然后经过一个softmax函数,来给每一个通路(channel)求出一个权重。
    (3)最上面:求多通路语音信号的mel谱加权和,作为ASR的输入

  2. 结果展示
    在这里插入图片描述
    (1)自称是beamformer中最SOTA的,其实粒度还是蛮高的。(还有方法降低粒度,还可以更SOTA)
    (2)效果绝对提升1-2%。

4. 代码/源码(只包含beamformer部分,具体融合到Wenet-ASR,参考我后期github更新)
import torch
import torch.nn as nn
import numpy as np
class beamformer_attention(nn.Module):
    def __init__(self,d_model, d_k, d_v, n_heads):
        super(beamformer_attention, self).__init__()
        self.muttiheadatt = MultiHeadAttention(d_model, d_k, d_v, n_heads)
    def forward(self,x):
        attain = self.muttiheadatt(x,x,x)
        softmax_att = nn.Softmax(dim=-1)(attain)
        output = torch.matmul(x.transpose(-1,-2),attain)
        length = output.size(0)
        output = output.view(length,-1)
        return output
class ScaledDotProductAttentionMask(nn.Module):
    def __init__(self, d_k):
        super(ScaledDotProductAttentionMask, self).__init__()
        self.d_k = d_k
    def forward(self, Q, K, V):
        scores = torch.matmul(Q, K.transpose(-1,-2)) / np.sqrt(self.d_k)
        attn = nn.Softmax(dim=-1)(scores)
        context = torch.matmul(attn, V)
        return context, attn
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, d_k, d_v, n_heads):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.d_k = d_k
        self.d_v = d_v
        self.n_heads = n_heads
        self.W_Q = nn.Linear(d_model, d_k * n_heads)
        self.W_K = nn.Linear(d_model, d_k * n_heads)
        self.W_V = nn.Linear(d_model, d_v*n_heads)
        self.layer_norm = nn.LayerNorm(d_model)
        self.concat = nn.Linear(n_heads*d_v,d_v)
    def forward(self, Q, K, V):

        batch_size = Q.size(0)

        ##下面这个就是先映射,后分头;这里都是dk
        q_s = self.W_Q(Q).view(batch_size, -1, self.n_heads, self.d_k).transpose(1,2)  # q_s: [batch_size x n_heads x len_q x d_k]
        k_s = self.W_K(K).view(batch_size, -1, self.n_heads, self.d_k).transpose(1,2)  # k_s: [batch_size x n_heads x len_k x d_k]
        v_s = self.W_V(V).view(batch_size, -1, self.n_heads, self.d_v).transpose(1,2)  # v_s: [batch_size x n_heads x len_k x d_v]

        ## 输入进行的attn_mask形状是 batch_size x len_q x len_k,然后经过下面这个代码得到 新的attn_mask : [batch_size x n_heads x len_q x len_k],就是把pad信息重复了n个头上
        #attn_mask = attn_mask.unsqueeze(1).repeat(1, self.n_heads, 1, 1)

        context, attn = ScaledDotProductAttentionMask(self.d_k)(q_s, k_s, v_s)
        context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.n_heads * self.d_v) # context: [batch_size x len_q x n_heads * d_v]
        #output = self.layer_norm(context)
        output = self.concat(context)
        return output # output: [batch_size x len_q x d_v]

if __name__ == '__main__':
    beamformer = beamformer_attention(40,64,1,6)
    input = torch.ones((10,8,40),dtype=torch.float32) # [seq_len channel fbank]
    print(input)
    output = beamformer(input)
    print(output)
    print(input.size(),'[seq_length,channels,fbank]')
    print(output.size(),'[seq_length,fbank]')
    print('succed')
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值