ACA-Net: Towards Lightweight Speaker Verification using Asymmetric Cross Attention

最新推荐文章于 2024-03-07 06:45:00 发布

shadowismine

最新推荐文章于 2024-03-07 06:45:00 发布

阅读量119

点赞数

分类专栏：语音识别文章标签： 1024程序员节

本文链接：https://blog.csdn.net/shadowismine/article/details/134018316

版权

语音识别专栏收录该内容

11 篇文章 3 订阅

订阅专栏

1. Overview

论文单位：阿里巴巴集团，南洋理工大学

核心内容：ACA-Net是一种轻量级的基于全局上下文感知信息和说话人特征的说话人辨识模型。ACA-Net利用非对称交叉注意力机制 (Asymmetric Cross Attention，即 ACA) 来代替常用的时间池化层，模型示意图见下图。ACA能够通过对key和value 大矩阵进行快速查询，将可变长度序列提取为较小的固定大小的隐层序列。

在ACA-Net中，我们使用ACA构建多层聚合模块（Multi-Layer Aggregation），从可变长度输入生成固定大小的单位向量。通过全局注意力模块，ACA-Net可以作为一个能自适应时序长短变化的高效全局特征提取器。现有说话人辨识模型使用固定函数在时间维度上进行池化，这可能会丢失一部分非平稳信号信息。我们在WSJ0-1talker 数据集上的实验表明，在仅使用 20% 的模型参数量的情况下，ACA-Net较目前最好的基线模型效果相对提升5%。

论文预印版下载地址：

https://arxiv.org/abs/2305.12121

代码下载：

github.com/Yip-Jia-Qi/ACA-Net

图1. ACA-Net 模型示意图。

ACA-Net, a lightweight, global context-aware speaker embedding extractor for Speaker Verification.

Problem statement:

1. the pooling method may obscure variability across time steps that may be important in discriminating between speakers.

2. Statistics pooling assumes that the speech signal has statistical properties that remain stationary over time, which may not always hold true.

Solution:

We propose ACA-Net, which uses Asymmetric Cross Attention (ACA) to avoid the high computational cost of self-attention while eliminating the need for temporal pooling.

2. Methodology

The model accepts audio input processed through a filterbank and consists of a single 1x1 TDNN block followed by the Multi-Layer Aggregation (MLA) Block and a final 1x1 convolution used to reduce the channel dimension back to 1 for the final embedding.

2.1 TDNN Block

The TDNN block in ACA-Net consists of a single depth-wise 1D Convolutional layer, followed
by ReLU activation and 1D batch normalization.

The purpose of this block is to serve as a further feature extractor to decouple the number of filterbanks from the input channels to the MLA block.

2.2. Asymmetric Cross Attention

ACA computes attention between a small latent query and a large feature sequence as the key and value matrices. This distils the temporal dimension of the feature input down to the embedding dimension.

ACA sub-block makes use of the standard Multi-Head Attention (MHA)

where d denotes the number of channels and Q, K, V denote the Query, Key and Value of the MHA respectively. $W_{i}^{Q}, W_{i}^{K}, W_{i}^{V}, W_{i}^{O}$ are the projection parameter matrices.

The standard transformer makes use of MHA where $Q\in \mathbb{R}^{C\times T}$ , $K\in \mathbb{R}^{C\times T}$ and $V\in \mathbb{R}^{C\times T}$ , whereas ACA uses a $Q\in \mathbb{R}^{C\times E}$ where $E < < < T$ .

Furthermore, E is the embedding size which is a fixed hyperparameter while T is the time dimension which is variable.

2.3 ACA and Latent Sub-Blocks

The ACA sub-block and the latent sub-block share the same block architecture.

The K and V for the ACA sub-block is the feature sequence while Q comes from random initialization. (Feature input of dimensions $\mathbb{R}^{C\times T}$ is reduced to the embedding size $\mathbb{R}^{C\times E}$ )

The Q, K and V for the latent sub-block are the latent produced by the ACA sub-block. (Self-attention on the same Latent vector, no change in dimensions)

Additionally, for the ACA sub-block, sinusoidal positional encoding is added to the feature sequence before being passed into the MHA layer.

2.4 The MLA Block

It consists of a single ACA sub-block (ACA-Sub-B) followed by a variable number of latent
sub-blocks.

The concatenated outputs of the latent sub-blocks are passed into a depth-wise 1D convolution and batch normalization layer.

The latent is refined through MLA, where the latent vector is passed through multiple latent sub-blocks, with the output of each latent sub-block aggregated by concatenation along the channel dimension and passed through a depth-wise convolution at the end to return the channel dimension back to its original size.

3. Experiment

DATASET:WSJ0-1talker speaker verification dataset

While the training (20,000 utterances) and development (5,000 utterances) sets share speakers but have different utterances, the testing (3,000utterances) set consists of 18 separate speakers unseen during training. The verification pairs for testing are randomly selected from the training set.

All utterances were down-sampled from 48kHz to 8kHz.(?)