说话人识别 CAM 论文阅读：CAM: CONTEXT-AWARE MASKING FOR ROBUST SPEAKER VERIFICATION

Axyzstra

已于 2024-07-26 18:42:34 修改

阅读量1.1k

点赞数 25

分类专栏：声纹识别文章标签：论文阅读深度学习

于 2024-07-26 16:05:35 首次发布

本文链接：https://blog.csdn.net/m0_52423355/article/details/140718096

版权

声纹识别专栏收录该内容

5 篇文章

订阅专栏

CAM

CAM: CONTEXT-AWARE MASKING FOR ROBUST SPEAKER VERIFICATION

ICASSP 2021

摘要

In this paper, we propose contextaware masking (CAM), a novel method to extract robust speaker embedding. CAM enables the speaker embedding network to “focus” on the speaker of interest and “blur” unrelated noise. The threshold of masking is dynamically controlled by an auxiliary context embedding that captures speaker and noise characteristics.

提出了上下文感知隐蔽 CAM；
使网络聚焦在感兴趣的语音而模糊不相关的噪声；

引言

传统处理噪声的方法：

将去噪变化用于说话人嵌入；用统计后端或神经网络后端将噪声嵌入转换为增强的嵌入；存在信息丢失；
通过语音增强模型过滤噪声，并根据增强的特征提取鲁棒的说话人嵌入。
- 采用有监督训练，基于噪声-干净语音对训练模型；是合成数据，不严格符合真实分布；
- 将 GANs 用于无监督语音增强；
- 文中提出语音增强的方法可能是多余的！！！

文中提出的方法：

First, the denoising step should be earlier than the aggregation step to avoid information loss. Second, we do not use a separate speech enhancement model to pre-process input features.

去噪早于聚合；
不使用语音增强；

两个贡献：

新的提取鲁棒的说话人嵌入；可以聚焦于感兴趣的语音段而模糊不相关的噪声，计算量比起语音增强大大减少；
提出了一个捕捉说话人和噪声特征的辅助上下文嵌入的思想；

方法

D-TDNN

模型使用 D-TDNN；可以参考 D-TDNN

Context-Aware Masking

上下文感知的隐蔽

在这里插入图片描述

主要关注过渡和掩蔽；

若输入的声学特征为 $\mathbf{x}$ ，那么其经过某一特定隐含层的输出为：
$g(\mathcal{F}(\mathbf{x}))$
其中 $g (\cdot)$ 表示当前隐藏层的变换， $\mathcal{F}(·)$ 表示所选隐藏层之前的变换；

在语音增强中：语音被增强后的特征为 $\tilde{\mathbf{x}}$ ，那么其经过所选隐藏层的输出为：
$g(\mathcal{F}(\tilde{\mathbf{x}}))$
在这篇论文中，我们想要实现的就是一个类似语音增强的效果。需要找到一个 $\mathrm{M}$ ，使得其满足：
$g(\mathcal{F}(\mathbf{x}))\odot\tilde{\mathrm{M}}\propto g(\mathcal{F}(\tilde{\mathbf{x}})),$
在这里插入图片描述

上图，图 b 展示了这一过程，即找到 $\mathrm{M}$ 后做计算：

$\tilde{\mathbf{F}}=g(\mathcal{F}(\mathbf{x}))\odot\mathbf{M}$
在这里插入图片描述

而如何找打一个 $\mathrm{M}$ ，在上图，图 c 中给出了详细过程：
$\begin{aligned}\mathbf{F}&=\mathcal{F}(\mathbf{x})\\\mathbf{M}_{*t}&=\sigma(\mathbf{W}_2^\top\omega(\mathbf{W}_1^\top\mathbf{F}_{*t}+\mathbf{e})+\mathbf{b}_2)\end{aligned}$
此处的 $F_{*t}$ 是指特征图的时间帧，这是为了为每个时间帧赋予一定的权重，是注意力机制的体现；其中 $e$ 是通过 $A SP$ 池化层得到的；具体求法如下：
$\begin{aligned}&\boldsymbol{\mu}=\frac1T\sum_{t=1}^T\mathbf{F}_{*t},\\&\boldsymbol{\sigma}=\sqrt{\frac1T\sum_{t=1}^T\mathbf{F}_{*t}\odot\mathbf{F}_{*t}-\boldsymbol{\mu}\odot\boldsymbol{\mu}},\end{aligned}$

$\mathrm{e}=\mathrm{W}_3^\top[\mu,\sigma]+\mathrm{b}_3,$

这样得到的 $e$ 也蕴含了重要的帧和不重要的帧等重要信息；相加后可以让重要的帧更重要，不重要的帧更不重要

通过 CAM 措施(即上面阐述的方法)有利于在说话人确认任务中省略干扰说话人的语音和非语音噪声。

CAM can be applied multiple times to different layers. In our experiment, we apply it to the first position-wise FC layer in the vanilla TDNN and the transition layer in each block of D-TDNN. The context embedding size is half of the output size of the selected hidden layer.

CAM 可以用在不同的层上；