论文精读--stGCF

Abstract

Multi-modal fusion is proven to be an effective method to improve the accuracy and robustness of speaker tracking, especially in complex scenarios. However, how to combine the heterogeneous information and exploit the complementarity of multi-modal signals remains a challenging issue.In this paper, we propose a novel Multi-modal Perception Tracker (MPT) for speaker tracking using both audio and visual modalities. Specifically, a novel acoustic map based on spatial-temporal Global Coherence Field (stGCF) is first constructed for heterogeneous signal fusion, which employs a camera model to map audio cues to the localization space consistent with the visual cues. Then a multi-modal perception attention network is introduced to derive the perception weights that measure the reliability and effectiveness of intermittent audio and video streams disturbed by noise. Moreover, a unique cross-modal self-supervised learning method is presented to model the confidence of audio and visual observations by leveraging the complementarity and consistency between different modalities. Experimental results show that the proposed MPT achieves 98.6% and 78.3% tracking accuracy on the standard and occluded datasets, respectively, which demonstrates its robustness under adverse conditions and outperforms the current state-of-the-art methods.

翻译:

多模态融合被证明是提高说话者追踪准确性和鲁棒性的有效方法,特别是在复杂场景中。然而,如何组合异构信息并利用多模态信号的互补性仍然是一个具有挑战性的问题。在本文中,我们提出了一种新颖的多模态感知追踪器(MPT),用于利用音频和视觉模态进行说话者追踪。具体来说,首先构建了基于空间-时间全局相干场(stGCF)的新型声学地图,用于异构信号融合,该地图利用摄像机模型将音频线索映射到与视觉线索一致的定位空间。然后引入了一个多模态感知注意力网络,用于推导感知权重,这些权重用于测量受噪声干扰的间歇性音频和视频流的可靠性和有效性。此外,提出了一种独特的跨模态自监督学习方法,通过利用不同模态之间的互补性和一致性来建模音频和视觉观察的置信度。实验结果表明,所提出的MPT在标准和遮挡数据集上分别实现了98.6%和78.3%的追踪准确率,这表明了其在不利条件下的鲁棒性,并且优于当前的最先进方法。

总结:

提出基于空间-时间全局相干场(stGCF)的新型声学地图,利用摄像机模型做特征映射

提出一种跨模态自监督学习方法

Introduction

Speaker tracking is the foundation task for intelligent systems to implement behavior analysis and human-computer interaction. To enhance the accuracy of the tracker, multimodal sensors are utilized to capture richer information (Kılıc¸ and Wang 2017). Among them, auditory and visual sensors have received extensive attention from researchers as the main senses for human to understand the surrounding environment and interact with others. Similar to the process of human multi-modal perception, the advantage of integrating auditory and visual information is that they can provide necessary supplementary cues (Xuan et al 2020). Compared with the single-modal case, the utilizing of the complementarity of audio-visual signals contributes to improving tracking accuracy and robustness, particularly when dealing with complicated situations such as target occlusion, limited view of cameras, illumination changes, and room reverberation (Katsaggelos, Bahaadini, and Molina 2015). Furthermore, multi-modal fusion shows distinct advantages when the information of one modality is missing, or neither modality is able to provide a reliable observation. As a result, it is critical to develop a multi-modal tracking method that is capable of fusing heterogeneous signals and dealing with intermittent noisy audio-visual data.

翻译:

说话人跟踪是实现行为分析和人机交互的智能系统的基础任务。为了增强跟踪器的准确性,利用多模态传感器来捕获更丰富的信息(Kılıc¸和Wang,2017)。其中,听觉和视觉传感器作为人类理解周围环境和与他人交互的主要感知方式,受到研究人员的广泛关注。类似于人类多模态感知的过程,集成听觉和视觉信息的优势在于它们可以提供必要的补充线索(Xuan等,2020)。与单一模态情况相比,利用音频-视觉信号的互补性有助于提高跟踪的准确性和鲁棒性,特别是在处理复杂情况时,如目标遮挡、摄像头视野受限、光照变化和房间混响(Katsaggelos、Bahaadini和Molina,2015)。此外,当一种模态的信息缺失或两种模态都无法提供可靠的观察时,多模态融合显示出明显的优势。因此,开发一种能够融合异构信号并处理间歇性嘈杂的音频-视觉数据的多模态跟踪方法至关重要。

Current speaker tracking methods are generally based on probabilistic generation models due to their ability to process multi-modal information. The representative method is Particle Filter (PF), which can recursively approximate the filtering distribution of tracking targets in nonlinear and nonGaussian systems. Based on PF implementation, the Direction of Arrival (DOA) angle of the audio source is projected onto the image plane to reshape the typical Gaussian noise distribution of particles and increase the weights of particles near DOA line (Kılıc¸ et al 2015). A two-layered PF is proposed to implement feature fusion and decision fusion of audio-visual sources through the hierarchical structure (Liu, Li, and Yang 2019). Moreover, a face detector is employed to geometrically estimate the 3D position of the target to assist in the calculation of the acoustic map (Qian et al 2021).However, these methods prefer to use the detection results of the single modality to assist the other modality to obtain more accurate observations, while neglecting to fully utilize the complementarity and redundancy of audio-visual information. In addition, most of the existing audio-visual trackers use generation algorithms (Ban et al 2019; Schymura and Kolossa 2020; Qian et al 2017), which are difficult to adapt to random and diverse changes of target appearance.Furthermore, the likelihood calculation based on the color histogram or Euclidean distance is susceptible to interference from observation noise, which limits the performance of the fusion likelihood.

翻译:

目前的说话人跟踪方法通常基于概率生成模型,因为它们能够处理多模态信息。代表性方法是粒子滤波器(PF),它可以递归地近似非线性和非高斯系统中跟踪目标的滤波分布。基于PF的实现,将音频源的到达方向(DOA)角度投影到图像平面上,以重塑粒子的典型高斯噪声分布,并增加靠近DOA线的粒子的权重。提出了一个两层粒子滤波器,通过分层结构实现音频-视觉源的特征融合和决策融合。此外,利用人脸检测器几何估计目标的三维位置,以辅助计算声学图(Qian等,2021)。然而,这些方法更倾向于使用单一模态的检测结果来辅助另一模态获得更准确的观测,而忽视了充分利用音频-视觉信息的互补性和冗余性。此外,大多数现有的音频-视觉跟踪器使用生成算法,这些算法难以适应目标外观的随机和多样化变化。此外,基于颜色直方图或欧几里德距离的似然度计算容易受到观测噪声的干扰,这限制了融合似然度的性能。

总结:

现有算法注意点在辅助另一模态进行预测,而非补全另一模态信息

To solve those limitations, we propose to adopt an attention mechanism to measure the confidence of multiple modalities, which determines the effectiveness of the fusion algorithm. The proposed idea is inspired by the human brain’s perception mechanism for multi-modal sensory information, which integrates the data and optimizes the decision-making through two key steps: estimating the reliability of various sources and weighting the evidences based on the reliability (Zhang et al 2016). Take the intuitive experience as an example: when determining a speaker’s position in a noisy and bright environment, we mainly use eyes; conversely, in a quiet and dim situation, we rely on sounds.Based on this phenomenon, we propose a multi-modal perception attention network to simulate the human perception system that is capable of selectively capturing valuable event information from multiple modalities. Figure 1 depicts the working process of the proposed network, in which the first two rows show the complementarity and consistency of audio and video modalities. In the third row, the image frame is obscured by an artificial mask to show the supplementary effect of the auditory modality when the visual modality is unreliable. Different from existing end-to-end models, the specialized network focuses on perceiving the reliability of observations from different modalities. However, the perception process is usually abstract, making it difficult to manually label quantitative tags. Due to the natural correspondence between sound and vision, necessary supervision is provided for audio-visual learning (Hu et al 2020) (Afouras et al 2021). Therefore, we design a cross-modal self-supervised learning method, which exploits the complementarity and consistency of multi-modal data to generate weight labels of perception.

翻译:

为了解决这些限制,我们提出采用注意力机制来衡量多模态的置信度,这决定了融合算法的有效性。该提议的灵感源自人脑对多模态感觉信息的感知机制,通过两个关键步骤来整合数据并优化决策:估计各种信息源的可靠性,并根据可靠性加权证据(Zhang等,2016年)。以直观的经验为例:在嘈杂明亮的环境中确定演讲者的位置时,我们主要依靠眼睛;相反,在安静昏暗的情况下,我们依靠声音。基于这一现象,我们提出了一个多模态感知注意力网络,模拟了人类感知系统,能够有选择地捕获来自多种模态的有价值的事件信息。图1描述了所提出网络的工作过程,其中前两行展示了音频和视频模态的互补性和一致性。在第三行中,图像帧被人工口罩遮挡,以展示当视觉模态不可靠时听觉模态的补充效果。与现有的端到端模型不同,该专用网络侧重于感知来自不同模态的观测的可靠性。然而,感知过程通常是抽象的,这使得手动标记数量标签变得困难。由于声音和视觉之间的自然对应关系,音频-视频学习提供了必要的监督(Hu等,2020年)(Afouras等,2021年)。因此,我们设计了一种跨模态自监督学习方法,利用多模态数据的互补性和一致性来生成感知权重标签。

总结:

与现有模型不同,该网络侧重于感知来自不同模态信息的可靠性,根据可靠性进行加权

感知过程是抽象的,难打标签,所以提出了一种自监督方法

Neural networks have been widely used in multi-modal fusion tasks, represented by Audio-Visual Speech Recognition (AVSR) (Baltrusaitis, Ahuja, and Morency ˇ 2018). However, except for preprocessing works such as target detection and feature extraction, neural network is rarely introduced to multi-modal tracking. This is because the positive samples in tracking task are simply random targets in the initial frame, resulting in a shortage of data to train a high-performing classifier. Therefore, using an attention network specifically to train the middle perception component provides a completely new approach to this problem. Another reason is that the heterogeneity of audio and video data makes it difficult to accomplish unity in the early stage of the network. Therefore, we propose the spatial-temporal Global Coherence Field (stGCF) map, which maps the audio cues to the image feature space through the projection operator of a camera model. To generate a fusion map, the integrated audio-visual cues are weighted by the perception weights estimated by the network. Finally, a PF-based tracker improved with the fusion map is employed to ensure smooth tracking of multi-modal observations.

翻译:

神经网络已被广泛应用于多模态融合任务,以音频-视觉语音识别(AVSR)为代表(Baltrusaitis,Ahuja和Morency 2018)。然而,除了目标检测和特征提取等预处理工作之外,神经网络很少被引入到多模态跟踪中。这是因为跟踪任务中的正样本仅仅是初始帧中的随机目标,导致缺乏数据来训练高性能的分类器。因此,专门使用注意力网络来训练中间感知组件为解决这个问题提供了全新的方法。另一个原因是音频和视频数据的异构性使得在网络的早期阶段难以实现统一。因此,我们提出了基于时空全局一致性场(stGCF)的地图,通过摄像机模型的投影算子将音频线索映射到图像特征空间。为了生成融合地图,集成的音频-视觉线索通过网络估计的感知权重进行加权。最后,利用改进的融合地图的基于PF的跟踪器来确保对多模态观测的平稳跟踪。

总结:

在多模态跟踪任务中,正样本为初始帧中的目标,随后都是需要去跟踪预测的,也就是正样本少,训练数据少,因此使用注意力机制提取全局信息,在后面也能看到前面的初始帧

为了统一音频和视频两个模态的特征空间,提出stGCF,通过摄像机模型的投影算子进行特征映射,然后根据GCF(SRP-PHAT)的方法进行加权得到音频和图片的融合空间,最后基于PF进行跟踪

All these components make up our Multi-modal Perception Tracker (MPT), and experimental results demonstrate that the proposed MPT achieves significantly better results than the current state-of-the-art methods.In summary, the contributions of this paper are as follows:

• A novel tracking architecture, termed Multi-modal Perception Tracker (MPT), is proposed for the challenging audio-visual speaker tracking task. Moreover, we propose a new multi-modal perception attention network for the first time to estimate the confidence and availability of observations from multi-modal data.

• A novel acoustic map, termed stGCF map, is proposed, which utilizes a camera model to establish a mapping relationship between audio and visual localization space. Benefiting from the complementarity and consistency of audio-visual modalities, a new cross-modal selfsupervised learning method is further introduced.

• Experimental results on the standard and occluded datasets demonstrate the superiority and robustness of the proposed methods, especially under noisy conditions.

翻译:

所有这些组件构成了我们的多模态感知跟踪器(MPT),实验结果表明,所提出的MPT比当前最先进的方法取得了显着更好的结果。

总之,本文的贡献如下:

  • 提出了一种新颖的跟踪架构,称为多模态感知跟踪器(MPT),用于具有挑战性的音频-视觉说话者跟踪任务。此外,我们首次提出了一种新的多模态感知注意力网络,用于估计来自多模态数据的观测的置信度和可用性。
  • 提出了一种新颖的声学地图,称为stGCF地图,它利用摄像机模型在音频和视觉定位空间之间建立映射关系。受益于音频-视觉模态的互补性和一致性,进一步引入了一种新的跨模态自监督学习方法。
  • 在标准和遮挡数据集上的实验结果证明了所提方法的优越性和鲁棒性,特别是在噪声条件下。

总结:

这里的注意力机制用于估计多模态数据的置信度

Related Works

SSL

We improve the Global Coherence Field (GCF) method to extract audio features with both spatial and temporal cues under the guidance of visual information.

翻译:

我们改进了全局相干场(Global Coherence Field, GCF)方法,在视觉信息的引导下提取具有空间和时间线索的音频特征。

Audio-Visual Tracking

Commonly used methods are state-space approaches based on the Bayesian framework.Many works improve the PF architecture to integrate data streams from different modalities into a unified tracking framework. Among them, multi-modal observations are fused in a joint observation model, which is represented by improved likelihoods (Qian et al 2019; Kılıc¸ et al 2015; Brutti and Lanz 2010). The tracking framework based on Extended Kalman Filter (EKF) realizes the fusion of an arbitrary number of multi-modal observations through dynamic weight flow (Schymura and Kolossa 2020). Probability Hypothesis Density (PHD) filter is introduced for tracking an unknown and variable number of speakers with the theory of Random Finite Sets (RFSs). The analytical solution is derived by introducing a Sequential Monte Carlo (SMC) implementation (Liu et al 2019). By analyzing the task as a generative audio-visual association model formulated as a latent-variable temporal graphical model, a variational inference model is proposed to approximate the joint distribution (Ban et al 2019). An end-to-end trained audio-visual object tracking network based on Single Shot Multibox Detector (SSD) is proposed, where visual and audio inputs are fused by an add merge layer (Wilson and Lin 2020). Deep learning methods are less utilized in the audio-visual tracking task, leading to further research prospects.

翻译:

常用的方法是基于贝叶斯框架的状态空间方法。许多工作改进了粒子滤波(PF)架构,将来自不同模态的数据流集成到统一的跟踪框架中。其中,多模态观测在联合观测模型中融合,其表示为改进的似然度(Qian等,2019;Kılıc¸等,2015;Brutti和Lanz,2010)。基于扩展卡尔曼滤波器(EKF)的跟踪框架通过动态权重流实现任意数量的多模态观测的融合(Schymura和Kolossa,2020)。概率假设密度(PHD)滤波器被引入以跟踪未知和可变数量的发言者,利用随机有限集(RFSs)理论。通过引入顺序蒙特卡罗(SMC)实现,推导出解析解(Liu等,2019)。通过将任务作为一个生成的音频-视觉关联模型来分析,该模型被构建为一个潜变量时间图模型,提出了一种变分推断模型来近似联合分布(Ban等,2019)。提出了一种基于单次拍摄多框检测器(SSD)的端到端训练的音频-视觉对象跟踪网络,其中视觉和音频输入通过添加合并层进行融合(Wilson和Lin,2020)。在音频-视觉跟踪任务中,深度学习方法的利用较少,这导致了进一步的研究前景。

总结:

深度学习方法少,都是些概率生成模型

Attention-Based Models

Recently, the attention mechanism has been widely used in several tasks (Duan et al2021b; Tang et al 2021; Yang et al 2021; Liu et al 2021; Duan et al 2021a; Tang et al 2019; Xu et al 2018). In visual object tracking, the Siamese network-based tracker is further developed by designing various attention mechanisms (Wang et al 2018; Yu et al 2020). Based on the MDNet architecture, two modules of spatial attention and channel attention are employed to increase the discriminative properties of tracking (Zeng, Wang, and Lu 2019). In audio-visual analysis, a cross-modal attention framework for exploring the potential hidden correlations of same-modal and crossmodal signals is proposed for audio-visual event localization (Xuan et al 2020). For video emotion recognition, (Zhao et al 2020) integrates spatial, channel and temporal attention into visual CNN, and temporal attention into audio CNN. In audio-visual speech separation, the attention mechanism is used to help the model measure the differences and similarities between the visual representations of different speakers (Li and Qian 2020). To the best of our knowledge, attention has not been studied on the audio-visual speaker tracking task. In this paper, a self-supervised multi-modal perception attention network is introduced to investigate the perceptive ability of different modalities on the tracking scene.

翻译:

最近,注意力机制已经被广泛应用于几个任务中(Duan等,2021b;Tang等,2021;Yang等,2021;Liu等,2021;Duan等,2021a;Tang等,2019;Xu等,2018)。在视觉对象跟踪中,基于Siamese网络的跟踪器通过设计各种注意力机制进一步发展(Wang等,2018;Yu等,2020)。基于MDNet架构,引入了空间注意力和通道注意力两个模块,以增强跟踪的区分特性(Zeng,Wang和Lu,2019)。在音频-视觉分析中,提出了一个用于探索相同模态和跨模态信号潜在隐藏相关性的跨模态注意力框架,用于音频-视觉事件定位(Xuan等,2020)。对于视频情感识别,(Zhao等,2020)将空间、通道和时间注意力集成到视觉CNN中,将时间注意力集成到音频CNN中。在音频-视觉语音分离中,注意力机制被用来帮助模型衡量不同说话者的视觉表示之间的差异和相似性(Li和Qian,2020)。据我们所知,注意力机制尚未在音频-视觉说话者跟踪任务中进行研究。在本文中,引入了一个自监督的多模态感知注意力网络,以研究不同模态对跟踪场景的感知能力。

Proposed Method

In this work, we propose a novel tracking architecture with a multi-modal perception attention network for audio-visual speaker tracking. Figure 2 shows the overall framework of the proposed MPT, which consists of four main modules: audio-visual (AV) measurements, multi-modal perception attention network, cross-modal self-supervised learning, and PF-based multi-modal tracker.

翻译:

在这项工作中,我们提出了一种新颖的跟踪架构,其中包括用于音频-视觉说话者跟踪的多模态感知注意力网络。图2显示了所提出的MPT的总体框架,它由四个主要模块组成:音频-视觉(AV)测量、多模态感知注意力网络、跨模态自监督学习和基于PF的多模态跟踪器。

Audio-Visual Measurements

Through audio-visual measurements, the corresponding cues are extracted from audio signals and video frames. To integrate multi-modal cues in the same state space, we map the audio cues to the same localization plane as visual cues.

翻译:

通过音频-视觉测量,从音频信号和视频帧中提取相应的线索。为了在相同的状态空间中集成多模态线索,我们将音频线索映射到与视觉线索相同的定位平面上。

Audio Measurement

麦克风对(i, k),Ω为M对麦克风的集合,延迟τ等于实际TDOA,r代表GCC-PHAT的值

在这里GCF的值为每对麦克风的GCC的平均值

为了构建空间网格,利用针孔相机模型将图像平面上的二维点在三维世界坐标中投影为一系列不同深度的三维点,其中深度是指三维点到相机光学中心的垂直距离

将二维的点投射到三维并加一个深度

i为纵坐标,j为横坐标,横纵坐标在二维图像样本上进行采集,R为导出的GCF图

当在kmax深度处求得GCF图的峰值时,则t时刻的sGCF图为以上定义

由于语音的间歇性和说话者运动的连续性,一段时间内的语音信号为当前时刻的音频线索提供了参考。考虑时间区间[t - m1; t]内的信号,从m1 + 1帧中选择具有最大峰值的m2帧作为sGCF地图。时间t处的stGCF地图定义如下:

,其中T表示m2帧的时间集合。

Visual Measurement

追踪任务旨在定位视频第一帧中选择的任意目标,这使得不可能提前收集数据来训练特定的追踪器。因此,将追踪问题视为已知目标与搜索区域之间的相似度测量。

在这个模块中,采用了预训练的Siamese网络(Bertinetto等人,2016),它使用交叉相关作为由卷积操作完成的度量函数。输出响应地图被用作视觉线索,可以表示为:

其中It是当前视频帧,I ref是在第一帧中由用户定义的跟踪目标的参考模板,I是具有不同尺度的参考模板的集合。f(·)表示输出代表性分数图的度量函数。

S(It)反映了搜索图像中每个位置的追踪目标的概率,这与指向音频线索的stGCF地图的含义一致

Multi-Modal Perception Attention Network

Given the extracted audio and visual cues, the multi-modal perception attention network (see Figure 2) generates a confidence score map as a speaker location representation. The brain’s attention mechanism is able to selectively improve the transmission of information that attracts human attention, weighing the specific information that is more critical to the current task goal from abundant information. Inspired by this signal processing mechanism, a neural attention mechanism is exploited in this module to learn to measure the plausibility of multiple modalities.

翻译:

鉴于提取的音频和视觉线索,多模感知注意力网络(见图2)生成置信度评分图作为说话者位置表示。大脑的注意力机制能够选择性地提高引起人类注意力的信息传输,权衡从丰富信息中更重要于当前任务目标的具体信息。受此信号处理机制启发,本模块利用神经注意力机制学习衡量多种模态的置信度。

为了整合视听线索,将stGCF映射RstGCF Ω和视觉响应映射S(It)归一化并重塑为三维矩阵形式:

其中U表示每个输入视频帧的大小,U = H × w;Da是音频线索的维数,它取决于参考时间线索的m2; Dv是视频线索的维数,它由I参考的数量决定

使用通道注意模块的架构(Woo等人2018)在每个通道产生一个正分数αi来衡量第i个通道上观察的置信度

其中得分αi,称为感知权重,反映了根据前一节测量的多模态线索的置信度。在受背景噪声、室内混响、视觉遮挡、背景混淆等干扰的模糊观测中,αi值较低,在可靠观测中αi值较高。这得益于网络从观测地图中学习到的统计特征。通过这一点,网络表现出对多模态观测的感知能力,这描述了所提出的网络的工作可解释性。

Cross-Modal Self-Supervised Learning

自监督包括时间因素和空间因素,分别考虑了移动目标的时间连续性和多模态观察中的位置一致性。

空间因子

对于第i个通道,假设p max t;i是时间t时特征图峰值的位置,则通道i上观察的对应空间因素被定义为多个模态内和跨模态的平均操作。跨模态空间因素定义为:

其中,S表示位置p处的归一化视觉响应值;R是位置p处的归一化sGCF值,其中j是深度索引

时间因子

通过在以时间t为中心的时间间隔上执行上述平均操作而得到的

其中,V表示音频地图或视觉地图。

如图3所示,自监督标签在时间间隔内整合了来自不同模态的评估。当目标在一个观测中漂移时,根据模态之间的互补性和目标运动的连续性,另一个观测将提供更准确的观测并提供较低的值。此外,当所有观测的峰值位于同一区域时,值将相应增加。使用常规的L2损失来评估生成的标签和注意力度量。

 Multi-Modal Tracker

利用网络输出的注意度量值对视听线索v进行加权平均得到的融合图表示为:

The perception attention values of different modalities are fused in the map and used to weight particles in the update step of the PF. After diffusion, the value of the fusion map at the particle position is set as the new particle weight. Moreover, in order to utilize the global information of the fusion map, we simply improve the resampling step as well. At the beginning of each iteration, a group of the particles is reset to the peak position of the fusion map. Through the correction of the peak value, the tracking drift problem caused by the observation noise of some frames is avoided. The method is outstanding when the observation is severely disturbed by the environment noise.

翻译:

不同模态的感知注意力值在地图中融合,并用于在PF的更新步骤中对粒子进行加权。在扩散后,融合地图在粒子位置处的值被设置为新的粒子权重。此外,为了利用融合地图的全局信息,我们还简单改进了重采样步骤。在每次迭代开始时,一组粒子被重置为融合地图的峰值位置。通过校正峰值,避免了由某些帧的观测噪声引起的跟踪漂移问题。当观测严重受到环境噪声干扰时,该方法表现出色。

Experiments and Discussions

Datasets

In this section, the proposed tracker is evaluated on the AV16.3 corpus (Lathoud, Odobez, and Gatica-Perez 2004), which provides true 3D mouth location derived from calibrated cameras and 2D measurements on the various images for systematic assessment. The audio data is recorded at the sampling rate of 16 kHz by two circular eight-element microphone arrays placed 0.8m apart on the table. The images are captured by 3 monocular color cameras installed in 3 corners of the room at 25Hz with size H×W = 288×360.

The experiments are tested on seq08, 11, and 12, where a single participant wandered around, moved quickly, and spoke intermittently. Each set of experiments uses signals from two microphone arrays and an individual camera.

翻译:

本节中,我们对提出的跟踪器在AV16.3语料库(Lathoud,Odobez和Gatica-Perez 2004)上进行评估,该语料库提供了由校准相机得出的真实3D口腔位置以及用于系统评估的各种图像上的2D测量。音频数据由放置在桌子上相距0.8m的两个圆形八元素麦克风阵列以16 kHz的采样率录制。图像由安装在房间3个角落的3台单眼彩色相机以25Hz的速度捕获,尺寸为H×W = 288×360。

实验在seq08、11和12上进行测试,其中单个参与者四处走动,快速移动并间歇性地讲话。每组实验使用两个麦克风阵列和一个单独的摄像头的信号。

Implementation Details

Visual cues are generated by a pretrained Siamese network (Bertinetto et al 2016) based on AlexNet backbone. Reference image set I contains two target rectangles with scales of 1 and 1.25, which are defined by users in the first frame. For audio measurement, the number of 2D sampling points in the horizontal and vertical directions on the image plane are w = 20 and h = 16. A 0.8m high table is placed in a (3:6×8.2×2.4)m room. Therefore, the sampling points located outside the room range and below the desktop are removed, which is in accord with the real situation and avoids the ambiguity caused by the symmetry of the circular microphone. The depths number of projected 3D points is set to d = 6. The speech signal is enframed to 40ms by a Hamming window with a frame shift of 1=2.

The parameters to calculate stGCF are set to M = 120, m1 = 15, m2 = 5. Backbone of the attention network is MobileNetv3-large (Howard et al 2019). The network is trained on single speaker sequences seq01, 02, 03, which contain more than 4500 samples. The parameters to generate self-supervised label are set to Da = 5, Dv = 2, n = 6.

All models are trained for 20 epochs with batch size 16 and learning rate 0.01. Our method and comparison methods are based on Sampling Importance Resampling (SIR)-PF for tracking. The number of particles is set to 100. Our source codes are available at https://github.com/liyidi/MPT.

翻译:

视觉线索是由预训练的Siamese网络(Bertinetto et al 2016)基于AlexNet骨干网络生成的。参考图像集I包含两个目标矩形,其比例尺为1和1.25,在第一帧中由用户定义。对于音频测量,在图像平面的水平和垂直方向上的2D采样点数为w = 20和h = 16。一个高0.8m的桌子放置在一个(3:6×8.2×2.4)m的房间中。因此,位于房间范围之外并且低于桌面以下的采样点被移除,这符合实际情况并避免了由于圆形麦克风的对称性而引起的歧义。投影的3D点的深度数设置为d = 6。语音信号通过一个Hamming窗口进行40ms的分帧,帧移为1=2。

计算stGCF的参数设置为M = 120,m1 = 15,m2 = 5。注意力网络的骨干网络是MobileNetv3-large(Howard et al 2019)。该网络在包含4500多个样本的单发言人序列seq01、02、03上进行训练。生成自监督标签的参数设置为Da = 5,Dv = 2,n = 6。

所有模型训练20个epochs,批量大小为16,学习率为0.01。我们的方法和比较方法都基于采样重要性重采样(SIR)-PF进行跟踪。粒子数设置为100。我们的源代码可在https://github.com/liyidi/MPT获取

Evaluation Metrics

Mean Absolute Error (MAE) and the Accuracy (ACC) is used to evaluate performance of tracking methods. MAE calculates the Euclidean distance in pixel between the estimated position and the ground truth (GT), divided by the number of frames. ACC measures the percentage of correct estimates, whose error distance in pixel does not exceed 1/2 of the diagonal of the bounding-box of GT.

翻译:

使用平均绝对误差(MAE)和精度(ACC)来评价跟踪方法的性能。MAE计算估计位置与地面真值(GT)之间的欧几里得距离(以像素为单位),除以帧数。ACC衡量的是正确估计的百分比,其误差距离(以像素为单位)不超过GT边界框对角线的1/2。

Ablation Study and Analysis

(AM:音频测量,AN:注意力网络,TR:跟踪器,AvgAtt:平均注意力,MPAtt:多模态感知注意力,IPF:改进的PF, Org:原始数据集)

 stGCF方法受相机和麦克风阵列几何配置的影响,特别是当扬声器位于连接相机和麦克风阵列的线路上时

由于声音信号的方向性,峰值通常出现在stGCF图中的一个大的高亮区域,这提供了一个模糊的搜索结果

改进的重采样方法利用融合图的全局最大值,避免了由于单个帧的目标缺失而使粒子被限制在局部最优

Visualization Analysis

融合图中正确的区域被高亮显示,即时第一个样本有椅子的噪声干扰,第二个样本说话者面部被完全遮挡

当说话人走到遮挡区域时,跟踪器可以粗略估计说话人的位置,这有利于在目标再次可见时重新跟踪

Conclusions

In this paper, we propose a novel multi-modal perception tracker for the challenging audio-visual speaker tracking task. We also propose a new multi-modal perception attention network and a new acoustic map extraction method.The proposed tracker utilizes the complementarity and consistency of multiple modalities to learn the availability and reliability of observations between various modalities in a self-supervised fashion. Extensive experiments demonstrate that the proposed tracker is superior over the current stateof-the-art counterparts, especially showing sufficient robustness under adverse conditions. Lastly, the intermediate process is visualized to demonstrate the interpretability of the proposed tracker network.

翻译:

在本文中,我们提出了一种新颖的多模态感知跟踪器,用于具有挑战性的音频-视觉说话者跟踪任务。我们还提出了一种新的多模态感知注意力网络和一种新的声学地图提取方法。所提出的跟踪器利用多种模态的互补性和一致性,以自监督的方式学习各种模态之间的观测可用性和可靠性。大量实验证明,所提出的跟踪器优于当前最先进的对应方法,特别是在恶劣条件下表现出足够的鲁棒性。最后,中间过程被可视化,以展示所提跟踪器网络的可解释性。

  • 6
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值