

Multi-modal fusion is proven to be an effective method to improve the accuracy and robustness of speaker tracking, especially in complex scenarios. However, how to combine the heterogeneous information and exploit the complementarity of multi-modal signals remains a challenging issue.In this paper, we propose a novel Multi-modal Perception Tracker (MPT) for speaker tracking using both audio and visual modalities. Specifically, a novel acoustic map based on spatial-temporal Global Coherence Field (stGCF) is first constructed for heterogeneous signal fusion, which employs a camera model to map audio cues to the localization space consistent with the visual cues. Then a multi-modal perception attention network is introduced to derive the perception weights that measure the reliability and effectiveness of intermittent audio and video streams disturbed by noise. Moreover, a unique cross-modal self-supervised learning method is presented to model the confidence of audio and visual observations by leveraging the complementarity and consistency between different modalities. Experimental results show that the proposed MPT achieves 98.6% and 78.3% tracking accuracy on the standard and occluded datasets, respectively, which demonstrates its robustness under adverse conditions and outperforms the current state-of-the-art methods.







Speaker tracking is the foundation task for intelligent systems to implement behavior analysis and human-computer interaction. To enhance the accuracy of the tracker, multimodal sensors are utilized to capture richer information (Kılıc¸ and Wang 2017). Among them, auditory and visual sensors have received extensive attention from researchers as the main senses for human to understand the surrounding environment and interact with others. Similar to the process of human multi-modal perception, the advantage of integrating auditory and visual information is that they can provide necessary supplementary cues (Xuan et al 2020). Compared with the single-modal case, the utilizing of the complementarity of audio-visual signals contributes to improving tracking accuracy and robustness, particularly when dealing with complicated situations such as target occlusion, limited view of cameras, illumination changes, and room reverberation (Katsaggelos, Bahaadini, and Molina 2015). Furthermore, multi-modal fusion shows distinct advantages when the information of one modality is missing, or neither modality is able to provide a reliable observation. As a result, it is critical to develop a multi-modal tracking method that is capable of fusing heterogeneous signals and dealing with intermittent noisy audio-visual data.



Current speaker tracking methods are generally based on probabilistic generation models due to their ability to process multi-modal information. The representative method is Particle Filter (PF), which can recursively approximate the filtering distribution of tracking targets in nonlinear and nonGaussian systems. Based on PF implementation, the Direction of Arrival (DOA) angle of the audio source is projected onto the image plane to reshape the typical Gaussian noise distribution of particles and increase the weights of particles near DOA line (Kılıc¸ et al 2015). A two-layered PF is proposed to implement feature fusion and decision fusion of audio-visual sources through the hierarchical structure (Liu, Li, and Yang 2019). Moreover, a face detector is employed to geometrically estimate the 3D position of the target to assist in the calculation of the acoustic map (Qian et al 2021).However, these methods prefer to use the detection results of the single modality to assist the other modality to obtain more accurate observations, while neglecting to fully utilize the complementarity and redundancy of audio-visual information. In addition, most of the existing audio-visual trackers use generation algorithms (Ban et al 2019; Schymura and Kolossa 2020; Qian et al 2017), which are difficult to adapt to random and diverse changes of target appearance.Furthermore, the likelihood calculation based on the color histogram or Euclidean distance is susceptible to interference from observation noise, which limits the performance of the fusion likelihood.





To solve those limitations, we propose to adopt an attention mechanism to measure the confidence of multiple modalities, which determines the effectiveness of the fusion algorithm. The proposed idea is inspired by the human brain’s perception mechanism for multi-modal sensory information, which integrates the data and optimizes the decision-making through two key steps: estimating the reliability of various sources and weighting the evidences based on the reliability (Zhang et al 2016). Take the intuitive experience as an example: when determining a speaker’s position in a noisy and bright environment, we mainly use eyes; conversely, in a quiet and dim situation, we rely on sounds.Based on this phenomenon, we propose a multi-modal perception attention network to simulate the human perception system that is capable of selectively capturing valuable event information from multiple modalities. Figure 1 depicts the working process of the proposed network, in which the first two rows show the complementarity and consistency of audio and video modalities. In the third row, the image frame is obscured by an artificial mask to show the supplementary effect of the auditory modality when the visual modality is unreliable. Different from existing end-to-end models, the specialized network focuses on perceiving the reliability of observations from different modalities. However, the perception process is usually abstract, making it difficult to manually label quantitative tags. Due to the natural correspondence between sound and vision, necessary supervision is provided for audio-visual learning (Hu et al 2020) (Afouras et al 2021). Therefore, we design a cross-modal self-supervised learning method, which exploits the complementarity and consistency of multi-modal data to generate weight labels of perception.






Neural networks have been widely used in multi-modal fusion tasks, represented by Audio-Visual Speech Recognition (AVSR) (Baltrusaitis, Ahuja, and Morency ˇ 2018). However, except for preprocessing works such as target detection and feature extraction, neural network is rarely introduced to multi-modal tracking. This is because the positive samples in tracking task are simply random targets in the initial frame, resulting in a shortage of data to train a high-performing classifier. Therefore, using an attention network specifically to train the middle perception component provides a completely new approach to this problem. Another reason is that the heterogeneity of audio and video data makes it difficult to accomplish unity in the early stage of the network. Therefore, we propose the spatial-temporal Global Coherence Field (stGCF) map, which maps the audio cues to the image feature space through the projection operator of a camera model. To generate a fusion map, the integrated audio-visual cues are weighted by the perception weights estimated by the network. Finally, a PF-based tracker improved with the fusion map is employed to ensure smooth tracking of multi-modal observations.


神经网络已被广泛应用于多模态融合任务,以音频-视觉语音识别(AVSR)为代表(Baltrusaitis,Ahuja和Morency 2018)。然而,除了目标检测和特征提取等预处理工作之外,神经网络很少被引入到多模态跟踪中。这是因为跟踪任务中的正样本仅仅是初始帧中的随机目标,导致缺乏数据来训练高性能的分类器。因此,专门使用注意力网络来训练中间感知组件为解决这个问题提供了全新的方法。另一个原因是音频和视频数据的异构性使得在网络的早期阶段难以实现统一。因此,我们提出了基于时空全局一致性场(stGCF)的地图,通过摄像机模型的投影算子将音频线索映射到图像特征空间。为了生成融合地图,集成的音频-视觉线索通过网络估计的感知权重进行加权。最后,利用改进的融合地图的基于PF的跟踪器来确保对多模态观测的平稳跟踪。




All these components make up our Multi-modal Perception Tracker (MPT), and experimental results demonstrate that the proposed MPT achieves significantly better results than the current state-of-the-art methods.In summary, the contributions of this paper are as follows:

• A novel tracking architecture, termed Multi-modal Perception Tracker (MPT), is proposed for the challenging audio-visual speaker tracking task. Moreover, we propose a new multi-modal perception attention network for the first time to estimate the confidence and availability of observations from multi-modal data.

• A novel acoustic map, termed stGCF map, is proposed, which utilizes a camera model to establish a mapping relationship between audio and visual localization space. Benefiting from the complementarity and consistency of audio-visual modalities, a new cross-modal selfsupervised learning method is further introduced.

• Experimental results on the standard and occluded datasets demonstrate the superiority and robustness of the proposed methods, especially under noisy conditions.




  • 提出了一种新颖的跟踪架构,称为多模态感知跟踪器(MPT),用于具有挑战性的音频-视觉说话者跟踪任务。此外,我们首次提出了一种新的多模态感知注意力网络,用于估计来自多模态数据的观测的置信度和可用性。
  • 提出了一种新颖的声学地图,称为stGCF地图,它利用摄像机模型在音频和视觉定位空间之间建立映射关系。受益于音频-视觉模态的互补性和一致性,进一步引入了一种新的跨模态自监督学习方法。
  • 在标准和遮挡数据集上的实验结果证明了所提方法的优越性和鲁棒性,特别是在噪声条件下。



Related Works


We improve the Global Coherence Field (GCF) method to extract audio features with both spatial and temporal cues under the guidance of visual information.


我们改进了全局相干场(Global Coherence Field, GCF)方法,在视觉信息的引导下提取具有空间和时间线索的音频特征。

Audio-Visual Tracking

Commonly used methods are state-space approaches based on the Bayesian framework.Many works improve the PF architecture to integrate data streams from different modalities into a unified tracking framework. Among them, multi-modal observations are fused in a joint observation model, which is represented by improved likelihoods (Qian et al 2019; Kılıc¸ et al 2015; Brutti and Lanz 2010). The tracking framework based on Extended Kalman Filter (EKF) realizes the fusion of an arbitrary number of multi-modal observations through dynamic weight flow (Schymura and Kolossa 2020). Probability Hypothesis Density (PHD) filter is introduced for tracking an unknown and variable number of speakers with the theory of Random Finite Sets (RFSs). The analytical solution is derived by introducing a Sequential Monte Carlo (SMC) implementation (Liu et al 2019). By analyzing the task as a generative audio-visual association model formulated as a latent-variable temporal graphical model, a variational inference model is proposed to approximate the joint distribution (Ban et al 2019). An end-to-end trained audio-visual object tracking network based on Single Shot Multibox Detector (SSD) is proposed, where visual and audio inputs are fused by an add merge layer (Wilson and Lin 2020). Deep learning methods are less utilized in the audio-visual tracking task, leading to further research prospects.





Attention-Based Models

Recently, the attention mechanism has been widely used in several tasks (Duan et al2021b; Tang et al 2021; Yang et al 2021; Liu et al 2021; Duan et al 2021a; Tang et al 2019; Xu et al 2018). In visual object tracking, the Siamese network-based tracker is further developed by designing various attention mechanisms (Wang et al 2018; Yu et al 2020). Based on the MDNet architecture, two modules of spatial attention and channel attention are employed to increase the discriminative properties of tracking (Zeng, Wang, and Lu 2019). In audio-visual analysis, a cross-modal attention framework for exploring the potential hidden correlations of same-modal and crossmodal signals is proposed for audio-visual event localization (Xuan et al 2020). For video emotion recognition, (Zhao et al 2020) integrates spatial, channel and temporal attention into visual CNN, and temporal attention into audio CNN. In audio-visual speech separation, the attention mechanism is used to help the model measure the differences and similarities between the visual representations of different speakers (Li and Qian 2020). To the best of our knowledge, attention has not been studied on the audio-visual speaker tracking task. In this paper, a self-supervised multi-modal perception attention network is introduced to investigate the perceptive ability of different modalities on the tracking scene.



Proposed Method

In this work, we propose a novel tracking architecture with a multi-modal perception attention network for audio-visual speaker tracking. Figure 2 shows the overall framework of the proposed MPT, which consists of four main modules: audio-visual (AV) measurements, multi-modal perception attention network, cross-modal self-supervised learning, and PF-based multi-modal tracker.



Audio-Visual Measurements

Through audio-visual measurements, the corresponding cues are extracted from audio signals and video frames. To integrate multi-modal cues in the same state space, we map the audio cues to the same localization plane as visual cues.



Audio Measurement

麦克风对(i, k),Ω为M对麦克风的集合,延迟τ等于实际TDOA,r代表GCC-PHAT的值






由于语音的间歇性和说话者运动的连续性,一段时间内的语音信号为当前时刻的音频线索提供了参考。考虑时间区间[t - m1; t]内的信号,从m1 + 1帧中选择具有最大峰值的m2帧作为sGCF地图。时间t处的stGCF地图定义如下:


Visual Measurement



其中It是当前视频帧,I ref是在第一帧中由用户定义的跟踪目标的参考模板,I是具有不同尺度的参考模板的集合。f(·)表示输出代表性分数图的度量函数。


Multi-Modal Perception Attention Network

Given the extracted audio and visual cues, the multi-modal perception attention network (see Figure 2) generates a confidence score map as a speaker location representation. The brain’s attention mechanism is able to selectively improve the transmission of information that attracts human attention, weighing the specific information that is more critical to the current task goal from abundant information. Inspired by this signal processing mechanism, a neural attention mechanism is exploited in this module to learn to measure the plausibility of multiple modalities.



为了整合视听线索,将stGCF映射RstGCF Ω和视觉响应映射S(It)归一化并重塑为三维矩阵形式:

其中U表示每个输入视频帧的大小,U = H × w;Da是音频线索的维数,它取决于参考时间线索的m2; Dv是视频线索的维数,它由I参考的数量决定



Cross-Modal Self-Supervised Learning



对于第i个通道,假设p max t;i是时间t时特征图峰值的位置,则通道i上观察的对应空间因素被定义为多个模态内和跨模态的平均操作。跨模态空间因素定义为:






 Multi-Modal Tracker


The perception attention values of different modalities are fused in the map and used to weight particles in the update step of the PF. After diffusion, the value of the fusion map at the particle position is set as the new particle weight. Moreover, in order to utilize the global information of the fusion map, we simply improve the resampling step as well. At the beginning of each iteration, a group of the particles is reset to the peak position of the fusion map. Through the correction of the peak value, the tracking drift problem caused by the observation noise of some frames is avoided. The method is outstanding when the observation is severely disturbed by the environment noise.



Experiments and Discussions


In this section, the proposed tracker is evaluated on the AV16.3 corpus (Lathoud, Odobez, and Gatica-Perez 2004), which provides true 3D mouth location derived from calibrated cameras and 2D measurements on the various images for systematic assessment. The audio data is recorded at the sampling rate of 16 kHz by two circular eight-element microphone arrays placed 0.8m apart on the table. The images are captured by 3 monocular color cameras installed in 3 corners of the room at 25Hz with size H×W = 288×360.

The experiments are tested on seq08, 11, and 12, where a single participant wandered around, moved quickly, and spoke intermittently. Each set of experiments uses signals from two microphone arrays and an individual camera.


本节中,我们对提出的跟踪器在AV16.3语料库(Lathoud,Odobez和Gatica-Perez 2004)上进行评估,该语料库提供了由校准相机得出的真实3D口腔位置以及用于系统评估的各种图像上的2D测量。音频数据由放置在桌子上相距0.8m的两个圆形八元素麦克风阵列以16 kHz的采样率录制。图像由安装在房间3个角落的3台单眼彩色相机以25Hz的速度捕获,尺寸为H×W = 288×360。


Implementation Details

Visual cues are generated by a pretrained Siamese network (Bertinetto et al 2016) based on AlexNet backbone. Reference image set I contains two target rectangles with scales of 1 and 1.25, which are defined by users in the first frame. For audio measurement, the number of 2D sampling points in the horizontal and vertical directions on the image plane are w = 20 and h = 16. A 0.8m high table is placed in a (3:6×8.2×2.4)m room. Therefore, the sampling points located outside the room range and below the desktop are removed, which is in accord with the real situation and avoids the ambiguity caused by the symmetry of the circular microphone. The depths number of projected 3D points is set to d = 6. The speech signal is enframed to 40ms by a Hamming window with a frame shift of 1=2.

The parameters to calculate stGCF are set to M = 120, m1 = 15, m2 = 5. Backbone of the attention network is MobileNetv3-large (Howard et al 2019). The network is trained on single speaker sequences seq01, 02, 03, which contain more than 4500 samples. The parameters to generate self-supervised label are set to Da = 5, Dv = 2, n = 6.

All models are trained for 20 epochs with batch size 16 and learning rate 0.01. Our method and comparison methods are based on Sampling Importance Resampling (SIR)-PF for tracking. The number of particles is set to 100. Our source codes are available at https://github.com/liyidi/MPT.


视觉线索是由预训练的Siamese网络(Bertinetto et al 2016)基于AlexNet骨干网络生成的。参考图像集I包含两个目标矩形,其比例尺为1和1.25,在第一帧中由用户定义。对于音频测量,在图像平面的水平和垂直方向上的2D采样点数为w = 20和h = 16。一个高0.8m的桌子放置在一个(3:6×8.2×2.4)m的房间中。因此,位于房间范围之外并且低于桌面以下的采样点被移除,这符合实际情况并避免了由于圆形麦克风的对称性而引起的歧义。投影的3D点的深度数设置为d = 6。语音信号通过一个Hamming窗口进行40ms的分帧,帧移为1=2。

计算stGCF的参数设置为M = 120,m1 = 15,m2 = 5。注意力网络的骨干网络是MobileNetv3-large(Howard et al 2019)。该网络在包含4500多个样本的单发言人序列seq01、02、03上进行训练。生成自监督标签的参数设置为Da = 5,Dv = 2,n = 6。


Evaluation Metrics

Mean Absolute Error (MAE) and the Accuracy (ACC) is used to evaluate performance of tracking methods. MAE calculates the Euclidean distance in pixel between the estimated position and the ground truth (GT), divided by the number of frames. ACC measures the percentage of correct estimates, whose error distance in pixel does not exceed 1/2 of the diagonal of the bounding-box of GT.



Ablation Study and Analysis

(AM:音频测量,AN:注意力网络,TR:跟踪器,AvgAtt:平均注意力,MPAtt:多模态感知注意力,IPF:改进的PF, Org:原始数据集)




Visualization Analysis




In this paper, we propose a novel multi-modal perception tracker for the challenging audio-visual speaker tracking task. We also propose a new multi-modal perception attention network and a new acoustic map extraction method.The proposed tracker utilizes the complementarity and consistency of multiple modalities to learn the availability and reliability of observations between various modalities in a self-supervised fashion. Extensive experiments demonstrate that the proposed tracker is superior over the current stateof-the-art counterparts, especially showing sufficient robustness under adverse conditions. Lastly, the intermediate process is visualized to demonstrate the interpretability of the proposed tracker network.







