论文精读--Segment beyond View

Abstract

Augmented Reality (AR) devices, emerging as prominent mobile interaction platforms, face challenges in user safety, particularly concerning oncoming vehicles. While some solutions leverage onboard camera arrays, these cameras often have limited field-of-view (FoV) with front or downward perspectives. Addressing this, we propose a new out-of-view semantic segmentation task and Segment Beyond View (SBV), a novel audio-visual semantic segmentation method. SBV supplements the visual modality, which miss the information beyond FoV, with the auditory information using a teacherstudent distillation model (Omni2Ego). The model consists of a vision teacher utilising panoramic information, an auditory teacher with 8-channel audio, and an audio-visual student that takes views with limited FoV and binaural audio as input and produce semantic segmentation for objects outside FoV. SBV outperforms existing models in comparative evaluations and shows a consistent performance across varying FoV ranges and in monaural audio settings.

翻译:

增强现实(AR)设备作为突出的移动交互平台,面临着用户安全方面的挑战,特别是涉及到来车的安全问题。尽管一些解决方案利用车载摄像头阵列,但这些摄像头通常具有有限的视野范围(FoV),只能提供前方或向下的视角。为了解决这个问题,我们提出了一个新的超出视野语义分割任务,并设计了一种新颖的音频-视觉语义分割方法——Segment Beyond View(SBV)。SBV通过使用一个教师-学生蒸馏模型(Omni2Ego)将视觉模态补充到视野之外的信息,从而弥补了视觉模态的不足。该模型包括一个利用全景信息的视觉教师,一个具有8个声道音频的听觉教师,以及一个将受限视野和双耳音频作为输入,并为视野之外的对象生成语义分割的音频-视觉学生。在比较评估中,SBV优于现有模型,并在不同视野范围和单声道音频设置下表现出一致的性能。

总结:

通过音频信息补充超出视野范围的信息,从而实现定位分割

教师模型提供全局信息给学生模型,学生模型做任务

Introduction

A few research works address the road safety issue by leveraging the cameras on mobile devices and HMDs. For examples, Wang et al (2012) and Tong, Jia, and Bao (2021) used the onboard cameras to detect and predict vehicle trajectories and warn the user of potential collision. Kang, Lee, and Han (2019) presented a system for detecting the ground obstacles along the path of pedestrians. Nonetheless, the focus on compactness and user comfort in mobile devices limit the placement of the camera system, resulting in a restricted field of view (FoV) that only marginally exceeding human vision. As a consequence, rendering both the user and the AI model blind to potential road hazards, which frequently originate from areas outside the current FoV. Researchers also explored the use of audio signal to infer the out-of-view objects (Manori et al 2018; Mizumachi et al 2014; Rovetta, Mnasri, and Masulli 2020), yet these approaches often lack precision in locating the position of upcoming vehicles.

翻译:

一些研究作品通过利用移动设备和头戴式显示器上的摄像头来解决道路安全问题。例如,Wang等人(2012年)和Tong、Jia和Bao(2021年)利用车载摄像头来检测和预测车辆轨迹,并警告用户可能发生碰撞。Kang、Lee和Han(2019年)提出了一种系统,用于检测行人路径上的地面障碍物。然而,移动设备在紧凑性和用户舒适性方面的关注限制了摄像头系统的放置位置,导致视野范围(FoV)受限,仅略微超出人类视野。因此,这使得用户和AI模型都无法看到潜在的道路危险,这些危险经常来自当前FoV之外的区域。研究人员还探索了使用音频信号推断视野之外的对象(Manori等人,2018年;Mizumachi等人,2014年;Rovetta,Mnasri和Masulli,2020年),然而这些方法通常缺乏准确地定位即将到来的车辆的能力。

总结:

现有方法不够用,预测轨迹仍需要存在于视野范围内,无法推断视野外的危险

Recognising this challenge, we introduce a new semantic segmentation task for objects beyond the field of view, with a benchmark that focuses on identifying oncoming vehicles for HMD users safety. This introduces a novel partially missing modality problem, where the model only has access to partial information within a specific modality, such as a constrained FoV or monaural audio signals to the surrounding environment. This problem formulation diverges from both the conventional multi-modality problem, which aims to enhance downstream tasks, such as tracking or segmentation accuracy, by leveraging multiple modalities (Valverde, Hurtado, and Valada 2021; Chakravarthula et al 2023; Zurn ¨ and Burgard 2022; Kim and Kim 2023), and the cross-modality problem, which focuses on the knowledge transfer from one modality to another in cases of absent information (Gan et al 2019; Dai et al 2022). In contrast, our focus is specifically on situations where the modality is only partially missing, which provides the opportunity of utilizing the available signal in conjunction with data from other modalities. It should be noted that the partially missing modality problem can be seen as a specific, yet significant, case within the cross-modality problem, where one modality is partially missing (e.g., limited FoV vs panorama).

翻译:

鉴于这一挑战,我们引入了一个新的语义分割任务,专注于识别头戴式显示器用户的安全需要,即识别视野范围之外的来车。这引入了一个新颖的部分缺失模态问题,其中模型只能访问特定模态内的部分信息,例如受限的视野范围或单声道音频信号。这个问题的构建与传统的多模态问题有所不同,后者旨在通过利用多个模态来增强下游任务(例如跟踪或分割准确性),如利用多模态(Valverde、Hurtado和Valada 2021;Chakravarthula等人 2023;Zurn¨和Burgard 2022;Kim和Kim 2023),以及交叉模态问题,其侧重于在缺失信息的情况下,从一种模态向另一种模态进行知识转移(Gan等人2019;Dai等人2022)。相反,我们的重点专门放在模态部分缺失的情况下,这为利用可用信号与其他模态的数据相结合提供了机会。值得注意的是,部分缺失模态问题可以被视为交叉模态问题中的一个具体而重要的情况,其中一个模态部分缺失(例如,受限的FoV与全景视野相比)。

总结:

与现有多模态任务目标不太一样,是用一个模态补全另一模态

To tackle this task, we propose Segment Beyond View (SBV), an audio-visual semantic segmentation method that supplements the visual modality, which partially miss the information beyond FoV, with the auditory information. SBV is driven by a teacher-student distillation model, which we termed Omni2Ego, comprising a vision teacher utilising panoramic information, an auditory teacher with 8-channel audio, and an audio-visual student that takes views with limited FoV and binaural audio as input and produce semantic segmentation for objects outside the FoV. Fig. 1 shows the illustration of our task. Adapting the Omni Auditory Perception Dataset (Dai et al 2022; Vasudevan, Dai, and Van Gool 2020) to the proposed task, the results suggest that our method outperforms state-of-the-art audio-visual semantic segmentation methods (Zhou et al 2022, 2023) and maintain consistent performance across different FoV ranges and in monaural audio environments.

翻译:

为了解决这一任务,我们提出了一种名为Segment Beyond View(SBV)的音频-视觉语义分割方法,它通过补充视觉模态的信息,该模态在视野范围之外部分缺失,以音频信息作为补充。SBV是由一个我们称之为Omni2Ego的师生蒸馏模型驱动的,该模型包括一个利用全景信息的视觉教师,一个带有8通道音频的听觉教师,以及一个音频-视觉学生,该学生以受限制的FoV和双耳音频作为输入,并为视野范围之外的对象生成语义分割。图1显示了我们任务的示意图。将Omni Auditory Perception Dataset(Dai等人2022;Vasudevan、Dai和Van Gool 2020)调整为我们提出的任务,结果表明,我们的方法优于最先进的音频-视觉语义分割方法(Zhou等人2022年,2023年),并且在不同的FoV范围和单声道音频环境中保持了一致的性能。

总结:

基于Omni2Ego的师生蒸馏模型,为视野范围之外的对象生成语义分割

右图描述了我们的新任务,仅使用FoV和双耳音频,该模型就可以在全景中对视线内和视线外的车辆进行语义分割。

 Our work makes following contributions: (1) Presenting a simple yet effective framework, termed Segment Beyond View (SBV), that leverages the partially-available information in one modality and complements it with information from another modality to perform the out-of-view semantic segmentation; (2) Introducing a novel out-of-view semantic segmentation task and its associated benchmark based on public dataset; (3) Demonstrating the superior performance of SBV through comparison with state-of-the-art models and presenting ablation studies examining various degrees of partially missing modality and model architectures. Additionally, our task has potential implications to robot navigation, autonomous vehicles and road safety in general.

翻译:

我们的工作具有以下贡献:(1)提出了一个简单而有效的框架,称为Segment Beyond View(SBV),利用一种模态中部分可用的信息,并补充来自另一种模态的信息,执行视野范围之外的语义分割;(2)介绍了一种新颖的视野范围之外的语义分割任务及其基于公共数据集的相关基准;(3)通过与最先进的模型进行比较,展示了SBV的优越性能,并通过消融研究考察了不同程度的部分缺失模态和模型架构。此外,我们的任务对于机器人导航、自动驾驶车辆以及道路安全等方面具有潜在的影响。

Related Work

Pedestrian Safety

所有相关工作都依赖于摄像头可以捕捉到驶近的车辆

Multimodal Learning with Missing Modality

Multimodal learning with missing modalities has gained much attention recently. Some methods aim to make predictions even when some modalities are unavailable during training or testing. Some approaches, such as those by Recasens et al (2023) and Ma et al (2022) apply masks or optimize multi-task strategies to handle missing modalities. Other methods handle missing modalities by predicting weights (Miech, Laptev, and Sivic 2018) or using combinatorial loss (Shvetsova et al 2022). A recent work (Li, Liu, and Tang 2022) proposes an audio-visual tracker that can localize speaker targets in the absence of visual modality.

However, those methods require modality-complete training data. The SMIL (Ma et al 2021) and ShaSpec (Wang et al 2023) are developed specifically for handling multimodal learning with missing modalities both during training and testing. But above methods all assume that one or more modalities will be missing entirely instead of partially missing of certain modalities. However, our partially missing modality task assumes all kinds of modalities exist, but each of them is partially missing.

翻译:

最近,缺失模态的多模态学习引起了广泛关注。一些方法旨在在训练或测试期间即使某些模态不可用也能进行预测。一些方法,如Recasens等人(2023年)和Ma等人(2022年)的方法,应用掩码或优化多任务策略来处理缺失模态。其他方法通过预测权重(Miech、Laptev和Sivic,2018年)或使用组合损失(Shvetsova等人,2022年)来处理缺失模态。最近的一项工作(Li、Liu和Tang,2022年)提出了一种音视频跟踪器,可以在没有视觉模态的情况下定位扬声器目标。

然而,这些方法都需要模态完整的训练数据。SMIL(Ma等人,2021年)和ShaSpec(Wang等人,2023年)专门用于处理训练和测试期间缺失模态的多模态学习。但是以上方法都假设一个或多个模态将完全缺失,而不是某些模态部分缺失。然而,我们的部分缺失模态任务假设所有类型的模态都存在,但每种模态都部分缺失。

总结:

与现有的缺失模态任务相比,SBV不是缺失整个模态,而是每种模态都部分缺失

Audio-Visual Segmentation

看不见的发声物体也很重要,但是现有视听任务模型都专注于定位可见的发声物体

Method

Problem Definition

We are interested in audio-visual semantic segmentation with partially missing modality. Given an audiovisual dataset contains omnidirectional auditory or visual information: panoramas and binaural audios in four directions (front, back, left, and right). Panoramas and all audios are accessible during training, but only partial view of the panoramas and the binaural audios in one direction are available in testing. Also, the training data contains no manual annotations. Our scenario is under pedestrian road safety, we define the partially visible view as the first-person view of the pedestrians. Formally, given a dataset D = {Da, Dv}, where Da means the auditory modality part and Dv means the visual modality part. Da contains binaural audio from 4 directions and Dv contains panoramas. For each panorama in the dataset, a randomly generated head rotation is assigned and the corresponding binaural audio is selected. We follow the previous works (Gao and Grauman 2019; Dai et al 2022; Gao et al 2020) to transfer the selected binaural audio to spectrograms, and denote x^isp and y^dsp as the spectrograms of input audio signals (isp) and spectrograms of their difference (dsp) with other directions. We denote x^osp as the spectrograms of audio in other directions. We denote the FirstPerson View (FPV) generated based on FoV and head rotations as x^fpv for the corresponding panoramic image x^v . Our task is to use such datasets to train a model that can semantically segment the vehicles in the surrounding environment when the FPV and binaural audio are available.

翻译:

我们对具有部分缺失模态的音视频语义分割感兴趣。给定一个音视频数据集,其中包含全方位的听觉或视觉信息:包括前、后、左、右四个方向的全景图和双耳音频。在训练期间,可以访问所有全景图和音频,但在测试期间,只能访问其中一个方向的全景图和双耳音频的部分视图。此外,训练数据不包含手动标注。我们的场景是行人道路安全,我们将部分可见视图定义为行人的第一人称视角。形式上,给定数据集D = {Da,Dv},其中Da表示听觉模态部分,Dv表示视觉模态部分。Da包含来自4个方向的双耳音频,Dv包含全景图。对于数据集中的每个全景图,都分配了随机生成的头部旋转,并选择相应的双耳音频。我们遵循之前的工作(Gao和Grauman,2019年;Dai等人,2022年;Gao等人,2020年),将所选双耳音频转换为频谱图,并将x^isp和y^dsp分别表示输入音频信号(isp)的频谱图和其与其他方向差异(dsp)的频谱图。我们将x^osp表示其他方向的音频的频谱图。我们将基于视野和头部旋转生成的第一人称视角(FPV)表示为x^fpv,对应全景图x^v。我们的任务是使用这样的数据集训练一个模型,在FPV和双耳音频可用时,对周围环境中的车辆进行语义分割。

总结:

数据集:包括前、后、左、右四个方向的全景图和双耳音频,并将所选双耳音频转换为频谱图

训练模式:在训练期间,可以访问所有全景图和音频,但在测试期间,只能访问其中一个方向的全景图和双耳音频的部分视图

自带的数据增强:对于数据集中的每个全景图,都分配了随机生成的头部旋转,并选择相应的双耳音频

Partially Missing Settings

The “partially missing” settings have visual and auditory parts. The visible part of the panorama is the first-person view generated by the head rotation. For visual modality, the “partially missing” is the out-of-view scene. The binaural audio is selected according to the head rotation. For auditory modality, the “partially missing” is the binaural audio from other directions.Head rotation consists of three parameters: horizontal, vertical, and in-plane rotation. Formally, u ∈ (−180◦ , 180◦ ), v ∈ (−90◦ , 90◦ ), rot ∈ (−180◦ , 180◦ ), where, “u”, “v” and “rot” represent the horizontal, vertical and in-plane rotation viewing angles, respectively. We visualize the binaural sound selection process in the Fig. 2. Formally,

where, {1, .., 8} means the id number of microphones and F(·) denotes the mapping function for the id numbers in Fig. 2. Regarding the first-person view, we opt for the binocular overlap area size equivalent to human eyes, measuring 135◦ vertically and 120◦ horizontally (Wandell 1995), as this is crucial for comprehensive environmental perception.

翻译:

“部分缺失”的设置包括视觉和听觉部分。全景图的可见部分是通过头部旋转生成的第一人称视角。对于视觉模态,“部分缺失”是视野之外的场景。双耳音频根据头部旋转进行选择。对于听觉模态,“部分缺失”是来自其他方向的双耳音频。头部旋转包括三个参数:水平、垂直和平面旋转。形式上,u ∈ (−180◦ , 180◦ ),v ∈ (−90◦ , 90◦ ),rot ∈ (−180◦ , 180◦ ),其中,“u”、“v”和“rot”分别表示水平、垂直和平面旋转的视角。我们在图2中可视化了双耳声音选择过程。形式上,

这里,{1, .., 8}表示麦克风的编号,而F(·)表示图2中的编号映射函数。关于第一人称视角,我们选择了双眼重叠区域的尺寸,垂直为135◦,水平为120◦(Wandell,1995),因为这对于全面的环境感知至关重要。

Sound-Making Objects Extraction

We use the following steps to generate foreground masks (M^fg) for sound-making vehicles, etc., which are in line with the previous work (Dai et al 2022). We use GSoC algorithm (Samsonov 2017) in OpenCV (Bradski 2000) to extract video backgrounds instead of the simple one mentioned in the previous work.Given a panoramic image (x^v) and a background (x^bg), first use the semantic segmentation model pre-trained on the Cityscapes dataset (Cordts et al 2016) to get their semantic segmentation results: y^seg and y^bg . M^fg is generated using the following formula:

where (h, w) denotes the coordinates of the pixels, 1 is to keep the pixel that is belong to the sound-making vehicles and 0 means otherwise. c1, c2, c3 mean car, tram and motorcycle correspondingly. We achieve similar results (mIoU: 65.39%) on AuditoryTestManual dataset (see in Section “Dataset”) with the previous work (mIoU: 65.35%) using the same DeepLabv3+ (Chen et al 2018) framework.Therefore, we consider our method can successfully generate the foreground masks, as some results shown by Fig. 5.

翻译:

我们采用以下步骤生成声音制造车辆等的前景掩码(M^fg),这与之前的工作(Dai et al 2022)保持一致。我们使用OpenCV中的GSoC算法(Samsonov 2017)来提取视频背景,而不是之前工作中提到的简单算法。给定全景图像(x^v)和背景图像(x^bg),首先使用在Cityscapes数据集(Cordts et al 2016)上预训练的语义分割模型得到它们的语义分割结果:y^seg和y^bg。M^fg通过以下公式生成:

其中(h,w)表示像素的坐标,1表示属于声音制造车辆的像素,0表示其他。c1,c2,c3分别表示汽车、有轨电车和摩托车。我们在AuditoryTestManual数据集上(见“数据集”章节)使用相同的DeepLabv3+(Chen et al 2018)框架取得了与之前工作(mIoU:65.35%)相似的结果(mIoU:65.39%)。因此,我们认为我们的方法可以成功生成前景掩码,如图5所示的一些结果。

Model Architecture

翻译:

据我们所知,我们是第一个解决部分缺失模态的视听语义分割,我们也提出了一个新的强baseline。我们采用师生蒸馏框架来训练基于enc-dec的模型:Segment Beyond View(SBV)。整体训练架构如图3所示。对于视觉和听觉数据,我们分别使用两个编码器。其中vs的是视觉的,as的是听觉的

Audio-Visual Feature Fusion Module (AVFFM)

AVFFM是跨模态特征连接的注意力模块,在输入给AVFFM前需要做一个视觉特征图与音频特征图的对齐操作,对听觉特征进行更新,然后concatenate它们做dot-product measured attention(Zhou et al 2022)

视觉特征图维度为HWC,听觉特征图维度为HWC',最后得到的concat后的特征图为HW(C+C')

式中QKV均为1x1卷积层,N=HxW作为归一化因子

解码器每个阶段输出特征的维度为HgxWgxCg,其中(Himg,Wimg是输入图像的原始高度和宽度),Cg=512。分割头恢复特征映射大小后再对每个像素进行分类

解码器的输出相当于g=0时每个阶段的输出特征

分割头的输出维度为HimgxWimgxK,其中K是类别的数量,最后用softmax得到结果

Audio and Image Reconstruction Head

为了完成捕捉视野外信息的任务,引入了图像和音频重构任务作为辅助任务

g=3

对decoder的输出做重构任务,图像重构成HimgxWimgx3,音频重构则采用(Gao and Grauman 2019)的方法,预测每个通道之间的差异,例如左耳音频预测其他方向的左耳音频。通过音频重构头和一些后处理方法得到预测频谱图的差异

Omni2Ego Distillation

Knowledge Distillation (KD) attempts to preserve useful knowledge from the teacher into the student as the teacher can acquire more information than the student during training. While in testing, student cannot access full modalities information, KD can thus help to improve the performance.We propose “Omni2Ego” distillation method which is extremely simple but effective and can distill Omni-directional information into the Ego-centric perspective, in both the visual and auditory aspects. Specifically, we distill panoramic visual information into the first-person view for completing missing visual part at the feature level. We also distill 8channel audio information into 2-channel binaural audio for completing missing auditory information from other directions. We choose encoder-decoder architecture based SegFormer (Xie et al 2021) and 8-channel SoundNet (Dai et al.2022) as our visual and auditory teachers. Our method is divided into feature alignment and logits distillation parts.

翻译:

Knowledge Distillation(KD)试图在训练过程中将老师的有用知识保留到学生中,因为老师在训练过程中可以获取比学生更多的信息。而在测试阶段,学生无法访问完整的模态信息,因此KD可以帮助提高性能。我们提出了“Omni2Ego”蒸馏方法,它非常简单但有效,可以将全方位信息蒸馏到自我中心的视角,在视觉和听觉方面都是如此。具体而言,我们将全景视觉信息蒸馏到第一人称视角,以在特征级别完成缺失的视觉部分。我们还将8通道音频信息蒸馏到2通道双耳音频中,以补充其他方向缺失的听觉信息。我们选择基于编码器-解码器架构的SegFormer(Xie等人,2021年)和8通道SoundNet(Dai等人,2022年)作为我们的视觉和听觉老师。我们的方法分为特征对齐和logits蒸馏两个部分。

总结:

提出Omni2Ego蒸馏方法,teacher模型选择SegFormer和SoundNet,做了特征对齐和logits蒸馏两个任务

Feature Alignment Distillation

训练过程中用线性层和插值操作对齐教室学生网络的特征?测试过程丢掉

Logits Distillation

对教师模型的decoder输出进行分割任务的计算,使用Eqn2的loss

Training Objectives

The objective function for training is divided into three parts: Feature Alignment Loss, Logits Distillation Loss and Modality Reconstruction Loss, as shown in Fig. 3.

翻译:

训练目标函数分为Feature Alignment Loss、Logits Distillation Loss和modal Reconstruction Loss三部分,如图3所示。

Feature Alignment Loss (FAL)

L2范数算对齐损失

Logits Distillation Loss (LDL)

LDL is divided into visual and auditory parts. Since we only focus on three categories of moving sound-making objects, in order to make the student model pay more attention to the important features from the teacher model, we use L1 loss for logits distillation:

翻译:

LDL分为视觉部分和听觉部分。由于我们只关注三种运动的发声物体,为了使学生模型更加关注来自教师模型的重要特征,我们使用L1损失进行logits蒸馏:

Modality Reconstruction Loss (MRL)

MRL has two parts, one is to reconstruct the panoramic image, another is to reconstruct the binaural audio from other directions:

翻译:

MRL分为两部分,一是重建全景图像,二是重建其他方向的双耳音频:

Overall Loss (OL)

λa 和 λv 是用于调节音频和视觉特征之间的权衡因子。βa 和 βv 是用于音频和视觉logits的权衡因子。γa 和 γv 是用于音频和图像重建的系数

Experiments

Dataset

Existing omnidirectional audio-visual semantic segmentation datasets for road safety are limited. Omni Auditory Perception Dataset (Dai et al 2022; Vasudevan, Dai, and Van Gool 2020) is a dataset that contains 64, 250 2-second video clips with 8 channel audio of city traffic in Zurich that are recorded by a 360◦ GoPro Fusion cameras and 4 pairs of 3Dio binaural microphones in four directions (front, back, left and right). In addition to the normal training dataset (51, 400) and validation dataset (6, 208), it contains two test datasets: AuditoryTestPseudo dataset (6, 492) and AuditoryTestManual dataset. The annotations for AuditoryTestPseudo dataset are generated by the model pre-trained on the Cityscapes dataset (Cordts et al 2016). Most objects that need to be segmented are in the equatorial region of the panorama without obvious distortion (Dai et al 2022).Therefore, those pre-trained models can achieve satisfactory performances. AuditoryTestManual dataset is a manually labeled dataset contains a total of 80 images but covers a variety of scenarios including rainy, foggy, night and daylight.This dataset contains three categories: car, tram, and motorcycle. Each sample has a panorama that is the middle frame of the video clip and eight 2-second audio clips (this setting follows previous works (Gan et al 2019; Dai et al 2022)).

翻译:

现有的用于道路安全的全向音频-视觉语义分割数据集相对有限。Omni Auditory Perception Dataset(Dai等,2022;Vasudevan,Dai和Van Gool,2020)是一个数据集,包含64,250个由360◦ GoPro Fusion相机和四对3Dio双耳式麦克风在四个方向(前、后、左和右)记录的苏黎世市区交通的2秒视频剪辑和8通道音频。除了正常的训练数据集(51,400个)和验证数据集(6,208个)外,它还包含两个测试数据集:AuditoryTestPseudo数据集(6,492个)和AuditoryTestManual数据集。AuditoryTestPseudo数据集的标注是由在Cityscapes数据集上预训练的模型生成的。大多数需要进行分割的对象位于全景图的赤道区域,没有明显的失真(Dai等,2022)。因此,这些预训练模型可以达到令人满意的性能。AuditoryTestManual数据集是一个手动标记的数据集,包含总共80张图像,但涵盖了多种情景,包括雨天、雾天、夜晚和白天。该数据集包含三个类别:汽车、有轨电车和摩托车。每个样本都有一个全景图,该全景图是视频剪辑的中间帧,以及八个2秒的音频剪辑(此设置遵循以前的工作(Gan等,2019;Dai等,2022))。

Implementation Details

We train models by using NVIDIA A100 GPUs. We use Adam (Kingma and Ba 2014) and set learning rate as 1 × 10−5 for the optimizer. We use one cycle policy (Smith and Topin 2019) as our learning rate decay strategy. All images are resized to 480 × 480. The spectrogram size is set as 257 × 601. All student models are trained for 50 epochs to ensure that the loss converges. For the Eqn. 7, we set βa = 0.1 and βv = 0.4 for logits distillation; about the feature distillation part, we set all λ = 0.05 and all γ = 0.02.

We choose SegFormer (Xie et al 2021) pretrained on the Cityscapes dataset and 8-channel SoundNet (Dai et al 2022) as teacher models. For student visual encoder, we followed previous work (Zhou et al 2022) and chose ResNet50 (He et al 2016) with an Atrous Spatial Pyramid Pooling (ASPP) module (Chen et al 2018); about auditory encoder which is same with the SoundNet’s encoder. The segmentation head consists of three convolution layers and one interpolation operation. The components of the image reconstruction head are three convolution layers and one upsampling layer. The audio reconstruction head has 5 convolution layers.

翻译:

我们使用NVIDIA A100 GPU训练模型。我们使用Adam(Kingma和Ba,2014)作为优化器,并将学习率设置为1 × 10^-5。我们采用一周期策略(Smith和Topin,2019)作为学习率衰减策略。所有图像被调整为480 × 480大小。频谱图的大小设置为257 × 601。所有学生模型都训练了50个epochs以确保损失收敛。对于公式7,我们设置βa = 0.1和βv = 0.4用于logits蒸馏;关于特征蒸馏部分,我们将所有λ设为0.05,所有γ设为0.02。

我们选择在Cityscapes数据集上预训练的SegFormer(Xie等,2021)和8通道SoundNet(Dai等,2022)作为教师模型。对于学生的视觉编码器,我们遵循以前的工作(Zhou等,2022)选择了ResNet50(He等,2016),并带有一个Atrous Spatial Pyramid Pooling(ASPP)模块(Chen等,2018);关于音频编码器,它与SoundNet的编码器相同。分割头包括三个卷积层和一个插值操作。图像重建头部的组件是三个卷积层和一个上采样层。音频重建头有5个卷积层。

Overall Performance

Following previous works (Vasudevan, Dai, and Van Gool 2020; Dai et al 2022; Zhou et al 2022), we present the Fβscore (β = 0.3) and mean Intersection-over-Union (mIoU) of the following baseline methods and our models in the Tab. 1, since our task is still an audio-visual semantic segmentation task. Our task divides the panorama into firstperson view and out-of-view area, we apply the above two metrics to both area, which is simple to realize and just apply first-person view and out-of-view masks on segmentation results and ground truth and then to do the evaluation.

翻译:

在表1中,我们按照以往的研究(Vasudevan、Dai和Van Gool,2020;Dai等,2022;Zhou等,2022)展示了以下基准方法和我们模型的Fβ-Score(β = 0.3)和平均交并比(mIoU),因为我们的任务仍然是一个音频-视觉语义分割任务。我们的任务将全景图分为第一视角和视野之外的区域,我们将上述两个指标应用于这两个区域,这样做很简单,只需将第一视角和视野之外的掩码应用于分割结果和地面实况,然后进行评估。

Vision Models

We also choose the SegFormer (Xie et al.2021) with only first-person view input to verify that it is impossible to achieve satisfactory performance with visual inputs alone. For “SBV-V”, this is an variant of our model, we disable the auditory encoder, AVFFM and image reconstruction and only apply visual feature alignment and logits distillation. SBV-V with only first-person view input can also achieve higher performance than SegFormer with only firstperson view input by using our visual distillation method from panorama to first-person view, and resulted in average increases of 3.5 / 2.1% in mIoU over out-of-view / overall.

翻译:

我们还选择了SegFormer(Xie等,2021),仅使用第一视角输入来验证仅使用视觉输入是无法达到令人满意的性能的。对于“SBV-V”,这是我们模型的一个变体,我们禁用了音频编码器、AVFFM和图像重建,只应用视觉特征对齐和logits蒸馏。SBV-V仅使用第一视角输入也可以通过我们从全景图到第一视角的视觉蒸馏方法获得比仅使用第一视角输入的SegFormer更高的性能,导致了在视野之外/总体上的mIoU分别增加了3.5/2.1%的平均增幅。

总结:

蒸馏框架提点了

Auditory Models

We use the 2-channel SoundNet (Dai et al 2022) as the auditory input only method and it is used to show panoramic semantic segmentation using only binaural audio is challenging. For “SBV-A”, it is another variant of our SBV, we disable the visual encoder, AVFFM and audio reconstruction and only apply auditory feature and logits distillation. We can see SBV-A with only binaural audio input also outperforms 2-channel SoundNet by using our auditory distillation method by about 2.2 / 3.2% mIoU on two AuditoryTestPseudo and AuditoryTestManual datasets respectively.

翻译:

我们使用了2通道SoundNet(Dai等,2022)作为仅音频输入的方法,用于展示仅使用双耳音频进行全景语义分割是具有挑战性的。对于“SBV-A”,这是我们SBV的另一个变体,我们禁用了视觉编码器、AVFFM和音频重建,只应用了听觉特征和logits蒸馏。我们可以看到,仅使用双耳音频输入的SBV-A也通过使用我们的听觉蒸馏方法在AuditoryTestPseudo和AuditoryTestManual数据集上分别将mIoU提高了约2.2/3.2%。

总结:

8通道变2通道;蒸馏框架提点了

Audio-Visual Models

We choose TPAVI (Zhou et al 2022, 2023) which is the first and state-of-the-art audio-visual semantic segmentation model for comparison with our model.

We train it using first-person view and binaural audio as inputs. From Tab. 1, we found that our SBV shows strong advantages over TPAVI (Zhou et al 2022, 2023), not only on the overall performance, but also in out-of-view areas and receives particularly good performance in out-of-view areas.

Compared to TPAVI, our SBV improved by 7.7 / 11.4% in the mIoU on out-of-view area, 5.5 / 7.6% on overall performance, and slightly improved by 2.2 / 1.2% on the firstperson view, on the AuditoryTestPseudo and AuditoryTestManual datasets respectively. Fig. 5 shows some segmentation results of TPAVI and our model. We can clearly see that our model segment more objects outside the field of view, and those objects are more defined. This proves that our model can focus not only on in-view objects but also on out-of-view objects. In addition, due to the Omni2Ego distillation and MRL, our SBV can better reconstruct the information of the out-of-view objects at the feature level. We can see in the fourth row of Fig. 5, our model has a better representation of the shape of the tram at the edge of first-person view compared to TPAVI. Moreover, our model also outperforms better than the 8-channel auditory teacher. We found it is critical to achieve a desired performance using partially missing visual or auditory modality. We achieve satisfactory performance when utilizing both modalities.

翻译:

我们选择TPAVI(Zhou等,2022, 2023)作为比较对象,它是第一个也是最先进的音频-视觉语义分割模型。

我们使用第一人称视角和双耳音频作为输入对其进行训练。从表1中,我们发现我们的SBV在整体性能以及视野之外的区域都表现出明显优势,并且在视野之外的区域表现出特别出色的性能。

与TPAVI相比,我们的SBV在视野之外的mIoU上提高了7.7/11.4%,在整体性能上提高了5.5/7.6%,在第一人称视角上略微提高了2.2/1.2%,分别在AuditoryTestPseudo和AuditoryTestManual数据集上。图5显示了TPAVI和我们模型的一些分割结果。我们可以清楚地看到,我们的模型在视野之外分割了更多对象,并且这些对象更加清晰。这证明了我们的模型不仅可以关注视野内的对象,还可以关注视野之外的对象。此外,由于Omni2Ego蒸馏和MRL,我们的SBV可以在特征级别更好地重建视野之外对象的信息。从图5的第四行可以看出,与TPAVI相比,我们的模型对第一人称视角边缘的有轨电车的形状有更好的表示。此外,我们的模型还优于8通道音频教师。我们发现,利用部分缺失的视觉或听觉模态实现所需的性能至关重要。当同时利用两种模态时,我们可以实现令人满意的性能。

总结:

与现有SOTA相比,效果全面提升,两个模态缺一不可

Analyses

Impact of Field of View Size

We test our model by inputting different sizes of FoV (binocular and monocular) to test the robustness of SBV. We found that the performance is relatively stable and we show the results in Fig. 4. On the whole, we find that the performance in the first-person view (FPV) fluctuates slightly, but as the out-of-view area increases, the overall performance decreases. In a larger FoV (160◦ width, 175◦ height), our SBV can achieve better results, that is because SBV can better focus on larger FPV.

SBV can still maintain good performance in a small FoV (80◦ width, 95◦ height), which should be due to the use of the Omni2Ego distillation and the MRL. Also, our AVFFM makes full use of auditory and visual information, SBV can not only focus on in-view objects but also out-of-view objects. From another hand, although the FPV is small or any interesting sound-making objects are not included in FPV, it can provide directionality for SBV.

翻译:

我们通过输入不同大小的视野(双眼和单眼)来测试SBV的鲁棒性。我们发现性能相对稳定,并在图4中展示了结果。总体而言,我们发现第一人称视角(FPV)中的性能略有波动,但随着视野之外区域的增加,整体性能会下降。在较大的视野(160°宽,175°高)中,我们的SBV可以取得更好的结果,这是因为SBV可以更好地聚焦于较大的FPV。

SBV在较小的视野(80°宽,95°高)中仍然可以保持良好的性能,这应该归功于Omni2Ego蒸馏和MRL的使用。此外,我们的AVFFM充分利用了听觉和视觉信息,SBV不仅可以关注视野内的对象,还可以关注视野外的对象。另一方面,即使FPV很小或者有趣的声音制造对象不包含在FPV中,它也可以为SBV提供方向性。

Mono vs. Binaural

Many real-world devices at least have a single microphone. We randomly drop an “ear” to do the testing. Results are shown in Tab. 2. We find that performance of SBV does not drop significantly. The FPV mIoU is slightly affected, while the OOV mIoU drops by about 3.5 / 5.5% on AuditoryTestPseudo and AuditoryTestManual datasets respectively. It means that the segmentation of outof-view objects are more dependent on the auditory information than in-view objects. That is because binaural audio can provide some position information than mono audio from a cognitive point of view (Blauert 1997; Kendall 1995).

翻译:

在现实世界中,许多设备至少具有一个单独的麦克风。我们随机地去除一个“耳朵”来进行测试。结果如表2所示。我们发现SBV的性能没有显著下降。第一视角的mIoU受到轻微影响,而AuditoryTestPseudo和AuditoryTestManual数据集中的OOV mIoU分别下降了约3.5 / 5.5%。这意味着对于超出视野的对象的分割更依赖于听觉信息而不是在视野内的对象。这是因为从认知角度来看,双耳音频可以提供比单声道音频更多的位置信息(Blauert 1997; Kendall 1995)。

总结:

减少一个“耳朵”的实验证明了SBV更多的是利用听觉补充视觉

Ablation Studies

We conduct ablation studies on Omni2Ego, AVFFM and MRL. Please refer Tab. 3 for results. We first introduce the Omni2Ego and remove this module to verify its effectiveness. We denote it as SBV-v3.

We find that mIoU decreased slightly (around 1.3% on both test datasets) in the first-person view, indicating that the model learned more shape information of different categories. In addition, the out-of-view mIoU drops about 3 / 2 % mIoU on AuditoryTestPseudo / AuditoryTestManual test datasets, showing that distillation can indeed help the model reconstruct the missing parts of the modality at the feature level. We then introduce the AVFFM and verify it can indeed help models better focus not only on objects in the first-person view but also on objects outside of the view. We denote it as SBV-v2 in Tab. 3. We found that the out-of-view mIoU of the model dropped by about 3% on average after removing this module on both test datasets.

This shows that AVFFM can help our model get more information about out-of-view objects from auditory signals.

Finally, we introduce the MRL which is expected to help the model to have ability to reconstruct the partially missing modalities. The results of SBV-v3 in Tab. 3 show that the MRL can help our model improves about 2 % out-of-view and overall mIoU on average on both test datasets.

翻译:

我们对Omni2Ego、AVFFM和MRL进行了消融研究。请参考表3获取结果。我们首先介绍了Omni2Ego并移除了该模块以验证其有效性。我们将其标记为SBV-v3。

我们发现第一视角的mIoU略微降低(在两个测试数据集上约为1.3%),表明模型学习了更多不同类别的形状信息。此外,AuditoryTestPseudo / AuditoryTestManual测试数据集中的超出视野的mIoU下降约3 / 2%,表明蒸馏确实可以帮助模型在特征级别上重构缺失的模态部分。然后,我们引入了AVFFM并验证了它确实可以帮助模型更好地关注不仅是第一人称视角中的对象,还包括视野之外的对象。我们将其标记为SBV-v2在表3中。我们发现在移除此模块后,模型的超出视野mIoU在两个测试数据集上平均下降约3%。

这表明AVFFM可以帮助我们的模型从听觉信号中获得更多关于超出视野对象的信息。

最后,我们引入了MRL,预计它将帮助模型具有重构部分缺失模态的能力。表3中的SBV-v3的结果表明,MRL可以帮助我们的模型在两个测试数据集上平均提高约2%的超出视野和总体mIoU。

总结:

v3去掉Omni2Ego,少掉了特征对齐损失和分割损失两个任务

v2去掉AVFFM,少掉了全局的注意力

v1去掉MRL

Conclusion

In this paper, we are the first to introduce and tackle the challenging and novel problem in the field of audio-visual semantic segmentation – Partially Missing Modality issue for multimodal learning. We propose a simple yet efficient framework named Segment Beyond View (SBV) to address this issue. The SBV model leverages Omni2Ego distillation, attention mechanism, and Modality Reconstruction Loss to handle this problem. In the experiments, the proposed model receives promising segmentation accuracy under different evaluation metrics compared to other models. Through extensive analyses, robust performances are achieved with both different sizes of field of view or in mono audio and the effectiveness of each module is further verified. Despite the very exciting out-of-view semantic segmentation result in this paper, the trained model might fails in a completely different scene, e.g. non-urban landscape, with different distribution. This limitation is shared by many similar audiovisual segmentation works (Dai et al 2022). Examining the generalibility of these models is a promising future direction with significant practical implications. Due to those limitations, we did not explore videos from AR and pedestrian perspectives. As the model matures, it will be beneficial to test the model in controlled street crossing user studies.

翻译:

在本文中,我们首次介绍并解决了音视频语义分割领域中的一个具有挑战性和新颖性的问题——多模态学习中的部分缺失模态问题。我们提出了一个简单而有效的框架,名为“Segment Beyond View”(SBV),来解决这个问题。SBV模型利用了Omni2Ego蒸馏、注意机制和模态重构损失来处理这个问题。在实验中,与其他模型相比,所提出的模型在不同的评估指标下获得了令人满意的分割准确度。通过广泛的分析,我们实现了在不同视野大小或单声道音频下的稳健性能,并进一步验证了每个模块的有效性。尽管本文中的超出视野语义分割结果非常令人兴奋,但训练模型可能在完全不同的场景(例如非城市景观)中失败,其分布也可能不同。这个局限性也适用于许多类似的音视频分割工作(Dai等,2022年)。检验这些模型的普适性是一个具有重要实际意义的有前途的未来方向。由于这些限制,我们没有探索来自AR和行人视角的视频。随着模型的成熟,将有利于在控制的过街用户研究中测试该模型。

  • 9
    点赞
  • 30
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值