【论文阅读笔记】（2018 ECCV）Look, Listen and Learn

最新推荐文章于 2022-03-19 21:43:53 发布

小吴同学真棒

最新推荐文章于 2022-03-19 21:43:53 发布

阅读量2k

点赞数

分类专栏：人工智能学习文章标签：多模态视频音频计算机视觉深度学习

本文链接：https://blog.csdn.net/qq_36627158/article/details/121395964

版权

学习同时被 2 个专栏收录

115 篇文章 7 订阅

订阅专栏

人工智能

72 篇文章 5 订阅

订阅专栏

Look, Listen and Learn

（2018 ECCV）

Relja Arandjelovi´c Andrew Zisserman

Notes

Contributions

We introduce a novel Audio-Visual Correspondence (AVC) learning task that is used to train the two (visual and audio) networks from scratch. The AVC task is a simple binary classification task: given an example video frame and a short audio clip – decide whether they correspond to each other or not. The corresponding (positive) pairs are the ones that are taken at the same time from the same video, while mismatched (negative) pairs are extracted from different videos.

Method

Vision subnetwork. The input to the vision subnetwork is a 224 × 224 colour image, and the network follows the VGG-network style.

Audio subnetwork. The input to the audio subnetwork is a 1 second sound clip converted into a log-spectrogram, which is thereafter treated as a greyscale 257×199 image. The architecture of the audio subnetwork is similar to the vision one.

Training data sampling. A non-corresponding frame-audio pair is compiled by randomly sampling two different videos and picking a random frame from one and a random 1 second audio clip from the other. A corresponding frame-audio pair is created by sampling a random video, picking a random frame in that video, and then picking a random 1 second audio clip that overlaps in time with the sampled frame.

We use standard data augmentation techniques for images: each training image is uniformly scaled such that the smallest dimension is equal to 256, followed by random cropping into 224 × 224, random horizontal flipping, and brightness and saturation jittering. Audio is only augmented by changing the volume up to 10% randomly but consistently across the sample.

Log-spectrogram computation. The 1 second audio is resampled to 48 kHz, and a spectrogram is computed with window length of 0.01 seconds and a half-window overlap; this produces 199 windows with 257 frequency bands. The response map is passed through a logarithm before feeding it into the audio subnetwork.