【论文阅读笔记】(2018 ECCV)Look, Listen and Learn

Look, Listen and Learn

(2018 ECCV)

Relja Arandjelovi´c        Andrew Zisserman

Notes

Contributions

We introduce a novel Audio-Visual Correspondence (AVC) learning task that is used to train the two (visual and audio) networks from scratch. The AVC task is a simple binary classification task: given an example video frame and a short audio clip – decide whether they correspond to each other or not. The corresponding (positive) pairs are the ones that are taken at the same time from the same video, while mismatched (negative) pairs are extracted from different videos.

 

Method

Vision subnetwork. The input to the vision subnetwork is a 224 × 224 colour image, and the network follows the VGG-network style.

Audio subnetwork. The input to the audio subnetwork is a 1 second sound clip converted into a log-spectrogram, which is thereafter treated as a greyscale 257×199 image. The architecture of the audio subnetwork is similar to the vision one.

Training data sampling. A non-corresponding frame-audio pair is compiled by randomly sampling two different videos and picking a random frame from one and a random 1 second audio clip from the other. A corresponding frame-audio pair is created by sampling a random video, picking a random frame in that video, and then picking a random 1 second audio clip that overlaps in time with the sampled frame.

We use standard data augmentation techniques for images: each training image is uniformly scaled such that the smallest dimension is equal to 256, followed by random cropping into 224 × 224, random horizontal flipping, and brightness and saturation jittering. Audio is only augmented by changing the volume up to 10% randomly but consistently across the sample.

Log-spectrogram computation. The 1 second audio is resampled to 48 kHz, and a spectrogram is computed with window length of 0.01 seconds and a half-window overlap; this produces 199 windows with 257 frequency bands. The response map is passed through a logarithm before feeding it into the audio subnetwork.

 

Results

 

 

 

 

 

 

 

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值