Look, Listen and Learn
(2018 ECCV)
Relja Arandjelovi´c Andrew Zisserman
Notes
Contributions
We introduce a novel Audio-Visual Correspondence (AVC) learning task that is used to train the two (visual and audio) networks from scratch. The AVC task is a simple binary classification task: given an example video frame and a short audio clip – decide whether they correspond to each other or not. The corresponding (positive) pairs are the ones that are taken at the same time from the same video, while mismatched (negative) pairs are extracted from different videos.
Method
Vision subnetwork. The input to the vision subnetwork is a 224 × 224 colour image, and the network follows the VGG-network style.
Audio subnetwork. The input to the audio subnetwork is a 1 second sound clip converted into a log-spectrogram, which is thereafter treated as a greyscale 257×199 image. The architecture of the audio subnetwork is similar to the vision one.
Training data sampling. A non-corresponding frame-audio pair is compiled by randomly sampling two different videos and picking a random frame from one and a random 1 second audio clip from the other. A corresponding frame-audio pair is created by sampling a random video, picking a random frame in that video, and then picking a random 1 second audio clip that overlaps in time with the sampled frame.
We use standard data augmentation techniques for images: each training image is uniformly scaled such that the smallest dimension is equal to 256, followed by random cropping into 224 × 224, random horizontal flipping, and brightness and saturation jittering. Audio is only augmented by changing the volume up to 10% randomly but consistently across the sample.
Log-spectrogram computation. The 1 second audio is resampled to 48 kHz, and a spectrogram is computed with window length of 0.01 seconds and a half-window overlap; this produces 199 windows with 257 frequency bands. The response map is passed through a logarithm before feeding it into the audio subnetwork.
Results