2018 ICMR
Content-Based Video–Music Retrieval Using Soft Intra-Modal Structure Constraint
Introduction
bidirectional retrieval
挑战
- 设计一种对元数据没有要求的跨模式模型
- 难以获得匹配的视频音乐对,视频和音乐之间的匹配标准比其他跨模态任务(例如,图像到文本的检索)更加模糊
Contributions
- Content-based, cross-modal embedding network
- introduce VM-NET, two-branch neural network that infers the latent alignment between videos and music tracks using only their contents
- train the network via inter-modal ranking loss
such that videos and music with similar semantics end up close together in the embedding space
However, if only the inter-modal ranking constraint for embedding is considered, modality-specific characteristics (e.g., rhythm or tempo for music and texture or color for image) may be lost.
- devise a novel soft intra-modal structure constraint
takes advantage of the relative distance relationship of samples within each modality
does not require ground truth pair information within individual modality.
Large-scale video–music pair dataset
- Hong–Im Music–Video 200K (HIMV- 200K)
composed of 200,500 video–music pairs.
Evaluation
- Recall@K
- subjective user evaluation
Related work
A. Video–Music Related Tasks
conventional approaches can be divided into three categories according to the task:
- generation,
- classification
- matching
大多数现有方法使用元数据(例如,关键字,心情标签和相关描述)
B. Two-branch Neural Networks Over
不同模态之间的关系
图像与文本相关联
音乐视频情感标签
Tunesensor: A semantic-driven music recommendation service for digital photo albums (ISWC 2011)
Method
A. Music Feature Extraction
-
decompose an audio signal into harmonic and percussive components
谐波 / 打击乐 -
apply log-amplitude scaling to each component
to avoid numerical underflow -
slice the components into shorter segments called local frames (or windowed excerpts) and extract multiple features from each component of each frame.
Frame-level features.
- Spectral features
频谱特征
The first type of audio features are derived from spectral analyses.
- first apply the fast Fourier transform and the discrete wavelet transform to the windowed signal in each local frame
快速傅立叶变换/离散小波变换 - From the magnitude spectral results
幅度频谱结果- compute summary features including the spectral centroid, the spectral bandwidth, the spectral rolloff, and the first and second order polynomial features of a spectrogram
频谱质心,频谱带宽,频谱衰减以及频谱图的一阶和二阶多项式特征
- compute summary features including the spectral centroid, the spectral bandwidth, the spectral rolloff, and the first and second order polynomial features of a spectrogram
- Mel-scale features
梅尔尺度特征
-
compute the Mel-scale spectrogram of each frame as well as the Mel-frequency Cepstral Coefficients (MFCC)
每帧的梅尔尺度谱图以及梅尔频率倒谱系数(MFCC)
to extract more meaningful features -
use delta-MFCC features(the first- and second-order differences in MFCC features over time)
增量MFCC
capture variations of timbre over time
音色随时间变化
- Chroma features
色度
- use chroma short-time Fourier transform as well as chroma energy normalized
色度短时傅立叶变换 以及色度能量归一化
While Mel- scaled representations efficiently capture timbre, they provide poor resolution of pitches and pitch classes