【Music】视频配乐|多模态检索 Content-based video–music retrieval (CBVMR) Using Soft Intra-Modal 笔记

最新推荐文章于 2024-01-25 17:59:50 发布

鹿鹿最可爱

最新推荐文章于 2024-01-25 17:59:50 发布

阅读量929

点赞数 1

分类专栏：计算音乐文章标签： ICMR 视频配乐音乐配视频音视频检索多模态检索

本文链接：https://blog.csdn.net/qq_31622015/article/details/109063253

版权

该研究提出了一个内容为基础的视频音乐检索（CBVMR）方法，利用软模态内结构约束。通过两分支神经网络，对视频和音乐的内容进行嵌入，使得相似语义的视频和音乐在嵌入空间中靠近。此外，引入了软模态内结构约束来保留模态特有的特性，如音乐的节奏和视频的颜色。在大型数据集HIMV-200K上进行了实验，验证了模型的有效性。

摘要由CSDN通过智能技术生成

2018 ICMR
Content-Based Video–Music Retrieval Using Soft Intra-Modal Structure Constraint

Introduction

bidirectional retrieval

挑战

设计一种对元数据没有要求的跨模式模型
难以获得匹配的视频音乐对，视频和音乐之间的匹配标准比其他跨模态任务（例如，图像到文本的检索）更加模糊

Contributions

Content-based, cross-modal embedding network
- introduce VM-NET, two-branch neural network that infers the latent alignment between videos and music tracks using only their contents
- train the network via inter-modal ranking loss
  such that videos and music with similar semantics end up close together in the embedding space

However, if only the inter-modal ranking constraint for embedding is considered, modality-specific characteristics (e.g., rhythm or tempo for music and texture or color for image) may be lost.

devise a novel soft intra-modal structure constraint
takes advantage of the relative distance relationship of samples within each modality
does not require ground truth pair information within individual modality.

Large-scale video–music pair dataset

Hong–Im Music–Video 200K (HIMV- 200K)
composed of 200,500 video–music pairs.

Evaluation

Recall@K
subjective user evaluation

Related work

A. Video–Music Related Tasks

conventional approaches can be divided into three categories according to the task:

generation,
classification
matching

大多数现有方法使用元数据（例如，关键字，心情标签和相关描述）

B. Two-branch Neural Networks Over

不同模态之间的关系
图像与文本相关联

音乐视频情感标签
Tunesensor: A semantic-driven music recommendation service for digital photo albums （ISWC 2011）

Method

A. Music Feature Extraction

decompose an audio signal into harmonic and percussive components
谐波 / 打击乐
apply log-amplitude scaling to each component
to avoid numerical underflow
slice the components into shorter segments called local frames (or windowed excerpts) and extract multiple features from each component of each frame.

Frame-level features.

Spectral features
频谱特征
The first type of audio features are derived from spectral analyses.

first apply the fast Fourier transform and the discrete wavelet transform to the windowed signal in each local frame
快速傅立叶变换/离散小波变换
From the magnitude spectral results
幅度频谱结果
- compute summary features including the spectral centroid, the spectral bandwidth, the spectral rolloff, and the first and second order polynomial features of a spectrogram
  频谱质心，频谱带宽，频谱衰减以及频谱图的一阶和二阶多项式特征

Mel-scale features
梅尔尺度特征

compute the Mel-scale spectrogram of each frame as well as the Mel-frequency Cepstral Coefficients (MFCC)
每帧的梅尔尺度谱图以及梅尔频率倒谱系数（MFCC）
to extract more meaningful features
use delta-MFCC features（the first- and second-order differences in MFCC features over time）
增量MFCC
capture variations of timbre over time
音色随时间变化

Chroma features
色度

use chroma short-time Fourier transform as well as chroma energy normalized
色度短时傅立叶变换以及色度能量归一化
While Mel- scaled representations efficiently capture timbre, they provide poor resolution of pitches and pitch classes