【Music】视频配乐|多模态检索 Content-based video–music retrieval (CBVMR) Using Soft Intra-Modal 笔记

该研究提出了一个内容为基础的视频音乐检索(CBVMR)方法,利用软模态内结构约束。通过两分支神经网络,对视频和音乐的内容进行嵌入,使得相似语义的视频和音乐在嵌入空间中靠近。此外,引入了软模态内结构约束来保留模态特有的特性,如音乐的节奏和视频的颜色。在大型数据集HIMV-200K上进行了实验,验证了模型的有效性。
摘要由CSDN通过智能技术生成

2018 ICMR
Content-Based Video–Music Retrieval Using Soft Intra-Modal Structure Constraint

Introduction

bidirectional retrieval

挑战

  • 设计一种对元数据没有要求的跨模式模型
  • 难以获得匹配的视频音乐对,视频和音乐之间的匹配标准比其他跨模态任务(例如,图像到文本的检索)更加模糊

Contributions

  • Content-based, cross-modal embedding network
    • introduce VM-NET, two-branch neural network that infers the latent alignment between videos and music tracks using only their contents
    • train the network via inter-modal ranking loss
      such that videos and music with similar semantics end up close together in the embedding space

However, if only the inter-modal ranking constraint for embedding is considered, modality-specific characteristics (e.g., rhythm or tempo for music and texture or color for image) may be lost.

  • devise a novel soft intra-modal structure constraint
    takes advantage of the relative distance relationship of samples within each modality
    does not require ground truth pair information within individual modality.

Large-scale video–music pair dataset

  • Hong–Im Music–Video 200K (HIMV- 200K)
    composed of 200,500 video–music pairs.

Evaluation

  • Recall@K
  • subjective user evaluation

Related work

A. Video–Music Related Tasks

conventional approaches can be divided into three categories according to the task:

  • generation,
  • classification
  • matching

大多数现有方法使用元数据(例如,关键字,心情标签和相关描述)

B. Two-branch Neural Networks Over

不同模态之间的关系
图像与文本相关联

音乐视频情感标签
Tunesensor: A semantic-driven music recommendation service for digital photo albums (ISWC 2011)

Method

A. Music Feature Extraction

  1. decompose an audio signal into harmonic and percussive components
    谐波 / 打击乐

  2. apply log-amplitude scaling to each component
    to avoid numerical underflow

  3. slice the components into shorter segments called local frames (or windowed excerpts) and extract multiple features from each component of each frame.

Frame-level features.

  1. Spectral features
    频谱特征
    The first type of audio features are derived from spectral analyses.
  • first apply the fast Fourier transform and the discrete wavelet transform to the windowed signal in each local frame
    快速傅立叶变换/离散小波变换
  • From the magnitude spectral results
    幅度频谱结果
    • compute summary features including the spectral centroid, the spectral bandwidth, the spectral rolloff, and the first and second order polynomial features of a spectrogram
      频谱质心,频谱带宽,频谱衰减以及频谱图的一阶和二阶多项式特征
  1. Mel-scale features
    梅尔尺度特征
  • compute the Mel-scale spectrogram of each frame as well as the Mel-frequency Cepstral Coefficients (MFCC)
    每帧的梅尔尺度谱图以及梅尔频率倒谱系数(MFCC)
    to extract more meaningful features

  • use delta-MFCC features(the first- and second-order differences in MFCC features over time)
    增量MFCC
    capture variations of timbre over time
    音色随时间变化

  1. Chroma features
    色度
  • use chroma short-time Fourier transform as well as chroma energy normalized
    色度短时傅立叶变换 以及色度能量归一化
    While Mel- scaled representations efficiently capture timbre, they provide poor resolution of pitches and pitch classes
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值