Contribution:
- CMD数据集,从MoveClips下载的30k+个包含“key scene”的视频。每个视频含有description。description中包含intent、relationship、emotion、attribute、context
- movie-text retrieval 的baseline的模型。利用“Expert”模型,就是将多个多种特征级联起来(The expert features are extracted using pre-trained models for speech, motion, faces, scenes and objects.),并额外增加了一个CBM模块(allows the model to learn to increase the relative weight of a past video feature as well.)