快手文本视频多模态检索论文
论文:Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
链接:https://arxiv.org/abs/2401.00701
摘要
近些年,基于CLIP的text-to-video检索方法广为流行,但大多从视觉文本对齐方法上演进。按照原文:design a heavy fusion block for sentence (words)-video (frames) interaction,而忽视了复杂度和检索效率。
- 升级点
- 本文采用多粒度视觉特征学习,捕获从抽象到具体的视觉内容。Multi-granularity visual feature learning, ensuring the model’s comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase.
- 设计两阶段检索框架,优点在于 balances the coarse and fine granularity of retrieval content.
- 在训练阶段,设计一个parameter-free text-gated interaction block (TIB) 模块用于细粒度视觉表征学习并嵌入一个额外的 Pearson Constraint来优化跨模态表示学习。
- 在检索阶段,使用粗粒度视觉表征快速检索topk结果ÿ