大模型下的视频理解video understanding

最新推荐文章于 2024-08-02 22:58:13 发布

一只想飞的锦鲤

最新推荐文章于 2024-08-02 22:58:13 发布

阅读量640

点赞数 8

文章标签：音视频人工智能计算机视觉

本文链接：https://blog.csdn.net/m0_37847767/article/details/140856065

版权

数据集

Learning Video Context as Interleaved Multimodal Sequences

Motivation：
针对Narrative videos, like movie clips, TV series, etc.：因为比较复杂
most top-performing video perception models 都是研究那种原子动作or人or物
understanding video contexts 有很多任务，解决这些任务的模型都太 specific了，不够general
++++=====>
can we develop a general solution that handles these diverse contexts and needs in videos?

Our work
虽然有类似的模型但是when applied to narrative videos, which encompass informative contexts , these models with a pre-defined visual-textual template still exhibit limitations due to inflexibility。基于此做了如下贡献：

提了一个新的多模态模型来解决这类视频，由于有复杂的结构，核心是要将embed the videos as
interleaved multi-modal sequences
想要统一多模态context和任务以一种用户友好的方式
收集了指令微调数据集（用了一系列方法a package of solutions来转换现有的数据集）而且是interleaved multimodal instruction-following。用这个数据集训练了一个deconder-only的模型
除此之外，这个模型的应用是，可以让用户以一种更free-form的形式与视频交互

Model
模型总体来说不难，frame也只是一个token，作者希望通过这样方式更好的编码交错多模态信息来帮助回答问题
model
DATA
建立了几个模板主要关注how to collect the corresponding tuning data for each type of interleaved prompt
实验
实验部分的话，任务很多,都是video 理解中最火的任务，基本都是sota了。一开始提了几个有意义的问题，并进行了深入思考。除此之外容易混淆的setting用了一些小标志代替，显得更清楚。

multi-task learning enhances individual capabilities.
This highlights the language model’s ability to acquire commonsense across
diverse objectives and contexts.
different kinds of interleaved multimodal instruction.

一只想飞的锦鲤

关注

8
点赞
踩
16

收藏

觉得还不错? 一键收藏
0
评论
大模型下的视频理解video understanding

虽然有类似的模型但是when applied to narrative videos, which encompass informative contexts , these models with a pre-defined visual-textual template still exhibit limitations due to inflexibility。实验部分的话，任务很多,都是video 理解中最火的任务，基本都是sota了。，并进行了深入思考。除此之外容易混淆的setting用了。
复制链接

扫一扫