Vript: A High-quality Fine-grained Video-text Dataset | ~145 words per caption

Vript is a fine-grained video-text dataset with 12K annotated YouTube videos (~400k clips). The annotation of this dataset is inspired by the video script. If we want to make a video, we have to first write a script to organize how to shoot the scenes in the videos. To shoot a scene, we need to decide the content, shot type (medium shot, close-up, etc), and how the camera moves (panning, tilting, etc). Therefore, we extend video captioning to video scripting by annotating the videos in the format of video scripts. Different from the previous video-text datasets, we densely annotate the entire videos without discarding any scenes and each scene has a caption with ~145 words. Besides the vision modality, we transcribe the voice-over into text and put it along with the video title to give more background information for annotating the videos.

在这里插入图片描述
There are some takeaways from the Vript dataset:

  • Fine-grained: The Vript dataset is annotated with detailed captions of ~145 words for each scene, which contain the shot type, camera movement, content, and scene title.
  • Dense Annotation: The Vript dataset is densely annotated with detailed captions for all scenes in the entire videos. Each video has ~40 scenes and lasts for ~6m on average (max 3h, min 5s). The total duration of the videos is ~1.3Kh.
  • High-quality: The Vript dataset is annotated by GPT-4V/Claude 3 Sonnet. We find that GPT-4V has the best performance in generating detailed captions for videos and Claude 3 Sonnet has a looser constraint on the video content so that it can caption some scenes that GPT-4V cannot.
  • High-resolution & Diverse Aspect Ratios & Open Domain: The Vript dataset contains both long videos from YouTube and short videos from YouTube Shorts and TikTok. The raw videos vary in 720p to 2K resolution.

In addition, we propose Vript-Bench, a new benchmark consisting of three challenging video understanding tasks (which are carefully double-checked by humans):

Vript-CAP (Caption): A benchmark with detailed captions rather than short captions.
Vript-RR (Retrieve then Reason): A video reasoning benchmark by first giving a detailed description of the scene as a hint and then asking questions about details in the scene.
Vript-ERO (Event Re-ordering): A benchmark that tests the temporal understanding by offering the descriptions of scenes located in two/four different timelines of the same video and asks the model to give the right temporal order of the scenes.

在这里插入图片描述
Vript is suitable for aligning the text and video modalities, which can be used in training video generation modal and multimodal models.

  • 24
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值