MovieNet 数据集详解

最新推荐文章于 2024-06-11 10:01:00 发布

清欢守护者

最新推荐文章于 2024-06-11 10:01:00 发布

阅读量2.3k

点赞数 2

分类专栏：数据集

本文链接：https://blog.csdn.net/irving512/article/details/112503571

版权

数据集专栏收录该内容

10 篇文章

订阅专栏

文章目录

0. 前言

相关资料
- 官方资料：论文，官网，官方文档（比论文更容易理解），Github
- 发布时间：ECCV 2020
- 发布机构：香港中文大学
- 其他资料：参考博客
一句话总结：基于电影的视频理解数据集，包含人物bbox/id、场景边界、每个场景的地点/行为标签等。
获取：官网直接下载，没有任何难度。
一句话总结：基于电影的视频理解数据集，包括人物bbox/id、镜头类型、场景的地点与行为标签等

1. 基本情况

如何拆分一步电影
- 电影结构：frame -> shot -> thread -> scene -> movie
- 我非常不专业的理解：
  - frame：没啥好说的，图像帧
  - shot：镜头，我的理解是一个摄像机连续拍摄得到的视频片段。
    - Shot is a series of frames that runs for an uninterrupted period of time. It is also the minimal visual unit of a movie. A movie would usually contains hundreds of shots.
  - scene：场景，我的理解是在一个地方拍摄的、由若干镜头组成的视频片段。
    - Scene is a sequence of continued shots that are semantically related. Usually a scene would tell about one event in the movie. A movie would contains tens of scenes.
  - thread：这个真没明白是啥意思。
    - Thread shows the pattern of the shot arrangement in a scene. But note that not all scenes would contain threads.
    - Take a typical dialog scene as an example. Suppose there are two persons A and B in the dialog scene, they would be alternately shown, the pattern of which can be represented as ABABAB…". So there are two threads in this dialog scene, namely A and B. To capture the hierarchical structure of a movie is important for movie understanding.
提供的标签类别
- 人物标签：
  - 手工标注了300多电影758k张图片中1.3M个人物bbox
  - 标注了573电影中的人物身份标签。对于没有手工bbox的电影，使用SOTA person detector来检测。为了减少工作量，只关注IMDb中该电影前10的演员。最终得到763k属于3087个credited cast的样本，以及364k个其他样本。
- 场景边界：
  - 电影场景时间划分。
  - 共有42k个场景。
- 地点/行为标签
  - 手工对每个场景标注行为和地点。
  - 每个场景对应多个place标签。
  - 对于行为标签，先将场景划分为sub-clips，然后对每个sub-clip标注多个行为标签。
  - 为了使得信息更多样、包含更多信息，我们鼓励标注者创建更多标签，而那些对于故事理解没有太大帮助的行为（如站立、说话）都被去掉了。最终确定了80类行为标签和90类地点标签。
  - 最终得到19.6K地点标签、41.3k行为片段以及45k行为标签。
- Description Alignment
  - 这个还真不知道是啥意思，猜测是视频摘要相关？
  - 请参考官方文档
- 影片类型（Cinematic Style）
  - 有两个维度的数据
  - view scale：long shot, full shot, medium shot, close-up shot and extreme close-up shot
  - camera movement：static shot, pans and tilts shot, zoom in and zoom out
提供的数据
- id：即电影在IMDb中的id，还提供了TMDb ID和Douban ID。
- Movie：电影本身，提供了1100部电影720P，长宽16:9，可能有黑边。由于版权原因，只放出关键帧。相邻帧非常类似，只有关键帧就够了。为了避免版权问题，只概统16K Hz采样频率和512 window length的数据。
- Trailer：预告片，即商业广告，共有33k不同的预告片，也有关键帧信息和对应的声音特征。
- Subtitle，即字幕，内嵌英文字幕或YIFY上下载的。
- Script：剧本。
- Synopsis：剧情简介，是观影者写的，从IMDb上获取的。
- Meta data：元数据

2. 详情

2.1. 标签详解

所有标签都是json文件，文件名为IMDb的电影id。
整个标签是一个字典，包含以下几个key
- imdb_id：即IMDb电影编号
- cast：即人相关标签，包括bbox以及对应的pid（即任务编号）
- scene：场景信息，包括每个场景的起始帧、起始镜头、地点标签以及行为标签
- story：我也不知道该怎么翻译这个，里面有编号、起始镜头、起始帧、时间、consistency（不知道是啥）、文字描述、字幕
- cinematic_style：镜头分类，即每个镜头的scale和movement，还有预告片信息。
标签举例如下

{
  "imdb_id": "tt1210166",
  "cast": [
    {
      "id": "tt1210166_000001",
      "frame_idx": null,
      "resolution": [
        1280,
        694
      ],
      "shot_idx": 1,
      "img_idx": 0,
      "body": {
        "type": "detected",
        "bbox": [
          22,
          27,
          1148,
          675
        ]
      },
      "pid": "others",
      "possible_pids": [
        "others"
      ]
    },
    ...
  ],
  "scene": [
    {
      "id": "tt1210166_0000",
      "shot": [
        0,
        1
      ],
      "frame": [
        0,
        841
      ],
      "place_tag": null,
      "action_tag": null
    },
    ...
  ],
  "story": [
    {
      "id": "tt1210166_0000",
      "shot": [
        60,
        424
      ],
      "frame": [
        6257,
        44851
      ],
      "duration": [
        260.97997833333335,
        1870.6211273333333
      ],
      "consistency": 0.963081028938084,
      "description": "Oakland Athletics general manager Billy Beane is upset by his team's loss to the New York Yankees in the 2001 postseason ...",
      "subtitle": [
        {
          "shot": 60,
          "duration": [
            260.26,
            262.51225
          ],
          "sentences": [
            "You gotta give the Yankees--"
          ]
        },
        ...
      ]
    },
    ...
  ],
  "cinematic_style": {
    "movie": [
      {
        "shot": 1,
        "scale": "closeup",
        "movement": "static"
      },
      {
        "shot": 2,
        "scale": "full",
        "movement": "static"
      },
      {
        "shot": 3,
        "scale": "closeup",
        "movement": "moving"
      },
      ...
    ],
    "trailer": null
  }
}