The 2nd MSR Video to Language Challenge

Video has become ubiquitous on the Internet, broadcasting channels, as well as personal devices. This has encouraged the development of advanced techniques to analyze the semantic video content for a wide variety of applications. Recognition of videos has been a fundamental challenge of computer vision for decades. Previous research has predominantly focused on recognizing videos with a predefined yet very limited set of individual words. In this grand challenge, we go one step further and target at translating video content to a complete and natural sentence, which can be regarded as the ultimate goal of video understanding. This challenge will bring together diverse topics in the areas of multimedia, computer vision, natural language processing and machine learning, as well as multiple modalities (textual, visual, and aural) and multiple ways of understanding and analyzing video content.

To further motivate and challenge the academic and industrial research communities, Microsoft Research organized the first Video to Language Grand Challenge in ACM Multimedia 2016 (http://www.acmmm.org/2016/?page_id=353), and released the first version of “Microsoft Research - Video to Text” (MSR-VTT), a large-scale video benchmark to public for video understanding. The dataset contains 38.7 hours and 200K clip-sentence pairs in total, covering the most comprehensive categories and diverse visual content, and representing the largest dataset in terms of sentence and vocabulary. The dataset can be used to train and evaluate video to language tasks, and other tasks (e.g., video retrieval, event detection, video categorization, etc.) as well in the future. Below is some statistics of this grand challenge in ACM Multimedia 2016.

pie

Figure 1. Statistics of participants in the first grand challenge at ACM MM 2016. There are 77 teams that registered our challenge, and 22 teams submitted their final results.

tb1

Figure 2. Ranking list of top 10 participants in terms of M1 (objective evaluation) metric in the first grand challenge at ACM MM 2016.

tb2

Figure 3. Ranking list of top 10 participants in terms of M2 (human subjective evaluation) metric in the first grand challenge at ACM MM 2016.

This year we are proposing to organize the second grand challenge in ACM Multimedia 2017, and will release a new test dataset for evaluations.

This year, similar with what we did in the first challenge, by participating in this challenge, you can:

  • Leverage MSR-VTT benchmark to boost research on an emerging task of video to language;
  • Try out your video to language system using real world data;
  • See how it compares to the rest of the community’s entries;
  • Get to be a contender for ACM Multimedia 2017 Grand Challenge;
Task Description

This year we will focus on video to language task. Given an input video clip, the goal is to automatically generate a complete and natural sentence to describe video content, ideally encapsulating its most informative dynamics.

The contestants are asked to develop video to language systems based on the MSR-VTT dataset provided by the Challenge (as training data) and any other public/private data to recognize a wide range of object, scene, event, etc., in the images/videos. For the evaluation purpose, a contesting system is asked to produce at least one sentence of the test videos. The accuracy will be evaluated against human pre-generated sentence(s) during evaluation stage.

For more information, please refer to “Microsoft Research Video to Language Grand Challenge” website (http://ms-multimedia-challenge.com/).

Dataset

The dataset is based on MSR-VTT and we split the data according to 60%:30%:10% in the training, testing and validation set, respectively. Below table shows the statistics of MSR-VTT dataset.

Dataset Context Sentence
source
#Video #Clip #Sentence #Word Vocabulary Duration
(hr)
MSR-VTT20
categories
AMT
workers
5,94210,000200,0001,535,91728,52838.7

*In MSR-VTT dataset, we provide the category information for each video clip and the video clip contains audio information as well.

Evaluation Metric

The evaluation provided here can be used to obtain results on the testing set of MSR-VTT. It computes multiple common metrics, including BLEU@4, METEOR, ROUGE-L, and CIDEr.

In addition, we will carry out the human evaluation of the systems submitted to this challenge on a subset of the testing set. Human were asked to rank four generated sentences and a reference sentence from 1 to 5 (lower - better) with respect to the following criteria.

  • Grammar: judge the fluency and readability of the sentence (independently of the correctness with respect to the video clip).
  • Correctness: for which sentence is the content more correct with respect to the video clip (independent if it is complete, i.e., describes everything), independent of the grammatical correctness.
  • Relevance: Which sentence contains the more salient (i.e., relevant, important) events/objects of the video clip?
  • Helpful for blind (additional criteria): how helpful would the sentence be for a blind person to understand what is happening in this video clip?
Important Dates:
  • April 18, 2017: Dataset available for download (training and validation set)
  • June 1, 2017: Test set available for download
  • June 15, 2017: Results and one-page notebook paper submission
  • June 16-28, 2017: Objective evaluation and human evaluation
  • July 3, 2017: Evaluation results announce
  • July 14, 2017: Paper submission deadline (please follow the instructions on the main conference website)
Contact

Ting Yao (), Associate Researcher, Microsoft Research Asia

Tao Mei (), Senior Researcher, Microsoft Research Asia

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值