论文阅读:Video Action Transformer Network

目录

Objective (Task)

Motivation

Proposed Method

Trunk: I3D

Region Proposal Network(RPN): Faster R-CNN

Action Transformer Head

Dataset

Submission Format

Result & Analysis

Action classification with GT person boxes

Localization performance (action agnostic)

Overall performance

Comparison with previous state of the art and challenge submissions

Embedding and attention

 


 

论文名称:Video Action Transformer Network

下载地址https://arxiv.org/pdf/1812.02707.pdf

项目网站(但是没有 code!!!):http://rohitgirdhar.github.io/ActionTransformer

 


 

Objective (Task)

recognizing and localizing human actions(我的理解是只在 keyframe 上进行人物定位动作分类不做 temporal 维度的定位!

Examples of 3-second video segments (from Video Source) with their bounding box annotations in the middle frame of each segment. (For clarity, only one bounding box is shown for each example.)

图自:https://ai.googleblog.com/2017/10/announcing-ava-finely-labeled-video.html

 


 

Motivation

  • Human actions remain so difficult to recognize is that inferring a person’s actions often requires understanding the people and objects around them. (‘pointing to an object’, or ‘holding an object’, or ‘shaking hands’). Note that this is not limited to the context at a given point in time. (‘watching a person’ rather than 'staring into the distance')
  • Thus we seek a model that can determine and utilize such contextual information (other people, other objects) when determining the action of a person of interest. So, we choose the Transformer architecture for this, since it explicitly builds contextual support for its representations using self-attention.

 


 

Proposed Method

  • Trunk (I3D): propose based feature for video
  • RPN (Faster R-CNN): localize people in the key frames
  • Action Transformer Head: aggregates contextual information from other people and objects in the surrounding video.
     

 

Trunk: I3D

  • Trunk: initial layers of an I3D network pre-trained on Kinetics-400(图中红线之前的 backbone)
  • Input: 64-frame clip(about 3 seconds of context around a given keyframe.)
  • Output: the feature map from the Mixed_4f layer【size:T / 4 × H / 16 × W / 16】

图自 I3D 论文:Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

I3D code:https://github.com/deepmind/kinetics-i3d/blob/master/i3d.py

 

 

Region Proposal Network(RPN): Faster R-CNN

  • slice out the temporally-central frame from the feature map and pass it through a region proposal network (RPN) 
  • The RPN generates multiple potential person bounding boxes along with objectness scores
  • select R boxes (we use R = 300) with the highest objectness scores.

 

 

Action Transformer Head

  • Q (query): the person being classified (the person box from the RPN)
  • K (key) & V (value): the clip around the person

Transformer 里总的流程

  • processes the query and memory to output an updated query vector
  • The intuition is that the self-attention will add context from other people and objects in the clip to the query vector, to aid with the subsequent classification. 
  • This unit can be stacked in multiple heads and layers similar to the original architecture [43], by concatenating the output from the multiple heads at a given layer, and using the concatenated feature as the next query.
     

key & value & query 具体是怎么得到的?

  • The key and value features are simply computed as linear projections of the original feature map from the trunk, hence each is of shape T′ ×H′ ×W′ ×D.
  • we extract the RoIPool-ed feature for the person box from the center clip, and pass it through a query preprocessor (QPr) and a linear layer to get the query feature of size 1×1×D (D=128, the same as the query and key feature maps). 其中 QPr 有两种版本:HighRes LowRes
    • LowRes:directly average the RoIpool feature across space, but would lose all spatial layout of the person.
    • HighRes:first reduce the dimensionality by a 1 × 1 convolution, resulting a 7 × 7 feature map, and then concatenate the cells of the resulting feature map into a vector.

Feed Forward Network (FFN): a 2-layer MLP
 

 


 

Dataset

AVA 数据集官网http://research.google.com/ava/download.html

ActivityNet 关于数据集任务的介绍http://activity-net.org/challenges/2019/tasks/guest_ava.html

  • Task: Spatio-Temporal Action Localization
  • Each labeled video segment can contain multiple subjects, each performing potentially multiple actions.
  • Evaluation Metric: Frame-mAP at spatial IoU >= 0.5
  • The AVA dataset densely annotates 80 atomic visual actions in 430 (235 videos for training, 64 videos for validation, 131 videos for test) 15-minute movie clips.
  • Generally raters provided annotations at timestamps 902:1798 inclusive, in seconds, at 1-second intervals.
     

Submission Format

  • The format of a row is the following: video_id, middle_frame_timestamp, person_box, action_id, score
    • video_id: YouTube identifier
    • middle_frame_timestamp: in seconds from the start of the video.
    • person_box: top-left (x1, y1) and bottom-right (x2, y2) normalized with respect to frame size, where (0.0, 0.0) corresponds to the top left, and (1.0, 1.0) corresponds to bottom right.
    • action_id: integer identifier of an action class, from ava_action_list_v2.2_for_activitynet_2019.pbtxt.
    • score: a float indicating the score for this labeled box.

 

  • An example taken from the validation set is: 1j20qq1JyX4, 0902, 0.002, 0.118, 0.714, 0.977, 12, 0.9

 


 

Result & Analysis

Action classification with GT person boxes

1、加 GT boxes 比 直接使用 Faster R-CNN(R=64) 效果更好:the RPN is likely to be less perfect than ground truth. 但 only get a small improvement by using groundtruth (GT) boxes, indicating that our model is already capable of learning a good representation for person detection.
2、It is also worth noting that our Action Transformer head implementation actually has 2.3M fewer parameters than the I3D head in the LowRes QPr case
3、Tx head 的效果 比 I3D head 的效果好,说明了 the importance of the context.

 

 

Localization performance (action agnostic)

  • isolate the localization performance by merging all classes into a single trivial one.
  • The transformer is less accurate for localization.

 

 

Overall performance

  • Action Transformer head is far superior to the I3D head (24.4 compared to 20.5).
  • An additional boost can be obtained (to 24.9) by using the I3D head for regression and the Action Transformer head for classification

 

 

Comparison with previous state of the art and challenge submissions

  • 96f 指的是 input clip 的 T = 96 frames

 

 

Embedding and attention

  • For two frames, we show their ‘keyembeddings as color-coded 3D PCA projection for two of the six heads in our 2-head 3-layer Tx head
  • It is interesting to note that one of these heads learns to track people semantically (Tx-A: all upper bodies are similar color – green), while the other is instance specific (Tx-B: each person is different color – blue, pink and purple). 
  • In the following columns we show by the average softmax attention corresponding to the person in the red box for all heads in the last Tx layer. Our model learns to hone in on faces, hands and objects being interacted with, as these are most discriminative for recognizing actions.
     

 

 

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值