目录
Region Proposal Network(RPN): Faster R-CNN
Action classification with GT person boxes
Localization performance (action agnostic)
Comparison with previous state of the art and challenge submissions
论文名称:Video Action Transformer Network
下载地址:https://arxiv.org/pdf/1812.02707.pdf
项目网站(但是没有 code!!!):http://rohitgirdhar.github.io/ActionTransformer
Objective (Task)
recognizing and localizing human actions(我的理解是只在 keyframe 上进行人物定位和动作分类,不做 temporal 维度的定位!)
Examples of 3-second video segments (from Video Source) with their bounding box annotations in the middle frame of each segment. (For clarity, only one bounding box is shown for each example.)
图自:https://ai.googleblog.com/2017/10/announcing-ava-finely-labeled-video.html
Motivation
- Human actions remain so difficult to recognize is that inferring a person’s actions often requires understanding the people and objects around them. (‘pointing to an object’, or ‘holding an object’, or ‘shaking hands’). Note that this is not limited to the context at a given point in time. (‘watching a person’ rather than 'staring into the distance')
- Thus we seek a model that can determine and utilize such contextual information (other people, other objects) when determining the action of a person of interest. So, we choose the Transformer architecture for this, since it explicitly builds contextual support for its representations using self-attention.
Proposed Method
- Trunk (I3D): propose based feature for video
- RPN (Faster R-CNN): localize people in the key frames
- Action Transformer Head: aggregates contextual information from other people and objects in the surrounding video.
Trunk: I3D
- Trunk: initial layers of an I3D network pre-trained on Kinetics-400(图中红线之前的 backbone)
- Input: 64-frame clip(about 3 seconds of context around a given keyframe.)
- Output: the feature map from the Mixed_4f layer【size:T / 4 × H / 16 × W / 16】
图自 I3D 论文:Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
I3D code:https://github.com/deepmind/kinetics-i3d/blob/master/i3d.py
Region Proposal Network(RPN): Faster R-CNN
- slice out the temporally-central frame from the feature map and pass it through a region proposal network (RPN)
- The RPN generates multiple potential person bounding boxes along with objectness scores.
- select R boxes (we use R = 300) with the highest objectness scores.
Action Transformer Head
- Q (query): the person being classified (the person box from the RPN)
- K (key) & V (value): the clip around the person
Transformer 里总的流程:
- processes the query and memory to output an updated query vector.
- The intuition is that the self-attention will add context from other people and objects in the clip to the query vector, to aid with the subsequent classification.
- This unit can be stacked in multiple heads and layers similar to the original architecture [43], by concatenating the output from the multiple heads at a given layer, and using the concatenated feature as the next query.
key & value & query 具体是怎么得到的?
- The key and value features are simply computed as linear projections of the original feature map from the trunk, hence each is of shape T′ ×H′ ×W′ ×D.
- we extract the RoIPool-ed feature for the person box from the center clip, and pass it through a query preprocessor (QPr) and a linear layer to get the query feature of size 1×1×D (D=128, the same as the query and key feature maps). 其中 QPr 有两种版本:HighRes 和 LowRes:
- LowRes:directly average the RoIpool feature across space, but would lose all spatial layout of the person.
- HighRes:first reduce the dimensionality by a 1 × 1 convolution, resulting a 7 × 7 feature map, and then concatenate the cells of the resulting feature map into a vector.
Feed Forward Network (FFN): a 2-layer MLP
Dataset
AVA 数据集官网:http://research.google.com/ava/download.html
ActivityNet 关于数据集任务的介绍:http://activity-net.org/challenges/2019/tasks/guest_ava.html
- Task: Spatio-Temporal Action Localization
- Each labeled video segment can contain multiple subjects, each performing potentially multiple actions.
- Evaluation Metric: Frame-mAP at spatial IoU >= 0.5
- The AVA dataset densely annotates 80 atomic visual actions in 430 (235 videos for training, 64 videos for validation, 131 videos for test) 15-minute movie clips.
- Generally raters provided annotations at timestamps 902:1798 inclusive, in seconds, at 1-second intervals.
Submission Format
- The format of a row is the following: video_id, middle_frame_timestamp, person_box, action_id, score
- video_id: YouTube identifier
- middle_frame_timestamp: in seconds from the start of the video.
- person_box: top-left (x1, y1) and bottom-right (x2, y2) normalized with respect to frame size, where (0.0, 0.0) corresponds to the top left, and (1.0, 1.0) corresponds to bottom right.
- action_id: integer identifier of an action class, from ava_action_list_v2.2_for_activitynet_2019.pbtxt.
- score: a float indicating the score for this labeled box.
- An example taken from the validation set is: 1j20qq1JyX4, 0902, 0.002, 0.118, 0.714, 0.977, 12, 0.9
Result & Analysis
Action classification with GT person boxes
1、加 GT boxes 比 直接使用 Faster R-CNN(R=64) 效果更好:the RPN is likely to be less perfect than ground truth. 但 only get a small improvement by using groundtruth (GT) boxes, indicating that our model is already capable of learning a good representation for person detection.
2、It is also worth noting that our Action Transformer head implementation actually has 2.3M fewer parameters than the I3D head in the LowRes QPr case
3、Tx head 的效果 比 I3D head 的效果好,说明了 the importance of the context.
Localization performance (action agnostic)
- isolate the localization performance by merging all classes into a single trivial one.
- The transformer is less accurate for localization.
Overall performance
- Action Transformer head is far superior to the I3D head (24.4 compared to 20.5).
- An additional boost can be obtained (to 24.9) by using the I3D head for regression and the Action Transformer head for classification
Comparison with previous state of the art and challenge submissions
- 96f 指的是 input clip 的 T = 96 frames
Embedding and attention
- For two frames, we show their ‘key’ embeddings as color-coded 3D PCA projection for two of the six heads in our 2-head 3-layer Tx head.
- It is interesting to note that one of these heads learns to track people semantically (Tx-A: all upper bodies are similar color – green), while the other is instance specific (Tx-B: each person is different color – blue, pink and purple).
- In the following columns we show by the average softmax attention corresponding to the person in the red box for all heads in the last Tx layer. Our model learns to hone in on faces, hands and objects being interacted with, as these are most discriminative for recognizing actions.