论文阅读：Video Action Transformer Network

最新推荐文章于 2024-06-17 09:50:50 发布

小吴同学真棒

最新推荐文章于 2024-06-17 09:50:50 发布

阅读量957

点赞数 1

分类专栏：学习人工智能文章标签：动作识别 Transformer 动作定位动作检测时序动作定位

本文链接：https://blog.csdn.net/qq_36627158/article/details/116460032

版权

学习同时被 2 个专栏收录

115 篇文章 7 订阅

订阅专栏

人工智能

72 篇文章 5 订阅

订阅专栏

Objective (Task)

recognizing and localizing human actions（我的理解是只在 keyframe 上进行人物定位和动作分类，不做 temporal 维度的定位！）

Examples of 3-second video segments (from Video Source) with their bounding box annotations in the middle frame of each segment. (For clarity, only one bounding box is shown for each example.)

图自：https://ai.googleblog.com/2017/10/announcing-ava-finely-labeled-video.html

Motivation

Human actions remain so difficult to recognize is that inferring a person’s actions often requires understanding the people and objects around them. (‘pointing to an object’, or ‘holding an object’, or ‘shaking hands’). Note that this is not limited to the context at a given point in time. (‘watching a person’ rather than 'staring into the distance')
Thus we seek a model that can determine and utilize such contextual information (other people, other objects) when determining the action of a person of interest. So, we choose the Transformer architecture for this, since it explicitly builds contextual support for its representations using self-attention.

Proposed Method

Trunk (I3D): propose based feature for video
RPN (Faster R-CNN): localize people in the key frames
Action Transformer Head: aggregates contextual information from other people and objects in the surrounding video.

Trunk: I3D

Trunk: initial layers of an I3D network pre-trained on Kinetics-400（图中红线之前的 backbone）
Input: 64-frame clip(about 3 seconds of context around a given keyframe.)
Output: the feature map from the Mixed_4f layer【size：T / 4 × H / 16 × W / 16】

图自 I3D 论文：Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

I3D code：https://github.com/deepmind/kinetics-i3d/blob/master/i3d.py

Region Proposal Network(RPN): Faster R-CNN

slice out the temporally-central frame from the feature map and pass it through a region proposal network (RPN)
The RPN generates multiple potential person bounding boxes along with objectness scores.
select R boxes (we use R = 300) with the highest objectness scores.

Action Transformer Head

Q (query): the person being classified (the person box from the RPN)
K (key) & V (value): the clip around the person

Transformer 里总的流程：

processes the query and memory to output an updated query vector.
The intuition is that the self-attention will add context from other people and objects in the clip to the query vector, to aid with the subsequent classification.
This unit can be stacked in multiple heads and layers similar to the original architecture [43], by concatenating the output from the multiple heads at a given layer, and using the concatenated feature as the next query.

key & value & query 具体是怎么得到的？

The key and value features are simply computed as linear projections of the original feature map from the trunk, hence each is of shape T′ ×H′ ×W′ ×D.
we extract the RoIPool-ed feature for the person box from the center clip, and pass it through a query preprocessor (QPr) and a linear layer to get the query feature of size 1×1×D (D=128, the same as the query and key feature maps). 其中 QPr 有两种版本：HighRes 和 LowRes：
- LowRes：directly average the RoIpool feature across space, but would lose all spatial layout of the person.
- HighRes：first reduce the dimensionality by a 1 × 1 convolution, resulting a 7 × 7 feature map, and then concatenate the cells of the resulting feature map into a vector.

Feed Forward Network (FFN): a 2-layer MLP

Dataset

AVA 数据集官网：http://research.google.com/ava/download.html

ActivityNet 关于数据集任务的介绍：http://activity-net.org/challenges/2019/tasks/guest_ava.html

Task: Spatio-Temporal Action Localization
Each labeled video segment can contain multiple subjects, each performing potentially multiple actions.
Evaluation Metric: Frame-mAP at spatial IoU >= 0.5
The AVA dataset densely annotates 80 atomic visual actions in 430 (235 videos for training, 64 videos for validation, 131 videos for test) 15-minute movie clips.
Generally raters provided annotations at timestamps 902:1798 inclusive, in seconds, at 1-second intervals.

Submission Format

The format of a row is the following: video_id, middle_frame_timestamp, person_box, action_id, score
video_id: YouTube identifier
middle_frame_timestamp: in seconds from the start of the video.
person_box: top-left (x1, y1) and bottom-right (x2, y2) normalized with respect to frame size, where (0.0, 0.0) corresponds to the top left, and (1.0, 1.0) corresponds to bottom right.
action_id: integer identifier of an action class, from ava_action_list_v2.2_for_activitynet_2019.pbtxt.
score: a float indicating the score for this labeled box.

An example taken from the validation set is: 1j20qq1JyX4, 0902, 0.002, 0.118, 0.714, 0.977, 12, 0.9

Result & Analysis

Action classification with GT person boxes

1、加 GT boxes 比直接使用 Faster R-CNN（R=64）效果更好：the RPN is likely to be less perfect than ground truth. 但 only get a small improvement by using groundtruth (GT) boxes, indicating that our model is already capable of learning a good representation for person detection.
2、It is also worth noting that our Action Transformer head implementation actually has 2.3M fewer parameters than the I3D head in the LowRes QPr case
3、Tx head 的效果比 I3D head 的效果好，说明了 the importance of the context.

Localization performance (action agnostic)

isolate the localization performance by merging all classes into a single trivial one.
The transformer is less accurate for localization.

Overall performance

Action Transformer head is far superior to the I3D head (24.4 compared to 20.5).
An additional boost can be obtained (to 24.9) by using the I3D head for regression and the Action Transformer head for classification

Comparison with previous state of the art and challenge submissions

96f 指的是 input clip 的 T = 96 frames

Embedding and attention

For two frames, we show their ‘key’ embeddings as color-coded 3D PCA projection for two of the six heads in our 2-head 3-layer Tx head.
It is interesting to note that one of these heads learns to track people semantically (Tx-A: all upper bodies are similar color – green), while the other is instance specific (Tx-B: each person is different color – blue, pink and purple).
In the following columns we show by the average softmax attention corresponding to the person in the red box for all heads in the last Tx layer. Our model learns to hone in on faces, hands and objects being interacted with, as these are most discriminative for recognizing actions.