文章目录
1. Motivation
-
以往的大部分方法没有显示建模不同action label的关系。
Although these works achieve strong multi-label action localization performance, they do not explicitly model the relationships between the different action labels, which can be extremely useful for determining the presence or absence of classes within a video.
-
然而,后来的工作测量了动作的共同悬系,但它不区分在同一时间步内和不同时间步内发生的动作,也就是没有本文后来提到的temporal dependencies。
In addition, the later works measure the video-level co-occurrence of actions, which does not differentiate between actions that occur within the same time-step and across different time-steps.
-
最主要的在于,本文指出没有已经存在的工作显式的建模了共同和前后类别间的依赖。
To the best of our knowledge, no existing works explicitly model both the co-occurrence and temporal dependencies between action classes.
2. Introduction
-
本文的领域:multi-label temporal action detection 多标签实时动作检测。
-
action localization:动作检测的任务是在视频序列的每一个时间步(time-step)中,预测出该时刻的动作。
The task of action localization in the untrimmed video involves predicting the action, or actions, present at each time-step of the video sequence
视频包含2种动作依赖,co-occurrence dependencies(发生在同一时间)以及temporal dependencies(一个动作接着另一个动作)。
Videos contain two types of action dependencies:
-
co-occurrence dependencies, involving actions that occur at the same time (this is most analogous to object class co- occurrence within images),
-
temporal dependencies, involving actions that precede or follow each other.
如图1所示,第一个视频表示co-occurrence dependencies(在同一个time-step)。第二个视频表示temporal dependencies(across time-steps)。
3. Contribution
-
本文提出了一个新的网络结构,同时建模同时动作依赖以及前后动作依赖关系。
We present a novel network architecture that models both co-occurrence action dependencies and temporal action dependencies.
-
本文还提出了一个多标签的度量方法,建模了类与类之间在同一时间步中以及不同时间步中的co-occurrence关系,从而来衡量一个模型的性能。
We propose multi-label performance metrics to measure a method’s ability to model class co-occurrence across time-steps as well as within a time-step.
-
在2个公共的多标签动作数据集上超过了现有的SOTA。
We evaluate the proposed approach on two large scale publicly available multi-label action datasets, outper- forming existing state-of-the-art methods.
4. Method
图2是网络框架,由三个部分组成,首先是Feature Extractor layer,提取经过Pre-trained Backbone得到的特征,得到 T × H × C T \times H \times C T×H×C;接着是核心MLAD layer,包含TB和CB block,类似transformer的结构,输出的维度不变;最后是一个简单的classification layer,在每一个time-step,对每一个action class进行分类,得到 T × C T \times C T×C。
4.1 Problem Formulation
- The problem of multi-label temporal action localization involves classifying all activities occurring throughout a video at each time-step.
T个特征序列,每一个时间步中包含了一个gt action label $y_{t,c} \in {0,1 } $ 。C是action class。
输入的一系列的特征向量 x t ∈ R F x_t \in R^F xt∈RF。即 T × F T \times F T×F。
4.2 Class-level Feature Extraction
代码的实现就是用了FC+RELU。
f t , c = R e L U ( W c T x t + b c ) f_{t,c} = ReLU(W_c^Tx_t+b_c) ft,c=ReLU(WcTxt+bc)
self.feature_expansion = nn.ModuleList([nn.Sequential(nn.Linear(feature_dim, self.hidden_dim), nn.ReLU()) for i in range(self.num_classes)])
4.3 MLAD Layer
本文提出了Multi-label Action Dependency(MLAD) layer,将传统的FC graph-baed或attention-based方法分解为CxC和TxT的的2种connection。
MLAD later包含2个branches,分别为Co-occurrence Dependency Branch(CB)以及Temporal Dependency Branch(TB)分支。作者认为这种方法建模了相关的动作依赖,并且修正了输入的class-lebel特征。
4.3.1 Co-occurrence Dependency Branch (CB)
和transformer中的qkv的方法类似, A ( c ) A^{(c)} A(c)是 C × C C\times C C×C attention matrix,表示不同类别在同一个time-step的联系。
This attention matrix contains the relevance of each class for the classification of another class.
A ( t ) = s o f t m a x ( Q t K t T H ) A^{(t)} = softmax(\frac{Q_tK_t^T}{\sqrt H}) A(t)=softmax(HQtKtT)
f t , c ′ = A ( t ) V t f^{'}_{t,c} =A^{(t)}V_t ft,c′=A(t)Vt
4.3.2 Temporal Dependency Branch (TB)
A ( c ) A^{(c)} A(c)是 T × T T\times T T×T attention matrix,表示不同时序之间的联系。
A ( c ) = s o f t m a x ( Q c K c T H ) A^{(c)} = softmax(\frac{Q_cK_c^T}{\sqrt H}) A(c)=softmax(HQcKcT)
f t , c ′ ′ = A ( c ) V c f^{''}_{t,c} =A^{(c)}V_c ft,c′′=A(c)Vc
4.3.3 Merging Branches and Classification
将公式5和公式3进行merge操作,公式如下:
g t , c = α f t , c ′