论文阅读:Skeleton-Based Action Recognition with Directed Graph Neural Networks

Skeleton-Based Action Recognition with Directed Graph Neural Networks

(2019 CVPR)

Lei Shi1,2

Yifan Zhang, Jian Cheng, Hanqing Lu

Notes

 

Contributions

(1) To the best of our knowledge, this is the first work to represent the skeleton data as a directed acyclic graph to model the dependencies between joints and bones. A novel directed graph neural network is designed specially to extract these dependencies for the final action recognition task.

(2) An adaptively learned graph structure, which is trained and updated jointly with model parameters in the training process, is used to better suit the action recognition task.

(3) The motion information between consecutive frames is extracted for temporal information modeling. Both the spatial and motion information are fed into a two-stream framework for the final recognition task.

 


 

Method

Graph Construction

We represent the skeleton data as a directed acyclic graph (DAG) with the joints as vertexes and bones as edges. The direction of each edge is determined by the distance between the vertex and the root vertex, where the vertex closer to the root vertex points to the vertex farther from the root vertex. Here, the root vertex is defined as the center of gravity of the skeleton.

Formally, for each vertex , we define the edge heading to it as the incoming edge  and the edge emitting from it as the outgoing edge . Similarly, for a directed edge , we define that it is a vector from the source vertex  to the target vertex . We use  and  to denote the set of incoming edges and the set of outgoing edges of vertex , respectively. In this way, a skeleton-based frame can be formulated as a directed graph , where  is a set of vertexes (joints) and  is a set of directed edges (bones). A skeleton-based video is a sequence of frames that can be formulated as , where T denotes the length of the video.

 

 

Directed graph network block

The directed graph network (DGN) block contains two updating functions,  and , and two aggregation functions,  and . The updating function is used to update the attributes of vertexes and edges based on their connected edges and vertexes. The aggregation function is used to aggregate the attributes contained in multiple incoming (outgoing) edges connected to one vertex. Formally, this process is formulated as follows:

Specifically,

  1. For each vertex , all of the edges that point to it are processed by the incoming aggregation function , which returns the aggregated result .
  2. Similar to step 1 , all of the edges that emit from  are processed by the outgoing aggregation function , which returns the aggregated result .
  3. and  are concatenated and fed into the vertexupdate function , which returns  as the updated version of .
  4. For each edge , its source vertex, target vertex and itself are concatenated and processed by the edge-update function  The function returns , which is the updated version of edge

The process can be also summarized as a vertex-update process followed by an edge-update process. With extensive experiments, we have chosen the average pooling as the aggregation functions for both the incoming edges and outgoing edges and chosen the single fully-connected layer as the update functions in this work.

Given a directed graph with  vertexes and  edges, the incidence matrix of  is an matrix whose element  indicates the relationship between the corresponding vertex  and edge . In detail, if  is the source vertex of , then  If  is the target vertex of , then . If there are no connections between  and , then . To separate the source vertexes and target vertexes, we use  to denote the incidence matrix of source vertexes, which contains only the absolute value of the elements of A that are smaller than 0. Similarly, we define  as the incidence matrix of target vertexes, which contains only the elements of A that are greater than 0. For example, Eq. 2 shows the incidence matrix and its corresponding  and  for the graph.

Note that the aggregation function used in this work is the average pooling operation and that the incidence matrix needs to be normalized. In detail, we define  as the normalized version of A, where Λ is a diagonal matrix and . ϵ is a small number to avoid division by zero. With these modifications, Eq. 1 is transformed into

where H denotes the single-layer fully connected layer, i.e., the updating function in Eq. 1. Similar to the conventional convolutional layer, we add a BN layer and a ReLU layer after each DGN block.

 

 

Adaptive DGN block

Note that A here denotes the incidence matrix. We directly set A as the parameter of the model but fix it at the first several training epochs.

 

 

Temporal information modeling

after updating the spatial information of joints and bones in each DGN block, we apply a 1D convolution along the temporal dimension to model the temporal information. Similar to the DGN block, each 1D convolutional layer is followed with a BN layer and a ReLU layer to form a temporal convolutional block (TCN).

 

 

Directed graph neural network

The overall architecture of the directed graph neural network (DGNN) has 9 units, each containing one DGN block and one TCN block. The output channels of the units are 64,64,64,128,128,128,256,256 and 256. A global-average-pooling layer followed by a softmax layer is added at the end for class prediction.

 

 

Two-Stream Framework

Formally, the movement of joint  in time t is calculated as  The deformation of bones is defined similarly as  Then, the motion graphs are fed into another DGNN to make the prediction for the action label. Two networks are finally fused by adding the output scores of the softmax layer.

 


 

Results

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值