论文阅读：Skeleton-Based Action Recognition with Directed Graph Neural Networks-CSDN博客

本文链接：https://blog.csdn.net/qq_36627158/article/details/117704502

Skeleton-Based Action Recognition with Directed Graph Neural Networks

（2019 CVPR）

Lei Shi1,2

Yifan Zhang, Jian Cheng, Hanqing Lu

Notes

Contributions

(1) To the best of our knowledge, this is the first work to represent the skeleton data as a directed acyclic graph to model the dependencies between joints and bones. A novel directed graph neural network is designed specially to extract these dependencies for the final action recognition task.

(2) An adaptively learned graph structure, which is trained and updated jointly with model parameters in the training process, is used to better suit the action recognition task.

(3) The motion information between consecutive frames is extracted for temporal information modeling. Both the spatial and motion information are fed into a two-stream framework for the final recognition task.

Method

Graph Construction

We represent the skeleton data as a directed acyclic graph (DAG) with the joints as vertexes and bones as edges. The direction of each edge is determined by the distance between the vertex and the root vertex, where the vertex closer to the root vertex points to the vertex farther from the root vertex. Here, the root vertex is defined as the center of gravity of the skeleton.

Formally, for each vertex , we define the edge heading to it as the incoming edge and the edge emitting from it as the outgoing edge . Similarly, for a directed edge , we define that it is a vector from the source vertex to the target vertex . We use and to denote the set of incoming edges and the set of outgoing edges of vertex , respectively. In this way, a skeleton-based frame can be formulated as a directed graph , where is a set of vertexes (joints) and is a set of directed edges (bones). A skeleton-based video is a sequence of frames that can be formulated as , where T denotes the length of the video.

Directed graph network block

The directed graph network (DGN) block contains two updating functions, and , and two aggregation functions, and . The updating function is used to update the attributes of vertexes and edges based on their connected edges and vertexes. The aggregation function is used to aggregate the attributes contained in multiple incoming (outgoing) edges connected to one vertex. Formally, this process is formulated as follows:

Specifically,

For each vertex , all of the edges that point to it are processed by the incoming aggregation function , which returns the aggregated result .
Similar to step 1 , all of the edges that emit from are processed by the outgoing aggregation function , which returns the aggregated result .
, and are concatenated and fed into the vertexupdate function , which returns as the updated version of .
For each edge , its source vertex, target vertex and itself are concatenated and processed by the edge-update function The function returns , which is the updated version of edge

The process can be also summarized as a vertex-update process followed by an edge-update process. With extensive experiments, we have chosen the average pooling as the aggregation functions for both the incoming edges and outgoing edges and chosen the single fully-connected layer as the update functions in this work.

Given a directed graph with vertexes and edges, the incidence matrix of is an matrix whose element indicates the relationship between the corresponding vertex and edge . In detail, if is the source vertex of , then If is the target vertex of , then . If there are no connections between and , then . To separate the source vertexes and target vertexes, we use to denote the incidence matrix of source vertexes, which contains only the absolute value of the elements of A that are smaller than 0. Similarly, we define as the incidence matrix of target vertexes, which contains only the elements of A that are greater than 0. For example, Eq. 2 shows the incidence matrix and its corresponding and for the graph.

Note that the aggregation function used in this work is the average pooling operation and that the incidence matrix needs to be normalized. In detail, we define as the normalized version of A, where Λ is a diagonal matrix and . ϵ is a small number to avoid division by zero. With these modifications, Eq. 1 is transformed into

where H denotes the single-layer fully connected layer, i.e., the updating function in Eq. 1. Similar to the conventional convolutional layer, we add a BN layer and a ReLU layer after each DGN block.

Adaptive DGN block

Note that A here denotes the incidence matrix. We directly set A as the parameter of the model but fix it at the first several training epochs.

Temporal information modeling

after updating the spatial information of joints and bones in each DGN block, we apply a 1D convolution along the temporal dimension to model the temporal information. Similar to the DGN block, each 1D convolutional layer is followed with a BN layer and a ReLU layer to form a temporal convolutional block (TCN).

Directed graph neural network

The overall architecture of the directed graph neural network (DGNN) has 9 units, each containing one DGN block and one TCN block. The output channels of the units are 64,64,64,128,128,128,256,256 and 256. A global-average-pooling layer followed by a softmax layer is added at the end for class prediction.

Two-Stream Framework

Formally, the movement of joint in time t is calculated as The deformation of bones is defined similarly as Then, the motion graphs are fed into another DGNN to make the prediction for the action label. Two networks are finally fused by adding the output scores of the softmax layer.