Skeleton-Based Action Recognition with Directed Graph Neural Networks
(2019 CVPR)
Lei Shi1,2
Yifan Zhang, Jian Cheng, Hanqing Lu
Notes
Contributions
(1) To the best of our knowledge, this is the first work to represent the skeleton data as a directed acyclic graph to model the dependencies between joints and bones. A novel directed graph neural network is designed specially to extract these dependencies for the final action recognition task.
(2) An adaptively learned graph structure, which is trained and updated jointly with model parameters in the training process, is used to better suit the action recognition task.
(3) The motion information between consecutive frames is extracted for temporal information modeling. Both the spatial and motion information are fed into a two-stream framework for the final recognition task.
Method
Graph Construction
We represent the skeleton data as a directed acyclic graph (DAG) with the joints as vertexes and bones as edges. The direction of each edge is determined by the distance between the vertex and the root vertex, where the vertex closer to the root vertex points to the vertex farther from the root vertex. Here, the root vertex is defined as the center of gravity of the skeleton.
Formally, for each vertex , we define the edge heading to it as the incoming edge
and the edge emitting from it as the outgoing edge
. Similarly, for a directed edge
, we define that it is a vector from the source vertex
to the target vertex
. We use
and
to denote the set of incoming edges and the set of outgoing edges of vertex
, respectively. In this way, a skeleton-based frame can be formulated as a directed graph
, where
is a set of vertexes (joints) and
is a set of directed edges (bones). A skeleton-based video is a sequence of frames that can be formulated as
, where T denotes the length of the video.
Directed graph network block
The directed graph network (DGN) block contains two updating functions, and
, and two aggregation functions,
and
. The updating function is used to update the attributes of vertexes and edges based on their connected edges and vertexes. The aggregation function is used to aggregate the attributes contained in multiple incoming (outgoing) edges connected to one vertex. Formally, this process is formulated as follows:
Specifically,
- For each vertex
, all of the edges that point to it are processed by the incoming aggregation function
, which returns the aggregated result
.
- Similar to step 1 , all of the edges that emit from
are processed by the outgoing aggregation function
, which returns the aggregated result
.
,
and
are concatenated and fed into the vertexupdate function
, which returns
as the updated version of
.
- For each edge
, its source vertex, target vertex and itself are concatenated and processed by the edge-update function
The function returns
, which is the updated version of edge
The process can be also summarized as a vertex-update process followed by an edge-update process. With extensive experiments, we have chosen the average pooling as the aggregation functions for both the incoming edges and outgoing edges and chosen the single fully-connected layer as the update functions in this work.
Given a directed graph with vertexes and
edges, the incidence matrix of
is an
matrix whose element
indicates the relationship between the corresponding vertex
and edge
. In detail, if
is the source vertex of
, then
If
is the target vertex of
, then
. If there are no connections between
and
, then
. To separate the source vertexes and target vertexes, we use
to denote the incidence matrix of source vertexes, which contains only the absolute value of the elements of A that are smaller than 0. Similarly, we define
as the incidence matrix of target vertexes, which contains only the elements of A that are greater than 0. For example, Eq. 2 shows the incidence matrix and its corresponding
and
for the graph.
Note that the aggregation function used in this work is the average pooling operation and that the incidence matrix needs to be normalized. In detail, we define as the normalized version of A, where Λ is a diagonal matrix and
. ϵ is a small number to avoid division by zero. With these modifications, Eq. 1 is transformed into
where H denotes the single-layer fully connected layer, i.e., the updating function in Eq. 1. Similar to the conventional convolutional layer, we add a BN layer and a ReLU layer after each DGN block.
Adaptive DGN block
Note that A here denotes the incidence matrix. We directly set A as the parameter of the model but fix it at the first several training epochs.
Temporal information modeling
after updating the spatial information of joints and bones in each DGN block, we apply a 1D convolution along the temporal dimension to model the temporal information. Similar to the DGN block, each 1D convolutional layer is followed with a BN layer and a ReLU layer to form a temporal convolutional block (TCN).
Directed graph neural network
The overall architecture of the directed graph neural network (DGNN) has 9 units, each containing one DGN block and one TCN block. The output channels of the units are 64,64,64,128,128,128,256,256 and 256. A global-average-pooling layer followed by a softmax layer is added at the end for class prediction.
Two-Stream Framework
Formally, the movement of joint in time t is calculated as
The deformation of bones is defined similarly as
Then, the motion graphs are fed into another DGNN to make the prediction for the action label. Two networks are finally fused by adding the output scores of the softmax layer.
Results