写在前面
很久之前就出了这篇文章的代码讲解博客:解读 2s-AGCN 代码_小吴同学真棒的博客-CSDN博客_2s-agcn代码
意外发现还挺多人阅读和收藏的,那我借着今天再复习这篇论文的时候再补充一下论文的方法笔记吧。
Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition
(2019 CVPR)
Lei Shi, Yifan Zhang, Jian Cheng, Hanqing Lu
Notes
Contributions
(1) An adaptive graph convolutional network is proposed to adaptively learn the topology of the graph for different GCN layers and skeleton samples in an end-to-end manner, which can better suit the action recognition task and the hierarchical structure of the GCNs.
(2) The second-order information of the skeleton data (the lengths and directions of bones) of is explicitly formulated and combined with the first-order information (2D or 3D coordinates of the joints) using a two-stream framework, which brings notable improvement for the recognition performance.
(3) On two large-scale datasets for skeleton-based action recognition, the proposed 2s-AGCN exceeds the state-of-the-art by a significant margin.
Method
1、Preliminaries
The implementation of the graph convolution in the spatial dimension is not straightforward. Concretely, the feature map of the network is actually a C × T × N tensor, where N denotes the number of vertexes, T denotes the temporal length and C denotes the number of channels. To implement the ST-GCN,
where Kv denotes the kernel size of the spatial dimension. With the partition strategy designed above, Kv is set to 3.
基础主要是 ST-GCN 的内容,详细可参考讲解 blog:
2、Adaptive graph convolutional layer
we propose an adaptive graph convolutional layer. It makes the topology of the graph optimized together with the other parameters of the network in an end-to-end learning manner. The graph is unique for different layers and samples, which greatly increases the flexibility of the model. Meanwhile, it is designed as a residual branch, which guarantees the stability of the original model. In detail, according to Eq. 2, the topology of the graph is actually decided by the adjacency matrix and the mask, i.e., Ak and Mk, respectively. Ak determines whether there are connections between two vertexes and Mk determines the strength of the connections. To make the graph structure adaptive, we change Eq. 2 into the following form:
The main difference lies in the adjacency matrix of the graph, which is divided into three parts: , and .
The first part () is the same as the original normalized N×N adjacency matrix in Eq. 2. It represents the physical structure of the human body.
The second part () is also an N ×N adjacency matrix. In contrast to , the elements of are parameterized and optimized together with the other parameters in the training process. There are no constraints on the value of , which means that the graph is completely learned according to the training data. Note that the element in the matrix can be an arbitrary value. It indicates not only the existence of the connections between two joints but also the strength of the connections. It can play the same role of the attention mechanism performed by in Eq. 2. However, the original attention matrix is dot multiplied to , which means that if one of the elements in is 0, it will always be 0 irrespective the value of . Thus, it cannot generate the new connections that not exist in the original physical graph. From this perspective, is more flexible than . The value of and the parameters of θ and φ are initialized to 0.
The third part () is a data-dependent graph which learn a unique graph for each sample. To determine whether there is a connection between two vertexes and how strong the connection is, we apply the normalized embedded Gaussian function to calculate the similarity of the two vertexes:
where N is the total number of the vertexes. We use the dot product to measure the similarity of the two vertexes in an embedding space. In detail, given the input feature map fin whose size is Cin×T×N, we first embed it into Ce×T×N with two embedding functions, i.e., θ and φ. Here, through extensive experiments, we choose one 1 × 1 convolutional layer as the embedding function. The two embedded feature maps are rearranged and reshaped to an N×CeT matrix and a CeT × N matrix. They are then multiplied to obtain an N×N similarity matrix , whose element represents the similarity of vertex vi and vertex vj. The value of the matrix is normalized to 0 ~ 1, which is used as the soft edge of the two vertexes. Since the normalized Gaussian is equipped with a softmax operation, we can calculate based on Eq.4 as follows:
where Wθ and Wφ are the parameters of the embedding functions θ and φ, respectively.
The overall architecture of the adaptive graph convolution layer is shown in Fig. 2. Except for the , and introduced above, the kernel size of convolution (Kv) is set the same as before, i.e., 3. wk is the weighting function introduced in Eq. 1, whose parameter is Wk in Eq. 3. A residual connection, similar to [10], is added for each layer. If the number of input channels is different than the number of output channels, a 1 × 1 convolution (orange box with dashed line in Fig. 2) is inserted in the residual path to transform the input to match the output in the channel dimension.
3、Adaptive graph convolutional block
The convolution for the temporal dimension is the same as ST-GCN, i.e., performing the Kt × 1 convolution on the C×T×N feature maps. Both the spatial GCN and temporal GCN are followed by a batch normalization (BN) layer and a ReLU layer. As shown in Fig. 3, one basic block is the combination of one spatial GCN (Convs), one temporal GCN (Convt) and an additional dropout layer with the drop rate set as 0.5. To stabilize the training, a residual connection is added for each block.
4、Adaptive graph convolutional network
The adaptive graph convolutional network (AGCN) is the stack of these basic blocks, as shown in Fig. 4. There are a total of 9 blocks. The numbers of output channels for each block are 64, 64, 64, 128, 128, 128, 256, 256 and 256. A data BN layer is added at the beginning to normalize the input data. A global average pooling layer is performed at the end to pool feature maps of different samples to the same size. The final output is sent to a softmax classifier to obtain the prediction.
5、Two-stream networks
In this paper, we propose explicitly modeling the second-order information, namely, the bone information, with a two-stream framework to enhance the recognition. In particular, since each bone is bound with two joints, we define that the joint close to the center of gravity of the skeleton is the source joint and the joint far away from the center of gravity is the target joint. Each bone is represented as a vector pointing to its target joint from its source joint. For example, given a bone with its source joint v1 = (x1, y1, z1) and its target joint v2 = (x2, y2, z2), the vector of the bone is calculated as ev1,v2 = (x2 − x1, y2 − y1, z2 − z1).
Since the graph of the skeleton data have no cycles, each bone can be assigned with a unique target joint. The number of joints is one more than the number of bones because the central joint is not assigned to any bones. To simplify the design of the network, we add an empty bone with its value as 0 to the central joint. In this way, both the graph and the network of bones can be designed the same as that of joints because each bone can be bound with a unique joint. We use J-stream and B-stream to represent the networks of joints and bones, respectively. The overall architecture (2s- AGCN) is shown in Fig. 5. Given a sample, we first calculate the data of bones based on the data of joints. Then, the joint data and bone data are fed into the J-stream and B-stream, respectively. Finally, the softmax scores of the two streams are added to obtain the fused score and predict the action label.
Results
The left matrix is the original adjacency matrix for the second subset in the NTU-RGBD dataset. The right matrix is an example of the corresponding adaptive adjacency matrix learned by our model.
原来有疑问:左图是 the original adjacency matrix,为什么会有灰色呢?是 Mask 的结果?
后面想想,应该左边的是 Ak · M,右边是 Ak+Bk+Ck
Fig. 8 is a visualization of the skeleton graph for different layers of one sample (from left to right is the 3rd, 5th and 7th layers in Fig. 4, respectively).
Each circle ‘s size represents the strength of the connection between the current joint and the 25th joint in the learned adaptive graph of our model.
It shows that a traditional physical connection of the human body is not the best choice for the action recognition task, and different layers need graphs with different topology structures.
The skeleton graph in the 3rd layer pays more attention to the adjacent joints in the physical graph. This result is intuitive since the lower layer only contains the low-level feature, while the global information cannot be observed.
For the 5th layer, more joints along the same arm are strongly connected.
For the 7th layer, the left hand and the right hand show a stronger connection, although they are far away from each other in the physical structure of the human body.
We argue that a higher layer contains higher-level information. Hence, the graph is more relevant to the final classification task.
The learned adjacency matrix is extracted from the second subset of the 5th layer in the model (Fig. 4). It shows that the graph structures learned by our model for different samples are also different, even for the same convolutional subset and the same layer. It verified our point of view that different samples need different topologies of the graph, and a data-driven graph structure is better than a fixed one.