CVPR 2019 行为识别相关论文

最新推荐文章于 2022-09-22 17:22:34 发布

HotCake△

最新推荐文章于 2022-09-22 17:22:34 发布

阅读量6.1k

点赞数 9

分类专栏：论文

本文链接：https://blog.csdn.net/qq_36589469/article/details/91556370

版权

本文汇总了CVPR 2019年关于行为识别的前沿研究，包括Action Transformer Network在捕捉人体行为上的应用，Timeception在复杂行为识别中的时间感知优势，以及Bayesian Hierarchical Dynamic Model对人类行为识别不确定性的量化。此外，还探讨了协同时空特征学习、运动增强RGB流、Pose-Action 3D Machine、 Representation Flow、Long Short-Term Attention、单时间戳监督和域适应LSTM等技术在视频行为识别中的创新应用。

摘要由CSDN通过智能技术生成

CVPR2019 Action Recognition 相关论文

Video Action Transformer Network
Abstract:
We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-speciﬁc, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action – all without explicit supervision other than boxes and class labels. We train and test our Action Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming the state-of-the-art by a signiﬁcant margin using only raw RGB frames as input.
Timeception for Complex Action Recognition
Abstract:
This paper focuses on the temporal aspect for recognizing human activities in videos; an important visual cue that has long been undervalued. We revisit the conventional definition of activity and restrict it to “Complex Action”: a set of one-actions with a weak temporal pattern that serves a speciﬁc purpose. Related works use spatiotemporal 3D convolutions with ﬁxed kernel size, too rigid to capture the varieties in temporal extents of complex actions, and too short for long-range temporal modeling. In contrast, we use multi-scale temporal convolutions, and we reduce the complexity of 3D convolutions. The outcome is Timeception convolution layers, which reasons about minute-long temporal patterns, a factor of 8 longer than best related works. As a result, Timeception achieves impressive accuracy in recognizing the human activities of Charades, Breakfast Actions, and MultiTHUMOS. Further, we demonstrate that Timeception learns long-range temporal dependencies and tolerate temporal extents of complex actions.
Bayesian Hierarchical Dynamic Model for Human Action Recognition
Abstract:
Human action recognition remains as a challenging task partially due to the presence of large variations in the execution of an action. To address this issue, we propose a probabilistic model called Hierarchical Dynamic Model (HDM). Leveraging on Bayesian framework, the model parameters are allowed to vary across different sequences of data, which increase the capacity of the model to adapt to intra-class variations on both spatial and temporal extent of actions. Meanwhile, the generative learning process allows the model to preserve the distinctive dynamic pattern for each action class. Through Bayesian inference, we are able to quantify the uncertainty of the classiﬁcation, providing insight during the decision process. Compared to stateof-the-art methods, our method not only achieves competitive recognition performance within individual dataset but also shows better generalization capability across different datasets. Experiments conducted on data with missing values also show the robustness of the proposed method.( MSRA, UTD, G3D)
Collaborative Spatiotemporal Feature Learning for Video Action Recognition
Abstract:
Spatiotemporal feature learning is of central importance for action recognition in videos. Existing deep neural network models either learn spatial and temporal features independently (C2D) or jointly with unconstrained parameters (C3D). In this paper, we propose a novel neural operation which encodes spatiotemporal features collaboratively by imposing a weight-sharing constraint on the learnable parameters. In particular, we perform 2D convolution along three orthogonal views of volumetric video data, which learns spatial appearance and temporal motion cues respectively. By sharing the convolution kernels of different views, spatial and temporal features are collaboratively learned and thus beneﬁt from each other. The complementary features are subsequently fused by a weighted summation whose coefﬁcients are learned end-to-end. Our approach achieves state-of-the-art performance on largescale benchmarks and won the 1st place