论文：3D Convolutional Neural Networks for Human Action Recognition

最新推荐文章于 2022-12-05 09:18:46 发布

SpengTAN

最新推荐文章于 2022-12-05 09:18:46 发布

阅读量698

点赞数

分类专栏：论文阅读

论文阅读专栏收录该内容

4 篇文章 0 订阅

订阅专栏

本文提出了一种3D卷积神经网络（3D-CNN）模型，用于在不受控环境中自动识别人类动作。3D CNN通过执行3D卷积在空间和时间维度上提取特征，捕获视频流的运动信息。实验表明，该模型在真实世界环境中的动作识别任务中表现出优越性能，无需依赖手工特征。

摘要由CSDN通过智能技术生成

HGFHG

3D Convolutional Neural Networks for Human Action Recognition

1. 摘要

We consider the fully automated recognition of actions in uncontrolled environment. Most existing work relies on domain knowledge to construct complex handcrafted features from inputs. In addition, the environments are usually assumed to be controlled. Convolutional neural networks (CNNs) are a type of deep models that can act directly on the raw inputs, thus automating the process of feature construction. However, such models are currently limited to handle 2D inputs. In this paper, we develop a novel 3D CNN model for action recognition. This model extracts features from both spatial and temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames. The developed model generates multiple channels of information from the input frames, and the final feature representation is obtained by combining information from all channels. We apply the developed model to recognize human actions in real-world environment, and it achieves superior performance without relying on handcrafted features.
我们考虑在不受控的环境中完全自动识别动作。现有的大多数工作都依赖相应领域知识来构建复杂的手工特征输入。另外，通常假定环境是受控的。卷积神经网络（CNN）是一种可直接作用于原始输入的深度模型，从而使特征构建过程自动化。但是，此类模型当前仅限于处理2D输入。在本文中，我们开发了一种用于动作识别的新颖3D-CNN模型。**该模型通过执行3D卷积从空间和时间维度中提取特征，从而捕获在多个相邻帧中运动信息。**所开发的模型从输入帧生成多个信息通道，并且通过组合所有通道的信息来获得最终的特征表示。我们将开发的模型用于识别现实环境中的人类行为，并且在不依赖手工功能的情况下实现了卓越的性能。

2. 方案

1）提出通过3D卷积操作核去提取视频数据的时间和空间特征。这些3D特征提取器在空间和时间维度上操作，因此可以捕捉视频流的运动信息。
2）基于3D卷积特征提取器构造了一个3D卷积神经网络。这个架构可以从连续视频帧中产生多通道的信息，然后在每一个通道都分离地进行卷积和下采样操作。最后将所有通道的信息组合起来得到最终的特征描述。

网络结构

硬连线hardwired层、3个卷积层、2个下采样层和一个全连接层。每个3D卷积核卷积的立方体是连续7帧，每帧大小是60x40。
在这里插入图片描述
在第一层，我们应用了一个固定的hardwired的核去对原始的帧进行处理，产生多个通道的信息，然后对多个通道分别处理。最后再将所有通道的信息组合起来得到最终的特征描述。
This results in 33 feature maps in the second layer in 5 different channels known as gray, gradient-x, gradient-y, optflow-x, and optflow-y. The gray channel contains the gray pixel values of the 7 input frames. The feature maps in the gradient-x and gradient-y channels are obtained by computing gradients along the horizontal and vertical directions, respectively, on each of the 7 input frames, and the optflow-x and optflow-y channels contain the optical flow fields, along the horizontal and vertical directions, respectively, computed from adjacent input frames. This hardwired layer is used to encode our prior knowledge on features, and this scheme usually leads to better performance as compared to random initialization.
这会在第二层中的5个不同通道中生成33个特征图，这些通道分别称为gray，gradient-x，gradient-y，optflow-x和optflow-y。 gray通道：包含7个输入帧的灰色像素值；gradient-x，gradient-y通道：通过在7个输入帧上分别计算沿水平方向和垂直方向的梯度来获得特征图；optflow-x和optflow-y通道：包含光流场，分别沿水平方向和垂直方向从相邻输入帧计算得出。该硬连线层用于编码我们对特征的先验知识，与随机初始化相比，该方案通常可带来更好的性能。
在这里插入图片描述

3. 实验

四个动作：向下、向上、向左、向右
在这里插入图片描述

SpengTAN

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
论文：3D Convolutional Neural Networks for Human Action Recognition

HGFHG1. 摘要2. 方案网络结构3. 实验3D Convolutional Neural Networks for Human Action Recognition1. 摘要We consider the fully automated recognition of actions in uncontrolled environment. Most existing work reli...
复制链接

扫一扫