全卷积网络用于手语识别

Fully Convolutional Networks for Continuous Sign Language Recognition

年份 识别类型 输入数据类型 手动特征 非手动特征 Fullframe 数据集 识别对象 关键词 讨论 paper的总结 paper的展望
2020 连续语句 RGB Shape(手形) Head(头) Bodyjoints(身体关节)、RGB Phoenix14、Phoenix14-T、CSL DGS(德语)、CSL(汉语) Continuous sign language recognition;Fully convolutional network; Joint training; Online recognition 在 FCN 的基础上,我们提出的网络只需连续地传入(足够多的)推断出的几个帧,就可以组合所有的中间输出以给出与真实值相匹配的最终识别结果。我们使用此技术来在 Concat All 场景中测试,在该场景中,所有测试样本被连接在一起作为一个大样本。可惜,基于 LSTM 的模型无法在 Concat All 场景进行测试,因为大样本内存容量太大,无法作为输入。表5的所有结果表明,提出的网络具有更强的泛化能力和更好的识别灵活性,适用于复杂的现实认知场景。FCN 结构使得所提出的网络能够显著减少识别所需的存储器使用量。同时,在 SplitConcat 场景中的实验结果表明,该方法不但能识别孤立词手语的意思,还能识别出手语短语和段落的含义。我们可以从 Split 场景中的结果进一步得出结论,在识别过程中不需要等待所有手语 gloss 的到来,因为只要对所提出的网络有足够的帧可用,就可以给出准确的中间(部分)识别结果。有了这一特性,我们的方法可以长时间的逐字提供中间结果,从人机交互的角度来看,这对 SLR 用户来说是非常友好的。这些特性使得所提出的网络在在线识别方面有很好的应用前景。补充演示视频中显示了可视化演示。 1、首次提出了一种端到端训练的全卷积网络,该网络无需预先训练即可用于连续手语识别。2、引入联合训练的 GFE 模型来增强特征的代表性。实验结果表明,(1)该网络在基于 RGB 方法的基准数据集上取得了最好的性能。(2)对于基于 LSTM 的网络在实际场景中的识别,该网络达到了一致的性能,并表现出许多极大的优势。(3)这些优势使我们提出的网络鲁棒性好,可以进行在线识别。 1、未来连续手语识别的一个可能的研究方向是利用一些光泽是字母组合的事实来加强监督;然而,这可能需要额外的标记预处理和手语专业知识。2、该网络具有更好的 gloss 识别精度,在手语翻译(SLT)中有很好的研究前景。3、我们希望所提出的网络能够启发未来序列识别任务的研究,以探索 FCN 结构作为基于 LSTM 的模型的替代方案,特别是对于那些训练数据有限的任务。

Abstract.

Continuous sign language recognition (SLR) is a challenging task that requires learning on both spatial and temporal dimensions of signing frame sequences. Most recent work accomplishes this by using CNN and RNN hybrid networks. However, training these networks is generally non-trivial, and most of them fail in learning unseen sequence patterns, causing an unsatisfactory performance for online recognition. In this paper, we propose a fully convolutional network (FCN) for online SLR to concurrently learn spatial and temporal features from weakly annotated video sequences with only sentence-level annotations given. A gloss feature enhancement (GFE) module is introduced in the proposed network to enforce better sequence alignment learning. The proposed network is end-to-end trainable without any pre-training. We conduct experiments on two large scale SLR datasets. Experiments show that our method for continuous SLR is effective and performs well in online recognition.

Keywords:Continuous sign language recognition; Fully convolutional network; Joint training; Online recognition

连续手语识别(SLR)是一项具有挑战性的任务,需要在手语帧序列的空间和时间维度上进行学习。最近的工作是通过使用CNN和RNN混合网络来实现的。但是,训练这些网络通常并非易事,并且大多数网络无法学习看不见的序列模式,从而导致在线识别的性能不尽人意。在本文中,我们提出了一种用于在线 SLR 的完全卷积网络(FCN),可以从仅带有句子级注释的弱注释视频序列中同时学习空间和时间特征。在提出的网络中引入了gloss feature 增强(GFE)模块,以加强更好的序列比对学习。提出的网络可以进行端到端的训练,而不需要任何预训练。我们对两个大型 SLR 数据集进行了实验。实验表明,我们的连续语句手语识别方法是有效的,并且在在线识别中表现良好。

**关键字:**连续语句手语识别;全卷积网络;联合训练;在线识别

研究对象 研究方法 数据集 研究成果
连续语句手语识别 在线全卷积网络端到端训练;gloss feature 增强(GFE)模块;不需要预训练

gloss:光泽作为物体的表面特性,取决于表面对光的镜面反射能力。(百度百科)

1  Introduction

  Sign language is a common communication method for people with disabled hearing. It composes of a variety range of gestures, actions, and even facial emotions. In linguistic terms, a gloss is regarded as the unit of the sign language [27]. To sign a gloss, one may have to complete one or a series of gestures and actions. However, many glosses have very similar gestures and movements because of the richness of the vocabulary in a sign language. Also, because different people have different action speeds, a same signing gloss may have different lengths. Not to mention that different from spoken languages, sign language like ASL [22] usually does not have a standard structured grammar. These facts place additional difficulties in solving continuous SLR because it requires the model to be highly capable of learning spatial and temporal information in the signing sequences.
  手语是听力障碍人士的常用交流方式。它由各种手势,动作甚至面部表情组成。用语言学术语来讲,一个 gloss 被视为手语的单位[27]。 对于手语 gloss 而言,必须识别一个或一系列的手势和动作。但是,由于手语中词汇的丰富性,许多 gloss 具有非常相似的手势和动作。同样,由于不同的人具有不同的动作速度,因此相同的手语 gloss 可能具有不同的长度。 更不用说与口语不同的是,像 ASL [22]这样的手语通常没有标准的结构化语法。这些事实为解决连续 SLR 带来了其他困难,因为它要求模型具有高度学习手语序列中的时空信息的能力。

Good 批判性吸收 参考文献
用语言学术语来说,一个 gloss 被视为手语的单位[27]。 - Ong, S., Ranganath, S.: Automatic sign language analysis: A survey and the future beyond lexical meaning. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 873–91 (2005)
同样,由于不同的人具有不同的动作速度,因此相同的手语 gloss 可能具有不同的长度。 即每个人都有个人专属的手语风格,就像普通人说话一样有不同的习惯 - -
手语通常没有标准的结构化语法 - -

  Early work on continuous SLR [6,18,34] utilizes hand-crafted features followed by Hidden Markov Models (HMMs) [43,48] or Dynamic Time Warping (DTW) [47] as common practices. More recent approaches achieve state-of-theart results using CNN and RNN hybrid models [4,14,44]. However, we observe that these hybrid models tend to focus on the sequential order of seen signing sequences in the training data but not the glosses, due to the existence of RNN. So, it is sometimes hard for these trained networks to recognize unseen signing sequences with different sequential patterns. Also, training of these models is generally non-trivial, as most of them require pre-training and incorporate iterative training strategy [4], which greatly lengthens the training process. Furthermore, the robustness of previous models is limited to sentence recognition only; most of the methods fail when the test cases are signing videos of a phrase (sentence fragment) or a paragraph (several sentences). Online recognition requires good recognition responses for partial sentences, but these models usually cannot give correction recognition until the signer finishes all the signing glosses in a sentence. Such limitation in robustness makes online recognition almost impossible for CNN and RNN hybrid models.
  连续语句手语识别[6,18,34]的早期工作采用了手工制作特征,之后隐马尔可夫模型(HMM)[43,48]或动态时间规整(DTW)[47]作为常用方法。最近的方法是使用 CNN 和 RNN 混合模型获得了最新技术成果[4,14,44]。但是,由于 RNN 的存在,我们发现这些混合模型倾向于集中在训练数据中可见手语序列的顺序上,而不是光泽上。因此,有时这些训练好的网络很难识别具有不同顺序模式的未知的手语序列。同样,对这些模型的训练通常也不是一件容易的事,因为它们中的大多数都需要进行预训练并包含迭代训练策略[4],这极大地延长了训练过程。此外,先前模型的鲁棒性仅限于句子识别;当测试用例对短语(句子片段)或段落(几个句子)的手语视频时,大多数方法都会失败。在线识别要求对部分句子有良好的识别响应,但是在 signer 完成句子中的所有签名修饰之前,这些模型通常无法给出正确的识别。这些限制使得 CNN 和 RNN 混合模型几乎无法在线识别

介绍其他方法:一句话带过早期和后面的工作,重点介绍最近的且是本工作重点比较的
时间(早期…之后…) 论文 方法 缺点 优点
2015 Evangelidis, G.D., Singh, G., Horaud, R.: Continuous gesture recognition from articulated poses. In: Proceedings of European Conference on Computer Vision. pp. 595–607 (2015) 手工制作特征 - -
2015 Koller, O., Forster, J., Ney, H.: Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding 141, 108–125 (2015) 手工制作特征 - -
2013 Sun, C., Zhang, T., Bao, B.K., Xu, C., Mei, T.: Discriminative exemplar coding for sign language recognition with kinect. IEEE Transactions on Cybernetics 43, 1418–1428 (2013) 手工制作特征 - -
2016 Yang, W., Tao, J., Ye, Z.: Continuous sign language recognition using level building based on fast hidden markov model. Pattern Recognition Letters 78, 28–35 (2016) 隐马尔可夫模型(HMM) - -
2016 Zhang, J., Zhou, W., Xie, C., Pu, J., Li, H.: Chinese sign language recognition with adaptive hmm. In: Proceedings of IEEE International Conference on Multimedia and Expo. pp. 1–6 (2016) 隐马尔可夫模型(HMM)
2014 Zhang, J., Zhou, W., Li, H.: A threshold-based hmm-dtw approach for continuous sign language recognition. In: Proceedings of International Conference on Internet Multimedia Computing and Service. pp. 237–240 (2014) 隐马尔可夫模型(HMM) - -
时间(最近) 论文 方法 缺点 优点
2017 Cui, R., Liu, H., Zhang, C.: Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1610–1618 (2017) CNN 和 RNN 混合模型 很难识别具有不同顺序模式的未知的手语序列;训练周期长;鲁棒性差 -
2018 Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W.: Video-based sign language recognition without temporal segmentation. In: Proceedings of AAAI Conference on Artificial Intelligence. pp. 2257–2264 (2018) CNN 和 RNN 混合模型 很难识别具有不同顺序模式的未知的手语序列;训练周期长;鲁棒性差 -
2019 Yang, Z., Shi, Z., Shen, X., Tai, Y.W.: Sf-net: Structured feature network for continuous sign language recognition. arXiv preprint arXiv:1908.01341 (2019) CNN 和 RNN 混合模型 很难识别具有不同顺序模式的未知的手语序列;训练周期长;鲁棒性差 -

  In this paper, we propose a fully convolutional network [24] for continuous SLR to address these challenges. The proposed network can be trained end-toend without any pre-training. On top of this, we introduce a GFE module to enhance the representativeness of features. The FCN design enables the proposed network to recognize new unseen signing sentences, or even unseen phrases and paragraphs. We conduct different sets of experiments on two public continuous SLR datasets. The major contribution of this work can be summarized:

    1. We are the first to propose a fully convolutional end-to-end trainable network for continuous SLR. The proposed FCN method models the semantic structure of sign language as glosses instead of sentences. Results show that the proposed network achieves state-of-the-art accuracy on both datasets, compared with other RGB-based methods.
    1. The proposed GFE module enforces additional rectified supervision and is jointly trained along with the main stream network. Compared with iterative training, joint training with the GFE module fastens the training process because joint training does not require additional fine-tuning stages.
    1. The FCN architecture achieves better adaptability in more complex realworld recognition scenarios, where previous LSTM based methods would almost fail. This attribute makes the proposed network able to do online recognition and is very suitable for real-world deployment applications.

  在本文中,我们提出了一种用于识别连续语句手语的完卷积网络[24]来解决这些挑战。所提出的网络可以端到端地训练,而不需要任何预训练。在此基础上,我们引入了 GFE 模块来增强特征的代表性。FCN 的设计使提出的网络能够识别新的看不见的手语句子,甚至是看不见的短语和段落。我们在两个公开的连续单反数据集上进行了不同的实验。这项工作的主要贡献可以概括为

    1. 我们首次提出了一种用于连续语句手语识别的可端到端训练的全卷积网络。该方法将手语的语义结构建模为注释而不是句子。结果表明,与其他基于 RGB 的方法相比,提出的网络在两个数据集上都达到了最高的准确率。
    1. 提出的 GFE 模块执行额外的已微调后的监督,并与主流网络一起进行联合培训。与迭代训练相比,GFE 模块联合训练网络加快了训练过程,因为联合训练不需要额外的微调阶段。
    1. FCN 体系结构在更复杂的现实世界识别场景中实现了更好的适应性,而以前基于 LSTM 的方法几乎都失败了。这一特性使得所提出的网络能够进行在线识别,非常适合于实际部署应用。

2 Related Work

  There are mainly two scenarios in SLR: isolated SLR and continuous SLR. Isolated SLR mainly focuses on the scenario where glosses have been well segmented temporally. Work in the field generally solves the task with methods such as Hidden Markov Models (HMMs) [10,12,13,29,35,42], Dynamic Time Warping (DTW) [36], and Conditional Random Field (CRF) [40,41]. As for continuous SLR, the task becomes more difficult as it aims to recognize glosses in the scenarios where no gloss segmentation is available but only sentence-level annotations as a whole. Learning separated individual glosses becomes more difficult in the weakly supervised setting. Many approaches propose to estimate the temporal boundary of different glosses first and then apply isolated SLR techniques and sequence to sequence methods [7,16] to construct the sentence.
  手语识别主要有两种场景:孤立词手语和连续语句手语。孤立词手语识别主要关注光泽在时序上能被很好地分割的场景。在孤立词手语识别中一般采用隐马尔可夫模型(HMM)[10,12,13,29,35,42]、动态时间规整(DTW)[36]和条件随机场(CRF)[40,41]等方法来解决这一问题。对于连续语句来说,手语识别任务更加困难,因为它的目标是识别情景中的 gloss,因为在情景中,没有对 gloss 分段,只有句子级的手语作为一个整体。在弱监督的环境下,学习分开的个体 gloss 变得更加困难。许多方法建议先估计不同 gloss 的时序边界,然后将孤立词手语识别技术和序列方法应用于连续语句序列[7,16]来构造句子。

孤立词手语识别:主要关注 gloss 在时序上能被很好地分割的场景
时间 论文 方法 缺点 优点
2017 Guo, D., Zhou, W., Li, H., Wang, M.: Online early-late fusion based on adaptive hmm for sign language recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1–18 (2017) 隐马尔可夫模型(HMM) - -
2016 Guo, D., Zhou, W., Wang, M., Li, H.: Sign language recognition based on adaptive hmms with data augmentation. In: Proceedings of IEEE International Conference on Image Processing. pp. 2876–2880 (2016) 隐马尔可夫模型(HMM) - -
2009 Han, J., Awad, G., Sutherland, A.: Modelling and segmenting subunits for sign language recognition based on hand motion analysis. Pattern Recognition Letters 30, 623–633 (2009) 隐马尔可夫模型(HMM) - -
2011 Pitsikalis, V., Theodorakis, S., Vogler, C., Maragos, P.: Advances in phoneticsbased sub-unit modeling for transcription alignment and sign language recognition. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 1–6 (2011) 隐马尔可夫模型(HMM) - -
2009 Theodorakis, S., Katsamanis, A., Maragos, P.: Product-hmms for automatic sign language recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 1601–1604 (2009) 隐马尔可夫模型(HMM) - -
2006 Yang, R., Sarkar, S.: Gesture recognition using hidden markov models from fragmented observations. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. pp. 766–773 (2006) 隐马尔可夫模型(HMM) - -
2014 Vela, A.H., Bautista, M., Perez-Sala, X., Ponce-Lpez, V., Escalera, S., Bar, X., Pujol, O., Angulo, C.: Probability-based dynamic time warping and bag-of-visualand-depth-words for human gesture recognition in rgb-d. Pattern Recognition Letters 50, 112–121 (2014) 动态时间规整(DTW) - -
2010 Yang, H.D., Lee, S.W.: Robust sign language recognition with hierarchical conditional random fields. In: Proceedings of IEEE International Conference on Pattern Recognition. pp. 2202–2205 (2010) 条件随机场(CRF) - -
2006 Yang, R., Sarkar, S.: Detecting coarticulation in sign language using conditional random fields. In: Proceedings of IEEE International Conference on Pattern Recognition. pp. 108–112 (2006) 条件随机场(CRF) - -
连续语句手语识别:目标是识别情景中的 glosses,因为在情景中,没有对 gloss 分段,只有句子级的手语作为一个整体。
时间 论文 方法 缺点 优点
2002 Fang, G., Gao, W.: A srn/hmm system for signer-independent continuous sign language recognition. In: Proceedings of IEEE International Conference on Automatic Face Gesture Recognition. pp. 312–317 (2002) 将孤立词手语识别技术和序列方法应用于连续语句序列来构造句子 - -
2009 Kelly, D., McDonald, J., Markham, C.: Recognizing spatiotemporal gestures and movement epenthesis in sign language. In: Proceedings of IEEE International Conference on Image Processing and Machine Vision. pp. 145–150 (2009) 将孤立词手语识别技术和序列方法应用于连续语句序列来构造句子 - -

  Concerning temporal boundary estimation, Cooper and Bowden [3] develop a method to extract similar video regions for inferring alignments in videos by using data mining and head and hand tracking. Farhadi and Forsyth [8] also come up with a method that utilizes HMMs to build a discriminative model for estimating the start and end frames of the glosses in video streams with a voting method. Yin et al. [46] make further improvements by introducing a weakly supervised metric learning framework to address the inter-signer variation problem in real applications of SLR.
  关于时序边界估计,Cooper和Bowden [3]提出了一种方法,该方法通过使用数据挖掘和头部和手部跟踪来提取相似的视频区域以达到 inferring alignments。Farhadi和Forsyth[8]还提出了一种方法,该方法利用隐马尔可夫模型(HMM)建立判别模型,用投票法预测视频流中光泽的起始帧和结尾帧。Yen等人[46]在SLR的实际应用中,引入弱监督度量学习框架,解决 signer 之间的差异问题,从而进一步完善SLR。
inferring alignments :预测对齐 ???

时间 论文 方法 缺点 优点
2009 Cooper, H., Bowden, R.: Learning signs from subtitles: A weakly supervised approach to sign language recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 2568–2574 (2009) 通过使用数据挖掘和头部和手部跟踪来提取相似的视频区域以达到 inferring alignments - -
2006 Farhadi, A., Forsyth, D.: Aligning asl for statistical translation using a discriminative word model. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1471–1476 (2006) 利用隐马尔可夫模型(HMM)建立判别模型,用投票法预测视频流中光泽的起始帧和结尾帧 - -
2015 Yin, F., Chai, X., Zhou, Y., Chen, X.: Weakly supervised metric learning towards signer adaptation for sign language recognition. In: Proceedings of British Machine Vision Conference. pp. 35.1–35.12 (2015) 引入弱监督度量学习框架,解决 signer 之间的差异问题 - -

  As for sequence to sequence methods, much work follows the framework used in the topic of speech recognition [25,33], handwriting recognition [23,32], and video captioning [39]. Specifically, an encoder module is responsible for extracting features in the input video frame sequences, and a CTC module acts as a cost function to learn the ground truth sequences. This framework also shows good performance on continuous SLR, and more recent work applies CNN and RNN hybrid models to infer gloss alignments implicitly [2,14,26,30]. However, RNNs are sometimes more sensitive to the sequential order than the spatial features. As a result, these models tend to learn much about the sequential signing patterns but little about the glosses (words), causing the failure of the recognition for unseen phrases and paragraphs.
  至于序列到序列方法,许多工作遵循语音识别[25,33]、手写识别[23,32]和视频字幕[39]领域中使用的框架。具体地,编码器模块负责从输入视频帧序列中提取特征,而 CTC 模块充当学习真实标签序列的代价函数。这个框架在连续语句识别上表现出了良好的性能,最近的工作是应用 CNN 和 RNN 混合模型来隐式推断 gloss 对齐[2,14,26,30]。然而, RNN 神经网络有时对顺序比空间特征更敏感,因此,这些模型往往对顺序手势模式学习较多,而对注释(词)学习较少,从而导致对看不见的短语和段落的识别失败。
unseen:看不见的 ???还是未训练过的手语 ???

时间 论文 方法 缺点 优点
2015 Miao, Y., Gowayyed, M., Metze, F.: Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding. In: IEEE Conference on Automatic Speech Recognition and Understanding Workshops. pp. 167–174 (2015) 借鉴语音识别框架 - -
2015 Sak, H., Senior, A., Rao, K., rsoy, O., Graves, A., Beaufays, F., Schalkwyk, J.: Learning acoustic frame labeling for speech recognition with recurrent neural networks. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4280–4284 (2015) 借鉴语音识别框架 - -
2007 Liwicki, M., Graves, A., Bunke, H., Schmidhuber, J.: A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks. In: Proceedings of International Conference on Document Analysis and Recognition. pp. 367–371 (2007) 借鉴手写识别框架 - -
2017 Puigcerver, J.: Are multidimensional recurrent layers really necessary for handwritten text recognition? In: Proceedings of International Conference on Document Analysis and Recognition. pp. 67–72 (2017) 借鉴手写识别框架 - -
2018 Wang, B., Ma, L., Zhang, W., Liu, W.: Reconstruction network for video captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 7622–7631 (2018) 借鉴视频字幕框架 - -
2017 Camgoz, N.C., Hadfield, S., Koller, O., Bowden, R.: Subunets: End-to-end hand shape and continuous sign language recognition. In: Proceedings of IEEE International Conference on Computer Vision. pp. 3075–3084 (2017) 应用 CNN 和 RNN 混合模型来隐式推断 gloss 对齐 然而, RNN 神经网络有时对顺序比空间特征更敏感,因此,这些模型往往对顺序手势模式学习较多,而对注释(词)学习较少,从而导致对看不见的短语和段落的识别失败。 -
2018 Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W.: Video-based sign language recognition without temporal segmentation. In: Proceedings of AAAI Conference on Artificial Intelligence. pp. 2257–2264 (2018) 应用 CNN 和 RNN 混合模型来隐式推断 gloss 对齐 然而, RNN 神经网络有时对顺序比空间特征更敏感,因此,这些模型往往对顺序手势模式学习较多,而对注释(词)学习较少,从而导致对看不见的短语和段落的识别失败。 -
2016 Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., Kautz, J.: Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 4207–4215 (2016) 应用 CNN 和 RNN 混合模型来隐式推断 gloss 对齐 然而, RNN 神经网络有时对顺序比空间特征更敏感,因此,这些模型往往对顺序手势模式学习较多,而对注释(词)学习较少,从而导致对看不见的短语和段落的识别失败。 -
2018 Pu, J., Zhou, W., Li, H.: Dilated convolutional network with iterative optimization for continuous sign language recognition. In: Proceedings of International Joint Conference on Artificial Intelligence. pp. 885–891 (2018) 应用 CNN 和 RNN 混合模型来隐式推断 gloss 对齐 然而, RNN 神经网络有时对顺序比空间特征更敏感,因此,这些模型往往对顺序手势模式学习较多,而对注释(词)学习较少,从而导致对看不见的短语和段落的识别失败。 -
借鉴语音识别的框架来处理手语识别和我的想法不谋而合,我也同样是想用隐马尔可夫(HHM)来预测连续语句手语。

3 Method

Formally, the proposed network aims to learn a mapping H : X ↦ Y H: \mathcal{X} \mapsto \mathcal{Y} H:XY that can transform an input video frame sequence X \mathcal{X} X to a target sequence Y \mathcal{Y} Y . The feature extraction contains two main steps: a frame feature encoder and a two-level gloss feature encoder. On top of them, a gloss feature enhancement (GFE) module is introduced to enhance the feature learning. An overview of the proposed network is shown in Figure 1.
在这里插入图片描述Fig. 1. Overview of the proposed network. The network is fully convolutional and divides the feature encoding process into two main steps. A GFE module is introduced to enhance the feature learning
形式上,提出的网络旨在学习一种映射 H : X ↦ Y H: \mathcal{X} \mapsto \mathcal{Y} H:XY,该映射可以将输入视频帧序列 X \mathcal{X} X 转换为目标序列 Y \mathcal{Y} Y。特征提取包括两个主要步骤: frame feature 编码器和 two-level gloss feature 编码器。在此基础上,引入了==光泽特征增强(GFE)==模块来增强特征学习。提出的网络整体如图1所示。
在这里插入图片描述
Fig. 1. 提出的网络整体图。该网络是全卷积的网络结构,并将特征编码过程分为两个主要步骤。引入光泽特征增强(GFE)模块来增强特征学习。

3.1 Main stream design

  Frame feature encoder

Frame feature encoder. The proposed network first encodes spatial features of the input RGB frames. The frame feature encoder S {S} S composes of a convolutional backbone S c n n S_{c n n} Scnn to extract features in the frames and a global average pooling layer S g a p S_{g a p} Sgap to compress the spatial features into feature vectors. Formally, each signing sequence is a tensor with shape ( t {t} t, c {c} c, h {h} h, w {w} w), where t {t} t denotes the length of the sequence, c {c} c denotes the number of channels, and h {h} h, w {w} w denotes the height and width of the frames. The process of encoder S {S} S can be described as:
       { s } t × f s = S ( { x } t × c × h × w ) = S g a p ( S c n n ( { x } t × c × h × w ) ) \{s\}^{t \times f_{s}}=S\left(\{x\}^{t \times c \times h \times w}\right)=S_{g a p}\left(S_{c n n}\left(\{x\}^{t \times c \times h \times w}\right)\right) { s}t×fs=S({ x}t×c×h×w)=Sgap(Scnn({ x}t×c×h×w)).       (1)
The output is of shape { s } t × f s \{s\}^{t \times f_{s}} { s}t×fs . Note that frame feature encoder treats each frame independently for the frame (spatial) feature learning.
帧特征编码器. 该网络首先对输入的 RGB 帧的空间特征进行编码。帧特征编码器 S {S} S 由用于提取帧中特征的基础卷积网络 S c n n S_{c n n} Scnn 和用于将空间特征压缩为特征向量的全局平均池层 S g a p S_{g a p} Sgap 组成。形式上,每个手语序列是具有形状的张量( t {t} t, c {c} c, h {h} h, w {w} w),其中 t {t} t 表示序列的长度, c {c} c 表示通道数, h {h} h 表示帧的高度, w {w} w 表示帧的宽度。帧特征编码器 S {S} S 的操作可以描述为:

       { s } t × f s = S ( { x } t × c × h × w ) = S g a p ( S c n n ( { x } t × c × h × w ) ) \{s\}^{t \times f_{s}}=S\left(\{x\}^{t \times c \times h \times w}\right)=S_{g a p}\left(S_{c n n}\left(\{x\}^{t \times c \times h \times w}\right)\right) { s}t×fs=S({ x}t×c×h×w)=Sgap(Scnn({ x}t×c×h×w)).       (1)

输出的 Shape 为 { s } t × f s \{s\}^{t \times f_{s}} { s}t×fs 。请注意,帧特征编码器 S {S} S 是单独处理每一帧(空间)的特征学习。

由公式(1)和这段可以得出:帧特征编码器 S {S} S 是对视频的每一帧都进行处理,先是由 CNN S c n n S_{c n n} Scnn 对每一帧提取特征,然后由全局平均池层 S g a p S_{g a p} Sgap 将空间特征压缩为特征向量。其中 t {t} t 表示序列的长度, c {c} c 表示通道数, h {h} h 表示帧的高度, w {w} w 表示帧的宽度

  Two-level gloss feature encoder

Two-level gloss feature encoder. The two-level gloss feature encoder G {G} G follows S {S} S immediately and aims to encode gloss features. Instead of using LSTM layers, a common practice in temporal feature encoding, we achieve this by using 1D convolutional layers over time dimension. Precisely, the first level encoder G 1 {G}_1 G1 consists of 1D-CNNs with a relatively larger filter size. Pooling layers can be used between convolutional layers to increase the window size when needed. Differently, the filter size is relatively smaller for the 1D-CNNs in the second level encoder G 2 {G}_2 G2 , with no pooling layers used in G 2 {G}_2 G2. So, cdoes not change the temporal dimension but only reconsider the contextual information between glosses by taking into account the neighboring glosses.
两级光泽特征编码器. 在帧特征编码器 S {S} S 后面的是两级光泽特征编码器 G {G} G G {G} G 的目标是对光泽特征进行编码。我们通过在时间维度上使用一维卷积层来实现这一点,而不是使用 LSTM 层,这是时序特征编码中的一种常见做法。准确地说,第一级编码器 G 1 {G}_1 G1 由具有较大卷积核(filter)的 1D-CNN 组成。当需要时,可以在卷积层之间使用池化层来增加窗口大小。
不同的是,在第二级编码器 G 2 {G}_2 G2 中,采用较卷积核的1D-CNN ,在 G 2 {G}_2 G2 中不用使用池化层。因此, G 2 {G}_2 G2 没有改变时序维度,而只是通过考虑相邻的光泽来重新考虑损失的上下文信息。

在帧特征编码器 S {S} S 后面连着用于提取 gloss 特征的两级光泽特征编码器 G {G} G G {G} G 由第一级编码器 G 1 {G}_1 G1 和第二级编码器 G 2 {G}_2 G2 组成。 G 1 {G}_1 G1 由具有较大卷积核的 1D-CNN 组成,用于提取 gloss 特征,必要时可以通过在卷积层之间增加池化层来增大提取特征的窗口。 G 2 {G}_2 G2 采用具有较小卷积核的 1D-CNN,并且不采池化层,只通过相邻的 gloss 来预测上下文信息。
卷积的基本概念参考:深度学习笔记_基本概念_卷积网络中的通道channel、特征图feature map、过滤器filter和卷积核kernel

  The overall convolutional process of G {G} G can be interpreted as a sliding window on the frame feature vector { s } t × f s \{s\}^{t \times f_{s}} { s}t×fs along the time dimension. The sliding window size l {l} l and the stride δ {δ} δ are determined by the accumulated receptive field size and the accumulated stride of 1D-CNNs in G 1 {G_1} G1 . Let { g } k × f g \{g\}^{k\times f_{g}} { g}k×fg and { g ′ } k × f g ′ \{g^{\prime}\}^{k \times f_{g^{\prime}}} { g}k×fg be the output tensor of gloss feature encoder G 1 {G_1} G1 and G 2 {G_2} G2 , respectively. The operation of the encoder G {G} G can be formulated as:

       { g ′ } k × f g ′ = G ( { s } t × f s ) = G 2 ( G 1 ( { s } t × f s ) ) = G 2 ( { g } k × f g ) \left\{g^{\prime}\right\}^{k \times f_{g^{\prime}}}=G\left(\{s\}^{t \times f_{s}}\right)=G_{2}\left(G_{1}\left(\{s\}^{t \times f_{s}}\right)\right)=G_{2}\left(\{g\}^{k \times f_{g}}\right) { g}

  • 5
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
手识别是一个复杂的任务,需要处理图像数据和分类问题。以下是一个简单的手语识别神经网络的编写过程: 1. 收集手语图像数据,并将其标记为不同的手语字母。这些图像可以来自现有的数据集或您自己的数据集。 2. 将图像预处理为神经网络可以使用的格式。这通常涉及将图像转换为灰度、调整大小和标准化等步骤。 3. 构建一个卷积神经网络(CNN)模型。CNN 是一种特别适合处理图像的神经网络类型,由卷积层、池化层和连接层组成。 4. 通过在训练数据上进行反向传播,对神经网络进行训练。这将使神经网络能够从输入图像中学习识别手语字母的模式。 5. 使用测试数据集评估模型的性能,并对其进行微调,以提高准确性和鲁棒性。 以下是一个简单的 Python 代码示例,用于构建一个基本的手语识别 CNN 模型: ``` import tensorflow as tf from tensorflow.keras import layers # 构建 CNN 模型 model = tf.keras.Sequential([ layers.Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 1)), layers.MaxPooling2D((2, 2)), layers.Conv2D(64, (3, 3), activation='relu'), layers.MaxPooling2D((2, 2)), layers.Conv2D(64, (3, 3), activation='relu'), layers.Flatten(), layers.Dense(64, activation='relu'), layers.Dense(26, activation='softmax') ]) # 编译模型 model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) ``` 这是一个简单的手语识别 CNN 模型,其中包含三个卷积层、两个最大池化层和两个连接层。您可以根据需要修改模型架构,并通过反向传播训练模型。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值