全卷积网络用于手语识别

最新推荐文章于 2024-11-30 18:29:12 发布

可还行？

最新推荐文章于 2024-11-30 18:29:12 发布

阅读量2.2k

点赞数 5

分类专栏：手语识别论文解读笔记文章标签：深度学习

本文链接：https://blog.csdn.net/qq_43787895/article/details/116058075

版权

本文提出了一种全卷积网络（FCN）用于在线连续手语识别，解决了现有CNN和RNN混合模型在识别新序列模式时的困难。FCN无需预训练，端到端训练，通过引入GFE模块强化序列比对学习。实验表明，该方法在两个大型SLR数据集上表现出优秀的性能和在线识别效果。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Fully Convolutional Networks for Continuous Sign Language Recognition

年份	识别类型	输入数据类型	手动特征	非手动特征	Fullframe	数据集	识别对象	关键词	讨论	paper的总结	paper的展望
2020	连续语句	RGB	Shape（手形）	Head（头）	Bodyjoints（身体关节）、RGB	Phoenix14、Phoenix14-T、CSL	DGS（德语）、CSL（汉语）	Continuous sign language recognition；Fully convolutional network; Joint training; Online recognition	在 FCN 的基础上，我们提出的网络只需连续地传入(足够多的)推断出的几个帧，就可以组合所有的中间输出以给出与真实值相匹配的最终识别结果。我们使用此技术来在 Concat All 场景中测试，在该场景中，所有测试样本被连接在一起作为一个大样本。可惜，基于 LSTM 的模型无法在 Concat All 场景进行测试，因为大样本内存容量太大，无法作为输入。表5的所有结果表明，提出的网络具有更强的泛化能力和更好的识别灵活性，适用于复杂的现实认知场景。FCN 结构使得所提出的网络能够显著减少识别所需的存储器使用量。同时，在 Split 和 Concat 场景中的实验结果表明，该方法不但能识别孤立词手语的意思，还能识别出手语短语和段落的含义。我们可以从 Split 场景中的结果进一步得出结论，在识别过程中不需要等待所有手语 gloss 的到来，因为只要对所提出的网络有足够的帧可用，就可以给出准确的中间(部分)识别结果。有了这一特性，我们的方法可以长时间的逐字提供中间结果，从人机交互的角度来看，这对 SLR 用户来说是非常友好的。这些特性使得所提出的网络在在线识别方面有很好的应用前景。补充演示视频中显示了可视化演示。	1、首次提出了一种端到端训练的全卷积网络，该网络无需预先训练即可用于连续手语识别。2、引入联合训练的 GFE 模型来增强特征的代表性。实验结果表明，(1)该网络在基于 RGB 方法的基准数据集上取得了最好的性能。(2)对于基于 LSTM 的网络在实际场景中的识别，该网络达到了一致的性能，并表现出许多极大的优势。(3)这些优势使我们提出的网络鲁棒性好，可以进行在线识别。	1、未来连续手语识别的一个可能的研究方向是利用一些光泽是字母组合的事实来加强监督；然而，这可能需要额外的标记预处理和手语专业知识。2、该网络具有更好的 gloss 识别精度，在手语翻译（SLT）中有很好的研究前景。3、我们希望所提出的网络能够启发未来序列识别任务的研究，以探索 FCN 结构作为基于 LSTM 的模型的替代方案，特别是对于那些训练数据有限的任务。

Abstract.

Continuous sign language recognition (SLR) is a challenging task that requires learning on both spatial and temporal dimensions of signing frame sequences. Most recent work accomplishes this by using CNN and RNN hybrid networks. However, training these networks is generally non-trivial, and most of them fail in learning unseen sequence patterns, causing an unsatisfactory performance for online recognition. In this paper, we propose a fully convolutional network (FCN) for online SLR to concurrently learn spatial and temporal features from weakly annotated video sequences with only sentence-level annotations given. A gloss feature enhancement (GFE) module is introduced in the proposed network to enforce better sequence alignment learning. The proposed network is end-to-end trainable without any pre-training. We conduct experiments on two large scale SLR datasets. Experiments show that our method for continuous SLR is eﬀective and performs well in online recognition.

Keywords：Continuous sign language recognition; Fully convolutional network; Joint training; Online recognition

连续手语识别（SLR）是一项具有挑战性的任务，需要在手语帧序列的空间和时间维度上进行学习。最近的工作是通过使用CNN和RNN混合网络来实现的。但是，训练这些网络通常并非易事，并且大多数网络无法学习看不见的序列模式，从而导致在线识别的性能不尽人意。在本文中，我们提出了一种用于在线 SLR 的完全卷积网络（FCN），可以从仅带有句子级注释的弱注释视频序列中同时学习空间和时间特征。在提出的网络中引入了gloss feature 增强（GFE）模块，以加强更好的序列比对学习。提出的网络可以进行端到端的训练，而不需要任何预训练。我们对两个大型 SLR 数据集进行了实验。实验表明，我们的连续语句手语识别方法是有效的，并且在在线识别中表现良好。

**关键字：**连续语句手语识别；全卷积网络；联合训练；在线识别

研究对象	研究方法	数据集	研究成果
连续语句手语识别	在线全卷积网络端到端训练；gloss feature 增强（GFE）模块；不需要预训练

gloss：光泽作为物体的表面特性，取决于表面对光的镜面反射能力。(百度百科)

1 Introduction

Sign language is a common communication method for people with disabled hearing. It composes of a variety range of gestures, actions, and even facial emotions. In linguistic terms, a gloss is regarded as the unit of the sign language [27]. To sign a gloss, one may have to complete one or a series of gestures and actions. However, many glosses have very similar gestures and movements because of the richness of the vocabulary in a sign language. Also, because diﬀerent people have diﬀerent action speeds, a same signing gloss may have diﬀerent lengths. Not to mention that diﬀerent from spoken languages, sign language like ASL [22] usually does not have a standard structured grammar. These facts place additional diﬃculties in solving continuous SLR because it requires the model to be highly capable of learning spatial and temporal information in the signing sequences.
手语是听力障碍人士的常用交流方式。它由各种手势，动作甚至面部表情组成。用语言学术语来讲，一个 gloss 被视为手语的单位[27]。对于手语 gloss 而言，必须识别一个或一系列的手势和动作。但是，由于手语中词汇的丰富性，许多 gloss 具有非常相似的手势和动作。同样，由于不同的人具有不同的动作速度，因此相同的手语 gloss 可能具有不同的长度。更不用说与口语不同的是，像 ASL [22]这样的手语通常没有标准的结构化语法。这些事实为解决连续 SLR 带来了其他困难，因为它要求模型具有高度学习手语序列中的时空信息的能力。

Good	批判性吸收	参考文献
用语言学术语来说，一个 gloss 被视为手语的单位[27]。	-	Ong, S., Ranganath, S.: Automatic sign language analysis: A survey and the future beyond lexical meaning. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 873–91 (2005)
同样，由于不同的人具有不同的动作速度，因此相同的手语 gloss 可能具有不同的长度。即每个人都有个人专属的手语风格，就像普通人说话一样有不同的习惯	-	-
手语通常没有标准的结构化语法	-	-

Early work on continuous SLR [6,18,34] utilizes hand-crafted features followed by Hidden Markov Models (HMMs) [43,48] or Dynamic Time Warping (DTW) [47] as common practices. More recent approaches achieve state-of-theart results using CNN and RNN hybrid models [4,14,44]. However, we observe that these hybrid models tend to focus on the sequential order of seen signing sequences in the training data but not the glosses, due to the existence of RNN. So, it is sometimes hard for these trained networks to recognize unseen signing sequences with diﬀerent sequential patterns. Also, training of these models is generally non-trivial, as most of them require pre-training and incorporate iterative training strategy [4], which greatly lengthens the training process. Furthermore, the robustness of previous models is limited to sentence recognition only; most of the methods fail when the test cases are signing videos of a phrase (sentence fragment) or a paragraph (several sentences). Online recognition requires good recognition responses for partial sentences, but these models usually cannot give correction recognition until the signer ﬁnishes all the signing glosses in a sentence. Such limitation in robustness makes online recognition almost impossible for CNN and RNN hybrid models.
连续语句手语识别[6,18,34]的早期工作采用了手工制作特征，之后隐马尔可夫模型（HMM）[43,48]或动态时间规整（DTW）[47]作为常用方法。最近的方法是使用 CNN 和 RNN 混合模型获得了最新技术成果[4,14,44]。但是，由于 RNN 的存在，我们发现这些混合模型倾向于集中在训练数据中可见手语序列的顺序上，而不是光泽上。因此，有时这些训练好的网络很难识别具有不同顺序模式的未知的手语序列。同样，对这些模型的训练通常也不是一件容易的事，因为它们中的大多数都需要进行预训练并包含迭代训练策略[4]，这极大地延长了训练过程。此外，先前模型的鲁棒性仅限于句子识别；当测试用例对短语（句子片段）或段落（几个句子）的手语视频时，大多数方法都会失败。在线识别要求对部分句子有良好的识别响应，但是在 signer 完成句子中的所有签名修饰之前，这些模型通常无法给出正确的识别。这些限制使得 CNN 和 RNN 混合模型几乎无法在线识别。

介绍其他方法：一句话带过早期和后面的工作，重点介绍最近的且是本工作重点比较的

时间（早期…之后…）	论文	方法	缺点	优点
2015	Evangelidis, G.D., Singh, G., Horaud, R.: Continuous gesture recognition from articulated poses. In: Proceedings of European Conference on Computer Vision. pp. 595–607 (2015)	手工制作特征	-	-
2015	Koller, O., Forster, J., Ney, H.: Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding 141, 108–125 (2015)	手工制作特征	-	-
2013	Sun, C., Zhang, T., Bao, B.K., Xu, C., Mei, T.: Discriminative exemplar coding for sign language recognition with kinect. IEEE Transactions on Cybernetics 43, 1418–1428 (2013)	手工制作特征	-	-
2016	Yang, W., Tao, J., Ye, Z.: Continuous sign language recognition using level building based on fast hidden markov model. Pattern Recognition Letters 78, 28–35 (2016)	隐马尔可夫模型（HMM）	-	-
2016	Zhang, J., Zhou, W., Xie, C., Pu, J., Li, H.: Chinese sign language recognition with adaptive hmm. In: Proceedings of IEEE International Conference on Multimedia and Expo. pp. 1–6 (2016)	隐马尔可夫模型（HMM）
2014	Zhang, J., Zhou, W., Li, H.: A threshold-based hmm-dtw approach for continuous sign language recognition. In: Proceedings of International Conference on Internet Multimedia Computing and Service. pp. 237–240 (2014)	隐马尔可夫模型（HMM）	-	-

时间（最近）	论文	方法	缺点	优点
2017	Cui, R., Liu, H., Zhang, C.: Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1610–1618 (2017)	CNN 和 RNN 混合模型	很难识别具有不同顺序模式的未知的手语序列；训练周期长；鲁棒性差	-
2018	Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W.: Video-based sign language recognition without temporal segmentation. In: Proceedings of AAAI Conference on Artiﬁcial Intelligence. pp. 2257–2264 (2018)	CNN 和 RNN 混合模型	很难识别具有不同顺序模式的未知的手语序列；训练周期长；鲁棒性差	-
2019	Yang, Z., Shi, Z., Shen, X., Tai, Y.W.: Sf-net: Structured feature network for continuous sign language recognition. arXiv preprint arXiv:1908.01341 (2019)	CNN 和 RNN 混合模型	很难识别具有不同顺序模式的未知的手语序列；训练周期长；鲁棒性差	-

In this paper, we propose a fully convolutional network [24] for continuous SLR to address these challenges. The proposed network can be trained end-toend without any pre-training. On top of this, we introduce a GFE module to enhance the representativeness of features. The FCN design enables the proposed network to recognize new unseen signing sentences, or even unseen phrases and paragraphs. We conduct diﬀerent sets of experiments on two public continuous SLR datasets. The major contribution of this work can be summarized:

1. We are the ﬁrst to propose a fully convolutional end-to-end trainable network for continuous SLR. The proposed FCN method models the semantic structure of sign language as glosses instead of sentences. Results show that the proposed network achieves state-of-the-art accuracy on both datasets, compared with other RGB-based methods.
1. The proposed GFE module enforces additional rectiﬁed supervision and is jointly trained along with the main stream network. Compared with iterative training, joint training with the GFE module fastens the training process because joint training does not require additional ﬁne-tuning stages.
1. The FCN architecture achieves better adaptability in more complex realworld recognition scenarios, where previous LSTM based methods would almost fail. This attribute makes the proposed network able to do online recognition and is very suitable for real-world deployment applications.

在本文中，我们提出了一种用于识别连续语句手语的完卷积网络[24]来解决这些挑战。所提出的网络可以端到端地训练，而不需要任何预训练。在此基础上，我们引入了 GFE 模块来增强特征的代表性。FCN 的设计使提出的网络能够识别新的看不见的手语句子，甚至是看不见的短语和段落。我们在两个公开的连续单反数据集上进行了不同的实验。这项工作的主要贡献可以概括为

1. 我们首次提出了一种用于连续语句手语识别的可端到端训练的全卷积网络。该方法将手语的语义结构建模为注释而不是句子。结果表明，与其他基于 RGB 的方法相比，提出的网络在两个数据集上都达到了最高的准确率。
1. 提出的 GFE 模块执行额外的已微调后的监督，并与主流网络一起进行联合培训。与迭代训练相比，GFE 模块联合训练网络加快了训练过程，因为联合训练不需要额外的微调阶段。
1. FCN 体系结构在更复杂的现实世界识别场景中实现了更好的适应性，而以前基于 LSTM 的方法几乎都失败了。这一特性使得所提出的网络能够进行在线识别，非常适合于实际部署应用。

2 Related Work

There are mainly two scenarios in SLR: isolated SLR and continuous SLR. Isolated SLR mainly focuses on the scenario where glosses have been well segmented temporally. Work in the ﬁeld generally solves the task with methods such as Hidden Markov Models (HMMs) [10,12,13,29,35,42], Dynamic Time Warping (DTW) [36], and Conditional Random Field (CRF) [40,41]. As for continuous SLR, the task becomes more diﬃcult as it aims to recognize glosses in the scenarios where no gloss segmentation is available but only sentence-level annotations as a whole. Learning separated individual glosses becomes more diﬃcult in the weakly supervised setting. Many approaches propose to estimate the temporal boundary of diﬀerent glosses ﬁrst and then apply isolated SLR techniques and sequence to sequence methods [7,16] to construct the sentence.
手语识别主要有两种场景：孤立词手语和连续语句手语。孤立词手语识别主要关注光泽在时序上能被很好地分割的场景。在孤立词手语识别中一般采用隐马尔可夫模型(HMM)[10，12，13，29，35，42]、动态时间规整(DTW)[36]和条件随机场(CRF)[40，41]等方法来解决这一问题。对于连续语句来说，手语识别任务更加困难，因为它的目标是识别情景中的 gloss，因为在情景中，没有对 gloss 分段，只有句子级的手语作为一个整体。在弱监督的环境下，学习分开的个体 gloss 变得更加困难。许多方法建议先估计不同 gloss 的时序边界，然后将孤立词手语识别技术和序列方法应用于连续语句序列[7，16]来构造句子。

孤立词手语识别:主要关注 gloss 在时序上能被很好地分割的场景

时间	论文	方法	缺点	优点
2017	Guo, D., Zhou, W., Li, H., Wang, M.: Online early-late fusion based on adaptive hmm for sign language recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1–18 (2017)	隐马尔可夫模型(HMM)	-	-
2016	Guo, D., Zhou, W., Wang, M., Li, H.: Sign language recognition based on adaptive hmms with data augmentation. In: Proceedings of IEEE International Conference on Image Processing. pp. 2876–2880 (2016)	隐马尔可夫模型(HMM)	-	-
2009	Han, J., Awad, G., Sutherland, A.: Modelling and segmenting subunits for sign language recognition based on hand motion analysis. Pattern Recognition Letters 30, 623–633 (2009)	隐马尔可夫模型(HMM)	-	-
2011	Pitsikalis, V., Theodorakis, S., Vogler, C., Maragos, P.: Advances in phoneticsbased sub-unit modeling for transcription alignment and sign language recognition. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 1–6 (2011)	隐马尔可夫模型(HMM)	-	-
2009	Theodorakis, S., Katsamanis, A., Maragos, P.: Product-hmms for automatic sign language recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 1601–1604 (2009)	隐马尔可夫模型(HMM)	-	-
2006	Yang, R., Sarkar, S.: Gesture recognition using hidden markov models from fragmented observations. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. pp. 766–773 (2006)	隐马尔可夫模型(HMM)	-	-
2014	Vela, A.H., Bautista, M., Perez-Sala, X., Ponce-Lpez, V., Escalera, S., Bar, X., Pujol, O., Angulo, C.: Probability-based dynamic time warping and bag-of-visualand-depth-words for human gesture recognition in rgb-d. Pattern Recognition Letters 50, 112–121 (2014)	动态时间规整(DTW)	-	-
2010	Yang, H.D., Lee, S.W.: Robust sign language recognition with hierarchical conditional random ﬁelds. In: Proceedings of IEEE International Conference on Pattern Recognition. pp. 2202–2205 (2010)	条件随机场(CRF)	-	-
2006	Yang, R., Sarkar, S.: Detecting coarticulation in sign language using conditional random ﬁelds. In: Proceedings of IEEE International Conference on Pattern Recognition. pp. 108–112 (2006)	条件随机场(CRF)	-	-

连续语句手语识别：目标是识别情景中的 glosses，因为在情景中，没有对 gloss 分段，只有句子级的手语作为一个整体。

时间	论文	方法	缺点	优点
2002	Fang, G., Gao, W.: A srn/hmm system for signer-independent continuous sign language recognition. In: Proceedings of IEEE International Conference on Automatic Face Gesture Recognition. pp. 312–317 (2002)	将孤立词手语识别技术和序列方法应用于连续语句序列来构造句子	-	-
2009	Kelly, D., McDonald, J., Markham, C.: Recognizing spatiotemporal gestures and movement epenthesis in sign language. In: Proceedings of IEEE International Conference on Image Processing and Machine Vision. pp. 145–150 (2009)	将孤立词手语识别技术和序列方法应用于连续语句序列来构造句子	-	-

Concerning temporal boundary estimation, Cooper and Bowden [3] develop a method to extract similar video regions for inferring alignments in videos by using data mining and head and hand tracking. Farhadi and Forsyth [8] also come up with a method that utilizes HMMs to build a discriminative model for estimating the start and end frames of the glosses in video streams with a voting method. Yin et al. [46] make further improvements by introducing a weakly supervised metric learning framework to address the inter-signer variation problem in real applications of SLR.
关于时序边界估计，Cooper和Bowden [3]提出了一种方法，该方法通过使用数据挖掘和头部和手部跟踪来提取相似的视频区域以达到 inferring alignments。Farhadi和Forsyth[8]还提出了一种方法，该方法利用隐马尔可夫模型(HMM)建立判别模型，用投票法预测视频流中光泽的起始帧和结尾帧。Yen等人[46]在SLR的实际应用中，引入弱监督度量学习框架，解决 signer 之间的差异问题，从而进一步完善SLR。
inferring alignments ：预测对齐？？？

时间	论文	方法	缺点	优点
2009	Cooper, H., Bowden, R.: Learning signs from subtitles: A weakly supervised approach to sign language recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 2568–2574 (2009)	通过使用数据挖掘和头部和手部跟踪来提取相似的视频区域以达到 inferring alignments	-	-
2006	Farhadi, A., Forsyth, D.: Aligning asl for statistical translation using a discriminative word model. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1471–1476 (2006)	利用隐马尔可夫模型(HMM)建立判别模型，用投票法预测视频流中光泽的起始帧和结尾帧	-	-
2015	Yin, F., Chai, X., Zhou, Y., Chen, X.: Weakly supervised metric learning towards signer adaptation for sign language recognition. In: Proceedings of British Machine Vision Conference. pp. 35.1–35.12 (2015)	引入弱监督度量学习框架，解决 signer 之间的差异问题	-	-

As for sequence to sequence methods, much work follows the framework used in the topic of speech recognition [25,33], handwriting recognition [23,32], and video captioning [39]. Speciﬁcally, an encoder module is responsible for extracting features in the input video frame sequences, and a CTC module acts as a cost function to learn the ground truth sequences. This framework also shows good performance on continuous SLR, and more recent work applies CNN and RNN hybrid models to infer gloss alignments implicitly [2,14,26,30]. However, RNNs are sometimes more sensitive to the sequential order than the spatial features. As a result, these models tend to learn much about the sequential signing patterns but little about the glosses (words), causing the failure of the recognition for unseen phrases and paragraphs.
至于序列到序列方法，许多工作遵循语音识别[25，33]、手写识别[23，32]和视频字幕[39]领域中使用的框架。具体地，编码器模块负责从输入视频帧序列中提取特征，而 CTC 模块充当学习真实标签序列的代价函数。这个框架在连续语句识别上表现出了良好的性能，最近的工作是应用 CNN 和 RNN 混合模型来隐式推断 gloss 对齐[2，14，26，30]。然而， RNN 神经网络有时对顺序比空间特征更敏感，因此，这些模型往往对顺序手势模式学习较多，而对注释(词)学习较少，从而导致对看不见的短语和段落的识别失败。
unseen：看不见的？？？还是未训练过的手语？？？

时间	论文	方法	缺点	优点
2015	Miao, Y., Gowayyed, M., Metze, F.: Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding. In: IEEE Conference on Automatic Speech Recognition and Understanding Workshops. pp. 167–174 (2015)	借鉴语音识别框架	-	-
2015	Sak, H., Senior, A., Rao, K., rsoy, O., Graves, A., Beaufays, F., Schalkwyk, J.: Learning acoustic frame labeling for speech recognition with recurrent neural networks. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4280–4284 (2015)	借鉴语音识别框架	-	-
2007	Liwicki, M., Graves, A., Bunke, H., Schmidhuber, J.: A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks. In: Proceedings of International Conference on Document Analysis and Recognition. pp. 367–371 (2007)	借鉴手写识别框架	-	-
2017	Puigcerver, J.: Are multidimensional recurrent layers really necessary for handwritten text recognition? In: Proceedings of International Conference on Document Analysis and Recognition. pp. 67–72 (2017)	借鉴手写识别框架	-	-
2018	Wang, B., Ma, L., Zhang, W., Liu, W.: Reconstruction network for video captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 7622–7631 (2018)	借鉴视频字幕框架	-	-
2017	Camgoz, N.C., Hadﬁeld, S., Koller, O., Bowden, R.: Subunets: End-to-end hand shape and continuous sign language recognition. In: Proceedings of IEEE International Conference on Computer Vision. pp. 3075–3084 (2017)	应用 CNN 和 RNN 混合模型来隐式推断 gloss 对齐	然而， RNN 神经网络有时对顺序比空间特征更敏感，因此，这些模型往往对顺序手势模式学习较多，而对注释(词)学习较少，从而导致对看不见的短语和段落的识别失败。	-
2018	Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W.: Video-based sign language recognition without temporal segmentation. In: Proceedings of AAAI Conference on Artiﬁcial Intelligence. pp. 2257–2264 (2018)	应用 CNN 和 RNN 混合模型来隐式推断 gloss 对齐	然而， RNN 神经网络有时对顺序比空间特征更敏感，因此，这些模型往往对顺序手势模式学习较多，而对注释(词)学习较少，从而导致对看不见的短语和段落的识别失败。	-
2016	Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., Kautz, J.: Online detection and classiﬁcation of dynamic hand gestures with recurrent 3d convolutional neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 4207–4215 (2016)	应用 CNN 和 RNN 混合模型来隐式推断 gloss 对齐	然而， RNN 神经网络有时对顺序比空间特征更敏感，因此，这些模型往往对顺序手势模式学习较多，而对注释(词)学习较少，从而导致对看不见的短语和段落的识别失败。	-
2018	Pu, J., Zhou, W., Li, H.: Dilated convolutional network with iterative optimization for continuous sign language recognition. In: Proceedings of International Joint Conference on Artiﬁcial Intelligence. pp. 885–891 (2018)	应用 CNN 和 RNN 混合模型来隐式推断 gloss 对齐	然而， RNN 神经网络有时对顺序比空间特征更敏感，因此，这些模型往往对顺序手势模式学习较多，而对注释(词)学习较少，从而导致对看不见的短语和段落的识别失败。	-

借鉴语音识别的框架来处理手语识别和我的想法不谋而合，我也同样是想用隐马尔可夫（HHM）来预测连续语句手语。

3 Method

Formally, the proposed network aims to learn a mapping $\mathcal{X} \mapsto \mathcal{Y}$ that can transform an input video frame sequence $\mathcal{X}$ to a target sequence $\mathcal{Y}$ . The feature extraction contains two main steps: a frame feature encoder and a two-level gloss feature encoder. On top of them, a gloss feature enhancement (GFE) module is introduced to enhance the feature learning. An overview of the proposed network is shown in Figure 1.
在这里插入图片描述 Fig. 1. Overview of the proposed network. The network is fully convolutional and divides the feature encoding process into two main steps. A GFE module is introduced to enhance the feature learning
形式上，提出的网络旨在学习一种映射 $\mathcal{X} \mapsto \mathcal{Y}$ ，该映射可以将输入视频帧序列 $\mathcal{X}$ 转换为目标序列 $\mathcal{Y}$ 。特征提取包括两个主要步骤： frame feature 编码器和 two-level gloss feature 编码器。在此基础上，引入了==光泽特征增强(GFE)==模块来增强特征学习。提出的网络整体如图1所示。
在这里插入图片描述
Fig. 1. 提出的网络整体图。该网络是全卷积的网络结构，并将特征编码过程分为两个主要步骤。引入光泽特征增强(GFE)模块来增强特征学习。

3.1 Main stream design

Frame feature encoder

Frame feature encoder. The proposed network ﬁrst encodes spatial features of the input RGB frames. The frame feature encoder ${S}$ composes of a convolutional backbone $S_{c n n}$ to extract features in the frames and a global average pooling layer $S_{g a p}$ to compress the spatial features into feature vectors. Formally, each signing sequence is a tensor with shape ( ${t}$ , ${c}$ , ${h}$ , ${w}$ ), where ${t}$ denotes the length of the sequence, ${c}$ denotes the number of channels, and ${h}$ , ${w}$ denotes the height and width of the frames. The process of encoder ${S}$ can be described as:
$\{s\}^{t \times f_{s}}=S\left(\{x\}^{t \times c \times h \times w}\right)=S_{g a p}\left(S_{c n n}\left(\{x\}^{t \times c \times h \times w}\right)\right)$ . (1)
The output is of shape $\{s\}^{t \times f_{s}}$ . Note that frame feature encoder treats each frame independently for the frame (spatial) feature learning.
帧特征编码器. 该网络首先对输入的 RGB 帧的空间特征进行编码。帧特征编码器 ${S}$ 由用于提取帧中特征的基础卷积网络 $S_{c n n}$ 和用于将空间特征压缩为特征向量的全局平均池层 $S_{g a p}$ 组成。形式上，每个手语序列是具有形状的张量( ${t}$ , ${c}$ , ${h}$ , ${w}$ )，其中 ${t}$ 表示序列的长度， ${c}$ 表示通道数， ${h}$ 表示帧的高度， ${w}$ 表示帧的宽度。帧特征编码器 ${S}$ 的操作可以描述为：

$\{s\}^{t \times f_{s}}=S\left(\{x\}^{t \times c \times h \times w}\right)=S_{g a p}\left(S_{c n n}\left(\{x\}^{t \times c \times h \times w}\right)\right)$ . (1)

输出的 Shape 为 $\{s\}^{t \times f_{s}}$ 。请注意，帧特征编码器 ${S}$ 是单独处理每一帧(空间)的特征学习。

由公式（1）和这段可以得出：帧特征编码器 ${S}$ 是对视频的每一帧都进行处理，先是由 CNN $S_{c n n}$ 对每一帧提取特征，然后由全局平均池层 $S_{g a p}$ 将空间特征压缩为特征向量。其中 ${t}$ 表示序列的长度， ${c}$ 表示通道数， ${h}$ 表示帧的高度， ${w}$ 表示帧的宽度

Two-level gloss feature encoder

Two-level gloss feature encoder. The two-level gloss feature encoder ${G}$ follows ${S}$ immediately and aims to encode gloss features. Instead of using LSTM layers, a common practice in temporal feature encoding, we achieve this by using 1D convolutional layers over time dimension. Precisely, the ﬁrst level encoder ${G}_1$ consists of 1D-CNNs with a relatively larger ﬁlter size. Pooling layers can be used between convolutional layers to increase the window size when needed. Diﬀerently, the ﬁlter size is relatively smaller for the 1D-CNNs in the second level encoder ${G}_2$ , with no pooling layers used in ${G}_2$ . So, cdoes not change the temporal dimension but only reconsider the contextual information between glosses by taking into account the neighboring glosses.
两级光泽特征编码器. 在帧特征编码器 ${S}$ 后面的是两级光泽特征编码器 ${G}$ ， ${G}$ 的目标是对光泽特征进行编码。我们通过在时间维度上使用一维卷积层来实现这一点，而不是使用 LSTM 层，这是时序特征编码中的一种常见做法。准确地说，第一级编码器 ${G}_1$ 由具有较大卷积核（filter）的 1D-CNN 组成。当需要时，可以在卷积层之间使用池化层来增加窗口大小。
不同的是，在第二级编码器 ${G}_2$ 中，采用较卷积核的1D-CNN ，在 ${G}_2$ 中不用使用池化层。因此， ${G}_2$ 没有改变时序维度，而只是通过考虑相邻的光泽来重新考虑损失的上下文信息。

在帧特征编码器 ${S}$ 后面连着用于提取 gloss 特征的两级光泽特征编码器 ${G}$ ， ${G}$ 由第一级编码器 ${G}_1$ 和第二级编码器 ${G}_2$ 组成。 ${G}_1$ 由具有较大卷积核的 1D-CNN 组成，用于提取 gloss 特征，必要时可以通过在卷积层之间增加池化层来增大提取特征的窗口。 ${G}_2$ 采用具有较小卷积核的 1D-CNN，并且不采池化层，只通过相邻的 gloss 来预测上下文信息。
卷积的基本概念参考：深度学习笔记_基本概念_卷积网络中的通道channel、特征图feature map、过滤器filter和卷积核kernel

在帧特征编码器

{S}

后面连着用于提取 gloss 特征的两级光泽特征编码器

{G}

，

{G}

由第一级编码器

{G}_1

和第二级编码器

{G}_2

组成。

{G}_1

由具有较大卷积核的 1D-CNN 组成，用于提取 gloss 特征，必要时可以通过在卷积层之间增加池化层来增大提取特征的窗口。

{G}_2

采用具有较小卷积核的 1D-CNN，并且不采池化层，只通过相邻的 gloss 来预测上下文信息。

卷积的基本概念参考：深度学习笔记_基本概念_卷积网络中的通道channel、特征图feature map、过滤器filter和卷积核kernel

The overall convolutional process of ${G}$ can be interpreted as a sliding window on the frame feature vector $\{s\}^{t \times f_{s}}$ along the time dimension. The sliding window size ${l}$ and the stride ${δ}$ are determined by the accumulated receptive ﬁeld size and the accumulated stride of 1D-CNNs in ${G_1}$ . Let $\{g\}^{k\times f_{g}}$ and $\{g^{\prime}\}^{k \times f_{g^{\prime}}}$ be the output tensor of gloss feature encoder ${G_1}$ and ${G_2}$ , respectively. The operation of the encoder ${G}$ can be formulated as:

$\left\{g^{\prime}\right\}^{k \times f_{g^{\prime}}}=G\left(\{s\}^{t \times f_{s}}\right)=G_{2}\left(G_{1}\left(\{s\}^{t \times f_{s}}\right)\right)=G_{2}\left(\{g\}^{k \times f_{g}}\right)$