# Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition

### 论文解读与个人理解（欢迎小伙伴们指正）：Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition

• 十分感谢小伙伴的鼓励，我会持续更新此论文后续的补充内容。

• 非常感谢此论文的一作，给了我一些很 nice 的建议。作者的实验室因为合作方的要求不允许公开代码，所以很遗憾要复现此论文有难度。虽然作者分享了论文中模型的搭建代码，但是作者说：“从过去一些人经验来看，基于这个复现我的结果是有一定困难的，主要难点在于如何得到迭代的伪标签。”同时一作大佬推了 3 个 GitHub 上的代码，可能有帮助。还建议尝试复现ECCV2020上的那篇 fully convolutional network 做手语识别的代码，可以只实现主体部分，基于这一部分也是可以达到目前sota的性能。后续我会复现大佬建议的部分。再次感谢大佬的建议，指出了一个方向。🙏🙏🙏代码、论文链接在文末

• (2020-连续-RGB-手形-头-关节/RGB-CSL/PHOENIX2014/PHOENIX204T）

# Abstract

Despite the recent success of deep learning in continuous sign language recognition (CSLR), deep models typically focus on the most discriminative features, ignoring other potentially non-trivial and informative contents. Such characteristic heavily constrains their capability to learn implicit visual grammars behind the collaboration of different visual cues (i,e., hand shape, facial expression and body posture). By injecting multi-cue learning into neural network design, we propose a spatial-temporal multi-cue (STMC) network to solve the vision-based sequence learning problem. Our STMC network consists of a spatial multi-cue (SMC) module and a temporal multi-cue (TMC) module. The SMC module is dedicated to spatial representation and explicitly decomposes visual features of different cues with the aid of a self-contained pose estimation branch. The TMC module models temporal correlations along two parallel paths, i.e., intra-cue and inter-cue, which aims to preserve the uniqueness and explore the collaboration of multiple cues. Finally, we design a joint optimization strategy to achieve the end-to-end sequence learning of the STMC network. To validate the effectiveness, we perform experiments on three large-scale CSLR benchmarks: PHOENIX-2014, CSL and PHOENIX-2014-T. Experimental results demonstrate that the proposed method achieves new state-of-the-art performance on all three benchmarks.

STMC网络spatial-temporal multi-cue 时空多线索，将多线索学习和神经网络相结合
SMC模块Spatial Multi-Cue 空间多线索，用于空间表示，并借助一个独立的姿势估计模块来分解不同线索的视觉特征
TMC模块Temporal Multi-Cue 时间多线索，沿两个平行路径对时间相关性进行建模
STMC端到端学习联合优化策略

PHOENIX-201421.120.7
PHOENIX- 2014-T19.621.0

CSL2.128.6

• 在模型训练的时候通常将我们所得的数据分成3部分：训练集、dev验证集和测试集
• dev用来统计的那一评估指标、调节参数，选择算法；而test用来在最后整体评估模型性能
• dev和训练集一起被输入到模型算法中，但又不参与模型训练，可以一边训练一边根据dev查看指标
• dev和测试集都是用来评估模型好坏，但dev只能用来统计单一评估指标；而测试集能够提供更多的评估模型指标，如混淆矩阵、roc、召回率、F1 Score等
• dev可以用来快速评估指标的，并及时做出参数调整，但不全面；而测试集能提供一个模型的完整评估报告，能更好的从多个角度评价模型的性能，缺点是比较费时，一般在dev把参数调整差不多后，才会用到测试集
• dev和测试集要保持同一分布
• 大数据时代以前，通常将数据按照8:1:1划分数据集，大数据时代(百万数量级)，通常可以将数据按照98:1:1的比例划分
————————————————
原文链接：https://blog.csdn.net/weixin_43821376/article/details/103777454

# 1 Introduction

Sign language is the primary language of the deaf community. To facilitate the daily communication between the deaf-mute and the hearing people, it is significant to develop sign language recognition (SLR) techniques. Recently, SLR has gained considerable attention for its abundant visual information and systematic grammar rules (Cui, Liu, and Zhang 2017; Huang et al. 2018; Koller et al. 2019; Pu, Zhou, and Li 2019; Li et al. 2019). In this paper, we concentrate on continuous SLR (CSLR), which aims to translate a series of signs to the corresponding sign gloss sentence.

• CSLR：continuous sign language recognition 连续手语识别

Sign language mainly relies on, but not limits to, hand gestures. To effectively and accurately express the desired idea, sign language simultaneously leverages both manual elements from hands and non-manual elements from the face and upper-body posture (Koller, Forster, and Ney 2015). To be specific, manual elements include the shape, position, orientation and movement of both hands, while non-manual elements include the eye gaze, mouth shape, facial expression and body pose. Human visual perception allows us to process and analyze these simultaneous yet complex information without much effort. However, with no expert knowledge, it is difficult for a deep neural network to discover the implicit collaboration of multiple visual cues automatically. Especially for CSLR, the transitions between sign glosses may come with temporal variations and switches of different cues.
手语主要依靠手势但不仅仅只依靠手势。为了准确有效地表达思想，手语同时利用了手部的手动特征和面部与上半身姿势的非手动特征(Koller, Forster, and Ney 2015)。具体来说，手动特征包括双手的形状、位置、方向和运动，非手动元素包括眼睛注视、嘴型、面部表情和身体姿势。人类的视觉感知使我们能够轻而易举地同时处理和分析这些复杂的信息。然而，在缺乏专业知识的情况下，深度神经网络很难自动发现多种视觉线索的隐式协作。尤其对于 CSLR 来说，sign glosses 的先后顺序和不同部位的相互切换深深影响着手语翻译的效果。

• Sign glosses：手势光泽？？？

To explore multi-cue information, some methods rely on external tools. For example, an off-the-shelf detector is utilized for hand detection, together with a tracker to cope with shape variation and occlusion (Cihan Camgoz et al. 2017; Huang et al. 2018). Some methods adopt multi-stream networks with inferred labels (i.e., mouth shape labels, hand shape labels) to guide each stream to focus on individual visual cue (Koller et al. 2019). Despite their improvement, they mostly suffer two limitations: First, external tools impede the end-to-end learning on the differentiable structure of neural networks. Second, off-the-shelf tools and multi-stream networks bring repetitive feature extraction of the same region, incurring expensive computational overhead for such a video-based translation task.
为了探索多线索信息，一些依赖于外部工具的方法。例如，一种现成的检测器被用于手部检测，同时还有一个跟踪器来处理形状变化和遮挡(Cihan Camgoz et al. 2017;黄等，2018)。一些方法采用带有推测标签的多流网络(如口形标签、手形标签)引导每个流关注手语视觉线索(Koller et al. 2019)。尽管有所改进，但它们大多存在两个局限性:一是外部工具阻碍了对神经网络可微结构的端到端学习。其次，现成的工具和多流网络带来了同一区域的重复特征提取，提高了计算开销。

2017Cihan Camgoz, N.; Hadfield, S.; Koller, O.; and Bowden, R. 2017.Subunets: end-to-end hand shape and continuous sign language recognition. In ICCV现成的检测器被用于手部检测外部工具阻碍了对神经网络可微结构的端到端学习
2018Cihan Camgoz, N.; Hadfield, S.; Koller, O.; Ney, H.; and Bowden,R. 2018. Neural sign language translation. In CVPR.跟踪器来处理形状变化和遮挡外部工具阻碍了对神经网络可微结构的端到端学习
2019Koller, O.; Camgoz, C.; Ney, H.; and Bowden, R. 2019. Weakly supervised learning with multi-stream cnn-lstm-hmms to discover sequential parallelism in sign language videos. TPAMI.用带有推测标签的多流网络(如口形标签、手形标签)引导每个流关注手语视觉线索多流网络带来了同一区域的重复特征提取，提高了计算开销。

To temporally exploit multi-cue features, an intuitive idea is to concatenate features and feed them into a temporal fusion module. In action recognition, two-stream fusion shows significant performance improvement by fusing temporal features of RGB and optical flow (Simonyan and Zisserman 2014; Feichtenhofer, Pinz, and Zisserman 2016). Nevertheless, the aforementioned fusion approaches are based on two counterpart features in terms of the representation capability. But when it turns to multiple diverse cues with unequal feature importance, how to fully exploit the synergy between strong features and weak features still leaves a challenge. Moreover, for deep learning based methods, neural networks tend to merely focus on strong features for quick convergence, potentially omitting other informative cues, which limits the further performance improvement.
为了利用短暂的多线索特征，一种直观的想法是将特征连接起来，并将它们输入一个时间融合模块。在动作识别方面，通过融合RGB和 optical flow 的时间特征，双流融合显著提高了性能(Simonyan and Zisserman 2014;Feichtenhofer, Pinz和Zisserman 2016)。然而，上述的融合方法在表示能力方面是基于两个对等的特征。但当它转向多个不同的特征重要性不等的线索时，如何充分利用强特征和弱特征之间的协同作用仍然是一个挑战。此外，对于基于深度学习的方法，神经网络往往只关注强特征以快速收敛，可能忽略其他信息线索，这限制了进一步的性能改进。

• optical flow：光流？？？
同一时间段里多个部位如何协同问题：比如中国手语中的“聚餐”，在一个短暂的时间段里要识别 双手，头，嘴。

2014Simonyan, K., and Zisserman, A. 2014. Two-stream convolutional networks for action recognition in videos. In NeurIPS.将特征连接起来,通过融合RGB和 optical flow 的时间特征性能显著提高此融合方法在表达能力方面是基于两个对等的特征，但当它转向多个不同的特征重要性不等的线索时，强特征和弱特征之间的协同作用仍然是一个挑战

To address the above difficulties, we propose a novel spatial-temporal multi-cue (STMC) framework. In the SMC module, we add two extra deconvolutional layers (Zeiler et al. 2010; Xiao, Wu, and Wei 2018) for pose estimation on the top layer of our backbone. A soft-argmax trick (Chapelle and Wu 2010) is utilized to regress the positions of keypoints and make it differentiable for subsequent operations in the temporal part. The spatial representations of other cues are acquired by the reuse of feature maps from the middle layer. Based on the learned spatial representations, we decompose the temporal modelling part into the intra-cue path and inter-cue path in the TMC module. The inter-cue path fuses the temporal correlations between different cues with temporal convolutional (TCOV) layers. The intra-cue path models the internal temporal dependency of each cue and feeds them to the inter-cue path at different time scales. To fully exploit the potential of STMC network, we design a joint optimization strategy with connectionist temporal classification (CTC) (Graves et al. 2006) and keypoint regression, making the whole structure end-to-end trainable.
针对上述问题，我们提出了一种新的时空多线索(STMC)框架。在 SMC 模块中，我们增加了两个额外的反卷积层(Zeiler et al. 2010;Xiao, Wu, and Wei 2018)，用于框架顶层的姿态估计。利用一种 soft-argmax 技巧(Chapelle and Wu 2010)对关键点的位置进行回归，使其在短暂的时间内可微。通过对中间层特征图的重复使用，获得其他线索的空间表示。基于学习到的空间表示，我们在 TMC 模块中将时间建模部分分解为线索内路径和线索间路径。线索间路径利用时间卷积(TCOV)层融合不同线索之间的时间相关性。线索内路径对每个线索的内部时间依赖性进行建模，并以不同的时间尺度将它们提供给线索间路径。为了充分挖掘 STMC 网络的潜力，我们设计了一个结合 connectionist temporal classification (CTC) (Graves et al. 2006)和关键点回归的联合优化策略，使整个结构端到端可训练。

Our main contributions are summarized as follows:

• We design an SMC module with a self-contained pose estimation branch. It provides multi-cue features in an end-to-end fashion and maintains efficiency at the same time.
• We propose a TMC module composed of stacked TMC blocks. Each block includes intra-cue and inter-cue paths to preserve the uniqueness and explore the synergy of different cues at the same time.
• A joint optimization strategy is proposed for the end-to-end sequence learning of our STMC network.
• Through extensive experiments, we demonstrate that our STMC network surpasses previous state-of-the-art models on three publicly available CSLR benchmarks.
我们的主要贡献总结如下:
• 设计了一个具有自包含姿态估计分支的SMC模块。它以端到端的方式提供多线索特性，同时保持效率。
• 我们提议一个由堆叠的TMC块组成的TMC模块。每个块包括线索内路径和线索间路径，以保持不同线索的独特性，同时探索不同线索的协同性。
• 对STMC网络的端到端序列学习提出了一种联合优化策略。
• 通过广泛的实验，我们证明了我们的STMC网络在三个公开可用的CSLR基准上超越了以前最先进的模型。

# 2 Related Work

In this section, we briefly review the related work on sign language recognition and multi-cue fusion.

A CSLR system usually consists of two parts: video representation and sequence learning. Early works utilize handcrafted features (Cooper and Bowden 2009; Buehler, Zisser-man, and Everingham 2009; Yin, Chai, and Chen 2016) for SLR. Recently, deep learning based methods have been applied to SLR for their strong representation capability. 2D convolutional neural networks (2D-CNN) and 3D convolutional neural networks (3D-CNN) (Ji et al. 2013; Qiu, Yao, and Mei 2017) are employed for modelling the appearance and motion in sign language videos. In (Cui, Liu, and Zhang 2017), Cui et al. propose to combine 2D-CNN with temporal convolutional layers for spatial-temporal representation. In (Molchanov et al. 2016; Pu, Zhou, and Li 2018; Zhou, Zhou, and Li 2019; Wei et al. 2019), 3D-CNN is adopted to learn motion features in sign language.
一个CSLR系统通常由两个部分组成:视频表示和序列学习。早期作品使用手工制作的特征(Cooper and Bowden 2009;Buehler, Zisser-man，和Everingham 2009;Yin, Chai, Chen(2016)。近年来，基于深度学习的方法因其较强的表示能力而被应用到手语识别中。2D卷积神经网络(2D-CNN)和3D卷积神经网络(3D-CNN) (Ji et al. 2013;Qiu, Yao, and Mei 2017)在手语视频中使用外观和动作作为特征的模型。在(Cui, Liu, and Zhang 2017)中，Cui等人提出将2D-CNN与时间卷积层相结合进行时空表示。在(Molchanov等人2016;Pu, Zhou, and Li 2018; Zhou, Zhou, and Li 2019; Wei et al. 2019)，采用3D-CNN学习手语的动作特征。

2009Cooper, H., and Bowden, R. 2009. Learning signs from subtitles: A weakly supervised approach to sign language recognition. In CVPR.使用手工制作的特征
2009Buehler, P.; Zisserman, A.; and Everingham, M. 2009. Learning sign language by watching tv (using weakly aligned subtitles). In CVPR.使用手工制作的特征
2016Yin, F.; Chai, X.; and Chen, X. 2016. Iterative reference driven metric learning for signer independent isolated sign language recognition. In ECCV.使用手工制作的特征
2013Ji, S.; Xu,W.; Yang, M.; and Yu, K. 2013. 3D convolutional neural networks for human action recognition. TPAMI 35(1):221–231.使用2D卷积神经网络(2D-CNN)和3D卷积神经网络搭建以外观和动作为特征的模型
2017Qiu, Z.; Yao, T.; and Mei, T. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV.使用2D卷积神经网络(2D-CNN)和3D卷积神经网络搭建以外观和动作为特征的模型
2017Cui, R.; Liu, H.; and Zhang, C. 2017. Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In CVPR.将2D-CNN与时间卷积层相结合进行时空表示
2016Molchanov, P.; Yang, X.; Gupta, S.; Kim, K.; Tyree, S.; and Kautz, J. 2016. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In CVPR.采用3D-CNN学习手语的动作特征
2018Pu, J.; Zhou, W.; and Li, H. 2018. Dilated convolutional network with iterative optimization for continuous sign language recognition. In IJCAI.采用3D-CNN学习手语的动作特征
2019Zhou, H.; Zhou, W.; and Li, H. 2019. Dynamic pseudo label decoding for continuous sign language recognition. In ICME.采用3D-CNN学习手语的动作特征

Sequence learning in CSLR is to learn the correspondence between video sequence and sign gloss sequence. Koller et al. (Koller, Ney, and Bowden 2016; Koller, Zargaran, and Ney 2017; Koller et al. 2018) propose to integrate 2D-CNNs with hidden markov models (HMM) to model the state transitions. In (Cihan Camgoz et al. 2017; Wang et al. 2018;Cui, Liu, and Zhang 2017; Cui, Liu, and Zhang 2019), connectionist temporal classification (CTC) (Graves et al. 2006) algorithm is employed as a cost function for CSLR, which is able to process unsegmented input data. In (Huang et al. 2018; Guo et al. 2018), the attention-based encoder-decoder model (Bahdanau, Cho, and Bengio 2014) is adopted to deal with CSLR in the way of neural machine translation.
CSLR中的序列学习是学习视频序列和 sign gloss 序列之间的对应关系。Koller等人(Koller, Ney和Bowden 2016;Koller, Zargaran和Ney 2017;Koller等人2018年)提出将2D-CNNs与隐马尔可夫模型(HMM)集成，以对状态转换进行建模。在 (Cihan Camgoz et al. 2017; Wang et al. 2018;Cui, Liu, and Zhang 2017; Cui, Liu, and Zhang 2019)CSLR 采用 connectionist temporal classification (CTC) (Graves et al. 2006)算法作为代价函数，能够处理未分割的输入数据。在 (Huang et al. 2018;Guo et al. 2018)，中采用基于注意力机制的编码-解码器模型(Bahdanau, Cho, and Bengio 2014)的神经机器翻译方式处理CSLR。

2016Koller, O.; Zargaran, O.; Ney, H.; and Bowden, R. 2016. Deep sign: hybrid cnn-hmm for continuous sign language recognition. In BMVC.将2D-CNNs与隐马尔可夫模型(HMM)集成，以对状态转换进行建模
2017Koller, O.; Zargaran, S.; and Ney, H. 2017. Re-sign: re-aligned end-to-end sequence modelling with deep recurrent cnn-hmms. In CVPR.将2D-CNNs与隐马尔可夫模型(HMM)集成，以对状态转换进行建模
2018Koller, O.; Zargaran, S.; Ney, H.; and Bowden, R. 2018. Deep sign: enabling robust statistical continuous sign language recognition via hybrid cnn-hmms. IJCV 126(12):1311–1325.将2D-CNNs与隐马尔可夫模型(HMM)集成，以对状态转换进行建模
2017Cui, R.; Liu, H.; and Zhang, C. 2017. Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In CVPR.在 CSRL 中采用 connectionist temporal classification (CTC) 算法作为代价函数，能够处理未分割的输入数据。
2017Cihan Camgoz, N.; Hadfield, S.; Koller, O.; and Bowden, R. 2017. Subunets: end-to-end hand shape and continuous sign language recognition. In ICCV.在 CSRL 中采用 connectionist temporal classification (CTC) 算法作为代价函数，能够处理未分割的输入数据。
2018Wang, S.; Guo, D.; Zhou, W.-g.; Zha, Z.-J.; and Wang, M. 2018. Connectionist temporal fusion for sign language translation. In ACM MM.在 CSRL 中采用 connectionist temporal classification (CTC) 算法作为代价函数，能够处理未分割的输入数据。
2019Cui, R.; Liu, H.; and Zhang, C. 2019. A deep neural framework for continuous sign language recognition by iterative training. TMM 21(7):1880–1891.在 CSRL 中采用 connectionist temporal classification (CTC) 算法作为代价函数，能够处理未分割的输入数据。
2006Graves, A.; Fern´andez, S.; Gomez, F.; and Schmidhuber, J. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML.提出connectionist temporal classification (CTC)
2014Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. In ICLR.提出基于注意力机制的编码-解码器模型
2018Huang, J.; Zhou, W.; Zhang, Q.; Li, H.; and Li, W. 2018. Videobased sign language recognition without temporal segmentation. In AAAI.采用基于注意力机制的编码-解码器模型的神经机器翻译方式处 CSLR
2018Guo, D.; Zhou, W.; Li, H.; and Wang, M. 2018. Hierarchical lstm for sign language translation. In AAAI.采用基于注意力机制的编码-解码器模型的神经机器翻译方式处 CSLR

The multiple cues of sign language can be separated into categories of multi-modality and multi-semantic. Early works about multi-modality utilize physical sensors to collect the 3D space information, such as depth and infrared maps (Molchanov et al. 2016; Liu et al. 2017). With the development of flow estimation, Cui et al. (Cui, Liu, and Zhang 2019) explore the multi-modality fusion of RGB and optical flow and achieve state-of-the-art performance on PHOENIX-2014 database. In contrast, multi-semantic refers to human body parts with different semantics. Early works use hand-crafted features from segmented hands, tracked body-parts and trajectories for recognition (Buehler, Zisser-man, and Everingham 2009; Pfister, Charles, and Zisserman 2013; Koller, Forster, and Ney 2015). In (Cihan Camgoz et al. 2017; Huang et al. 2018), feature sequence of hand patches captured by a tracker is fused with feature sequence of full-frames for further sequence prediction. In (Koller et al. 2019), Koller et al. propose to infer weak mouth labels from spoken German annotations and weak hand labels from SL dictionaries. These weak labels are used to establish the state synchronization in HMM of different cues, including full-frame, hand shape and mouth shape. Unlike previous methods, we propose an end-to-end differentiable network for multi-cue fusion with joint optimization, which achieves excellent performance.
手语的多种线索可分为多模态和多语义两大类。关于多模态的早期工作是利用物理传感器来收集三维空间信息，如深度和红外地图(Molchanov et al. 2016; Liu et al. 2017)。随着流量估计技术的发展，Cui等(Cui, Liu, and Zhang 2019)探索了RGB光流的多模态融合，并在PHOENIX-2014数据库上实现了最先进的性能。而多语义指的是具有不同语义的人体部位。早期的研究使用了 hand-crafted features ：分割手部，跟踪身体的部分肢体和轨迹识别(Buehler, Zisser-man, and Everingham 2009;Pfister, Charles和Zisserman 2013;Koller, Forster和Ney, 2015年)。在(Cihan Camgoz et al. 2017;(Huang et al. 2018)，将跟踪器捕获的手部贴片特征序列与全帧特征序列融合，进一步进行序列预测。在(Koller et al. 2019)中，Koller et al.提出从德语口语注释推断弱口标签，从SL词典推断弱手标签。这些弱标签用于在HMM中建立不同线索的状态同步，包括全帧、手形和嘴形。与以往的方法不同，我们提出了一种基于联合优化的端到端可微网络用于多线索融合，取得了良好的性能。

2016Molchanov, P.; Yang, X.; Gupta, S.; Kim, K.; Tyree, S.; and Kautz, J. 2016. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In CVPR.利用物理传感器来收集三维空间信息,比如：depth and infrared maps
2017Liu, Z.; Chai, X.; Liu, Z.; and Chen, X. 2017. Continuous gesture recognition with hand-oriented spatiotemporal feature. In ICCV.利用物理传感器来收集三维空间信息,比如：depth and infrared maps
2019Cui, R.; Liu, H.; and Zhang, C. 2019. A deep neural framework for continuous sign language recognition by iterative training. TMM 21(7):1880–1891.探索了RGB光流的多模态融合，并在PHOENIX-2014数据库上实现了最先进的性能

2009Buehler, P.; Zisserman, A.; and Everingham, M. 2009. Learning sign language by watching tv (using weakly aligned subtitles). In CVPR.使用了 hand-crafted features ：分割手部，跟踪身体的部分肢体和轨迹识别
2013Pfister, T.; Charles, J.; and Zisserman, A. 2013. Large-scale learning of sign language by watching tv (using co-occurrences). In BMVC.使用了 hand-crafted features ：分割手部，跟踪身体的部分肢体和轨迹识别
2015Koller, O.; Forster, J.; and Ney, H. 2015. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. CVIU 141:108–125.使用了 hand-crafted features ：分割手部，跟踪身体的部分肢体和轨迹识别
2017Cihan Camgoz, N.; Hadfield, S.; Koller, O.; and Bowden, R. 2017. Subunets: end-to-end hand shape and continuous sign language recognition. In ICCV.将跟踪器捕获的手部贴片特征序列与全帧特征序列融合，进一步进行序列预测
2018Huang, J.; Zhou, W.; Zhang, Q.; Li, H.; and Li, W. 2018. Videobased sign language recognition without temporal segmentation. In AAAI.将跟踪器捕获的手部贴片特征序列与全帧特征序列融合，进一步进行序列预测
2019Koller, O.; Camgoz, C.; Ney, H.; and Bowden, R. 2019. Weakly supervised learning with multi-stream cnn-lstm-hmms to discover sequential parallelism in sign language videos. TPAMI.从德语口语注释推断弱口标签，从SL词典推断弱手标签，这些弱标签用于在HMM中建立不同线索的状态同步，包括全帧、手形和嘴形

# 3 Proposed Approach

In this section, we first introduce the overall architecture of the proposed method. Then we elaborate the key components in our framework, including the spatial multi-cue (SMC) module and temporal multi-cue (TMC) module. Finally, we detail the sequence learning part and the joint loss optimization of our spatial-temporal multi-cue (STMC) framework.

## 3.1 Framework Overview

Given a video x = { x t } t = 1 T \mathbf{x}=\left\{x_{t}\right\}_{t=1}^{T} with T frames, the target of CSLR task is to predict its corresponding sign gloss sequence ℓ = { ℓ i } i = 1 L \ell=\left\{\ell_{i}\right\}_{i=1}^{L} with L words. As illustrated in Figure 1, our framework consists of three key modules, i.e., spatial representation, temporal modelling and sequence learning. First, each frame is processed by an SMC module to generate spatial features of multiple cues, including full-frame, hand, face and pose. Then, a TMC module is leveraged to capture the temporal correlations of intra-cue features and inter-cue features at different time steps and time scales. Finally, the whole STMC network equipped with bidirectional Long-Short Term Memory (BLSTM) (Hochreiter and Schmidhuber 1997) encoders utilizes connectionist temporal classification (CTC) for sequence learning and inference.
CSLR 的目的是使视频帧 X 与手语字 L 相对应。如图1所示，我们的框架由三个关键模块组成，即空间表示、时间建模和序列学习。首先，利用 SMC 模块对每一帧进行处理，生成包括全帧、手、脸和姿态等多种线索的空间特征; 然后，利用TMC模块捕捉线索内特征和线索间特征在不同时间步长和时间尺度上的时间相关性。最后，整个 STMC 网络配备了双向长期记忆(BLSTM) (Hochreiter and Schmidhuber 1997)编码器，利用连接主义者时间分类(CTC)进行序列学习和推理。 Figure 1: An overview of the proposed STMC framework. The SMC module is firstly utilized to decompose spatial features of visual cues for T frames in a video. Strips with different colors represent feature sequences of different cues. Then, the feature sequences of cues are fed into the TMC module with stacked TMC blocks and temporal pooling (TP) layers. The output of TMC module consists of feature sequence in the inter-cue path and feature sequences of N cues in the intra-cue path, which are processed by BLSTM encoders and CTC layers for sequence learning and inference. Here, N denotes the number of cues.

T × H × W × 3 \mathrm{T} \times \mathrm{H} \times \mathrm{W} \times 3 ：T帧 × H高 × W宽 × 3通道（RGB

SMC：Spatial Multi-Cue (SMC) 空间多线索
T × ( C 1 + ⋯ + C N ) \mathrm{T} \times\left(C_{1}+\cdots+C_{N}\right) ：T帧 × （线索1 + 线索2 +…+ 线索N）；本文的线索可以理解为特征部位。比如，线索1是 Full-frame；线索2是 手形 Shape；线索3是 面部表情 Head；线索4是 姿势 Bodyjoints。

• TMC Block：Temporal Multi-Cue Block 时间多线索块

• TP：Temporal Pooling 时序池化

2 × T 2 \times \mathrm{T} ：2 × T帧 ？？?

2 × C 2 \times \mathrm{C} ：2 × C个线索 ？？？

4 × C 4 \times \mathrm{C} ：4 × C个线索 ？？？

BLSTM：bidirectional Long-Short Term Memory 双向长短时记忆。详情参考 作者: herosunly

CTC：connectionist temporal classification 连接主义时间分类 ，用来解决输入序列和输出序列难以一一对应的问题。详情参考 作者：yudonglee

• Inference：Friday is expected with sunshine
识别出：周五有可能出太阳。

• Optimization：Joint Loss
优化：加入损失函数

## 3.2 Spatial Multi-Cue Representation

In spatial representation module, 2D-CNN is adopted to generate multi-cue features of full-frame, hands, face and pose. Here, we select VGG-11 model (Simonyan and Zisserman 2015) as the backbone network, considering its simple but effective neural architecture design. As depicted in Figure 2, the operations in SMC are composed of three steps: pose estimation, patch cropping and feature generation.

Figure 2: The SMC Module. The keypoints are estimated for patch cropping of face and hands. The output of SMC includes features from full-frame, hands, face and pose.

• crop：修剪

Pose Estimation. Deconvolutional networks (Zeiler et al. 2010) are widely used in pixel-wise prediction. For pose estimation, two deconvolutional layers are added after the 7-th convolutional layer of VGG-11. The stride of each layer is 2. So, the feature maps are 4× upsampled from the resolution 14×14 to 56×56. The output is fed into a point-wise convolutional layer to generate K predicted heat maps. In each heat map, the position of its corresponding keypoint is expected to show the highest response value. Here, K is set to 7 for keypoints at the upper body, including the nose, both shoulders, both elbows and both wrists.

To make the keypoint prediction differentiable for subsequent sequence learning, a soft-argmax layer is applied on K these heat maps. Denoting K heat maps as h = { h k } k = 1 K \mathbf{h}=\left\{h_{k}\right\}_{k=1}^{K} , each heat map h k ∈ R H × W h_{k} \in \mathbb{R}^{H \times W} is passed through a spatial softmax function as follows,
p i , j , k = e h i , j , k ∑ i = 1 H ∑ j = 1 W e h i , j , k p_{i, j, k}=\frac{e^{h_{i, j, k}}}{\sum_{i=1}^{H} \sum_{j=1}^{W} e^{h_{i, j, k}}} ,      (1)
where h i , j , k h_{i, j, k} is the value of heat map h k h_{k} at position ( i , j ) (i, j) and p i , j , k p_{i, j, k} is the probability of keypoint k at position ( i , j ) (i, j) . Afterwards, the expected values of coordinates along x-axis and y-axis over the whole probability map are calculated as follows,
( x ^ , y ^ ) k = ( ∑ i = 1 H ∑ j = 1 W i − 1 H − 1 p i , j , k , ∑ i = 1 H ∑ j = 1 W j − 1 W − 1 p i , j , k ) (\hat{x}, \hat{y})_{k}=\left(\sum_{i=1}^{H} \sum_{j=1}^{W} \frac{i-1}{H-1} p_{i, j, k}, \sum_{i=1}^{H} \sum_{j=1}^{W} \frac{j-1}{W-1} p_{i, j, k}\right) ,      (2)
Here, J k = ( x ^ , y ^ ) k ∈ [ 0 , 1 ] J_{k}=(\hat{x}, \hat{y})_{k} \in[0,1] is the normalized predicted position of keypoint k. The corresponding position of ( x , y ) (x, y) in a H × W H \times W feature map is ( x ^ ( H − 1 ) + 1 , y ^ ( W − 1 ) + 1 ) (\hat{x}(H-1)+1, \hat{y}(W-1)+1) .
为了使关键点预测可微，便于后续序列学习，在这些K个预测热图上应用了soft-argmax层。K个预测的热图表示为 h = { h k } k = 1 K \mathbf{h}=\left\{h_{k}\right\}_{k=1}^{K} ，每张热图的 h k ∈ R H × W h_{k} \in \mathbb{R}^{H \times W} 是通过如下的 空间 softmax 函数计算的，
p i , j , k = e h i , j , k ∑ i = 1 H ∑ j = 1 W e h i , j , k p_{i, j, k}=\frac{e^{h_{i, j, k}}}{\sum_{i=1}^{H} \sum_{j=1}^{W} e^{h_{i, j, k}}} ,      (1)

( x ^ , y ^ ) k = ( ∑ i = 1 H ∑ j = 1 W i − 1 H − 1 p i , j , k , ∑ i = 1 H ∑ j = 1 W j − 1 W − 1 p i , j , k ) (\hat{x}, \hat{y})_{k}=\left(\sum_{i=1}^{H} \sum_{j=1}^{W} \frac{i-1}{H-1} p_{i, j, k}, \sum_{i=1}^{H} \sum_{j=1}^{W} \frac{j-1}{W-1} p_{i, j, k}\right) ,      (2)

Patch Cropping. In CSLR, the perception of detailed visual cues is vital, including eye gaze, facial expression, mouth shape, hand shape and orientations of hands. Our model takes predicted positions of the nose and both wrists as the center points of the face and both hands. The patches are cropped from the output (56 × 56 × C4) of 4-th convolutional layer of VGG-11. The cropping sizes are fixed to 24 × 24 for both hands and 16 × 16 for the face. It’s large enough to cover body parts of a signer whose upper body is visible to the camera. The center point of each patch is clamped into a range to ensure that the patch would not cross the border of the original feature map.

Feature Generation.  After K keypoints are predicted, they are flattened to a 1D-vector with dimension 2K and passed through two fully-connected (FC) layers with ReLU to get the feature vector of pose cue. Then, feature maps of the face and both hands are cropped and processed by several convolutional layers, separately. Most sign gestures rely on the cooperation of both hands. So we use weight-sharing convolutional layers for both hands. The outputs of them are concatenated along the channel-dimension. Finally, we perform global average pooling over all the feature maps with spatial dimension to form feature vectors of different cues.

All features are extracted by passing frames x = { x t } t = 1 T \mathbf{x}=\left\{x_{t}\right\}_{t=1}^{T} through our spatial multi-cue (SMC) module as follows,
{ { f t , n } n = 1 N , { J t , k } k = 1 K } t = 1 T = { Ω θ ( x t ) } t = 1 T \left\{\left\{f_{t, n}\right\}_{n=1}^{N},\left\{J_{t, k}\right\}_{k=1}^{K}\right\}_{t=1}^{T}=\left\{\Omega_{\theta}\left(x_{t}\right)\right\}_{t=1}^{T} ,      (3)
where Ω θ ( ⋅ ) \Omega_{\theta}(\cdot) denotes SMC module and θ \theta denotes the parameters of it. J t , k ∈ R 2 J_{t, k} \in \mathbb{R}^{2} is the position of keypoint k at the t-th frame. f t , n ∈ R C n f_{t, n} \in \mathbb{R}^{C_{n}} is the feature vector of visual cue n at the t-th frame. In this paper, we set N = 4, which represents visual cues of full-frame, hand, face and pose, respectively.

{ { f t , n } n = 1 N , { J t , k } k = 1 K } t = 1 T = { Ω θ ( x t ) } t = 1 T \left\{\left\{f_{t, n}\right\}_{n=1}^{N},\left\{J_{t, k}\right\}_{k=1}^{K}\right\}_{t=1}^{T}=\left\{\Omega_{\theta}\left(x_{t}\right)\right\}_{t=1}^{T} ,      (3)

Soft-Aramax：结合 softmax 函数(一种指数归一化函数) ，达到argmax的目的(寻找参数最大值的索引)，同时使得过程可导。详情参考 1详情参考 2

## 3.3 Temporal Multi-Cue Modelling

Instead of simple fusion, our proposed temporal multi-cue (TMC) module intends to integrate spatiotemporal information from two aspects, intra-cue and inter-cue. The intra-cue path captures the unique features of each visual cue. The inter-cue path learns the combination of fused features from different cues at different time scales. Then, we define a TMC block to model the operations between the two paths as follows,
( o l , f l ) = Block ⁡ l ( o l − 1 , f l − 1 ) \left(o_{l}, f_{l}\right)=\operatorname{Block}_{l}\left(o_{l-1}, f_{l-1}\right) ,      (4)
where ( o l − 1 , f l − 1 ) \left(o_{l-1}, f_{l-1}\right) and ( o l , f l ) \left(o_{l}, f_{l}\right) are the input pair and output pair of the l l -th block. o l ∈ R T × C o o_{l} \in \mathbb{R}^{T \times C_{o}} denotes the feature matrix of the inter-cue path. f l ∈ R T × C f f_{l} \in \mathbb{R}^{T \times C_{f}} denotes the feature matrix of the intra-cue path which is the concatenation of vectors from different cues along channel-dimension. As the first input pair, o 1 = f 1 = [ f 1 , 1 , f 1 , 2 , ⋯   , f 1 , N ] o_{1}=f_{1}=\left[f_{1,1}, f_{1,2}, \cdots, f_{1, N}\right] , where [ ⋅ ] [\cdot] is the concatenating operation and N is the number of cues.

( o l , f l ) = Block ⁡ l ( o l − 1 , f l − 1 ) \left(o_{l}, f_{l}\right)=\operatorname{Block}_{l}\left(o_{l-1}, f_{l-1}\right) ,      (4)

The detailed operations inside each TMC block are shown in Figure 3 and can be decomposed into two paths as follows. ( C is the number of output channels in each path)
每个TMC块内的详细操作如图3所示，可以将其分解为如下两条路径。(C为每条路径的输出通道数)

Figure 3: The TMC Module.

• Temporal Conv-kernel size：时序卷积核大小
• Temporal Pool：时序池
Temporal Convolutional Networks 时序卷积网络 参考链接
• Concatenation：连接
• ReLU：ReLU激活函数
• o l − 1 o_{l-1} ：第 l − 1 l-1 个 TMC 模块的输入
• f l − 1 f_{l-1} ：第 l − 1 l-1 个 TMC 模块的输出
Intra-Cue Path. The first path is to provide unique features of different cues at different time scales. The temporal transformation inside each cue is performed as follows,
f l , n = ReLU ⁡ ( K k C N ( f l − 1 , n ) ) , f_{l, n}=\operatorname{ReLU}\left(\mathcal{K}_{k}^{\frac{C}{N}}\left(f_{l-1, n}\right)\right),       (5)
f l = [ f l , 1 , f l , 2 , ⋯   , f l , ∣ N ] f_{l}=\left[f_{l, 1}, f_{l, 2}, \cdots, f_{l, \mid N}\right] .$(6) Here, f l , n ∈ R T × C N f_{l, n} \in \mathbb{R}^{T \times \frac{C}{N}} denotes the feature matrix of n-th cue. K k C N \mathcal{K}_{k}^{\frac{C}{N}} denotes the kernel of a temporal convolution, where k is the temporal kernel size and C N \frac{C}{N} is the number of output channels. 线索内路径. 第一种路径是在不同的时间尺度上提供不同线索的独特特征。每个线索内部的时序转换过程如下: f l , n = ReLU ⁡ ( K k C N ( f l − 1 , n ) ) , f_{l, n}=\operatorname{ReLU}\left(\mathcal{K}_{k}^{\frac{C}{N}}\left(f_{l-1, n}\right)\right), (5) f l = [ f l , 1 , f l , 2 , ⋯ , f l , ∣ N ] f_{l}=\left[f_{l, 1}, f_{l, 2}, \cdots, f_{l, \mid N}\right] .$      (6)
这里的 f l , n ∈ R T × C N f_{l, n} \in \mathbb{R}^{T \times \frac{C}{N}} 代表第 n 条线索的特征矩阵。 K k C N \mathcal{K}_{k}^{\frac{C}{N}} 是时序卷积的核，这里的 k 是时序核大小， C N \frac{C}{N} 是输出通道的数量。

Inter-Cue Path.  The second path is to perform the temporal
transformation on the inter-cue feature from the previous block and fuse information from the intra-cue path as follows,
o l = ReLU ⁡ ( [ K k C 2 ( o l − 1 ) , K 1 C 2 ( f l ) ] ) o_{l}=\operatorname{ReLU}\left(\left[\mathcal{K}_{k}^{\frac{C}{2}}\left(o_{l-1}\right), \mathcal{K}_{1}^{\frac{C}{2}}\left(f_{l}\right)\right]\right) ,\$      (7)
where K 1 C 2 \mathcal{K}_{1}^{\frac{C}{2}} is a point-wise temporal convolution.It serves as a project matrix between the two paths. Note that f l f_{l} is the output of intra-cue path in the present block.

o l = ReLU ⁡ ( [ K k C 2 ( o l − 1 ) , K 1 C 2 ( f l ) ] ) o_{l}=\operatorname{ReLU}\left(\left[\mathcal{K}_{k}^{\frac{C}{2}}\left(o_{l-1}\right), \mathcal{K}_{1}^{\frac{C}{2}}\left(f_{l}\right)\right]\right) ,      (7)

After each block, a temporal max-pooling with stride 2 and kernel size 2 is performed. In this paper, we use two blocks in the TMC module. The kernel size k of all temporal convolutions is set to 5, except the point-wise one. The number of output channels C in each path is set to 1024.
在每个块之后，执行步长为2、卷积核大小为 2 的时序最大池化操作。在本文中，我们在 TMC 模块中使用了两个模块。所有时序卷积的核大小 k 都设置为 5，除了 point-wise 的卷积。每条路径的输出通道量 C 设为1024。

3.4 Sequence Learning and Inference
With the proposed SMC and TMC module, the network can generate inter-cue feature sequence o = { o t } t = 1 T ′ \mathbf{o}=\left\{o_{t}\right\}_{t=1}^{T^{\prime}} and N intra-cue feature sequences f n = { f t , n } t = 1 T ′ \mathbf{f}_{n}=\left\{f_{t, n}\right\}_{t=1}^{T^{\prime}} . Here, T ′ T^{\prime} is the temporal length of the final output of the TMC module. The question then is how to utilize these two feature sequences to accomplish the sequence learning and inference.

BLSTM Encoder.  Recurrent neural networks (RNN) can use their internal state to model the state transitions in the sequence of inputs. Here, we use RNN to map the spatial-temporal feature sequence to its sign gloss sequence. RNN takes the feature sequence as input and generates T ′ T^{\prime} hidden states as follows,
h t = RNN ⁡ ( h t − 1 , o t ) h_{t}=\operatorname{RNN}\left(h_{t-1}, o_{t}\right) ,      (8)
in which h t h_{t} is the hidden state at time step t and the initial state h 0 h_{0} is a fixed all-zero vector. In our approach, we choose the bidirectional Long Short-Term Memory (BLSTM) (Sutskever, Vinyals, and Le 2014) unit as the recurrent unit for its ability in processing long-term dependencies. BLSTM concatenates forward and backward hidden states from bidirectional inputs. Afterward, the hidden state of each time step is passed through a fully-connected layer and a softmax layer,
a t = W ⋅ h t + b , y t , j = e a t , j ∑ k e a t , k a_{t}=W \cdot h_{t}+b, \quad y_{t, j}=\frac{e^{a_{t, j}}}{\sum_{k} e^{a_{t, k}}} ,      (9)
where y t , j y_{t, j} is the probability of label j at time step t. In CSLR task, label j comes from a given vocabulary.

h t = RNN ⁡ ( h t − 1 , o t ) h_{t}=\operatorname{RNN}\left(h_{t-1}, o_{t}\right) ,      (8)

a t = W ⋅ h t + b , y t , j = e a t , j ∑ k e a t , k a_{t}=W \cdot h_{t}+b, \quad y_{t, j}=\frac{e^{a_{t, j}}}{\sum_{k} e^{a_{t, k}}} ,      (9)

BLSTM：Bi-directional Long Short-Term Memory的缩写，是由前向LSTM与后向LSTM组合而成。LSTM和BLSTM在自然语言处理任务中都常被用来建模上下文信息。 参考链接

Connectionist Temporal Classification. Our model employs connectionist temporal classification (CTC) (Graves et al. 2006) to tackle the problem of mapping video sequence o = { o t } t = 1 T ′ \mathbf{o}=\left\{o_{t}\right\}_{t=1}^{T^{\prime}} to ordered sign gloss sequence ℓ = { ℓ i } i = 1 L \ell=\left\{\ell_{i}\right\}_{i=1}^{L} ( L ≤ T ) (L \leq T) , where the explicit alignment between them is unknown.The objective of CTC is to maximize the sum of probabilities of all possible alignment paths between input and target sequence.

CTC creates an extended vocabulary V \mathcal{V} with a blank label “ − − ”, where V = V origin  ∪ { − } \mathcal{V}=\mathcal{V}_{\text {origin }} \cup\{-\} . The blank label represents stillness and transitions which have no precise meaning. Denote the alignment path of the input sequence as π = { π t } t = 1 T ′ \pi=\left\{\pi_{t}\right\}_{t=1}^{T^{\prime}} , where label π t ∈ V \pi_{t} \in \mathcal{V} . The probability of alignment path π \pi given the input sequence is defined as follows,
p ( π ∣ o ) = ∏ t = 1 T ′ p ( π t ∣ o ) = ∏ t = 1 T ′ y t , π t p(\pi \mid \mathbf{o})=\prod_{t=1}^{T^{\prime}} p\left(\pi_{t} \mid \mathbf{o}\right)=\prod_{t=1}^{T^{\prime}} y_{t, \pi_{t}} .      (10)
CTC 创建带有空白标签 “ − − ”的扩展词汇表 V \mathcal{V} 。这里的 V = V origin  ∪ { − } \mathcal{V}=\mathcal{V}_{\text {origin }} \cup\{-\} 。空白的标签代表 stillness and transitions，没有确切的意义。将输入序列的 alignment path 表示为 π = { π t } t = 1 T ′ \pi=\left\{\pi_{t}\right\}_{t=1}^{T^{\prime}} ，其中标签 π t ∈ V \pi_{t} \in \mathcal{V} 。给定输入序列， alignment path π \pi 的概率定义为:
p ( π ∣ o ) = ∏ t = 1 T ′ p ( π t ∣ o ) = ∏ t = 1 T ′ y t , π t p(\pi \mid \mathbf{o})=\prod_{t=1}^{T^{\prime}} p\left(\pi_{t} \mid \mathbf{o}\right)=\prod_{t=1}^{T^{\prime}} y_{t, \pi_{t}} .      (10)

Define a many-to-one mapping operation B \mathcal{B} which removes all blanks and repeated words in the alignment path (e.g., B ( I I − m i s s − − y o u ) = I , \mathcal{B}(I I-m i s s--y o u)=I, miss, y o u y o u ). In this way, we calculate the conditional probability of sign gloss sequence ℓ \ell as the sum of probabilities of all paths that can be mapped to ℓ \ell via B \mathcal{B} :
p ( ℓ ∣ o ) = ∑ π ∈ B − 1 ( ℓ ) p ( π ∣ o ) p(\ell \mid \mathbf{o})=\sum_{\pi \in \mathcal{B}^{-1}(\ell)} p(\pi \mid \mathbf{o}) .      (11)
where B − 1 ( ℓ ) = { π ∣ B ( π ) = ℓ } \mathcal{B}^{-1}(\boldsymbol{\ell})=\{\pi \mid \mathcal{B}(\pi)=\boldsymbol{\ell}\} is the inverse operation of B \mathcal{B} . Finally, the CTC losses of inter-cue feature sequence O \mathbf{O} and intra-cue feature sequence f n \mathbf{f}_{\mathbf{n}} are defined as follows,
L C T C − o = − ln ⁡ p ( ℓ ∣ o ) \mathcal{L}_{\mathrm{CTC}-\mathbf{o}}=-\ln p(\ell \mid \mathbf{o}) .      (12)
L C T C − f n = − ln ⁡ p ( ℓ ∣ f n ) \mathcal{L}_{\mathrm{CTC}-\mathbf{f}_{n}}=-\ln p\left(\ell \mid \mathbf{f}_{n}\right) .      (13)

p ( ℓ ∣ o ) = ∑ π ∈ B − 1 ( ℓ ) p ( π ∣ o ) p(\ell \mid \mathbf{o})=\sum_{\pi \in \mathcal{B}^{-1}(\ell)} p(\pi \mid \mathbf{o}) .      (11)

L C T C − o = − ln ⁡ p ( ℓ ∣ o ) \mathcal{L}_{\mathrm{CTC}-\mathbf{o}}=-\ln p(\ell \mid \mathbf{o}) .      (12)
L C T C − f n = − ln ⁡ p ( ℓ ∣ f n ) \mathcal{L}_{\mathrm{CTC}-\mathbf{f}_{n}}=-\ln p\left(\ell \mid \mathbf{f}_{n}\right) .      (13)
CTC：connectionist temporal classification 连接主义时间分类 ，用来解决输入序列和输出序列难以一一对应的问题。参考前文

Joint Loss Optimization. During the training process, we take the optimization of the inter-cue path as the primary objective. To provide the information of each individual cue for fusion, the intra-cue path plays an auxiliary role. Hence, the objective function of the entire STMC framework is given as follows,
L = L C T C − o + α ∑ n L C T C − f n + L R β \mathcal{L}=\mathcal{L}_{\mathrm{CTC}-\mathrm{o}}+\alpha \sum_{n} \mathcal{L}_{\mathrm{CTC}-\mathbf{f}_{n}}+\mathcal{L}_{\mathrm{R}}^{\beta} .      (14)
Here, α \alpha and β \beta are hyper-parameters, where α \alpha is to balance the ratio of auxiliary loss for the intra-cue path, and β \beta is to make the regression loss L R \mathcal{L}_{\mathrm{R}} of pose estimation have the same order of magnitudes with others. Given the estimated keypoints J t , k ∈ R 2 J_{t, k} \in \mathbb{R}^{2} which is calculated in Eq. 3, its corresponding ground-truth is J ^ \hat{J} , and the smooth-L1 (Girshick 2015) loss function of pose estimation branch is calculated as follows,
L R β = 1 2 T K ∑ t ∑ k ∑ i ∈ ( x , y ) smooth ⁡ L 1 β ( J t , k , i − J ^ t , k , i ) \mathcal{L}_{\mathrm{R}}^{\beta}=\frac{1}{2 T K} \sum_{t} \sum_{k} \sum_{i \in(x, y)} \operatorname{smooth}_{L_{1}} \beta\left(J_{t, k, i}-\hat{J}_{t, k, i}\right) ,      (15)
in which,
smooth ⁡ L 1 ( x ) = { 0.5 x 2  if  ∣ x ∣ < 1 ∣ x ∣ − 0.5  otherwise  \operatorname{smooth}_{L_{1}}(x)=\left\{\begin{array}{ll} 0.5 x^{2} & \text { if }|x|<1 \\ |x|-0.5 & \text { otherwise } \end{array}\right. .      (16)

L = L C T C − o + α ∑ n L C T C − f n + L R β \mathcal{L}=\mathcal{L}_{\mathrm{CTC}-\mathrm{o}}+\alpha \sum_{n} \mathcal{L}_{\mathrm{CTC}-\mathbf{f}_{n}}+\mathcal{L}_{\mathrm{R}}^{\beta} .      (14)

L R β = 1 2 T K ∑ t ∑ k ∑ i ∈ ( x , y ) smooth ⁡ L 1 β ( J t , k , i − J ^ t , k , i ) \mathcal{L}_{\mathrm{R}}^{\beta}=\frac{1}{2 T K} \sum_{t} \sum_{k} \sum_{i \in(x, y)} \operatorname{smooth}_{L_{1}} \beta\left(J_{t, k, i}-\hat{J}_{t, k, i}\right) ,      (15)
in which,
smooth ⁡ L 1 ( x ) = { 0.5 x 2  if  ∣ x ∣ < 1 ∣ x ∣ − 0.5  otherwise  \operatorname{smooth}_{L_{1}}(x)=\left\{\begin{array}{ll} 0.5 x^{2} & \text { if }|x|<1 \\ |x|-0.5 & \text { otherwise } \end{array}\right. .      (16)
Inference. For inference, we pass video frames through
the SMC and TMC modules. Only the inter-cue feature sequence and its BLSTM encoder are used to generate the posterior probability distribution of glosses at all time steps.We use the beam search decoder (Hannun et al. 2014) to search the most probable sequence within an acceptable range.

# 4 Experiments

## 4.1 Dataset and Evaluation

Dataset. We evaluate our method on three datasets, including PHOENIX-2014 (Koller, Forster, and Ney 2015), CSL (Huang et al. 2018; Guo et al. 2018) and PHOENIX- 2014-T (Cihan Camgoz et al. 2018).

PHOENIX-2014 is a publicly available German Sign Language dataset, which is the most popular benchmark for CSLR. The corpus was recorded from broadcast news about the weather. It contains videos of 9 different signers with a vocabulary size of 1295. The split of videos for Train, Dev and Test is 5672, 540 and 629, respectively. Our method is evaluated on the multi-signer database.
PHOENIX-2014 是一个公开可用的德语手语数据集，它是CSLR最广泛的测试基准。这些语料库是从广播的天气新闻中记录下来的。它包含9个不同的 signers 视频，词汇量大小为1295。Train、Dev和Test的视频分割分别为5672、540和629。我们的方法在 multi-signer 数据库上进行了评估。

CSL is a Chinese Sign Language dataset, which has 100 sign language sentences about daily life with 178 words. Each sentence is performed by 50 signers and there are 5000 videos in total. For pre-training, it also provides a matched isolated Chinese Sign Language database, which contains 500 words. Each word is performed 10 times by 50 signers.
CSL 是一个中文手语数据库，包含100个关于日常生活的手语句子，包含178个单词。每句话都由50名 signers，总共有5000个视频。在训练前，它还提供了一个匹配的孤立的中文手语数据库，包含500个单词。每个单词由50个 signers 执行10次。

PHOENIX-2014-T is an extended version of PHOENIX-2014 and has two-stage annotations for new videos. One is sign gloss annotations for CSLR task. Another is German translation annotations for sign language translation (SLT) task. The split of videos for Train, Dev and Test is 7096, 519 and 642, respectively. It has no overlap with the previous version between Train, Dev and Test set. The vocabulary size is 1115 for sign gloss and 3000 for German.
PHOENIX-2014-T 是PHOENIX- 2014的扩展版，在扩充的视频中有两种的注释。一个是CSLR任务的 sign gloss 注释。另一种是用于手语翻译(SLT)任务的德语翻译注释。Train、Dev和Test的视频分割分别为7096、519和642。它与之前的版本(Train, Dev和Test set)没有重叠。sign gloss 的词汇量是1115，德语是3000。

Pose Annotation. To obtain the keypoint positions for training, we use the publicly available HRNet (Sun et al. 2019) toolbox to estimate the positions of 7 keypoints in upper-body for all frames on three databases. The toolbox gives 2D coordinates (x, y) in the pixel coordinate system. We thus represent each normalized keypoint with a tuple of (x, y) and record it as an array of 7 tuples.

Evaluation. In CSLR, Word Error Rate (WER) is used as the metric of measuring the similarity between two sentences (Koller, Forster, and Ney 2015). It measures the least operations of substitution (sub), deletion (del) and insertion (ins) to transform the hypothesis to the reference:

## 4.2 Implementation Details

In our experiments, the input frames are resized to 224×224. For data augmentation in one video, we add random crop at the same location of all frames, random discard of 20% frames and random flip of all frames. For inter-cue features, the number of output channels after TCOVs and BLSTM are all set to 1024. There are 4 visual cues. For each intra-cue feature, the number of output channels after TCOVs and BLSTM are all set to 256.

Following the previous methods (Koller, Zargaran, and Ney 2017; Pu, Zhou, and Li 2019; Cui, Liu, and Zhang 2019), we utilize a staged optimization strategy. First, we train a VGG11-based network as DNF (Cui, Liu, and Zhang 2019) and use it to decode pseudo labels for each clip. Then,we add a fully-connected layer after each output of the TMC module. The STMC network without BLSTM is trained with cross-entropy and smooth-L1 loss by SGD optimizer. The batch size is 24 and the clip size is 16. Finally, with fine-tuned parameters from the previous stage, our full STMC network is trained end-to-end under joint loss optimization. We use Adam optimizer with learning rate 5 × 1 0 − 5 5 \times 10^{-5} and set the batch size to 2. In all experiments, we set α \alpha to 0.6 and β \beta to 30. In fact, the experiment results are insensitive to the slight change of α \alpha (see Fig. 4), except α \alpha = 0.

Figure 4: The effect of weight parameter α \alpha in Eq. 14.

图 4： α \alpha 参数权重在Eq. 14.中的影响

Our network architecture is implemented in PyTorch. For finetuning, we train the STMC network without BLSTM for 25 epochs. Afterward, the whole STMC network is trained end-to-end for 30 epochs. For inference, the beam width is set to 20. Experiments are run on 4 GTX 1080Ti GPUs.
我们的网络架构是在PyTorch中实现的。为了进行微调，我们对没有 BLSTM 的 STMC 网络进行了25次迭代训练。之后，对整个 STMC 网络进行30次端到端的训练。关于实验，beam width 设置为20。实验在 4 个 GTX 1080Ti GPU上运行。

## 4.3 Framework Effectiveness Study

For a fair comparison, experiments in this subsection are conducted on PHOENIX-2014, which is the most popular dataset in CSLR.

Module Analysis We analyze the effectiveness of each module in our proposed approach. In Table 1, different combinations of spatial and temporal modules are evaluated. The baseline model is composed of VGG11 and 1D-CNN with a BLSTM encoder. With the aid of multi-cue features, the SMC module provides about 3% improvement compared with baseline on the test set. However, with no extra guidance, the TMC module doesn’t show expected gain by replacing the 1D-CNN. With joint loss optimization, the intra-cue path is guided by CTC loss to learn temporal dependency of each cue and provides 1.6% and 1.7% extra gain on the dev set and test set, compared with 1D-CNN. Compared with the baseline model, our STMC network reduces the WER on the test set by 4.8%.
Table 1: Evaluation of different module combinations (thelower the better).

表 1：评估不同的模块组合(越低越好)

Intra-Cue and Inter-Cue Paths With further optimization, the BLSTM encoder of each cue in the intra-cue path can also serve as an individual sequence predictor. In Table 2, WERs of different encoders in both paths are evaluated on the dev set. Among the four cues, the performance of pose is worst. With only the position and orientation of joints in upper-body, it’s difficult to distinguish the subtle variations in the appearance of sign gestures. The performance of hand is superior to that of face, while full-frame achieves relatively better performance. By leveraging the synergy of different cues, the inter-cue path shows the lowest WER.
Table 2: Evaluation of different paths in TMC on Dev set.

Intra-Cue and Inter-Cue Paths 通过进一步的优化，intra-cue path 中每个线索的 BLSTM 编码器也可以作为一个单独的序列预测器。在表 2 中，我们在 dev 集上评估了两种路径下不同编码器的方案。在四种线索中，姿势的表现是最差的。由于只有上半身关节的位置和方向，很难区分手势外观的细微变化。手的性能优于脸，而全帧的性能相对较好。通过利用不同线索的协同作用，inter-cue path 的 WER 最低。
表2：在 Dev 集上评估 TMC 中的不同 pathsInference Time To clarify the effectiveness of the self-contained pose estimation branch, we evaluate the inference time in Table 3. The inference time depends on the video length. In average, it takes around 8 seconds (25FPS) for a sign sentence. For a fair comparison, we evaluate the inference time of 200 frames on a single GPU. Compared with introducing an external VGG-11 based model for pose estimation, our self-contained branch saves around 44% inference time. It’s notable that our framework with the self-contained branch still shows slightly better performance than an off-the-shelf model. We argue that the differentiable pose estimation branch plays the role of regularization and then alleviate the overfitting of neural networks.
Table 3: Comparison of inference time. (PE: an external VGG11-based model for pose estimation)

Table 3：实验时间比较。(PE:基于 VGG11 的外部姿态估计模型)

• del/ins：
• WER：通用硬指标
• Time：
• FLOPs：
• Cue：full、hand、face、pose

Qualitative Analysis Figure 5 shows an example generated by different cues. It’s clear to see that the result of the inter-cue path can effectively learn correlations of multiple cues and make a better prediction.

Figure 5: A qualitative result of different cues with estimated poses (zoom in) from Dev set (D: delete, I: insert, S: substitute).

图5：Dev集合(D:删除，I:插入，S:替换)中不同线索与估计姿势(放大)的定性结果

## 4.4 State-of-the-art Comparison

Evaluation on PHOENIX-2014. In Table 4, we compare our approach with methods on PHOENIX-2014. CMLLR and 1-Mio-Hands belong to traditional HMM-based model with hand-crafted features. In SubUNets and LS-HAN, full-frame features are fused with features of hand patches, which are captured by an external tracker. In CNN-LSTM-HMM, two-stream networks are trained with weak hand labels and sign gloss labels, respectively. Our STMC out-performs two recent multi-cue methods, i.e., LS-HAN and CNN-LSTM-HMM by 17.6% and 5.3%. Moreover, compared with DNF which explores the fusion of RGB and optical flow modality, STMC still surpasses this best competitor by 2.2%. Based on the RGB modality, we propose a novel STMC framework and achieves 20.7% WER on the test set, a new state-of-the-art result on PHOENIX-2014.
Table 4: Comparison with methods on PHOENIX-2014 (the lower the better).

表 4：在 PHOENIX-2014 上的方法比较(越低越好)

Evaluation on CSL. In Table 5, we evaluate our approach on CSL under two settings. CSL dataset contains a smaller vocabulary compared with PHOENIX-2014. Following the works of (Huang et al. 2018; Guo et al. 2018), the dataset is split by two strategies in Table 5. Split I is a signer independent test: the train and test sets share the same sentences with no overlap of signers. Split II is an unseen sentence test: the train and test sets share the same signers and vocabulary with no overlap of same sentences. Between two settings, Split II is more challenging for that recognizing unseen combinations of words is difficult in CSLR. In IAN, their alignment algorithm of CTC decoder and LSTM decoder shows notable improvement, compared with previous methods. Benefiting from multi-cue learning, our STMC framework outperforms the best competitor on CSL by 4.1% on WER.
Table 5: Comparison with methods on CSL.

表5：与在 CSL 上的方法比较

Evaluation on PHOENIX-2014-T. In Table 6, we provide a result of our method on PHOENIX-2014-T. As a newly proposed dataset (Cihan Camgoz et al. 2018) for sign language translation, PHOENIX-2014-T provides an extended database with sign gloss annotation and spoken German annotation. CNN-LSTM-HMM utilizes spoken German annotation to infer the weak mouth shape labels for each video. It provides results of multi-cue sequential parallelism, including full-frame, hand and mouth. Our method surpasses all three combinations of CNN-LSTM-HMM.
Table 6: Comparison with methods on PHOENIX-2014-T.(f: full-frame, m: mouth, h: hand)

表 6：与在 PHOENIX-2014-T 上方法的比较。(f: full-frame, m: mouth, h: hand)

# 5 Conclusion

In this paper, we present a novel multi-cue framework for CSLR, which aims to learn spatial-temporal correlations of visual cues in an end-to-end fashion. In our framework, a spatial multi-cue module is designed with a self-contained pose estimation branch to decompose spatial multi-cue features. Moreover, we propose a temporal multi-cue module composed of the intra-cue and inter-cue paths, which aims to preserve the uniqueness of each cue and explore the synergy of different cues at the same time. A joint optimization strategy is proposed to accomplish multi-cue sequence learning. Extensive experiments on three large-scale CSLR datasets demonstrate the superiority of our STMC framework.

Acknowledgments. This work was supported in part to Dr. Wengang Zhou by NSFC under contract No. 61632019 & No. 61822208 and Youth Innovation Promotion Association CAS (No. 2018497), and in part to Dr. Houqiang Li by NSFC under contract No. 61836011.
致谢 国家自然科学基金面上项目(No. 61632019 & No. 61822208);中国科学院青年创新促进会面上项目(No. 2018497);国家自然科学基金面上项目(No. 61836011)李后强博士资助。

# 后续补充

3 个 github 上的代码
github 1
github 2
github 3（这套代码作者还没有放出，但他们组的工作很靠谱，一作大佬建议可以关注一下）

ECCV2020 fully convolutional network 做手语识别的论文链接（后续会更此论文的研读与复现）

07-01 2188
10-08 8875
08-03 2489
06-13 4859
10-11 1288
10-11 5228
06-25 2944
06-08 3801
07-16 2651