Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition

  • 十分感谢小伙伴的鼓励,我会持续更新此论文后续的补充内容。

  • 非常感谢此论文的一作,给了我一些很 nice 的建议。作者的实验室因为合作方的要求不允许公开代码,所以很遗憾要复现此论文有难度。虽然作者分享了论文中模型的搭建代码,但是作者说:“从过去一些人经验来看,基于这个复现我的结果是有一定困难的,主要难点在于如何得到迭代的伪标签。”同时一作大佬推了 3 个 GitHub 上的代码,可能有帮助。还建议尝试复现ECCV2020上的那篇 fully convolutional network 做手语识别的代码,可以只实现主体部分,基于这一部分也是可以达到目前sota的性能。后续我会复现大佬建议的部分。再次感谢大佬的建议,指出了一个方向。🙏🙏🙏代码、论文链接在文末

  • (2020-连续-RGB-手形-头-关节/RGB-CSL/PHOENIX2014/PHOENIX204T)

年份识别类型输入数据类型手动特征非手动特征Fullframe数据集识别对象
2020连续语句RGBShape(手形)Head(头)Bodyjoints(身体关节)、RGBPhoenix14、Phoenix14-T、CSLDGS(德语)、CSL(汉语)

Abstract

Despite the recent success of deep learning in continuous sign language recognition (CSLR), deep models typically focus on the most discriminative features, ignoring other potentially non-trivial and informative contents. Such characteristic heavily constrains their capability to learn implicit visual grammars behind the collaboration of different visual cues (i,e., hand shape, facial expression and body posture). By injecting multi-cue learning into neural network design, we propose a spatial-temporal multi-cue (STMC) network to solve the vision-based sequence learning problem. Our STMC network consists of a spatial multi-cue (SMC) module and a temporal multi-cue (TMC) module. The SMC module is dedicated to spatial representation and explicitly decomposes visual features of different cues with the aid of a self-contained pose estimation branch. The TMC module models temporal correlations along two parallel paths, i.e., intra-cue and inter-cue, which aims to preserve the uniqueness and explore the collaboration of multiple cues. Finally, we design a joint optimization strategy to achieve the end-to-end sequence learning of the STMC network. To validate the effectiveness, we perform experiments on three large-scale CSLR benchmarks: PHOENIX-2014, CSL and PHOENIX-2014-T. Experimental results demonstrate that the proposed method achieves new state-of-the-art performance on all three benchmarks.
尽管深度学习最近在连续手语识别(CSLR)方面取得了成功,但深度模型通常专注于最具区别性的特征,而忽略了其他潜在的非琐碎和信息内容。这种特征严重限制了他们在不同视觉线索协同作用下学习内隐视觉语法的能力。(比如手的形状、面部表情和身体姿势)。通过在神经网络设计中注入多线索学习,我们提出了一种 spatial-temporal multi-cue (STMC) 网络来解决基于视觉的 sequence 学习问题。STMC网络由 spatial multi-cue (SMC) 模块和 temporal multi-cue (TMC) 模块组成。SMC 模块致力于 spatial representation,并借助 self-contained pose estimation branch 明确分解不同 cues 的视觉特征。TMC模块沿两条平行 paths ( intra-cue 和 inter-cue)建立时间相关性模型,旨在保持线索的独特性,并探讨多种线索之间的协作关系。最后,我们设计了一个联合优化策略来实现STMC网络的端到端的序列学习。为了验证该方法的有效性,在PHOENIX-2014、CSL和PHOENIX-2014-T三种大规模CSLR基准上进行了实验。实验结果表明,该方法在这三个数据集上都达到了目前最高的性能水平。

研究对象手型、面部表情、身体姿势
研究方法spatial-temporal multi-cue (STMC) 网络(由 spatial multi-cue (SMC) 模块和temporal multi-cue (TMC) 模块组成)
STMC网络spatial-temporal multi-cue 时空多线索,将多线索学习和神经网络相结合
SMC模块Spatial Multi-Cue 空间多线索,用于空间表示,并借助一个独立的姿势估计模块来分解不同线索的视觉特征
TMC模块Temporal Multi-Cue 时间多线索,沿两个平行路径对时间相关性进行建模
STMC端到端学习联合优化策略
数据集PHOENIX-2014 、PHOENIX- 2014-T 、CSL
数据集验证集(Dev)WER(Word Error Rate)测试集(Test) WER(越低越好)
PHOENIX-201421.120.7
PHOENIX- 2014-T19.621.0
数据集Split ISplit II
CSL2.128.6

在CSL上的评估标准和其他两个数据集不一样,猜测:可能在CSL上的效果不理想,就换了评价标准。需要在源码上验证。

训练集、验证集(dev)和测试集

  • 在模型训练的时候通常将我们所得的数据分成3部分:训练集、dev验证集和测试集
  • dev用来统计的那一评估指标、调节参数,选择算法;而test用来在最后整体评估模型性能
  • dev和训练集一起被输入到模型算法中,但又不参与模型训练,可以一边训练一边根据dev查看指标
  • dev和测试集都是用来评估模型好坏,但dev只能用来统计单一评估指标;而测试集能够提供更多的评估模型指标,如混淆矩阵、roc、召回率、F1 Score等
  • dev可以用来快速评估指标的,并及时做出参数调整,但不全面;而测试集能提供一个模型的完整评估报告,能更好的从多个角度评价模型的性能,缺点是比较费时,一般在dev把参数调整差不多后,才会用到测试集
  • dev和测试集要保持同一分布
  • 大数据时代以前,通常将数据按照8:1:1划分数据集,大数据时代(百万数量级),通常可以将数据按照98:1:1的比例划分
    ————————————————
    原文链接:https://blog.csdn.net/weixin_43821376/article/details/103777454

1 Introduction

Sign language is the primary language of the deaf community. To facilitate the daily communication between the deaf-mute and the hearing people, it is significant to develop sign language recognition (SLR) techniques. Recently, SLR has gained considerable attention for its abundant visual information and systematic grammar rules (Cui, Liu, and Zhang 2017; Huang et al. 2018; Koller et al. 2019; Pu, Zhou, and Li 2019; Li et al. 2019). In this paper, we concentrate on continuous SLR (CSLR), which aims to translate a series of signs to the corresponding sign gloss sentence.
手语是聋人群体的主要语言。为了方便聋哑人与听障人的日常交流,发展手语识别技术具有重要意义。近年来,手语识别因其丰富的视觉信息和系统的语法规则而受到广泛关注(Cui, Liu, and Zhang 2017;黄等2018;Koller等人2019;普,周,李2019;Li et al. 2019)。本文主要研究连续手语识别(CSLR),它的目的是将一系列的手语翻译成相应的口语。

  • CSLR:continuous sign language recognition 连续手语识别

   Sign language mainly relies on, but not limits to, hand gestures. To effectively and accurately express the desired idea, sign language simultaneously leverages both manual elements from hands and non-manual elements from the face and upper-body posture (Koller, Forster, and Ney 2015). To be specific, manual elements include the shape, position, orientation and movement of both hands, while non-manual elements include the eye gaze, mouth shape, facial expression and body pose. Human visual perception allows us to process and analyze these simultaneous yet complex information without much effort. However, with no expert knowledge, it is difficult for a deep neural network to discover the implicit collaboration of multiple visual cues automatically. Especially for CSLR, the transitions between sign glosses may come with temporal variations and switches of different cues.
  手语主要依靠手势但不仅仅只依靠手势。为了准确有效地表达思想,手语同时利用了手部的手动特征和面部与上半身姿势的非手动特征(Koller, Forster, and Ney 2015)。具体来说,手动特征包括双手的形状、位置、方向和运动,非手动元素包括眼睛注视、嘴型、面部表情和身体姿势。人类的视觉感知使我们能够轻而易举地同时处理和分析这些复杂的信息。然而,在缺乏专业知识的情况下,深度神经网络很难自动发现多种视觉线索的隐式协作。尤其对于 CSLR 来说,sign glosses 的先后顺序和不同部位的相互切换深深影响着手语翻译的效果。

  • Sign glosses:手势光泽???

阐明CSRL的难度:我可以借鉴"The Significance of Facial Features For Automatic Sign Language Recognition”中用图片的方式来说明在这里插入图片描述

  To explore multi-cue information, some methods rely on external tools. For example, an off-the-shelf detector is utilized for hand detection, together with a tracker to cope with shape variation and occlusion (Cihan Camgoz et al. 2017; Huang et al. 2018). Some methods adopt multi-stream networks with inferred labels (i.e., mouth shape labels, hand shape labels) to guide each stream to focus on individual visual cue (Koller et al. 2019). Despite their improvement, they mostly suffer two limitations: First, external tools impede the end-to-end learning on the differentiable structure of neural networks. Second, off-the-shelf tools and multi-stream networks bring repetitive feature extraction of the same region, incurring expensive computational overhead for such a video-based translation task.
  为了探索多线索信息,一些依赖于外部工具的方法。例如,一种现成的检测器被用于手部检测,同时还有一个跟踪器来处理形状变化和遮挡(Cihan Camgoz et al. 2017;黄等,2018)。一些方法采用带有推测标签的多流网络(如口形标签、手形标签)引导每个流关注手语视觉线索(Koller et al. 2019)。尽管有所改进,但它们大多存在两个局限性:一是外部工具阻碍了对神经网络可微结构的端到端学习。其次,现成的工具和多流网络带来了同一区域的重复特征提取,提高了计算开销。
举其他方法的例子:要最近几年的研究,不可是太久以前的研究,没说服力

年份论文方法缺点
2017Cihan Camgoz, N.; Hadfield, S.; Koller, O.; and Bowden, R. 2017.Subunets: end-to-end hand shape and continuous sign language recognition. In ICCV现成的检测器被用于手部检测外部工具阻碍了对神经网络可微结构的端到端学习
2018Cihan Camgoz, N.; Hadfield, S.; Koller, O.; Ney, H.; and Bowden,R. 2018. Neural sign language translation. In CVPR.跟踪器来处理形状变化和遮挡外部工具阻碍了对神经网络可微结构的端到端学习
2019Koller, O.; Camgoz, C.; Ney, H.; and Bowden, R. 2019. Weakly supervised learning with multi-stream cnn-lstm-hmms to discover sequential parallelism in sign language videos. TPAMI.用带有推测标签的多流网络(如口形标签、手形标签)引导每个流关注手语视觉线索多流网络带来了同一区域的重复特征提取,提高了计算开销。

  To temporally exploit multi-cue features, an intuitive idea is to concatenate features and feed them into a temporal fusion module. In action recognition, two-stream fusion shows significant performance improvement by fusing temporal features of RGB and optical flow (Simonyan and Zisserman 2014; Feichtenhofer, Pinz, and Zisserman 2016). Nevertheless, the aforementioned fusion approaches are based on two counterpart features in terms of the representation capability. But when it turns to multiple diverse cues with unequal feature importance, how to fully exploit the synergy between strong features and weak features still leaves a challenge. Moreover, for deep learning based methods, neural networks tend to merely focus on strong features for quick convergence, potentially omitting other informative cues, which limits the further performance improvement.
  为了利用短暂的多线索特征,一种直观的想法是将特征连接起来,并将它们输入一个时间融合模块。在动作识别方面,通过融合RGB和 optical flow 的时间特征,双流融合显著提高了性能(Simonyan and Zisserman 2014;Feichtenhofer, Pinz和Zisserman 2016)。然而,上述的融合方法在表示能力方面是基于两个对等的特征。但当它转向多个不同的特征重要性不等的线索时,如何充分利用强特征和弱特征之间的协同作用仍然是一个挑战。此外,对于基于深度学习的方法,神经网络往往只关注强特征以快速收敛,可能忽略其他信息线索,这限制了进一步的性能改进。

  • optical flow:光流???
    同一时间段里多个部位如何协同问题:比如中国手语中的“聚餐”,在一个短暂的时间段里要识别 双手,头,嘴。在这里插入图片描述
问题年份论文方法优点缺点
同一时间段里多个部位如何协同2008The Significance of Facial Features for Automatic Sign Language Recognition将手动特征和面部特征合并到一个特征向量 z t = [ x t , y t ] \boldsymbol{z}_{t}=\left[\boldsymbol{x}_{t}, \boldsymbol{y}_{t}\right] zt=[xt,yt]暂不知,猜:用此方法来构建特征,再用深度学习来训练会不会有效果猜:需要把手动特征和面部特征用公式表示,计算难,难以实现
2014Simonyan, K., and Zisserman, A. 2014. Two-stream convolutional networks for action recognition in videos. In NeurIPS.将特征连接起来,通过融合RGB和 optical flow 的时间特征性能显著提高此融合方法在表达能力方面是基于两个对等的特征,但当它转向多个不同的特征重要性不等的线索时,强特征和弱特征之间的协同作用仍然是一个挑战
同上2016Feichtenhofer, C.; Pinz, A.; and Zisserman, A. 2016. Convolutional two-stream network fusion for video action recognition. In CVPR.同上同上同上

基于深度学习的方法,神经网络往往只关注强特征以快速收敛,可能忽略其他信息线索,这限制了进一步的性能改进。

  To address the above difficulties, we propose a novel spatial-temporal multi-cue (STMC) framework. In the SMC module, we add two extra deconvolutional layers (Zeiler et al. 2010; Xiao, Wu, and Wei 2018) for pose estimation on the top layer of our backbone. A soft-argmax trick (Chapelle and Wu 2010) is utilized to regress the positions of keypoints and make it differentiable for subsequent operations in the temporal part. The spatial representations of other cues are acquired by the reuse of feature maps from the middle layer. Based on the learned spatial representations, we decompose the temporal modelling part into the intra-cue path and inter-cue path in the TMC module. The inter-cue path fuses the temporal correlations between different cues with temporal convolutional (TCOV) layers. The intra-cue path models the internal temporal dependency of each cue and feeds them to the inter-cue path at different time scales. To fully exploit the potential of STMC network, we design a joint optimization strategy with connectionist temporal classification (CTC) (Graves et al. 2006) and keypoint regression, making the whole structure end-to-end trainable.
  针对上述问题,我们提出了一种新的时空多线索(STMC)框架。在 SMC 模块中,我们增加了两个额外的反卷积层(Zeiler et al. 2010;Xiao, Wu, and Wei 2018),用于框架顶层的姿态估计。利用一种 soft-argmax 技巧(Chapelle and Wu 2010)对关键点的位置进行回归,使其在短暂的时间内可微。通过对中间层特征图的重复使用,获得其他线索的空间表示。基于学习到的空间表示,我们在 TMC 模块中将时间建模部分分解为线索内路径和线索间路径。线索间路径利用时间卷积(TCOV)层融合不同线索之间的时间相关性。线索内路径对每个线索的内部时间依赖性进行建模,并以不同的时间尺度将它们提供给线索间路径。为了充分挖掘 STMC 网络的潜力,我们设计了一个结合 connectionist temporal classification (CTC) (Graves et al. 2006)和关键点回归的联合优化策略,使整个结构端到端可训练。

问题年份论文方法
框架顶层的姿态估计2010在 SMC 模块中加两个反卷积层Zeiler, M. D.; Krishnan, D.; Taylor, G. W.; and Fergus, R. 2010. Deconvolutional networks. In CVPR
同上2018Xiao, B.; Wu, H.; and Wei, Y. 2018. Simple baselines for human pose estimation and tracking. In ECCV.同上
对关键点的位置进行回归,使其在短暂的时间内可微2010Chapelle, O., and Wu, M. 2010. Gradient descent optimization of smoothed information retrieval metrics. Information Retrieval.soft-argmax
获得其他线索的空间表示对中间层特征图的重复使用
融合不同线索之间的时间相关性时间卷积层 TCOV (temporal convolutional)
可端到端训练2006Graves, A.; Fern´andez, S.; Gomez, F.; and Schmidhuber, J. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML.将connectionist temporal classification (CTC) 与关键点回归结合的联合优化策略

用以前(2012年之前)的方法或结合以前的方法来解决新问题

  Our main contributions are summarized as follows:

  • We design an SMC module with a self-contained pose estimation branch. It provides multi-cue features in an end-to-end fashion and maintains efficiency at the same time.
  • We propose a TMC module composed of stacked TMC blocks. Each block includes intra-cue and inter-cue paths to preserve the uniqueness and explore the synergy of different cues at the same time.
  • A joint optimization strategy is proposed for the end-to-end sequence learning of our STMC network.
  • Through extensive experiments, we demonstrate that our STMC network surpasses previous state-of-the-art models on three publicly available CSLR benchmarks.
      我们的主要贡献总结如下:
  • 设计了一个具有自包含姿态估计分支的SMC模块。它以端到端的方式提供多线索特性,同时保持效率。
  • 我们提议一个由堆叠的TMC块组成的TMC模块。每个块包括线索内路径和线索间路径,以保持不同线索的独特性,同时探索不同线索的协同性。
  • 对STMC网络的端到端序列学习提出了一种联合优化策略。
  • 通过广泛的实验,我们证明了我们的STMC网络在三个公开可用的CSLR基准上超越了以前最先进的模型。

2 Related Work

In this section, we briefly review the related work on sign language recognition and multi-cue fusion.
在本节中,我们简要回顾了手语识别和多线索融合的相关工作。

  A CSLR system usually consists of two parts: video representation and sequence learning. Early works utilize handcrafted features (Cooper and Bowden 2009; Buehler, Zisser-man, and Everingham 2009; Yin, Chai, and Chen 2016) for SLR. Recently, deep learning based methods have been applied to SLR for their strong representation capability. 2D convolutional neural networks (2D-CNN) and 3D convolutional neural networks (3D-CNN) (Ji et al. 2013; Qiu, Yao, and Mei 2017) are employed for modelling the appearance and motion in sign language videos. In (Cui, Liu, and Zhang 2017), Cui et al. propose to combine 2D-CNN with temporal convolutional layers for spatial-temporal representation. In (Molchanov et al. 2016; Pu, Zhou, and Li 2018; Zhou, Zhou, and Li 2019; Wei et al. 2019), 3D-CNN is adopted to learn motion features in sign language.
  一个CSLR系统通常由两个部分组成:视频表示和序列学习。早期作品使用手工制作的特征(Cooper and Bowden 2009;Buehler, Zisser-man,和Everingham 2009;Yin, Chai, Chen(2016)。近年来,基于深度学习的方法因其较强的表示能力而被应用到手语识别中。2D卷积神经网络(2D-CNN)和3D卷积神经网络(3D-CNN) (Ji et al. 2013;Qiu, Yao, and Mei 2017)在手语视频中使用外观和动作作为特征的模型。在(Cui, Liu, and Zhang 2017)中,Cui等人提出将2D-CNN与时间卷积层相结合进行时空表示。在(Molchanov等人2016;Pu, Zhou, and Li 2018; Zhou, Zhou, and Li 2019; Wei et al. 2019),采用3D-CNN学习手语的动作特征。
本段首句:一个CSLR系统通常由两个部分组成:视频表示和序列学习。承接本章首段“在本节中,我们简要回顾了手语识别和多线索融合的相关工作。”然后,介绍了以往对 CSRL 系统中视频表示对研究工作。

年份论文方法
2009Cooper, H., and Bowden, R. 2009. Learning signs from subtitles: A weakly supervised approach to sign language recognition. In CVPR.使用手工制作的特征
2009Buehler, P.; Zisserman, A.; and Everingham, M. 2009. Learning sign language by watching tv (using weakly aligned subtitles). In CVPR.使用手工制作的特征
2016Yin, F.; Chai, X.; and Chen, X. 2016. Iterative reference driven metric learning for signer independent isolated sign language recognition. In ECCV.使用手工制作的特征
2013Ji, S.; Xu,W.; Yang, M.; and Yu, K. 2013. 3D convolutional neural networks for human action recognition. TPAMI 35(1):221–231.使用2D卷积神经网络(2D-CNN)和3D卷积神经网络搭建以外观和动作为特征的模型
2017Qiu, Z.; Yao, T.; and Mei, T. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV.使用2D卷积神经网络(2D-CNN)和3D卷积神经网络搭建以外观和动作为特征的模型
2017Cui, R.; Liu, H.; and Zhang, C. 2017. Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In CVPR.将2D-CNN与时间卷积层相结合进行时空表示
2016Molchanov, P.; Yang, X.; Gupta, S.; Kim, K.; Tyree, S.; and Kautz, J. 2016. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In CVPR.采用3D-CNN学习手语的动作特征
2018Pu, J.; Zhou, W.; and Li, H. 2018. Dilated convolutional network with iterative optimization for continuous sign language recognition. In IJCAI.采用3D-CNN学习手语的动作特征
2019Zhou, H.; Zhou, W.; and Li, H. 2019. Dynamic pseudo label decoding for continuous sign language recognition. In ICME.采用3D-CNN学习手语的动作特征

  Sequence learning in CSLR is to learn the correspondence between video sequence and sign gloss sequence. Koller et al. (Koller, Ney, and Bowden 2016; Koller, Zargaran, and Ney 2017; Koller et al. 2018) propose to integrate 2D-CNNs with hidden markov models (HMM) to model the state transitions. In (Cihan Camgoz et al. 2017; Wang et al. 2018;Cui, Liu, and Zhang 2017; Cui, Liu, and Zhang 2019), connectionist temporal classification (CTC) (Graves et al. 2006) algorithm is employed as a cost function for CSLR, which is able to process unsegmented input data. In (Huang et al. 2018; Guo et al. 2018), the attention-based encoder-decoder model (Bahdanau, Cho, and Bengio 2014) is adopted to deal with CSLR in the way of neural machine translation.
  CSLR中的序列学习是学习视频序列和 sign gloss 序列之间的对应关系。Koller等人(Koller, Ney和Bowden 2016;Koller, Zargaran和Ney 2017;Koller等人2018年)提出将2D-CNNs与隐马尔可夫模型(HMM)集成,以对状态转换进行建模。在 (Cihan Camgoz et al. 2017; Wang et al. 2018;Cui, Liu, and Zhang 2017; Cui, Liu, and Zhang 2019)CSLR 采用 connectionist temporal classification (CTC) (Graves et al. 2006)算法作为代价函数,能够处理未分割的输入数据。在 (Huang et al. 2018;Guo et al. 2018),中采用基于注意力机制的编码-解码器模型(Bahdanau, Cho, and Bengio 2014)的神经机器翻译方式处理CSLR。
本段首句:CSLR 中的序列学习是学习视频序列和 sign gloss 序列之间的对应关系。承接上一段首句“一个CSLR系统通常由两个部分组成:视频表示和序列学习。”然后,介绍了 以往对 CSRL 中序列学习的研究工作。

年份论文方法
2016Koller, O.; Zargaran, O.; Ney, H.; and Bowden, R. 2016. Deep sign: hybrid cnn-hmm for continuous sign language recognition. In BMVC.将2D-CNNs与隐马尔可夫模型(HMM)集成,以对状态转换进行建模
2017Koller, O.; Zargaran, S.; and Ney, H. 2017. Re-sign: re-aligned end-to-end sequence modelling with deep recurrent cnn-hmms. In CVPR.将2D-CNNs与隐马尔可夫模型(HMM)集成,以对状态转换进行建模
2018Koller, O.; Zargaran, S.; Ney, H.; and Bowden, R. 2018. Deep sign: enabling robust statistical continuous sign language recognition via hybrid cnn-hmms. IJCV 126(12):1311–1325.将2D-CNNs与隐马尔可夫模型(HMM)集成,以对状态转换进行建模
2017Cui, R.; Liu, H.; and Zhang, C. 2017. Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In CVPR.在 CSRL 中采用 connectionist temporal classification (CTC) 算法作为代价函数,能够处理未分割的输入数据。
2017Cihan Camgoz, N.; Hadfield, S.; Koller, O.; and Bowden, R. 2017. Subunets: end-to-end hand shape and continuous sign language recognition. In ICCV.在 CSRL 中采用 connectionist temporal classification (CTC) 算法作为代价函数,能够处理未分割的输入数据。
2018Wang, S.; Guo, D.; Zhou, W.-g.; Zha, Z.-J.; and Wang, M. 2018. Connectionist temporal fusion for sign language translation. In ACM MM.在 CSRL 中采用 connectionist temporal classification (CTC) 算法作为代价函数,能够处理未分割的输入数据。
2019Cui, R.; Liu, H.; and Zhang, C. 2019. A deep neural framework for continuous sign language recognition by iterative training. TMM 21(7):1880–1891.在 CSRL 中采用 connectionist temporal classification (CTC) 算法作为代价函数,能够处理未分割的输入数据。
2006Graves, A.; Fern´andez, S.; Gomez, F.; and Schmidhuber, J. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML.提出connectionist temporal classification (CTC)
2014Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. In ICLR.提出基于注意力机制的编码-解码器模型
2018Huang, J.; Zhou, W.; Zhang, Q.; Li, H.; and Li, W. 2018. Videobased sign language recognition without temporal segmentation. In AAAI.采用基于注意力机制的编码-解码器模型的神经机器翻译方式处 CSLR
2018Guo, D.; Zhou, W.; Li, H.; and Wang, M. 2018. Hierarchical lstm for sign language translation. In AAAI.采用基于注意力机制的编码-解码器模型的神经机器翻译方式处 CSLR

  The multiple cues of sign language can be separated into categories of multi-modality and multi-semantic. Early works about multi-modality utilize physical sensors to collect the 3D space information, such as depth and infrared maps (Molchanov et al. 2016; Liu et al. 2017). With the development of flow estimation, Cui et al. (Cui, Liu, and Zhang 2019) explore the multi-modality fusion of RGB and optical flow and achieve state-of-the-art performance on PHOENIX-2014 database. In contrast, multi-semantic refers to human body parts with different semantics. Early works use hand-crafted features from segmented hands, tracked body-parts and trajectories for recognition (Buehler, Zisser-man, and Everingham 2009; Pfister, Charles, and Zisserman 2013; Koller, Forster, and Ney 2015). In (Cihan Camgoz et al. 2017; Huang et al. 2018), feature sequence of hand patches captured by a tracker is fused with feature sequence of full-frames for further sequence prediction. In (Koller et al. 2019), Koller et al. propose to infer weak mouth labels from spoken German annotations and weak hand labels from SL dictionaries. These weak labels are used to establish the state synchronization in HMM of different cues, including full-frame, hand shape and mouth shape. Unlike previous methods, we propose an end-to-end differentiable network for multi-cue fusion with joint optimization, which achieves excellent performance.
  手语的多种线索可分为多模态和多语义两大类。关于多模态的早期工作是利用物理传感器来收集三维空间信息,如深度和红外地图(Molchanov et al. 2016; Liu et al. 2017)。随着流量估计技术的发展,Cui等(Cui, Liu, and Zhang 2019)探索了RGB光流的多模态融合,并在PHOENIX-2014数据库上实现了最先进的性能。而多语义指的是具有不同语义的人体部位。早期的研究使用了 hand-crafted features :分割手部,跟踪身体的部分肢体和轨迹识别(Buehler, Zisser-man, and Everingham 2009;Pfister, Charles和Zisserman 2013;Koller, Forster和Ney, 2015年)。在(Cihan Camgoz et al. 2017;(Huang et al. 2018),将跟踪器捕获的手部贴片特征序列与全帧特征序列融合,进一步进行序列预测。在(Koller et al. 2019)中,Koller et al.提出从德语口语注释推断弱口标签,从SL词典推断弱手标签。这些弱标签用于在HMM中建立不同线索的状态同步,包括全帧、手形和嘴形。与以往的方法不同,我们提出了一种基于联合优化的端到端可微网络用于多线索融合,取得了良好的性能。
本段首句:手语的多种线索可分为多模态和多语义两大类。承接本章首段“在本节中,我们简要回顾了手语识别和多线索融合的相关工作。”然后,介绍了以往对手语识别中多线索融合的研究工作。

年份论文方法(解决 SRL 多线索中的多模态问题)
2016Molchanov, P.; Yang, X.; Gupta, S.; Kim, K.; Tyree, S.; and Kautz, J. 2016. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In CVPR.利用物理传感器来收集三维空间信息,比如:depth and infrared maps
2017Liu, Z.; Chai, X.; Liu, Z.; and Chen, X. 2017. Continuous gesture recognition with hand-oriented spatiotemporal feature. In ICCV.利用物理传感器来收集三维空间信息,比如:depth and infrared maps
2019Cui, R.; Liu, H.; and Zhang, C. 2019. A deep neural framework for continuous sign language recognition by iterative training. TMM 21(7):1880–1891.探索了RGB光流的多模态融合,并在PHOENIX-2014数据库上实现了最先进的性能
年份论文方法(解决 SRL 多线索中的多语义问题)
2009Buehler, P.; Zisserman, A.; and Everingham, M. 2009. Learning sign language by watching tv (using weakly aligned subtitles). In CVPR.使用了 hand-crafted features :分割手部,跟踪身体的部分肢体和轨迹识别
2013Pfister, T.; Charles, J.; and Zisserman, A. 2013. Large-scale learning of sign language by watching tv (using co-occurrences). In BMVC.使用了 hand-crafted features :分割手部,跟踪身体的部分肢体和轨迹识别
2015Koller, O.; Forster, J.; and Ney, H. 2015. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. CVIU 141:108–125.使用了 hand-crafted features :分割手部,跟踪身体的部分肢体和轨迹识别
2017Cihan Camgoz, N.; Hadfield, S.; Koller, O.; and Bowden, R. 2017. Subunets: end-to-end hand shape and continuous sign language recognition. In ICCV.将跟踪器捕获的手部贴片特征序列与全帧特征序列融合,进一步进行序列预测
2018Huang, J.; Zhou, W.; Zhang, Q.; Li, H.; and Li, W. 2018. Videobased sign language recognition without temporal segmentation. In AAAI.将跟踪器捕获的手部贴片特征序列与全帧特征序列融合,进一步进行序列预测
2019Koller, O.; Camgoz, C.; Ney, H.; and Bowden, R. 2019. Weakly supervised learning with multi-stream cnn-lstm-hmms to discover sequential parallelism in sign language videos. TPAMI.从德语口语注释推断弱口标签,从SL词典推断弱手标签,这些弱标签用于在HMM中建立不同线索的状态同步,包括全帧、手形和嘴形

3 Proposed Approach

In this section, we first introduce the overall architecture of the proposed method. Then we elaborate the key components in our framework, including the spatial multi-cue (SMC) module and temporal multi-cue (TMC) module. Finally, we detail the sequence learning part and the joint loss optimization of our spatial-temporal multi-cue (STMC) framework.
在本节中,我们首先介绍所提方法的整体架构。然后详细阐述了该框架的关键组成部分,包括空间多线索模块和时间多线索模块。最后,我们详细介绍了序列学习部分和联合损失优化的时空多线索(STMC)框架。

3.1 Framework Overview

Given a video x = { x t } t = 1 T \mathbf{x}=\left\{x_{t}\right\}_{t=1}^{T} x={xt}t=1Twith T frames, the target of CSLR task is to predict its corresponding sign gloss sequence ℓ = { ℓ i } i = 1 L \ell=\left\{\ell_{i}\right\}_{i=1}^{L} ={i}i=1L with L words. As illustrated in Figure 1, our framework consists of three key modules, i.e., spatial representation, temporal modelling and sequence learning. First, each frame is processed by an SMC module to generate spatial features of multiple cues, including full-frame, hand, face and pose. Then, a TMC module is leveraged to capture the temporal correlations of intra-cue features and inter-cue features at different time steps and time scales. Finally, the whole STMC network equipped with bidirectional Long-Short Term Memory (BLSTM) (Hochreiter and Schmidhuber 1997) encoders utilizes connectionist temporal classification (CTC) for sequence learning and inference.
CSLR 的目的是使视频帧 X 与手语字 L 相对应。如图1所示,我们的框架由三个关键模块组成,即空间表示、时间建模和序列学习。首先,利用 SMC 模块对每一帧进行处理,生成包括全帧、手、脸和姿态等多种线索的空间特征; 然后,利用TMC模块捕捉线索内特征和线索间特征在不同时间步长和时间尺度上的时间相关性。最后,整个 STMC 网络配备了双向长期记忆(BLSTM) (Hochreiter and Schmidhuber 1997)编码器,利用连接主义者时间分类(CTC)进行序列学习和推理。 Figure 1: An overview of the proposed STMC framework. The SMC module is firstly utilized to decompose spatial features of visual cues for T frames in a video. Strips with different colors represent feature sequences of different cues. Then, the feature sequences of cues are fed into the TMC module with stacked TMC blocks and temporal pooling (TP) layers. The output of TMC module consists of feature sequence in the inter-cue path and feature sequences of N cues in the intra-cue path, which are processed by BLSTM encoders and CTC layers for sequence learning and inference. Here, N denotes the number of cues.Figure 1: An overview of the proposed STMC framework. The SMC module is firstly utilized to decompose spatial features of visual cues for T frames in a video. Strips with different colors represent feature sequences of different cues. Then, the feature sequences of cues are fed into the TMC module with stacked TMC blocks and temporal pooling (TP) layers. The output of TMC module consists of feature sequence in the inter-cue path and feature sequences of N cues in the intra-cue path, which are processed by BLSTM encoders and CTC layers for sequence learning and inference. Here, N denotes the number of cues.
图1:建议的 STMC 框架的概述。首先利用 SMC 模块对视频 T 帧的视觉线索进行空间特征分解。不同颜色的条带代表不同线索的特征序列。然后,将线索特征序列输入基于 temporal pooling (TP) 层和 TMC blocks 的 TMC 模块。TMC模块的输出由线索间路径的特征序列和线索内路径的 N 个线索的特征序列组成,通过 BLSTM 编码器和 CTC 层进行序列学习和推理。这里,N 表示线索的数量。
T × H × W × 3 \mathrm{T} \times \mathrm{H} \times \mathrm{W} \times 3 T×H×W×3:T帧 × H高 × W宽 × 3通道(RGB

SMC:Spatial Multi-Cue (SMC) 空间多线索
T × ( C 1 + ⋯ + C N ) \mathrm{T} \times\left(C_{1}+\cdots+C_{N}\right) T×(C1++CN):T帧 × (线索1 + 线索2 +…+ 线索N);本文的线索可以理解为特征部位。比如,线索1是 Full-frame;线索2是 手形 Shape;线索3是 面部表情 Head;线索4是 姿势 Bodyjoints。

  • TMC Block:Temporal Multi-Cue Block 时间多线索块

  • TP:Temporal Pooling 时序池化

2 × T 2 \times \mathrm{T} 2×T :2 × T帧 ???

2 × C 2 \times \mathrm{C} 2×C:2 × C个线索 ???

4 × C 4 \times \mathrm{C} 4×C:4 × C个线索 ???

BLSTM:bidirectional Long-Short Term Memory 双向长短时记忆。详情参考 作者: herosunly

CTC:connectionist temporal classification 连接主义时间分类 ,用来解决输入序列和输出序列难以一一对应的问题。详情参考 作者:yudonglee

  • Inference:Friday is expected with sunshine
    识别出:周五有可能出太阳。

  • Optimization:Joint Loss
    优化:加入损失函数

3.2 Spatial Multi-Cue Representation

In spatial representation module, 2D-CNN is adopted to generate multi-cue features of full-frame, hands, face and pose. Here, we select VGG-11 model (Simonyan and Zisserman 2015) as the backbone network, considering its simple but effective neural architecture design. As depicted in Figure 2, the operations in SMC are composed of three steps: pose estimation, patch cropping and feature generation.
在空间表示模块中,采用 2D-CNN 生成全帧、手、脸、姿态的多线索特征。在这里,我们选择 VGG-11 模型(Simonyan and Zisserman 2015)作为骨干网,考虑到其简单而有效的神经结构设计。如图2所示,SMC 的操作包括姿态估计、patch 裁剪和 feature 生成三个步骤。
在这里插入图片描述
Figure 2: The SMC Module. The keypoints are estimated for patch cropping of face and hands. The output of SMC includes features from full-frame, hands, face and pose.
图2:SMC 模块。估计出人脸和手的关键点并进行 patch 裁剪。SMC 的输出包括全帧图、手、脸、姿态等特征。
模型选择:VGG-11(2015年 Simonyan, K., and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR.)作为骨干网络

  • crop:修剪

Pose Estimation. Deconvolutional networks (Zeiler et al. 2010) are widely used in pixel-wise prediction. For pose estimation, two deconvolutional layers are added after the 7-th convolutional layer of VGG-11. The stride of each layer is 2. So, the feature maps are 4× upsampled from the resolution 14×14 to 56×56. The output is fed into a point-wise convolutional layer to generate K predicted heat maps. In each heat map, the position of its corresponding keypoint is expected to show the highest response value. Here, K is set to 7 for keypoints at the upper body, including the nose, both shoulders, both elbows and both wrists.
姿态估计. 反卷积网络(Zeiler et al. 2010)广泛应用于像素级预测。对于姿态估计,在VGG-11的第7个卷积层之后增加了2个反卷积层。每一层的步长为2。因此,从14×14到56×56的分辨率对特征图进行了4倍的更新采样。输出反馈入各个点的卷积层,以生成K个预测热图。在每一张热图中,期望其对应的关键点的位置表现出最高的响应值。在这里,上半身的关键部位K值设为7,包括鼻子、肩膀、肘部和手腕。

  To make the keypoint prediction differentiable for subsequent sequence learning, a soft-argmax layer is applied on K these heat maps. Denoting K heat maps as h = { h k } k = 1 K \mathbf{h}=\left\{h_{k}\right\}_{k=1}^{K} h={hk}k=1K, each heat map h k ∈ R H × W h_{k} \in \mathbb{R}^{H \times W} hkRH×W is passed through a spatial softmax function as follows,
                p i , j , k = e h i , j , k ∑ i = 1 H ∑ j = 1 W e h i , j , k p_{i, j, k}=\frac{e^{h_{i, j, k}}}{\sum_{i=1}^{H} \sum_{j=1}^{W} e^{h_{i, j, k}}} pi,j,k=i=1Hj=1Wehi,j,kehi,j,k,      (1)
where h i , j , k h_{i, j, k} hi,j,k is the value of heat map h k h_{k} hk at position ( i , j ) (i, j) (i,j) and p i , j , k p_{i, j, k} pi,j,k is the probability of keypoint k at position ( i , j ) (i, j) (i,j). Afterwards, the expected values of coordinates along x-axis and y-axis over the whole probability map are calculated as follows,
                ( x ^ , y ^ ) k = ( ∑ i = 1 H ∑ j = 1 W i − 1 H − 1 p i , j , k , ∑ i = 1 H ∑ j = 1 W j − 1 W − 1 p i , j , k ) (\hat{x}, \hat{y})_{k}=\left(\sum_{i=1}^{H} \sum_{j=1}^{W} \frac{i-1}{H-1} p_{i, j, k}, \sum_{i=1}^{H} \sum_{j=1}^{W} \frac{j-1}{W-1} p_{i, j, k}\right) (x^,y^)k=(i=1Hj=1WH1i1pi,j,k,i=1Hj=1WW1j1pi,j,k),      (2)
Here, J k = ( x ^ , y ^ ) k ∈ [ 0 , 1 ] J_{k}=(\hat{x}, \hat{y})_{k} \in[0,1] Jk=(x^,y^)k[0,1] is the normalized predicted position of keypoint k. The corresponding position of ( x , y ) (x, y) (x,y) in a H × W H \times W H×W feature map is ( x ^ ( H − 1 ) + 1 , y ^ ( W − 1 ) + 1 ) (\hat{x}(H-1)+1, \hat{y}(W-1)+1) (x^(H1)+1,y^(W1)+1).
  为了使关键点预测可微,便于后续序列学习,在这些K个预测热图上应用了soft-argmax层。K个预测的热图表示为 h = { h k } k = 1 K \mathbf{h}=\left\{h_{k}\right\}_{k=1}^{K} h={hk}k=1K,每张热图的 h k ∈ R H × W h_{k} \in \mathbb{R}^{H \times W} hkRH×W 是通过如下的 空间 softmax 函数计算的,
              p i , j , k = e h i , j , k ∑ i = 1 H ∑ j = 1 W e h i , j , k p_{i, j, k}=\frac{e^{h_{i, j, k}}}{\sum_{i=1}^{H} \sum_{j=1}^{W} e^{h_{i, j, k}}} pi,j,k=i=1Hj=1Wehi,j,kehi,j,k,      (1)
其中, h i , j , k h_{i, j, k} hi,j,k 是热图 h k h_{k} hk 在点 ( i , j ) (i, j) (i,j) 处的值, p i , j , k p_{i, j, k} pi,j,k 是关键点 K ( i , j ) (i, j) (i,j) 的概率。然后,如下计算整个概率图上x轴坐标和y轴坐标的期望值:
              ( x ^ , y ^ ) k = ( ∑ i = 1 H ∑ j = 1 W i − 1 H − 1 p i , j , k , ∑ i = 1 H ∑ j = 1 W j − 1 W − 1 p i , j , k ) (\hat{x}, \hat{y})_{k}=\left(\sum_{i=1}^{H} \sum_{j=1}^{W} \frac{i-1}{H-1} p_{i, j, k}, \sum_{i=1}^{H} \sum_{j=1}^{W} \frac{j-1}{W-1} p_{i, j, k}\right) (x^,y^)k=(i=1Hj=1WH1i1pi,j,k,i=1Hj=1WW1j1pi,j,k),      (2)
这里的 J k = ( x ^ , y ^ ) k ∈ [ 0 , 1 ] J_{k}=(\hat{x}, \hat{y})_{k} \in[0,1] Jk=(x^,y^)k[0,1] 是关键预测点k的归一化。 ( x , y ) (x, y) (x,y) H × W H \times W H×W 特征图中对应的点为 ( x ^ ( H − 1 ) + 1 , y ^ ( W − 1 ) + 1 ) (\hat{x}(H-1)+1, \hat{y}(W-1)+1) (x^(H1)+1,y^(W1)+1)

Patch Cropping. In CSLR, the perception of detailed visual cues is vital, including eye gaze, facial expression, mouth shape, hand shape and orientations of hands. Our model takes predicted positions of the nose and both wrists as the center points of the face and both hands. The patches are cropped from the output (56 × 56 × C4) of 4-th convolutional layer of VGG-11. The cropping sizes are fixed to 24 × 24 for both hands and 16 × 16 for the face. It’s large enough to cover body parts of a signer whose upper body is visible to the camera. The center point of each patch is clamped into a range to ensure that the patch would not cross the border of the original feature map.
贴片裁剪. 在CSLR中,对眼睛注视、面部表情、嘴型、手型和手的方向等视觉细节信息的感知至关重要。我们的模型以鼻子和两个手腕的预测位置作为脸和双手的中心点。从VGG-11的第四卷积层输出(56 × 56 × C4)中裁剪贴片。双手裁剪尺寸固定为24 × 24,脸部裁剪尺寸固定为16 × 16。它足够大,可以覆盖一个上半身可以被摄像头看到的手语者的身体部分。将每个的中心点夹入一个范围内,以保证贴片不会越过原始特征图的边界。

Feature Generation.  After K keypoints are predicted, they are flattened to a 1D-vector with dimension 2K and passed through two fully-connected (FC) layers with ReLU to get the feature vector of pose cue. Then, feature maps of the face and both hands are cropped and processed by several convolutional layers, separately. Most sign gestures rely on the cooperation of both hands. So we use weight-sharing convolutional layers for both hands. The outputs of them are concatenated along the channel-dimension. Finally, we perform global average pooling over all the feature maps with spatial dimension to form feature vectors of different cues.
提取特征. 对K个关键点进行预测后,将其平化为维数为2K的一维向量,并通过两个全连接层(FC)用ReLU作为激活函数得到姿态线索的特征向量。然后,分别通过多个卷积层对人脸和双手的特征图进行裁剪处理。因为大多数手语都是依靠双手的配合,所以我们用于提取双手的卷积层共享权重。它们的输出沿着通道维度连接。最后,我们对所有空间维度的特征图进行全局平均池化,形成不同线索的特征向量。

All features are extracted by passing frames x = { x t } t = 1 T \mathbf{x}=\left\{x_{t}\right\}_{t=1}^{T} x={xt}t=1T through our spatial multi-cue (SMC) module as follows,
                 { { f t , n } n = 1 N , { J t , k } k = 1 K } t = 1 T = { Ω θ ( x t ) } t = 1 T \left\{\left\{f_{t, n}\right\}_{n=1}^{N},\left\{J_{t, k}\right\}_{k=1}^{K}\right\}_{t=1}^{T}=\left\{\Omega_{\theta}\left(x_{t}\right)\right\}_{t=1}^{T} {{ft,n}n=1N,{Jt,k}k=1K}t=1T={Ωθ(xt)}t=1T,      (3)
where Ω θ ( ⋅ ) \Omega_{\theta}(\cdot) Ωθ() denotes SMC module and θ \theta θ denotes the parameters of it. J t , k ∈ R 2 J_{t, k} \in \mathbb{R}^{2} Jt,kR2 is the position of keypoint k at the t-th frame. f t , n ∈ R C n f_{t, n} \in \mathbb{R}^{C_{n}} ft,nRCn is the feature vector of visual cue n at the t-th frame. In this paper, we set N = 4, which represents visual cues of full-frame, hand, face and pose, respectively.
传递帧 x = { x t } t = 1 T \mathbf{x}=\left\{x_{t}\right\}_{t=1}^{T} x={xt}t=1T 通过我们的空间多线索(spatial multi-cue, SMC)模块来提取所有特征,如下所示:
                 { { f t , n } n = 1 N , { J t , k } k = 1 K } t = 1 T = { Ω θ ( x t ) } t = 1 T \left\{\left\{f_{t, n}\right\}_{n=1}^{N},\left\{J_{t, k}\right\}_{k=1}^{K}\right\}_{t=1}^{T}=\left\{\Omega_{\theta}\left(x_{t}\right)\right\}_{t=1}^{T} {{ft,n}n=1N,{Jt,k}k=1K}t=1T={Ωθ(xt)}t=1T,      (3)
这里的 Ω θ ( ⋅ ) \Omega_{\theta}(\cdot) Ωθ() 代表 SMC 模型, θ \theta θ 是 SMC 模型的参数。 J t , k ∈ R 2 J_{t, k} \in \mathbb{R}^{2} Jt,kR2 是关键点k在第t个坐标系的位置。 f t , n ∈ R C n f_{t, n} \in \mathbb{R}^{C_{n}} ft,nRCn 是第 t 帧中视觉线索 n 的特征向量。在本文中,我们设 N = 4,分别表示全帧、手、脸和姿态的视觉线索。

输入 T 帧 尺寸为 H × W 的 包含 3 通道的 RGB 图像,比如将 T 帧个 3×244×244 的图像输入到 SMC 模块,经 SMC 模块中的 VGG-11 网络中 1–4 层网络处理得到 T 帧个 256×56×56 的图像,然后对图像进姿态估计,裁剪出面部、双手,并用 5–7 层网络提取到 signer 整体姿态得到 T 帧个 512×14×14 的图像,然后在第 7 层网络后加 2 层步长为 2 的反卷积层,然后将得到的特征图进行整张图的一个均值池化,形成一个特征点,将这些特征点组成最后 512 维的特征向量即为 Full-Frame 特征向量。而在经过 1–4 层网络处理得到 T 帧个 256×56×56 的图像,并对图像进姿态估计,裁剪出面部、双手,对裁剪出的面部、双手都输入 ? 层卷积网络,并将得到的特征图进行整张图的一个均值池化,形成一个特征点,将这些特征点组成最后的特征向量,其中左右手共享权重,并在 ? 层卷积后 Concatenation 在一起。最终得到 256 维的面部特征向量和 512 维的双手特征向量。用 5–7 层网络提取到 signer 整体姿态得到 T 帧个 512×14×14 的图像经过 ? 层反卷积层后输入到 Soft-Aramax 得到 7 个姿态的关键点,然后将其平化为维数为 2×7 个一维向量,并通过两个全连接层 (FC) 用 ReLU 作为激活函数得到姿态特征 256 维的特征向量。得到的 7 个姿态的关键点反馈到第 4 层后的 256×56×56 的图像,对其进行修剪。

Soft-Aramax:结合 softmax 函数(一种指数归一化函数) ,达到argmax的目的(寻找参数最大值的索引),同时使得过程可导。详情参考 1详情参考 2

3.3 Temporal Multi-Cue Modelling

Instead of simple fusion, our proposed temporal multi-cue (TMC) module intends to integrate spatiotemporal information from two aspects, intra-cue and inter-cue. The intra-cue path captures the unique features of each visual cue. The inter-cue path learns the combination of fused features from different cues at different time scales. Then, we define a TMC block to model the operations between the two paths as follows,
                 ( o l , f l ) = Block ⁡ l ( o l − 1 , f l − 1 ) \left(o_{l}, f_{l}\right)=\operatorname{Block}_{l}\left(o_{l-1}, f_{l-1}\right) (ol,fl)=Blockl(ol1,fl1),      (4)
where ( o l − 1 , f l − 1 ) \left(o_{l-1}, f_{l-1}\right) (ol1,fl1) and ( o l , f l ) \left(o_{l}, f_{l}\right) (ol,fl) are the input pair and output pair of the l l l-th block. o l ∈ R T × C o o_{l} \in \mathbb{R}^{T \times C_{o}} olRT×Co denotes the feature matrix of the inter-cue path. f l ∈ R T × C f f_{l} \in \mathbb{R}^{T \times C_{f}} flRT×Cf denotes the feature matrix of the intra-cue path which is the concatenation of vectors from different cues along channel-dimension. As the first input pair, o 1 = f 1 = [ f 1 , 1 , f 1 , 2 , ⋯   , f 1 , N ] o_{1}=f_{1}=\left[f_{1,1}, f_{1,2}, \cdots, f_{1, N}\right] o1=f1=[f1,1,f1,2,,f1,N], where [ ⋅ ] [\cdot] [] is the concatenating operation and N is the number of cues.
我们提出的时序多线索(TMC)模块并不是简单的融合,而是从线索内和线索间两个方面整合时空信息。线索内路径捕捉每个视觉线索的独特特征。线索间路径学习融合特征在不同时间尺度下的融合特征。然后,我们定义一个 TMC 块来建模两条路径之间的操作,如下所示:
                 ( o l , f l ) = Block ⁡ l ( o l − 1 , f l − 1 ) \left(o_{l}, f_{l}\right)=\operatorname{Block}_{l}\left(o_{l-1}, f_{l-1}\right) (ol,fl)=Blockl(ol1,fl1),      (4)
这里的 ( o l − 1 , f l − 1 ) \left(o_{l-1}, f_{l-1}\right) (ol1,fl1) ( o l , f l ) \left(o_{l}, f_{l}\right) (ol,fl) 是第 l l l 个TMC模块的输入对和输出对。 o l ∈ R T × C o o_{l} \in \mathbb{R}^{T \times C_{o}} olRT×Co 表示线索间路径的特征矩阵。 f l ∈ R T × C f f_{l} \in \mathbb{R}^{T \times C_{f}} flRT×Cf 表示线索内路径的特征矩阵,它是不同线索向量沿通道维数的拼接。作为第一对输入对, o 1 = f 1 = [ f 1 , 1 , f 1 , 2 , ⋯   , f 1 , N ] o_{1}=f_{1}=\left[f_{1,1}, f_{1,2}, \cdots, f_{1, N}\right] o1=f1=[f1,1,f1,2,,f1,N],其中 [ ⋅ ] [\cdot] [] 代表连接操作,N是线索的数量。

  The detailed operations inside each TMC block are shown in Figure 3 and can be decomposed into two paths as follows. ( C is the number of output channels in each path)
  每个TMC块内的详细操作如图3所示,可以将其分解为如下两条路径。(C为每条路径的输出通道数)
Figure 3: The TMC Module.
            Figure 3: The TMC Module.

  • Temporal Conv-kernel size:时序卷积核大小
  • Temporal Pool:时序池
    Temporal Convolutional Networks 时序卷积网络 参考链接
  • Concatenation:连接
  • ReLU:ReLU激活函数
  • o l − 1 o_{l-1} ol1 :第 l − 1 l-1 l1 个 TMC 模块的输入
  • f l − 1 f_{l-1} fl1 :第 l − 1 l-1 l1 个 TMC 模块的输出
    Intra-Cue Path. The first path is to provide unique features of different cues at different time scales. The temporal transformation inside each cue is performed as follows,
                     f l , n = ReLU ⁡ ( K k C N ( f l − 1 , n ) ) , f_{l, n}=\operatorname{ReLU}\left(\mathcal{K}_{k}^{\frac{C}{N}}\left(f_{l-1, n}\right)\right), fl,n=ReLU(KkNC(fl1,n)),      (5)
                     f l = [ f l , 1 , f l , 2 , ⋯   , f l , ∣ N ] f_{l}=\left[f_{l, 1}, f_{l, 2}, \cdots, f_{l, \mid N}\right] fl=[fl,1,fl,2,,fl,N].$      (6)
    Here, f l , n ∈ R T × C N f_{l, n} \in \mathbb{R}^{T \times \frac{C}{N}} fl,nRT×NC denotes the feature matrix of n-th cue. K k C N \mathcal{K}_{k}^{\frac{C}{N}} KkNC denotes the kernel of a temporal convolution, where k is the temporal kernel size and C N \frac{C}{N} NC is the number of output channels.
    线索内路径. 第一种路径是在不同的时间尺度上提供不同线索的独特特征。每个线索内部的时序转换过程如下:
                     f l , n = ReLU ⁡ ( K k C N ( f l − 1 , n ) ) , f_{l, n}=\operatorname{ReLU}\left(\mathcal{K}_{k}^{\frac{C}{N}}\left(f_{l-1, n}\right)\right), fl,n=ReLU(KkNC(fl1,n)),      (5)
                     f l = [ f l , 1 , f l , 2 , ⋯   , f l , ∣ N ] f_{l}=\left[f_{l, 1}, f_{l, 2}, \cdots, f_{l, \mid N}\right] fl=[fl,1,fl,2,,fl,N].$      (6)
    这里的 f l , n ∈ R T × C N f_{l, n} \in \mathbb{R}^{T \times \frac{C}{N}} fl,nRT×NC 代表第 n 条线索的特征矩阵。 K k C N \mathcal{K}_{k}^{\frac{C}{N}} KkNC 是时序卷积的核,这里的 k 是时序核大小, C N \frac{C}{N} NC 是输出通道的数量。

Inter-Cue Path.  The second path is to perform the temporal
transformation on the inter-cue feature from the previous block and fuse information from the intra-cue path as follows,
                 o l = ReLU ⁡ ( [ K k C 2 ( o l − 1 ) , K 1 C 2 ( f l ) ] ) o_{l}=\operatorname{ReLU}\left(\left[\mathcal{K}_{k}^{\frac{C}{2}}\left(o_{l-1}\right), \mathcal{K}_{1}^{\frac{C}{2}}\left(f_{l}\right)\right]\right) ol=ReLU([Kk2C(ol1),K12C(fl)]),$      (7)
where K 1 C 2 \mathcal{K}_{1}^{\frac{C}{2}} K12C is a point-wise temporal convolution.It serves as a project matrix between the two paths. Note that f l f_{l} fl is the output of intra-cue path in the present block.
线索间路径. 第二种路径是对上一个块的线索间特征进行短暂的变换,并从线索内路径融合信息,
                 o l = ReLU ⁡ ( [ K k C 2 ( o l − 1 ) , K 1 C 2 ( f l ) ] ) o_{l}=\operatorname{ReLU}\left(\left[\mathcal{K}_{k}^{\frac{C}{2}}\left(o_{l-1}\right), \mathcal{K}_{1}^{\frac{C}{2}}\left(f_{l}\right)\right]\right) ol=ReLU([Kk2C(ol1),K12C(fl)]),      (7)
这里的 K 1 C 2 \mathcal{K}_{1}^{\frac{C}{2}} K12C 是一个 point-wise 时序卷积。它充当两条路径之间的过渡矩阵。 f l f_{l} fl 是当前块的线索内路径的输出。

  After each block, a temporal max-pooling with stride 2 and kernel size 2 is performed. In this paper, we use two blocks in the TMC module. The kernel size k of all temporal convolutions is set to 5, except the point-wise one. The number of output channels C in each path is set to 1024.
  在每个块之后,执行步长为2、卷积核大小为 2 的时序最大池化操作。在本文中,我们在 TMC 模块中使用了两个模块。所有时序卷积的核大小 k 都设置为 5,除了 point-wise 的卷积。每条路径的输出通道量 C 设为1024。

3.4 Sequence Learning and Inference
With the proposed SMC and TMC module, the network can generate inter-cue feature sequence o = { o t } t = 1 T ′ \mathbf{o}=\left\{o_{t}\right\}_{t=1}^{T^{\prime}} o={ot}t=1T and N intra-cue feature sequences f n = { f t , n } t = 1 T ′ \mathbf{f}_{n}=\left\{f_{t, n}\right\}_{t=1}^{T^{\prime}} fn={ft,n}t=1T. Here, T ′ T^{\prime} T is the temporal length of the final output of the TMC module. The question then is how to utilize these two feature sequences to accomplish the sequence learning and inference.
利用所提出的 SMC 和 TMC 模块,网络可以生成线索间特征序列 o = { o t } t = 1 T ′ \mathbf{o}=\left\{o_{t}\right\}_{t=1}^{T^{\prime}} o={ot}t=1TN 个线索内特征序列 f n = { f t , n } t = 1 T ′ \mathbf{f}_{n}=\left\{f_{t, n}\right\}_{t=1}^{T^{\prime}} fn={ft,n}t=1T。其中 T ′ T^{\prime} T 是 TMC 模块最终输出的时序长度。接下来的问题是如何利用这两个特征序列来完成序列的学习和推理。

BLSTM Encoder.  Recurrent neural networks (RNN) can use their internal state to model the state transitions in the sequence of inputs. Here, we use RNN to map the spatial-temporal feature sequence to its sign gloss sequence. RNN takes the feature sequence as input and generates T ′ T^{\prime} T hidden states as follows,
                 h t = RNN ⁡ ( h t − 1 , o t ) h_{t}=\operatorname{RNN}\left(h_{t-1}, o_{t}\right) ht=RNN(ht1,ot),      (8)
in which h t h_{t} ht is the hidden state at time step t and the initial state h 0 h_{0} h0 is a fixed all-zero vector. In our approach, we choose the bidirectional Long Short-Term Memory (BLSTM) (Sutskever, Vinyals, and Le 2014) unit as the recurrent unit for its ability in processing long-term dependencies. BLSTM concatenates forward and backward hidden states from bidirectional inputs. Afterward, the hidden state of each time step is passed through a fully-connected layer and a softmax layer,
                 a t = W ⋅ h t + b , y t , j = e a t , j ∑ k e a t , k a_{t}=W \cdot h_{t}+b, \quad y_{t, j}=\frac{e^{a_{t, j}}}{\sum_{k} e^{a_{t, k}}} at=Wht+b,yt,j=keat,keat,j,      (9)
where y t , j y_{t, j} yt,j is the probability of label j at time step t. In CSLR task, label j comes from a given vocabulary.
双向长短时记忆编码 循环神经网络(RNN)可以利用其内部状态对输入序列中的状态转换进行建模。在这里,我们使用 RNN 将时空特征序列映射到其对应的手语注释序列。RNN以特征序列为输入,生成 T ′ T^{\prime} T 个隐藏状态,如下所示:
                 h t = RNN ⁡ ( h t − 1 , o t ) h_{t}=\operatorname{RNN}\left(h_{t-1}, o_{t}\right) ht=RNN(ht1,ot),      (8)
其中, h t h_{t} ht 是时间步长为 t 时的隐藏状态,初始状态 h 0 h_{0} h0 默认为零向量。在我们的方法中,我们选择双向长短期记忆(BLSTM) (Sutskever, Vinyals,和Le 2014)单元作为其处理长期依赖的循环单元。BLSTM连接双向输入的前向和后向隐藏状态。然后,每个时序步长的隐藏状态经过一个全连接层和一个 softmax 层,
                 a t = W ⋅ h t + b , y t , j = e a t , j ∑ k e a t , k a_{t}=W \cdot h_{t}+b, \quad y_{t, j}=\frac{e^{a_{t, j}}}{\sum_{k} e^{a_{t, k}}} at=Wht+b,yt,j=keat,keat,j,      (9)
这里的 y t , j y_{t, j} yt,j 是标签 j 在时间步长设为 t 时出现的概率。在 CSLR 任务中,标签 j 来自给定的词汇表。
BLSTM:Bi-directional Long Short-Term Memory的缩写,是由前向LSTM与后向LSTM组合而成。LSTM和BLSTM在自然语言处理任务中都常被用来建模上下文信息。 参考链接

Connectionist Temporal Classification. Our model employs connectionist temporal classification (CTC) (Graves et al. 2006) to tackle the problem of mapping video sequence o = { o t } t = 1 T ′ \mathbf{o}=\left\{o_{t}\right\}_{t=1}^{T^{\prime}} o={ot}t=1T to ordered sign gloss sequence ℓ = { ℓ i } i = 1 L \ell=\left\{\ell_{i}\right\}_{i=1}^{L} ={i}i=1L ( L ≤ T ) (L \leq T) (LT), where the explicit alignment between them is unknown.The objective of CTC is to maximize the sum of probabilities of all possible alignment paths between input and target sequence.
我们的模型使用了连接时序分类(CTC) (Graves等人2006)来解决将视频序列 o = { o t } t = 1 T ′ \mathbf{o}=\left\{o_{t}\right\}_{t=1}^{T^{\prime}} o={ot}t=1T 映射到手语注释序列 ℓ = { ℓ i } i = 1 L \ell=\left\{\ell_{i}\right\}_{i=1}^{L} ={i}i=1L ( L ≤ T ) (L \leq T) (LT) 的问题,其中它们之间的显式对齐是未知的。CTC 的目标是使输入序列和目标序列之间所有可能的 alignment paths 的概率之和最大。

  CTC creates an extended vocabulary V \mathcal{V} V with a blank label “ − − ”, where V = V origin  ∪ { − } \mathcal{V}=\mathcal{V}_{\text {origin }} \cup\{-\} V=Vorigin {}. The blank label represents stillness and transitions which have no precise meaning. Denote the alignment path of the input sequence as π = { π t } t = 1 T ′ \pi=\left\{\pi_{t}\right\}_{t=1}^{T^{\prime}} π={πt}t=1T, where label π t ∈ V \pi_{t} \in \mathcal{V} πtV . The probability of alignment path π \pi π given the input sequence is defined as follows,
                 p ( π ∣ o ) = ∏ t = 1 T ′ p ( π t ∣ o ) = ∏ t = 1 T ′ y t , π t p(\pi \mid \mathbf{o})=\prod_{t=1}^{T^{\prime}} p\left(\pi_{t} \mid \mathbf{o}\right)=\prod_{t=1}^{T^{\prime}} y_{t, \pi_{t}} p(πo)=t=1Tp(πto)=t=1Tyt,πt.      (10)
  CTC 创建带有空白标签 “ − − ”的扩展词汇表 V \mathcal{V} V。这里的 V = V origin  ∪ { − } \mathcal{V}=\mathcal{V}_{\text {origin }} \cup\{-\} V=Vorigin {}。空白的标签代表 stillness and transitions,没有确切的意义。将输入序列的 alignment path 表示为 π = { π t } t = 1 T ′ \pi=\left\{\pi_{t}\right\}_{t=1}^{T^{\prime}} π={πt}t=1T,其中标签 π t ∈ V \pi_{t} \in \mathcal{V} πtV。给定输入序列, alignment path π \pi π 的概率定义为:
                 p ( π ∣ o ) = ∏ t = 1 T ′ p ( π t ∣ o ) = ∏ t = 1 T ′ y t , π t p(\pi \mid \mathbf{o})=\prod_{t=1}^{T^{\prime}} p\left(\pi_{t} \mid \mathbf{o}\right)=\prod_{t=1}^{T^{\prime}} y_{t, \pi_{t}} p(πo)=t=1Tp(πto)=t=1Tyt,πt.      (10)

  Define a many-to-one mapping operation B \mathcal{B} B which removes all blanks and repeated words in the alignment path (e.g., B ( I I − m i s s − − y o u ) = I , \mathcal{B}(I I-m i s s--y o u)=I, B(IImissyou)=I, miss, y o u y o u you). In this way, we calculate the conditional probability of sign gloss sequence ℓ \ell as the sum of probabilities of all paths that can be mapped to ℓ \ell via B \mathcal{B} B :
                 p ( ℓ ∣ o ) = ∑ π ∈ B − 1 ( ℓ ) p ( π ∣ o ) p(\ell \mid \mathbf{o})=\sum_{\pi \in \mathcal{B}^{-1}(\ell)} p(\pi \mid \mathbf{o}) p(o)=πB1()p(πo).      (11)
where B − 1 ( ℓ ) = { π ∣ B ( π ) = ℓ } \mathcal{B}^{-1}(\boldsymbol{\ell})=\{\pi \mid \mathcal{B}(\pi)=\boldsymbol{\ell}\} B1()={πB(π)=} is the inverse operation of B \mathcal{B} B. Finally, the CTC losses of inter-cue feature sequence O \mathbf{O} O and intra-cue feature sequence f n \mathbf{f}_{\mathbf{n}} fn are defined as follows,
                 L C T C − o = − ln ⁡ p ( ℓ ∣ o ) \mathcal{L}_{\mathrm{CTC}-\mathbf{o}}=-\ln p(\ell \mid \mathbf{o}) LCTCo=lnp(o).      (12)
                 L C T C − f n = − ln ⁡ p ( ℓ ∣ f n ) \mathcal{L}_{\mathrm{CTC}-\mathbf{f}_{n}}=-\ln p\left(\ell \mid \mathbf{f}_{n}\right) LCTCfn=lnp(fn).      (13)
定义一个多对一的映射操作 B \mathcal{B} B,用来删除 alignment path 中的所有空格和重复的单词(比如, B ( I I − m i s s − − y o u ) = I , \mathcal{B}(I I-m i s s--y o u)=I, B(IImissyou)=I, miss, y o u y o u you)。这样,我们计算 sign gloss 序列 ℓ \ell 的条件概率为可以通过 B \mathcal{B} B 映射到 ℓ \ell 的所有路径的概率之和:
                 p ( ℓ ∣ o ) = ∑ π ∈ B − 1 ( ℓ ) p ( π ∣ o ) p(\ell \mid \mathbf{o})=\sum_{\pi \in \mathcal{B}^{-1}(\ell)} p(\pi \mid \mathbf{o}) p(o)=πB1()p(πo).      (11)
其中, B − 1 ( ℓ ) = { π ∣ B ( π ) = ℓ } \mathcal{B}^{-1}(\boldsymbol{\ell})=\{\pi \mid \mathcal{B}(\pi)=\boldsymbol{\ell}\} B1()={πB(π)=} B \mathcal{B} B 的逆运算。最后,连接时序分类模型CTC中 inter-cue feature sequence O \mathbf{O} O 的损失函数和 intra-cue feature sequence f n \mathbf{f}_{\mathbf{n}} fn 的损失函数定义如下:
                 L C T C − o = − ln ⁡ p ( ℓ ∣ o ) \mathcal{L}_{\mathrm{CTC}-\mathbf{o}}=-\ln p(\ell \mid \mathbf{o}) LCTCo=lnp(o).      (12)
                 L C T C − f n = − ln ⁡ p ( ℓ ∣ f n ) \mathcal{L}_{\mathrm{CTC}-\mathbf{f}_{n}}=-\ln p\left(\ell \mid \mathbf{f}_{n}\right) LCTCfn=lnp(fn).      (13)
CTC:connectionist temporal classification 连接主义时间分类 ,用来解决输入序列和输出序列难以一一对应的问题。参考前文

Joint Loss Optimization. During the training process, we take the optimization of the inter-cue path as the primary objective. To provide the information of each individual cue for fusion, the intra-cue path plays an auxiliary role. Hence, the objective function of the entire STMC framework is given as follows,
                 L = L C T C − o + α ∑ n L C T C − f n + L R β \mathcal{L}=\mathcal{L}_{\mathrm{CTC}-\mathrm{o}}+\alpha \sum_{n} \mathcal{L}_{\mathrm{CTC}-\mathbf{f}_{n}}+\mathcal{L}_{\mathrm{R}}^{\beta} L=LCTCo+αnLCTCfn+LRβ.      (14)
Here, α \alpha α and β \beta β are hyper-parameters, where α \alpha α is to balance the ratio of auxiliary loss for the intra-cue path, and β \beta β is to make the regression loss L R \mathcal{L}_{\mathrm{R}} LR of pose estimation have the same order of magnitudes with others. Given the estimated keypoints J t , k ∈ R 2 J_{t, k} \in \mathbb{R}^{2} Jt,kR2 which is calculated in Eq. 3, its corresponding ground-truth is J ^ \hat{J} J^, and the smooth-L1 (Girshick 2015) loss function of pose estimation branch is calculated as follows,
                 L R β = 1 2 T K ∑ t ∑ k ∑ i ∈ ( x , y ) smooth ⁡ L 1 β ( J t , k , i − J ^ t , k , i ) \mathcal{L}_{\mathrm{R}}^{\beta}=\frac{1}{2 T K} \sum_{t} \sum_{k} \sum_{i \in(x, y)} \operatorname{smooth}_{L_{1}} \beta\left(J_{t, k, i}-\hat{J}_{t, k, i}\right) LRβ=2TK1tki(x,y)smoothL1β(Jt,k,iJ^t,k,i),      (15)
in which,
                 smooth ⁡ L 1 ( x ) = { 0.5 x 2  if  ∣ x ∣ < 1 ∣ x ∣ − 0.5  otherwise  \operatorname{smooth}_{L_{1}}(x)=\left\{\begin{array}{ll} 0.5 x^{2} & \text { if }|x|<1 \\ |x|-0.5 & \text { otherwise } \end{array}\right. smoothL1(x)={0.5x2x0.5 if x<1 otherwise .      (16)
联合优化损失. 在训练过程中,我们以优化 inter-cue path 为主要目标。在融合过程中,inter-cue path 起着辅助作用,为融合提供各个线索的信息。因此,整个 STMC 框架的目标函数如下:
                 L = L C T C − o + α ∑ n L C T C − f n + L R β \mathcal{L}=\mathcal{L}_{\mathrm{CTC}-\mathrm{o}}+\alpha \sum_{n} \mathcal{L}_{\mathrm{CTC}-\mathbf{f}_{n}}+\mathcal{L}_{\mathrm{R}}^{\beta} L=LCTCo+αnLCTCfn+LRβ.      (14)
这里的 α \alpha α β \beta β 是超参数,其中 α \alpha α 是为平衡 intra-cue path 的辅助损耗比, β \beta β 是使姿态估计的回归损失 L R \mathcal{L}_{\mathrm{R}} LR 与其他的具有相同的数量级。根据估计关键点 J t , k ∈ R 2 J_{t, k} \in \mathbb{R}^{2} Jt,kR2 J t , k ∈ R 2 J_{t, k} \in \mathbb{R}^{2} Jt,kR2 是关键点k在第t个坐标系的位置)。其对应的ground-truth为 J ^ \hat{J} J^,pose 估计分支的损失函数smooth-L1 (Girshick 2015)计算如下:
                 L R β = 1 2 T K ∑ t ∑ k ∑ i ∈ ( x , y ) smooth ⁡ L 1 β ( J t , k , i − J ^ t , k , i ) \mathcal{L}_{\mathrm{R}}^{\beta}=\frac{1}{2 T K} \sum_{t} \sum_{k} \sum_{i \in(x, y)} \operatorname{smooth}_{L_{1}} \beta\left(J_{t, k, i}-\hat{J}_{t, k, i}\right) LRβ=2TK1tki(x,y)smoothL1β(Jt,k,iJ^t,k,i),      (15)
in which,
                 smooth ⁡ L 1 ( x ) = { 0.5 x 2  if  ∣ x ∣ < 1 ∣ x ∣ − 0.5  otherwise  \operatorname{smooth}_{L_{1}}(x)=\left\{\begin{array}{ll} 0.5 x^{2} & \text { if }|x|<1 \\ |x|-0.5 & \text { otherwise } \end{array}\right. smoothL1(x)={0.5x2x0.5 if x<1 otherwise .      (16)
Inference. For inference, we pass video frames through
the SMC and TMC modules. Only the inter-cue feature sequence and its BLSTM encoder are used to generate the posterior probability distribution of glosses at all time steps.We use the beam search decoder (Hannun et al. 2014) to search the most probable sequence within an acceptable range.
推理 为了进行推理,我们将视频帧传入 SMC 和 TMC 模块。只使用inter-cue feature sequence 及其 BLSTM 编码器来生成所有时间步长的gloss后验概率分布。我们使用集束搜索解码器(Hannun等人,2014)在可接受的范围内搜索最可能的序列。

4 Experiments

4.1 Dataset and Evaluation

Dataset. We evaluate our method on three datasets, including PHOENIX-2014 (Koller, Forster, and Ney 2015), CSL (Huang et al. 2018; Guo et al. 2018) and PHOENIX- 2014-T (Cihan Camgoz et al. 2018).
数据集. 我们在三个数据集上评估我们的方法,包括PHOENIX-2014 (Koller, Forster, and Ney 2015), CSL (Huang et al. 2018; Guo et al. 2018) and PHOENIX- 2014-T (Cihan Camgoz et al. 2018)。

  PHOENIX-2014 is a publicly available German Sign Language dataset, which is the most popular benchmark for CSLR. The corpus was recorded from broadcast news about the weather. It contains videos of 9 different signers with a vocabulary size of 1295. The split of videos for Train, Dev and Test is 5672, 540 and 629, respectively. Our method is evaluated on the multi-signer database.
  PHOENIX-2014 是一个公开可用的德语手语数据集,它是CSLR最广泛的测试基准。这些语料库是从广播的天气新闻中记录下来的。它包含9个不同的 signers 视频,词汇量大小为1295。Train、Dev和Test的视频分割分别为5672、540和629。我们的方法在 multi-signer 数据库上进行了评估。

  CSL is a Chinese Sign Language dataset, which has 100 sign language sentences about daily life with 178 words. Each sentence is performed by 50 signers and there are 5000 videos in total. For pre-training, it also provides a matched isolated Chinese Sign Language database, which contains 500 words. Each word is performed 10 times by 50 signers.
  CSL 是一个中文手语数据库,包含100个关于日常生活的手语句子,包含178个单词。每句话都由50名 signers,总共有5000个视频。在训练前,它还提供了一个匹配的孤立的中文手语数据库,包含500个单词。每个单词由50个 signers 执行10次。

  PHOENIX-2014-T is an extended version of PHOENIX-2014 and has two-stage annotations for new videos. One is sign gloss annotations for CSLR task. Another is German translation annotations for sign language translation (SLT) task. The split of videos for Train, Dev and Test is 7096, 519 and 642, respectively. It has no overlap with the previous version between Train, Dev and Test set. The vocabulary size is 1115 for sign gloss and 3000 for German.
  PHOENIX-2014-T 是PHOENIX- 2014的扩展版,在扩充的视频中有两种的注释。一个是CSLR任务的 sign gloss 注释。另一种是用于手语翻译(SLT)任务的德语翻译注释。Train、Dev和Test的视频分割分别为7096、519和642。它与之前的版本(Train, Dev和Test set)没有重叠。sign gloss 的词汇量是1115,德语是3000。

Pose Annotation. To obtain the keypoint positions for training, we use the publicly available HRNet (Sun et al. 2019) toolbox to estimate the positions of 7 keypoints in upper-body for all frames on three databases. The toolbox gives 2D coordinates (x, y) in the pixel coordinate system. We thus represent each normalized keypoint with a tuple of (x, y) and record it as an array of 7 tuples.
姿势注释. 为了获得用于训练的关键点位置,我们使用开源的HRNet(Sun等人,2019年)工具箱来估计三个数据库上所有帧中上半身的7个关键点的位置。工具箱给出了像素坐标系中的2维坐标 (x,y),因此我们用 (x,y) 数组来表示每个归一化后的关键点,并将其记录为一个由7个元组组成的数组。

Evaluation. In CSLR, Word Error Rate (WER) is used as the metric of measuring the similarity between two sentences (Koller, Forster, and Ney 2015). It measures the least operations of substitution (sub), deletion (del) and insertion (ins) to transform the hypothesis to the reference:
在这里插入图片描述
评估. 在CSLR中,单词错误率(Word Error Rate, WER)作为衡量两个句子之间相似性的度量标准(Koller, Forster, and Ney 2015)。它度量的是替换(sub)、删除(del)和插入(ins)的最小运算操作:在这里插入图片描述
评价标准:WER(越低越好)

4.2 Implementation Details

In our experiments, the input frames are resized to 224×224. For data augmentation in one video, we add random crop at the same location of all frames, random discard of 20% frames and random flip of all frames. For inter-cue features, the number of output channels after TCOVs and BLSTM are all set to 1024. There are 4 visual cues. For each intra-cue feature, the number of output channels after TCOVs and BLSTM are all set to 256.
在我们的实验中,输入帧被调整为224×224。为了在视频中增强数据,我们在所有帧的同一位置进行随机裁剪,随机丢弃20%的帧,并随机翻转所有帧。对于 inter-cue 特征,在TCOVs和BLSTM之后的输出通道数都设置为 1024。有 4 种视觉线索。对于每个 intra-cue 特征, 在TCOVs和BLSTM之后的输出通道数都设置为256。

Following the previous methods (Koller, Zargaran, and Ney 2017; Pu, Zhou, and Li 2019; Cui, Liu, and Zhang 2019), we utilize a staged optimization strategy. First, we train a VGG11-based network as DNF (Cui, Liu, and Zhang 2019) and use it to decode pseudo labels for each clip. Then,we add a fully-connected layer after each output of the TMC module. The STMC network without BLSTM is trained with cross-entropy and smooth-L1 loss by SGD optimizer. The batch size is 24 and the clip size is 16. Finally, with fine-tuned parameters from the previous stage, our full STMC network is trained end-to-end under joint loss optimization. We use Adam optimizer with learning rate 5 × 1 0 − 5 5 \times 10^{-5} 5×105 and set the batch size to 2. In all experiments, we set α \alpha α to 0.6 and β \beta β to 30. In fact, the experiment results are insensitive to the slight change of α \alpha α (see Fig. 4), except α \alpha α = 0.
Figure 4: The effect of weight parameter α in Eq. 14.
      Figure 4: The effect of weight parameter α \alpha α in Eq. 14.
参考已有的方法 (Koller, Zargaran, and Ney 2017; Pu, Zhou, and Li 2019; Cui, Liu, and Zhang 2019),我们采用了分阶段的优化策略。首先,将基于VGG11的网络训练为DNF(Cui,Liu和Zhang2019),并使用它来解码每个剪辑的伪标签。然后,我们在 TMC 模块的每个输出之后添加一个全连接层。利用 SGD 优化器对不含 BLSTM 的 STMC 网络进行交叉熵和平滑L1损失训练。最后,通过对前一层参数的微调,我们对整个 STMC 网络进行了端到端的联合损失优化训练,使用学习速率为 5 × 1 0 − 5 5 \times 10^{-5} 5×105 的 ADAM 优化器,将 batch size 设置为2。在所有的实验中,我们将 α \alpha α 设置为0.6, β \beta β 设置为30。事实上,实验结果表示 α \alpha α 的大小对实验的结果影响不大(见图4),除了 α \alpha α = 0时。
在这里插入图片描述
      图 4: α \alpha α 参数权重在Eq. 14.中的影响

  Our network architecture is implemented in PyTorch. For finetuning, we train the STMC network without BLSTM for 25 epochs. Afterward, the whole STMC network is trained end-to-end for 30 epochs. For inference, the beam width is set to 20. Experiments are run on 4 GTX 1080Ti GPUs.
  我们的网络架构是在PyTorch中实现的。为了进行微调,我们对没有 BLSTM 的 STMC 网络进行了25次迭代训练。之后,对整个 STMC 网络进行30次端到端的训练。关于实验,beam width 设置为20。实验在 4 个 GTX 1080Ti GPU上运行。
优化策略的评价指标:WER(Word Error Rate)

4.3 Framework Effectiveness Study

For a fair comparison, experiments in this subsection are conducted on PHOENIX-2014, which is the most popular dataset in CSLR.
为了实验结果便于比较,本小节的实验是在CSLR中最流行的数据集PHOENIX-2014上进行的。

Module Analysis We analyze the effectiveness of each module in our proposed approach. In Table 1, different combinations of spatial and temporal modules are evaluated. The baseline model is composed of VGG11 and 1D-CNN with a BLSTM encoder. With the aid of multi-cue features, the SMC module provides about 3% improvement compared with baseline on the test set. However, with no extra guidance, the TMC module doesn’t show expected gain by replacing the 1D-CNN. With joint loss optimization, the intra-cue path is guided by CTC loss to learn temporal dependency of each cue and provides 1.6% and 1.7% extra gain on the dev set and test set, compared with 1D-CNN. Compared with the baseline model, our STMC network reduces the WER on the test set by 4.8%.
      Table 1: Evaluation of different module combinations (thelower the better).
在这里插入图片描述
模块分析 我们分析了我们所提出的方法中每个模块的有效性。表 1 评估了空间和时间模块的不同组合。baseline model 模型由VGG11和带有BLSTM编码器的1D-CNN组成。在 multi-cue 特征的帮助下,SMC 模块在测试集上比 baseline 提高了约3%。但是,在没有额外引导的情况下,TMC 模块在替换 1D-CNN 时并没有表现出预期的效果。通过联合损耗优化,由 CTC 损失引导的 intra-cue path 学习每个线索的时间依赖性,与1D-CNN 相比,在 dev 集和 test 集上提高了 1.6% 和 1.7% 。与 baseline 模型相比,我们的 STMC 网络在测试集上 WER 降低了的 4.8%。
      表 1:评估不同的模块组合(越低越好)
在这里插入图片描述

Intra-Cue and Inter-Cue Paths With further optimization, the BLSTM encoder of each cue in the intra-cue path can also serve as an individual sequence predictor. In Table 2, WERs of different encoders in both paths are evaluated on the dev set. Among the four cues, the performance of pose is worst. With only the position and orientation of joints in upper-body, it’s difficult to distinguish the subtle variations in the appearance of sign gestures. The performance of hand is superior to that of face, while full-frame achieves relatively better performance. By leveraging the synergy of different cues, the inter-cue path shows the lowest WER.
      Table 2: Evaluation of different paths in TMC on Dev set.
在这里插入图片描述
Intra-Cue and Inter-Cue Paths 通过进一步的优化,intra-cue path 中每个线索的 BLSTM 编码器也可以作为一个单独的序列预测器。在表 2 中,我们在 dev 集上评估了两种路径下不同编码器的方案。在四种线索中,姿势的表现是最差的。由于只有上半身关节的位置和方向,很难区分手势外观的细微变化。手的性能优于脸,而全帧的性能相对较好。通过利用不同线索的协同作用,inter-cue path 的 WER 最低。
      表2:在 Dev 集上评估 TMC 中的不同 paths在这里插入图片描述Inference Time To clarify the effectiveness of the self-contained pose estimation branch, we evaluate the inference time in Table 3. The inference time depends on the video length. In average, it takes around 8 seconds (25FPS) for a sign sentence. For a fair comparison, we evaluate the inference time of 200 frames on a single GPU. Compared with introducing an external VGG-11 based model for pose estimation, our self-contained branch saves around 44% inference time. It’s notable that our framework with the self-contained branch still shows slightly better performance than an off-the-shelf model. We argue that the differentiable pose estimation branch plays the role of regularization and then alleviate the overfitting of neural networks.
      Table 3: Comparison of inference time. (PE: an external VGG11-based model for pose estimation)在这里插入图片描述
实验时间 为了阐明 self-contained 姿态估计 branch 的有效性,我们在表 3 中测试了实验时间。实验时间取决于视频的长度。平均一个 sign sentence 大约需要8秒(25帧/秒)。为了进行公平的比较,我们在单个 GPU 上测试了 200 帧的实验时间。与引入了基于VGG-11的姿态估计模型相比,我们的 self-contained branch 省了约44%的实验时间。值得注意的是,带有 self-contained branch 的框架也比现成的模型拥有更好的性能。我们认为 differentiable pose estimation branch 起到了调整的作用,从而防止了神经网络的过拟合。
      Table 3:实验时间比较。(PE:基于 VGG11 的外部姿态估计模型)在这里插入图片描述
这里的评估标准要找有利于自己的指标,并且要看和谁比,在什么基础上有了多少提升。

  • del/ins:
  • WER:通用硬指标
  • Time:
  • FLOPs:
  • Cue:full、hand、face、pose

Qualitative Analysis Figure 5 shows an example generated by different cues. It’s clear to see that the result of the inter-cue path can effectively learn correlations of multiple cues and make a better prediction.
在这里插入图片描述
      Figure 5: A qualitative result of different cues with estimated poses (zoom in) from Dev set (D: delete, I: insert, S: substitute).
定量分析 图5显示了由不同 cues 生成的示例。可见,inter-cue path 的结果可以有效地学习多个线索的相关性,并做出更好的预测。
在这里插入图片描述
      图5:Dev集合(D:删除,I:插入,S:替换)中不同线索与估计姿势(放大)的定性结果

4.4 State-of-the-art Comparison

Evaluation on PHOENIX-2014. In Table 4, we compare our approach with methods on PHOENIX-2014. CMLLR and 1-Mio-Hands belong to traditional HMM-based model with hand-crafted features. In SubUNets and LS-HAN, full-frame features are fused with features of hand patches, which are captured by an external tracker. In CNN-LSTM-HMM, two-stream networks are trained with weak hand labels and sign gloss labels, respectively. Our STMC out-performs two recent multi-cue methods, i.e., LS-HAN and CNN-LSTM-HMM by 17.6% and 5.3%. Moreover, compared with DNF which explores the fusion of RGB and optical flow modality, STMC still surpasses this best competitor by 2.2%. Based on the RGB modality, we propose a novel STMC framework and achieves 20.7% WER on the test set, a new state-of-the-art result on PHOENIX-2014.
      Table 4: Comparison with methods on PHOENIX-2014 (the lower the better).在这里插入图片描述
在 PHOENIX-2014 上的评估 在表4中,将我们的方法与在 PHOENIX-2014 上做过的方法进行了比较。CMLLR 和 1-Mio-Hands 是基于 HMM 的传统模型,具有hand-crafted 的特点。在 SubUNets 和 LS-HAN 中,将 full-frame 特征与外部跟踪器捕获的 hand patches 特征融合在一起。在 CNN-LSTM-HMM 中,分别用 weak hand labels 和 sign gloss labels 训练双流网络。我们的 STMC 比两种最新的 multi-cue 方法 LS-HAN 和 CNN-LSTM-HMM 分别高出 17.6% 和 5.3%。此外,与探索 RGB 与 optical flow modality 融合的DNF相比,STMC 也比 DNF 最佳结果高出 2.2%。基于 RGB modality,我们提出了一个新的 STMC 框架,并在测试集上获得了 20.7% 的WER,这是 PHOENIX-2014 上的最新成果。
      表 4:在 PHOENIX-2014 上的方法比较(越低越好)在这里插入图片描述

Evaluation on CSL. In Table 5, we evaluate our approach on CSL under two settings. CSL dataset contains a smaller vocabulary compared with PHOENIX-2014. Following the works of (Huang et al. 2018; Guo et al. 2018), the dataset is split by two strategies in Table 5. Split I is a signer independent test: the train and test sets share the same sentences with no overlap of signers. Split II is an unseen sentence test: the train and test sets share the same signers and vocabulary with no overlap of same sentences. Between two settings, Split II is more challenging for that recognizing unseen combinations of words is difficult in CSLR. In IAN, their alignment algorithm of CTC decoder and LSTM decoder shows notable improvement, compared with previous methods. Benefiting from multi-cue learning, our STMC framework outperforms the best competitor on CSL by 4.1% on WER.
      Table 5: Comparison with methods on CSL.
在这里插入图片描述
在 CSL 上的评估 在表5中,我们在两种设置下评估 CSL 上的方法。与 PHOENIX-2014 相比,CSL 数据集包含的词汇量更小。(Huang et al. 2018;Guo et al. 2018),表5将数据集分为两种策略。Split I 是一个 signer 独立的测试:训练集和测试集共享相同的句子,没有 signers 重叠。Split II 是一个 unseen sentence 测试:训练集和测试集共享相同的 signers 和词汇,没有重复的句子。在两种设置中,Split II 更具有挑战性,因为在 CSLR 中很难识别 unseen sentence。在 IAN 中,CTC 解码器和 LSTM 解码器的 alignment 算法与之前的方法相比有了显著的改进。利用多线索学习的优势,我们的 STMC 框架在 CSL 上的表现优于最佳竞争对手的4.1%。
      表5:与在 CSL 上的方法比较
在这里插入图片描述

Evaluation on PHOENIX-2014-T. In Table 6, we provide a result of our method on PHOENIX-2014-T. As a newly proposed dataset (Cihan Camgoz et al. 2018) for sign language translation, PHOENIX-2014-T provides an extended database with sign gloss annotation and spoken German annotation. CNN-LSTM-HMM utilizes spoken German annotation to infer the weak mouth shape labels for each video. It provides results of multi-cue sequential parallelism, including full-frame, hand and mouth. Our method surpasses all three combinations of CNN-LSTM-HMM.
      Table 6: Comparison with methods on PHOENIX-2014-T.(f: full-frame, m: mouth, h: hand)
在这里插入图片描述
在 PHOENIX-2014-T 上的评估 表6给出了我们的方法在 PHOENIX-2014-T 的结果。PHOENIX-2014-T 作为一个新提出的手语翻译数据集(Cihan Camgoz et al. 2018),提供了一个扩展的数据库,包括手语注释和德语口语注释。CNN-LSTM-HMM利用德语口语注释来推断每个视频的弱嘴型标签。它提供了多线索顺序并行的结果,包括全帧,手和嘴。我们的方法超过了 CNN-LSTM-HMM 的所有三种组合。
      表 6:与在 PHOENIX-2014-T 上方法的比较。(f: full-frame, m: mouth, h: hand)
在这里插入图片描述

5 Conclusion

In this paper, we present a novel multi-cue framework for CSLR, which aims to learn spatial-temporal correlations of visual cues in an end-to-end fashion. In our framework, a spatial multi-cue module is designed with a self-contained pose estimation branch to decompose spatial multi-cue features. Moreover, we propose a temporal multi-cue module composed of the intra-cue and inter-cue paths, which aims to preserve the uniqueness of each cue and explore the synergy of different cues at the same time. A joint optimization strategy is proposed to accomplish multi-cue sequence learning. Extensive experiments on three large-scale CSLR datasets demonstrate the superiority of our STMC framework.
在本文中,我们提出了一个新的基于多线索的 CSLR 框架,该框架旨在以端到端的方式学习视觉线索的时空相关性。在我们的框架中,设计了一个空间多线索模块,包含一个独立的姿态估计分支来分解空间多线索特征。此外,我们提出了一个由intra-cue paths 和 inter-cue paths 组成的时序多线索模块,旨在保持每条线索的唯一性,同时探索不同线索在同一时间内的协同作用。提出了一种联合优化策略来完成多线索序列学习。在三个大规模 CSLR 数据集上的大量实验证明了我们的 STMC 框架的优越性。

  Acknowledgments. This work was supported in part to Dr. Wengang Zhou by NSFC under contract No. 61632019 & No. 61822208 and Youth Innovation Promotion Association CAS (No. 2018497), and in part to Dr. Houqiang Li by NSFC under contract No. 61836011.
  致谢 国家自然科学基金面上项目(No. 61632019 & No. 61822208);中国科学院青年创新促进会面上项目(No. 2018497);国家自然科学基金面上项目(No. 61836011)李后强博士资助。

Paper原文链接点击下载

Paper源码(难复现,建议尝试其他)

后续补充

3 个 github 上的代码
github 1
github 2
github 3(这套代码作者还没有放出,但他们组的工作很靠谱,一作大佬建议可以关注一下)

ECCV2020 fully convolutional network 做手语识别的论文链接(后续会更此论文的研读与复现)

已标记关键词 清除标记
相关推荐
©️2020 CSDN 皮肤主题: 数字20 设计师:CSDN官方博客 返回首页