论文阅读【CVPR-2022】 A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation

最新推荐文章于 2024-12-01 20:48:34 发布

智尊宝人工智能社区

最新推荐文章于 2024-12-01 20:48:34 发布

阅读量818

点赞数

文章标签：计算机视觉人工智能机器学习深度学习神经网络

本文链接：https://blog.csdn.net/weixin_42155685/article/details/123912279

版权

论文阅读【CVPR-2022】 A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation

studyai.com

搜索论文: A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation

摘要(Abstract)

This paper proposes a simple transfer learning baseline for sign language translation. Existing sign language datasets (e.g. PHOENIX-2014T, CSL-Daily) contain only about 10K-20K pairs of sign videos, gloss annotations and texts, which are an order of magnitude smaller than typical parallel data for training spoken language translation models. Data is thus a bottleneck for training effective sign language translation models. To mitigate this problem, we propose to progressively pretrain the model from general_x0002_domain datasets that include a large amount of external supervision to within-domain datasets.

本文提出了一个用于手语翻译的简单的迁移学习基准。现有的手语数据集（如PHOENIX-2014T、CSL Daily）仅包含约10K-20K手语视频、注释和文本的样本对，比用于训练口语翻译模型的典型并行数据小一个数量级。因此，数据成了训练有效手语翻译模型的瓶颈。为了缓解这个问题，我们建议把模型从包含大量外部监督的通用领域域数据集逐步预训练到领域内数据集。

Concretely, we pretrain the sign-to-gloss visual network on the general domain of human actions and the within-domain of a sign-to-gloss dataset, and pretrain the gloss-to-text translation network on the general domain of a multilingual corpus and the within-domain of a gloss-to-text corpus. The joint model is fine-tuned with an additional module named the visual_x0002_language mapper that connects the two networks. This simple baseline surpasses the previous state-of-the-art results on two sign language translation benchmarks, demonstrating the effectiveness of transfer learning. With its simplicity and strong performance, this approach can serve as a solid baseline for future research.

具体来说，我们在人类行为的通用域以及sign-to-gloss数据集的特定域内预训练sign-to-gloss视觉网络，在多语言语料库的一般域和gloss-to-text语料库的特定域内预训练gloss-to-text的翻译网络。联合模型通过一个名为visual_x0002_language mapper的附加模块进行了微调，该模块连接两个网络。这个简单的基准超过了之前两个手语翻译基准的最新结果，证明了迁移学习的有效性。由于其简单性和强大的性能，这种方法可以作为未来研究的坚实基础。