【动作生成】MoMask: Generative Masked Modeling of 3D Human Motions-CSDN博客

本文链接：https://blog.csdn.net/Arachis_X/article/details/136691407

MoMask是一种创新的框架，使用分层量化和双向Transformer进行文本引导的3D人体运动生成。它在HumanML3D和KIT-ML数据集上表现出色，FID分别优于T2M-GPT。此外，该模型适用于多种相关任务无需额外调参。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

MoMask: Generative Masked Modeling of 3D Human Motions 三维人体运动的生成式屏蔽建模

2023.11 CVPR 2024

论文地址
 代码地址
 动作生成CVPR2024最新论文 MoMask: Generative Masked Modeling of 3D Human Motions

请添加图片描述

Abstract

We introduce MoMask, a novel masked modeling framework for text-driven 3D human motion generation. In MoMask, a hierarchical quantization scheme is employed to represent human motion as multi-layer discrete motion tokens with high-fidelity details. Starting at the base layer, with a sequence of motion tokens obtained by vector quantization, the residual tokens of increasing orders are derived and stored at the subsequent layers of the hierarchy. This is consequently followed by two distinct bidirectional transformers. For the base-layer motion tokens, a Masked Transformer is designated to predict randomly masked motion tokens conditioned on text input at training stage. During generation (i.e. inference) stage, starting from an empty sequence, our Masked Transformer iteratively fills up the missing tokens; Subsequently, a Residual Transformer learns to progressively predict the next-layer tokens based on the results from current layer. Extensive experiments demonstrate that MoMask outperforms the state-of-art methods on the text-to-motion generation task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset, and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly applied in related tasks without further model fine-tuning, such as text-guided temporal inpainting.

我们介绍了用于文本驱动三维人体运动生成的新型遮罩建模框架 MoMask。

MoMask 采用分层量化方案，将人体运动表示为具有高保真细节的多层离散运动标记。