近期arxiv上使用diffusion的行为识别、动作生成等论文

最新推荐文章于 2024-12-14 00:15:00 发布

无敌小枫枫

最新推荐文章于 2024-12-14 00:15:00 发布

阅读量1k

点赞数

文章标签：深度学习

本文链接：https://blog.csdn.net/weixin_43179737/article/details/132255881

版权

Synthesizing Long-Term Human Motions with Diffusion Models via Coherent Sampling

Text-to-motion generation has gained increasing attention, but most existing methods are limited to generating short-term motions that correspond to a single sentence describing a single action. However, when a text stream describes a sequence of continuous motions, the generated motions corresponding to each sentence may not be coherently linked. Existing long-term motion generation methods face two main issues. Firstly, they cannot directly generate coherent motions and require additional operations such as interpolation to process the generated actions. Secondly, they generate subsequent actions in an autoregressive manner without considering the influence of future actions on previous ones. To address these issues, we propose a novel approach that utilizes a past-conditioned diffusion model with two optional coherent sampling methods: Past Inpainting Sampling and Compositional Transition Sampling. Past Inpainting Sampling completes subsequent motions by treating previous motions as conditions, while Compositional Transition Sampling models the distribution of the transition as the composition of two adjacent motions guided by different text prompts. Our experimental results demonstrate that our proposed method is capable of generating compositional and coherent long-term 3D human motions controlled by a user-instructed long text stream. The code is available at \href{https://github.com/yangzhao1230/PCMDM}{https://github.com/yangzhao1230/PCMDM}.

文本到动作生成引起了越来越多的关注，但大部分现有方法局限于生成与描述单个动作的单个句子相对应的短期动作。然而，当文本流描述一系列连续的动作时，与每个句子相对应的生成动作可能不会有连贯性。现有的长期动作生成方法面临两个主要问题。首先，它们不能直接生成连贯的动作，需要额外的操作（如插值）来处理生成的动作。其次，它们以自回归方式生成后续动作，而没有考虑未来动作对先前动作的影响。为了解决这些问题，我们提出了一种新方法，利用过去条件扩散模型，并提供两种可选的连贯采样方法：过去修复采样和组合过渡采样。过去修复采样通过将先前动作作为条件来完成后续动作，而组合过渡采样则将过渡的分布建模为由不同的文本提示引导的两个相邻动作的组合。我们的实验结果表明，我们提出的方法能够生成由用户指导的长文本流控制的组合和连贯的长期3D人体动作。代码可在\href{https://github.com/yangzhao1230/PCMDM}{https://github.com/yangzhao1230/PCMDM}找到。

NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis

We address the problem of generating realistic 3D motions of humans interacting with objects in a scene. Our key idea is to create a neural interaction field attached to a specific object, which outputs the distance to the valid interaction manifold given a human pose as input. This interaction field guides the sampling of an object-conditioned human motion diffusion model, so as to encourage plausible contacts and affordance semantics. To support interactions with scarcely available data, we propose an automated synthetic data pipeline. For this, we seed a pre-trained motion model, which has priors for the basics of human movement, with interaction-specific anchor poses extracted from limited motion capture data. Using our guided diffusion model trained on generated synthetic data, we synthesize realistic motions for sitting and lifting with several objects, outperforming alternative approaches in terms of motion quality and successful action completion. We call our framework NIFTY: Neural Interaction Fields for Trajectory sYnthesis.

我们解决了在场景中生成与物体互动的真实3D人体动作的问题。我们的关键思想是创建一个附加到特定物体的神经交互场，该场输出给定人体姿势的有效交互流形的距离。该交互场引导了基于物体条件的人体运动扩散模型的采样，从而鼓励合理的接触和作用语义。为了支持与稀缺可用数据的互动，我们提出了一个自动合成数据流程。为此，我们从有限的动作捕捉数据中提取与交互特定的锚定姿势，为基础人体运动设定了先验，然后种子化一个预训练的运动模型。使用在生成的合成数据上训练的我们引导的扩散模型，我们为坐下和举起物体合成了逼真的动作，性能优于其他替代方法，包括动作质量和成功的动作完成。我们称这一框架为NIFTY：用于轨迹合成的神经交互场。

Multimodal Motion Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection

Anomalies are rare and anomaly detection is often therefore framed as One-Class Classification (OCC), i.e. trained solely on normalcy. Leading OCC techniques constrain the latent representations of normal motions to limited volumes and detect as abnormal anything outside, which accounts satisfactorily for the openset'ness of anomalies. But normalcy shares the same openset'ness property, since humans can perform the same action in several ways, which the leading techniques neglect. We propose a novel generative model for video anomaly detection (VAD), which assumes that both normality and abnormality are multimodal. We consider skeletal representations and leverage state-of-the-art diffusion probabilistic models to generate multimodal future human poses. We contribute a novel conditioning on the past motion of people, and exploit the improved mode coverage capabilities of diffusion processes to generate different-but-plausible future motions. Upon the statistical aggregation of future modes, anomaly is detected when the generated set of motions is not pertinent to the actual future. We validate our model on 4 established benchmarks: UBnormal, HR-UBnormal, HR-STC, and HR-Avenue, with extensive experiments surpassing state-of-the-art results.

异常往往是罕见的，因此异常检测通常被构建为一类分类（OCC），即仅在正常情况下进行训练。领先的OCC技术将正常动作的潜在表示限制在有限的范围内，并检测超出范围的任何内容作为异常，这很好地解决了异常的“开放集”性质。但正常性也具有相同的“开放集”性质，因为人类可以用多种方式执行相同的动作，而这些领先的技术忽视了这一点。我们提出了一种新颖的视频异常检测（VAD）生成模型，假设正常和异常都是多模态的。我们考虑了骨架表示，并利用先进的扩散概率模型来生成多模态的未来人体姿势。我们提出了一种基于人的过去动作的新型条件方法，并利用扩散过程的改进模式覆盖能力来生成不同但合理的未来动作。通过对未来模式的统计聚合，当生成的动作集与实际未来不相关时，就检测到异常。我们在4个已建立的基准数据集上验证了我们的模型：UBnormal，HR-UBnormal，HR-STC和HR-Avenue，并进行了大量实验证明了超越了现有技术的结果。

VDT: An Empirical Study on Video Diffusion with Transformers

This work introduces Video Diffusion Transformer (VDT), which pioneers the use of transformers in diffusion-based video generation. It features transformer blocks with modularized temporal and spatial attention modules, allowing separate optimization of each component and leveraging the rich spatial-temporal representation inherited from transformers. VDT offers several appealing benefits. 1) It excels at capturing temporal dependencies to produce temporally consistent video frames and even simulate the dynamics of 3D objects over time. 2) It enables flexible conditioning information through simple concatenation in the token space, effectively unifying video generation and prediction tasks. 3) Its modularized design facilitates a spatial-temporal decoupled training strategy, leading to improved efficiency. Extensive experiments on video generation, prediction, and dynamics modeling (i.e., physics-based QA) tasks have been conducted to demonstrate the effectiveness of VDT in various scenarios, including autonomous driving, human action, and physics-based simulation. We hope our study on the capabilities of transformer-based video diffusion in capturing accurate temporal dependencies, handling conditioning information, and achieving efficient training will benefit future research and advance the field. Codes and models are available at https://github.com/RERV/VDT.

本研究引入了Video Diffusion Transformer (VDT)，这是在基于扩散的视频生成中首次使用transformers。它具有模块化的时间和空间注意模块的transformer块，允许对每个组件进行单独优化，并利用了transformer继承的丰富的时空表示。VDT提供了一些吸引人的优势。1）它在捕捉时间依赖性方面表现出色，可以产生时间一致的视频帧，甚至模拟3D物体随时间的动态。2）它通过在令牌空间中进行简单的连接，实现了灵活的条件信息，有效地统一了视频生成和预测任务。3）其模块化设计促进了空间-时间解耦的训练策略，提高了效率。我们在视频生成、预测和动力学建模（例如基于物理的问答）任务上进行了广泛的实验，以展示VDT在各种场景下的有效性，包括自动驾驶、人体动作和基于物理的模拟。我们希望我们关于基于transformer的视频扩散在捕捉准确的时间依赖性、处理条件信息和实现高效训练方面的能力的研究将有助于未来的研究并推动领域的发展。代码和模型可在https://github.com/RERV/VDT找到。

Diffusion Action Segmentation

Temporal action segmentation is crucial for understanding long-form videos. Previous works on this task commonly adopt an iterative refinement paradigm by using multi-stage models. Our paper proposes an essentially different framework via denoising diffusion models, which nonetheless shares the same inherent spirit of such iterative refinement. In this framework, action predictions are progressively generated from random noise with input video features as conditions. To enhance the modeling of three striking characteristics of human actions, including the position prior, the boundary ambiguity, and the relational dependency, we devise a unified masking strategy for the conditioning inputs in our framework. Extensive experiments on three benchmark datasets, i.e., GTEA, 50Salads, and Breakfast, are performed and the proposed method achieves superior or comparable results to state-of-the-art methods, showing the effectiveness of a generative approach for action segmentation. Our codes will be made available.

时序动作分割对于理解长格式视频至关重要。先前在这个任务上的工作通常采用多阶段模型的迭代精化范例。我们的论文通过去噪扩散模型提出了一个根本不同的框架，尽管在本质上与这种迭代精化的精神相同。在这个框架中，动作预测是从随机噪声中逐步生成的，其输入视频特征作为条件。为了增强对人类动作三个显著特征的建模，包括位置先验、边界模糊和关系依赖，我们在我们的框架中设计了一个统一的遮罩策略来处理条件输入。我们对三个基准数据集（即GTEA、50Salads和Breakfast）进行了广泛的实验，所提出的方法在结果上优于或与最先进的方法相当，显示了用于动作分割的生成方法的有效性。我们的代码将会被提供。

DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion

We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD in short. Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video. This presents a generative modeling perspective, against previous discriminative learning manners. This capability is achieved by first diffusing the ground-truth proposals to random ones (i.e., the forward/noising process) and then learning to reverse the noising process (i.e., the backward/denoising process). Concretely, we establish the denoising process in the Transformer decoder (e.g., DETR) by introducing a temporal location query design with faster convergence in training. We further propose a cross-step selective conditioning algorithm for inference acceleration. Extensive evaluations on ActivityNet and THUMOS show that our DiffTAD achieves top performance compared to previous art alternatives. The code will be made available at https://github.com/sauradip/DiffusionTAD.

我们提出了一种新的时序动作检测（Temporal Action Detection，TAD）的去噪扩散形式，简称DiffTAD。在给定一个未经剪辑的长视频时，通过随机时序提议作为输入，它可以准确地产生动作提议。这提供了一种生成建模的视角，与先前的判别学习方式相对应。通过首先将地面真实提议扩散到随机提议（即正向/加噪过程），然后学习逆转加噪过程（即反向/去噪过程）来实现这种能力。具体而言，我们通过引入具有更快训练收敛速度的时间位置查询设计，在Transformer解码器（例如DETR）中建立去噪过程。我们进一步提出了一种交叉步骤选择性条件算法，用于加速推理。在ActivityNet和THUMOS上进行了广泛的评估，结果显示我们的DiffTAD相对于先前的替代方案实现了最佳性能。代码将会在https://github.com/sauradip/DiffusionTAD上提供。

EgoViT: Pyramid Video Transformer for Egocentric Action Recognition

Capturing interaction of hands with objects is important to autonomously detect human actions from egocentric videos. In this work, we present a pyramid video transformer with a dynamic class token generator for egocentric action recognition. Different from previous video transformers, which use the same static embedding as the class token for diverse inputs, we propose a dynamic class token generator that produces a class token for each input video by analyzing the hand-object interaction and the related motion information. The dynamic class token can diffuse such information to the entire model by communicating with other informative tokens in the subsequent transformer layers. With the dynamic class token, dissimilarity between videos can be more prominent, which helps the model distinguish various inputs. In addition, traditional video transformers explore temporal features globally, which requires large amounts of computation. However, egocentric videos often have a large amount of background scene transition, which causes discontinuities across distant frames. In this case, blindly reducing the temporal sampling rate will risk losing crucial information. Hence, we also propose a pyramid architecture to hierarchically process the video from short-term high rate to long-term low rate. With the proposed architecture, we significantly reduce the computational cost as well as the memory requirement without sacrificing from the model performance. We perform comparisons with different baseline video transformers on the EPIC-KITCHENS-100 and EGTEA Gaze+ datasets. Both quantitative and qualitative results show that the proposed model can efficiently improve the performance for egocentric action recognition.

捕捉手与物体的交互对于自主检测从主体视角视频中的人类动作非常重要。在这项工作中，我们提出了一种金字塔视频变换器，并配以动态类令牌生成器，用于主体视角动作识别。与先前的视频变换器不同，它们使用相同的静态嵌入作为类令牌，适用于多样化的输入，我们提出了一个动态类令牌生成器，通过分析手物交互和相关的运动信息，为每个输入视频生成一个类令牌。动态类令牌可以通过与后续的变换器层中的其他信息性令牌进行通信，将此类信息扩散到整个模型中。借助动态类令牌，视频之间的差异更加显著，有助于模型区分不同的输入。此外，传统的视频变换器在全局范围内探索时间特征，这需要大量计算。然而，主体视角视频通常具有大量的背景场景转换，这会导致远处帧之间的不连续性。在这种情况下，盲目地降低时间采样率会冒失去关键信息的风险。因此，我们还提出了一个金字塔架构，从短期高速到长期低速逐层处理视频。借助所提出的架构，我们在不牺牲模型性能的情况下，显著降低了计算成本和内存要求。我们在EPIC-KITCHENS-100和EGTEA Gaze+数据集上与不同的基准视频变换器进行了比较。定量和定性结果都表明，所提出的模型可以有效提高主体视角动作识别的性能。

Spatial-temporal Transformer-guided Diffusion based Data Augmentation for Efficient Skeleton-based Action Recognition

Recently, skeleton-based human action has become a hot research topic because the compact representation of human skeletons brings new blood to this research domain. As a result, researchers began to notice the importance of using RGB or other sensors to analyze human action by extracting skeleton information. Leveraging the rapid development of deep learning (DL), a significant number of skeleton-based human action approaches have been presented with fine-designed DL structures recently. However, a well-trained DL model always demands high-quality and sufficient data, which is hard to obtain without costing high expenses and human labor. In this paper, we introduce a novel data augmentation method for skeleton-based action recognition tasks, which can effectively generate high-quality and diverse sequential actions. In order to obtain natural and realistic action sequences, we propose denoising diffusion probabilistic models (DDPMs) that can generate a series of synthetic action sequences, and their generation process is precisely guided by a spatial-temporal transformer (ST-Trans). Experimental results show that our method outperforms the state-of-the-art (SOTA) motion generation approaches on different naturality and diversity metrics. It proves that its high-quality synthetic data can also be effectively deployed to existing action recognition models with significant performance improvement.

最近，基于骨架的人体动作研究因其紧凑的人体骨架表示而成为一个热门研究课题。因此，研究人员开始注意到使用RGB或其他传感器来通过提取骨架信息来分析人体动作的重要性。借助深度学习（DL）的快速发展，最近提出了大量基于骨架的人体动作方法，这些方法具有精心设计的DL结构。然而，训练良好的DL模型通常需要高质量和充足的数据，而这很难在不付出高昂费用和人力的情况下获得。在本文中，我们介绍了一种新颖的骨架动作识别任务数据增强方法，它可以有效生成高质量且多样化的序列动作。为了获得自然而逼真的动作序列，我们提出了去噪扩散概率模型（DDPMs），它可以生成一系列合成的动作序列，其生成过程由空间-时间变换器（ST-Trans）精确引导。实验结果表明，我们的方法在不同的自然性和多样性度量上优于现有的运动生成方法。这证明其高质量的合成数据也可以有效地部署到现有的动作识别模型中，从而显著提高性能。

Imitating Human Behaviour with Diffusion Models

Diffusion models have emerged as powerful generative models in the text-to-image domain. This paper studies their application as observation-to-action models for imitating human behaviour in sequential environments. Human behaviour is stochastic and multimodal, with structured correlations between action dimensions. Meanwhile, standard modelling choices in behaviour cloning are limited in their expressiveness and may introduce bias into the cloned policy. We begin by pointing out the limitations of these choices. We then propose that diffusion models are an excellent fit for imitating human behaviour, since they learn an expressive distribution over the joint action space. We introduce several innovations to make diffusion models suitable for sequential environments; designing suitable architectures, investigating the role of guidance, and developing reliable sampling strategies. Experimentally, diffusion models closely match human demonstrations in a simulated robotic control task and a modern 3D gaming environment.

扩散模型已经在文本到图像领域中崭露头角，成为强大的生成模型。本文研究将这些模型应用于观察到动作模型，以模仿顺序环境中的人类行为。人类行为是随机的和多模态的，动作维度之间存在结构化的相关性。与此同时，行为克隆中的标准建模选择在表达能力上受到限制，可能会引入克隆策略的偏差。我们首先指出了这些选择的局限性。然后，我们提出扩散模型非常适合模仿人类行为，因为它们可以学习到关于联合动作空间的表达丰富的分布。我们引入了几种创新，使扩散模型适用于顺序环境；设计适当的体系结构，研究指导的作用，并开发可靠的采样策略。实验结果表明，在模拟机器人控制任务和现代3D游戏环境中，扩散模型与人类演示密切匹配。

Modiff: Action-Conditioned 3D Motion Generation with Denoising Diffusion Probabilistic Models

Diffusion-based generative models have recently emerged as powerful solutions for high-quality synthesis in multiple domains. Leveraging the bidirectional Markov chains, diffusion probabilistic models generate samples by inferring the reversed Markov chain based on the learned distribution mapping at the forward diffusion process. In this work, we propose Modiff, a conditional paradigm that benefits from the denoising diffusion probabilistic model (DDPM) to tackle the problem of realistic and diverse action-conditioned 3D skeleton-based motion generation. We are a pioneering attempt that uses DDPM to synthesize a variable number of motion sequences conditioned on a categorical action. We evaluate our approach on the large-scale NTU RGB+D dataset and show improvements over state-of-the-art motion generation methods.

基于扩散的生成模型近来在多个领域中成为高质量合成的强大解决方案。借助双向马尔可夫链，扩散概率模型通过推断在前向扩散过程中学习到的分布映射的反向马尔可夫链来生成样本。在这项工作中，我们提出了Modiff，这是一种条件范式，利用去噪扩散概率模型（DDPM）来解决逼真且多样化的基于3D骨骼的动作生成问题。我们是第一个尝试使用DDPM根据分类动作生成变量数量的运动序列的工作。我们在大规模的NTU RGB+D数据集上评估了我们的方法，并显示出相对于最先进的动作生成方法的改进效果。

BeLFusion: Latent Diffusion for Behavior-Driven Human Motion Prediction

Stochastic human motion prediction (HMP) has generally been tackled with generative adversarial networks and variational autoencoders. Most prior works aim at predicting highly diverse movements in terms of the skeleton joints' dispersion. This has led to methods predicting fast and motion-divergent movements, which are often unrealistic and incoherent with past motion. Such methods also neglect contexts that need to anticipate diverse low-range behaviors, or actions, with subtle joint displacements. To address these issues, we present BeLFusion, a model that, for the first time, leverages latent diffusion models in HMP to sample from a latent space where behavior is disentangled from pose and motion. As a result, diversity is encouraged from a behavioral perspective. Thanks to our behavior coupler's ability to transfer sampled behavior to ongoing motion, BeLFusion's predictions display a variety of behaviors that are significantly more realistic than the state of the art. To support it, we introduce two metrics, the Area of the Cumulative Motion Distribution, and the Average Pairwise Distance Error, which are correlated to our definition of realism according to a qualitative study with 126 participants. Finally, we prove BeLFusion's generalization power in a new cross-dataset scenario for stochastic HMP.

随机人体运动预测（HMP）通常是通过生成对抗网络和变分自动编码器来解决的。大多数先前的工作旨在预测骨骼关节分散性方面高度多样的运动。这导致了方法预测出快速和运动发散的运动，这些运动通常不现实，并且与过去的运动不协调。这种方法还忽视了需要预测具有微小关节位移的各种低幅度行为或动作的背景。为了解决这些问题，我们提出了BeLFusion，这是一种首次在HMP中利用潜在扩散模型来从潜在空间中进行采样的模型，其中行为与姿势和运动相互分离。因此，从行为的角度鼓励多样性。得益于我们行为耦合器将采样的行为转移到正在进行的运动中的能力，BeLFusion的预测显示了各种行为，这些行为比现有技术更加现实。为了支持它，我们引入了两个度量标准，累积运动分布的区域和平均配对距离误差，这些度量标准与我们的现实主义定义相关，根据与126名参与者进行的定性研究。最后，我们证明了BeLFusion在新的交叉数据集情景下在随机HMP中的泛化能力。
Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models

Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. We also demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. We also synthesise path-driven locomotion using the same model architecture. Finally, we generalise the guidance procedure to obtain product-of-expert ensembles of diffusion models and demonstrate how these may be used for, e.g., style interpolation, a contribution we believe is of independent interest.

扩散模型近年来受到了极大的关注，作为高度表达能力且训练效率高的概率模型。我们展示了这些模型非常适合用于合成与音频同时发生的人体动作，例如跳舞和语音交际手势，因为在给定音频情况下，动作复杂且高度模糊，需要概率性描述。具体而言，我们将DiffWave架构进行了调整，用Conformers替代了膨胀卷积以提高建模能力，以建模3D姿势序列。我们还展示了对动作风格的控制，使用无需分类器的指导来调整风格表达的强度。在手势和舞蹈生成实验中，验证了所提方法实现了顶级的动作质量，并产生了具有独特风格的动作，其表达可以更明显或更不明显。我们还使用相同的模型架构合成了路径驱动的行走动作。最后，我们将指导过程推广到获取扩散模型的专家集合，并展示了如何使用这些模型进行风格插值等操作，这是一个独立有趣的贡献。
Diffusion Motion: Generate Text-Guided 3D Human Motion by Diffusion Model

We propose a simple and novel method for generating 3D human motion from complex natural language sentences, which describe different velocity, direction and composition of all kinds of actions. Different from existing methods that use classical generative architecture, we apply the Denoising Diffusion Probabilistic Model to this task, synthesizing diverse motion results under the guidance of texts. The diffusion model converts white noise into structured 3D motion by a Markov process with a series of denoising steps and is efficiently trained by optimizing a variational lower bound. To achieve the goal of text-conditioned image synthesis, we use the classifier-free guidance strategy to add text embedding into the model during training. Our experiments demonstrate that our model achieves competitive results on HumanML3D test set quantitatively and can generate more visually natural and diverse examples. We also show with experiments that our model is capable of zero-shot generation of motions for unseen text guidance.
Human Motion Diffusion Model

我们提出了一种从复杂的自然语言句子中生成3D人体动作的简单而新颖的方法，这些句子描述了各种动作的速度、方向和组合。与现有方法使用经典生成架构不同，我们将去噪扩散概率模型应用于此任务，通过文本指导在生成多样的动作结果。扩散模型通过一系列去噪步骤的马尔可夫过程，将白噪声转换为结构化的3D动作，并通过优化变分下界来高效地训练。为了实现文本条件下的图像合成目标，我们采用了无需分类器的指导策略，在训练过程中将文本嵌入到模型中。我们的实验表明，我们的模型在HumanML3D测试集上在定量上取得了有竞争力的结果，并且能够生成更自然和多样的示例。我们还通过实验展示，我们的模型能够对未见文本指导生成零样本的动作。

MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model

Human motion modeling is important for many modern graphics applications, which typically require professional skills. In order to remove the skill barriers for laymen, recent motion generation methods can directly generate human motions conditioned on natural languages. However, it remains challenging to achieve diverse and fine-grained motion generation with various text inputs. To address this problem, we propose MotionDiffuse, the first diffusion model-based text-driven motion generation framework, which demonstrates several desired properties over existing methods. 1) Probabilistic Mapping. Instead of a deterministic language-motion mapping, MotionDiffuse generates motions through a series of denoising steps in which variations are injected. 2) Realistic Synthesis. MotionDiffuse excels at modeling complicated data distribution and generating vivid motion sequences. 3) Multi-Level Manipulation. MotionDiffuse responds to fine-grained instructions on body parts, and arbitrary-length motion synthesis with time-varied text prompts. Our experiments show MotionDiffuse outperforms existing SoTA methods by convincing margins on text-driven motion generation and action-conditioned motion generation. A qualitative analysis further demonstrates MotionDiffuse's controllability for comprehensive motion generation. Homepage: https://mingyuan-zhang.github.io/projects/MotionDiffuse.html

人类运动建模对于许多现代图形应用至关重要，这些应用通常需要专业技能。为了消除非专业人员的技能障碍，最近的运动生成方法可以直接根据自然语言生成人类动作。然而，通过各种文本输入实现多样化和细粒度的动作生成仍然具有挑战性。为了解决这个问题，我们提出了MotionDiffuse，这是基于扩散模型的文本驱动动作生成框架，相对于现有方法，它展示了几个期望的属性。 1) 概率映射。MotionDiffuse不是通过确定性的语言-动作映射来生成动作，而是通过一系列注入变化的去噪步骤来生成动作。 2) 逼真合成。MotionDiffuse在建模复杂的数据分布和生成生动的动作序列方面表现出色。 3) 多层次操作。MotionDiffuse可以对身体部位的细粒度指令做出响应，并通过时变的文本提示进行任意长度的动作合成。我们的实验结果显示，MotionDiffuse在文本驱动的动作生成和动作条件生成方面的性能优于现有的最优方法。进一步的定性分析进一步展示了MotionDiffuse在全面动作生成方面的可控性。主页：https://mingyuan-zhang.github.io/projects/MotionDiffuse.html