《MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance》论文详解

最新推荐文章于 2025-03-26 15:27:17 发布

爆肝疯学大模型

最新推荐文章于 2025-03-26 15:27:17 发布

阅读量1.1k

点赞数 10

文章标签：论文笔记计算机视觉

本文链接：https://blog.csdn.net/weixin_41973200/article/details/142650350

版权

论文：https://arxiv.org/pdf/2406.19680
code：https://github.com/Tencent/MimicMotion

Abstract

In recent years, generative artificial intelligence has achieved significant advancements in the field of image generation, spawning a variety of applications. However, video generation still faces considerable challenges in various aspects, such as controllability, video length, and richness of details, which hinder（阻碍） the application and popularization of this technology. In this work, we propose a controllable video generation framework, dubbed MimicMotion, which can generate high-quality videos of arbitrary length mimicking specific motion guidance. 本文贡献 Compared with previous methods, our approach has several highlights. Firstly, we introduce confidenceaware pose guidance that ensures high frame quality（高帧质量） and temporal smoothness（时间平滑度）. Secondly, we introduce regional loss amplification（放大） based on pose confidence, which significantly reduces image distortion（失真）. Lastly, for generating long and smooth videos，we propose a progressive latent fusion strategy. By this means, we can produce videos of arbitrary length with acceptable resource consumption. With extensive(广泛的) experiments and user studies, MimicMotion demonstrates significant improvements over previous approaches in various aspects. Detailed results and comparisons are available on our project page: https://tencent.github.io/MimicMotion.

提出本文核心要解决的问题和方法。
1.利用姿态置信度（pose confidence）解决视频质量和时间平滑度问题。
2.区域损失放大（regional loss amplification）解决图像失真问题。
3.渐进式潜在融合策略（latent fusion strategy）解决长视频不稳定问题。

1 Introduction

With the rapid（迅速的） development of generative artificial intelligence, video generation is gaining attention in parallel with the growing maturity（日益成熟） of image generation techniques. 当前挑战和问题 However, video generation is more challenging due to its higher inherent complexities, including the need for high quality imagery and seamless temporal smoothness. This sets higher standards for video generation technology. In addition to these challenges, controlling the generated content and extension（扩展） to significant lengths without compromising（妥协） quality is essential（基本的） for real-world use（除了这些挑战意外，不受质量的影响去控制这些内容和扩展的生成到显著的长度对于真实场景的使用是非常重要的）. （本论文解决问题）In this paper, we focus on pose-guided video generation conditioned on a reference image. Our goal is to generate a video that not only
contains rich imagery details but also adheres（坚持） to the reference image and the pose guidance.
相关工作介绍 Currently, there are plenty of works focusing on image-conditioned pose-guided video generation,such as Follow Your Pose [1], DreamPose [2], DisCo [3], MagicDance [4], AnimateAnyone [5], MagicAnimate [6], DreaMoving [7], Champ [8], etc. Though various model architectures and training techniques have been studied for better generation performance, the generated results are unsatisfactory（不满意） in several aspects. 相关工作存在的问题 Imagery distortion（失真） especially on the regions of human hands is still a common issue which is particularly evident in videos containing large movements. Besides, to achieve good temporal smoothness, imagery details are sometimes sacrificed（牺牲） resulting in videos of blurred（模糊） frames. In the presence（在场） of diverse（各种各样的） appearances（外表） and motions in videos, accurate（准确） pose estimation is inherently challenging. This inaccuracy not only creates a conflict between pose alignment（姿态对齐） and temporal smoothness but also hinders（阻碍） the model scaling on the training schedule due to overfitting on noisy samples. In addition, due to computational limitations and model capabilities（能力）, there are still significant challenges in generating high-quality long videos containing a large number of frames. 提出本文解决别人工作问题的方案 To solve these problems, we propose a series of approaches for generating long but still smooth videos
based on pose guidance and image reference.
To alleviate（减轻） the negative impact of inaccurate pose estimation, we propose an approach of confidenceaware pose guidance. By introducing the concept of confidence to the pose sequence representation, better temporal smoothness can be achieved and imagery istortion can also be eased. Confidencebased regional loss amplification（放大） can make the hand regions more accurate and clear. In addition, we propose a progressive latent fusion method for achieving long but still smooth video generation. Through generating video segments with overlapped frames with the proposed progressive latent
fusion, our model can handle arbitrary-length pose sequence guidance. By merging the generated video segments, the final long video can be of good cross-frame smoothness and imagery richness at the same time. For model training, to keep the cost of model training within an acceptable range, our method is based on a generally pre-trained video generation model. The amount of training data is not large and no special manual annotation is required.
In summary, there are three key contributions（文本核心贡献） of this work:
1.We improve the pose guidance by employing a confidence-aware strategy. In this way, the negative impact of inaccurate pose estimation can be alleviated. This approach not only reduces the influence of noisy samples during training but also corrects erroneous pose guidance during inference.
2.Based on the confidence-aware strategy, we propose hand region enhancement to alleviate hand distortion by strengthening the loss weight of the region of human hands with high pose confidence.
3.While cross-frame overlapped diffusion is a standard technique for generating long videos, we advance it with a position-aware progressive latent fusion approach that improves temporal smoothness at segment boundaries. Extensive experimental results show the effectiveness of the proposed approach.

总结提炼：
此部分最开始列举一些相关工作及相关工作的问题，提出了自己的方法去解决相关问题。

减轻姿态预估不准，提出置信度的概念去改善，改善图像扭曲和时间平滑度问题。
置信区域损失放大（Confidencebased regional loss amplification）解决手部区域不准确和更清晰
渐进式潜在融合（a progressive latent fusion method）实现长视频生成问题，去解决任意长度视频生成。
模型训练方面，基于预训练的视频生成模型，所以训练数据不需要非常多，也不需要人工标注数据。

2 Related work

2.1 Diffusion models for image/video generation

Diffusion-based models have demonstrated promising（有前景的） results in the fields of image [9–13] and video generation [14–20], renowned（著名的） for their capacity in generative tasks. 突出存在的问题 Diffusion models operating in the pixel domain（领域） encounter（遇到） challenges in generating high-resolution（高分辨率） images due to information redundancy（冗余） and high computational costs. 提出此模型对问题的解决方案Latent Diffusion Models (LDM) [11] address these issues by performing the diffusion process in low-dimensional latent spaces, significantly enhancing generation efficiency and quality while reducing computational demands. 视频生成存在的挑战Compared to image generation, video generation demands a more precise（精确的） understanding of spatial relationships and temporal motion patterns. 视频存在问题的解决方案 Recent video generation models leverage diffusion models by adding temporal layers to pre-trained image generation models [14, 15, 21, 22], or utilizing transformers structures [23–26] to enhance generative capabilities for videos. Stable Video Diffusion（后续进行一下尝试） (SVD) [20] is one of the most popular open-source models built upon LDM. It offers a straightforward（直接了当） and effective method for image-based video generation and serves as a powerful pre-trained model for this task. Our approach extends SVD for pose-guided video generation, leveraging the pre-trained generative capabilities of SVD.

2.2 Pose-guided human motion transfer

Pose-to-appearance mapping aims to transfer motion from the source identity to the target identity. Methods based on paired keypoints from source and target images employ local affine transformations [27, 28] or Thin-Plate Spline transformations [29] to warp the source image to match the driving image. These techniques aim to minimize distortion（失真） by applying weighted affine（放射） transformations, thereby generating poses in the output image that closely resemble those in the driving image. Similarly, methods such as [30, 1, 5, 7] utilize pose stick figures obtained from off-the-shelf human pose detectors as motion indicators and directly generate video frames through generative models. Depth information [7] or 3D human parametric models, such as SMPL (Skinned Multi-Person Linear) [8], can also be used to represent human geometry（几何） and motion characteristics from the source video. Nevertheless（尽管如此）, these overly dense guidance techniques can rely too much on the signal from the source video过于依赖源视频提取的pose, such as the outline of the body, leading to a degradation（降解） in the quality of the generated videos, especially when the target identity differs significantly from the source. 论文方法的改进 Our approach, leveraging（利用） off-the-shelf（现成的） human pose detectors, is capable（能力） of capturing the motion of the human body in driving videos without introducing excessive（过多的） extraneous（无关） information, thereby（从而） ensuring（保障） the overall（全面） quality of the generated video. Different from existing methods, we introduce confidence-aware pose guidance, which effectively mitigates（减轻） the influence of inaccurate pose estimation in training and inference. In this way, we achieve superior（优越） portrait（人像） frame quality, especially in the hand regions.

2.3 Long video generation

Recent diffusion-based video generation algorithms are constrained（受约束） to producing videos with durations（持续时间） of only a few seconds, significantly limiting their practical（实际） applications. As a result, substantial（重大的）research efforts have been dedicated（投入） to extending the duration of generated videos, leading to the proposal of various approaches to overcome this limitation. Methods like [17, 31] autoregressively predict successive frames, enabling the generation of infinitely（无限） long videos. 存在问题 However, these methods often face quality degradation（降解） due to error accumulation（积累） and the lack of long-term temporal coherence（连贯性）. 问题解决方案 Hierarchical approach [32, 22] are proposed for generating long videos in a coarse-to-fine manner. It first creates a coarse storyline with keyframes using a global diffusion model, then iteratively refines the video with local diffusion models to produce detailed intermediate frames. MultiDiffusion [33] combines multiple processes that use pre-trained text-to-image diffusion models to create high-quality images with user-defined controls. It works by applying the model to different parts of an image and using an optimization method to ensure all parts blend seamlessly. This allows users to generate images that meet specific requirements, like certain aspect ratios or spatial layouts, without needing additional training or fine-tuning. Lumiere [34] extends MultiDiffusion to video generation by dividing the video into overlapping temporal segments. Each segment is independently denoised, and an optimization algorithm then combines these denoised segments. This approach ensures high coherence in the generated video, effectively maintaining temporal smoothness across segments. However, 提出自己的方法 our experiments reveal（揭示） that abrupt（突然的） transitions（过渡） can still occur at segment boundaries.
Building upon the principle of MultiDiffusion, we introduce a position-aware progressive latent fusion strategy that enhances temporal smoothness near segment boundaries. We adaptively assign fusion weight based on the temporal position, ensuring a smooth transition at the segment boundaries that further reduces flickering.

3 Method

3.1 Preliminaries

A Diffusion Model (DM) learns a diffusion process that generates a probability distribution for a given dataset. In the case of visual content generation tasks, a neural network of DM is trained to reverse(逆转) the process of adding noise to real data so new data can be progressively（逐步） generated starting from random noise. For a data sample x ∼ $p_{data}$ from a specific data distribution $p_{data}$ , the forward diffusion process is defined as a fixed Markov Chain that gradually adds Gaussian noise to the data following:
$q(x_t|x_{t-1})=N(x_t;\sqrt{1-\beta_t}x_{t-1},\beta_tI)$ (1)

for $t = 1,\cdot\cdot\cdot T$ , where $T$ is the number of perturbing steps(扰动步数) and $x_t$ represents noisy data after adding $t$ steps of noise on the real data $x_0$ . This process is controlled by a sequence schedule $β_t$ which is
parameterized by the noising step t. Following the closure(关闭) of normal distribution（正态分布）, $x_t$ can be directly computed with x0 by reforming(更改) the above diffusion process as follows:
$q(x_t|x_0)=N(x_t;\sqrt{\overline\alpha_t}x_{0},1-\overline\alpha_t)I$ (2)
where $\overline\alpha_t =\prod_{i=1}^{t}\alpha_i$ and $α_t = 1 − β_t$ . Following DDPM [9], a denoising function $ϵ_θ$ parameterized with $θ$ , commonly implemented with a neural network, is trained by minimizing the mean square error loss as follows:
$E_{ϵ～N(O,I),x_t,c,t}[||ϵ-ϵ_θ(x_t;c,t)||_2^2]$ (3)
where $c$ is an optional condition and $x_t$ is a perturbed version of real data $x_0 ∼ p_{data}$ by adding t-step noises. In this way, $ϵ_θ$ can be trained till converge by sampling $x_0$ from real data distribution and a time step $t$ , with an optional condition $c$ .

3.2 Data preparation

To train a pose-guided video diffusion model, we collect a video dataset containing various human motions. Leveraging(利用) the powerful capability of the generally pre-trained image-to-video model, the dataset need not be excessively large, as the pre-trained model already has a good prior(先验).
Given a video from our dataset, the training sample is constructed with three parts: a reference image (denoted as Iref), a sequence of raw video frames, and the corresponding poses. Firstly, basic pre-processing operations like frame resizing and cropping（裁剪） are applied to the raw video to get a sequence of video frames with a fixed aspect ratio（固定的长宽比）. For a given video, a fixed number of frames are randomly sampled at equal intervals（间隔） as input video frames to the diffusion model. The input reference image is randomly sampled from the same video at a location not limited to（位置不限于） the sampled video frame. This reference image is pre-processed in the same way as the video frames. Another input of the model is the pose sequence, which is extracted from the video frames with DWPose [35] frame by frame.

数据分为三部分：参考图像（reference image）、原视频的帧序列（a sequence of raw video frames）、原视频帧序列对应的pose序列（the corresponding poses）

预处理操作。通过裁剪或者调整大小等方式将源视频每一帧处理为固定的长宽比。
对于视频，要保证被采样的帧数的固定数量的间隔要和输入到扩散模型（diffusion model）的视频帧要是一样的。
参考图像是从同样的视频被随机采样的，位置不限于被采样是的视频帧数里。
参考图像的预处理方式和1.一样。
姿态序列（pose sequence）是通过DWPose的方法从视频的一帧一帧中进行抽取。

3.3 Pose-guided Video Diffusion Model

在这里插入图片描述

The goal of MimicMotion is to generate high-quality, pose-guided human videos from a single reference image and a sequence of poses to mimic. This task involves（涉及） the synthesis of realistic motion that adheres to（坚持） the provided pose sequence while maintaining visual fidelity（保真度） to the reference image. We exploit（开发） the ability of a specific pre-trained video diffusion model的作用 to reduce the data requirement and computational cost of training a video diffusion model from scratch. 介绍使用的预训练模型 Stable Video Diffusion (SVD) [20] is an open-source image-to-video diffusion model trained on a large-scale video dataset. It shows good performance on both video quality and diversity compared with the other contemporary models. The model structure of MimicMotion is designed to integrate a pre-trained Stable Video Diffusion (SVD) model to leverage（利用） its image-to-video generation capabilities. Learning a diffusion process in pixel space is costly and this is more severe in generating highdefinition videos involving many frames. We follow the Latent Diffusion Model (LDM) [36] to encode pixels into latent space so diffusion can be conducted in a low-dimension latent space. LDM adopts a pair of autoencoders, consisting of an encoder $E$ and a decoder $D$ . Given a data sample x, it is encoded into the latent space as $z = E (x)$ . Conversely, the latent vector $z$ can be decoded back into pixel space via $x = D (z)$ . Figure 2 shows the structure of our model.后面部分均为了介绍图2 The core structure of our model is a latent video diffusion model with a U-Net for progressive denoising in latent space. The VAE encoder on input video frames and the corresponding decoder for getting denoised video frames are both adopted from SVD and these parameters are frozen. The VAE encoder is applied independently to each frame of the input video as well as to the conditional reference image, operating on a per-frame basis without considering temporal or cross-frame interactions（vae编码器只考虑独立的每一帧，不考虑各帧之间的时序关系）. Differently, the VAE decoder processes the latent features, which undergo（经历） spatiotemporal（时空） interaction from U-Net（vae解码器需要处理与unet时空交互的结果）. To ensure the generation of a smooth video, the VAE decoder incorporates（合并） temporal layers alongside the spatial layers, mirroring the architecture of the VAE encoder（为了保证视频的流畅度，vae解码器合并了时许层）. In addition to the input video frames, the reference image and the sequence of poses are two other inputs of the model. The reference image is fed into the diffusion model along two separate pathways. One pathway involves feeding the image into each block of the U-Net. Specifically, through a visual encoder like CLIP [37], the image feature is extracted and fed into the cross-attention of every U-Net block for finally controlling the output results. The other pathway targets the input latent features. Similar to the raw video frames, the input reference image is encoded with the same frozen VAE encoder to get its representation in the latent space. The latent feature of the single reference image is then duplicated along the temporal dimension to align with the features of input video frames. The duplicated latent reference images are concatenated with latent video frames along the channel dimension and then fed into U-Net for diffusion altogether. For introducing the guidance of poses, PoseNet, which is implemented with multiple convolution layers, is designed as a trainable module for extracting features of the input sequence of poses. The reason for not using the VAE encoder is that the pixel value distribution of the pose sequence is different from that of common images on which the VAE autoencoder is trained（PoseNet不使用vae编码是因为与图像像素的分布不同）. With PoseNet, the features of poses are extracted and then element-wisely added to the output of the first convolution layer of U-Net. In this way, the influence of the posture guidance can take effect from the very beginning of denoising（只在去噪的最开始加入姿态指导）. We do not add pose guidance to every U-Net block for the following considerations: a) the sequence pose is extracted frame by frame without any temporal interaction so it may confuse the spatio-temporal layers within U-Net when it takes effect on these layers directly;（姿态提取是逐帧提取，不考虑任何的时间序列关系，但是unet是考虑时序关系的，如果在每一个unet block中添加姿态引导可能会导致不好的影响） b) excessive（过度） involvement（参与） of the pose sequence may degrade the performance of the pre-trained image-to-video model.（影响需训练模型的效果）

3.4 Confidence-aware pose guidance

在这里插入图片描述

Inaccurate pose estimation has a negative impact on the model’s training and inference. Accurately estimating poses from images is challenging in dynamic videos. Estimating poses from 2D images is inherently difficult. The limited capability of the pose estimation model, like DWPose [35], is only one aspect of the reason. The more significant cause is the inherent uncertainty of pose from dynamic appearances and motions. Specifically, incorrect pose guidance signals can mislead the model, resulting in the generation of inaccurate or distorted outputs, as illustrated in Figure 8. Moreover,noisy pose guidance signals can lead to overfitting on samples with incorrect poses, potentially causing training instability.（错误姿态导致的结果以及论文方法要解决的原因） This in turn may hinder the model’s ability to benefit from extended training schedules.
For this problem, we propose confidence-aware pose guidance, which leverages the confidence scores associated with each keypoint from the pose estimation model. These scores reflect the likelihood of accurate detection, with higher values indicating higher visibility, less occlusion, and motion blur.（较高的可见度，更少的遮挡和运动模糊）Instead of applying a fixed confidence threshold to filter the keypoints, as commonly adopted in prior works [38, 4], we utilize brightness（亮度） on the pose guidance frame to represent the confidence level of pose estimation（本文方法）. Specifically, we integrate（整合） the confidence scores of the pose and keypoints into their respective drawing colors. This means that we multiply the color assigned to each keypoint and limb by its confidence score. Consequently, keypoints and corresponding limbs with higher confidence
scores will appear more significant on the pose guidance map. This method enables the model to prioritize more reliable pose information in its guidance, thereby enhancing the overall accuracy of pose-guided generation.
在这里插入图片描述
Figure 3 illustrates this concept, showing how confidence-aware pose frames reflect situations of occlusion and motion blur. In this way, the uncertainty of pose estimation can be conveyed through the pose guidance, making pose guidance more informative. Our ablation studies show the effectiveness of this technique in suppressing visual artifacts, as shown in Figure 8.
Hand region enhancement Moreover, we employ pose estimation and the associated confidence scores to alleviate region-specific artifacts, such as hand distortion（失真）, which are prevalent（流行） in the diffusion-based image and video generation models. Specifically, we identify reliable（可靠的） regions via thresholding keypoint confidence scores. By setting a threshold, we can distinguish between keypoints that are confidently detected and those that may be ambiguous or incorrect due to factors like occlusion or motion blur. Keypoints with confidence scores above the threshold are considered reliable. We implement a masking strategy that generates masks based on a confidence threshold. We unmask areas where confidence scores surpass a predefined threshold, thereby identifying reliable regions. When computing the loss of the video diffusion model, the loss values corresponding to the unmasked regions are amplified by a certain scale so they can have more effect on the model training than other masked regions.
Specifically, to mitigate（缓解） hand distortion, we compute masks using a confidence threshold for keypoints in the hand region. Only hands with all keypoint confidence scores exceeding this threshold are considered reliable, as a higher score correlates to higher visual quality. We then construct a bounding（边界） box around the hand by padding the boundary of these keypoints, and the enclosed rectangle is designated as unmasked. This region is subsequently assigned a larger weight in the loss calculation during the training of the video diffusion model. This selective unmasking and weighting process biases the model’s learning towards hands, especially hands with higher visual quality, effectively reducing distortion and improving the overall realism of the generated content.（置信度高证明信息可靠，置信度低区域信息不可靠，对置信度低的区域重点识别并生成结果）

3.5 Progressive latent fusion for long video generation

Limited by computation resources, generating long videos containing a large number of frames is challenging. For this problem, latent fusion during denoising with DM has been validated by some prior works like MultiDiffusion [33] which utilizes latent fusion on overlapped tiles to realize panoramic image generation. A similar idea can be applied to the video generation task. A straightforward approach is directly applying MultiDiffusion in the time domain, as in Lumiere [34]. Compared with spatial discontinuity between image tiles, viewers are more sensitive to temporal discontinuity because they can significantly cause flickering or even abrupt changes in content. For this problem,
we propose a progressive approach for generating long videos with high temporal continuity.
在这里插入图片描述

Progressive latent fusion is training-free and is integrated into the denoising process of the latent diffusion model during inference.无需训练，推理过程中被融合到去噪期间 Figure 4 shows an overview of this process. We omit（忽略） the VAE for brevity（简洁）. The denoising process is done in latent space in our method. In general, there are $T$ denoising steps in total and our latent fusion is applied within each step. For a long given pose sequence, we use a pre-defined strategy for splitting the whole sequence into segments, consisting of a fixed number of frames per segment (denoted as $N$ ), with a certain number ( $C$ ) of overlapped frames between every two adjacent segments. For the sake of generation efficiency, it is common to assume that $C ≪ N$ . During each denoising step, video segments are firstly denoised separately with the trained model, conditioning on the same reference image and the corresponding sub-sequence of poses. Algorithm 1 shows the specific details of progressive latent fusion. As inputs, the reference image is denoted as $I_{ref}$ (c.f. Sec. 3.2) and pose frames corresponding to $j$ -th frame in $i$ -th video segment is denoted as $P_j^i$ . We use $z_j^i$ to denote the latent feature of $j$ -th frame in $i$ -th video segment. The denoising process
starts from the maximum time step $T$ and the latent features are initialized with a normal distribution $N (0, I)$ . Within each denoising step at time step $t$ , the reversed diffusion process defined by the trained model (DM) is applied to the latent features of each video segment numbered $i$ separately,with $z_i$ , $I_{ref}$ , $P_i$ and $t$ as inputs. During the latent fusion stage, for every two adjacent video segments, the involved video frames are then fused. To avoid the corruption of temporal smoothness near video segment boundaries after latent fusion, we propose progressive latent fusion. For a video frame involved in latent fusion, its fusion weight is determined by its relative position in the video segment it belongs to. Specifically, if a frame is close to the segment it belongs to, it will be assigned a heavier weight. For implementation, a fusion scale is pre-defined as $λ_{fusion} = 1/(C + 1)$ for controlling the level of latent fusion.
在这里插入图片描述
After applying $T$ iteratively denoising steps, a merging strategy denoted as Merge is used to get the final long sequence based on the denoised overlapped video segments in latent space. The function of Merge concatenates the multi-segment latents, which is described in Listing 1 in detail.
Listing 1: $M er g e$ is for merging 2D list z representing overlapped video segments into a long list zp.
在这里插入图片描述

4 Experiments

4.1 Implementation details

We train our model on 4,436 human dancing videos collected from the internet. The average length of the training videos is 20.1s. We adopt the pre-trained weights from the public stable video diffusion 1.1 image-to-video model. The PoseNet is trained from scratch. We train our model on 8 NVIDIA
A100 GPUs (40G) for 20 epochs, with a per-device batch size of 1. The learning rate is $10^{−5}$ with a linear warmup for the first 500 iterations. We tune all the parameters in the UNet and PoseNet.

4.2 Comparison to state-of-the-art methods

We compare our method with latest state-of-the-art pose-guided human video generation methods, including MagicPose [4], Moore-AnymateAnyone [38], and MuseV [39]. Following previous works [3, 4], we adopt the TikTok [40] dataset and use sequence 335 to 340 for testing.
We provide both qualitative and quantitative comparisons, complemented by a user study. Each method has a different input aspect ratio. To ensure a fair comparison, we only consider the central square region of the videos. Specifically, to accommodate each method’s unique input aspect ratios, we individually apply a center crop to the reference image and pose sequence. Then, we extract the center squares from the generated videos for a fair comparison across different methods. This applies to all experiments in comparison to state-of-the-art methods.
在这里插入图片描述

Qualitative evaluation We conduct qualitative comparisons between the selected baselines and our method. In Figure 5, we showcase sample frames to highlight the superior quality of individual frames produced by our method. Additionally, in Figure 6, we illustrate the temporal differences, demonstrating the enhanced temporal smoothness of our approach compared to existing methods. Figure 5 presents a comparison of the generated frames, where each row represents a distinct example. The first row demonstrates the superior hand quality achieved by our approach, while the second row showcases the improved adherence to pose guidance. These improvements directly result from our confidence-aware pose guidance and hand region enhancement design. Importantly, our method shows superior temporal smoothness, characterized by smooth motion and minimal flickering(平滑运动和最小闪烁). To illustrate this aspect, we present the pixel-wise differences between consecutive frames in Figure 6, which effectively illustrate the temporal stability of our method. From the figure, it is evident that MagicPose [4] exhibits abrupt transitions（突然的转变）, Moore-AnymateAnyone [38] shows flickering（闪烁） in the texture of clothing wrinkles（皱纹）, and MuseV [39] struggles with generating consistent text on clothing（在衣服上生成一致的文本）. In contrast（对比）, our method maintains stable inter-frame（稳定的帧） differences without obvious artifacts（明显的违影）, demonstrating better temporal smoothness in our generation results. Videos are included on the project page. This enhancement in temporal smoothness is likely due to the robustness provided by our confidence-aware pose guidance, which effectively mitigates the impact of inaccurate pose inputs and temporal noise. By intelligently weighting the influence of pose signals based on their confidence, our method ensures that the generated videos maintai n a high level of temporal smoothness in the presence of noise.
在这里插入图片描述

Quantitative evaluation In Table 1, we present a quantitative comparison of our method against state-of-the-art approaches using the FID-VID [41] and FVD [42] metrics on the test sequences
from the TikTok [40] dataset. The results indicate that our method achieves a notable performance advantage over all existing methods in terms of both metrics.
User study To supplement our quantitative and qualitative evaluations, we conduct a user study to assess the subjective preferences of participants regarding the generated videos on the TikTok dataset test split. The study involves showing two video clips—one generated by our method and the other by one of the baseline methods—to a diverse group of users. Participants are instructed
to select the video that they perceived as having higher quality, considering factors such as image quality, flickering, and the temporal smoothness of characters and clothing. We collected data from
36 participants, with each participant evaluating 6 video pairs for our method against each baseline method. As shown in Figure 7, the results indicate a strong preference for MimicMotion over the
baseline methods. In comparison to MagicPose and Moore, the participants almost favored all videos produced by our method. Despite MuseV showing higher image quality compared to other baselines,
the preference for videos produced by our method still reached 75.5%. These findings align with our qualitative and quantitative evaluation, reinforcing the effectiveness of our method in meeting user expectations for high-quality human video generation.

4.3 Ablation Study

在这里插入图片描述

Confidence-aware pose guiding Figure 8 shows the effectiveness of confidence-aware pose guidance. Each row corresponds to one example. On the left side, we show three images used to extract the pose. On the right side, we plot the guiding signals corresponding to the pose estimation, both with and without confidence-aware pose guiding. From the guiding signals, we can see that there are errors in the pose estimated by DWPose. Nevertheless, our confidence-aware design minimizes the impact of incorrect pose estimation in guidance signals. Specifically, in the case of Pose 1, the estimation exhibits a duplicate detection issue, which leads to the inclusion of duplicate keypoints. In the case of Pose 2, there is one hand obscured, and the keypoints of this hand are incorrectly estimated on the other hand; In the case of Pose 3, the right
elbow is obscured, but it is still detected with confidence above the threshold thus falsely remains in the guidance signal. These problems lead to confusing hand guidance signals and ultimately lead to distortions such as deformed hands or wrong spatial relationships in the generated frames. In contrast, by integrating confidence scores into the pose representation, our method effectively mitigates these issues. The confidence scores provide a measure of reliability for each keypoint, allowing the system to weigh the guidance signals accordingly. Specifically, keypoint with lower confidence, which typically correspond to inaccurate keypoints caused by occlusion or motion blur, will be of less significance in the guidance. This approach leads to clearer and richer pose guidance, as the influence of potentially erroneous keypoints is reduced. The corresponding generation results demonstrate how our method enhances the robustness of generation against false guiding signals (Pose 1 and Pose 2) and offers visibility hints to resolve the front-back ambiguity of 2D pose estimation(Pose 3).
Hand region enhancement In conjunction with confidence-aware pose guiding, we further improve the quality of hand generation by assigning a higher weight to the hand region in the training loss.
在这里插入图片描述

Figure 9 compares the generation result with and without hand region enhancement, using the same reference image and pose guidance. All experiments incorporate confidence-aware pose guidance.
The hands in the first row are cropped from the generated video frames of a model trained without hand region enhancement, which exhibits noticeable distortions, such as irregular and misplaced
fingers. In contrast, the results of the model trained with hand region enhancement (the second row) show consistent improvements in hand generation quality and a reduction in hand distortion. These results show the effectiveness of the proposed hand region enhancement design, which substantially mitigates hand distortion, which is a prevalent challenge in diffusion-based models. Moreover, hand region enhancement improves the visual appeal of the generated content. The hand region is often the area of interest that human observers tend to focus on. By emphasizing the hand regions, we align the regional preferences of the training process with human preferences, thereby enhancing the visual appeal of the generation results.
在这里插入图片描述

Progressive latent fusion To achieve seamless transitions between video segments, we introduce progressive latent fusion, a technique that gradually blends frames in the overlapped regions of
consecutive video segments. The original MultiDiffusion approach employs a simple averaging of frames within the overlap region. As illustrated in Figure 10a, this method assigns equal weight to all frames in the overlap region, irrespective of their temporal position (whether they are closer to the preceding or subsequent segment)（图片10a中，方法给所有的帧分配同样的权重，无论他的时序位置）. This lack of a gradual transition in influence from one segment to another can cause abrupt transitions and noticeable flickers in the video. This is evident in the y-t slice shown on the left, where the transition at segment boundaries is abrupt. The right side of the figure shows four frames at the segment boundary. Note that the background in the top-left corner(enlarged) is initially clear in segment 1. It suddenly becomes blurry in the overlapped region and then suddenly reverts to clearer in the main part of segment 2. This artifact is not observed when progressive latent fusion is applied. The proposed progressive latent fusion approach (see Figure 10b) effectively mitigates these issues. The y-t slice on the left demonstrates that this method enables a smooth transition across segment boundaries, eliminating the abrupt changes seen in the original approach. The right side of the figure demonstrates the relevance of sudden blurring. This strategy significantly mitigates flicking artifacts, thus improving the overall visual temporal coherence for long video generation.

5 Conclusion

In this study, we introduce MimicMotion, a pose-guided human video generation model that leverages confidence-aware pose guidance and progressive latent fusion for producing high-quality, long videos with human motion guided by pose. Through extensive experiments and ablation studies, we show that our model achieves superior adaptation to noisy pose estimation, enhancing hand quality, and ensuring temporal smoothness. The integration of confidence scores into pose guidance, the enhancement of hand region loss, and the implementation of progressive latent fusion are crucial in achieving these improvements, resulting in more visually compelling and realistic human video generation.