《MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling》论文详解

最新推荐文章于 2024-10-01 20:30:00 发布

爆肝疯学大模型

最新推荐文章于 2024-10-01 20:30:00 发布

阅读量682

点赞数 13

文章标签： AIGC 计算机视觉图像处理论文笔记

本文链接：https://blog.csdn.net/weixin_41973200/article/details/142586204

版权

最近很多人都在讨论阿里发表的MIMO结果，感知上认为无论效果还是实际应用，效果都挺好，于是选择拜读，花一些时间记录下对这边论文的理解。

信息参考：https://menyifang.github.io/projects/MIMO/index.html

效果介绍

原文：
Figure 1. Given a single reference character, MIMO can synthesize animated avatars in driving 3D poses retrieved from motion datasets(left) or extracted from in-the-wild videos (right). Real-world scenes from driving videos can also be integrated into the synthesis with
natural human-object interactions. MIMO simultaneously achieves advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes in a unified framework.
翻译：
文章要实现的效果正如下图展示的那样，给定一个特征图片，MIMO先通过从左侧的动态数据集中或者从右侧的视频中提取3d姿态驱动特征图片。随后视频中的现实场景和被驱动起来的人物目标就行融合。MIMO同时能够实现对任意人物图片进行高级拓展，获取新颖的3d动作和适应性，适用于在统一框架下的真实场景中进行交互。
在这里插入图片描述

论文翻译及段落心得总结

Abstract

Character video synthesis aims to produce realistic videos of animatable characters within lifelike scenes. As a fundamental（基本的） problem in the computer vision（计算机视觉） and graphics community（图像社区）, 3D works typically require multi-view captures（捕获） for per-case training, which severely（严格） limits their applicability of modeling arbitrary（任意） characters in a short time. Recent 2D methods break this limitation via（通过） pre-trained diffusion models, but they struggle（斗争） for pose generality and scene interaction. To this end, we propose MIMO, a novel framework which can not only synthesize character videos with controllable attributes（可控的属性） (i.e., character, motion and scene) provided by simple user inputs, but also simultaneously achieve advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes in a unified framework. The core idea is to encode the 2D video to compact spatial codes（紧凑的空间代码）, considering the inherent（固有） 3D nature（属性） of video occurrence（发生）. Concretely（具体来说）, we lift（提升） the 2D frame pixels into 3D using monocular（单眼？） depth estimators, and decompose（分解） the video clip（视频剪辑） to three spatial
components（空间成分） (i.e., main human, underlying scene, and floating occlusion（浮动遮挡）) in hierarchical （等级制度的）layers（层级的） based on the 3D depth. These components are further encoded to canonical（典范） identity（身份） code, structured（结构化） motion code and full scene code, which are utilized（被利用） as control signals of synthesis process. The design of spatial decomposed modeling enables flexible（灵活的） user control, complex motion expression, as well as 3D-aware（3d感知） synthesis for scene interactions. Experimental（实验性的） results demonstrate（证明） the proposed method’s effectiveness and robustness（鲁棒性）.

翻译：
摘要
角色视频融合的目的在于产生具有逼真场景却能够赋予角色生命的真实视频。对于计算机视觉或者图像社区的基本问题是3d的工作要求是多视角的去捕获每每种情况的训练，这种场景训练严格的限制了他们在短时间内模拟任意角色的可适用性。最近2d方法通过diffusion模型打破了这种限制，但是他们一直在姿态生成和场景交互之间纠结。为此，我们提出MIMO模型，这种模型具有新颖的工作框架，该方法不仅能够合成被用户制定的可控属性（包含人物特征、动作和场景）的角色视频，而且能够同时实现在一个统一框架性对于任意人物特征、新颖的3d动作生成和对于真实场景的交互性的高级扩展性。考虑到视频发生的固有3d属性，这个核心观点是将2d视频编码成紧凑的空间代码。具体来说，我们通过使用monocular depth的方法将2d帧的像素提升到3d，并且通过基于3d depth方法分解视频片段为三个空间成分（主要人物，底层场景，浮动遮挡）到不同的层中。这些成分在未来被编码成能够标识自己身份的代码，结构化动作的代码和在视频合成过程中能够被全场景代码利用的控制信号。空间分解建模的设计实现用户的灵活控制，复杂动作的表达和对场景交互的3d感知。实验结果证明了提出方法的有效性和鲁棒性。

1 Introduction

Character video synthesis, an essential（基本的） topic in areas of Computer Vision and Computer Graphics, has huge potential applications for movie production, virtual reality, and animation. While recent video generative models [2, 5, 6, 11, 28, 33] have achieved great progress with text or image guidance, none of them fully captures the underlying（底层） attributes (e.g., appearance and motion of instance and scene) in a video and provides flexible user controls. Meanwhile, they still struggle（努力） for reasonable character synthesis in challenging scenarios（具有挑战性的场景）, such as extreme（极端的） 3D motions and complex object interactions accompanied by occlusions（遮挡）
总结：
提出之前相关研究（文或图驱动视频生成），引出存在的问题。
The aim of this paper is to propose a brand-new and boosting（提升） method for controllable video synthesis, which can not only synthesize character videos with controllable attributes (i.e., character, motion and scene) provided by very simple user inputs, but also achieve advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes in a unified framework (see Figure 1). In other words, the proposed method is capable（有能力的） of mimicking（模仿） anyone anywhere with complex motions and object interactions, thus named MIMO. As more concretely illustrated（插图） in Figure 2, users are allowed to feed multiple inputs (e.g., a single image for character, a pose sequence for motion, and a single video/image for scene) to provide desired attributes respectively（分别） or a direct driving video as input. The proposed model can embed（嵌入）target attributes into the latent space to construct target codes or encode the driving video with spatial-aware（空间感知） decomposition as spatial codes, thus enabling intuitive（直觉） attribute control of the synthesis by freely integrating（整合） latent codes in a specific order.
总结：
提出这篇论文的目的并以图片2去介绍本论文的实现方法。

在这里插入图片描述
优势 Our task setting significantly decreases the cost of video creation and enables wide applications for not only character animation, but also video attribute editing (e.g., character replacement, motion transfer and scene insertion). However, it is extremely challenging due to the simplicity of user inputs, the complexity of real-world scenarios and the absence（缺少） of annotation（注释） for 2D videos. 3d生成的相关文献介绍 With the great progress of 3D neural representations (e.g., NeRF [22] and 3D Gaussian splatting [12]), a series of works [8, 15, 17, 23, 27] tend to represent the dynamic human as a pose-conditioned NeRF or Gaussian to learn animatable avatars in highfidelity（高度保真） rendering（渲染） quality. 参考文献工作的一些问题 However, they typically require fitting a neural field to multi-view captures or a monocular video of dynamic performers, which severely limits their applicability due to inefficient training and expensive data acquisition. 3d生成的相关文献改进介绍 Another 3D works explored faster and cheaper solutions by directly inferring（推断） 3D models from single human images, following by rigged（被操纵） animation and physical rendering（渲染） [9, 10, 16, 21]. 参考文献工作的一些问题 Unfortunately, the realism of the renderings is marginally compromised（略有妥协） due to cumulative（累积） errors in sequential processes. 2d图片生成模型的相关文献改进介绍 Recently, several efforts [7, 26, 30, 37] have investigated（被调查过） the potential of 2D diffusion models on image guided character video synthesis. They show that high-fidelity character synthesis can be achieved by inserting image feature via a referencenet [7, 37] or control-net [30,32] to a pretrained diffusion model. 参考文献改进方法后仍存在的问题 However, they only focus on character animation in simple 2D motions (e.g., frontal dancing) and are less effective for articulated human motion in 3D space with limited pose generality. Moreover, they fail to produce lifelike video with complicated scenes accompanied by humanobject interactions. 分析这些参考文献的问题 We argue that the cause for these difficulties stems from insufficient（不足） video attribute parser（解析器） considered only in 2D feature space, thereby disregarding（无视） the inherent（固有） 3D nature of video occurrence. 引出本论文的思想 To tackle these challenges, we propose a novel framework for controllable character video synthesis via spatial decomposed modeling. The core idea is to decompose and encode the 2D video in 3D-aware manner（3d 感知方式） and employ（采用） more adequate expressions (e.g., 3D representations) for articulated properties. In contrast to（对比） previous works[7, 34] directly learn the whole 2D feature at each video frame, we lift the 2D frame pixels into 3D, and construct the decomposed spatial representations in 3D space, which are equipped（装备齐全） with richer contextual（上下文） information and can be used for control signals of synthesis process. Specifically, we decompose the video clip to three spatial components (scene, human and occlusion) in hierarchical layers based on 3D depth. In particular, human represents the main object in the video, scene represents the underlying background, and occlusion traces floating foreground objects. For the human component, we further disentangle（解开） the identity property via canonical（典范） appearance transfer and encode the 3D motion representation via structural body codes. The scene and occlusion components are embedded with a shared VAE encoder and re-organized as a full scene code. The decomposed latent codes are inserted as conditions of a diffusion-based decoder to reconstruction the video clip. In this way, the network learns not only controllable synthesis of various attributes, but also 3D-aware composition of main object, foreground and background.
Thereby, it enables flexible user controls as well as challenging cases of complicated 3D motions and natural object interactions. In summary, 介绍论文核心贡献 our contributions are threefold:
• We present a new approach capable of synthesizing realistic character videos with controllable attributes by directly providing simple user inputs, and simultaneously achieving advanced scalability, generality and applicability in a unified framework.
能够按照用户简单输入的控制属性进行人物视频生成。同时实现了一个具有高级的扩展性、概论和适用性统一的框架。
• We propose the Spatial Decomposed Diffusion, a novel video generative model with automatic separation（分离） of spatial attributes, which enables not only flexible user control, but also 3D-aware synthesis for scene interactions.
提出了一个空间分解扩散模型（一个新颖的视频生成方法，此方法通过空间属性的自动分离，不仅能够让用户灵活控制，并且能够为交互场景进行3d感知合成）
• We tackle（处理） the challenge of inadequate pose representation for articulated human by introducing structured body codes to express complex motions in spatial space, making an advanced generality to novel 3D motions。
通过结构化肢体代码去表达在空间中复杂的运动解决了姿态表达对于有关节的人类的不足，对新颖的3d运动进行了一种高级的生成方式。
总结：
通过介绍之前别人做的一些工作与存在的问题，引出这篇论文的核心思想和主要贡献。

2 Method Description

Our goal is to synthesize high-quality character videos with
user-controlled visual attributes, such as character, motion
and scenes. The desired attributes(所需要的属性) can be automatically extracted from an in-the-wild character video or simply provided by a single image, a pose sequence, and a single video/image, respectively. 列举别人解决相关问题的方法及缺点 Different from previous methods using only weak control signals (e.g., text prompt) [19, 35] or insufficient 2D expressions [7, 34], 此论文方法的成效 our model achieves automatic and unsupervised separation of spatial components and encodes them into compact latent codes considering inherent 3D nature to control the synthesis. Thus, our dataset can only contain 2D character videos {v ∈ R N×H×W } without any annotations（注释）.
在这里插入图片描述

The overview of the proposed framework is illustrated in Figure 3. 介绍图3 Given a video clip v, MIMO learns a reconstruction process with automatic attribute encoding and composed condition decoding. Considering 3D nature of video occurrence（发生）, we extract the three spatial components in hierarchical layers based on 3D depth (Section 2.1). The first component of human is encoded with disentangled（解开的） properties of identity and motion (Section 2.2). The last two components of scene and occlusion are embedded with a shared encoder and re-organized as a latent code (Section 2.3). These latent codes C are inserted into a diffusion decoder D as composed conditions (Section 2.4). C, D are jointly learned by minimizing the difference between the synthesized frames and input frames in noise level (Section 2.5).

2.1. Hierarchically spatial layer decomposition（层次空间层分解）

在这里插入图片描述

Considering the inherent 3D elements（元素） of video composition（合成）, we split a video $v$ = { $I t ∣ t = 1, ..., N$ } into three main components: human as a core performer, scene as the underlying background, and occluded object as the floating foreground. To automatically decompose them, we lift 2D pixels into 3D and track detected objects in hierarchical layers based on corresponding depth values.
human layer提取方法介绍 To start with, for each frame $I_t ∈ v$ , we obtain its monocular depth map using a pretrained monocular depth estimator [31]. The human layer is firstly extracted with human detection [29], and propagate（传播）to video volume via video tracking method [24], thus obtaining $M^h ∈ R^{N∗H∗W}$ , a binary (二进制) mask sequence along the time axis (i.e., masklet). occlusion layer提取方法介绍 Subsequently, we extract the occlusion layer with objects whose mean depth values are smaller than the human layer, and generate masklet predictions $M^o$ via a video tracker. scene layer提取方法介绍 The scene layer can be obtained by removing human and occlusion objects, defined by scene masklet $M^s$ . 总结以上三种方法为具体公式 With predicted masklets, we can compute the decomposed human video of component $i$ by multiplying the original source video with component masklet $M^i$ :
$v^i$ = $v ⊙ M^i$ , $i$ = { $h, o, s$ }, (1)
where ⊙ denotes（表示）element-wise product（元素乘积）. $v^i$ is then fed into the corresponding branch for human, scene and occlusion encoding, respectively.

2.2. Disentangled human encoding（解析human编码）

在这里插入图片描述

This branch aims to encode the human component $v^h$
into the latent space as disentangled codes $C_{id}$ and $C_{mo}$ of identity and motion. 介绍完成此方法的之前工作 Previous works [7, 30, 34] typically random select one frame from the video clip as appearance representation, and employ extracted 2D skeleton with keypoints as the pose representation. Essentially（本质上）, 介绍完成此方法的他人工作的不足 this design exists two core issues which may limit networks’ performance: 1) It is hard for 2D pose to adequately express motions which take place in（发生在） 3D spatial space, especially for articulated ones accompanied by exaggerated deformations（夸张的变形） and frequent self-occlusions（频繁的遮挡）. 2) The postures of frames across a video are highly similar, and there inevitably（不可避免的） exists the entanglement（纠缠） between appearance frame and target frame both retrieved from the posed video. 提出自己的方法 Thereby, we introduce new 3D representations of motion and identity for adequate expression and full disentanglement.
Structured motion. We define a set of latent codes $Z$ = { $z_1, z_2, . . . , z_{nv}$ }, and anchor them to corresponding vertices of a deformable（可变形的） human body model (SMPL) [18], where $n v$ is the number of vertices. For the frame $t$ , SMPL parameters $S_t$ and camera parameters $C_t$ are estimated from the monocular video frame $v^h_t$ using [3]. The spatial locations of the latent codes are then transformed based on the human pose $S_t$ and projected to the 2D plane based on the camera setting $C_t$ . Using a differentiable rasterizer [14] with vertex interpolation, the 2D feature map $F_t$ in continuous values can be obtained. { $F_t, t = 1, .., N$ } will be stacked（堆叠） along the time axis and embedded into the latent space as the motion code $C_{mo}$ by a pose encoder. 使用方法想要达到的效果 In this way, we establish a correspondence that maps the same set of latent codes of underlying 3D body surface to the posed 2D renderings at different frames of arbitrary（任意的） videos. This structured body codes enables more dense（稠密） pose representation with 3D occlusions.
本段方法总结：利用SMPL模型获取vertices和camera（相机坐标），根据相机坐标将3d向量转化为2d向量。（个人认为可以再使用平滑模型提高图片之间的连续性）。

Canonical identity（规范身份）. To fully disentangle the appearance from posed video frames, an ideal solution is to learn the dynamic human representation from the monocular video and transform it from the posed space to the canonical space. Considering the efficiency, we employ a simplified method that directly transforms the posed human image to the canonical result in standard A-pose using a pretrained human repose model（？）. The synthesized canonical appearance image is fed to ID encoders to obtain the identity code $C_{id}$ . This simple design enables full disentanglement of identity and motion attributes. Following [7], the ID encoders include a CLIP image encoder and a reference-net architecture to embed for the global and local feature, respectively, which compose $C_{id}$ .

2.3. Scene and occlusion encoding（背景和遮挡编码）

在这里插入图片描述

In scene and occlusion branches, we use a shared and fixed VAE encoder [13] to embed the $v^s$ and $v^o$ into the latent space as the scene code $C_s$ and occlusion code $C_o$ , respectively. Before $v^s$ input, we pre-recover it by a video inpainting method(将背景里被遮挡的部分给修复完成) [36] for $R(v^s)$ to avoid the confusion(困惑) brought by mask contours(轮廓). Then the scene code $C_s$ and the occlusion code $C_o$ are concatenated（串联） together in order to get the full scene code $C_{so}$ for composed synthesis. The independent
encoding of spatial components (i.e., middle human, underlying scene, and floating occlusion) enable the network to learn an automatic layer composition, thus achieving natural character insertion in complicated scenes even with occluded object interactions.

2.4. Composed decoding（组合编码）

在这里插入图片描述
解释图4的流程 Given the latent codes of decomposed attributes, we recompose them as conditions of the diffusion-based decoder for video reconstruction. As shown in Figure 4, we adapt denoising U-Net backbone built upon Stable Diffusion (SD) [25] with temporal layers from [4]. The full scene code $C_{so}$ is concatenated（串联） with the latent noise, and is fed into a 3D convolution layer for fusion and alignment（融合和对齐）. The motion code $C_{mo}$ is added to the fused（融合） feature and input to the denoising U-Net. For identity code $C_{id}$ , its local feature and global feature are inserted into the U-Net via self-attention layers and cross-attention layers, respectively（分别）. Finally, the denoised result is converted into the video clip $\^v$ via a pretrained VAE decoder [13].

2.5. Training

We initialize the model of denoising U-Net and referencenet based on the pretrained weights from SD 1.5 [25], whereas the motion module is initialized with the weights of AnimateDiff [4]. During training, the weights of VAE encoder and decoder, as well as the CLIP image encoder are frozen. We optimize the denoising U-Net, pose encoder and reference-net with the diffusion noise-prediction loss:
$L = E_{x_0,c_{id},c_{so},c_{mo},t,ϵ∈N(0,1)}[||ϵ−ϵ_θ(x_t, c_{id}, c_{so}, c_{mo}, t)||{2 \ 2}]$
where $x_0$ is the augmented input sample, $t$ denotes the
diffusion timestep, $x_t$ is the noised sample at $t$ , and $ϵ_θ$ represents the function of the denoising UNet. We conduct the training on 8 NVIDIA A100 GPUs. It takes around 50k iterations with 24 video frames and a batch size of 4 for converge.

3. Experimental Results

Dataset. We create a human video dataset called HUD-7K to train our model. This dataset consists of 5K real character videos and 2K synthetic character animations. The former does not require any annotations and can be automatically decomposed to various spatial attributes via our scheme. To enlarge the range of the real dataset, we also synthesize 2K videos by rendering character animations in complex motions under multiple camera views, utilizing En3D [21]. These synthetic videos are equipped with accurate annotations due to completely controlled production.
总结：
数据：5000的真实人物视频和2000的合成人物（为了增加数据量，利用En3D模型人工模拟创造了一批合成数据）

3.1. Controllable character video synthesis（可控的人物视频合成）

Given the target attributes of character, motion and scene,
our method can generate realistic video results with their
latent codes combined for guided synthesis. The target attributes can be provided by simple user inputs (e.g., single images/videos for character/scene, pose sequences from large database [1, 20] for motion) or flexibly extracted from the real-world videos, involving complicated scenes of occluded object interactions and extreme articulated motions. In the following, MIMO demonstrates that it can simultaneously achieve advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to in-thewild scenes in a unified framework.

3.1.1 Arbitrary character control

As shown in Figure 5, our method can animate arbitrary characters, including realistic humans, cartoon characters and personified（拟人化） ones. Various body shapes of characters can be faithfully preserved（忠实的保留） due to the decoupled（解藕） pose and shape parameters in our structured motion representation.
验证真人、卡通和拟人的效果
在这里插入图片描述

3.1.2 Novel 3D motion control

To verify the generality to novel 3D motions, we test MIMO using challenging out-of-distribution pose sequences from the AMASS [20] and Mixamo [1] database, including dancing, playing and climbing (Figure 6 (a)). We also try complex spatial motions in 3D space by extracting them from in-the-wild human videos (Figure 6 (b)). Our method exhibits high robustness for these novel 3D motions under different viewpoints.
验证比较复杂的动作效果（a展示了跳舞、玩、爬。b展示了户外复杂的空间动作）
在这里插入图片描述

3.1.3 Interactive scene control

We validate（证实）the applicability of our model to complicated real scenes by extracting scene and motion attributes from in-the-wild videos for character animation (i.e., the task of video character replacement). As shown in Figure 7, the character can be seamlessly inserted to the real scenes with natural object interactions, owing to our spatial-aware synthesis for hierarchical layers.
验证动画嵌入真实场景的效果
在这里插入图片描述

4. Conclusions

In this paper, we presented MIMO, a novel framework
for controllable character video synthesis, which allows
for flexible user control with simple attribute inputs. 本论文方法介绍 Our method introduces a new generative architecture which decomposes the video clip to various spatial components, and embeds their latent codes as the condition of decoder to reconstruct the video clip.
实验证明效果 Experimental results demonstrated that our method enables not only flexible character, motion and scene control, but also advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive scenes. We also believed that our solution, which considers inherent 3D nature and automatically encodes the 2D video to hierarchical spatial components could inspire future researches for 3D-aware video synthesis. 未来能够实现效果 Furthermore, our framework is not only well suited to generate character videos but also can be potentially adapted to other controllable video synthesis tasks.