AIGC -Latent Video Diffusion Models for High-Fidelity Long Video Generation-CVPR2022

hflexx

已于 2024-07-25 20:37:09 修改

阅读量768

点赞数 8

分类专栏： AIGC论文文章标签： AIGC 音视频 stable diffusion pytorch

于 2024-07-25 18:07:03 首次发布

本文链接：https://blog.csdn.net/hwjokcq/article/details/140688842

版权

AIGC论文专栏收录该内容

30 篇文章 1 订阅

订阅专栏

homepage:
https://yingqinghe.github.io/LVDM/
paper:
https://arxiv.org/abs/2211.13221
参考博客:
https://blog.csdn.net/Are_you_ready/article/details/136615853

LVDM:潜在空间视频生成模型
在这里插入图片描述

MOTIVATION

Diffusion models have shown remarkable results recently but require significant computational resources
GANs suffer from mode collapse and training instability problems, which makes GAN-based approaches hard to scale up to handle complex and diverse video distributions
TATS proposes an autoregressive approach that leverages the VQGAN and transformers to synthesize long videos. However, the generation fidelity and resolution (128×128) still have much room for improvement

CONTRIBUTION

We introduce LVDM, an efficient diffusion-based baseline approach for video generation by firstly compressing videos into tight latents.
We propose a hierarchical framework that operates in the video latent space, enabling our models to generate longer videos beyond the training length further.
We propose conditional latent perturbation and unconditional guidance for mitigating the performance degradation issue during long video generation.
Our model achieves state-of-the-art results on three benchmarks in both short and long video generation settings. We also provide appealing results for opendomain text-to-video generation, demonstrating the effectiveness and generalization of our models.

RELATED WORKS

METHODS

overview

在这里插入图片描述

Firstly we compress video samples to a lowerdimensional latent space by a video autoencoder. 【通过视频自编码器将视频样本压缩到低维潜在空间，时间维度也压缩，他分别有空间和时间下采样因子】
Then we design a unified video diffusion model, which can perform both unconditional generation and conditional video generation in one network, in the latent space. This enables our model to self-extend the generated video to an arbitrary length autoregressively. 【设计了一个统一的视频扩散模型，该模型可以在潜在空间中执行无条件生成和条件视频生成，模型能够自回归地自动扩展生成的视频到任意长度】
To further improve the coherence of generated long video and alleviate the quality degradation problem
- we propose hierarchical latent video diffusion models to first generate video patents sparsely and then interpolate intermediate latents. 【提出了分层潜在视频扩散模型,首先稀疏地生成视频片段，然后插值中间潜在】
- We also propose conditional latent perturbation and unconditional guidance for tackling the performance degradation problem in long video generation.【提出了条件潜在扰动和无条件指导，以解决长视频生成中的性能下降问题】

Video Autoencoder

Encoder $\mathcal{E}$ :given a video sample $x_0 ∼ p_{data}(x_0)$ where $x_0 ∈ R^{H×W×L×3}$
- the encoder $\mathcal{E}$ encodes it to its latent representation $z_0 = E(x_0)$
- $x_0$ :the vedio
- $z_0 ∈ R^{h×w×l×c}$ ,
- $h = H/f_s$ , $w = W/f_s$ ,and $l = L/f_t$ .
- $f_s$ and $f_t$ are spatial and temporal downsampling factors.
Decoder $D$ :
- The decoder $D$ decodes $z_0$ to the reconstructed sample $\tilde{\mathbf{x}}_{0}$ , $\tilde{\mathbf{x}}_{0}=D(z_0)$ .
- Both of encoder and decoder consist of several layers of 3D convolutions
To ensure that the autoencoder is temporally shift-equivariant, we use repeat padding in all 3D convolutions
training objective:
$\begin{aligned}\mathcal{L}_{AE}&=\operatorname*{min}_{\mathcal{E},\mathcal{D}}\operatorname*{max}_{\psi}\big(\mathcal{L}_{rec}(\mathbf{x}_{0},\mathcal{D}(\mathcal{E}(\mathbf{x}_{0}))\big)\\&+\mathcal{L}_{adv}(\psi(\mathcal{D}(\mathcal{E}(\mathbf{x}_{0})))).\end{aligned}$
- $\mathcal{L}_{rec}$ :reconstruction loss; $\mathcal{L}_{rec}$ is comprised of a pixel-level mean-squared error (MSE) loss and a perceptual-level LPIPS loss.
- $\mathcal{L}_{adv}$ :adversarial loss; $\mathcal{L}_{adv}$ is used to eliminate the blur in reconstruction usually caused by the pixel-level reconstruction loss and further improve the realism of the reconstruction.
- ${\psi}$ :the discriminator used in adversarial training

Base LVDM for Short Video Generation

Revisiting Diffusion Models

forward process:
$\begin{aligned}q(\mathbf{z}_{1:T}|\mathbf{z}_0)&:=\prod_{t=1}^Tq(\mathbf{z}_t|\mathbf{z}_{t-1}),\\q(\mathbf{z}_t|\mathbf{z}_{t-1})&:=\mathcal{N}(\mathbf{z}_t;\sqrt{1-\beta_t}\mathbf{z}_{t-1},\beta_t\mathbf{I}).\end{aligned}$
backward process:
$\begin{aligned}&p_{\theta}(\mathbf{z}_{0:T}):=p(\mathbf{z}_{T})\prod_{t=1}^{T}p_{\theta}(\mathbf{z}_{t-1}|\mathbf{z}_{t}),\\&p_{\theta}(\mathbf{z}_{t-1}|\mathbf{z}_{t}):=\mathcal{N}(\mathbf{z}_{t_{1}};\mu_{\theta}(\mathbf{z}_{t},t),\Sigma_{\theta}(\mathbf{z}_{t},t)),\end{aligned}\\ \mu_\theta(\mathbf{z}_t,t)=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{z}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(\mathbf{z}_t,t)\right)$
training objective
$\mathcal{L}_{\mathrm{simple}}(\theta):=\left\|\boldsymbol{\epsilon}_{\theta}(\mathbf{z}_{t},t)-\boldsymbol{\epsilon}\right\|_{2}^{2},$

Video Generation Backbone

model video samples in the 3D latent space(follow VDM)

exploit a spatial-temporal factorized 3D U-Net architecture to estimate the $\epsilon$ :
- use space-only 3D convolution with the shape of 1 × 3 × 3
- and add temporal attention in partial layers.
use factorized spatial-temporal attention as the default setting in our experiments
use adaptive group normalization to inject the timestep embedding into normalization modules to control the channel-wise scale and bias parameters

Hierarchical LVDM for LongVideo Generation

problem:The aforementioned framework can only generate short videos, whose lengths are determined by the input frame number during training. 【上述框架只能生成短视频，其长度由训练期间输入的帧数决定】
solution:We therefore propose a conditional latent diffusion model, which can produce future latent codes conditioned on the previous ones in an autoregressive manner, to facilitate long video generation【我们提出了一个条件潜在扩散模型，该模型可以以自回归的方式根据前一个潜在码生成未来潜在码，以促进长视频的生成】

Autoregressive Latent Prediction.

Considering a short clip latent $\mathbf{z}_t = \{\mathbf{z}_t^i\}_{i=i}^l$

$z^i_t ∈ R^{h×w×c}$
$l$ is the number of latent codes within the clip

在这里插入图片描述

adding an additional binary mask 在这里插入图片描述

For each video frame in a clip latent, we add an additional binary mask along the channel dimension to indicate whether it is a conditional frame or a frame to predict 【对于片段潜在表示中的每个视频帧，沿着通道维度添加了一个额外的二进制掩码。这个掩码用于指示每一帧是条件帧还是预测帧】
$\begin{aligned}&\tilde{\mathbf{z}}_{t}=\{\tilde{\mathbf{z}}_t^i=[\mathbf{z}_t^i,\mathbf{m}^i]\}_{i=1}^l,\tilde{\mathbf{z}}_t^i\in\mathbb{R}^{h\times w\times(c+1)} \\&\tilde{\mathbf{z}}_{0}=\{\tilde{\mathbf{z}}_0^i=[\mathbf{z}_0^i,\mathbf{m}^i]\}_{i=1}^l,\tilde{\mathbf{z}}_0^i\in\mathbb{R}^{h\times w\times(c+1)} \\&\tilde{\mathbf{z}}_{t}\leftarrow\tilde{\mathbf{z}}_t\odot(1-\mathbf{m})+\tilde{\mathbf{z}}_0\odot\mathbf{m}\end{aligned}$
- $\mathbf{m} = \{\mathbf{m}^i\}_{i=1}^l$ ， $\mathbf{m}^i \in \mathbb{R}^{h\times w\times1}$ is the binary mask.
- $m = 0$ :a frame to predict
- $m = 1$ :a conditional frame，replace the $z^i_t$ with $z^0_t$
- $\tilde{\mathbf{z}}_{0}$ :conditional latent,frame with no noise
- $c + 1$ : $c$ channels + 1 binary mask
By randomly setting different binary masks to ones or zeros, we can train our diffusion model to perform both unconditional video generation and conditional video prediction jointly.【通过随机将不同的二进制掩码设置为1或0，我们可以训练我们的扩散模型同时进行无条件视频生成和条件视频预测】
- Concretely, we set all masks in the binary clip $m$ to zeros for unconditional diffusion model training.
- During inference stage
  - set the first $k$ binary mask $\mathbf{m} = \{\mathbf{m}^i\}_{i=1}^k$ to ones 【二进制片段m中的所有掩码设置为零，以进行无条件扩散模型训练】
  - and the remaining $\mathbf{m} = \{\mathbf{m}^i\}_{i=k+1}^l$ to zeros.【猜测比如前100帧是之前生成的，后100帧是高斯噪声，然后k=100，设为条件，去预测这后100帧】

Hierarchical Latent Generation

firstly we train an autoregressive video generation model on sparse frames to form the basic storyline of the video 【首先在稀疏帧上训练一个自回归视频生成模型，以形成视频的基本故事线】
then train another interpolation model to infill the missing frames. 【然后训练另一个插值模型来填充缺失的帧】
- The training of the interpolation model is similar to the autoregressive model【插值模型的训练与自回归模型类似】
- But we set the binary masks of the middle frames between every two sparse frames to zeros(unconditional).【将每两个稀疏帧之间的中间帧的二进制掩码设置为零（猜测前后帧是之前自回归模型生成的，然后在这两帧中间插入高斯噪声帧，然后将前后帧m置位1，高斯噪声帧置位0)】

Conditional Latent Perturbation

problem：Although the abovementioned hierarchical generation manner can reduce the number of autoregressive steps to overcome the degradation issue, more prediction steps are indispensable to produce long-enough video samples. 【尽管上述分层生成方式可以减少自回归步骤的数量，以克服质量下降的问题，但为了生成足够长的视频样本，更多的预测步骤是必不可少的】
solution：we propose conditional perturbation to mitigate the conditional error induced by the previous generation step.【提出了条件扰动来缓解由前一步生成引起的条件误差】
- rather than directly conditioning on $z_0$ , we **use the noisy latent code $z_s$ at an arbitrary time $s$ 【这个操作就是不拿 $z_0$ 进行条件化，而是拿任意时间 $s$ （s属于时间步t内的值）， $s$ 作为训练期间的条件，即不是拿 $z^i_0$ 去做条件，而是拿 $z^i_s$ 去做条件根据掩码进行替换】
  - $s$ can be computed by $p_{\theta}(\mathbf{z}_{0:T}):=p(\mathbf{z}_{T})\prod_{t=1}^{T}p_{\theta}(\mathbf{z}_{t-1}|\mathbf{z}_{t})$ , as the condition during training
  - $\{\mathbf{z}_t^i\}_{i=1}^k \leftarrow \{\mathbf{z}_s^i\}_{i=1}^k$
- To keep the conditional information preserved, a maximum threshold $s_{max}$ is used to clamp the timesteps in a minor noise level.
- During sampling, a fixed noise level is used to consistently add noise during autoregressive prediction

在这里插入图片描述

Unconditional Guidance-Another complementary technique to alleviate quality degradation of autoregressive

Since the accumulated error during autoregressive generation does not affect the unconditional score, introducing this score into long video generation could improve the diversity and fidelity of sampled video.【由于自回归生成过程中的累积误差不会影响无条件得分，因此将这一得分引入长期视频生成可以提高采样视频的多样性和保真度】

use one network to estimate both unconditional scores $\epsilon_u$ and conditional scores $\epsilon_c$ .
By zeroing all binary maps $\{ \mathbf{m}^i \}_{i=1}^l$ in $\tilde z$ , we obtain $\epsilon_u$
By setting the first k binary maps $\{ \mathbf{m}^i \}_{i=1}^k$ to one and the remaining ones $\{ \mathbf{m}^i \}_{i=k+1}^l$ to zero in $\tilde z$ , we get $\epsilon_c$ .

problem:the conditional score may be out of the model learned distribution due to the error accumulation when autoregressively predicted. 【当自回归预测时，由于误差累积，条件分数可能超出模型学习分布】

solution: we propose to leverage the unconditional scores to guide the prediction sampling process via【利用无条件得分来指导预测抽样过程】
$\tilde{\epsilon}_\theta=(1+w)\boldsymbol{\epsilon}_c-w\boldsymbol{\epsilon}_u,$
$w$ :the guidance strength
This formula is initially referred to as classifier-free guidance to avoid training a separate classifier for class-conditional diffusion models

EXPERIMENTS

在提供的文件中，第4节 “Experiments” 描述了对所提出方法的实验评估。以下是对这一部分的详细总结：

Experimental Setup (实验设置)

Datasets (数据集): 作者在UCF-101、Sky Time-lapse和Taichi数据集上评估了他们的方法。这些数据集被用来训练所有模型进行无条件视频生成，并且使用了特定的帧步长来选择训练片段。
Resolution (分辨率): 所有模型都以256²的分辨率进行训练。
Training Details (训练细节): 对于Taichi数据集，选择了具有帧步长4的片段，以增加人体运动的动态性。由于UCF-101和Taichi数据集的视频数量有限，作者采用了整个数据集进行训练。对于Sky Time-lapse数据集，仅使用其训练分割进行模型训练。

Efficiency Comparison (效率比较)

作者展示了他们的方法与像素空间视频扩散模型（VDM和MCVD）在UCF-101数据集上的性能和效率比较。
效率比较包括模型参数数量、每一步的时间消耗、FVD16和KVD16指标。

Short Video Generation (短视频生成)

Quantitative Results (定量结果): 作者提供了与先前方法在不同数据集上的定量比较。他们的方法在多个指标上超越了先前的最先进方法。
Qualitative Results (定性结果): 通过视觉比较，展示了与DIGAN和TATS等方法相比，作者的LVDM能够生成具有高保真度和多样性的视频样本。

Long Video Generation (长视频生成)

Comparison with State-of-the-art Approach (与最先进方法的比较): 作者将LVDM与TATS在UCF-101和Sky Time-lapse数据集上进行了比较，生成了1024帧的长视频。
Autoregressive and Hierarchical Models (自回归和层次化模型): 比较了纯自回归预测模型和结合预测模型与插值模型（层次化）的性能。
Quality Degradation Over Time (随时间的质量退化): 作者展示了LVDM在长视频生成中质量退化更慢，尤其是在UCF-101数据集上。

Ablation Experiments (消融实验)

Conditional Latent Perturbation (条件潜在扰动): 展示了条件潜在扰动如何减缓性能退化的趋势，尤其是在帧数大于512时。
Unconditional Guidance (无条件指导): 探索了无条件指导如何有效减轻自回归生成的质量退化。

Extension for Text-to-Video Generation (文本到视频生成的扩展)

除了无条件视频生成，作者还将他们的方法扩展到了更可控的文本到视频生成任务。
他们扩展了模型到亿级参数，并在WebVid数据集的一个200万子集上进行了训练。
使用预训练的文本到图像模型来重用空间内容生成能力，并在视频数据集上继续学习运动动态。
展示了文本到视频生成的结果，证明了模型的有效性和泛化能力。

Code

https://github.com/YingqingHe/LVDM

hflexx

关注

8
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
AIGC -Latent Video Diffusion Models for High-Fidelity Long Video Generation-CVPR2022

Latent Video Diffusion Models原理解读
复制链接

扫一扫

专栏目录