AIGC -Latent Video Diffusion Models for High-Fidelity Long Video Generation-CVPR2022

homepage:
https://yingqinghe.github.io/LVDM/
paper:
https://arxiv.org/abs/2211.13221
参考博客:
https://blog.csdn.net/Are_you_ready/article/details/136615853

LVDM:潜在空间视频生成模型
在这里插入图片描述

MOTIVATION

  • Diffusion models have shown remarkable results recently but require significant computational resources
  • GANs suffer from mode collapse and training instability problems, which makes GAN-based approaches hard to scale up to handle complex and diverse video distributions
  • TATS proposes an autoregressive approach that leverages the VQGAN and transformers to synthesize long videos. However, the generation fidelity and resolution (128×128) still have much room for improvement

CONTRIBUTION

  • We introduce LVDM, an efficient diffusion-based baseline approach for video generation by firstly compressing videos into tight latents.
  • We propose a hierarchical framework that operates in the video latent space, enabling our models to generate longer videos beyond the training length further.
  • We propose conditional latent perturbation and unconditional guidance for mitigating the performance degradation issue during long video generation.
  • Our model achieves state-of-the-art results on three benchmarks in both short and long video generation settings. We also provide appealing results for opendomain text-to-video generation, demonstrating the effectiveness and generalization of our models.

RELATED WORKS

METHODS

overview

在这里插入图片描述
在这里插入图片描述

  • Firstly we compress video samples to a lowerdimensional latent space by a video autoencoder. 【通过视频自编码器将视频样本压缩到低维潜在空间,时间维度也压缩,他分别有空间和时间下采样因子】在这里插入图片描述

  • Then we design a unified video diffusion model, which can perform both unconditional generation and conditional video generation in one network, in the latent space. This enables our model to self-extend the generated video to an arbitrary length autoregressively. 【设计了一个统一的视频扩散模型,该模型可以在潜在空间中执行无条件生成和条件视频生成,模型能够自回归地自动扩展生成的视频到任意长度】

  • To further improve the coherence of generated long video and alleviate the quality degradation problem

    • we propose hierarchical latent video diffusion models to first generate video patents sparsely and then interpolate intermediate latents. 【提出了分层潜在视频扩散模型,首先稀疏地生成视频片段,然后插值中间潜在】
    • We also propose conditional latent perturbation and unconditional guidance for tackling the performance degradation problem in long video generation.【提出了条件潜在扰动和无条件指导,以解决长视频生成中的性能下降问题】

Video Autoencoder

  • Encoder E \mathcal{E} E:given a video sample x 0 ∼ p d a t a ( x 0 ) x_0 ∼ p_{data}(x_0) x0pdata(x0) where x 0 ∈ R H × W × L × 3 x_0 ∈ R^{H×W×L×3} x0RH×W×L×3

    • the encoder E \mathcal{E} E encodes it to its latent representation z 0 = E ( x 0 ) z_0 = E(x_0) z0=E(x0)
    • x 0 x_0 x0:the vedio
    • z 0 ∈ R h × w × l × c z_0 ∈ R^{h×w×l×c} z0Rh×w×l×c,
    • h = H / f s h = H/f_s h=H/fs, w = W / f s w = W/f_s w=W/fs,and l = L / f t l = L/f_t l=L/ft.
    • f s f_s fs and f t f_t ft are spatial and temporal downsampling factors.
  • Decoder D D D:

    • The decoder D D D decodes z 0 z_0 z0 to the reconstructed sample x ~ 0 \tilde{\mathbf{x}}_{0} x~0, x ~ 0 = D ( z 0 ) \tilde{\mathbf{x}}_{0}=D(z_0) x~0=D(z0).
    • Both of encoder and decoder consist of several layers of 3D convolutions
      在这里插入图片描述
  • To ensure that the autoencoder is temporally shift-equivariant, we use repeat padding in all 3D convolutions

  • training objective:
    L A E = min ⁡ E , D max ⁡ ψ ( L r e c ( x 0 , D ( E ( x 0 ) ) ) + L a d v ( ψ ( D ( E ( x 0 ) ) ) ) . \begin{aligned}\mathcal{L}_{AE}&=\operatorname*{min}_{\mathcal{E},\mathcal{D}}\operatorname*{max}_{\psi}\big(\mathcal{L}_{rec}(\mathbf{x}_{0},\mathcal{D}(\mathcal{E}(\mathbf{x}_{0}))\big)\\&+\mathcal{L}_{adv}(\psi(\mathcal{D}(\mathcal{E}(\mathbf{x}_{0})))).\end{aligned} LAE=E,Dminψmax(Lrec(x0,D(E(x0)))+Ladv(ψ(D(E(x0)))).

    • L r e c \mathcal{L}_{rec} Lrec:reconstruction loss; L r e c \mathcal{L}_{rec} Lrec is comprised of a pixel-level mean-squared error (MSE) loss and a perceptual-level LPIPS loss.
    • L a d v \mathcal{L}_{adv} Ladv:adversarial loss; L a d v \mathcal{L}_{adv} Ladv is used to eliminate the blur in reconstruction usually caused by the pixel-level reconstruction loss and further improve the realism of the reconstruction.
    • ψ {\psi} ψ:the discriminator used in adversarial training

Base LVDM for Short Video Generation

Revisiting Diffusion Models

  • forward process:
    q ( z 1 : T ∣ z 0 ) : = ∏ t = 1 T q ( z t ∣ z t − 1 ) , q ( z t ∣ z t − 1 ) : = N ( z t ; 1 − β t z t − 1 , β t I ) . \begin{aligned}q(\mathbf{z}_{1:T}|\mathbf{z}_0)&:=\prod_{t=1}^Tq(\mathbf{z}_t|\mathbf{z}_{t-1}),\\q(\mathbf{z}_t|\mathbf{z}_{t-1})&:=\mathcal{N}(\mathbf{z}_t;\sqrt{1-\beta_t}\mathbf{z}_{t-1},\beta_t\mathbf{I}).\end{aligned} q(z1:Tz0)q(ztzt1):=t=1Tq(ztzt1),:=N(zt;1βt zt1,βtI).
  • backward process:
    p θ ( z 0 : T ) : = p ( z T ) ∏ t = 1 T p θ ( z t − 1 ∣ z t ) , p θ ( z t − 1 ∣ z t ) : = N ( z t 1 ; μ θ ( z t , t ) , Σ θ ( z t , t ) ) , μ θ ( z t , t ) = 1 α t ( z t − β t 1 − α ˉ t ϵ θ ( z t , t ) ) \begin{aligned}&p_{\theta}(\mathbf{z}_{0:T}):=p(\mathbf{z}_{T})\prod_{t=1}^{T}p_{\theta}(\mathbf{z}_{t-1}|\mathbf{z}_{t}),\\&p_{\theta}(\mathbf{z}_{t-1}|\mathbf{z}_{t}):=\mathcal{N}(\mathbf{z}_{t_{1}};\mu_{\theta}(\mathbf{z}_{t},t),\Sigma_{\theta}(\mathbf{z}_{t},t)),\end{aligned}\\ \mu_\theta(\mathbf{z}_t,t)=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{z}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(\mathbf{z}_t,t)\right) pθ(z0:T):=p(zT)t=1Tpθ(zt1zt),pθ(zt1zt):=N(zt1;μθ(zt,t),Σθ(zt,t)),μθ(zt,t)=αt 1(zt1αˉt βtϵθ(zt,t))
  • training objective
    L s i m p l e ( θ ) : = ∥ ϵ θ ( z t , t ) − ϵ ∥ 2 2 , \mathcal{L}_{\mathrm{simple}}(\theta):=\left\|\boldsymbol{\epsilon}_{\theta}(\mathbf{z}_{t},t)-\boldsymbol{\epsilon}\right\|_{2}^{2}, Lsimple(θ):=ϵθ(zt,t)ϵ22,

Video Generation Backbone

model video samples in the 3D latent space(follow VDM)

  • exploit a spatial-temporal factorized 3D U-Net architecture to estimate the ϵ \epsilon ϵ:
    • use space-only 3D convolution with the shape of 1 × 3 × 3
    • and add temporal attention in partial layers.
  • use factorized spatial-temporal attention as the default setting in our experiments
  • use adaptive group normalization to inject the timestep embedding into normalization modules to control the channel-wise scale and bias parameters在这里插入图片描述

Hierarchical LVDM for LongVideo Generation

  • problem:The aforementioned framework can only generate short videos, whose lengths are determined by the input frame number during training. 【上述框架只能生成短视频,其长度由训练期间输入的帧数决定】
  • solution:We therefore propose a conditional latent diffusion model, which can produce future latent codes conditioned on the previous ones in an autoregressive manner, to facilitate long video generation【我们提出了一个条件潜在扩散模型,该模型可以以自回归的方式根据前一个潜在码生成未来潜在码,以促进长视频的生成】

Autoregressive Latent Prediction.

Considering a short clip latent z t = { z t i } i = i l \mathbf{z}_t = \{\mathbf{z}_t^i\}_{i=i}^l zt={zti}i=il

  • z t i ∈ R h × w × c z^i_t ∈ R^{h×w×c} ztiRh×w×c
  • l l l is the number of latent codes within the clip

在这里插入图片描述

adding an additional binary mask在这里插入图片描述

  • For each video frame in a clip latent, we add an additional binary mask along the channel dimension to indicate whether it is a conditional frame or a frame to predict 【对于片段潜在表示中的每个视频帧,沿着通道维度添加了一个额外的二进制掩码。这个掩码用于指示每一帧是条件帧还是预测帧】
    z ~ t = { z ~ t i = [ z t i , m i ] } i = 1 l , z ~ t i ∈ R h × w × ( c + 1 ) z ~ 0 = { z ~ 0 i = [ z 0 i , m i ] } i = 1 l , z ~ 0 i ∈ R h × w × ( c + 1 ) z ~ t ← z ~ t ⊙ ( 1 − m ) + z ~ 0 ⊙ m \begin{aligned}&\tilde{\mathbf{z}}_{t}=\{\tilde{\mathbf{z}}_t^i=[\mathbf{z}_t^i,\mathbf{m}^i]\}_{i=1}^l,\tilde{\mathbf{z}}_t^i\in\mathbb{R}^{h\times w\times(c+1)} \\&\tilde{\mathbf{z}}_{0}=\{\tilde{\mathbf{z}}_0^i=[\mathbf{z}_0^i,\mathbf{m}^i]\}_{i=1}^l,\tilde{\mathbf{z}}_0^i\in\mathbb{R}^{h\times w\times(c+1)} \\&\tilde{\mathbf{z}}_{t}\leftarrow\tilde{\mathbf{z}}_t\odot(1-\mathbf{m})+\tilde{\mathbf{z}}_0\odot\mathbf{m}\end{aligned} z~t={z~ti=[zti,mi]}i=1l,z~tiRh×w×(c+1)z~0={z~0i=[z0i,mi]}i=1l,z~0iRh×w×(c+1)z~tz~t(1m)+z~0m
    • m = { m i } i = 1 l \mathbf{m} = \{\mathbf{m}^i\}_{i=1}^l m={mi}i=1l m i ∈ R h × w × 1 \mathbf{m}^i \in \mathbb{R}^{h\times w\times1} miRh×w×1is the binary mask.
    • m = 0 m=0 m=0:a frame to predict
    • m = 1 m=1 m=1:a conditional frame,replace the z t i z^i_t ztiwith z t 0 z^0_t zt0
    • z ~ 0 \tilde{\mathbf{z}}_{0} z~0:conditional latent,frame with no noise
    • c + 1 c+1 c+1: c c c channels + 1 binary mask
  • By randomly setting different binary masks to ones or zeros, we can train our diffusion model to perform both unconditional video generation and conditional video prediction jointly.【通过随机将不同的二进制掩码设置为1或0,我们可以训练我们的扩散模型同时进行无条件视频生成和条件视频预测】
    • Concretely, we set all masks in the binary clip m m m to zeros for unconditional diffusion model training.
    • During inference stage
      • set the first k k k binary mask m = { m i } i = 1 k \mathbf{m} = \{\mathbf{m}^i\}_{i=1}^k m={mi}i=1k to ones 【二进制片段m中的所有掩码设置为零,以进行无条件扩散模型训练
      • and the remaining m = { m i } i = k + 1 l \mathbf{m} = \{\mathbf{m}^i\}_{i=k+1}^l m={mi}i=k+1l to zeros.【猜测比如前100帧是之前生成的,后100帧是高斯噪声,然后k=100,设为条件,去预测这后100帧】

Hierarchical Latent Generation

  • firstly we train an autoregressive video generation model on sparse frames to form the basic storyline of the video 【首先在稀疏帧上训练一个自回归视频生成模型,以形成视频的基本故事线】
  • then train another interpolation model to infill the missing frames. 【然后训练另一个插值模型来填充缺失的帧】
    • The training of the interpolation model is similar to the autoregressive model【插值模型的训练与自回归模型类似】
    • But we set the binary masks of the middle frames between every two sparse frames to zeros(unconditional).【将每两个稀疏帧之间的中间帧的二进制掩码设置为零(猜测前后帧是之前自回归模型生成的,然后在这两帧中间插入高斯噪声帧,然后将前后帧m置位1,高斯噪声帧置位0)】
      在这里插入图片描述

Conditional Latent Perturbation

  • problem:Although the abovementioned hierarchical generation manner can reduce the number of autoregressive steps to overcome the degradation issue, more prediction steps are indispensable to produce long-enough video samples. 【尽管上述分层生成方式可以减少自回归步骤的数量,以克服质量下降的问题,但为了生成足够长的视频样本,更多的预测步骤是必不可少的】
  • solution:we propose conditional perturbation to mitigate the conditional error induced by the previous generation step.【提出了条件扰动来缓解由前一步生成引起的条件误差】
    • rather than directly conditioning on z 0 z_0 z0, we **use the noisy latent code z s z_s zs at an arbitrary time s s s【这个操作就是不拿 z 0 z_0 z0进行条件化,而是拿任意时间 s s s(s属于时间步t内的值), s s s作为训练期间的条件,即不是拿 z 0 i z^i_0 z0i去做条件,而是拿 z s i z^i_s zsi去做条件根据掩码进行替换】

      • s s s can be computed by p θ ( z 0 : T ) : = p ( z T ) ∏ t = 1 T p θ ( z t − 1 ∣ z t ) p_{\theta}(\mathbf{z}_{0:T}):=p(\mathbf{z}_{T})\prod_{t=1}^{T}p_{\theta}(\mathbf{z}_{t-1}|\mathbf{z}_{t}) pθ(z0:T):=p(zT)t=1Tpθ(zt1zt), as the condition during training
      • { z t i } i = 1 k ← { z s i } i = 1 k \{\mathbf{z}_t^i\}_{i=1}^k \leftarrow \{\mathbf{z}_s^i\}_{i=1}^k {zti}i=1k{zsi}i=1k
    • To keep the conditional information preserved, a maximum threshold s m a x s_{max} smax is used to clamp the timesteps in a minor noise level.

    • During sampling, a fixed noise level is used to consistently add noise during autoregressive prediction

在这里插入图片描述

Unconditional Guidance-Another complementary technique to alleviate quality degradation of autoregressive

Since the accumulated error during autoregressive generation does not affect the unconditional score, introducing this score into long video generation could improve the diversity and fidelity of sampled video.【由于自回归生成过程中的累积误差不会影响无条件得分,因此将这一得分引入长期视频生成可以提高采样视频的多样性和保真度】

  • use one network to estimate both unconditional scores ϵ u \epsilon_u ϵu and conditional scores ϵ c \epsilon_c ϵc.
  • By zeroing all binary maps { m i } i = 1 l \{ \mathbf{m}^i \}_{i=1}^l {mi}i=1l in z ~ \tilde z z~, we obtain ϵ u \epsilon_u ϵu
  • By setting the first k binary maps { m i } i = 1 k \{ \mathbf{m}^i \}_{i=1}^k {mi}i=1k to one and the remaining ones { m i } i = k + 1 l \{ \mathbf{m}^i \}_{i=k+1}^l {mi}i=k+1l to zero in z ~ \tilde z z~, we get ϵ c \epsilon_c ϵc.
    在这里插入图片描述

problem:the conditional score may be out of the model learned distribution due to the error accumulation when autoregressively predicted. 【当自回归预测时,由于误差累积,条件分数可能超出模型学习分布】

  • solution: we propose to leverage the unconditional scores to guide the prediction sampling process via【利用无条件得分来指导预测抽样过程】
    ϵ ~ θ = ( 1 + w ) ϵ c − w ϵ u , \tilde{\epsilon}_\theta=(1+w)\boldsymbol{\epsilon}_c-w\boldsymbol{\epsilon}_u, ϵ~θ=(1+w)ϵcwϵu,
  • w w w:the guidance strength
  • This formula is initially referred to as classifier-free guidance to avoid training a separate classifier for class-conditional diffusion models

EXPERIMENTS

在提供的文件中,第4节 “Experiments” 描述了对所提出方法的实验评估。以下是对这一部分的详细总结:

Experimental Setup (实验设置)

  • Datasets (数据集): 作者在UCF-101、Sky Time-lapse和Taichi数据集上评估了他们的方法。这些数据集被用来训练所有模型进行无条件视频生成,并且使用了特定的帧步长来选择训练片段。
  • Resolution (分辨率): 所有模型都以256²的分辨率进行训练。
  • Training Details (训练细节): 对于Taichi数据集,选择了具有帧步长4的片段,以增加人体运动的动态性。由于UCF-101和Taichi数据集的视频数量有限,作者采用了整个数据集进行训练。对于Sky Time-lapse数据集,仅使用其训练分割进行模型训练。

Efficiency Comparison (效率比较)

  • 作者展示了他们的方法与像素空间视频扩散模型(VDM和MCVD)在UCF-101数据集上的性能和效率比较。
  • 效率比较包括模型参数数量、每一步的时间消耗、FVD16和KVD16指标。
    在这里插入图片描述

Short Video Generation (短视频生成)

  • Quantitative Results (定量结果): 作者提供了与先前方法在不同数据集上的定量比较。他们的方法在多个指标上超越了先前的最先进方法。
  • Qualitative Results (定性结果): 通过视觉比较,展示了与DIGAN和TATS等方法相比,作者的LVDM能够生成具有高保真度和多样性的视频样本。
    在这里插入图片描述

Long Video Generation (长视频生成)

  • Comparison with State-of-the-art Approach (与最先进方法的比较): 作者将LVDM与TATS在UCF-101和Sky Time-lapse数据集上进行了比较,生成了1024帧的长视频。
  • Autoregressive and Hierarchical Models (自回归和层次化模型): 比较了纯自回归预测模型和结合预测模型与插值模型(层次化)的性能。
  • Quality Degradation Over Time (随时间的质量退化): 作者展示了LVDM在长视频生成中质量退化更慢,尤其是在UCF-101数据集上。
    在这里插入图片描述

Ablation Experiments (消融实验)

  • Conditional Latent Perturbation (条件潜在扰动): 展示了条件潜在扰动如何减缓性能退化的趋势,尤其是在帧数大于512时。
  • Unconditional Guidance (无条件指导): 探索了无条件指导如何有效减轻自回归生成的质量退化。
    在这里插入图片描述

Extension for Text-to-Video Generation (文本到视频生成的扩展)

  • 除了无条件视频生成,作者还将他们的方法扩展到了更可控的文本到视频生成任务。
  • 他们扩展了模型到亿级参数,并在WebVid数据集的一个200万子集上进行了训练。
  • 使用预训练的文本到图像模型来重用空间内容生成能力,并在视频数据集上继续学习运动动态。
  • 展示了文本到视频生成的结果,证明了模型的有效性和泛化能力。

Code

https://github.com/YingqingHe/LVDM

  • 8
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值