AIGC-视频生成VideoCrafter1: Open Diffusion Models for High-Quality Video Generation 论文精读

Homepage: https://ailab-cvc.github.io/videocrafter
paper:https://arxiv.org/abs/2310.19512

在这里插入图片描述

MOTIVATION

  • 级联模型(Cascaded Model): 某些文本到视频模型,如Make-A-Video和Imagen Video,采用了级联的方法。这意味着它们通过一系列步骤逐步细化视频的每一帧,可能包括初步生成、细节增强等阶段。
  • 开源与封闭模型的对比: 虽然有像Gen-2、Pika Labs和Moonvalley这样的初创公司能够生成高质量的视频,但它们的模型对研究人员不开放。与此同时,存在一些开源的T2V模型,如ModelScope、Hotshot-XL、AnimateDiff和Zeroscope V2 XL,但它们在分辨率和质量上存在限制。
  • I2V模型的需求: 目前,开源社区中缺少能够将图像动画化并保留内容和结构的高质量I2V模型。虽然Pika Labs和Gen-2发布了他们的I2V模型,但这些技术的动画效果有限,并且存在可见的伪影。

CONTRIBUTION

  • We introduce a text-to-video model capable of generating high-quality videos with a resolution of 1024 × 576 and cinematic quality. The model is trained on 20 million videos and 600 million images.

    • The T2V model builds upon SD 2.1 by incorporating temporal attention layers into the SD UNet to capture temporal consistency【T2V模型建立在SD 2.1的基础上,通过将时间注意力层合并到SD UNet中来捕获时间一致性】
    • We employ a joint image and video training strategy to prevent concept forgetting.【采用了一种联合图像和视频训练策略来防止概念遗忘】
    • The training dataset comprises LAION COCO 600M, Webvid10M , and a 10M high-resolution collected video dataset. 【训练数据集包括LAION COCO 600M 、Webvid10M 、和一个10M高分辨率采集视频数据集】
  • We present an image-to-video model, the first opensource generic I2V model that can strictly preserve the content and structure of the input reference image while animating it into a video. This model allows for both image and text inputs.

    • The I2V model, is based on a T2V model and accepts both text and image inputs.【I2V模型基于T2V模型,并接受文本和图像输入】
    • The image embedding is extracted using CLIP and injected into the SD UNet through cross attention , similar to the injection of text embeddings. 【使用CLIP提取图像嵌入,并通过交叉注意将其注入SD UNet,类似于文本嵌入的注入】
    • The image embedding is extracted using CLIP and injected into the SD UNet through cross attention , similar to the injection of text embeddings. 【在LAION COCO 600M和Webvid10M上对I2V模型进行了训练】

RELATED WORKS

VDM

  • First VDm:

    • 利用了一个空间-时间分解的U-Net(space-time factorized U-Net)来模拟像素空间中的低分辨率视频。
    • 这种模型同时在图像和视频数据上进行训练,以便更好地理解图像和视频之间的关联。
  • ImagenVideo的级联范式(ImagenVideo Cascaded Paradigm): ImagenVideo引入了一种有效的级联范式,使用扩散模型(Diffusion Models, DMs)生成高清视频。

    • 这种方法通过v-prediction参数化方法来改进视频生成,可能涉及对视频帧的预测和优化。
    • 计算效率的提升(Computational Efficiency Improvement):
  • 像素基础VDM的计算成本和潜在基础VDM的文本-视频对齐问题(Pixel-based VDM Computational Cost and Latent-based VDM Text-Video Alignment):

    • Zhang等人[59]指出,基于像素的VDM存在计算成本高昂的问题,而基于潜在的VDM在文本和视频对齐方面表现不佳。
    • 为了解决这些问题,他们提出了一个混合像素-潜在VDM框架(hybrid-pixel-latent VDM framework),结合了像素级和潜在空间的方法来生成视频。

conditional controls in T2V DMs

  • 结构控制的集成:一些模型如Gen-1和Make-Your-Video通过将逐帧深度图与输入噪声序列连接起来,将结构控制集成到视频扩散模型中,用于视频编辑。
  • 时间一致性和真实运动的挑战:
    • Seer和VideoComposer等模型可能专注于特定领域(如室内物体)或由于对输入图像的语义理解不足,无法生成时间上连贯的帧和真实运动。
    • DragNUWA进一步引入轨迹控制到图像到视频生成中,但这种方法只能在一定程度上缓解不真实运动的问题。

Methodology

VideoCrafter1: Text-to-Video Model(T2V)

Structure Overview

The VideoCrafter T2V model is a Latent Video Diffusion Model (LVDM)

  • two key components:a video VAE and a video latent diffusion model,
  • The Video VAE is responsible for reducing the sample dimension, allowing the subsequent diffusion model to be more compact and efficient
    在这里插入图片描述
  • VAE:
    • First, the video data x 0 x_0 x0 is fed into the VAE encoder E E E to project it into the video latent z 0 z_0 z0, which exhibits a lower data dimension with a compressed video representation.
    • Then, the video latent can be projected back into the reconstructed video x 0 ′ x_0' x0 via the VAE decoder D D D.
    • We adopt the pretrained VAE from the Stable Diffusion model to serve as the video VAE and project each frame individually without extracting temporal information.
  • diffusion forward process:
    • After obtaining the video latent z 0 z_0 z0, the diffusion process is performed on z 0 z_0 z0 q ( z 1 : T ∣ z 0 ) : = ∏ t = 1 T q ( z t ∣ z t − 1 ) , q ( z t ∣ z t − 1 ) : = N ( z t ; 1 − β t z t − 1 , β t I ) \begin{aligned}&q(\mathbf{z}_{1:T}|\mathbf{z}_{0}):=\prod_{t=1}^{T}q(\mathbf{z}_{t}|\mathbf{z}_{t-1}),\\&q(\mathbf{z}_{t}|\mathbf{z}_{t-1}):=\mathcal N(\mathbf{z}_{t};\sqrt{1-\beta_{t}}\mathbf{z}_{t-1},\beta_{t}\mathbf{I})\end{aligned} q(z1:Tz0):=t=1Tq(ztzt1),q(ztzt1):=N(zt;1βt zt1,βtI)

Denoising 3D U-Net

  • the denoising U-Net is a 3D U-Net architecture

    • U-Net consists of a stack of basic spatial-temporal blocks with skip connections.
    • Each block comprises convolutional layers, spatial transformers (ST), and temporal transformers (TT)

    S T = P r o j i n ∘ ( A t t n s e l f ∘ A t t n c r o s s ∘ M L P ) ∘ P r o j o u t , T T = P r o j i n ∘ ( A t t n t e m p ∘ A t t n t e m p ∘ M L P ) ∘ P r o j o u t . \mathrm{ST}=\mathrm{Proj}_{\mathrm{in}}\circ(\mathrm{Attn}_{\mathrm{self}}\circ\mathrm{Attn}_{\mathrm{cross}}\circ\mathrm{MLP})\circ\mathrm{Proj}_{\mathrm{out}},\\\mathrm{TT}=\mathrm{Proj}_{\mathrm{in}}\circ(\mathrm{Attn}_{\mathrm{temp}}\circ\mathrm{Attn}_{\mathrm{temp}}\circ\mathrm{MLP})\circ\mathrm{Proj}_{\mathrm{out}}. ST=Projin(AttnselfAttncrossMLP)Projout,TT=Projin(AttntempAttntempMLP)Projout.

  • The controlling signals of the denoiser include semantic control, such as the text prompt, and motion speed control, such as the video fps.

    • We inject the semantic control via the cross-attention:
      A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T d ) ⋅ V , w h e r e Q = W O ( i ) ⋅ φ i ( z t ) , K = W K ( i ) ⋅ ϕ ( y ) , V = W V ( i ) ⋅ ϕ ( y ) . \mathrm{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})=\mathrm{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d}}\right)\cdot\mathbf{V},\mathrm{where}\\\mathbf{Q}=\mathbf{W}_O^{(i)}\cdot\varphi_i(z_t),\mathbf{K}=\mathbf{W}_K^{(i)}\cdot\phi(y),\mathbf{V}=\mathbf{W}_V^{(i)}\cdot\phi(y). Attention(Q,K,V)=softmax(d QKT)V,whereQ=WO(i)φi(zt),K=WK(i)ϕ(y),V=WV(i)ϕ(y).

      • φ i ( z t ) ∈ R N × d ϵ i \varphi_i(z_t)\in\mathbb{R}^{N\times d_\epsilon^i} φi(zt)RN×dϵi:represents spatially flattened tokens of video latent【视频潜在表示的空间展平标记】
      • ϕ ( y ) \phi(y) ϕ(y):denotes the Clip text encoder【CLIP文本编码器,用于将文本提示转换为嵌入向量】
      • y y y:the input text prompt
    • Motion speed control with fps is incorporated through an FPS embedder, which shares the same structure as the timestep embedder.

      • the FPS or timestep is projected into an embedding vector using sinusoidal embedding
      • This vector is then fed into a two-layer MLP to map the sinusoidal embedding to a learned embedding. 【使用正弦嵌入将FPS或时间步长投影到嵌入向量,然后通过两层MLP将正弦嵌入映射到学习到的嵌入】
      • Subsequently, the timestep embedding and FPS embedding are fused via elementwise addition. The fused embedding is finally added to the convolutional features to modulate the intermediate features.【时间步长嵌入和FPS嵌入通过逐元素相加的方式融合。融合后的嵌入最终添加到卷积特征上,以调制中间特征。】

VideoCrafter1: Image-to-Video Model(I2V)

Structure Overview.

Text prompts

  • offer highly flexible control for content generation, but they primarily focus on semantic-level specifications rather than detailed appearance.
  • it is essential to project the image into a text-aligned embedding space

在这里插入图片描述

Text-Aligned Rich Image Embedding

  • Employ CLIP text encoder 's image encoder counterpart to extract the image features from the input image
    • though the global semantic token f c l s f_{cls} fcls from the CLIP image encoder is well-aligned with image captions, it primarily represents visual contents at a semantic level, while being less capable of capturing details.
    • we utilize the full patch visual tokens F v i s = { f i } i = 0 K F_{vis} = \{f_i\}^K_{i=0} Fvis={fi}i=0Kfrom the last layer of the CLIP image ViT , which are believed to encompass much richer information about the image.
  • promote alignment with the text embedding
    • utilize a learnable projection network P P P to transform F v i s F_{vis} Fvis into the target image embedding F i m g = P ( F v i s ) F_{img} = P(F_{vis}) Fimg=P(Fvis),enabling the video model backbone to process the image feature efficiently.
    • The text embedding F t e x t F_{text} Ftext and image embedding F i m g F_{img} Fimgare then used to compute the U-Net intermediate features Fin via dual cross-attention layers:
      F o u t = S o f t m a x ( Q K t e x t ⊤ d ) V t e x t + S o f t m a x ( Q K i m g ⊤ d ) V i m g , \mathbf{F}_{out}=\mathrm{Softmax}(\frac{\mathbf{Q}\mathbf{K}_{text}^\top}{\sqrt{d}})\mathbf{V}_{text}+\mathrm{Softmax}(\frac{\mathbf{Q}\mathbf{K}_{img}^\top}{\sqrt{d}})\mathbf{V}_{img}, Fout=Softmax(d QKtext)Vtext+Softmax(d QKimg)Vimg,
      • Q = F i n W q \mathbf{Q} = \mathbf{F}_{in}\mathbf{W}_{q} Q=FinWq
      • K t e x t = F t e x t W k , V t e x t = F t e x t W v \mathbf{K}_{text} = \mathbf{F}_{text}\mathbf{W}_{k}, \mathbf{V}_{text} =\mathbf{F}_{text}\mathbf{W}_{v} Ktext=FtextWk,Vtext=FtextWv
      • K i m g = F i m g W k ′ , V i m g = F i m g W v ′ \mathbf{K}_{img}=\mathbf{F}_{img}\mathbf{W}_{k}^{\prime}, \mathbf{V}_{img}=\mathbf{F}_{img}\mathbf{W}_{v}^{\prime} Kimg=FimgWk,Vimg=FimgWv
      • we use the same query for image crossattention as for text cross-attention. Thus, only two parameter matrices W k ′ W_k′ Wk, W v ′ W_v′ Wvare newly added for each crossattention layer.
        在这里插入图片描述

Experiments

4.1. Implementation Details(实现细节)

  • Datasets(数据集): 作者采用了图像和视频联合训练策略,使用了包括LAION COCO、Webvid10M以及一个超过1000万高分辨率视频的大型数据集。

  • Training Scheme(训练方案): 训练T2V模型时,采用了从低分辨率到高分辨率的训练策略。具体来说,首先在256×256分辨率下训练,然后逐步提升分辨率继续训练。

  • Evaluation Metrics(评估指标): 作者使用了EvalCrafter,一个评估视频生成模型的基准,来全面评估视频质量和文本与视频之间的对齐度。EvalCrafter通过定量指标和用户研究来进行模型间的比较
    在这里插入图片描述
    在这里插入图片描述

  • Relations to Floor33:作者在名为Floor33的Discord频道上部署了两个开源模型,允许用户通过输入提示来在线探索模型的功能。此外,还添加了一个可选的提示扩展功能,以丰富用户提示中的信息。
    在这里插入图片描述

Performance Evaluation(性能评估)

  • Text-to-Video Results(文本到视频结果): 作者将他们的T2V模型与Gen-2、Pika Labs等商业模型以及I2VGen-XL等开源模型进行了比较。结果表明,作者的模型在视觉质量和文本对齐方面优于其他开源T2V模型。
  • Image-to-Video Results(图像到视频结果): 作者评估了他们的方法与现有的图像到视频方法,包括VideoComposer、I2VGenXL、Pika和Gen-2等。作者的I2V模型在保持输入图像内容和结构的同时,展现出良好的时间一致性和运动幅度。
  • 12
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值