AIGC-视频生成VideoCrafter1: Open Diffusion Models for High-Quality Video Generation 论文精读

hflexx

于 2024-07-23 12:44:43 发布

阅读量499

点赞数 12

分类专栏： AIGC论文文章标签： AIGC 音视频 stable diffusion pytorch

本文链接：https://blog.csdn.net/hwjokcq/article/details/140628512

版权

AIGC论文专栏收录该内容

26 篇文章 0 订阅

订阅专栏

Homepage: https://ailab-cvc.github.io/videocrafter
paper:https://arxiv.org/abs/2310.19512

在这里插入图片描述

MOTIVATION

级联模型（Cascaded Model）：某些文本到视频模型，如Make-A-Video和Imagen Video，采用了级联的方法。这意味着它们通过一系列步骤逐步细化视频的每一帧，可能包括初步生成、细节增强等阶段。
开源与封闭模型的对比：虽然有像Gen-2、Pika Labs和Moonvalley这样的初创公司能够生成高质量的视频，但它们的模型对研究人员不开放。与此同时，存在一些开源的T2V模型，如ModelScope、Hotshot-XL、AnimateDiff和Zeroscope V2 XL，但它们在分辨率和质量上存在限制。
I2V模型的需求：目前，开源社区中缺少能够将图像动画化并保留内容和结构的高质量I2V模型。虽然Pika Labs和Gen-2发布了他们的I2V模型，但这些技术的动画效果有限，并且存在可见的伪影。

CONTRIBUTION

We introduce a text-to-video model capable of generating high-quality videos with a resolution of 1024 × 576 and cinematic quality. The model is trained on 20 million videos and 600 million images.
- The T2V model builds upon SD 2.1 by incorporating temporal attention layers into the SD UNet to capture temporal consistency【T2V模型建立在SD 2.1的基础上，通过将时间注意力层合并到SD UNet中来捕获时间一致性】
- We employ a joint image and video training strategy to prevent concept forgetting.【采用了一种联合图像和视频训练策略来防止概念遗忘】
- The training dataset comprises LAION COCO 600M, Webvid10M , and a 10M high-resolution collected video dataset. 【训练数据集包括LAION COCO 600M 、Webvid10M 、和一个10M高分辨率采集视频数据集】
We present an image-to-video model, the first opensource generic I2V model that can strictly preserve the content and structure of the input reference image while animating it into a video. This model allows for both image and text inputs.
- The I2V model, is based on a T2V model and accepts both text and image inputs.【I2V模型基于T2V模型，并接受文本和图像输入】
- The image embedding is extracted using CLIP and injected into the SD UNet through cross attention , similar to the injection of text embeddings. 【使用CLIP提取图像嵌入，并通过交叉注意将其注入SD UNet，类似于文本嵌入的注入】
- The image embedding is extracted using CLIP and injected into the SD UNet through cross attention , similar to the injection of text embeddings. 【在LAION COCO 600M和Webvid10M上对I2V模型进行了训练】

RELATED WORKS

VDM

First VDm：
- 利用了一个空间-时间分解的U-Net（space-time factorized U-Net）来模拟像素空间中的低分辨率视频。
- 这种模型同时在图像和视频数据上进行训练，以便更好地理解图像和视频之间的关联。
ImagenVideo的级联范式（ImagenVideo Cascaded Paradigm）： ImagenVideo引入了一种有效的级联范式，使用扩散模型（Diffusion Models, DMs）生成高清视频。
- 这种方法通过v-prediction参数化方法来改进视频生成，可能涉及对视频帧的预测和优化。
- 计算效率的提升（Computational Efficiency Improvement）：
像素基础VDM的计算成本和潜在基础VDM的文本-视频对齐问题（Pixel-based VDM Computational Cost and Latent-based VDM Text-Video Alignment）：
- Zhang等人[59]指出，基于像素的VDM存在计算成本高昂的问题，而基于潜在的VDM在文本和视频对齐方面表现不佳。
- 为了解决这些问题，他们提出了一个混合像素-潜在VDM框架（hybrid-pixel-latent VDM framework），结合了像素级和潜在空间的方法来生成视频。

conditional controls in T2V DMs

结构控制的集成：一些模型如Gen-1和Make-Your-Video通过将逐帧深度图与输入噪声序列连接起来，将结构控制集成到视频扩散模型中，用于视频编辑。
时间一致性和真实运动的挑战：
- Seer和VideoComposer等模型可能专注于特定领域（如室内物体）或由于对输入图像的语义理解不足，无法生成时间上连贯的帧和真实运动。
- DragNUWA进一步引入轨迹控制到图像到视频生成中，但这种方法只能在一定程度上缓解不真实运动的问题。

Methodology

VideoCrafter1: Text-to-Video Model(T2V)

Structure Overview

The VideoCrafter T2V model is a Latent Video Diffusion Model (LVDM)

two key components:a video VAE and a video latent diffusion model,
The Video VAE is responsible for reducing the sample dimension, allowing the subsequent diffusion model to be more compact and efficient
VAE:
- First, the video data $x_0$ is fed into the VAE encoder $E$ to project it into the video latent $z_0$ , which exhibits a lower data dimension with a compressed video representation.
- Then, the video latent can be projected back into the reconstructed video $x_0'$ via the VAE decoder $D$ .
- We adopt the pretrained VAE from the Stable Diffusion model to serve as the video VAE and project each frame individually without extracting temporal information.
diffusion forward process:
- After obtaining the video latent $z_0$ , the diffusion process is performed on $z_0$ $\begin{aligned}&q(\mathbf{z}_{1:T}|\mathbf{z}_{0}):=\prod_{t=1}^{T}q(\mathbf{z}_{t}|\mathbf{z}_{t-1}),\\&q(\mathbf{z}_{t}|\mathbf{z}_{t-1}):=\mathcal N(\mathbf{z}_{t};\sqrt{1-\beta_{t}}\mathbf{z}_{t-1},\beta_{t}\mathbf{I})\end{aligned}$

Denoising 3D U-Net

the denoising U-Net is a 3D U-Net architecture
- U-Net consists of a stack of basic spatial-temporal blocks with skip connections.
- Each block comprises convolutional layers, spatial transformers (ST), and temporal transformers (TT)
$\mathrm{ST}=\mathrm{Proj}_{\mathrm{in}}\circ(\mathrm{Attn}_{\mathrm{self}}\circ\mathrm{Attn}_{\mathrm{cross}}\circ\mathrm{MLP})\circ\mathrm{Proj}_{\mathrm{out}},\\\mathrm{TT}=\mathrm{Proj}_{\mathrm{in}}\circ(\mathrm{Attn}_{\mathrm{temp}}\circ\mathrm{Attn}_{\mathrm{temp}}\circ\mathrm{MLP})\circ\mathrm{Proj}_{\mathrm{out}}.$
The controlling signals of the denoiser include semantic control, such as the text prompt, and motion speed control, such as the video fps.
- We inject the semantic control via the cross-attention:
  $\mathrm{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})=\mathrm{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d}}\right)\cdot\mathbf{V},\mathrm{where}\\\mathbf{Q}=\mathbf{W}_O^{(i)}\cdot\varphi_i(z_t),\mathbf{K}=\mathbf{W}_K^{(i)}\cdot\phi(y),\mathbf{V}=\mathbf{W}_V^{(i)}\cdot\phi(y).$
  - $\varphi_i(z_t)\in\mathbb{R}^{N\times d_\epsilon^i}$ :represents spatially flattened tokens of video latent【视频潜在表示的空间展平标记】
  - $\phi(y)$ :denotes the Clip text encoder【CLIP文本编码器，用于将文本提示转换为嵌入向量】
  - $y$ :the input text prompt
- Motion speed control with fps is incorporated through an FPS embedder, which shares the same structure as the timestep embedder.
  - the FPS or timestep is projected into an embedding vector using sinusoidal embedding
  - This vector is then fed into a two-layer MLP to map the sinusoidal embedding to a learned embedding. 【使用正弦嵌入将FPS或时间步长投影到嵌入向量，然后通过两层MLP将正弦嵌入映射到学习到的嵌入】
  - Subsequently, the timestep embedding and FPS embedding are fused via elementwise addition. The fused embedding is finally added to the convolutional features to modulate the intermediate features.【时间步长嵌入和FPS嵌入通过逐元素相加的方式融合。融合后的嵌入最终添加到卷积特征上，以调制中间特征。】

VideoCrafter1: Image-to-Video Model(I2V)

Structure Overview.

Text prompts

offer highly flexible control for content generation, but they primarily focus on semantic-level specifications rather than detailed appearance.
it is essential to project the image into a text-aligned embedding space

在这里插入图片描述

Text-Aligned Rich Image Embedding

Employ CLIP text encoder 's image encoder counterpart to extract the image features from the input image
- though the global semantic token $f_{cls}$ from the CLIP image encoder is well-aligned with image captions, it primarily represents visual contents at a semantic level, while being less capable of capturing details.
- we utilize the full patch visual tokens $F_{vis} = \{f_i\}^K_{i=0}$ from the last layer of the CLIP image ViT , which are believed to encompass much richer information about the image.
promote alignment with the text embedding
- utilize a learnable projection network $P$ to transform $F_{vis}$ into the target image embedding $F_{img} = P(F_{vis})$ ,enabling the video model backbone to process the image feature efficiently.
- The text embedding $F_{text}$ and image embedding $F_{img}$ are then used to compute the U-Net intermediate features Fin via dual cross-attention layers:
  $\mathbf{F}_{out}=\mathrm{Softmax}(\frac{\mathbf{Q}\mathbf{K}_{text}^\top}{\sqrt{d}})\mathbf{V}_{text}+\mathrm{Softmax}(\frac{\mathbf{Q}\mathbf{K}_{img}^\top}{\sqrt{d}})\mathbf{V}_{img},$
  - $\mathbf{Q} = \mathbf{F}_{in}\mathbf{W}_{q}$
  - $\mathbf{K}_{text} = \mathbf{F}_{text}\mathbf{W}_{k}, \mathbf{V}_{text} =\mathbf{F}_{text}\mathbf{W}_{v}$
  - $\mathbf{K}_{img}=\mathbf{F}_{img}\mathbf{W}_{k}^{\prime}, \mathbf{V}_{img}=\mathbf{F}_{img}\mathbf{W}_{v}^{\prime}$
  - we use the same query for image crossattention as for text cross-attention. Thus, only two parameter matrices $W_k′$ , $W_v′$ are newly added for each crossattention layer.

Experiments

4.1. Implementation Details（实现细节）

Datasets（数据集）: 作者采用了图像和视频联合训练策略，使用了包括LAION COCO、Webvid10M以及一个超过1000万高分辨率视频的大型数据集。
Training Scheme（训练方案）: 训练T2V模型时，采用了从低分辨率到高分辨率的训练策略。具体来说，首先在256×256分辨率下训练，然后逐步提升分辨率继续训练。
Evaluation Metrics（评估指标）: 作者使用了EvalCrafter，一个评估视频生成模型的基准，来全面评估视频质量和文本与视频之间的对齐度。EvalCrafter通过定量指标和用户研究来进行模型间的比较
Relations to Floor33:作者在名为Floor33的Discord频道上部署了两个开源模型，允许用户通过输入提示来在线探索模型的功能。此外，还添加了一个可选的提示扩展功能，以丰富用户提示中的信息。

Performance Evaluation（性能评估）

Text-to-Video Results（文本到视频结果）: 作者将他们的T2V模型与Gen-2、Pika Labs等商业模型以及I2VGen-XL等开源模型进行了比较。结果表明，作者的模型在视觉质量和文本对齐方面优于其他开源T2V模型。
Image-to-Video Results（图像到视频结果）: 作者评估了他们的方法与现有的图像到视频方法，包括VideoComposer、I2VGenXL、Pika和Gen-2等。作者的I2V模型在保持输入图像内容和结构的同时，展现出良好的时间一致性和运动幅度。

hflexx

关注

12
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
AIGC-视频生成VideoCrafter1: Open Diffusion Models for High-Quality Video Generation 论文精读

开源的T2V和T2I！AIGC-视频生成VideoCrafter1: Open Diffusion Models for High-Quality Video Generation 论文精读
复制链接

扫一扫

专栏目录