Homepage: https://ailab-cvc.github.io/videocrafter
paper:https://arxiv.org/abs/2310.19512
MOTIVATION
- 级联模型(Cascaded Model): 某些文本到视频模型,如Make-A-Video和Imagen Video,采用了级联的方法。这意味着它们通过一系列步骤逐步细化视频的每一帧,可能包括初步生成、细节增强等阶段。
- 开源与封闭模型的对比: 虽然有像Gen-2、Pika Labs和Moonvalley这样的初创公司能够生成高质量的视频,但它们的模型对研究人员不开放。与此同时,存在一些开源的T2V模型,如ModelScope、Hotshot-XL、AnimateDiff和Zeroscope V2 XL,但它们在分辨率和质量上存在限制。
- I2V模型的需求: 目前,开源社区中缺少能够将图像动画化并保留内容和结构的高质量I2V模型。虽然Pika Labs和Gen-2发布了他们的I2V模型,但这些技术的动画效果有限,并且存在可见的伪影。
CONTRIBUTION
-
We introduce a text-to-video model capable of generating high-quality videos with a resolution of 1024 × 576 and cinematic quality. The model is trained on 20 million videos and 600 million images.
- The T2V model builds upon SD 2.1 by incorporating temporal attention layers into the SD UNet to capture temporal consistency【T2V模型建立在SD 2.1的基础上,通过将时间注意力层合并到SD UNet中来捕获时间一致性】
- We employ a joint image and video training strategy to prevent concept forgetting.【采用了一种联合图像和视频训练策略来防止概念遗忘】
- The training dataset comprises LAION COCO 600M, Webvid10M , and a 10M high-resolution collected video dataset. 【训练数据集包括LAION COCO 600M 、Webvid10M 、和一个10M高分辨率采集视频数据集】
-
We present an image-to-video model, the first opensource generic I2V model that can strictly preserve the content and structure of the input reference image while animating it into a video. This model allows for both image and text inputs.
- The I2V model, is based on a T2V model and accepts both text and image inputs.【I2V模型基于T2V模型,并接受文本和图像输入】
- The image embedding is extracted using CLIP and injected into the SD UNet through cross attention , similar to the injection of text embeddings. 【使用CLIP提取图像嵌入,并通过交叉注意将其注入SD UNet,类似于文本嵌入的注入】
- The image embedding is extracted using CLIP and injected into the SD UNet through cross attention , similar to the injection of text embeddings. 【在LAION COCO 600M和Webvid10M上对I2V模型进行了训练】
RELATED WORKS
VDM
-
First VDm:
- 利用了一个空间-时间分解的U-Net(space-time factorized U-Net)来模拟像素空间中的低分辨率视频。
- 这种模型同时在图像和视频数据上进行训练,以便更好地理解图像和视频之间的关联。
-
ImagenVideo的级联范式(ImagenVideo Cascaded Paradigm): ImagenVideo引入了一种有效的级联范式,使用扩散模型(Diffusion Models, DMs)生成高清视频。
- 这种方法通过v-prediction参数化方法来改进视频生成,可能涉及对视频帧的预测和优化。
- 计算效率的提升(Computational Efficiency Improvement):
-
像素基础VDM的计算成本和潜在基础VDM的文本-视频对齐问题(Pixel-based VDM Computational Cost and Latent-based VDM Text-Video Alignment):
- Zhang等人[59]指出,基于像素的VDM存在计算成本高昂的问题,而基于潜在的VDM在文本和视频对齐方面表现不佳。
- 为了解决这些问题,他们提出了一个混合像素-潜在VDM框架(hybrid-pixel-latent VDM framework),结合了像素级和潜在空间的方法来生成视频。
conditional controls in T2V DMs
- 结构控制的集成:一些模型如Gen-1和Make-Your-Video通过将逐帧深度图与输入噪声序列连接起来,将结构控制集成到视频扩散模型中,用于视频编辑。
- 时间一致性和真实运动的挑战:
- Seer和VideoComposer等模型可能专注于特定领域(如室内物体)或由于对输入图像的语义理解不足,无法生成时间上连贯的帧和真实运动。
- DragNUWA进一步引入轨迹控制到图像到视频生成中,但这种方法只能在一定程度上缓解不真实运动的问题。
Methodology
VideoCrafter1: Text-to-Video Model(T2V)
Structure Overview
The VideoCrafter T2V model is a Latent Video Diffusion Model (LVDM)
- two key components:a video VAE and a video latent diffusion model,
- The Video VAE is responsible for reducing the sample dimension, allowing the subsequent diffusion model to be more compact and efficient
- VAE:
- First, the video data x 0 x_0 x0 is fed into the VAE encoder E E E to project it into the video latent z 0 z_0 z0, which exhibits a lower data dimension with a compressed video representation.
- Then, the video latent can be projected back into the reconstructed video x 0 ′ x_0' x0′ via the VAE decoder D D D.
- We adopt the pretrained VAE from the Stable Diffusion model to serve as the video VAE and project each frame individually without extracting temporal information.
- diffusion forward process:
- After obtaining the video latent z 0 z_0 z0, the diffusion process is performed on z 0 z_0 z0 q ( z 1 : T ∣ z 0 ) : = ∏ t = 1 T q ( z t ∣ z t − 1 ) , q ( z t ∣ z t − 1 ) : = N ( z t ; 1 − β t z t − 1 , β t I ) \begin{aligned}&q(\mathbf{z}_{1:T}|\mathbf{z}_{0}):=\prod_{t=1}^{T}q(\mathbf{z}_{t}|\mathbf{z}_{t-1}),\\&q(\mathbf{z}_{t}|\mathbf{z}_{t-1}):=\mathcal N(\mathbf{z}_{t};\sqrt{1-\beta_{t}}\mathbf{z}_{t-1},\beta_{t}\mathbf{I})\end{aligned} q(z1:T∣z0):=t=1∏Tq(zt∣zt−1),q(zt∣zt−1):=N(zt;1−βtzt−1,βtI)
Denoising 3D U-Net
-
the denoising U-Net is a 3D U-Net architecture
- U-Net consists of a stack of basic spatial-temporal blocks with skip connections.
- Each block comprises convolutional layers, spatial transformers (ST), and temporal transformers (TT)
S T = P r o j i n ∘ ( A t t n s e l f ∘ A t t n c r o s s ∘ M L P ) ∘ P r o j o u t , T T = P r o j i n ∘ ( A t t n t e m p ∘ A t t n t e m p ∘ M L P ) ∘ P r o j o u t . \mathrm{ST}=\mathrm{Proj}_{\mathrm{in}}\circ(\mathrm{Attn}_{\mathrm{self}}\circ\mathrm{Attn}_{\mathrm{cross}}\circ\mathrm{MLP})\circ\mathrm{Proj}_{\mathrm{out}},\\\mathrm{TT}=\mathrm{Proj}_{\mathrm{in}}\circ(\mathrm{Attn}_{\mathrm{temp}}\circ\mathrm{Attn}_{\mathrm{temp}}\circ\mathrm{MLP})\circ\mathrm{Proj}_{\mathrm{out}}. ST=Projin∘(Attnself∘Attncross∘MLP)∘Projout,TT=Projin∘(Attntemp∘Attntemp∘MLP)∘Projout.
-
The controlling signals of the denoiser include semantic control, such as the text prompt, and motion speed control, such as the video fps.
-
We inject the semantic control via the cross-attention:
A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T d ) ⋅ V , w h e r e Q = W O ( i ) ⋅ φ i ( z t ) , K = W K ( i ) ⋅ ϕ ( y ) , V = W V ( i ) ⋅ ϕ ( y ) . \mathrm{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})=\mathrm{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d}}\right)\cdot\mathbf{V},\mathrm{where}\\\mathbf{Q}=\mathbf{W}_O^{(i)}\cdot\varphi_i(z_t),\mathbf{K}=\mathbf{W}_K^{(i)}\cdot\phi(y),\mathbf{V}=\mathbf{W}_V^{(i)}\cdot\phi(y). Attention(Q,K,V)=softmax(dQKT)⋅V,whereQ=WO(i)⋅φi(zt),K=WK(i)⋅ϕ(y),V=WV(i)⋅ϕ(y).- φ i ( z t ) ∈ R N × d ϵ i \varphi_i(z_t)\in\mathbb{R}^{N\times d_\epsilon^i} φi(zt)∈RN×dϵi:represents spatially flattened tokens of video latent【视频潜在表示的空间展平标记】
- ϕ ( y ) \phi(y) ϕ(y):denotes the Clip text encoder【CLIP文本编码器,用于将文本提示转换为嵌入向量】
- y y y:the input text prompt
-
Motion speed control with fps is incorporated through an FPS embedder, which shares the same structure as the timestep embedder.
- the FPS or timestep is projected into an embedding vector using sinusoidal embedding
- This vector is then fed into a two-layer MLP to map the sinusoidal embedding to a learned embedding. 【使用正弦嵌入将FPS或时间步长投影到嵌入向量,然后通过两层MLP将正弦嵌入映射到学习到的嵌入】
- Subsequently, the timestep embedding and FPS embedding are fused via elementwise addition. The fused embedding is finally added to the convolutional features to modulate the intermediate features.【时间步长嵌入和FPS嵌入通过逐元素相加的方式融合。融合后的嵌入最终添加到卷积特征上,以调制中间特征。】
-
VideoCrafter1: Image-to-Video Model(I2V)
Structure Overview.
Text prompts
- offer highly flexible control for content generation, but they primarily focus on semantic-level specifications rather than detailed appearance.
- it is essential to project the image into a text-aligned embedding space
Text-Aligned Rich Image Embedding
- Employ CLIP text encoder 's image encoder counterpart to extract the image features from the input image
- though the global semantic token f c l s f_{cls} fcls from the CLIP image encoder is well-aligned with image captions, it primarily represents visual contents at a semantic level, while being less capable of capturing details.
- we utilize the full patch visual tokens F v i s = { f i } i = 0 K F_{vis} = \{f_i\}^K_{i=0} Fvis={fi}i=0Kfrom the last layer of the CLIP image ViT , which are believed to encompass much richer information about the image.
- promote alignment with the text embedding
- utilize a learnable projection network P P P to transform F v i s F_{vis} Fvis into the target image embedding F i m g = P ( F v i s ) F_{img} = P(F_{vis}) Fimg=P(Fvis),enabling the video model backbone to process the image feature efficiently.
- The text embedding
F
t
e
x
t
F_{text}
Ftext and image embedding
F
i
m
g
F_{img}
Fimgare then used to compute the U-Net intermediate features Fin via dual cross-attention layers:
F o u t = S o f t m a x ( Q K t e x t ⊤ d ) V t e x t + S o f t m a x ( Q K i m g ⊤ d ) V i m g , \mathbf{F}_{out}=\mathrm{Softmax}(\frac{\mathbf{Q}\mathbf{K}_{text}^\top}{\sqrt{d}})\mathbf{V}_{text}+\mathrm{Softmax}(\frac{\mathbf{Q}\mathbf{K}_{img}^\top}{\sqrt{d}})\mathbf{V}_{img}, Fout=Softmax(dQKtext⊤)Vtext+Softmax(dQKimg⊤)Vimg,- Q = F i n W q \mathbf{Q} = \mathbf{F}_{in}\mathbf{W}_{q} Q=FinWq
- K t e x t = F t e x t W k , V t e x t = F t e x t W v \mathbf{K}_{text} = \mathbf{F}_{text}\mathbf{W}_{k}, \mathbf{V}_{text} =\mathbf{F}_{text}\mathbf{W}_{v} Ktext=FtextWk,Vtext=FtextWv
- K i m g = F i m g W k ′ , V i m g = F i m g W v ′ \mathbf{K}_{img}=\mathbf{F}_{img}\mathbf{W}_{k}^{\prime}, \mathbf{V}_{img}=\mathbf{F}_{img}\mathbf{W}_{v}^{\prime} Kimg=FimgWk′,Vimg=FimgWv′
- we use the same query for image crossattention as for text cross-attention. Thus, only two parameter matrices
W
k
′
W_k′
Wk′,
W
v
′
W_v′
Wv′are newly added for each crossattention layer.
Experiments
4.1. Implementation Details(实现细节)
-
Datasets(数据集): 作者采用了图像和视频联合训练策略,使用了包括LAION COCO、Webvid10M以及一个超过1000万高分辨率视频的大型数据集。
-
Training Scheme(训练方案): 训练T2V模型时,采用了从低分辨率到高分辨率的训练策略。具体来说,首先在256×256分辨率下训练,然后逐步提升分辨率继续训练。
-
Evaluation Metrics(评估指标): 作者使用了EvalCrafter,一个评估视频生成模型的基准,来全面评估视频质量和文本与视频之间的对齐度。EvalCrafter通过定量指标和用户研究来进行模型间的比较
-
Relations to Floor33:作者在名为Floor33的Discord频道上部署了两个开源模型,允许用户通过输入提示来在线探索模型的功能。此外,还添加了一个可选的提示扩展功能,以丰富用户提示中的信息。
Performance Evaluation(性能评估)
- Text-to-Video Results(文本到视频结果): 作者将他们的T2V模型与Gen-2、Pika Labs等商业模型以及I2VGen-XL等开源模型进行了比较。结果表明,作者的模型在视觉质量和文本对齐方面优于其他开源T2V模型。
- Image-to-Video Results(图像到视频结果): 作者评估了他们的方法与现有的图像到视频方法,包括VideoComposer、I2VGenXL、Pika和Gen-2等。作者的I2V模型在保持输入图像内容和结构的同时,展现出良好的时间一致性和运动幅度。