AIGC-VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models-CVPR2024

hflexx

已于 2024-07-23 17:29:38 修改

阅读量171

点赞数 12

分类专栏： AIGC论文文章标签： AIGC pytorch stable diffusion

于 2024-07-23 17:29:12 首次发布

本文链接：https://blog.csdn.net/hwjokcq/article/details/140636174

版权

AIGC论文专栏收录该内容

26 篇文章 0 订阅

订阅专栏

Homepage: https://ailab-cvc.github.io/videocrafter
Paper:https://arxiv.org/abs/2401.09047

在不使用高质量视频的情况下训练高质量视频模型在这里插入图片描述

ABSTRCT

在这里插入图片描述

MOTIVATION

several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these models rely on large-scale, well-filtered, highquality videos that are not accessible to the community【商用视频模型依赖于社区无法访问的大规模、过滤良好、高质量的视频】
Many existing research works, which train models using the low-quality WebVid-10M dataset, struggle to generate high-quality videos because the models are optimized to fit WebVid-10M.However, the picture quality is unsatisfactory, and most videos have a resolution of about 320p. 【画质并不令人满意，大多数视频的分辨率都在320p左右】
there exist many issues for video generation, e.g., poor picture quality and caption, multiple clips in one video, and static frames or slides.[图像质量差和字幕不佳;一个视频中包含多个剪辑;静态帧或幻灯片】
AnimateDiff finds that combining temporal modules from a video model trained on WebVid-10M and a LORA SD model can improve the picture quality of the generated videos. However, this is not a generic model and does not always work.
- the temporal modules can only be combined with a few selected LORA models, which makes it not a generic model【时间模块只能与几个选定的LORA模型相结合，不通用】
- since each LORA model is a personalized model, the composed video model might suffer from degraded concept composition if the LORA model trained with limited data【每个LORA模型是个性化模型，如果LORA模型用有限的数据训练，则合成的视频模型可能遭受降级的概念合成】
- the motion quality degenerates when the modules do not match the LORA model well.【当模块与LORA模型不匹配时，运动质量会退化】

CONTRIBUTION

We propose a method to overcome the data for training high-quality video models by disentangling motion from appearance at the data level.【通过在数据层面上将运动与外观分开来训练高质量的视频模型】
We investigate the connection between spatial and temporal modules, and the distribution shift. We identify the keys to obtain a high-quality video model. 【研究了空间和时间模块之间的联系，以及分布偏移】
We design an effective pipeline based on the observations, i.e., obtaining a fully trained video model first and tuning the spatial modules with synthesized high-quality images.【根据观察结果设计了一个有效的管道，即首先获得一个经过充分训练的视频模型，然后用合成的高质量图像调整空间模块】

METHODS

Spatial-temporal Connection Analyses

Base T2V model

To leverage the prior in SD trained on a large-scale image dataset, most T2V diffusion models inflate the SD model to a video model by adding temporal modules, they follow VDM to use a particular type of 3D-UNet that is factorized over space and time.These models can be categorized into two groups according to their training strategies
- One is to use videos to learn both the spatial and temporal modules with the SD weights as initialization, called full training. （Align Your Latent，AnimateDiff）【使用视频来学习空间和时间模块，以SD权重作为初始化，称为完全训练】
- The other is to train temporal modules with spatial ones fixed, called partial training. 【用固定的空间模块训练时间模块，称为部分训练】
Architecture：
- we follow the architecture of the open-source VideoCrafter1with FPS (frames per second) condition 【遵循 开源 VideoCrafter1 的架构和 FPS（每秒帧数）条件】
- We also incorporate the temporal convolution in ModelScopeT2V to improve temporal consistency.【加入了时间卷积，以提高时间一致性】

Parameter Perturbation for Full and Partial Training.

Settings

实验设置：
- 完整训练（Full Training）和部分训练（Partial Training）：研究者将这两种训练策略应用于相同的架构，使用相同的数据集进行训练。模型使用预训练的SD（Stable Diffusion）权重初始化。
  - 为了简单起见，完全训练的视频模型表示为 $M_F(\theta_T, \theta_S)$ ，而部分训练的视频模型表示为 $M_P(\theta_T, \theta_S^0)$
  - $\theta_T$ 和 $\theta_S$ 分别是时间和空间模块的学习参数。 $\theta_S^0$ 是SD的原始空间参数。
- 数据集：使用WebVid-10M作为训练数据，同时为了避免概念遗忘，结合LAION-COCO 600M数据集进行视频和图像的联合训练。
- 分辨率：训练的分辨率设置为512×320。
evaluate the connection strength between spatial and temporal modules
- perturb the parameters of the specified modules by using another high-quality image dataset $D_I$ for finetuning【使用另一个高质量的图像数据集 $D_I$ 对指定模块的参数进行微调】
- The image data is JDB that consists of synthesized images from Midjornery 【图像数据是 JDB，由来自 Midjornery 的合成图像组成】
- As the JDB has 4 million images, we use LORA for finetuning【使用 LORA 进行微调】

Spatial Perturbation

When perturbing the spatial parameters of the two video models using the image dataset, the temporal parameters are frozen and vice versa.【当使用图像数据集扰动两个视频模型的空间参数时，时间参数将被冻结，反之亦然。】
The perturbation process of the fully trained base model $M_F$ and $M_p$ can be denoted as:
$M_F^{'}(\theta_T,\theta_S+\Delta_{\theta_S})\leftarrow\mathrm{PERTB}_{\theta_S}^{\mathrm{LORA}}(M_F(\theta_T,\theta_S),\mathcal{D}_I),\\M_P^{'}(\theta_T,\theta_S^0+\Delta_{\theta_S})\leftarrow\mathrm{PERTB}_{\theta_S}^{\mathrm{LORA}}(M_F(\theta_T,\theta_S^0),\mathcal{D}_I),$
- $\mathrm{PERTB}_{\theta_S}^{\mathrm{LORA}}$ :finetuning $M_F$ with respect to $θ_S$ on the image dataset $D_I$ using LORA
- $M_F$ denotes the fully trained video model， $M_P$ denotes the partiallytrained video model【全局训练和部分训练】
- $\Delta_{\theta_S}$ :the parameters of the LORA branch【采用LORA技术对空间参数进行微调】
- $θ_T$ and $θ_S$ are the learned parameters of the temporal and spatial modules, respectively. $\theta_S^0$ are the original spatial parameters of SD.【时间和空间模块的学习参数】
- F-Spa-LORA: $M_F'$ ,P-Spa-LORA: $M_P'$ (Spa:finetune spatial modules;Temp:finetune temporal modules【对应含义】
observations
- the motion quality of F-Spa-LORA is more stable than that of P-Spa-LORA 【运动质量：F-Spa-LORA>P-Spa-LORA】
  - The motion of P-Spa-LORA becomes worse quickly during the finetuning process. The more finetuning steps, the video tends to be more still with local flicker【微调过程中，P-Spa-LORA的运动会迅速变差。微调步骤越多，视频往往更静止，局部闪烁】
  - While the motion of F-Spa-LORA slightly degenerates compared to the fully trained base model.【与完全训练的基础模型相比，F-Spa-LORA 的运动略有退化】
- Visual quality evaluation of the perturbed T2V models.
  - P-Spa-LORA achieves much better visual quality than F-Spa-LORA【视觉质量：P-Spa-LORA>F-Spa-LORA】
  - The picture quality and aesthetic score of F-Spa-LORA are greatly improved compared to the partially trained base model【画质和美学得分：F-Spa-LORA>P-Spa-LORA】
  - Surprisingly, the watermark is also removed. While F-Spa-LORA obtains a slight improvement in picture quality and aesthetic score, the generated videos are still noisy.【时间模块中，水印也被删除。虽然 F-Spa-LORA 在画质和美学得分方面略有提高，但生成的视频仍然很嘈杂。】
conclusions
the coupling strength between spatial and temporal modules of the fully trained model is stronger than that of the partially trained model
- Because the spatial-temporal coupling of the partially trained model can be easily broken, leading to quick motion degeneration and picture quality shift. A stronger connection can tolerate parameter perturbation more than a weak one.
- Our observation can be used to explain the quality improvement and motion degeneration of AnimateDiff.
  - AnimateDiff is not a generic model and only works for selected personalized SD models. The reason is that its motion modules are obtained with the partially training strategy, and they cannot tolerate large parameter perturbations.
  - When the personalized model does not match the temporal modules, both picture and motion quality will degenerate.

Temporal Perturbation.

temporal modules:The partially trained model has only the temporal modules updated, but the picture quality is shifted to the quality ofWebVid-10M. Hence, the temporal modules take responsibility for not only the motion but also the picture quality. 【时间模块不仅负责运动，还负责图像质量。】
We perturb the temporal modules while fixing the spatial modules with the image dataset. The perturbation processes can be denoted as:
$M_{F}^{''}(\theta_{T}+\Delta_{\theta_{T}},\theta_{S})\leftarrow\mathrm{PERTB}_{\theta_{T}}^{\mathrm{LORA}}(M_{F}(\theta_{T},\theta_{S}),\mathcal{D}_{I}),\\M_{P}^{''}(\theta_{T}+\Delta_{\theta_{T}},\theta_{S}^{0})\leftarrow\mathrm{PERTB}_{\theta_{T}}^{\mathrm{LORA}}(M_{P}(\theta_{T},\theta_{S}^{0}),\mathcal{D}_{I}).$
observations:
- the picture quality of P-Temp-LORA ( $M_{P}^{''}$ ) is better than F-Temp-LORA ( $M_{F}^{''}$ ). 【图像质量：P-Temp-LORA>F-Temp-LORA】
- However, the foreground and background of the videos are more shaky in P-Temp-LORA , i.e., the temporal consistency becomes worse .
- The picture of F-Temp-LORA is improved, but the watermark is still there. Its motion is close to the base model and much better than P-Temp-LORA . 【运动质量：F-Temp-LORA>P-Temp-LORA】

Data-level Disentanglement ofAppearance and Motion

We propose to disentangle motion from appearance at the data level, i.e., learning motion from lowquality videos while learning picture quality and aesthetics from high-quality images【从低质量视频中学习运动，同时从高质量图像中获得图像质量和美感】

According to the study of the connection between spatial and temporal modules, a fully trained model is more suitable for the subsequent finetuning with high-quality images【完全训练的模型更适合于后续对高质量图像进行微调】.
In both spatial and temporal perturbation the picture quality can be improved but not very significantly. To obtain a greater quality improvement, we evaluate two strategies. 【在空间和时间扰动中，图像质量可以提高，但不是很明显。为了获得更高的质量，提出方法进行实验】
- One is to involve more parameters, i.e., finetuning both spatial and temporal modules with images. 【用图像微调空间和时间模块】
- The other is to change the finetuning method, i.e., using direct finetuning without LORA. 【使用没有LORA的直接微调】
We can evaluate the following four cases: $\begin{gathered} M_{F}^{A}(\theta_{T}+\Delta_{\theta_{T}},\theta_{S}+\Delta_{\theta_{S}}) \leftarrow\mathrm{PERTB}_{\theta_{T},\theta_{S}}^{\mathrm{LORA}}(M_{F}(\theta_{T},\theta_{S}),\mathcal{D}_{I}) \\ M_{F}^{B}(\theta_{T},\theta_{S}+\Delta_{\theta_{S}}) \leftarrow\mathrm{PERTB}_{\theta_S}(M_F(\theta_T,\theta_S),\mathcal{D}_I), \\ M_{F}^{C}(\theta_{T}+\Delta_{\theta_{T}},\theta_{S}) \leftarrow\mathrm{PERTB}_{\theta_T}(M_F(\theta_T,\theta_S),\mathcal{D}_I), \\ M_{F}^{D}(\theta_{T}+\Delta_{\theta_{T}},\theta_{S}+\Delta_{\theta_{S}}) \leftarrow\mathrm{PERTB}_{\theta_T,\theta_S}(M_F(\theta_T,\theta_S),\mathcal{D}_I) \end{gathered}$
- $M_{F}^{A}$ ((F-Spa&Temp-LORA)):finetuning both spatial and temporal modules with images
- $M_{F}^{B}$ (F-Spa-DIR):directly finetuning the spatial modules
- $M_{F}^{C}$ (F-Temp-DIR):directly finetuning the temporal modules
- $M_{F}^{D}$ (F-Spa&Temp-DIR):directly finetuning all modules
observations:
- F-Spa&Temp-LORA can further improve the picture quality of F-Spa-LORA, but the quality is still unsatisfying. The watermark exists in most generated videos, and the noise is obvious.【画质：F-Spa&Temp-LORA>F-Spa-LORA，但水印存在】
- F-Temp-DIR achieves better picture quality than F-Temp-LORA. It is also better than F-Spa&Temp-LORA. The watermark is removed or lightened in half of the videos【画质：F-Temp-DIR> F-Temp-LORA, F-Spa&Temp-LORA,一半的视频中的水印被去除或变亮】
- F-Spa-DIR and F-Spa&Temp-DIR achieve the best picture quality among the fine-tuned models. However, the motion of F-Spa-DIR is better【F-Spa-DIR and F-Spa&Temp-DIR有最好画质，F-Spa-DIR运动效果更好】
- The foreground and background of F-Spa&Temp-DIR are flashing in videos generated by MDF , especially local textures.【F-Spa&Temp-DIR的前景和背景在MDF生成的视频中闪烁】
conclusion:we identify that directly finetuning spatial modules with high-quality images is the best way to improve the picture quality without marginal loss of motion quality.【直接用高质量图像微调空间模块（F-Spa-DIR）是提高图像质量的最佳方法，而不会降低运动质量。】
- first fully training a video model with low-quality videos 【先用低质量的视频对视频模型进行全面训练】
- and then directly finetuning the spatial modules only with high-quality images.【然后仅使用高质量图像直接微调空间模块】

Promotion of Concept Composition

PROBLEM:To improve the concept composition ability of video models, rather than using their training images, we propose transferring their concept composition ability to video models by synthesizing a set of images with complex concepts. 【通过合成一组具有复杂概念的图像，而不是使用其训练图像，将其概念构图能力转移到视频模型中】
To validate the effectiveness of synthesized images, we use JDB and LAION-aesthetics V2 as image data for the second finetuning stage.
- LAION-aesthetics V2 consists of web-collected images while JDB contains images synthesized by Midjourney. 【使用 JDB 和 LAION-aesthetics V2 作为第二个微调阶段的图像数据】
- We observe that the model trained with JDB has much better concept composition ability【用 JDB 训练的模型具有更好的概念组合能力】

EXPERIMENTS

数据（Data）

使用了 WebVid-10M 作为低质量视频数据源，这个数据集包含了大约10百万个文本-视频对，视频分辨率大多为336x596，每个视频是单个镜头。
使用 JDB 作为高质量图像数据源，包含约4百万张由Midjourney合成的高分辨率图像，每张图像都有相应的文本提示。
为了防止在基础T2V模型训练期间发生概念遗忘，还使用了 LAION-COCO，这是一个包含600百万个公开网络图像的高质量描述的数据集。

指标（Metrics）

使用 EvalCrafter 进行定量评估，这是一个评估文本到视频生成模型的基准，包含大约18个客观指标，涵盖了视觉质量、内容质量、运动质量和文本-视频对齐等方面。
除了客观指标，还进行了用户研究以了解人类偏好，因为目前还没有一个全面的客观指标来衡量运动质量。

训练细节（Training Details）

基础模型架构基于开源的 VideoCrafter1，并整合了 ModelScopeT2V 中的时间卷积以提高时间一致性。
空间模块使用 SD 2.1 的权重初始化，时间模块的输出初始化为零。
训练分辨率设置为512x320，使用32个NVIDIA A100 GPU进行270K次迭代，批量大小为128。
学习率设置为5×10^-5，使用LORA进行微调时，仅使用JDB数据集。

与现有T2V模型的比较（Comparison with State-of-the-Art T2V Models）

作者将他们的方法与几种现有的T2V模型进行了比较，包括商业模型如 Gen2 和 Pika Labs，以及开源模型如 Show-1、VideoCrafter1 和 AnimateDiff。
为了进行比较，我们使用基于SD v1.5的时间模块（第二版），并采用Realistic Vision V2.0 [10]作为其相应的LORA模型
比较的指标包括视觉质量、文本-视频对齐、运动质量等。

定量评估（Quantitative Evaluation）

使用EvalCrafter得到的定量结果显示，作者的方法在视觉质量方面与使用高质量视频训练的 VideoCrafter1 和 Pika Labs 相当。
在文本-视频对齐方面表现排名第二，在运动质量方面，表现超过了 Show-1，但不及使用更多视频数据量来学习运动的模型。

策略评估（Strategy Evaluation）

作者评估了不同的训练策略，包括空间-时间连接分析和模块选择，以确定最有效的微调方法。
通过视觉质量和运动质量的评估，确定了直接微调空间模块是提高画面质量的关键。

hflexx

关注

12
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
AIGC-VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models-CVPR2024

AIGC-VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models-CVPR2024，在不使用高质量视频的情况下训练高质量视频模型
复制链接

扫一扫

专栏目录