基于隐扩散模型的高分辨率图像生成

最新推荐文章于 2024-09-11 12:11:23 发布

umbrellazg

最新推荐文章于 2024-09-11 12:11:23 发布

阅读量962

点赞数 15

文章标签：人工智能

本文链接：https://blog.csdn.net/m0_51576139/article/details/135836138

版权

本文介绍了一种新的方法，通过在强大预训练自编码器的潜在空间中训练扩散模型，显著降低计算需求，同时保持图像合成的高质量和灵活性。研究者提出利用跨注意力层和条件输入处理，实现了文本到图像、高分辨率生成等任务的高效生成。

摘要由CSDN通过智能技术生成

1 Title

High-Resolution Image Synthesis with Latent Diffusion Models（Robin Rombach， Andreas Blattmann，Dominik Lorenz，Patrick Esser，Bjorn Ommer）

2 Conclusion

since DM models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluationsTo enable DM training on limited computational resources while retaining their quality and flexibility, this paper apply them in the latent space of powerful pretrained autoencoders. Our latent diffusion models (LDMs) achieve new state of the art scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including unconditional image generation, text-to-image synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs

3 Good Sentences

1、In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner.(The advantages of this method in contrast to previous works)
2、To increase the accessibility of this powerful model class and at the same time reduce its significant resource consumption, a method is needed that reduces the computational complexity for both training and sampling. Reducing the computational demands of DMs without impairing their performance is, therefore, key to enhance their accessibility（The main questions of DM to improvement）
3、While approaches to jointly learn an encoding/decoding model together with a
score-based prior exist , they still require a difficult weighting between reconstruction and generative capabilities and are outperformed by our approach (The advantages of this method in contrast to previous work)

简介

Diffusion model是一种likelihood-based的模型，相比GAN可以取得更好的生成效果。然而该模型是一种自回归模型，需要反复迭代计算，因而训练和推理都十分昂贵。本文提出一种diffusion的过程改为在latent space上做的方法，从而大大减少计算复杂度，同时也能达到十分不错的生成效果。除此以外，还提出了cross-attention的方法来实现多模态训练，使得class-condition, text-to-image, layout-to-image也可以实现。

Method

整体框架如图，先训练好一个AutoEncoder（包括一个encoder和decoder）。可以利用encoder压缩后的数据做diffusion操作，再用decoder恢复即可。
具体扩散过程其实没有变，只不过现在扩散和重建的目标为latent space的向量了。Diffusion model具体实现为 time-conditional UNet。