基于隐扩散模型的高分辨率图像生成

本文介绍了一种新的方法,通过在强大预训练自编码器的潜在空间中训练扩散模型,显著降低计算需求,同时保持图像合成的高质量和灵活性。研究者提出利用跨注意力层和条件输入处理,实现了文本到图像、高分辨率生成等任务的高效生成。
摘要由CSDN通过智能技术生成

1 Title

        High-Resolution Image Synthesis with Latent Diffusion Models(Robin Rombach, Andreas Blattmann,Dominik Lorenz,Patrick Esser,Bjorn Ommer)

2 Conclusion

         since DM models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluationsTo enable DM training on limited computational resources while retaining their quality and flexibility, this paper apply them in the latent space of powerful pretrained autoencoders.  Our latent diffusion models (LDMs) achieve new state of the art scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including unconditional image generation, text-to-image synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs

3 Good Sentences

        1、In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner.(The advantages of this method in contrast to previous works)
        2、To increase the accessibility of this powerful model class and at the same time reduce its significant resource consumption, a method is needed that reduces the computational complexity for both training and sampling. Reducing the computational demands of DMs without impairing their performance is, therefore, key to enhance their accessibility(The main questions of DM to improvement)
        3、While approaches to jointly learn an encoding/decoding model together with a
score-based prior exist , they still require a difficult weighting between reconstruction and generative capabilities and are outperformed by our approach (The advantages of this method in contrast to previous work)


简介

        Diffusion model是一种likelihood-based的模型,相比GAN可以取得更好的生成效果。然而该模型是一种自回归模型,需要反复迭代计算,因而训练和推理都十分昂贵。本文提出一种diffusion的过程改为在latent space上做的方法,从而大大减少计算复杂度,同时也能达到十分不错的生成效果。除此以外,还提出了cross-attention的方法来实现多模态训练,使得class-condition, text-to-image, layout-to-image也可以实现。

Method

整体框架如图,先训练好一个AutoEncoder(包括一个encoder和decoder)。可以利用encoder压缩后的数据做diffusion操作,再用decoder恢复即可。
        具体扩散过程其实没有变,只不过现在扩散和重建的目标为latent space的向量了。Diffusion model具体实现为 time-conditional UNet。


        在latent space中,高频的细节将会被去除,因此它是一个高效的低维空间,更加方便似然

Perceptual Image Compression

        给一个RGB空间内的图像x,编码器把x编码为latent表示,解码器D则从latent中重建图像。编码器通过将图像下采样,为了避免高方差的latent空间,用了两种正则化,分别是KL-reg和QV-reg,

Conditioning Mechanisms

        为了使DM转为更加灵活的条件图像生成器,引入交叉注意力机制。而为了预处理来自各种模态(如语言提示)的y,又引入了一个特定于领域的编码器,该编码器将y投影到中间表示,然后通过交叉注意力层实现将其映射到UNet的中间层,其中

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值