StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control 基于区域语义控制的实时交互式生成
2024.3.14
Abstract
The enormous success of diffusion models in text-to-image synthesis has made them promising candidates for the next generation of end-user applications for image generation and editing. Previous works have focused on improving the usability of diffusion models by reducing the inference time or increasing user interactivity by allowing new, fine-grained controls such as region-based text prompts. However, we empirically find that integrating both branches of works is nontrivial, limiting the potential of diffusion models. To solve this incompatibility, we present StreamMultiDiffusion, the first real-time region-based text-to-image generation framework. By stabilizing fast inference techniques and restructuring the model into a newly proposed multi-prompt stream batch architecture, we achieve x10 faster panorama generation than existing solutions, and the generation speed of 1.57 FPS in region-based text-to-image synthesis on a single RTX 2080 Ti GPU. Our solution opens up a new paradigm for interactive image generation named semantic palette, where high-quality images are generated in real-time from given multiple hand-drawn regions, encoding prescribed semantic meanings (e.g., eagle, girl). Our code and demo application are available at https://github.com/ironjr/StreamMultiDiffusion.
扩散模型在文本到图像合成方面的巨大成功,使其有望成为下一代图像生成和编辑终端用户应用的候选模型。
以前的工作主要集中在通过减少推理时间来提高扩散模型的可用性,或通过允许新的细粒度控制(如基于区域的文本提示)来提高用户交互性。
然而,我们根据经验发现,整合这两个工作分支并非易事,从而限制了扩散模型的潜力。
为了解决这种不兼容性,我们提出了 StreamMultiDiffusion,这是第一个基于区域的实时文本到图像生成框架。
通过稳定快速推理技术并将模型重组为新提出的多提示流批处理架构,我们实现了比现有解决方案快 10 倍的全景生成速度,并在单个 RTX 2080 Ti GPU 上实现了 1.57 FPS 的基于区域的文本到图像合成生成速度。
我们的解决方案为名为semantic palette的交互式图像生成开辟了一种新模式,即从给定的多个手绘区域实时生成高质量图像,并编码规定的语义含义(如鹰、女孩)。
我们的代码和演示程序可在 https://github.com/ironjr/StreamMultiDiffusion 上获取。
Results
Streaming Generation Process
Region-Based Multi-Text-to-Image Generation
Larger Region-Based Multi-Text-to-Image Generation