高分辨率图像合成；可控运动合成；虚拟试衣；在FPGA上高效运行二值Transformer

最新推荐文章于 2024-05-02 15:26:08 发布

JiauZhang

最新推荐文章于 2024-05-02 15:26:08 发布

阅读量1.1k

点赞数 26

文章标签： transformer 深度学习人工智能

本文链接：https://blog.csdn.net/q_z_r_s/article/details/135789741

版权

本文探讨了在FPGA上实现高分辨率图像合成（如1024x1024像素）的HourglassDiffusionTransformer，同时介绍了无需传统高分辨率训练技术的模型。此外，文中还涵盖了文本到图像生成的新方法（如Recaption,PlanandGenerate）、弱监督的运动生成以及在边缘设备上的高效二进制Transformer加速器BETA。

摘要由CSDN通过智能技术生成

本文首发于公众号：机器感知

高分辨率图像合成；可控运动合成；虚拟试衣；在FPGA上高效运行二值Transformer

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution (e.g. $1024 \times 1024$) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning. We demonstrate that HDiT performs competitively with existing models on ImageNet $256^2$, and sets a new state-of-the-art for diffusion models on FFHQ-$1024^2$.

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

In this paper, we propose a brand new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our approach employs the MLLM as a global planner to decompose the process of generating complex images into multiple simpler generation tasks within subregions. Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image diffusion models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment.

MotionMix: Weakly-Supervised Diffusion for Controllable Motion Generation

This paper proposed MotionMix, a simple yet effective weakly-supervised diffusion model that leverages both noisy and unannotated motion sequences. Extensive experiments on several benchmarks demonstrate that our MotionMix, as a versatile framework, consistently achieves state-of-the-art performances on text-to-motion, action-to-motion, and music-to-dance tasks.

Product-Level Try-on: Characteristics-preserving Try-on with Realistic Clothes Shading and Wrinkles

We propose a novel diffusion-based Product-level virtual try-on pipeline,\ie PLTON, which can preserve the fine details of logos and embroideries while producing realistic clothes shading and wrinkles. To enhance retention, a Two-stage Blended Denoising method is proposed to guide the diffusion process for correct spatial layout and color. PLTON is finetuned only with our collected small-size try-on dataset.

BETA: Binarized Energy-Efficient Transformer Accelerator at the Edge

Existing binary Transformers are promising in edge deployment due to their compact model size, low computational complexity, and considerable inference accuracy. However, deploying binary Transformers faces challenges on prior processors due to inefficient execution of quantized matrix multiplication (QMM) and the energy consumption overhead caused by multi-precision activations. To tackle the challenges above, we first develop a computation flow abstraction method for binary Transformers to improve QMM execution efficiency by optimizing the computation order. Furthermore, a binarized energy-efficient Transformer accelerator, namely BETA, is proposed to boost the efficient deployment at the edge. Experimental results evaluated on ZCU102 FPGA show BETA achieves an average energy efficiency of 174 GOPS/W, which is 1.76~21.92x higher than prior FPGA-based accelerators, showing BETA's good potential for edge Transformer acceleration.