根据音乐合成舞蹈;提升预训练扩散模型分辨率;基于扩散模型的视频超分;LLM推理加速框架;3D控制运动人像合成

本文介绍了DanceMeld,一种利用层次latent代码进行音乐到舞蹈合成的方法,以及分辨率chromatography在扩散模型中的应用,展示了如何提高生成过程中的分辨率和时间依赖性。此外,还探讨了文本到视频超分辨率、LLM推理加速和3D控制的人像合成技术,如Medusa和ActAnywhere,旨在提升效率和真实感。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

本文首发于公众号:机器感知

根据音乐合成舞蹈;提升预训练扩散模型分辨率;基于扩散模型的视频超分;LLM推理加速框架;3D控制运动人像合成

DanceMeld: Unraveling Dance Phrases with Hierarchical Latent Codes for  Music-to-Dance Synthesis

图片

In the realm of 3D digital human applications, music-to-dance presents a challenging task. Dance poses composed of a series of basic meaningful body postures, while dance movements can reflect dynamic changes such as the rhythm, melody, and style of dance. Taking inspiration from these concepts, we introduce an innovative dance generation pipeline called DanceMeld, which comprising two stages, i.e., the dance decouple stage and the dance generation stage. Our approach has undergone qualitative and quantitative experiments on the AIST++ dataset, demonstrating its superiority over other methods.

Resolution Chromatography of Diffusion Models

图片

In this paper, we introduce "resolution chromatography" that indicates the signal generation rate of each resolution, which is very helpful concept to mathematically explain this coarse-to-fine behavior in generation process, to understand the role of noise schedule, and to design time-dependent modulation. Using resolution chromatography, we determine which resolution level becomes dominant at a specific time step, and experimentally verify our theory with text-to-image diffusion models. We also propose some direct applications utilizing the concept: upscaling pre-trained models to higher resolutions and time-dependent prompt composing.

Inflation with Diffusion: Efficient Temporal Adaptation for  Text-to-Video Super-Resolution

图片

We propose an efficient diffusion-based text-to-video super-resolution (SR) tuning approach that leverages the readily learned capacity of pixel level image diffusion model to capture spatial information for video generation. To accomplish this goal, we design an efficient architecture by inflating the weightings of the text-to-image SR model into our video generation framework. Additionally, we incorporate a temporal adapter to ensure temporal coherence across video frames. Empirical evaluation, both quantitative and qualitative, on the Shutterstock video dataset, demonstrates that our approach is able to perform text-to-video SR generation with good visual quality and temporal consistency.

Medusa: Simple LLM Inference Acceleration Framework with Multiple  Decoding Heads

图片

The inference process in Large Language Models (LLMs) is often limited due to the absence of parallelism in the auto-regressive decoding process, resulting in most operations being restricted by the memory bandwidth of accelerators. In this paper, we present Medusa, an efficient method that augments LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel.

ActAnywhere: Subject-Aware Video Background Generation

图片

Generating video background that tailors to foreground subject motion is an important problem for the movie industry and visual effects community. We introduce ActAnywhere, a generative model that automates this process which traditionally requires tedious manual efforts. Our model leverages the power of large-scale video diffusion models, and is specifically tailored for this task. ActAnywhere takes a sequence of foreground subject segmentation as input and an image that describes the desired scene as condition, to produce a coherent video with realistic foreground-background interactions while adhering to the condition frame.

Synthesizing Moving People with 3D Control

图片

In this paper, we present a diffusion model-based framework for animating people from a single image for a given target 3D motion sequence. Our approach has two core components: a) learning priors about invisible parts of the human body and clothing, and b) rendering novel body poses with proper clothing and texture. This disentangled approach allows our method to generate a sequence of images that are faithful to the target motion in the 3D pose and, to the input image in terms of visual similarity. In addition to that, the 3D control allows various synthetic camera trajectories to render a person.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值