【文本2视频+姿势跟随数字人】Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos

最新推荐文章于 2024-11-16 23:27:53 发布

Arachis_X

最新推荐文章于 2024-11-16 23:27:53 发布

阅读量1.6k

点赞数 40

分类专栏：有意思的工作文章标签：音视频人工智能 cv 自然语言处理

本文链接：https://blog.csdn.net/Arachis_X/article/details/136692403

版权

有意思的工作专栏收录该内容

21 篇文章

订阅专栏

Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos 跟随你的姿势使用无姿势视频进行姿势引导的文本到视频生成

2023.12 AAAI 2024
论文地址
 代码地址
请添加图片描述

在这里插入图片描述

Abstract

Generating text-editable and pose-controllable character videos have an imperious demand in creating various digital human. Nevertheless, this task has been restricted by the absence of a comprehensive dataset featuring paired video-pose captions and the generative prior models for videos. In this work, we design a novel two-stage training scheme that can utilize easily obtained datasets (i.e.,image pose pair and pose-free video) and the pre-trained text-to-image (T2I) model to obtain the pose-controllable character videos. Specifically, in the first stage, only the keypoint-image pairs are used only for a controllable text-to-image generation. We learn a zero-initialized convolutional encoder to encode the pose information. In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks. Powered by our new designs, our method successfully generates continuously pose-controllable character videos while keeps the editing and concept composition ability of the pre-trained T2I model. The code and models will be made publicly available.

生成文字可编辑、姿势可控制的人物视频是创造各种数字人类的迫切需求。

然而，这项任务一直受限于缺乏以视频与姿势字幕配对为特征的综合数据集和视频先验生成模型。

在这项工作中，我们设计了一种新颖的两阶段训练方案，可以利用容易获得的数据集（即图像姿势配对和无姿势视频）和预训练的文本到图像（T2I）模型来获得姿势可控的人物视频。