一文快速学会FastAPI部署LatentSync数字人模型

Knoka705

于 2025-05-01 14:50:00 发布

阅读量969

点赞数 9

文章标签： fastapi

本文链接：https://blog.csdn.net/qq_61897309/article/details/147652376

版权

服务器准备

首先，我们需要进行服务器的准备，这里准备的是 RTX-4090 服务器

连接我们已经创建好的服务器，这里可使用 MobaXterm 进行 ssh 连接

ssh funhpc@IP地址

环境准备

接着先拉取官方代码，然后创建一个虚拟环境，再安装其对应的依赖库，如果下载过慢，可采用下面添加代理的方式进行下载

git clone https://github.com/bytedance/LatentSync.git
# git clone https://gitproxy.click/https://github.com/bytedance/LatentSync.git

接着可运行官方提供的一件设置环境的脚本进行环境的准备

source setup_env.sh

但是注意这里可能下载速度过慢，我这里将模型下载聪huggingface更换成了modelscope

#!/bin/bash

# Create a new conda environment
conda create -y -n latentsync python=3.10.13
conda activate latentsync

# Install ffmpeg
conda install -y -c conda-forge ffmpeg 

# Python dependencies
pip install -r requirements.txt -i  https://repo.huaweicloud.com/repository/pypi/simple
pip install modelscope -i  https://repo.huaweicloud.com/repository/pypi/simple

# OpenCV dependencies
apt -y install libgl1

# Download the checkpoints required for inference from HuggingFace
modelscope download --model ByteDance/LatentSync-1.5 whisper/tiny.pt --local_dir checkpoints
modelscope download --model ByteDance/LatentSync-1.5 latentsync_unet.pt --local_dir checkpoints
modelscope download --model zhuzhukeji/sd-vae-ft-mse --local_dir stabilityai/sd-vae-ft-mse

如果下载成功，checkpoints 文件夹应如下所示

./checkpoints/
|-- latentsync_unet.pt
|-- whisper
|   `-- tiny.pt

官方提供的推理脚本为./inference.sh，可直接运行体验

FastAPI部署实现

但这里我们为实现FastAPI的部署，需要观察内部代码，可以看到其内部也是执行一个python脚本

python -m scripts.inference \
    --unet_config_path "configs/unet/stage2.yaml" \
    --inference_ckpt_path "checkpoints/latentsync_unet.pt" \
    --inference_steps 20 \
    --guidance_scale 2.0 \
    --video_path "assets/demo1_video.mp4" \
    --audio_path "assets/demo1_audio.wav" \
    --video_out_path "video_out.mp4"

因此去观察scripts文件夹下的inference脚本，这个即为官方的整个推理代码，其按照如下流程进行

输入检查 ：验证输入的视频和音频文件路径是否存在。
设备与精度设置 ：根据GPU支持情况选择使用 float16 或 float32 精度以优化计算效率。
模型加载 ：
- 加载 DDIM Scheduler 用于推理阶段的去噪调度；
- 根据配置加载对应的 Whisper 模型 提取音频特征；
- 加载 VAE 模型 用于图像空间与潜在空间之间的编码/解码；
- 加载预训练的 3D UNet 条件模型 作为去噪网络。
初始化 Pipeline ：将上述组件构建为一个 LipsyncPipeline，专门用于音频驱动的视频生成。
执行推理 ：通过传入的视频、音频路径等参数进行推理，生成与音频同步的视频，并保存结果。

import argparse
import os
from omegaconf import OmegaConf
import torch
from diffusers import AutoencoderKL, DDIMScheduler
from latentsync.models.unet import UNet3DConditionModel
from latentsync.pipelines.lipsync_pipeline import LipsyncPipeline
from accelerate.utils import set_seed
from latentsync.whisper.audio2feature import Audio2Feature


def main(config, args):
    if not os.path.exists(args.video_path):
        raise RuntimeError(f"Video path '{args.video_path}' not found")
    if not os.path.exists(args.audio_path):
        raise RuntimeError(f"Audio path '{args.audio_path}' not found")

    # Check if the GPU supports float16
    is_fp16_supported = torch.cuda.is_available() and torch.cuda.get_device_capability()[0] > 7
    dtype = torch.float16 if is_fp16_supported else torch.float32

    print(f"Input video path: {args.video_path}")
    print(f"Input audio path: {args.audio_path}")
    print(f"Loaded checkpoint path: {args.inference_ckpt_path}")

    scheduler = DDIMScheduler.from_pretrained("configs")

    if config.model.cross_attention_dim == 768:
        whisper_model_path = "checkpoints/whisper/small.pt"
    elif config.model.cross_attention_dim =&