Hallo：让一张图开口说话

烧技湾

已于 2024-07-26 11:22:19 修改

阅读量846

点赞数 15

分类专栏： AI & Computer Vision 文章标签：计算机视觉人工智能人机交互

于 2024-07-26 09:58:47 首次发布

本文链接：https://blog.csdn.net/wqthaha/article/details/140696292

版权

AI & Computer Vision 专栏收录该内容

60 篇文章 18 订阅

订阅专栏

输入的照片展示：
在这里插入图片描述

生成的视频如下所示：

output

在这里插入图片描述

binmayong

前言

Hallo模型，是近期开源的一个生成式的模型，它可以输入一张图片+一段语音，就可以生成对应的一段视频。这个技术的难点主要有：1、如何让音频与脸部的发音系统如嘴唇、面部表情联动；2、生成的视频质量比较高。

这个开源的工程，可以跟其他工具配套使用，如windows版本、webUI等集成使用，非常友好。
其任务图如下所示：
在这里插入图片描述

一、Hallo是什么？

在这里插入图片描述
这个模型是一个典型的多模态学习模型，多模态体现在：1、接受输入图像的特征信息；2、在统一尺度下嵌入其他模态即语音模态的信息。具体的论文解读，参考我之前对这篇论文的解读。

二、使用步骤

2.1 软硬件要求

System requirement: Ubuntu 20.04/Ubuntu 22.04, Cuda 12.1
Tested GPUs: A100

2.2 软件环境安装

代码如下（示例）：

  conda create -n hallo python=3.10
  conda activate hallo
  
  git clone https://github.com/fudan-generative-vision/hallo.git

  pip install -r requirements.txt
  pip install .

  apt-get install ffmpeg

2.3 使用

使用步骤：

Download all required pretrained models.
Prepare source image and driving audio pairs.
Run inference.

2.3.1 预训练模型的下载

这个工程设计到的模块非常多，因此有必要一个一个介绍，其文件目录如下所示：

./pretrained_models/
|-- audio_separator/
|   |-- download_checks.json
|   |-- mdx_model_data.json
|   |-- vr_model_data.json
|   `-- Kim_Vocal_2.onnx
|-- face_analysis/
|   `-- models/
|       |-- face_landmarker_v2_with_blendshapes.task  # face landmarker model from mediapipe
|       |-- 1k3d68.onnx
|       |-- 2d106det.onnx
|       |-- genderage.onnx
|       |-- glintr100.onnx
|       `-- scrfd_10g_bnkps.onnx
|-- motion_module/
|   `-- mm_sd_v15_v2.ckpt
|-- sd-vae-ft-mse/
|   |-- config.json
|   `-- diffusion_pytorch_model.safetensors
|-- stable-diffusion-v1-5/
|   `-- unet/
|       |-- config.json
|       `-- diffusion_pytorch_model.safetensors
`-- wav2vec/
    `-- wav2vec2-base-960h/
        |-- config.json
        |-- feature_extractor_config.json
        |-- model.safetensors
        |-- preprocessor_config.json
        |-- special_tokens_map.json
        |-- tokenizer_config.json
        `-- vocab.json

hallo checkpoints

Hallo这个模块负责的部分包括：去噪的UNet结构，人脸定位，以及图像和语音的投影，即Our checkpoints consist of denoising UNet, face locator, image & audio proj.

下载连接：https://huggingface.co/fudan-generative-ai/hallo/tree/main/hallo
里面的net.pth文件有4.85G的大小需要下载。

audio_separator:

主要是实现语音的变换，生成基于语音的嘴唇同步

insightface: 2D and 3D Face Analysis

主要是实现2D 和3D的人脸分析，主要包括人脸检测、关键点检测等
placed into pretrained_models/face_analysis/models/. (Thanks to deepinsight)

下载连接：
https://drive.usercontent.google.com/download?id=18wEUfMNohBJ4K3Ly5wpTejPfDzp-8fI8&export=download&authuser=0

face landmarker:

主要实现人脸检测和人脸蒙版，
Face detection & mesh model from mediapipe placed into pretrained_models/face_analysis/models.

下载连接：
https://github.com/nlml/deconstruct-mediapipe/blob/main/face_landmarker_v2_with_blendshapes.task

motion module:

主要实现将文生图的模型转换为动画的生成
motion module from AnimateDiff. (Thanks to guoyww).
下载连接：
https://drive.google.com/drive/folders/1EqLC65eR1-W-sGD0Im7fkED6c8GkiNFI

sd-vae-ft-mse:

Weights are intended to be used with the diffusers library. (Thanks to stablilityai)
下载连接：
https://huggingface.co/stabilityai/sd-vae-ft-mse/tree/main

StableDiffusion V1.5:

Initialized and fine-tuned from Stable-Diffusion-v1-2. (Thanks to runwayml)

wav2vec:

将wav音频转化为向量
wav audio to vector model from Facebook.
下载连接：https://huggingface.co/facebook/wav2vec2-base-960h/tree/main

推理

准备推理数据 Prepare Inference Data

对输入的数据有少量的要求：

对于图像源：For the source image:
1、图片必须是方的；
2、人脸必须要占比超过50-70%；
3、人脸朝向必须向前的，转动的角度不能超过30度。

对于输入的音频：For the driving audio:
1、必须是wav格式；
2、必须是英文，因为训练的模型是仅采用英文来训练；
3、确保人的声音是清晰的，背景音乐不受干扰；

这，，，，只能训练一个只会说英文的模型。

运行推理脚本

python scripts/inference.py --source_image examples/reference_images/1.jpg --driving_audio examples/driving_audios/1.wav

bug解决：

bug 1: no module named "hallo"，
是因为hallo的文件夹处在不同层级，无法读到，只需要把inference.py上移一个目录

bug 2： 找不到face_landmarker_v2_with_blendshapes.task
需要找到对应的文件即可，按照原文的连接可能下载的文件不对。

推理过程的运行：

# 处理步骤1：先处理背景图，再处理人脸；
Processed and saved: ./.cache/1_sep_background.png
Processed and saved: ./.cache/1_sep_face.png

#步骤2：将音频文件转为向量；
Some weights of Wav2VecModel were not initialized from the model checkpoint at ./pretrained_models/wav2vec/wav2vec2-base-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:audio_separator.separator.separator:Separator version 0.17.2 instantiating with output_dir: ./.cache/audio_preprocess, output_format: WAV
INFO:audio_separator.separator.separator:Operating System: Linux #44~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Jun 18 14:36:16 UTC 2
INFO:audio_separator.separator.separator:System: Linux Node: ubuntu22-E500-G9-WS760T Release: 6.5.0-44-generic Machine: x86_64 Proc: x86_64
INFO:audio_separator.separator.separator:Python Version: 3.10.14
INFO:audio_separator.separator.separator:PyTorch Version: 2.2.2+cu121
INFO:audio_separator.separator.separator:FFmpeg installed: ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
INFO:audio_separator.separator.separator:ONNX Runtime GPU package installed with version: 1.18.0
INFO:audio_separator.separator.separator:CUDA is available in Torch, setting Torch device to CUDA
INFO:audio_separator.separator.separator:ONNXruntime has CUDAExecutionProvider available, enabling acceleration
INFO:audio_separator.separator.separator:Loading model Kim_Vocal_2.onnx...
INFO:audio_separator.separator.separator:Load model duration: 00:00:00
INFO:audio_separator.separator.separator:Starting separation process for audio_file_path: examples/driving_audios/1.wav
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.24s/it]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 17.42it/s]
INFO:audio_separator.separator.separator:Saving Vocals stem to 1_(Vocals)_Kim_Vocal_2.wav...
INFO:audio_separator.separator.separator:Clearing input audio file paths, sources and stems...
INFO:audio_separator.separator.separator:Separation duration: 00:00:10
## 大概需要运行2mins

# 步骤三：将多模态特征输入扩散模型的UNet结构中
The config attributes {'center_input_sample': False, 'out_channels': 4} were passed to UNet2DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file.
Some weights of the model checkpoint were not used when initializing UNet2DConditionModel: 
 ['conv_norm_out.bias, conv_norm_out.weight, conv_out.bias, conv_out.weight']

#运行SD的UNet
INFO:hallo.models.unet_3d:loaded temporal unet's pretrained weights from pretrained_models/stable-diffusion-v1-5/unet ...
The config attributes {'center_input_sample': False} were passed to UNet3DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file.

# 运行运动模块，生成动画效果
Load motion module params from pretrained_models/motion_module/mm_sd_v15_v2.ckpt
INFO:hallo.models.unet_3d:Loaded 453.20928M-parameter motion module

# 运行hallo框架
loaded weight from  ./pretrained_models/hallo/net.pth

#进行12次迭代生成；
[1/12]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:27<00:00,  1.45it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 35.88it/s]
[2/12]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:27<00:00,  1.45it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 36.18it/s]
[3/12]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:27<00:00,  1.44it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 36.85it/s]
[4/12]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:27<00:00,  1.44it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 36.77it/s]
[5/12]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:28<00:00,  1.43it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 36.43it/s]
[6/12]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:27<00:00,  1.44it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 36.93it/s]
[7/12]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:27<00:00,  1.43it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 37.34it/s]
[8/12]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:27<00:00,  1.45it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 37.51it/s]
[9/12]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:27<00:00,  1.44it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 35.63it/s]
[10/12]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:27<00:00,  1.45it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 37.64it/s]
[11/12]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:27<00:00,  1.44it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 37.71it/s]
[12/12]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:27<00:00,  1.45it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 37.55it/s]
Moviepy - Building video .cache/output.mp4.
MoviePy - Writing audio in outputTEMP_MPY_wvf_snd.mp4
MoviePy - Done.                                                                                                                                                       
Moviepy - Writing video .cache/output.mp4

#输出mp4文件
Moviepy - Done !                                                                                                                                                      
Moviepy - video ready .cache/output.mp4

训练Hallo

该工程非常友好，提供了对模型的训练代码，因此我们可以训练自己的Hallo模型；训练步骤如下所示：

训练数据的要求

视频文件的要求：
1、必须要切成方块的大小；
2、人脸必须占比50-70%；
3、人脸朝向不能超过30度。

视频文件的组织方式如下：

dataset_name/
|-- videos/
|   |-- 0001.mp4
|   |-- 0002.mp4
|   |-- 0003.mp4
|   `-- 0004.mp4

视频预处理

python -m scripts.data_preprocess --input_dir dataset_name/videos --step 1
python -m scripts.data_preprocess --input_dir dataset_name/videos --step 2

按顺序执行步骤 1 和 2，因为它们执行不同的任务。步骤 1 将视频转换为帧，从每个视频中提取音频，并生成必要的蒙版。步骤 2 使用 InsightFace 生成人脸嵌入，使用 Wav2Vec 生成音频嵌入，并且需要 GPU。对于并行处理，请使用-p和-r参数。参数-p指定要启动的实例总数，将数据分成p几部分。-r参数指定当前进程应处理哪个部分。您需要手动启动具有不同值的多个实例-r。

python scripts/extract_meta_info_stage1.py -r path/to/dataset -n dataset_name
python scripts/extract_meta_info_stage2.py -r path/to/dataset -n dataset_name

将其替换path/to/dataset为的父目录路径videos，如上dataset_name例所示。这将在目录中生成dataset_name_stage1.json和。dataset_name_stage2.json./data

开始训练

总结

提示：这里对文章进行总结：

例如：以上就是今天要讲的内容，本文仅仅简单介绍了pandas的使用，而pandas提供了大量能使我们快速便捷地处理数据的函数和方法。

烧技湾

关注

15
点赞
踩
24

收藏

觉得还不错? 一键收藏
打赏
0
评论
Hallo：让一张图开口说话

Hallo模型，是近期开源的一个生成式的模型，它可以输入一张图片+一段语音，就可以生成对应的一段视频。这个技术的难点主要有：1、如何让音频与脸部的发音系统如嘴唇、面部表情联动；2、生成的视频质量比较高。这个开源的工程，可以跟其他工具配套使用，如windows版本、webUI等集成使用，非常友好。例如：以上就是今天要讲的内容，本文仅仅简单介绍了pandas的使用，而pandas提供了大量能使我们快速便捷地处理数据的函数和方法。
复制链接

扫一扫