Moore-AnimateAnyone环境配置更换为H100可用环境及运行发生问题-CSDN博客

本文链接：https://blog.csdn.net/weixin_41973200/article/details/141259179

目前正在搞AnimateAnyone相关项目，利用moore在做训练时候之前使用传统gpu，比如a100一直没有什么问题，但是在用H100的gpu时候疯狂报错，为了满足h100的gpu使用，经过无数踩坑，完成moore环境搭建，在这里介绍下。
参考：Moore-AnimateAnyone git地址

一、镜像拉取及启动

配环境太麻烦了，不如直接拉取相关镜像，这里直接在pytorch提供的官方镜像中拉取合适。
H100好像只有cuda12以上的gpu才可以使用。可能h100存在条件不同，翻阅资料有的说cuda11.8以上就可以，但是我的gpu要求必须12以上。

# docker镜像获取
docker pull pytorch/pytorch:2.4.0-cuda12.1-cudnn9-runtime
# docker 容器启动
docker run -itd -v /root/scratch/Moore-AnimateAnyone:/workspace 5dba57

在这里插入图片描述

二、环境搭建

pip install
原则上就是根据原始版的requirement.py不带版本安装，但是我采用的方法是直接尝试启动代码，缺什么安装什么。
遇到以下问题

踩坑1

Traceback (most recent call last):
  File "/workspace/tools/extract_dwpose_from_vid.py", line 10, in <module>
    from src.dwpose import DWposeDetector
  File "/workspace/src/dwpose/__init__.py", line 12, in <module>
    import cv2
  File "/opt/conda/lib/python3.11/site-packages/cv2/__init__.py", line 181, in <module>
    bootstrap()
  File "/opt/conda/lib/python3.11/site-packages/cv2/__init__.py", line 153, in bootstrap
    native_module = importlib.import_module("cv2")
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: libGL.so.1: cannot open shared object file: No such file or directory

老问题了，解决方案

apt-get update
apt-get install libgl1
apt-get install libglib2.0-0

踩坑2

ModuleNotFoundError: No module named 'diffusers'

解决方案：
引起重视了，看着很简单，但是必须是这个版本，否则引发的问题会特别多，每个问题都有解决方案，但是后面改不动了，发现这里改个版本就一马平川。

pip install diffusers==0.24.0

踩坑3 容器里没有挂gpu

启动容器时需要用docker run --gpus all

docker run --gpus all -itd -v /root/scratch/Moore-AnimateAnyone:/workspace

踩坑4 任务报错

IndexError: The shape of the mask [0] at index 0 does not match the shape of the indexed tensor [1, 9216, 320] at index 0

官方的解决方案

解决方案1(问题解决):
在stage1.py中设置

reference_unet = ori_net.reference_unet
denoising_unet = ori_net.denoising_unet
改成
reference_unet = copy.deepcopy(ori_net.reference_unet)
denoising_unet = copy.deepcopy(ori_net.denoising_unet)

弊端是看帖子说，这个方法比较占用内存
may encounter OOM if VARM is not enough

解决方案2(尝试了没解决，也不想继续研究了):
帖子有人解释说问题的原因是batchsize, 代码作者在创建dataloader的时候没有设置 drop_last=True而代码又假设了batchsize是满的。这样如果最后一个batch不是满的

train_dataloader = torch.utils.data.DataLoader(
        train_dataset, batch_size=cfg.data.train_bs, shuffle=True, num_workers=4, drop_last=True
    )

踩坑5 任务报错:共享内存不足

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

# 查看宿主机最大可用内存
df -h /dev/shm

Filesystem      Size  Used Avail Use% Mounted on
tmpfs           237G     0  237G   0% /dev/shm

# 设置docker共享存储
# docker run -it --shm-size=256m ubuntu /bin/bas
docker run -itd --gpus all –shm-size 100g -v  /root/scratch/Moore-AnimateAnyone:/workspace

Filesystem      Size  Used Avail Use% Mounted on
overlay         113G   87G   27G  77% /
tmpfs            64M     0   64M   0% /dev
shm             100G     0  100G   0% /dev/shm
/dev/sdb        5.5T  108G  5.4T   2% /workspace
/dev/sda1       113G   87G   27G  77% /etc/hosts
tmpfs           237G   12K  237G   1% /proc/driver/nvidia
tmpfs            48G  4.0M   48G   1% /run/nvidia-persistenced/socket
tmpfs           237G     0  237G   0% /proc/acpi
tmpfs           237G     0  237G   0% /proc/scsi
tmpfs           237G     0  237G   0% /sys/firmware
tmpfs           237G     0  237G   0% /sys/devices/virtual/powercap

踩坑6 train_strage_2报错

ValueError: Unexpected keyword arguments: encoder_hidden_states,timestep,attention_mask,video_length,self_attention_additional_feats,mode

解决方案
参考官方git问题解释

Issue can be fixed by playing with torch version or by enabling gradient_checkpointing = False in stage2.yaml.

One thing I tried was using a different version of torch which seemed to fix the issue once pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118

Otherwise your best bet is to turn off gradient checkpointing, unsure why exactly, we were running into the same issue if we turned on gradient checkpointing in stage1.yaml too

版本好总结展示

直接在这里展示requirement.py最终版本，注意我这里的对应相关版本号，pytorch:2.4.0-cuda12.1-cudnn9，更详细信息可以直接拉这个镜像去看。

accelerate                0.33.0
anaconda-anon-usage       0.4.4
antlr4-python3-runtime    4.9.3
archspec                  0.2.3
asttokens                 2.0.5
astunparse                1.6.3
attrs                     23.1.0
av                        12.3.0
beautifulsoup4            4.12.3
boltons                   23.0.0
Brotli                    1.0.9
certifi                   2024.7.4
cffi                      1.16.0
chardet                   4.0.0
charset-normalizer        2.0.4
click                     8.1.7
coloredlogs               15.0.1
conda                     24.5.0
conda-build               24.5.1
conda-content-trust       0.2.0
conda_index               0.5.0
conda-libmamba-solver     24.1.0
conda-package-handling    2.3.0
conda_package_streaming   0.10.0
contourpy                 1.2.1
controlnet-aux            0.0.9
cryptography              42.0.5
cycler                    0.12.1
decorator                 5.1.1
diffusers                 0.24.0
distro                    1.9.0
dnspython                 2.6.1
einops                    0.8.0
executing                 0.8.3
expecttest                0.2.1
filelock                  3.13.1
flatbuffers               24.3.25
fonttools                 4.53.1
frozendict                2.4.2
fsspec                    2024.6.1
gmpy2                     2.1.2
huggingface-hub           0.24.5
humanfriendly             10.0
hypothesis                6.108.4
idna                      3.7
imageio                   2.35.0
importlib_metadata        8.2.0
ipython                   8.25.0
jedi                      0.19.1
Jinja2                    3.1.4
jsonpatch                 1.33
jsonpointer               2.1
jsonschema                4.19.2
jsonschema-specifications 2023.7.1
kiwisolver                1.4.5
lazy_loader               0.4
libarchive-c              2.9
libmambapy                1.5.8
lintrunner                0.12.5
MarkupSafe                2.1.3
matplotlib                3.9.2
matplotlib-inline         0.1.6
menuinst                  2.1.1
mkl-fft                   1.3.8
mkl-random                1.2.4
mkl-service               2.4.0
more-itertools            10.1.0
mpmath                    1.3.0
networkx                  3.3
ninja                     1.11.1.1
numpy                     1.26.4
omegaconf                 2.3.0
onnxruntime               1.19.0
opencv-python             4.10.0.84
opencv-python-headless    4.10.0.84
optree                    0.12.1
packaging                 24.1
parso                     0.8.3
pexpect                   4.8.0
pillow                    10.4.0
pip                       24.0
pkginfo                   1.10.0
platformdirs              3.10.0
pluggy                    1.0.0
prompt-toolkit            3.0.43
protobuf                  5.27.3
psutil                    5.9.0
ptyprocess                0.7.0
pure-eval                 0.2.2
pycosat                   0.6.6
pycparser                 2.21
Pygments                  2.15.1
pyparsing                 3.1.2
PySocks                   1.7.1
python-dateutil           2.9.0.post0
python-etcd               0.4.5
pytz                      2024.1
PyYAML                    6.0.1
referencing               0.30.2
regex                     2024.7.24
requests                  2.32.3
rpds-py                   0.10.6
ruamel.yaml               0.17.21
safetensors               0.4.4
scikit-image              0.24.0
scipy                     1.14.0
setuptools                69.5.1
six                       1.16.0
sortedcontainers          2.4.0
soupsieve                 2.5
stack-data                0.2.0
sympy                     1.12
tifffile                  2024.8.10
timm                      0.6.7
tokenizers                0.19.1
torch                     2.4.0
torchaudio                2.4.0
torchelastic              0.2.2
torchvision               0.19.0
tqdm                      4.66.4
traitlets                 5.14.3
transformers              4.44.0
triton                    3.0.0
truststore                0.8.0
types-dataclasses         0.6.6
typing_extensions         4.11.0
urllib3                   2.2.2
wcwidth                   0.2.5
wheel                     0.43.0
zipp                      3.20.0
zstandard                 0.22.0

三、成果展示

以上看起来简单，但是发生很多问题后总结出来的最终方案，强烈推荐。

python -m scripts.pose2vid --config ./configs/prompts/animation.yaml -W 512 -H 784 -L 64

在这里插入图片描述

python tools/extract_dwpose_from_vid.py --video_root /workspace/data/Batch_4/19489910/BV1mX4y1z75Q

在这里插入图片描述

root@46e99c62b9aa:/workspace# nvidia-smi
Sat Aug 17 02:52:24 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 PCIe               Off |   00000000:01:00.0 Off |                    0 |
| N/A   39C    P0             52W /  350W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 PCIe               Off |   00000000:02:00.0 Off |                    0 |
| N/A   41C    P0             51W /  350W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+