目前正在搞AnimateAnyone相关项目,利用moore在做训练时候之前使用传统gpu,比如a100一直没有什么问题,但是在用H100的gpu时候疯狂报错,为了满足h100的gpu使用,经过无数踩坑,完成moore环境搭建,在这里介绍下。
参考:Moore-AnimateAnyone git地址
一、镜像拉取及启动
配环境太麻烦了,不如直接拉取相关镜像,这里直接在pytorch提供的官方镜像中拉取合适。
H100好像只有cuda12以上的gpu才可以使用。可能h100存在条件不同,翻阅资料有的说cuda11.8以上就可以,但是我的gpu要求必须12以上。
# docker镜像获取
docker pull pytorch/pytorch:2.4.0-cuda12.1-cudnn9-runtime
# docker 容器启动
docker run -itd -v /root/scratch/Moore-AnimateAnyone:/workspace 5dba57
二、环境搭建
pip install
原则上就是根据原始版的requirement.py不带版本安装,但是我采用的方法是直接尝试启动代码,缺什么安装什么。
遇到以下问题
踩坑1
Traceback (most recent call last):
File "/workspace/tools/extract_dwpose_from_vid.py", line 10, in <module>
from src.dwpose import DWposeDetector
File "/workspace/src/dwpose/__init__.py", line 12, in <module>
import cv2
File "/opt/conda/lib/python3.11/site-packages/cv2/__init__.py", line 181, in <module>
bootstrap()
File "/opt/conda/lib/python3.11/site-packages/cv2/__init__.py", line 153, in bootstrap
native_module = importlib.import_module("cv2")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: libGL.so.1: cannot open shared object file: No such file or directory
老问题了,解决方案
apt-get update
apt-get install libgl1
apt-get install libglib2.0-0
踩坑2
ModuleNotFoundError: No module named 'diffusers'
解决方案:
引起重视了,看着很简单,但是必须是这个版本,否则引发的问题会特别多,每个问题都有解决方案,但是后面改不动了,发现这里改个版本就一马平川。
pip install diffusers==0.24.0
踩坑3 容器里没有挂gpu
启动容器时需要用docker run --gpus all
docker run --gpus all -itd -v /root/scratch/Moore-AnimateAnyone:/workspace
踩坑4 任务报错
IndexError: The shape of the mask [0] at index 0 does not match the shape of the indexed tensor [1, 9216, 320] at index 0
-
解决方案1(问题解决):
在stage1.py中设置reference_unet = ori_net.reference_unet denoising_unet = ori_net.denoising_unet 改成 reference_unet = copy.deepcopy(ori_net.reference_unet) denoising_unet = copy.deepcopy(ori_net.denoising_unet)
弊端是看帖子说,这个方法比较占用内存
may encounter OOM if VARM is not enough -
解决方案2(尝试了没解决,也不想继续研究了):
帖子有人解释说问题的原因是batchsize, 代码作者在创建dataloader的时候没有设置 drop_last=True而代码又假设了batchsize是满的。这样如果最后一个batch不是满的
train_dataloader = torch.utils.data.DataLoader(
train_dataset, batch_size=cfg.data.train_bs, shuffle=True, num_workers=4, drop_last=True
)
踩坑5 任务报错:共享内存不足
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
# 查看宿主机最大可用内存
df -h /dev/shm
Filesystem Size Used Avail Use% Mounted on
tmpfs 237G 0 237G 0% /dev/shm
# 设置docker共享存储
# docker run -it --shm-size=256m ubuntu /bin/bas
docker run -itd --gpus all –shm-size 100g -v /root/scratch/Moore-AnimateAnyone:/workspace
Filesystem Size Used Avail Use% Mounted on
overlay 113G 87G 27G 77% /
tmpfs 64M 0 64M 0% /dev
shm 100G 0 100G 0% /dev/shm
/dev/sdb 5.5T 108G 5.4T 2% /workspace
/dev/sda1 113G 87G 27G 77% /etc/hosts
tmpfs 237G 12K 237G 1% /proc/driver/nvidia
tmpfs 48G 4.0M 48G 1% /run/nvidia-persistenced/socket
tmpfs 237G 0 237G 0% /proc/acpi
tmpfs 237G 0 237G 0% /proc/scsi
tmpfs 237G 0 237G 0% /sys/firmware
tmpfs 237G 0 237G 0% /sys/devices/virtual/powercap
踩坑6 train_strage_2报错
ValueError: Unexpected keyword arguments: encoder_hidden_states,timestep,attention_mask,video_length,self_attention_additional_feats,mode
解决方案
参考官方git问题解释
Issue can be fixed by playing with torch version or by enabling gradient_checkpointing = False in stage2.yaml.
One thing I tried was using a different version of torch which seemed to fix the issue once pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
Otherwise your best bet is to turn off gradient checkpointing, unsure why exactly, we were running into the same issue if we turned on gradient checkpointing in stage1.yaml too
版本好总结展示
直接在这里展示requirement.py最终版本,注意我这里的对应相关版本号,pytorch:2.4.0-cuda12.1-cudnn9,更详细信息可以直接拉这个镜像去看。
accelerate 0.33.0
anaconda-anon-usage 0.4.4
antlr4-python3-runtime 4.9.3
archspec 0.2.3
asttokens 2.0.5
astunparse 1.6.3
attrs 23.1.0
av 12.3.0
beautifulsoup4 4.12.3
boltons 23.0.0
Brotli 1.0.9
certifi 2024.7.4
cffi 1.16.0
chardet 4.0.0
charset-normalizer 2.0.4
click 8.1.7
coloredlogs 15.0.1
conda 24.5.0
conda-build 24.5.1
conda-content-trust 0.2.0
conda_index 0.5.0
conda-libmamba-solver 24.1.0
conda-package-handling 2.3.0
conda_package_streaming 0.10.0
contourpy 1.2.1
controlnet-aux 0.0.9
cryptography 42.0.5
cycler 0.12.1
decorator 5.1.1
diffusers 0.24.0
distro 1.9.0
dnspython 2.6.1
einops 0.8.0
executing 0.8.3
expecttest 0.2.1
filelock 3.13.1
flatbuffers 24.3.25
fonttools 4.53.1
frozendict 2.4.2
fsspec 2024.6.1
gmpy2 2.1.2
huggingface-hub 0.24.5
humanfriendly 10.0
hypothesis 6.108.4
idna 3.7
imageio 2.35.0
importlib_metadata 8.2.0
ipython 8.25.0
jedi 0.19.1
Jinja2 3.1.4
jsonpatch 1.33
jsonpointer 2.1
jsonschema 4.19.2
jsonschema-specifications 2023.7.1
kiwisolver 1.4.5
lazy_loader 0.4
libarchive-c 2.9
libmambapy 1.5.8
lintrunner 0.12.5
MarkupSafe 2.1.3
matplotlib 3.9.2
matplotlib-inline 0.1.6
menuinst 2.1.1
mkl-fft 1.3.8
mkl-random 1.2.4
mkl-service 2.4.0
more-itertools 10.1.0
mpmath 1.3.0
networkx 3.3
ninja 1.11.1.1
numpy 1.26.4
omegaconf 2.3.0
onnxruntime 1.19.0
opencv-python 4.10.0.84
opencv-python-headless 4.10.0.84
optree 0.12.1
packaging 24.1
parso 0.8.3
pexpect 4.8.0
pillow 10.4.0
pip 24.0
pkginfo 1.10.0
platformdirs 3.10.0
pluggy 1.0.0
prompt-toolkit 3.0.43
protobuf 5.27.3
psutil 5.9.0
ptyprocess 0.7.0
pure-eval 0.2.2
pycosat 0.6.6
pycparser 2.21
Pygments 2.15.1
pyparsing 3.1.2
PySocks 1.7.1
python-dateutil 2.9.0.post0
python-etcd 0.4.5
pytz 2024.1
PyYAML 6.0.1
referencing 0.30.2
regex 2024.7.24
requests 2.32.3
rpds-py 0.10.6
ruamel.yaml 0.17.21
safetensors 0.4.4
scikit-image 0.24.0
scipy 1.14.0
setuptools 69.5.1
six 1.16.0
sortedcontainers 2.4.0
soupsieve 2.5
stack-data 0.2.0
sympy 1.12
tifffile 2024.8.10
timm 0.6.7
tokenizers 0.19.1
torch 2.4.0
torchaudio 2.4.0
torchelastic 0.2.2
torchvision 0.19.0
tqdm 4.66.4
traitlets 5.14.3
transformers 4.44.0
triton 3.0.0
truststore 0.8.0
types-dataclasses 0.6.6
typing_extensions 4.11.0
urllib3 2.2.2
wcwidth 0.2.5
wheel 0.43.0
zipp 3.20.0
zstandard 0.22.0
三、成果展示
以上看起来简单,但是发生很多问题后总结出来的最终方案,强烈推荐。
python -m scripts.pose2vid --config ./configs/prompts/animation.yaml -W 512 -H 784 -L 64
python tools/extract_dwpose_from_vid.py --video_root /workspace/data/Batch_4/19489910/BV1mX4y1z75Q
root@46e99c62b9aa:/workspace# nvidia-smi
Sat Aug 17 02:52:24 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 PCIe Off | 00000000:01:00.0 Off | 0 |
| N/A 39C P0 52W / 350W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 PCIe Off | 00000000:02:00.0 Off | 0 |
| N/A 41C P0 51W / 350W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+