天知道cuDNN, cuFFT, and cuBLAS Errors · Issue #62075 · tensorflow/tensorflow · GitHub我到底要听谁的话
Step 1:Debug oneDNN
import os
os.environ['TF_ENABLE_ONEDNN_OPTS'] = '0'
import tensorflow as tf
在import tensorflow之前将'TF_ENABLE_ONEDNN_OPTS'设置为0可以解决第一个warning(但是谁在运行程序之前,每次还要这样设置啊???不现实,侧面反映这不是什么大问题,忽略)
Anyway,note 一下(oneAPI Deep Neural Network Library (oneDNN))
以及我真的需要把清除pip缓存的语句背下来,为什么我背不下来!
pip cache purge
就三个词!
再参考一下WSL2 - TensorFlow Install Issue Unable to register cuDNN factory
和TensorFlow 🤝Conda🤝NVIDIA GPU on Ubuntu
所以tensorflow2.14.0版本及以上就是不用装cudatoolkit和cudnn,那就不是版本不兼容的问题了,因为我根本就不用装啊!!!
pip install tensorflow[and-cuda]
一个语句就能完事儿!虽然会有报错,并且根本用不了!!!
但是已经成功解决报错的博主告诉我就是不能装tensorflow2.14.0及以上的,得装2,13.0的版本,我再create 一个env,step by step follow 她的步骤看看问题能不能得到解决,也就是两个🤝的blog
Step 2:🤝🤝
按照conda install -c conda-forge cudatoolkit=11.8 cudnn=8.8装的话后续pip用不了,不知道为什么
会报以下的错,然后怎么pip install tensorflow==2.13.0都装不上
<frozen graalpy.pip_hook>:48: RuntimeWarning: You are using an untested version of pip. GraalPy provides patches and workarounds for a number of packages when used with compatible pip versions. We recommend to stick with the pip version that ships with this version of GraalPy.
那我就自己装
参看https://www.tensorflow.org/install/source#gpu
(vivit-env) dddcyy@dddcyy6100846:~$ conda search cudatoolkit
Loading channels: done
# Name Version Build Channel
cudatoolkit 9.0 h13b8566_0 pkgs/main
cudatoolkit 9.2 0 pkgs/main
cudatoolkit 10.0.130 0 pkgs/main
cudatoolkit 10.1.168 0 pkgs/main
cudatoolkit 10.1.243 h6bb024c_0 pkgs/main
cudatoolkit 10.2.89 hfd86e86_0 pkgs/main
cudatoolkit 10.2.89 hfd86e86_1 pkgs/main
cudatoolkit 11.0.221 h6bb024c_0 pkgs/main
cudatoolkit 11.3.1 h2bc3f7f_2 pkgs/main
cudatoolkit 11.8.0 h6a678d5_0 pkgs/main
(vivit-env) dddcyy@dddcyy6100846:~$ conda search cudnn
Loading channels: done
# Name Version Build Channel
cudnn 7.0.5 cuda8.0_0 pkgs/main
cudnn 7.1.2 cuda9.0_0 pkgs/main
cudnn 7.1.3 cuda8.0_0 pkgs/main
cudnn 7.2.1 cuda9.2_0 pkgs/main
cudnn 7.3.1 cuda10.0_0 pkgs/main
cudnn 7.3.1 cuda9.0_0 pkgs/main
cudnn 7.3.1 cuda9.2_0 pkgs/main
cudnn 7.6.0 cuda10.0_0 pkgs/main
cudnn 7.6.0 cuda10.1_0 pkgs/main
cudnn 7.6.0 cuda9.0_0 pkgs/main
cudnn 7.6.0 cuda9.2_0 pkgs/main
cudnn 7.6.4 cuda10.0_0 pkgs/main
cudnn 7.6.4 cuda10.1_0 pkgs/main
cudnn 7.6.4 cuda9.0_0 pkgs/main
cudnn 7.6.4 cuda9.2_0 pkgs/main
cudnn 7.6.5 cuda10.0_0 pkgs/main
cudnn 7.6.5 cuda10.1_0 pkgs/main
cudnn 7.6.5 cuda10.2_0 pkgs/main
cudnn 7.6.5 cuda9.0_0 pkgs/main
cudnn 7.6.5 cuda9.2_0 pkgs/main
cudnn 8.2.1 cuda11.3_0 pkgs/main
cudnn 8.9.2.26 cuda11_0 pkgs/main
cudnn 8.9.2.26 cuda12_0 pkgs/main
cudnn 9.1.1.17 cuda12_0 pkgs/main
那就装一个cudatoolkit==11.8.0 & cudnn==8.9.2.26(for cuda11_0)
conda create -n vivit-env python=3.10
conda activate vivit-env
conda install cudatoolkit==11.8.0
conda install cudnn==8.9.2.26
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
conda deactivate
echo $CONDA_PREFIX
:/home/dddcyy/miniconda3/envs/vivit-env
echo $LD_LIBRARY_PATH
:/home/dddcyy/miniconda3/envs/vivit-env/lib/
conda activate vivit-env
pip install tensorflow==2.13
你以为就完了吗?没有!会报错说你没有cuda driver,我????从来都没给我报过这个错过,我懵了,电脑上怎么可能没有呢?然后stack overflow上让我装TensorRT。好,我装,刚好🤝🤝里面也有!!!
pip install tensorrt==8.5.3.1
TENSORRT_PATH=$(dirname $(python -c "import tensorrt;print(tensorrt.__file__)"))
echo $TENSORRT_PATH
:/home/dddcyy/miniconda3/envs/vivit-env/lib/python3.10/site-packages/tensorrt
#linking tensorrt library files to LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/dddcyy/miniconda3/envs/vivit-env/lib/python3.10/site-packages/tensorrt
conda deactivate
装了tensorrt之后才会给我装nvidia前缀的几个包,看来得装啊
Installing collected packages: nvidia-cuda-runtime-cu11, nvidia-cublas-cu11, nvidia-cudnn-cu11, tensorrt
Successfully installed nvidia-cublas-cu11-11.11.3.6 nvidia-cuda-runtime-cu11-11.8.89 nvidia-cudnn-cu11-9.6.0.74 tensorrt-8.5.3.1
所以大功告成了吗?虽然我先装的tensorflow再装的tensorrt,不会这个顺序也会妨碍我吧??? 结果表明,哪怕我重装了tensorflow还是会有报错,这回就给我换着法儿报错。虽然解决了三个unable,但是接踵而来的报错似乎也不可小觑哈哈哈哈哈哈
(vivit-env) dddcyy@dddcyy6100846:~$ python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2025-01-13 12:57:24.373915: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-01-13 12:57:24.546476: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-01-13 12:57:26.321954: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2025-01-13 12:57:26.343778: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2025-01-13 12:57:26.343840: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
忽略oneDNN,然后是三个NUMA的报错,我之前还庆幸我没有,哈哈哈,终于NUMA也轮到我头上来了。该来的还是会来。不过🤝🤝也有,还好还好,我差点心肌梗塞而死。
Non-Uniform Memory Access (NUMA)
参看🤝🤝里提到的Fixing NUMA problem这篇Blog,应该还是能解决的
Step 3 : Fixing NUMA problems
我很好奇,为什么github上面的回答不管事,最后帮我解决问题的是medium????medium网站现在已经成为新的曙光了吗???
lspci | grep -i nvidia
第一步我运行不了,我可以放弃吗?我想放弃
第二步,我只有三个欸
ls /sys/bus/pci/devices/
490b:00:00.0 5582:00:00.0 92ab:00:00.0
以及我真的就只有三个文件,下一步也没办法进行啊
我根本没有0000:01:00.0/numa_node这个东西啊!!!不知道要怎么解决这个NUMA的问题,但是也许可以不解决吗?