环境
同上篇。
操作系统:Ubuntu 22.04
显卡:A100
Conda 安装 tensorflow-gpu
按照 tensorflow 官网说法,需要用户源码级编译 tensorflow-gpu,这个太累了,还是偷懒使用 conda 包管理安装。
创建虚拟环境
conda create -n tensorflow python=3.8
激活虚拟环境
conda activate tensorflow
使用 conda 安装 tensorflow-gpu
不要使用 pip 安装
特别注意,不要使用 pip install --upgrade tensorflow-gpu==2.4。
至少我这里安装的版本是错误的,显示没有 libcudart.so.11.0
>>> import tensorflow as tf
2023-05-22 21:52:56.234235: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-05-22 21:52:56.234290: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
使用 conda 安装
conda install tensorflow-gpu
安装完成,验证,有一个错误,如下。
python
Python 3.8.16 (default, Mar 2 2023, 03:21:46)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2023-05-22 22:04:51.350939: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
/home/yzhou/base/anaconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/framework/dtypes.py:513: FutureWarning: In the future `np.object` will be defined as the corresponding NumPy scalar.
np.object,
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/yzhou/base/anaconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/__init__.py", line 41, in <module>
from tensorflow.python.tools import module_util as _module_util
File "/home/yzhou/base/anaconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/__init__.py", line 46, in <module>
from tensorflow.python import data
File "/home/yzhou/base/anaconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/data/__init__.py", line 25, in <module>
from tensorflow.python.data import experimental
File "/home/yzhou/base/anaconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/data/experimental/__init__.py", line 96, in <module>
from tensorflow.python.data.experimental import service
File "/home/yzhou/base/anaconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/data/experimental/service/__init__.py", line 140, in <module>
from tensorflow.python.data.experimental.ops.data_service_ops import distribute
File "/home/yzhou/base/anaconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/data/experimental/ops/data_service_ops.py", line 25, in <module>
from tensorflow.python.data.experimental.ops import compression_ops
File "/home/yzhou/base/anaconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/data/experimental/ops/compression_ops.py", line 20, in <module>
from tensorflow.python.data.util import structure
File "/home/yzhou/base/anaconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/data/util/structure.py", line 26, in <module>
from tensorflow.python.data.util import nest
File "/home/yzhou/base/anaconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/data/util/nest.py", line 41, in <module>
from tensorflow.python.framework import sparse_tensor as _sparse_tensor
File "/home/yzhou/base/anaconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/framework/sparse_tensor.py", line 29, in <module>
from tensorflow.python.framework import constant_op
File "/home/yzhou/base/anaconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 29, in <module>
from tensorflow.python.eager import execute
File "/home/yzhou/base/anaconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 27, in <module>
from tensorflow.python.framework import dtypes
File "/home/yzhou/base/anaconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/framework/dtypes.py", line 513, in <module>
np.object,
File "/home/yzhou/base/anaconda3/envs/tensorflow/lib/python3.8/site-packages/numpy/__init__.py", line 305, in __getattr__
raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'object'.
`np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
意思是 numpy 的版本太高了。查询安装记录,可知道 numpy 版本为 1.24.3。
numpy pkgs/main/linux-64::numpy-1.24.3-py38hf6e8229_1
numpy-base pkgs/main/linux-64::numpy-base-1.24.3-py38h060ed82_1
需要将 numpy 降级为 1.23.4 即可。
特别注意,不要使用 pip 来降级,需要使用 conda 完成。
如果使用 pip,将出现如下错误。
$ pip install --upgrade numpy==1.23.4
Collecting numpy==1.23.4
Downloading numpy-1.23.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.1/17.1 MB 3.1 MB/s eta 0:00:00
Installing collected packages: numpy
Attempting uninstall: numpy
Found existing installation: numpy 1.24.3
Uninstalling numpy-1.24.3:
Successfully uninstalled numpy-1.24.3
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.4.1 requires typing-extensions~=3.7.4, which is not installed.
tensorflow 2.4.1 requires absl-py~=0.10, but you have absl-py 1.3.0 which is incompatible.
tensorflow 2.4.1 requires flatbuffers~=1.12.0, but you have flatbuffers 2.0 which is incompatible.
tensorflow 2.4.1 requires gast==0.3.3, but you have gast 0.4.0 which is incompatible.
tensorflow 2.4.1 requires grpcio~=1.32.0, but you have grpcio 1.48.2 which is incompatible.
tensorflow 2.4.1 requires numpy~=1.19.2, but you have numpy 1.23.4 which is incompatible.
tensorflow 2.4.1 requires six~=1.15.0, but you have six 1.16.0 which is incompatible.
tensorflow 2.4.1 requires tensorflow-estimator<2.5.0,>=2.4.0, but you have tensorflow-estimator 2.6.0 which is incompatible.
tensorflow 2.4.1 requires termcolor~=1.1.0, but you have termcolor 2.1.0 which is incompatible.
tensorflow 2.4.1 requires wrapt~=1.12.1, but you have wrapt 1.14.1 which is incompatible.
Successfully installed numpy-1.23.4
再次强调,使用 conda 包管理完成。
$ conda install numpy==1.23.4
验证安装
$ python
Python 3.8.16 (default, Mar 2 2023, 03:21:46)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> tf.config.list_physical_devices('GPU')
2023-05-22 22:24:42.577927: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2023-05-22 22:24:42.579631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:21:00.0 name: NVIDIA A100 80GB PCIe computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 79.19GiB deviceMemoryBandwidth: 1.76TiB/s
2023-05-22 22:24:42.579693: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2023-05-22 22:24:42.579757: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2023-05-22 22:24:42.579772: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2023-05-22 22:24:42.579784: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2023-05-22 22:24:42.579797: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2023-05-22 22:24:42.579810: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2023-05-22 22:24:42.579824: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2023-05-22 22:24:42.579845: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2023-05-22 22:24:42.581861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
这样,搬砖工具全部就位。