课题组某一台服务器升级后,很多环境丢失了,4块3090的GPU的驱动已安装好,但没有公用的Tensorflow可使用。于是自己鼓捣了一番Tensorflow的安装,等管理员安装公用的环境不知道要到猴年马月……
服务器是Linux系统(CentOS),GPU是英伟达公司的3090,已经安装好驱动,可以通过命令看到相关信息:
$ nvidia-smi
Tue May 28 20:54:09 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78 Driver Version: 550.78 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:25:00.0 Off | N/A |
| 31% 30C P8 18W / 350W | 2MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
目标是:安装Tensorflow 2 并调用GPU加速。
一、确定要安装的各软件的版本
要安装Tensorflow并调用GPU,需要安装:tensorflow2(2版本的gpu和cpu版本合并在一个安装包里,1版本需要分别安装gpu和cpu版本,这里不做讨论),cudnn和cudatoolkit。此外还需要安装高能物理领域常用的ROOT软件。
cudatoolkit:The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler, and a runtime library. 安装的版本低于GPU信息里的cuda版本(上方代码块中显示为12.4)。
cudnn:选择与cudatoolkit兼容的版本,可查阅nvidia官网:https://developer.nvidia.com/rdp/cudnn-archive
python、tensorflow的版本:可根据https://www.tensorflow.org/install/source?hl=zh-cn#gpu选择。
这里我们选择cudatoolkit 11.2,cudnn 8.1(满足cuda 11.x,从cuda-cudnn GPU版本对照表里挑了一个cuda11.2对应的cudnn版本8.1,配置GPU应该没问题),python 3.8(比较常见的版本),tensorflow 2.9(之前服务器上的版本),root 6.26(常见的root版本)。gcc版本似乎在7以上就行(包管理器基于我以上环境安装了10.3)。以上版本在conda channel里都能search到,比较方便。
此外还需要配置相关环境变量。下面会讨论环境变量的设置。
二、安装
安装基于conda虚拟环境,能避免不同环境下的依赖冲突。
- 新建一个conda环境tf_env,并安装python3.8:
conda create -n tf_env python=3.8 -c conda-forge
- 搜索和安装cudatoolkit 11.2:
conda search cudatoolkit=11.2 conda install cudatoolkit=11.2
- 安装cudnn 8.1:
conda install cudnn=8.1
- 安装python 3.8:
conda install python=3.8
- 安装tensorflow2.9,这里用官方推荐的pip安装,和conda类似。 我这里需要指定URL pypi.org/simple,否则解析错误。pip会自动安装很多tensorflow依赖的包,可以通过conda list查看该环境下已安装的包。
pip install tensorflow==2.9 -i https://pypi.org/simple
- 配置环境变量。tensorflow已安装完成。如果成功读取gpu且返回张量,则成功:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/ python -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))" # 2024-05-28 22:08:21.114048: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2024-05-28 22:08:24.152377: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-05-28 22:08:28.675827: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0. 2024-05-28 22:08:28.675894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22301 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:25:00.0, compute capability: 8.6 2024-05-28 22:08:28.689181: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0. 2024-05-28 22:08:28.689211: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 22301 MB memory: -> device: 1, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:5b:00.0, compute capability: 8.6 2024-05-28 22:08:28.690197: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0. 2024-05-28 22:08:28.690226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 22301 MB memory: -> device: 2, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:9b:00.0, compute capability: 8.6 2024-05-28 22:08:28.690678: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0. 2024-05-28 22:08:28.690713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 22301 MB memory: -> device: 3, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:c8:00.0, compute capability: 8.6 tf.Tensor(2062.545, shape=(), dtype=float32)
- 安装root6.26, 这一步推荐用mamba管理器安装,预先编译成c++,安装更快。否则conda安装root解析环境巨慢。这一步会自动安装root依赖的gcc(默认安装了10.3)和很多其他库。
conda install mamba mamba install root=6.26.6
$ conda list | grep "gcc" _libgcc_mutex 0.1 conda_forge conda-forge gcc 10.3.0 he2824d0_10 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge gcc_impl_linux-64 10.3.0 hf2f2afa_16 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge gcc_linux-64 10.3.0 hc39de41_10 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge libgcc-devel_linux-64 10.3.0 he6cfe16_16 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge libgcc-ng 13.2.0 h77fa898_7 conda-forge
至此,ROOT安装完成,可在python中import ROOT测试:
-
>>> import ROOT as rt >>> rt.__version__ '6.26/08'