Linux配置PyTorch GPU环境

本文是基于系统已经安装了驱动和CUDA的,假如不会安装驱动和CUDA的,可以参考我写的上一篇文章:https://blog.csdn.net/pdc31czy/article/details/136072017?spm=1001.2014.3001.5501

并且本文是基于HPC写的笔记,普通电脑跳过步骤1.

1. 进入GPU节点

[zychen@sms ~]$ ssh gpunode1

2.查看显卡信息

[zychen@gpunode1 ~]$ nvidia-smi

3.查看cuda版本

[zychen@gpunode1 ~]$ nvcc --version
bash: nvcc: command not found…

4.上面显示没有cuda信息,先检查 CUDA 是否已安装,再添加 CUDA 到 PATH:

[zychen@gpunode1 ~]$ export PATH=/usr/local/cuda/bin:$PATH
[zychen@gpunode1 ~]$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

5.运行 source ~/.bashrc 来使更改生效

[zychen@gpunode1 ~]$ source ~/.bashrc

6.再次查看 CUDA 编译器的版本信息

[zychen@gpunode1 ~]$ nvcc --version
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

可以通过上面可知CUDA 编译器版本是 11.8,并且构建于 2022 年 9 月 21 日。因为已经有cuda11.8,现在只需要下载安装anaconda。

7.用wget下载anaconda

https://www.anaconda.com/download/
在这skip register并且找到linux下的下载文件,右键复制下载链接

[zychen@gpunode1 ~]$ cd ~
[zychen@gpunode1 ~]$ mkdir tmp
[zychen@gpunode1 ~]$ cd tmp
[zychen@gpunode1 tmp]$ wget https://repo.anaconda.com/archive/Anaconda3-2024.02-1-Linux-x86_64.sh

ERROR: cannot verify repo.anaconda.com’s certificate, issued by ‘/C=US/O=Let’s Encrypt/CN=E1’:
Issued certificate has expired.
To connect to repo.anaconda.com insecurely, use `–no-check-certificate’.

假如上面显示证书过期问题,根据提示重新用wget下载。

[zychen@gpunode1 tmp]$ wget --no-check-certificate https://repo.anaconda.com/archive/Anaconda3-2024.02-1-Linux-x86_64.sh

8.用bash安装anaconda

[zychen@gpunode1 tmp]$ ls
Anaconda3-2024.02-1-Linux-x86_64.sh

[zychen@gpunode1 tmp]$ bash Anaconda3-2024.02-1-Linux-x86_64.sh
然后就一直按enter,按到出现询问咨询yes or no,输入yes

安装完会询问你是否希望更新你的shell配置文件,以便在启动新的shell会话时自动初始化conda:
Do you wish to update your shell profile to automatically initialize conda?
This will activate conda on startup and change the command prompt when activated.
If you’d prefer that conda’s base environment not be activated on startup,
run the following command when conda is activated:

conda config --set auto_activate_base false

You can undo this by running conda init --reverse $SHELL? [yes|no]

如果你希望conda在启动新的shell会话时自动激活,并且改变命令提示符,你应该输入yes

9.source the .bash-rc file to add Anaconda to your PATH

[zychen@gpunode1 tmp]$ cd ~
[zychen@gpunode1 ~]$ source .bashrc

10.创建虚拟环境

(base) [zychen@gpunode1 ~]$ conda create -n torch39env python=3.9.18
(base) [zychen@gpunode1 ~]$ conda activate torch39env

(torch39env) [zychen@gpunode1 ~]$ python
Python 3.9.18 (main, Sep 11 2023, 13:41:44)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type “help”, “copyright”, “credits” or “license” for more information.

从上面信息已知已经创建了一个名为torch39env的python为3.9.18的环境,只需要把pytorch相关的包pip到这个环境里就行。按Ctrl+Z退出回到torch39env环境。

11. 安装pytorch:

(torch39env) [zychen@gpunode1 ~]$ pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.1+cu118 -f https://download.pytorch.org/whl/cu118/torch_stable.html

12.检查

(torch39env) [zychen@gpunode1 ~]$ python -c "import torch; print(torch.cuda.is_available()); print(torch.version.cuda)"
True
11.8

成功配置PyTorch的环境。其他包比如numpy这些就在这个环境(torch39env)用pip install numpy安装即可。
比如:
(torch39env) [zychen@gpunode1 ~]$ pip install numpy


13. 额外补充(A100递交作业形式跑code)

跟之前差不多,也还是要先 ssh gpunode1进去第一个GPU节点,然后 conda activate torch39env激活PyTorch虚拟环境,在这个环境下cd到需要运行的代码所在的文件夹(文件夹路径可以用 pwd 命令获得)。
比如我需要跑在文件夹/home/zychen/PINN/11.27.5_Disc_2DCH_A100里面的"CH_11.27.5.py" ,
我就需要在/home/zychen/PINN/11.27.5_Disc_2DCH_A100里创建一个名为"gpu_11_27_5.sh",格式如下:

#!/bin/bash
#SBATCH -J job_name
#SBATCH -p gpunode.q
##SBATCH --nodelist=gpunode
#SBATCH -n 1
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH -o job_name.out
#SBATCH --exclusive

./program

这里的job_name可以改成自己想要的,然后./program改成自己的代码,具体修改如下:

#!/bin/bash
#SBATCH -J 11_27_5
#SBATCH -p gpunode.q
##SBATCH --nodelist=gpunode
#SBATCH -n 1
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH -o 11_27_5.out
#SBATCH --exclusive

python /home/zychen/PINN/11.27.5_Disc_2DCH_A100/CH_11.27.5.py

把这个"gpu_11_27_5.sh"放到/home/zychen/PINN/11.27.5_Disc_2DCH_A100文件夹里面。
sbatch gpu_11_27_5.sh进入排队,squeue查看排队情况。
比如:

[zychen@sms ~]$ ssh gpunode1
Last login: Sat Aug  3 17:30:55 2024 from sms
[zychen@gpunode1 ~]$ conda activate torch39env
(torch39env) [zychen@gpunode1 ~]$ cd PINN/
(torch39env) [zychen@gpunode1 PINN]$ cd 11.27.5_Disc_2DCH_A100/
(torch39env) [zychen@gpunode1 11.27.5_Disc_2DCH_A100]$ ls
boundary_value  current_col_pt               current_figures_q     data_loss_f     figures        figures_q_count  loss_count
CH_11.27.5.py   current_figures              current_loss_figures  debug           figures_count  gpu_11_27_5.sh   models
col_pts_count   current_figures_matlab_data  data_loss_b           debug_evaluate  figures_q      IRK_weights      Points_Data
(torch39env) [zychen@gpunode1 11.27.5_Disc_2DCH_A100]$ vim gpu_11_27_5.sh 
(torch39env) [zychen@gpunode1 11.27.5_Disc_2DCH_A100]$ sbatch gpu_11_27_5.sh 
Submitted batch job 14704
(torch39env) [zychen@gpunode1 11.27.5_Disc_2DCH_A100]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             14704 gpunode.q  11_27_5   zychen PD       0:00      1 (Resources)
             14679 cpunode.q p3D_lift  jpzhang  R    8:11:23      8 cpunode[9-12,29-32]
             14701 gpunode.q job_name   zychen  R    1:37:42      1 gpunode1
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值