Linux 安装 pytorch+cuda+gpu 大模型开发环境过程记录
2025-05-17
本文可用于生产环境,用于大模型训练开发运行。
1. 确定 OS 架构
# cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
# uname -m
x86_64
2. 查看磁盘空间
# df -h /llm
因为系统已有 Anaconda3 和 minianaconda3,但是已有的这些环境存在问题。
为不破坏这些环境,我重新安装一个 Anaconda3,与已有的共存,切记不可自动设置。
3. 安装 Anaconda3
# cd /llm/Downloads/
# wget -c https://mirrors.ustc.edu.cn/anaconda/archive/Anaconda3-2024.10-1-Linux-x86_64.sh --no-check-certificate
# Anaconda3-2024.10-1-Linux-x86_64.sh
指定安装目录:/llm/huggingface/anaconda3
[no]不要配置: 直接回车
# cd ~/Workspace
# ln -s /llm/huggingface huggingface
# ln -s /llm/Downloads Downloads
以后下面的目录等效:
~/Workspace/huggingface -> /llm/huggingface
~/Workspace/Downloads -> /llm/Downloads
测试Anaconda3安装是否正确:
# conda deactivate
# source ~/Workspace/huggingface/anaconda3/etc/profile.d/conda.sh
# export PATH="~/Workspace/huggingface/anaconda3/bin:$PATH"
# conda activate base
# conda deactivate
# which conda
务必显示为安装的路径:/llm/huggingface/anaconda3/bin/conda
然后执行下面的命令创建虚拟环境:pytorch_env
# conda create -n pytorch_env python=3.10
以后全部操作在 pytorch_env 中进行!
4. 安装 pytorch
查看系统 cuda 版本,安装与之匹配的 pytorch:2.6.0
安装 CUDA + cuDNN 的过程(略)
# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
在下面的网站找到 CUDA 12.4 安装链接:
https://pytorch.org/get-started/previous-versions/
# CUDA 12.4
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
开始安装 pytorch cuda 版本:
# conda deactivate
# source ~/Workspace/huggingface/anaconda3/etc/profile.d/conda.sh
# export PATH="~/Workspace/huggingface/anaconda3/bin:$PATH"
# conda env list
base
pytorch_env
# conda activate pytorch_env
(pytorch_env) # python
Python 3.10.17 | ...
(pytorch_env) # pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
测试 pytorch 是否安装成功:
(pytorch_env) # python
Python 3.10.17
>>> import torch
>>> print(torch.__version__)
2.6.0+cu124
>>> print(torch.cuda.is_available())
True
>>> print(torch.cuda.get_device_name(0))
NVIDIA GeForce RTX 3090
5. 安装 huggingface
在安装好 PyTorch 后,安装 Hugging Face 的 transformers 库(核心工具库)和其他相关依赖的步骤如下。
创建环境脚本: /llm/huggingface/pytorch_env.sh
#!/bin/bash
# 2025-05-19, zhangliang
# source pytorch_env.sh
#
export HF_HUB_DISABLE_SYMLINKS_WARNING=1
export HF_ENDPOINT="https://hf-mirror.com"
export HF_HOME=/llm/huggingface
export HF_MODELS=/llm/huggingface/models
export ANACONDA3_HOME=/llm/huggingface/anaconda3
source "$ANACONDA3_HOME/etc/profile.d/conda.sh"
conda activate pytorch_env
然后执行:
# source /llm/huggingface/pytorch_env.sh
(pytorch_env) # pip install transformers datasets tokenizers accelerate peft safetensors
(pytorch_env) # pip install soundfile librosa Pillow huggingface_hub python-dotenv bitsandbytes
(pytorch_env) # pip install emoji opencc-python-reimplemented evaluate
写一个测试文件:huggingface_chk.py:
# huggingface_chk.py
# 基础库
import os, sys, yaml, datetime
import evaluate # 评估指标计算
import pandas as pd # 数据处理
import numpy as np # 数值计算
import torch # PyTorch 深度学习框架
import transformers # Hugging Face 模型库
import re, emoji
from typing import Optional, Union
# PyTorch 组件
from torch import nn
from torch.amp import GradScaler, autocast # 混合精度训练
from torch.utils.data import Dataset as TorchDataset, DataLoader # 数据加载
from torch.optim import AdamW # 优化器
# 数据集处理
from datasets import Dataset # Hugging Face 数据集格式
# 中文简繁体转换
from opencc import OpenCC
# 模型量化
from bitsandbytes.nn import Int8Params
# 参数高效微调 (PEFT)
from peft import get_peft_model, LoraConfig
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
# Transformers 组件
from transformers import (
AutoTokenizer, # 自动分词器
AutoModelForSequenceClassification,# 序列分类模型
AutoModelForSeq2SeqLM,
TrainingArguments, # 训练参数配置
BitsAndBytesConfig, # 量化配置
Trainer, # 训练器
DataCollatorWithPadding, # 数据填充对齐
EvalPrediction, # 评估预测对象
pipeline # 推理管道
)
#########################################################
print("transformers version:", transformers.__version__)
print("torch version:", torch.__version__)
# 检查 CUDA 可用性
if torch.cuda.is_available():
print("cuda version=", torch.version.cuda)
print("torch using GPU:", torch.cuda.get_device_name(0))
else:
raise RuntimeError("CUDA 不可用,请检查安装!")
# 环境路径配置
print("HF_ENDPOINT=", os.environ["HF_ENDPOINT"])
# hf 主目录。下载缓存目录自动在: $HF_HOME/hub
print("HF_HOME=", os.environ["HF_HOME"])
# 模型保存的本地目录
print("HF_MODELS=", os.environ["HF_MODELS"])
print("ANACONDA3_HOME=", os.environ["ANACONDA3_HOME"])
print("HF_HUB_DISABLE_SYMLINKS_WARNING=", os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"])
print("Check huggingface success.")
执行环境测试 (以后的py程序完全按此过程执行):
# source /llm/huggingface/pytorch_env.sh
(pytorch_env) # python ./huggingface_chk.py
transformers version: 4.51.3
torch version: 2.6.0+cu124
cuda version= 12.4
torch using GPU: NVIDIA GeForce RTX 3090
HF_ENDPOINT= https://hf-mirror.com
HF_HOME= /llm/huggingface
HF_MODELS= /llm/huggingface/models
ANACONDA3_HOME= /llm/huggingface/anaconda3
HF_HUB_DISABLE_SYMLINKS_WARNING= 1
Check huggingface success.
完毕。开始你的大模型开发之旅吧!