https://zhuanlan.zhihu.com/p/632912924
需要安装cuda工具包
https://developer.nvidia.com/cuda-toolkit-archive
配置环境变量,如果是本地安装
export PATH=/usr/local/cuda-12.4/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64\${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
source <(kubectl completion bash)
export LANGUAGE="en_US.UTF-8"
export LANG=en_US:zh_CN.UTF-8
export LC_ALL=C
Dockerfile nvidia
FROM nvcr.io/nvidia/pytorch:24.06-py3
RUN pip install vllm openai sse_starlette -i https://pypi.tuna.tsinghua.edu.cn/simple
RUN pip install peft transformers datasets accelerate deepspeed tensorboard \
fire packaging ninja openai gradio -i https://pypi.tuna.tsinghua.edu.cn/simple
处理nvidia-smi执行后结果显示很慢的问题,安装fabric-manager
version=535.54.03
yum -y install yum-utils nvidia-docker2
yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo
yum install -y nvidia-fabric-manager-${version}-1 nvidia-fabric-manager-devel-${version}-1
安装cuda
安装 nvidia驱动 nvidia-docker2
cuda:https://developer.nvidia.com/cuda-toolkit-archive
nvidid: https://download.nvidia.com/
cuda-rhel7.repo
cat cuda-rhel7.repo
[cuda-rhel7-x86_64]
name=cuda-rhel7-x86_64
baseurl=https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64
enabled=1
gpgcheck=1
gpgkey=https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/D42D0685.pub
亦庄
FROM nvcr.io/nvidia/pytorch:23.10-py3
RUN pip install --upgrade pip && \
pip install --no-cache-dir vllm==0.4.3 openai sse_starlette spacy torch typer torch-tensorrt torchdata torchtext torchvision weasel --upgrade --upgrade-strategy=only-if-needed -i https://pypi.tuna.tsinghua.edu.cn/simple
nvidia-docker有时候拉取不下来
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
yum install --downloadonly nvidia-docker2 --downloaddir=/tmp/nvidia
nvidia-fabric-manager 加快调用nvidia
yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo
yum install -y nvidia-fabric-manager-${version}-1
yum install -y nvidia-fabric-manager-devel-${version}-1
https://developer.aliyun.com/mirror/centos?spm=a2c6h.13651102.0.0.3e221b116j42Ya
wget -O /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-7.repo
yum -y install epel-release
[base]
baseurl=http://mirror.centos.org/centos/
r
e
l
e
a
s
e
v
e
r
/
o
s
/
releasever/os/
releasever/os/basearch/
mirrorlist=http://mirrorlist.centos.org/?release=KaTeX parse error: Expected 'EOF', got '&' at position 11: releasever&̲arch=basearch&repo=os
gpgcheck=1
gpgkey=file:///etc/pki/rpm-pgg/RPM-GPG-KEY-CentOS-6
[update]
baseurl=http://mirror.centos.org/centos/
r
e
l
e
a
s
e
v
e
r
/
u
p
d
a
t
e
s
/
releasever/updates/
releasever/updates/basearch/
mirrorlist=http://mirrorlist.centos.org/?release=KaTeX parse error: Expected 'EOF', got '&' at position 11: releasever&̲arch=basearch&repo=updates
处理nvidia-smi执行后结果显示很慢的问题
version=535.54.03
yum -y install yum-utils nvidia-docker2
yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo
yum install -y nvidia-fabric-manager-
v
e
r
s
i
o
n
−
1
n
v
i
d
i
a
−
f
a
b
r
i
c
−
m
a
n
a
g
e
r
−
d
e
v
e
l
−
{version}-1 nvidia-fabric-manager-devel-
version−1nvidia−fabric−manager−devel−{version}-1
import torch
torch.cuda.is_available()
cuda_version = torch.version.cuda
print(f"CUDA version: {cuda_version}")
limits:
cpu: "16"
memory: 50Gi
tencent.com/vcuda-core: "800"
tencent.com/vcuda-memory: "32"
requests:
cpu: "16"
memory: 50Gi
tencent.com/vcuda-core: "800"
tencent.com/vcuda-memory: "32"
import torch
# 检查CUDA是否可用
torch.cuda.is_available()
# 获取CUDA版本
cuda_version = torch.version.cuda
print(f"CUDA version: {cuda_version}")
指定系统架构下载镜像
docker pull --platform linux/arm64