K8s如何利用IB RDMA网络加速大模型分布式训练
在AI人工智能领域,大模型的微调是提升性能不可或缺的一环。通过结合InfiniBand RDMA网络的高速互联、Kubernetes的云原生容器化部署,以及智能算力调度平台,可以打造一个高效、灵活且可扩展的分布式训练环境。本文将深入探讨这些技术如何协同工作,优化大模型微调流程并提升整体计算效率!
硬件环境
类型 | 型号 | 配置 | 数量(台) |
服务器 | NVIDIA H100 SXM | CPU:192c RAM:2T NIC:400Gb IB *4,200Gb IB *2 | 10 |
交换机 | NVIDIA QM9700 | 64-ports NDR(400Gb/s),吞吐量:51.2Tb/s,端口类型:32 OSFP | 1 |
驱动安装
1. 操作系统
系统版本 | 内核版本 |
Ubuntu 22.04.5 LTS | Linux 5.15.0-130-generic |
2. NVIDIA驱动
2.1 禁用默认驱动
~# cat /etc/modprobe.d/nvidia-installer-disable-nouveau.conf
# generated by nvidia-installer
blacklist nouveau
options nouveau modeset=0
2.2 安装NVIDIA驱动
~# wget https://cn.download.nvidia.com/tesla/560.35.03/NVIDIA-Linux-x86_64-560.35.03.run
~# ./NVIDIA-Linux-x86_64-560.35.03.run
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 560.35.03..............................................................................................................
.......................................................................................................................
2.3 关闭 GSP
~# echo "options nvidia NVreg_EnableGpuFirmware=0" > /etc/modprobe.d/nvidia-gsp.conf
~# cp /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
~# sudo update-initramfs -u
# 重启机器验证
~# reboot
检查是否禁用成功:查看相关值是否为 0,若为 0 则已禁用 GSP。
~# grep EnableGpuFirmware /proc/driver/nvidia/params
EnableGpuFirmware: 0
EnableGpuFirmwareLogs: 2
2.4 GPU驱动开启持久化模式
使用
nvidia-smi -pm 1
能够临时开启持久模式,但重启后会失效,推荐使用nvidia-persistenced
常驻进程。Persistence-M(Persistence Mode)
是一个用户可设置的驱动程序属性的术语。启用持久性模式后,即使没有活动的客户端,NVIDIA驱动程序也会保持加载状态。这样可以最大程度地减少与运行依赖的应用程序(例如CUDA 程序
)相关的驱动程序加载延迟。
执行nvidia-smi
命令,可查看Persistence Mode
当前状态。Persistence-M
的值为On
时,持续模式为打开状态。
~# cat <<EOF > /lib/systemd/system/nvidia-persistenced.service
[Unit]
Description=NVIDIA Persistence Daemon
After=syslog.target
[Service]
Type=forking
PIDFile=/var/run/nvidia-persistenced/nvidia-persistenced.pid
Restart=always
ExecStart=/usr/bin/nvidia-persistenced --verbose
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced/*
TimeoutSec=300
[Install]
WantedBy=multi-user.target
EOF
~# systemctl start nvidia-persistenced && systemctl enable nvidia-persistenced
看累了吧,玩一把小游戏放松一下接着看~~~
2.5 安装nvidia-fabricmanager
NVIDIA Fabric Manager
是专门为多GPU互联所提供的一款软件管理工具。它的主要作用是负责管理和监控GPU之间的互联结构和数据传输路径,以确保多GPU系统中各GPU间的通信高效、稳定和可拓展。Fabric Manager
具有的特点和功能有:统一管理多 GPU 拓扑、高效的数据传输调度、监控与诊断、自动恢复与容错等。
有
NVSwitch
硬件的服务器上,这个服务必须安装!因为这个组件对于NVLink
通信十分重要,如果在训练过程中发现莫名降速和NVLink
通信异常的话,可以优先排查下nvidia-fabricmanager
服务是否正常!
下载
https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64/
注意: nvidia-fabricmanager
版本一定要和nvidia
驱动版本必须保持一致.
安装
因为阿程安装的NVIDIA驱动版本是 560.35.03,所以下载对应版本:
~# wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/nvidia-fabricmanager-560_560.35.03-1_amd64.deb
~# apt install ./nvidia-fabricmanager-560_560.35.03-1_amd64.deb
~# systemctl restart nvidia-fabricmanager.service
~# systemctl enable nvidia-fabricmanager.service
查看服务状态
~# systemctl status nvidia-fabricmanager.service
● nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2025-03-1116:56:16 CST; 4 days ago
Main PID: 3888 (nv-fabricmanage)
Tasks: 18 (limit: 9830)
Memory: 19.8M
CPU: 2min 51.930s
CGroup: /system.slice/nvidia-fabricmanager.service
└─3888/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg
Mar 1116:56:12 h100-node-14-34 systemd[1]: Starting NVIDIA fabric manager service...
Mar 1116:56:16 h100-node-14-34 nv-fabricmanager[3888]: Connected to1 node.
Mar 1116:56:16 h100-node-14-34 nv-fabricmanager[3888]: Successfully configured all the available NVSwitches to route GPU NVLink traffic. NVLink Peer-to>
Mar 1116:56:16 h100-node-14-34 systemd[1]: Started NVIDIA fabric manager service.
查看Fabric的状态
~# nvidia-smi -q -i 0 |grep -A 2 Fabric
Fabric
State : Completed
Status : Success # 说明GPU已完成注册
2.6 检查NVIDIA驱动
3. IB驱动
3.1 驱动下载
https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/
3.2 检查是否有IB Controller
~# lspci |grep Mell
18:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
4b:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
5d:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
5d:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
9b:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
cb:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
可以看到有4个IB控制器和2个RoCE,均为CX7
3.3 安装IB驱动
# 安装依赖工具
~# apt-get install pkg-config automake tk flex autotools-dev libnl-route-3-200 m4 quilt chrpath debhelper autoconf libltdl-dev libfuse2 libnl-route-3-dev gfortran bison libgfortran5 libnl-3-dev swig graphviz dkms --fix-missing
# 解包
~# tar -zxvf MLNX_OFED_LINUX-24.10-0.7.0.0-ubuntu22.04-x86_64.tgz
# 执行安装程序(大概5~7分钟)
./mlnxofedinstall --force
Logs dir: /tmp/MLNX_OFED_LINUX.4478.logs
General log file: /tmp/MLNX_OFED_LINUX.4478.logs/general.log
Below is the list of MLNX_OFED_LINUX packages that you have chosen
(some may have been added by the installer due to package dependencies):
ofed-scripts
mlnx-tools
mlnx-ofed-kernel-utils
mlnx-ofed-kernel-dkms
iser-dkms
isert-dkms
srp-dkms
rdma-core
......
Installation passed successfully
To load the new driver, run:
/etc/init.d/openibd restart
/etc/init.d/openibd restart
Unloading HCA driver: [ OK ]
Loading HCA driver and Access Layer: [ OK ]
3.4 查看IB网卡状态
~# /etc/init.d/openibd restart
~# ibstatus
Infiniband device 'mlx5_0' port 1 status:
default gid: fe80:0000:0000:0000:a088:c203:001a:43ea
base lid: 0x1
sm lid: 0x7
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 400 Gb/sec (4X NDR)
link_layer: InfiniBand
Infiniband device 'mlx5_1' port 1 status:
default gid: fe80:0000:0000:0000:a088:c203:001a:6df2
base lid: 0x9
sm lid: 0x7
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 400 Gb/sec (4X NDR)
link_layer: InfiniBand
Infiniband device 'mlx5_2' port 1 status:
default gid: fe80:0000:0000:0000:5e25:73ff:fe5b:9fd8
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 200 Gb/sec (4X HDR)
link_layer: Ethernet
Infiniband device 'mlx5_3' port 1 status:
default gid: fe80:0000:0000:0000:5e25:73ff:fe5b:9fd9
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 200 Gb/sec (4X HDR)
link_layer: Ethernet
Infiniband device 'mlx5_4' port 1 status:
default gid: fe80:0000:0000:0000:a088:c203:001a:825a
base lid: 0x10
sm lid: 0x7
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 400 Gb/sec (4X NDR)
link_layer: InfiniBand
Infiniband device 'mlx5_5' port 1 status:
default gid: fe80:0000:0000:0000:a088:c203:001a:3022
base lid: 0x14
sm lid: 0x7
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 400 Gb/sec (4X NDR)
link_layer: InfiniBand
3.5 查看GPU拓扑关系(NVL全连接)
~# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE NODE SYS SYS 0-47,96-143 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE SYS SYS 0-47,96-143 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE PIX NODE NODE SYS SYS 0-47,96-143 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE PIX PIX SYS SYS 0-47,96-143 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS PIX NODE 48-95,144-191 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS NODE NODE 48-95,144-191 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS NODE PIX 48-95,144-191 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS NODE NODE 48-95,144-191 1 N/A
NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE SYS SYS
NIC1 NODE NODE PIX NODE SYS SYS SYS SYS NODE X NODE NODE SYS SYS
NIC2 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE X PIX SYS SYS
NIC3 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE PIX X SYS SYS
NIC4 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS X NODE
NIC5 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded setof # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
k8s环境安装配置
1. 安装GPU Operator
Kubernetes版本 | Containerd版本 |
v1.30.6 | 1.7.13 |
Github链接:
https://github.com/NVIDIA/gpu-operator
~# helm install --wait --generate-name \
-n gpu-operator --timeout=10m --create-namespace \
nvidia/gpu-operator \
--version=v24.9.1 \
--set driver.enabled=false \
--set validator.repository=nvcr.mirrorify.net/nvidia/cloud-native \
--set operator.repository=nvcr.mirrorify.net/nvidia \
--set toolkit.repository=nvcr.mirrorify.net/nvidia/k8s \
--set devicePlugin.repository=nvcr.mirrorify.net/nvidia \
--set dcgmExporter.repository=nvcr.mirrorify.net/nvidia/k8s \
--set migManager.repository=nvcr.mirrorify.net/nvidia/cloud-native \
--set mig.strategy=mixed \
--set migManager.enabled=true
~# helm list -n gpu-operator
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
gpu-operator-1733809653 gpu-operator 1 2024-12-10 05:47:34.959775898 +0000 UTC deployed gpu-operator-v24.9.1 v24.9.1
~# kubectl get pods -n gpu-operator
......
2. 安装rdma-shared-dev-plugin
Github链接:
https://github.com/Mellanox/k8s-rdma-shared-dev-plugin
2.1 查看RDMA网卡的厂商ID和设备ID
~# lspci | grep Mellanox
18:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
4b:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
5d:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
5d:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
9b:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
cb:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
# 第一列是PCI序号
~# lspci -n | grep 18:00.0
18:00.00207: 15b3:1021
~# lspci -n | grep 4b:00.0
4b:00.00207: 15b3:1021
#厂商ID是15b3,设备ID是1021
2.2 cm配置创建
apiVersion: v1
kind: ConfigMap
metadata:
name: rdma-devices
namespace: kube-system
data:
config.json: |
{
"periodicUpdateInterval": 300,
"configList": [{
"resourceName": "ib",
"resourcePrefix": "rdma",
"rdmaHcaMax": 1024,
"selectors": {
"vendors": ["15b3"],
"deviceIDs": ["1021"],
"ifNames": ["ibs11", "ibs13", "ibs15", "ibs17"] #跟主机ib网卡名称一致
}
}
]
}
2.3 ds资源创建
apiVersion: apps/v1
kind:DaemonSet
metadata:
name:rdma-shared-dp-ds
namespace:kube-system
spec:
selector:
matchLabels:
name:rdma-shared-dp-ds
template:
metadata:
labels:
name:rdma-shared-dp-ds
spec:
hostNetwork:true
priorityClassName:system-node-critical
containers:
-image:ghcr.mirrorify.net/mellanox/k8s-rdma-shared-dev-plugin
name:k8s-rdma-shared-dp-ds
imagePullPolicy:IfNotPresent
securityContext:
privileged:true
volumeMounts:
-name:device-plugin
mountPath:/var/lib/kubelet/device-plugins
readOnly:false
-name:plugins-registry
mountPath:/var/lib/kubelet/plugins_registry
readOnly:false
-name:config
mountPath:/k8s-rdma-shared-dev-plugin
-name:devs
mountPath:/dev/
nodeSelector:
nvidia.com/gpu.deploy.gpu-feature-discovery:"true"
volumes:
-name:device-plugin
hostPath:
path:/var/lib/kubelet/device-plugins
-name:plugins-registry
hostPath:
path:/var/lib/kubelet/plugins_registry
-name:config
configMap:
name:rdma-devices
items:
-key:config.json
path:config.json
-name:devs
hostPath:
path: /dev/
2.4 查看设备注册信息
~# kubectl get pods -n kube-system|grep rdma
rdma-shared-dp-ds-2q5b4 1/1 Running 2 (3d23h ago) 81d
rdma-shared-dp-ds-45wxd 1/1 Running 6 (4d1h ago) 81d
......
~# kubectl describe node h100-node-14-34|grep -A 9 Allocatable
Allocatable:
cpu: 191600m
ephemeral-storage: 933505692Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 2055425873742
nvidia.com/gpu: 8
pods: 110
rdma/ib: 128
看累了吧,玩一把小游戏放松一下接着看~~~
2.5 测试到Pod中查看ib网卡
apiVersion: v1
kind:Pod
metadata:
name:mofed-test-pod
spec:
restartPolicy:OnFailure
containers:
-image:hub.mirrorify.net/dtnaas/ofed:latest
name:mofed-test-ctr
securityContext:
capabilities:
add: [ "IPC_LOCK" ]
resources:
limits:
rdma/ib:1
command:
-sh
--c
- |
ls -l /dev/infiniband /sys/class/infiniband /sys/class/net
sleep 100000
~# kubectl logs -n sre-tools mofed-test-pod
/dev/infiniband:
total 0
crw------- 1 root root 231, 64 Mar 15 10:50 issm0
crw------- 1 root root 231, 65 Mar 15 10:50 issm1
crw------- 1 root root 231, 68 Mar 15 10:50 issm4
crw------- 1 root root 231, 69 Mar 15 10:50 issm5
crw-rw-rw-1 root root 10, 121 Mar 1510:50 rdma_cm
crw------- 1 root root 231, 0 Mar 15 10:50 umad0
crw------- 1 root root 231, 1 Mar 15 10:50 umad1
crw------- 1 root root 231, 4 Mar 15 10:50 umad4
crw------- 1 root root 231, 5 Mar 15 10:50 umad5
crw-rw-rw-1 root root 231, 192 Mar 1510:50 uverbs0
crw-rw-rw-1 root root 231, 193 Mar 1510:50 uverbs1
crw-rw-rw-1 root root 231, 196 Mar 1510:50 uverbs4
crw-rw-rw-1 root root 231, 197 Mar 1510:50 uverbs5
/sys/class/infiniband:
total 0
lrwxrwxrwx 1 root root 0 Mar 1510:50 mlx5_0 -> ../../devices/pci0000:15/0000:15:01.0/0000:16:00.0/0000:17:00.0/0000:18:00.0/infiniband/mlx5_0
lrwxrwxrwx 1 root root 0 Mar 1510:50 mlx5_1 -> ../../devices/pci0000:48/0000:48:01.0/0000:49:00.0/0000:4a:00.0/0000:4b:00.0/infiniband/mlx5_1
lrwxrwxrwx 1 root root 0 Mar 1510:50 mlx5_2 -> ../../devices/pci0000:59/0000:59:01.0/0000:5a:00.0/0000:5b:01.0/0000:5d:00.0/infiniband/mlx5_2
lrwxrwxrwx 1 root root 0 Mar 1510:50 mlx5_3 -> ../../devices/pci0000:59/0000:59:01.0/0000:5a:00.0/0000:5b:01.0/0000:5d:00.1/infiniband/mlx5_3
lrwxrwxrwx 1 root root 0 Mar 1510:50 mlx5_4 -> ../../devices/pci0000:97/0000:97:01.0/0000:98:00.0/0000:99:01.0/0000:9b:00.0/infiniband/mlx5_4
lrwxrwxrwx 1 root root 0 Mar 1510:50 mlx5_5 -> ../../devices/pci0000:c7/0000:c7:01.0/0000:c8:00.0/0000:c9:01.0/0000:cb:00.0/infiniband/mlx5_5
/sys/class/net:
total 0
-rw-r--r-- 1 root root 4096 Mar 15 10:50 bonding_masters
lrwxrwxrwx 1 root root 0 Mar 1510:50 eth0 -> ../../devices/virtual/net/eth0
lrwxrwxrwx 1 root root 0 Mar 1510:50 lo -> ../../devices/virtual/net/lo
lrwxrwxrwx 1 root root 0 Mar 1510:50 tunl0 -> ../../devices/virtual/net/tunl0
root@mofed-test-pod:/# ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 28.43.1014
node_guid: a088:c203:001a:40e2
sys_image_guid: a088:c203:001a:40e2
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 7
port_lid: 29
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 28.43.1014
node_guid: a088:c203:001a:43e2
sys_image_guid: a088:c203:001a:43e2
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 7
port_lid: 30
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_4
transport: InfiniBand (0)
fw_ver: 28.43.1014
node_guid: a088:c203:001a:28d2
sys_image_guid: a088:c203:001a:28d2
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 7
port_lid: 24
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_5
transport: InfiniBand (0)
fw_ver: 28.43.1014
node_guid: a088:c203:001a:84fa
sys_image_guid: a088:c203:001a:84fa
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 7
port_lid: 40
port_lmc: 0x00
link_layer: InfiniBand
3. 安装Volcano调度引擎
Github链接:
https://github.com/volcano-sh/volcano
~# kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml
~# kubectl get pods -n volcano-system
NAME READY STATUS RESTARTS AGE
volcano-admission-568d7dfff9-2nb8h 1/1 Running 0 103d
volcano-admission-init-7xhm2 0/1 Completed 0 103d
volcano-controllers-778cd4b694-pmxvx 1/1 Running 0 103d
volcano-dashboard-77949478d-xkh87 2/2 Running 0 50d
volcano-scheduler-55664c4fc5-mr6m7 1/1 Running 1 (103d ago) 103d
~# kubectl api-resources |grep -i volcano
jobs vcjob,vj batch.volcano.sh/v1alpha1 true Job
commands bus.volcano.sh/v1alpha1 true Command
jobflows jf flow.volcano.sh/v1alpha1 true JobFlow
jobtemplates jt flow.volcano.sh/v1alpha1 true JobTemplate
numatopologies numatopo nodeinfo.volcano.sh/v1alpha1 false Numatopology
podgroups pg,podgroup-v1beta1 scheduling.volcano.sh/v1beta1 true PodGroup
queues q,queue-v1beta1 scheduling.volcano.sh/v1beta1 false Queue
测试创建一个Job
apiVersion: batch.volcano.sh/v1alpha1
kind:Job
metadata:
name:test-job
spec:
minAvailable:1
schedulerName:volcano
policies:
-event:PodEvicted
action:RestartJob
plugins:
ssh: []
env: []
svc: []
maxRetry:5
queue:default
tasks:
-name:"postproc"
replicas:1
template:
metadata:
name:postproc
spec:
containers:
-image:nginx
imagePullPolicy:IfNotPresent
name:postproc
resources:
requests:
cpu:"1"
restartPolicy:OnFailure
-name:"agent"
replicas:1
template:
metadata:
name:agent
spec:
containers:
-image:nginx
imagePullPolicy:IfNotPresent
name:agent
resources:
requests:
cpu:"1"
restartPolicy:OnFailure
~#kubectlgetpods-nsre-tools|grepjob
test-job-agent-0 1/1 Running 0 80s
test-job-postproc-0 1/1 Running 0 80s
~#kubectlgetjobs.batch.volcano.sh-nsre-tools
NAME STATUS MINAVAILABLE RUNNINGS AGE
test-job Running 1 2 2m31s
# Events中可以看到Pod是被volcano所调度。
~#kubectldescribepod-nsre-toolstest-job-postproc-0|grep-A7Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
NormalScheduled4m10svolcanoSuccessfullyassignedsre-tools/test-job-postproc-0toh100-node-14-41
NormalPulling 4m9s kubeletPullingimage"nginx"
NormalPulled 3m45skubeletSuccessfullypulledimage"nginx"in 24.579s (24.579s including waiting). Image size:72195292bytes.
NormalCreated 3m45skubeletCreatedcontainerpostproc
NormalStarted 3m44skubeletStartedcontainer postproc
4. IB 网络带宽性能测试
先看下最终的测试结果汇总:
测试项 | 测试结果 | 指令 |
IB RDMA 读取速度 | BW peak[354.93Gb/s] | ib_read_bw |
IB RDMA 写入速度 | BW peak[375.31Gb/s] | ib_write_bw |
IB RDMA 读时延测试 | t_avg[196.02 usec] | ib_read_lat |
IB RDMA 写时延测试 | t_avg[184.13 usec] | ib_write_lat |
测试环境准备
# Pod 1
apiVersion:v1
kind:Pod
metadata:
name:mofed-test-pod
spec:
restartPolicy:OnFailure
containers:
-image:hub.mirrorify.net/dtnaas/ofed:latest
name:mofed-test-ctr
securityContext:
capabilities:
add: [ "IPC_LOCK" ]
resources:
requests:
rdma/ib:4
nvidia.com/gpu:8
limits:
rdma/ib:4
nvidia.com/gpu:8
command:
-sh
--c
-|
ls -l /dev/infiniband /sys/class/infiniband /sys/class/net
sleep 1000000
# Pod 5
apiVersion:v1
kind:Pod
metadata:
name:mofed-test-pod5
spec:
restartPolicy:OnFailure
containers:
-image:hub.mirrorify.net/dtnaas/ofed:latest
name:mofed-test-ctr
securityContext:
capabilities:
add: [ "IPC_LOCK" ]
resources:
requests:
rdma/ib:4
nvidia.com/gpu:8
limits:
rdma/ib:4
nvidia.com/gpu:8
command:
-sh
--c
-|
ls -l /dev/infiniband /sys/class/infiniband /sys/class/net
sleep 1000000
# 调度到对应2台H100宿主机。
~#kubectlgetpods-nsre-tools|grepmofed
mofed-test-pod 1/1 Running 0 8m40s
mofed-test-pod5 1/1 Running 0 7m39s
4.1 测试 RDMA 读取速度:ib_read_bw (BWpeak[Gb/sec])
服务端:进入test-pod操作
root@mofed-test-pod:/# ib_read_bw -a -d mlx5_0
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Read BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Outstand reads : 16
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x15 QPN 0x0046 PSN 0x98dac6OUT0x10 RKey 0x1fffbd VAddr 0x007f591dd4e000
remote address: LID 0x1c QPN 0x0046 PSN 0xbfe6f7OUT0x10 RKey 0x1fffbd VAddr 0x007f8d557d8000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
8388608 1000 354.93 354.93 0.005289
---------------------------------------------------------------------------------------
客户端:进入test-pod5操作
root@mofed-test-pod5:/# ib_read_bw -d mlx5_0 --all --report_gbits 10.233.109.18
---------------------------------------------------------------------------------------
RDMA_Read BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Outstand reads : 16
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x1c QPN 0x0046 PSN 0xbfe6f7OUT0x10 RKey 0x1fffbd VAddr 0x007f8d557d8000
remote address: LID 0x15 QPN 0x0046 PSN 0x98dac6OUT0x10 RKey 0x1fffbd VAddr 0x007f591dd4e000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 2100.000000!=792.248000. CPU Frequency isnot max.
2 1000 0.046027 0.045525 2.845305
Conflicting CPU frequency values detected: 2100.000000!=2340.527000. CPU Frequency isnot max.
4 1000 0.12 0.12 3.657695
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
8 1000 0.25 0.25 3.937377
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
16 1000 0.51 0.51 3.950131
Conflicting CPU frequency values detected: 2100.000000!=2403.199000. CPU Frequency isnot max.
32 1000 1.01 1.01 3.939948
Conflicting CPU frequency values detected: 2100.000000!=2558.396000. CPU Frequency isnot max.
64 1000 2.02 2.02 3.936831
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
128 1000 3.94 3.93 3.841496
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
256 1000 7.79 7.78 3.799502
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
512 1000 16.35 16.34 3.988604
Conflicting CPU frequency values detected: 2100.000000!=800.624000. CPU Frequency isnot max.
1024 1000 31.22 31.18 3.806334
Conflicting CPU frequency values detected: 2100.000000!=800.000000. CPU Frequency isnot max.
2048 1000 56.31 56.26 3.433909
Conflicting CPU frequency values detected: 2100.000000!=2211.146000. CPU Frequency isnot max.
4096 1000 98.87 98.76 3.014045
Conflicting CPU frequency values detected: 2100.000000!=1487.434000. CPU Frequency isnot max.
8192 1000 169.28 169.22 2.582022
Conflicting CPU frequency values detected: 2100.000000!=2243.762000. CPU Frequency isnot max.
16384 1000 211.73 211.71 1.615208
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
32768 1000 258.33 258.31 0.985359
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
65536 1000 203.78 203.78 0.388671
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
131072 1000 319.87 319.87 0.305048
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
262144 1000 357.01 356.95 0.170206
Conflicting CPU frequency values detected: 2100.000000!=1804.701000. CPU Frequency isnot max.
524288 1000 357.16 357.15 0.085152
Conflicting CPU frequency values detected: 2100.000000!=2204.009000. CPU Frequency isnot max.
1048576 1000 355.23 355.22 0.042346
Conflicting CPU frequency values detected: 2100.000000!=2507.574000. CPU Frequency isnot max.
2097152 1000 354.87 354.87 0.021152
Conflicting CPU frequency values detected: 2100.000000!=1845.456000. CPU Frequency isnot max.
4194304 1000 354.91 354.90 0.010577
Conflicting CPU frequency values detected: 2100.000000!=2783.162000. CPU Frequency isnot max.
8388608 1000 354.93 354.93 0.005289
---------------------------------------------------------------------------------------
4.2 测试 RDMA 写入速度:ib_write_bw(BW peak[Gb/sec])
服务端:进入test-pod操作
root@mofed-test-pod:/# ib_write_bw -a -d mlx5_0
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x15 QPN 0x0047 PSN 0x98e97e RKey 0x1fff00 VAddr 0x007fe0a1739000
remote address: LID 0x1c QPN 0x0047 PSN 0x27c6f6 RKey 0x1fff00 VAddr 0x007f035ce39000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
8388608 5000 375.31 375.30 0.005592
---------------------------------------------------------------------------------------
客户端:进入test-pod5操作
root@mofed-test-pod5:/# ib_write_bw -d mlx5_0 --all --report_gbits 10.233.109.18
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x1c QPN 0x0047 PSN 0x27c6f6 RKey 0x1fff00 VAddr 0x007f035ce39000
remote address: LID 0x15 QPN 0x0047 PSN 0x98e97e RKey 0x1fff00 VAddr 0x007fe0a1739000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 2100.000000!=2015.956000. CPU Frequency isnot max.
2 5000 0.061538 0.061258 3.828649
Conflicting CPU frequency values detected: 2100.000000!=1804.850000. CPU Frequency isnot max.
4 5000 0.13 0.13 4.035661
Conflicting CPU frequency values detected: 2100.000000!=2228.940000. CPU Frequency isnot max.
8 5000 0.25 0.25 3.926875
Conflicting CPU frequency values detected: 811.274000!=2100.000000. CPU Frequency isnot max.
16 5000 0.57 0.57 4.447351
Conflicting CPU frequency values detected: 2100.000000!=2497.823000. CPU Frequency isnot max.
32 5000 1.01 1.00 3.921481
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
64 5000 2.29 2.29 4.463603
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
128 5000 4.56 4.55 4.442628
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
256 5000 9.07 9.06 4.422162
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
512 5000 18.22 18.18 4.439645
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
1024 5000 36.29 36.20 4.418731
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
2048 5000 72.59 72.34 4.415126
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
4096 5000 146.10 145.88 4.451952
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
8192 5000 219.50 219.47 3.348810
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
16384 5000 317.48 316.54 2.414980
Conflicting CPU frequency values detected: 2100.000000!=890.921000. CPU Frequency isnot max.
32768 5000 350.42 350.27 1.336162
Conflicting CPU frequency values detected: 2100.000000!=966.139000. CPU Frequency isnot max.
65536 5000 368.85 368.79 0.703409
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
131072 5000 373.22 373.20 0.355915
Conflicting CPU frequency values detected: 2100.000000!=2999.999000. CPU Frequency isnot max.
262144 5000 374.71 374.69 0.178665
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
524288 5000 375.67 375.66 0.089564
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
1048576 5000 375.71 375.71 0.044788
Conflicting CPU frequency values detected: 2100.000000!=3000.000000. CPU Frequency isnot max.
2097152 5000 375.42 375.42 0.022377
Conflicting CPU frequency values detected: 2100.000000!=2167.965000. CPU Frequency isnot max.
4194304 5000 375.21 375.21 0.011182
Conflicting CPU frequency values detected: 2100.000000!=1805.255000. CPU Frequency isnot max.
8388608 5000 375.31 375.30 0.005592
---------------------------------------------------------------------------------------
看累了吧,玩一把小游戏放松一下接着看~~~
DeepSpeed-分布式训练实践
本次阿程
演示的是一个生成AI图像&视频的开源项目,Github地址:
https://github.com/aigc-apps/CogVideoX-Fun
准备工作:
1.下载模型
2.数据集
3.分布式训练编排Yaml
4.任务启动脚本
分布式训练编排
# 前置条件:需要安装好Kubeflow,依赖PyTorchJob
---
apiVersion:kubeflow.org/v1
kind:PyTorchJob
metadata:
name:pt-cyw-train
spec:
pytorchReplicaSpecs:
Master:
replicas:1
template:
metadata:
annotations:
sidecar.istio.io/inject:"false"
spec:
schedulerName:volcano
containers:
-args:
-|-
export NCCL_DEBUG=INFO
export NCCL_IB_GID_INDEX=5
export NCCL_IB_TC=106
export NCCL_CROSS_NIC=0
export NCCL_ALGO=TREE
export NCCL_SOCKET_IFNAME=eth0
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
cd /data/CogVideoX-Fun
bash scripts/train-h.sh
command:
-bash
--c
env:
-name:NCCL_IB_TIMEOUT
value:"22"
-name:NCCL_IB_RETRY_CNT
value:"13"
-name:NCCL_IB_AR_THRESHOLD
value:"0"
image:harbor.xxx-sh.com/test/torch_cuda:cogvideox_fun
imagePullPolicy:IfNotPresent
name:pytorch
resources:
limits:
cpu:"128"
memory:512Gi
nvidia.com/gpu:"8"
rdma/ib:1
requests:
cpu:"128"
memory:512Gi
nvidia.com/gpu:"8"
rdma/ib:1
securityContext:
capabilities:
add:
-IPC_LOCK
-SYS_PTRACE
volumeMounts:
-mountPath:/data
name:volume-0
-mountPath:/dev/shm
name:shm-data
volumes:
-hostPath:
path:/file_CPU_01/cyw_data/cogvideox
type:""
name:volume-0
-emptyDir:
medium:Memory
sizeLimit:512Gi
name:shm-data
Worker:
replicas:1
template:
metadata:
annotations:
sidecar.istio.io/inject:"false"
spec:
schedulerName:volcano
containers:
-args:
-|-
export NCCL_DEBUG=INFO
export NCCL_IB_GID_INDEX=5
export NCCL_IB_TC=106
export NCCL_CROSS_NIC=0
export NCCL_ALGO=TREE
export NCCL_SOCKET_IFNAME=eth0
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
cd /data/CogVideoX-Fun
bash scripts/train-h.sh
command:
-bash
--c
image:harbor.xxx-sh.com/test/torch_cuda:cogvideox_fun
imagePullPolicy:IfNotPresent
name:pytorch
resources:
limits:
cpu:"128"
memory:512Gi
nvidia.com/gpu:"8"
rdma/ib:1
requests:
cpu:"128"
memory:512Gi
nvidia.com/gpu:"8"
rdma/ib:1
volumeMounts:
-mountPath:/data
name:volume-0
-mountPath:/dev/shm
name:shm-data
volumes:
-hostPath:
path:/file_CPU_01/cyw_data/cogvideox
type:""
name:volume-0
-emptyDir:
medium:Memory
sizeLimit:256Gi
name:shm-data
---
# 查看 pytorchjobs
~#kubectlgetpytorchjobs.kubeflow.org-nsre-tools
NAME STATE AGE
pt-cyw-train Running 52s
~#kubectlgetpods-nsre-tools|greptrain
pt-cyw-train-master-0 1/1 Running 0 83s
pt-cyw-train-worker-0 1/1 Running 0 83s
任务启动脚本
~# cat train-h.sh
export MODEL_NAME="models/Diffusion_Transformer/CogVideoX-Fun-5b-InP"
export DATASET_NAME="datasets/internal_datasets/"
export DATASET_META_NAME="datasets/internal_datasets/json_of_internal_datasets.json"
export NCCL_IB_DISABLE=0
export NCCL_P2P_DISABLE=0
NCCL_DEBUG=INFO
accelerate launch --use_deepspeed --deepspeed_config_file config/zero_stage2_config.json --deepspeed_multinode_launcher standard scripts/train.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--train_data_dir=$DATASET_NAME \
--train_data_meta=$DATASET_META_NAME \
--image_sample_size=1024 \
--video_sample_size=256 \
--token_sample_size=512 \
--video_sample_stride=3 \
--video_sample_n_frames=49 \
--train_batch_size=4 \
--video_repeat=1 \
--gradient_accumulation_steps=1 \
--dataloader_num_workers=8 \
--num_train_epochs=100 \
--checkpointing_steps=500 \
--learning_rate=2e-05 \
--lr_scheduler="constant_with_warmup" \
--lr_warmup_steps=100 \
--seed=42 \
--output_dir="output_dir" \
--gradient_checkpointing \
--mixed_precision="bf16" \
--adam_weight_decay=3e-2 \
--adam_epsilon=1e-10 \
--vae_mini_batch=1 \
--max_grad_norm=0.05 \
--random_hw_adapt \
--training_with_video_token_length \
--random_frame_crop \
--enable_bucket \
--use_came \
--use_deepspeed \
--train_mode="inpaint" \
--resume_from_checkpoint="latest" \
--trainable_modules "."
查看训练任务日志
查看任务资源利用率监控
GPU显存、利用率
附-常用IB运维命令
命令 | 含义 |
lspci |grep Mell | 查看机器上是否存在IB卡(搜索厂商名字Mellanox)。 |
ibstatus | 查看IB卡相关信息:链路状态,端口速率,端口GUID等信息。 |
ibstat | ibstat功能与ibstatus相似。 |
ofed_info -s | 查询当前安装的驱动版本。 |
ibv_devinfo | 查询当前节点系统中IB设备信息。 |
ibqueryerrors -C mlx4_0 -P 1 | 检查当前IB网络各端口的统计信息。 |
perfquery | 查看IB卡端口丢包、端口符号错误。 |
ibv_devices | 查询当前节点的IB卡 - ibv_devices。 |
ibdump | 该工具可以抓取IB层报文,由Mellanox提供。 |
ethtool --set-priv-flags eth-s0 sniffer on | 用ethtool命令使能Sniffer,可以使用TCPDUMP抓包。 |
ib_atomic_bw | 计算一对机器之间RDMA原子事务的带宽,一个server端一个client端,通过CPU采样获取接受完整消息时间计算带宽。该测试支持双向测试,支持更换MTU大小,tx大小,迭代数,消息大小等,更多用法参见“-a”参数。 |
ib_atomic_lat | 计算一对机器之间RDMA一定消息大小下原子事务的时延,client端发送RDMA atomic操作到服务器端,并对CPU时钟采样获取它接收消息完成情况,从而计算时延。 |
ib_read_bw | 计算一对机器之间RDMA read操作带宽。 |
ib_read_lat | 计算一对机器之间RDMA一定消息大小下read操作时延。 |
ib_send_bw -d mlx5_1 | 计算一对机器之间RDMA send操作带宽。 |
ib_send_lat | 计算一对机器之间RDMA一定消息大小下send操作时延。 |
ib_write_bw | 计算一对机器之间RDMA write操作带宽。 |
ib_write_lat | 计算一对机器之间RDMA一定消息大小下write操作时延。 |
raw_ethernet_bw | 计算一对机器之间send带宽。 |
raw_ethernet_lat | 计算一对机器之间send一定大小消息的时延。 |
rping | 检测RDMA CM连接是否OK。 |
如何学习AI大模型?
我在一线互联网企业工作十余年里,指导过不少同行后辈。帮助很多人得到了学习和成长。
我意识到有很多经验和知识值得分享给大家,也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑,所以在工作繁忙的情况下还是坚持各种整理和分享。但苦于知识传播途径有限,很多互联网行业朋友无法获得正确的资料得到学习提升,故此将并将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。
第一阶段: 从大模型系统设计入手,讲解大模型的主要方法;
第二阶段: 在通过大模型提示词工程从Prompts角度入手更好发挥模型的作用;
第三阶段: 大模型平台应用开发借助阿里云PAI平台构建电商领域虚拟试衣系统;
第四阶段: 大模型知识库应用开发以LangChain框架为例,构建物流行业咨询智能问答系统;
第五阶段: 大模型微调开发借助以大健康、新零售、新媒体领域构建适合当前领域大模型;
第六阶段: 以SD多模态大模型为主,搭建了文生图小程序案例;
第七阶段: 以大模型平台应用与开发为主,通过星火大模型,文心大模型等成熟大模型构建大模型行业应用。
👉学会后的收获:👈
• 基于大模型全栈工程实现(前端、后端、产品经理、设计、数据分析等),通过这门课可获得不同能力;
• 能够利用大模型解决相关实际项目需求: 大数据时代,越来越多的企业和机构需要处理海量数据,利用大模型技术可以更好地处理这些数据,提高数据分析和决策的准确性。因此,掌握大模型应用开发技能,可以让程序员更好地应对实际项目需求;
• 基于大模型和企业数据AI应用开发,实现大模型理论、掌握GPU算力、硬件、LangChain开发框架和项目实战技能, 学会Fine-tuning垂直训练大模型(数据准备、数据蒸馏、大模型部署)一站式掌握;
• 能够完成时下热门大模型垂直领域模型训练能力,提高程序员的编码能力: 大模型应用开发需要掌握机器学习算法、深度学习框架等技术,这些技术的掌握可以提高程序员的编码能力和分析能力,让程序员更加熟练地编写高质量的代码。
1.AI大模型学习路线图
2.100套AI大模型商业化落地方案
3.100集大模型视频教程
4.200本大模型PDF书籍
5.LLM面试题合集
6.AI产品经理资源合集
👉获取方式:
😝有需要的小伙伴,可以保存图片到wx扫描二v码免费领取【保证100%免费】🆓