在 K8s 中进行多机多卡训练,尤其是使用 DeepSpeed 时,容器之间的通信问题通常可以通过以下方式排查和解决:
✅ 1. 确认网络配置
- K8s CNI 插件:检查你的 K3s 使用的 CNI 插件(如 Flannel、Calico)。确保它支持跨节点的 Pod 通信。
kubectl get pods -A -o wide kubectl get nodes -o wide
- Pod 网络是否互通:从一个 Pod 使用
ping
或curl
命令测试另一个 Pod 的 IP 地址。kubectl exec -it <pod-name> -- ping <target-pod-ip>
✅ 2. 配置 DeepSpeed 通信
DeepSpeed 使用 TCP 进行节点间的通信,通常需要开放特定端口:
- 默认端口范围:
29500-29525
(根据 rank 数量动态分配) - 在
deepspeed.initialize()
中确保--master_addr
和--master_port
设置正确:deepspeed.init_distributed( dist_backend='nccl', init_method='tcp://192.168.2.186:29500' )
安装CNI插件的时候报错:
(base) root@ainode3:/home/xie/distribute_test/DistilBERT/k8s# docker images | grep flannel
ghcr.io/flannel-io/flannel v0.26.4 1421c6bce6d6 6 weeks ago 76.9MB
ghcr.io/flannel-io/flannel-cni-plugin v1.6.2-flannel1 55ce2385d9d8 6 weeks ago 10.7MB
(base) root@ainode3:/home/xie/distribute_test/DistilBERT/k8s#
(base) root@ainode3:/home/xie/distribute_test/DistilBERT/k8s# kubectl logs -n kube-flannel kube-flannel-ds-4k5dk
I0321 01:27:37.371710 1 main.go:211] CLI flags config: {etcdEndpoints:http://127.0.0.1:4001,http://127.0.0.1:2379 etcdPrefix:/coreos.com/network etcdKeyfile: etcdCertfile: etcdCAFile: etcdUsername: etcdPassword: version:false kubeSubnetMgr:true kubeApiUrl: kubeAnnotationPrefix:flannel.alpha.coreos.com kubeConfigFile: iface:[] ifaceRegex:[] ipMasq:true ifaceCanReach: subnetFile:/run/flannel/subnet.env publicIP: publicIPv6: subnetLeaseRenewMargin:60 healthzIP:0.0.0.0 healthzPort:0 iptablesResyncSeconds:5 iptablesForwardRules:true netConfPath:/etc/kube-flannel/net-conf.json setNodeNetworkUnavailable:true}
W0321 01:27:37.371806 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0321 01:27:37.382014 1 kube.go:139] Waiting 10m0s for node controller to sync
I0321 01:27:37.382058 1 kube.go:469] Starting kube subnet manager
I0321 01:27:37.389941 1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.42.2.0/24]
I0321 01:27:37.390035 1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.42.0.0/24]
I0321 01:27:37.390056 1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.42.1.0/24]
I0321 01:27:38.382202 1 kube.go:146] Node controller sync successful
I0321 01:27:38.382239 1 main.go:231] Created subnet manager: Kubernetes Subnet Manager - ainode3
I0321 01:27:38.382254 1 main.go:234] Installing signal handlers
I0321 01:27:38.382448 1 main.go:468] Found network config - Backend type: vxlan
I0321 01:27:38.391329 1 kube.go:669] List of node(ainode3) annotations: map[string]string{"flannel.alpha.coreos.com/backend-data":"{\"VNI\":1,\"VtepMAC\":\"9a:8c:24:c7:05:a2\"}", "flannel.alpha.coreos.com/backend-type":"vxlan", "flannel.alpha.coreos.com/kube-subnet-manager":"true", "flannel.alpha.coreos.com/public-ip":"192.168.2.186", "hami.io/node-handshake":"Requesting_2025.03.21 01:27:11", "hami.io/node-handshake-dcu":"Deleted_2025.02.02 10:29:26", "hami.io/node-nvidia-register":"GPU-92d1d617-b093-9383-4070-53e5e6c5afc5,10,15360,100,NVIDIA-Tesla T4,0,true:GPU-ef72722c-8952-72a1-c2be-d4d5b077bbd9,10,15360,100,NVIDIA-Tesla T4,0,true:", "k3s.io/node-args":"[\"server\",\"--docker\"]", "k3s.io/node-config-hash":"6DWNFWQMIJPJNOOSYKGNXGNN7DPG53Z77PAPIQ56XNVT2UPS3TFA====", "k3s.io/node-env":"{\"K3S_DATA_DIR\":\"/var/lib/rancher/k3s/data/cba07c8500bccabd42d9215a6af6b01181cb6ca5755d12ae1e4e02b27b50bafa\"}", "node.alpha.kubernetes.io/ttl":"0", "volumes.kubernetes.io/controller-managed-attach-detach":"true"}
I0321 01:27:38.391440 1 match.go:211] Determining IP address of default interface
I0321 01:27:38.393378 1 match.go:264] Using interface with name ens2f0 and address 192.168.2.186
I0321 01:27:38.393433 1 match.go:286] Defaulting external address to interface address (192.168.2.186)
I0321 01:27:38.393531 1 vxlan.go:141] VXLAN config: VNI=1 Port=0 GBP=false Learning=false DirectRouting=false
I0321 01:27:38.398409 1 kube.go:636] List of node(ainode3) annotations: map[string]string{"flannel.alpha.coreos.com/backend-data":"{\"VNI\":1,\"VtepMAC\":\"9a:8c:24:c7:05:a2\"}", "flannel.alpha.coreos.com/backend-type":"vxlan", "flannel.alpha.coreos.com/kube-subnet-manager":"true", "flannel.alpha.coreos.com/public-ip":"192.168.2.186", "hami.io/node-handshake":"Requesting_2025.03.21 01:27:11", "hami.io/node-handshake-dcu":"Deleted_2025.02.02 10:29:26", "hami.io/node-nvidia-register":"GPU-92d1d617-b093-9383-4070-53e5e6c5afc5,10,15360,100,NVIDIA-Tesla T4,0,true:GPU-ef72722c-8952-72a1-c2be-d4d5b077bbd9,10,15360,100,NVIDIA-Tesla T4,0,true:", "k3s.io/node-args":"[\"server\",\"--docker\"]", "k3s.io/node-config-hash":"6DWNFWQMIJPJNOOSYKGNXGNN7DPG53Z77PAPIQ56XNVT2UPS3TFA====", "k3s.io/node-env":"{\"K3S_DATA_DIR\":\"/var/lib/rancher/k3s/data/cba07c8500bccabd42d9215a6af6b01181cb6ca5755d12ae1e4e02b27b50bafa\"}", "node.alpha.kubernetes.io/ttl":"0", "volumes.kubernetes.io/controller-managed-attach-detach":"true"}
I0321 01:27:38.398550 1 vxlan.go:155] Interface flannel.1 mac address set to: 9a:8c:24:c7:05:a2
E0321 01:27:38.399484 1 main.go:359] Error registering network: failed to acquire lease: subnet "10.244.0.0/16" specified in the flannel net config doesn't contain "10.42.0.0/24" PodCIDR of the "ainode3" node
I0321 01:27:38.399568 1 main.go:448] Stopping shutdownHandler...
你的 flannel
失败的原因在这里:
E0321 01:27:38.399484 1 main.go:359] Error registering network: failed to acquire lease: subnet "10.244.0.0/16" specified in the flannel net config doesn't contain "10.42.0.0/24" PodCIDR of the "ainode3" node
这个错误表明 flannel
配置的 CIDR 是 10.244.0.0/16
,但你的 K3s 可能使用了 10.42.0.0/24
作为 Pod CIDR,这导致 flannel
无法正确分配网络。
解决方案:
-
确认 K3s 的 Pod CIDR
运行以下命令检查你的 K3s 网络:kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'
如果输出的是
10.42.x.x/24
,那就说明你的 K3s 默认使用10.42.0.0/16
作为 Pod 网络,而flannel
仍然在使用10.244.0.0/16
。 -
修改 flannel 配置
你需要让flannel
使用 K3s 的 Pod CIDR,而不是10.244.0.0/16
。编辑
flannel
的ConfigMap
:kubectl edit cm -n kube-flannel kube-flannel-cfg
找到
net-conf.json
部分:{ "Network": "10.244.0.0/16", "Backend": { "Type": "vxlan" } }
将
"Network": "10.244.0.0/16"
改成"10.42.0.0/16"
(跟 K3s 的 Pod CIDR 一致):{ "Network": "10.42.0.0/16", "Backend": { "Type": "vxlan" } }
-
重启 flannel
kubectl delete pod -n kube-flannel --all
等待 flannel 重新启动,并检查是否正常运行:
kubectl get pods -n kube-flannel
这样 flannel
应该就能正确运行了。试试看!