记录解决K3s的Pod之间通信问题

在 K8s 中进行多机多卡训练,尤其是使用 DeepSpeed 时,容器之间的通信问题通常可以通过以下方式排查和解决:

1. 确认网络配置

  • K8s CNI 插件:检查你的 K3s 使用的 CNI 插件(如 Flannel、Calico)。确保它支持跨节点的 Pod 通信。
    kubectl get pods -A -o wide
    kubectl get nodes -o wide
    
  • Pod 网络是否互通:从一个 Pod 使用 pingcurl 命令测试另一个 Pod 的 IP 地址。
    kubectl exec -it <pod-name> -- ping <target-pod-ip>
    

2. 配置 DeepSpeed 通信

DeepSpeed 使用 TCP 进行节点间的通信,通常需要开放特定端口:

  • 默认端口范围29500-29525(根据 rank 数量动态分配)
  • deepspeed.initialize() 中确保 --master_addr--master_port 设置正确:
    deepspeed.init_distributed(
        dist_backend='nccl',
        init_method='tcp://192.168.2.186:29500'
    )
    

安装CNI插件的时候报错:

(base) root@ainode3:/home/xie/distribute_test/DistilBERT/k8s# docker images | grep flannel
ghcr.io/flannel-io/flannel                                                              v0.26.4                                                 1421c6bce6d6   6 weeks ago     76.9MB
ghcr.io/flannel-io/flannel-cni-plugin                                                   v1.6.2-flannel1                                         55ce2385d9d8   6 weeks ago     10.7MB
(base) root@ainode3:/home/xie/distribute_test/DistilBERT/k8s# 
(base) root@ainode3:/home/xie/distribute_test/DistilBERT/k8s# kubectl logs -n kube-flannel kube-flannel-ds-4k5dk
I0321 01:27:37.371710       1 main.go:211] CLI flags config: {etcdEndpoints:http://127.0.0.1:4001,http://127.0.0.1:2379 etcdPrefix:/coreos.com/network etcdKeyfile: etcdCertfile: etcdCAFile: etcdUsername: etcdPassword: version:false kubeSubnetMgr:true kubeApiUrl: kubeAnnotationPrefix:flannel.alpha.coreos.com kubeConfigFile: iface:[] ifaceRegex:[] ipMasq:true ifaceCanReach: subnetFile:/run/flannel/subnet.env publicIP: publicIPv6: subnetLeaseRenewMargin:60 healthzIP:0.0.0.0 healthzPort:0 iptablesResyncSeconds:5 iptablesForwardRules:true netConfPath:/etc/kube-flannel/net-conf.json setNodeNetworkUnavailable:true}
W0321 01:27:37.371806       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0321 01:27:37.382014       1 kube.go:139] Waiting 10m0s for node controller to sync
I0321 01:27:37.382058       1 kube.go:469] Starting kube subnet manager
I0321 01:27:37.389941       1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.42.2.0/24]
I0321 01:27:37.390035       1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.42.0.0/24]
I0321 01:27:37.390056       1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.42.1.0/24]
I0321 01:27:38.382202       1 kube.go:146] Node controller sync successful
I0321 01:27:38.382239       1 main.go:231] Created subnet manager: Kubernetes Subnet Manager - ainode3
I0321 01:27:38.382254       1 main.go:234] Installing signal handlers
I0321 01:27:38.382448       1 main.go:468] Found network config - Backend type: vxlan
I0321 01:27:38.391329       1 kube.go:669] List of node(ainode3) annotations: map[string]string{"flannel.alpha.coreos.com/backend-data":"{\"VNI\":1,\"VtepMAC\":\"9a:8c:24:c7:05:a2\"}", "flannel.alpha.coreos.com/backend-type":"vxlan", "flannel.alpha.coreos.com/kube-subnet-manager":"true", "flannel.alpha.coreos.com/public-ip":"192.168.2.186", "hami.io/node-handshake":"Requesting_2025.03.21 01:27:11", "hami.io/node-handshake-dcu":"Deleted_2025.02.02 10:29:26", "hami.io/node-nvidia-register":"GPU-92d1d617-b093-9383-4070-53e5e6c5afc5,10,15360,100,NVIDIA-Tesla T4,0,true:GPU-ef72722c-8952-72a1-c2be-d4d5b077bbd9,10,15360,100,NVIDIA-Tesla T4,0,true:", "k3s.io/node-args":"[\"server\",\"--docker\"]", "k3s.io/node-config-hash":"6DWNFWQMIJPJNOOSYKGNXGNN7DPG53Z77PAPIQ56XNVT2UPS3TFA====", "k3s.io/node-env":"{\"K3S_DATA_DIR\":\"/var/lib/rancher/k3s/data/cba07c8500bccabd42d9215a6af6b01181cb6ca5755d12ae1e4e02b27b50bafa\"}", "node.alpha.kubernetes.io/ttl":"0", "volumes.kubernetes.io/controller-managed-attach-detach":"true"}
I0321 01:27:38.391440       1 match.go:211] Determining IP address of default interface
I0321 01:27:38.393378       1 match.go:264] Using interface with name ens2f0 and address 192.168.2.186
I0321 01:27:38.393433       1 match.go:286] Defaulting external address to interface address (192.168.2.186)
I0321 01:27:38.393531       1 vxlan.go:141] VXLAN config: VNI=1 Port=0 GBP=false Learning=false DirectRouting=false
I0321 01:27:38.398409       1 kube.go:636] List of node(ainode3) annotations: map[string]string{"flannel.alpha.coreos.com/backend-data":"{\"VNI\":1,\"VtepMAC\":\"9a:8c:24:c7:05:a2\"}", "flannel.alpha.coreos.com/backend-type":"vxlan", "flannel.alpha.coreos.com/kube-subnet-manager":"true", "flannel.alpha.coreos.com/public-ip":"192.168.2.186", "hami.io/node-handshake":"Requesting_2025.03.21 01:27:11", "hami.io/node-handshake-dcu":"Deleted_2025.02.02 10:29:26", "hami.io/node-nvidia-register":"GPU-92d1d617-b093-9383-4070-53e5e6c5afc5,10,15360,100,NVIDIA-Tesla T4,0,true:GPU-ef72722c-8952-72a1-c2be-d4d5b077bbd9,10,15360,100,NVIDIA-Tesla T4,0,true:", "k3s.io/node-args":"[\"server\",\"--docker\"]", "k3s.io/node-config-hash":"6DWNFWQMIJPJNOOSYKGNXGNN7DPG53Z77PAPIQ56XNVT2UPS3TFA====", "k3s.io/node-env":"{\"K3S_DATA_DIR\":\"/var/lib/rancher/k3s/data/cba07c8500bccabd42d9215a6af6b01181cb6ca5755d12ae1e4e02b27b50bafa\"}", "node.alpha.kubernetes.io/ttl":"0", "volumes.kubernetes.io/controller-managed-attach-detach":"true"}
I0321 01:27:38.398550       1 vxlan.go:155] Interface flannel.1 mac address set to: 9a:8c:24:c7:05:a2
E0321 01:27:38.399484       1 main.go:359] Error registering network: failed to acquire lease: subnet "10.244.0.0/16" specified in the flannel net config doesn't contain "10.42.0.0/24" PodCIDR of the "ainode3" node
I0321 01:27:38.399568       1 main.go:448] Stopping shutdownHandler...

你的 flannel 失败的原因在这里:

E0321 01:27:38.399484       1 main.go:359] Error registering network: failed to acquire lease: subnet "10.244.0.0/16" specified in the flannel net config doesn't contain "10.42.0.0/24" PodCIDR of the "ainode3" node

这个错误表明 flannel 配置的 CIDR 是 10.244.0.0/16,但你的 K3s 可能使用了 10.42.0.0/24 作为 Pod CIDR,这导致 flannel 无法正确分配网络。

解决方案:

  1. 确认 K3s 的 Pod CIDR
    运行以下命令检查你的 K3s 网络:

    kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'
    

    如果输出的是 10.42.x.x/24,那就说明你的 K3s 默认使用 10.42.0.0/16 作为 Pod 网络,而 flannel 仍然在使用 10.244.0.0/16

  2. 修改 flannel 配置
    你需要让 flannel 使用 K3s 的 Pod CIDR,而不是 10.244.0.0/16

    编辑 flannelConfigMap

    kubectl edit cm -n kube-flannel kube-flannel-cfg
    

    找到 net-conf.json 部分:

    {
      "Network": "10.244.0.0/16",
      "Backend": {
        "Type": "vxlan"
      }
    }
    

    "Network": "10.244.0.0/16" 改成 "10.42.0.0/16"(跟 K3s 的 Pod CIDR 一致):

    {
      "Network": "10.42.0.0/16",
      "Backend": {
        "Type": "vxlan"
      }
    }
    
  3. 重启 flannel

    kubectl delete pod -n kube-flannel --all
    

    等待 flannel 重新启动,并检查是否正常运行:

    kubectl get pods -n kube-flannel
    

这样 flannel 应该就能正确运行了。试试看!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Coder小谢

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值