k8s RoCE 部署: k8s-rdma-shared-dev-plugin + macvlan cni


前言

写给自己的入门篇。后续会在原理方面持续更新


一、创建 k8s 集群

k8s 集群的创建有多种方法,可以按照官方文档的说明来操作 https://kubernetes.io/docs/setup/production-environment/tools/
对于新手(比如我)来说,我认为需要结合 k8s 架构来理解集群创建的过程。
k8s cluster component
(图片来源于 https://www.redhat.com/en/topics/containers/kubernetes-architecture)

需要安装:

  1. container runtime: 用于运行 container 的服务,每个 node 都需要安装并启动
  2. kubectl: 用户 CLI,是集群资源管理、容器部署、调试时的主要工具
  3. kubelet: 运行在 node 上的服务,确保 pod 与 container 启动并运行 (需要关闭 swap),每个 node 都需要安装并启动
  4. kubeadm: 创建与管理集群

初始化 control-plane (master) 节点:

# kubeadm init

这个过程一般会遇到很多报错,Google 是最好的寻求解决方案的地方。初始化成功后,将会有如下输出:

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Alternatively, if you are the root user, you can run:

  export KUBECONFIG=/etc/kubernetes/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 10.7.157.30:6443 --token 2rg4l1.n0rhvdp0uvxdrxjv \
        --discovery-token-ca-cert-hash sha256:fd7d661ec35868d036761e844597807a3d076daf3c8b71de6e1b55ee01e66a32

此时会发现,如下 Pods 已创建,除了 coredns 处于 Pending 状态,其余都处于 Running 状态:

# export KUBECONFIG=/etc/kubernetes/admin.conf
# kubectl get node -o wide
NAME          STATUS   ROLES           AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                      KERNEL-VERSION           CONTAINER-RUNTIME
node1         Ready    control-plane   2m50s   v1.24.0   10.7.157.30   <none>        Red Hat Enterprise Linux Server 7.7 (Maipo)   3.10.0-1062.el7.x86_64   containerd://1.6.4

# kubectl get pods --all-namespaces
NAMESPACE     NAME                                  READY   STATUS    RESTARTS   AGE
kube-system   coredns-6d4b75cb6d-752q4              0/1     Pending   0          35s
kube-system   coredns-6d4b75cb6d-7h2g5              0/1     Pending   0          35s
kube-system   etcd-node1                            1/1     Running   5          47s
kube-system   kube-apiserver-node1                  1/1     Running   4          48s
kube-system   kube-controller-manager-node1         1/1     Running   1          47s
kube-system   kube-proxy-px447                      1/1     Running   0          35s
kube-system   kube-scheduler-node1                  1/1     Running   4          48s

二、启用 primary network

先说 k8s 的网络模型。
对于 k8s 网络,核心理念是 - 每个 pod 都有唯一的 IP。 Pod 中所有 container 共享该 IP,并可以与其他 Pod 通信。
通常会在 kubeadm.config.yaml 中设置 pod subnet 作为 CIDR 块,即一系列 IP 地址,在此范围内分配 IP 给 pod:

#### in kubeadm-config.yaml ####
kind: ClusterConfiguration
apiVersion: kubeadm.k8s.io/v1beta3
kubernetesVersion: v1.24.0
networking:
  podSubnet: 10.244.0.0/16

Pod 之间的通信,通常会结合管道对与以太网桥来实现:
primary network

cni0 本质上是 Linux 网桥,可以发送 ARP request 与解析 ARP response
eno1 作为 node 之间通信的网络接口,启用了 IP 转发,可以依据 Route Table 将收到的数据包转发给 cni0

为了启用 k8s primary network,需要安装 primary network CNI

有多种选择,如 flannel, Calico, WeaveNet 等。 此例选取 flannel,需要设置 flannel 使用的网络接口:

# yum install -y flannel
# vi /etc/sysconfig/flanneld  ## add additional options:
FLANNEL_OPTIONS="-iface=eno1"
# cp /usr/bin/flanneld /opt/bin
# kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml
Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
podsecuritypolicy.policy/psp.flannel.unprivileged created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.apps/kube-flannel-ds created

此时 coredns 将会变为 Running 状态

三、启用 secondary network

Primary network 常用于 Pod 之间的基本通信。通常需要为 Pod 提供 secondary network,作为高性能网络供应用程序使用:
multi-networking
需要部署:

  1. k8s-rdma-shared-dev-plugin
  2. Multus CNI
  3. Secondary CNI
  4. Multi-Network CRD

其中 Multus CNI 可以看作一个 meta plugin,与其他 CNI plugin 配合使用,以实现多网络接口的功能:
Mulus CNI as meta plugin

k8s-rdma-shared-dev-plugin

创建 configmap

# cat k8s-rdma-shared-dev-plugin-config-map.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: rdma-devices
  namespace: kube-system
data:
  config.json: |
    {
        "periodicUpdateInterval": 300,
        "configList": [{
             "resourceName": "cx5_bond_shared_devices_a",
             "rdmaHcaMax": 1000,
             "selectors": {
               "vendors": ["15b3"],
               "deviceIDs": ["1017"]
             }
           },
           {
             "resourceName": "cx6dx_shared_devices_b",
             "rdmaHcaMax": 500,
             "selectors": {
               "vendors": ["15b3"],
               "deviceIDs": ["101d"]
             }
           }
        ]
    }


# kubectl create -f k8s-rdma-shared-dev-plugin-config-map.yaml
configmap/rdma-devices created

创建 k8s-rdma-shared-dev-plugin daemonset

# kubectl create -f https://raw.githubusercontent.com/Mellanox/k8s-rdma-shared-dev-plugin/master/images/k8s-rdma-shared-dev-plugin-ds.yaml
daemonset.apps/rdma-shared-dp-ds created

若上述 k8s-rdma-shared-dev-plugin-ds.yaml 的 git rep 无法访问,可以采用如下方式:

# git clone https://github.com/Mellanox/k8s-rdma-shared-dev-plugin.git
# cd k8s-rdma-shared-dev-plugin/
# kubectl create -f deployment/k8s/base/daemonset.yaml
daemonset.apps/rdma-shared-dp-ds created

Multus CNI

# kubectl create -f https://raw.githubusercontent.com/intel/multus-cni/master/images/multus-daemonset.yml
ustomresourcedefinition.apiextensions.k8s.io/network-attachment-definitions.k8s.cni.cncf.io created
clusterrole.rbac.authorization.k8s.io/multus created
clusterrolebinding.rbac.authorization.k8s.io/multus created
serviceaccount/multus created
configmap/multus-cni-config created
daemonset.apps/kube-multus-ds-amd64 created
daemonset.apps/kube-multus-ds-ppc64le created

若上述 multus-daemonset.yml 的 git rep 无法访问,可以采用如下方式:

# git clone https://github.com/k8snetworkplumbingwg/multus-cni.git
# cd multus-cni/
# kubectl create -f deployments/multus-daemonset.yml
customresourcedefinition.apiextensions.k8s.io/network-attachment-definitions.k8s.cni.cncf.io created
clusterrole.rbac.authorization.k8s.io/multus created
clusterrolebinding.rbac.authorization.k8s.io/multus created
serviceaccount/multus created
configmap/multus-cni-config created
daemonset.apps/kube-multus-ds created

Secondary CNI

# mkdir -p /opt/cni/bin
# wget https://github.com/containernetworking/plugins/releases/download/v1.1.1/cni-plugins-linux-amd64-v1.1.1.tgz
# tar Cxzvf /opt/cni/bin cni-plugins-linux-amd64-v1.1.1.tgz

查看 /opt/cni/bin,可以看到已有多个 cni 插件:

# ls /opt/cni/bin
bandwidth  bridge  dhcp  firewall  host-device  host-local  ipvlan  loopback  macvlan  portmap  ptp  sbr  static  tuning  vlan  vrf

此例将使用 macvlan CNI

Multi-Network CRD

为 macvlan CNI 创建两个 network attachment,注意 IP 地址范围与 primary network 的 IP 地址范围不可有重合:

# cat macvlan_cx6dx.yaml
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: macvlan-cx6dx-conf
spec:
  config: '{
    "cniVersion": "0.3.1",
    "type": "macvlan",
     "master": "ens2f0",
        "ipam": {
                "type": "host-local",
                "subnet": "10.56.217.0/24",
                "rangeStart": "10.56.217.171",
                "rangeEnd": "10.56.217.181",
                "routes": [
                        { "dst": "0.0.0.0/0" }
                ],
                "gateway": "10.56.217.1"
        }
  }'

# cat macvlan_cx5_bond.yaml
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: macvlan-cx5-bond-conf
spec:
  config: '{
    "cniVersion": "0.3.1",
    "type": "macvlan",
     "master": "bond0",
        "ipam": {
                "type": "host-local",
                "subnet": "10.56.217.0/24",
                "rangeStart": "10.56.217.71",
                "rangeEnd": "10.56.217.81",
                "routes": [
                        { "dst": "0.0.0.0/0" }
                ],
                "gateway": "10.56.217.1"
        }
  }'

# kubectl create -f macvlan_cx6dx.yaml
networkattachmentdefinition.k8s.cni.cncf.io/macvlan-cx6dx-conf created

# kubectl create -f macvlan_cx5_bond.yaml
networkattachmentdefinition.k8s.cni.cncf.io/macvlan-cx5-bond-conf created

四、启用 pod

本例仅用到 macvlan-cx5-bond-conf,若需要使用 macvlan-cx6dx-conf,可在 test-xxx-pod.yaml 中指定对应的 annotation 与 resources:
pod with multi-networking interfaces

# cat test-cx5-bond-pod1.yaml
apiVersion: v1
kind: Pod
metadata:
  name: mofed-test-cx5-bond-pod1
  annotations:
    k8s.v1.cni.cncf.io/networks: default/macvlan-cx5-bond-conf
spec:
  restartPolicy: OnFailure
  containers:
  - image: mellanox/rping-test
    name: mofed-test-ctr
    securityContext:
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        rdma/cx5_bond_shared_devices_a: 1
      requests:
        rdma/cx5_bond_shared_devices_a: 1
    command:
    - sh
    - -c
    - |
      ls -l /dev/infiniband /sys/class/infiniband /sys/class/net
      sleep 1000000

# kubectl create -f test-cx5-bond-pod1.yaml
pod/mofed-test-cx5-bond-pod1 created

# cat test-cx5-bond-pod2.yaml
apiVersion: v1
kind: Pod
metadata:
  name: mofed-test-cx5-bond-pod2
  annotations:
    k8s.v1.cni.cncf.io/networks: default/macvlan-cx5-bond-conf
spec:
  restartPolicy: OnFailure
  containers:
  - image: mellanox/rping-test
    name: mofed-test-ctr
    securityContext:
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        rdma/cx5_bond_shared_devices_a: 1
      requests:
        rdma/cx5_bond_shared_devices_a: 1
    command:
    - sh
    - -c
    - |
      ls -l /dev/infiniband /sys/class/infiniband /sys/class/net
      sleep 1000000

# kubectl create -f test-cx5-bond-pod2.yaml
pod/mofed-test-cx5-bond-pod2 created

五、在 pod 中启动 RoCE 流量

此时可在 pod 中使用 secondary network (eth1) 启动 RoCE 流量:
RoCE traffic vie eth1

# kubectl get pods -A
NAMESPACE     NAME                                  READY   STATUS    RESTARTS   AGE
default       mofed-test-cx5-bond-pod1              1/1     Running   0          3m41s
default       mofed-test-cx5-bond-pod2              1/1     Running   0          32s
default       mofed-test-macvlan-pod                1/1     Running   0          4d9h
kube-system   coredns-6d4b75cb6d-752q4              1/1     Running   0          5d3h
kube-system   coredns-6d4b75cb6d-7h2g5              1/1     Running   0          5d3h
kube-system   etcd-node1                            1/1     Running   5          5d3h
kube-system   kube-apiserver-node1                  1/1     Running   4          5d3h
kube-system   kube-controller-manager-node1         1/1     Running   1          5d3h
kube-system   kube-flannel-ds-xwlr2                 1/1     Running   0          5d3h
kube-system   kube-multus-ds-kqhqn                  1/1     Running   0          5d2h
kube-system   kube-proxy-px447                      1/1     Running   0          5d3h
kube-system   kube-scheduler-node1                  1/1     Running   4          5d3h
kube-system   rdma-shared-dp-ds-vps6x               1/1     Running   0          21m

mofed-test-cx5-bond-pod1

# kubectl exec -it mofed-test-cx5-bond-pod1 bash
[root@mofed-test-cx5-bond-pod1 /]# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 10.244.0.211  netmask 255.255.255.0  broadcast 10.244.0.255
        inet6 fe80::e45d:c4ff:fe4c:f3b3  prefixlen 64  scopeid 0x20<link>
        ether e6:5d:c4:4c:f3:b3  txqueuelen 0  (Ethernet)
        RX packets 12  bytes 1016 (1016.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8  bytes 612 (612.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

net1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 10.56.217.71  netmask 255.255.255.0  broadcast 10.56.217.255
        ether fa:a4:6e:24:3e:ba  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@mofed-test-cx5-bond-pod1 /]# ib_write_bw -d mlx5_bond_0 -F --report_gbits
************************************
* Waiting for client to connect... *
************************************

mofed-test-cx5-bond-pod1

# kubectl exec -it mofed-test-cx5-bond-pod2 bash
[root@mofed-test-cx5-bond-pod2 /]# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 10.244.0.212  netmask 255.255.255.0  broadcast 10.244.0.255
        inet6 fe80::20d6:7eff:fec0:4e39  prefixlen 64  scopeid 0x20<link>
        ether 22:d6:7e:c0:4e:39  txqueuelen 0  (Ethernet)
        RX packets 12  bytes 1016 (1016.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8  bytes 612 (612.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

net1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 10.56.217.72  netmask 255.255.255.0  broadcast 10.56.217.255
        ether a6:46:b9:94:b0:31  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@mofed-test-cx5-bond-pod2 /]# ib_write_bw -d mlx5_bond_0 -F --report_gbits 10.56.217.71
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_bond_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 4
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x117c PSN 0xbfdcaf RKey 0x00511b VAddr 0x007fdf469fd000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:56:217:72
 remote address: LID 0000 QPN 0x117d PSN 0x75cbaa RKey 0x004407 VAddr 0x007f65e74dc000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:56:217:71
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      5000             82.62              82.55              0.157445
---------------------------------------------------------------------------------------

总结

TBD

  • 5
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 6
    评论
### 回答1: RDMA编程用户手册-官方中文版,是一个介绍RDMA(Remote Direct Memory Access,远程直接内存访问)编程的指南,全文共分为六章,内容详细而清晰。该手册引导读者从熟悉RDMA的基本概念开始,到理解和使用RDMA编程模型,最后为读者提供了一些高级主题,如优化数据传输、多资源管理等。 第一章介绍了RDMA及其相关概念,如IB(InfiniBand,无穷带宽)和RoCE(RDMA over Converged Ethernet,以太网上的RDMA),使读者对RDMA有了初步了解。 第二章讲述了RDMA编程模型及其基本特性,如零拷贝、CPU减轻、低延迟、高吞吐量等。该章还介绍了RDMA的三种通信方式:发送/接收(send/receive)、发送/写(send/write)和原子操作(atomic operations)。 第三章详细介绍了RDMA编程中的一些重要概念,如信号量、内存区域、端点和队列,为读者进一步了解RDMA编程模型打下基础。 第四章详细介绍了RDMA编程接口(APIs),包括IB Verbs(IB词汇)和UCP(Unified Communication Platform,统一通信平台),并提供了相关示例代码和解释。 第五章介绍了RDMA应用的一些高级话题,如数据传输优化、内存区域与队列管理、事件处理等,提供了进一步优化RDMA应用的方法和技巧。 最后一章通过具体的案例分享了RDMA编程的示例,从简单的ping-pong测试到复杂的数据传输、内存区域管理、事件处理等,为读者提供了实际应用经验和使用技巧。 总之,这个手册是一个非常有用的资源,不仅对初学者具有参考价值,也为专业RDMA编程人员提供了实用信息和技巧。 ### 回答2: RDMA 编程用户手册 (官方中文版) 是一份非常详细的技术文档,主要面向使用 RDMA 开发网络应用程序的开发人员。该手册包括了 RDMA 的基本介绍、RDMA 技术的优点、RDMA 编程的基本原理和方法、RDMA 常见编程模式、RDMA 应用编程界面、RDMA 编程工具等内容。通过这份手册,读者可以了解 RDMA 技术的基础知识、掌握 RDMA 编程方法和技巧,更好地开发和优化基于 RDMA网络应用程序。 手册首先介绍了 RDMA 的基础概念、优点和实现原理,为读者提供了深入理解 RDMA 技术的基础知识。接着,手册详细介绍了 RDMA 技术的编程方法和基本模式,包括点对点 RDMA、远程读写、原子操作等,给读者提供了开发 RDMA 应用程序的基本指南。同时,手册还介绍了 RDMA API,为读者提供了详细的接口说明和使用方法。此外,手册还介绍了 RDMA 编程工具和调试技巧,方便读者对 RDMA 应用程序进行优化和调试。 总之,RDMA 编程用户手册 (官方中文版) 是一份非常有用的技术文档,对于需要开发和优化基于 RDMA网络应用程序的开发人员来说,是一份必备的工具和参考资料。 ### 回答3: RDMA(Remote Direct Memory Access)是一种异步、零拷贝数据传输技术,它允许网络主机直接访问远端主机的内存。RDMA编程用户手册是RDMA编程的权威指南,对于RDMA编程掌握和实践意义重大。 该手册被分为三个主要部分,分别是RDMA概述、RDMA编程和RDMA应用,对RDMA编程的基础知识、源代码实现以及应用领域进行了详细描述和讲解。其中,RDMA概述主要介绍RDMA的发展历程、基本原理、硬件支持和软件实现等,为读者提供了深入了解RDMA的基础知识。RDMA编程部分则主要介绍了RDMA编程的基本模型、原子操作、数据类型和粘包处理等,同时提供了丰富的代码实现和案例分析,以方便读者进行实践活动。RDMA应用部分主要讲解RDMA在各种场景中的应用,包括存储系统、网络加速、云计算和高性能计算等,帮助读者了解RDMA技术在各种实际应用领域中的表现和优势。 总体而言,RDMA编程用户手册-官方中文版是一本介绍RDMA编程的权威指南,对于打算了解和掌握RDMA编程技术的人员具有重要意义。该手册不仅提供了丰富的知识资源和代码实现支持,而且分析了RDMA技术在各种场景中的应用场景和优点,为读者掌握RDMA编程技术和加速应用提供了有力支持和指导。
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值