CNI 网络流量 4. 6 Calico ebpf dataplane

Calico ebpf dataplane

针对 useCase 功能学习

Use Case

根据官网,有 traffic control,networkpolicy,connect-time load balancing 几种 case

  • Traffic control
    通常流量的走向需要复杂的路由决定如何到达目的地,在已知流量的目的地时,使用eBPF,您可以利用额外的上下文在内核中进行这些更改,以便数据包绕过复杂的路由,直接到达其最终目的地。
  • Networkpolicy
    创建网络策略时,有两种情况可以使用eBPF:
    • eXpress数据路径(XDP)–当原始数据包缓冲区进入系统时,eBPF为您提供了一种有效的方法来检查缓冲区并快速决定如何使用它。(ebpf dataplane 关闭该种方法)
    • 网络策略–eBPF允许您高效地检查数据包并应用网络策略,包括pod和主机。
  • Connect-time 负载均衡
    Kubernetes 中 访问 Service 是通过访问 vip 开始,然后 nat 成真正的 endpoint 进行业务访问,并且回复时需要将 nat 回 vip,使用 ebpf,通过加载到内核中的 ebpf 程序在 source 连接时进行地址转换。不需要在数据包处理路径上进行 nat 转换,服务连接的 nat 开销被删除。

Calico ebpf dataplane 暂不支持

  • 不支持混合部署,即一些节点 linux dataplane,一部分 ebpf dataplane
  • IPv6.
  • Floating IPs.
  • SCTP (either for policy or services). This is due to lack of kernel support for the SCTP checksum in BPF.
  • Log action in policy rules. This is because the Log action maps to the iptables LOG action and BPF programs cannot access that log.
  • VLAN-based traffic.

部署

  • k8s 的部署
    部署时不部署 kube-proxy
$ kubeadm init --config kubeadm.conf --skip-phases=addon/kube-proxy
  • 创建 api configmap,kube-proxy 没有的情况下连接 api 使用
kind: ConfigMap
apiVersion: v1
metadata:
  name: kubernetes-services-endpoint
  namespace: tigera-operator
data:
  KUBERNETES_SERVICE_HOST: "172.18.22.111"
  KUBERNETES_SERVICE_PORT: "6443"
  • 安装 operator
$ wget https://raw.githubusercontent.com/projectcalico/calico/v3.25.0/manifests/tigera-operator.yaml
$ kubectl create -f tigera-operator.yaml
  • custom-resources.yaml
# This section includes base Calico installation configuration.
# For more information, see: https://projectcalico.docs.tigera.io/master/reference/installation/api#operator.tigera.io/v1.Installation
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  # Configures Calico networking.
  calicoNetwork:
    linuxDataplane: BPF
    # Note: The ipPools section cannot be modified post-install.
    ipPools:
    - blockSize: 24
      cidr: 10.244.0.0/16
      encapsulation: None
      natOutgoing: Enabled
      nodeSelector: all()

---

# This section configures the Calico API server.
# For more information, see: https://projectcalico.docs.tigera.io/master/reference/installation/api#operator.tigera.io/v1.APIServer
apiVersion: operator.tigera.io/v1
kind: APIServer 
metadata: 
  name: default 
spec: {}
kubectl create -f custom-resources.yaml

代码分析

接 4.3 章节补充分析,主要介绍 ebpf 的分支

在部署后,calico-node 通过 env FELIX_BPFENABLED=true 可以解析到配置中 BPFEnabled: True
felix 启动时

检查 bpf

if configParams.BPFEnabled {
   if err := dp.SupportsBPF(); err != nil {

主要通过 内核版本检查

  • 判断 ipvssupport
kubeIPVSSupportEnabled := false
if ifacemonitor.IsInterfacePresent(intdataplane.KubeIPVSInterface) {
   if configParams.BPFEnabled {
      log.Info("kube-proxy IPVS device found but we're in BPF mode, ignoring.")
   } else {
      kubeIPVSSupportEnabled = true
      log.Info("Kube-proxy in ipvs mode, enabling felix kube-proxy ipvs support.")
   }
}

BPFenable,kubeIPVSSupportEnabled 为 false

实例化 DataplaneDriver

计算可以用的 iptable mark

       allowedMarkBits := configParams.IptablesMarkMask
        if configParams.BPFEnabled {
            // In BPF mode, the BPF programs use mark bits that are not configurable.  Make sure that those
            // bits are covered by our allowed mask.
            if allowedMarkBits&tcdefs.MarksMask != tcdefs.MarksMask {
                log.WithFields(log.Fields{
                    "Name":            "felix-iptables",
                    "MarkMask":        allowedMarkBits,
                    "RequiredBPFBits": tcdefs.MarksMask,
                }).Panic("IptablesMarkMask doesn't cover bits that are used (unconditionally) by eBPF mode.")
            }
            allowedMarkBits ^= allowedMarkBits & tcdefs.MarksMask
            log.WithField("updatedBits", allowedMarkBits).Info(
                "Removed BPF program bits from available mark bits.")
        }

        markBitsManager := markbits.NewMarkBitsManager(allowedMarkBits, "felix-iptables")

如果是 bpf,允许的 mark 不是默认的 iptables Mark(0xffff0000),通过计算扣除 0x1ff00000 可用的 iptables mark 为 0xe00f0000。markBitsManager 就是通过 32 比特 来分割 iptables mark。
分到的 mark,acceptMark=0x10000 endpointMark=0xe0000000 endpointMarkNonCali=0x0 passMark=0x20000 scratch0Mark=0x40000 scratch1Mark=0x80000

生成 dpconfig

dpConfig := 
...
        if configParams.BPFExternalServiceMode == "dsr" {
            dpConfig.BPFNodePortDSREnabled = true
        }

externalServiceMode 为 dsr 时,BPFNodePortDSREnabled: true

初始化 dataplaneDriver 并启动

       intDP := intdataplane.NewIntDataplaneDriver(dpConfig)
        intDP.Start()

其中初始化 iptables 链,在 bpf 模式中 use iptables for raw egress policy.

初始化 BPFMap

初始化 BPFMap,后面的注释即 map 大小和 map 名字

   bpfMapContext := bpfmap.CreateBPFMapContext(
        config.BPFMapSizeIPSets,        // 1048576     // cali_v4_ip_sets
        config.BPFMapSizeNATFrontend,   // 65536       // cali_v4_nat_fe3
        config.BPFMapSizeNATBackend,    // 262144      // cali_v4_nat_be
        config.BPFMapSizeNATAffinity,   // 65536       // cali_v4_nat_aff
        config.BPFMapSizeRoute,         // 262144      // cali_v4_routes
        config.BPFMapSizeConntrack,     // 512000      // cali_v4_ct3
        config.BPFMapSizeIfState,       // 1000
        config.BPFMapRepin,             // false
    )

bpfMapContext

    bpfMapContext := &bpf.MapContext{
        RepinningEnabled: repinEnabled,
        MapSizes:         map[string]uint32{},
    }

MapSizes map 中
key 是 map 名字+版本号等信息,value 是对应的数量

    RepinningEnabled: false
    cali_v4_ip_sets: 1048576     
    cali_v4_nat_fe3: 65536
    cali_v4_nat_be: 262144      
    cali_v4_nat_aff: 65536       
    cali_v4_routes: 262144      
    cali_v4_ct3: 512000      
    cali_state2: 1
    cali_v4_arp2: 10000
    cali_v4_fsafes2: 65536
    cali_v4_srmsg: 510000
    cali_v4_ct_nats: 10000
    cali_iface2: 1000
  
// 为 bpfMapContext 中 各种 map 设置值,举例其中 
    m := &PinnedMap{
        context:       c,               // bpfMapContext 指针
        MapParameters: params,          // map 参数
        perCPU:        strings.Contains(params.Type, "percpu"),      // bool 值
    }

Map 定义在 felix/bpf/ 下

  • 定义完,如果 linux 上不存在则创建
   cmd := exec.Command("bpftool", "map", "create", b.versionedFilename(),
        "type", b.Type,
        "key", fmt.Sprint(b.KeySize),
        "value", fmt.Sprint(b.ValueSize),
        "entries", fmt.Sprint(b.MaxEntries),
        "name", b.VersionedName(),
        "flags", fmt.Sprint(b.Flags),
    )
注册 manager

紧接着为 dp 注册所有的 manager
使用 IpsetsMap 生成的 ipSetsV4 加到 ipsetsManager
使用 RouteMap 生成的 bpfRTMgr
使用 IfStateMap 和 ArpMap 生成的 bpfEndpointManager
使用 CtMap 生成 conntrackScanner 开始 Scan
使用 FrontendMap,BackendMap,AffinityMap,CtMap 开始 kubeproxy

           kp, err := bpfproxy.StartKubeProxy(
                config.KubeClientSet,
                config.Hostname,
                bpfMapContext,
                bpfproxyOpts...,
            )
            if err != nil {
                log.WithError(err).Panic("Failed to start kube-proxy.")
            }
            bpfRTMgr.setHostIPUpdatesCallBack(kp.OnHostIPsUpdate)
            bpfRTMgr.setRoutesCallBacks(kp.OnRouteUpdate, kp.OnRouteDelete)
            conntrackScanner.AddUnlocked(conntrack.NewStaleNATScanner(kp))
            conntrackScanner.Start()
生效安装 program, FeatureGate 可以支持 udp
err = nat.InstallConnectTimeLoadBalancer(
    config.BPFCgroupV2, config.BPFLogLevel, config.BPFConntrackTimeouts.UDPLastSeen, bpfMapContext, excludeUDP) 
  
// bpfMount:/sys/fs/bpf;udpNotSeen:60000000000       
err = installProgram("connect", "4", bpfMount, cgroupPath, logLevel, udpNotSeen, bpfMc.MapSizes, excludeUDP)
    if err != nil {
        return err
    }

err = installProgram("connect", "6", bpfMount, cgroupPath, logLevel, udpNotSeen, bpfMc.MapSizes, excludeUDP)
    if err != nil {
        return err
    }

打开 Obj 文件 /usr/lib/calico/bpf/connect_time_no_log_v4_co-re.o
依次循环里面的 map SetPinPath 在 /sys/fs/bpf/tc/globals/ 路径下

继续为 dp 注册 epManager,newFloatingIPManager,newMasqManager 等

dp.Start

配置 kernel
/proc/sys/net/ipv4/ip_forward = 1
/proc/sys/net/ipv6/conf/all/forwarding = 1
/proc/sys/kernel/unprivileged_bpf_disabled = 1
启动所有 worker 进程

syncfelix

继续就是 4.3 syncfelix 那一套

业务变化代码分析

syncer

syncer 实例化时 watch 很多资源,当收到资源变化 msg 时将通过回调函数 syncerToValidator 发送给 calcgraph;
syncerToValidator 维护了一个 channel,收到更新 msg 后 sendto 时可以发给 asyncCalcGraph;最后通知所有的 manager OnUpdates,发到 dataplane

更新过程分两步
第一步更新 manager 的资源,第二步则是 apply 使 dataplane 生效。

在 dataplane 循环协程中,收取三种信息

  • toDataplane 收到的发给 dataplane 的 msg
  • iface 变化
  • 地址变化

通知所有的 manager 去 Onupdate,最后需要的进行 apply 到 dataplane。

Pod

pod 对应 calico 里的 workloadendpoint,syncer 收到 workloadendpoint 的更新

- apiVersion: projectcalico.org/v3
  kind: WorkloadEndpoint
  metadata:
    creationTimestamp: "2023-02-25T07:35:43Z"
    labels:
      projectcalico.org/namespace: default
      projectcalico.org/orchestrator: k8s
      projectcalico.org/serviceaccount: default
    name: node111-k8s-pod2-eth0
    namespace: default
    resourceVersion: "755704"
    uid: f2f486fe-3af9-4274-aaca-440828ca41f2
  spec:
    containerID: 9b37a494251d11e53d66be4e1d9d6a056c56af43cb00a311d417ec0f905a9534
    endpoint: eth0
    interfaceName: calibd2348b4f67
    ipNetworks:
    - 10.244.213.11/32
    node: node111
    orchestrator: k8s
    pod: pod2
    ports:
    - hostIP: ""
      hostPort: 0
      name: nginx-port
      port: 80
      protocol: TCP
    profiles:
    - kns.default
    - ksa.default.default
    serviceAccountName: default
bpfEndpointManager.OnUpdate
bpfEndpointManager struct
	allWEPs        map[proto.WorkloadEndpointID]*proto.WorkloadEndpoint
	policiesToWorkloads map[proto.PolicyID]set.Set[any]  /* FIXME proto.WorkloadEndpointID or string (for a HEP) */
	profilesToWorkloads map[proto.ProfileID]set.Set[any] /* FIXME proto.WorkloadEndpointID or string (for a HEP) */

allWEPs  所有 workloadendpoint 的信息

将 workloadendpoint 存到 bpfEndpointManager allWEPs 中,存储过程当中,需要重新添加 endpoint 的policy 到 policiesToWorkloads

allWEPs map[proto.WorkloadEndpointID]*proto.WorkloadEndpoint
<orchestrator_id:"k8s" workload_id:"default/pod2" endpoint_id:"eth0" >

type WorkloadEndpointID struct {
	OrchestratorId   k8s
	WorkloadId      default/pod2
	EndpointId      eth0
}

endpoint:<state:"active" name:"calibd2348b4f67" profile_ids:"kns.default" profile_ids:"ksa.default.default" ipv4_nets:"10.244.213.11/32" > 

更新 manager 后,如果需要生效到 dataplane

d.apply()

运行所有 manager 的 CompleteDeferredWork()

  1. 设置 /proc/sys/net/ipv4/conf/calibd2348b4f67/accept_local = 1
  2. 将 workloadendpoint 和 ifacename 存到 happyWEPs,加 interface 到 iptables allow list
  3. 当收到 interface state 变为 up 后,设置 rp_filter = 2,accept_local = 1
  4. applyPolicy 给 workload 生效 policy,bpf 场景下 attachWorkloadProgram
  5. 找到 workload 的挂载点,如果未挂载则挂载
Program attached to TC. attachPoint=&tc.AttachPoint{Type:"workload", ToOrFrom:"to", Hook:"egress", Iface:"calibd2348b4f67", LogLevel:"off", HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xc0, 0xa8, 0x64, 0x6f}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, TunnelMTU:0x5aa, VXLANPort:0x12b5, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, MapSizes:map[string]uint32{"cali_iface2":0x3e8, "cali_state2":0x1, "cali_v4_arp2":0x2710, "cali_v4_ct3":0x7d000, "cali_v4_ct_nats":0x2710, "cali_v4_fsafes2":0x10000, "cali_v4_ip_sets":0x100000, "cali_v4_nat_aff":0x10000, "cali_v4_nat_be":0x40000, "cali_v4_nat_fe3":0x10000, "cali_v4_routes":0x40000, "cali_v4_srmsg":0x7c830}, RPFEnforceOption:0x1, NATin:0x0, NATout:0x0}

Program attached to TC. attachPoint=&tc.AttachPoint{Type:"workload", ToOrFrom:"from", Hook:"ingress", Iface:"calibd2348b4f67", LogLevel:"off", HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xc0, 0xa8, 0x64, 0x6f}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, TunnelMTU:0x5aa, VXLANPort:0x12b5, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, MapSizes:map[string]uint32{"cali_iface2":0x3e8, "cali_state2":0x1, "cali_v4_arp2":0x2710, "cali_v4_ct3":0x7d000, "cali_v4_ct_nats":0x2710, "cali_v4_fsafes2":0x10000, "cali_v4_ip_sets":0x100000, "cali_v4_nat_aff":0x10000, "cali_v4_nat_be":0x40000, "cali_v4_nat_fe3":0x10000, "cali_v4_routes":0x40000, "cali_v4_srmsg":0x7c830}, RPFEnforceOption:0x1, NATin:0x0, NATout:0x0}
AttachProgram

binaryToLoad:to_wep_no_log_co-re.o
SectionName:calico_%s_%s_ep;fromOrTo(from ingress,to egress),workload
ifName:calibd2348b4f67
Hook:ingress,egress

attach 过程

  • loadobject
    用 libbpf load object;
libbpf.OpenObject(file)

遍历 map,设置 map 最大 size,设置 mappin

# ls /sys/fs/bpf/tc/globals/
cali_iface2      cali_rule_ctrs2  cali_state2      cali_v4_arp2     cali_v4_ct3      cali_v4_ct_nats  cali_v4_fsafes2  cali_v4_ip_sets  cali_v4_nat_aff  cali_v4_nat_be   cali_v4_nat_fe3  cali_v4_routes   cali_v4_srmsg

// 对应接口的 pinpath
# ls /sys/fs/bpf/tc/calibd2348b4f67_egr/
cali_counters1  cali_jump2
# ls /sys/fs/bpf/tc/calibd2348b4f67_igr/
cali_counters1  cali_jump2

// set pin_path,内存 fd,用户态程序与内核交互数据。
C.bpf_map__set_pin_path(m.bpfMap, cPath)

// UpdateJumpMap
C.bpf_update_jump_map(o.obj, cMapName, cProgName, C.int(mapIndex))
mapName:cali_jump2

attach 后

# tc filter show dev calibd2348b4f67 ingress
filter protocol all pref 49152 bpf chain 0 
filter protocol all pref 49152 bpf chain 0 handle 0x1 calico_from_wor:[651] direct-action not_in_hw id 651 name calico_from_wor tag ed575173e2394260 jited 
# tc filter show dev calibd2348b4f67 egress
filter protocol all pref 49152 bpf chain 0 
filter protocol all pref 49152 bpf chain 0 handle 0x1 calico_to_workl:[649] direct-action not_in_hw id 649 name calico_to_workl tag 0cbb6af82c8023a6 jited

获取 progID,651
bpftool 查看 attach

# kubectl exec -it  calico-node-mq4rw      -n calico-system  -- bpftool prog show id 651 --json

{"id":651,"type":"sched_cls","name":"calico_from_wor","tag":"ed575173e2394260","gpl_compatible":true,"loaded_at":1677310544,"uid":0,"bytes_xlated":13104,"jited":true,"bytes_jited":8354,"bytes_memlock":16384,"map_ids":[44,226,227,225,45,51,50,47,49,48,53],"btf_id":271}

MapIDs [44,226,227,225,45,51,50,47,49,48,53]

通过 mapID 获取 Mapfd,通过 mapfd 获取 mapinfo

C.bpf_attr_setup_obj_get_id(bpfAttr, C.uint(mapID), 0)
mapInfo, err := bpf.GetMapInfo(mapFD)

在 manager 存储 fd
			iface.dpState.jumpMapFDs[ap.JumpMapFDMapKey()] = fd

networkpolicy

apiVersion: projectcalico.org/v3
items:
- apiVersion: projectcalico.org/v3
  kind: NetworkPolicy
  metadata:
    creationTimestamp: "2023-02-26T03:15:35Z"
    name: allow-tcp-80
    namespace: default
    resourceVersion: "890028"
    uid: 194f329c-111b-469a-bb27-17b79a819d63
  spec:
    ingress:
    - action: Allow
      destination:
        ports:
        - 80
      protocol: TCP
      source:
        selector: role == 'pod1'
    selector: app == 'nginx'
    types:
    - Ingress
kind: NetworkPolicyList
metadata:
  resourceVersion: "892050"

networkpolicy 由 ActiveRulesCalculator 管理

allPolicies     map[model.PolicyKey]*model.Policy

key:Policy(name=default/default.allow-tcp-80)
selectorsById[id] = sel
sel:selector

将对应的 ip 加到 ipset 内
“s:SPeglQlTBmfidv00S2cBaDCQ11JHPfyxQSqeOw” members:“10.244.213.6/32”

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值