Calico ebpf dataplane
针对 useCase 功能学习
Use Case
根据官网,有 traffic control,networkpolicy,connect-time load balancing 几种 case
- Traffic control
通常流量的走向需要复杂的路由决定如何到达目的地,在已知流量的目的地时,使用eBPF,您可以利用额外的上下文在内核中进行这些更改,以便数据包绕过复杂的路由,直接到达其最终目的地。 - Networkpolicy
创建网络策略时,有两种情况可以使用eBPF:- eXpress数据路径(XDP)–当原始数据包缓冲区进入系统时,eBPF为您提供了一种有效的方法来检查缓冲区并快速决定如何使用它。(ebpf dataplane 关闭该种方法)
- 网络策略–eBPF允许您高效地检查数据包并应用网络策略,包括pod和主机。
- Connect-time 负载均衡
Kubernetes 中 访问 Service 是通过访问 vip 开始,然后 nat 成真正的 endpoint 进行业务访问,并且回复时需要将 nat 回 vip,使用 ebpf,通过加载到内核中的 ebpf 程序在 source 连接时进行地址转换。不需要在数据包处理路径上进行 nat 转换,服务连接的 nat 开销被删除。
Calico ebpf dataplane 暂不支持
- 不支持混合部署,即一些节点 linux dataplane,一部分 ebpf dataplane
- IPv6.
- Floating IPs.
- SCTP (either for policy or services). This is due to lack of kernel support for the SCTP checksum in BPF.
- Log action in policy rules. This is because the Log action maps to the iptables LOG action and BPF programs cannot access that log.
- VLAN-based traffic.
部署
- k8s 的部署
部署时不部署 kube-proxy
$ kubeadm init --config kubeadm.conf --skip-phases=addon/kube-proxy
- 创建 api configmap,kube-proxy 没有的情况下连接 api 使用
kind: ConfigMap
apiVersion: v1
metadata:
name: kubernetes-services-endpoint
namespace: tigera-operator
data:
KUBERNETES_SERVICE_HOST: "172.18.22.111"
KUBERNETES_SERVICE_PORT: "6443"
- 安装 operator
$ wget https://raw.githubusercontent.com/projectcalico/calico/v3.25.0/manifests/tigera-operator.yaml
$ kubectl create -f tigera-operator.yaml
- custom-resources.yaml
# This section includes base Calico installation configuration.
# For more information, see: https://projectcalico.docs.tigera.io/master/reference/installation/api#operator.tigera.io/v1.Installation
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
name: default
spec:
# Configures Calico networking.
calicoNetwork:
linuxDataplane: BPF
# Note: The ipPools section cannot be modified post-install.
ipPools:
- blockSize: 24
cidr: 10.244.0.0/16
encapsulation: None
natOutgoing: Enabled
nodeSelector: all()
---
# This section configures the Calico API server.
# For more information, see: https://projectcalico.docs.tigera.io/master/reference/installation/api#operator.tigera.io/v1.APIServer
apiVersion: operator.tigera.io/v1
kind: APIServer
metadata:
name: default
spec: {}
kubectl create -f custom-resources.yaml
代码分析
接 4.3 章节补充分析,主要介绍 ebpf 的分支
在部署后,calico-node 通过 env FELIX_BPFENABLED=true 可以解析到配置中 BPFEnabled: True
felix 启动时
检查 bpf
if configParams.BPFEnabled {
if err := dp.SupportsBPF(); err != nil {
主要通过 内核版本检查
- 判断 ipvssupport
kubeIPVSSupportEnabled := false
if ifacemonitor.IsInterfacePresent(intdataplane.KubeIPVSInterface) {
if configParams.BPFEnabled {
log.Info("kube-proxy IPVS device found but we're in BPF mode, ignoring.")
} else {
kubeIPVSSupportEnabled = true
log.Info("Kube-proxy in ipvs mode, enabling felix kube-proxy ipvs support.")
}
}
BPFenable,kubeIPVSSupportEnabled 为 false
实例化 DataplaneDriver
计算可以用的 iptable mark
allowedMarkBits := configParams.IptablesMarkMask
if configParams.BPFEnabled {
// In BPF mode, the BPF programs use mark bits that are not configurable. Make sure that those
// bits are covered by our allowed mask.
if allowedMarkBits&tcdefs.MarksMask != tcdefs.MarksMask {
log.WithFields(log.Fields{
"Name": "felix-iptables",
"MarkMask": allowedMarkBits,
"RequiredBPFBits": tcdefs.MarksMask,
}).Panic("IptablesMarkMask doesn't cover bits that are used (unconditionally) by eBPF mode.")
}
allowedMarkBits ^= allowedMarkBits & tcdefs.MarksMask
log.WithField("updatedBits", allowedMarkBits).Info(
"Removed BPF program bits from available mark bits.")
}
markBitsManager := markbits.NewMarkBitsManager(allowedMarkBits, "felix-iptables")
如果是 bpf,允许的 mark 不是默认的 iptables Mark(0xffff0000),通过计算扣除 0x1ff00000 可用的 iptables mark 为 0xe00f0000。markBitsManager 就是通过 32 比特 来分割 iptables mark。
分到的 mark,acceptMark=0x10000 endpointMark=0xe0000000 endpointMarkNonCali=0x0 passMark=0x20000 scratch0Mark=0x40000 scratch1Mark=0x80000
生成 dpconfig
dpConfig :=
...
if configParams.BPFExternalServiceMode == "dsr" {
dpConfig.BPFNodePortDSREnabled = true
}
externalServiceMode 为 dsr 时,BPFNodePortDSREnabled: true
初始化 dataplaneDriver 并启动
intDP := intdataplane.NewIntDataplaneDriver(dpConfig)
intDP.Start()
其中初始化 iptables 链,在 bpf 模式中 use iptables for raw egress policy.
初始化 BPFMap
初始化 BPFMap,后面的注释即 map 大小和 map 名字
bpfMapContext := bpfmap.CreateBPFMapContext(
config.BPFMapSizeIPSets, // 1048576 // cali_v4_ip_sets
config.BPFMapSizeNATFrontend, // 65536 // cali_v4_nat_fe3
config.BPFMapSizeNATBackend, // 262144 // cali_v4_nat_be
config.BPFMapSizeNATAffinity, // 65536 // cali_v4_nat_aff
config.BPFMapSizeRoute, // 262144 // cali_v4_routes
config.BPFMapSizeConntrack, // 512000 // cali_v4_ct3
config.BPFMapSizeIfState, // 1000
config.BPFMapRepin, // false
)
bpfMapContext
bpfMapContext := &bpf.MapContext{
RepinningEnabled: repinEnabled,
MapSizes: map[string]uint32{},
}
MapSizes map 中
key 是 map 名字+版本号等信息,value 是对应的数量
RepinningEnabled: false
cali_v4_ip_sets: 1048576
cali_v4_nat_fe3: 65536
cali_v4_nat_be: 262144
cali_v4_nat_aff: 65536
cali_v4_routes: 262144
cali_v4_ct3: 512000
cali_state2: 1
cali_v4_arp2: 10000
cali_v4_fsafes2: 65536
cali_v4_srmsg: 510000
cali_v4_ct_nats: 10000
cali_iface2: 1000
// 为 bpfMapContext 中 各种 map 设置值,举例其中
m := &PinnedMap{
context: c, // bpfMapContext 指针
MapParameters: params, // map 参数
perCPU: strings.Contains(params.Type, "percpu"), // bool 值
}
Map 定义在 felix/bpf/ 下
- 定义完,如果 linux 上不存在则创建
cmd := exec.Command("bpftool", "map", "create", b.versionedFilename(),
"type", b.Type,
"key", fmt.Sprint(b.KeySize),
"value", fmt.Sprint(b.ValueSize),
"entries", fmt.Sprint(b.MaxEntries),
"name", b.VersionedName(),
"flags", fmt.Sprint(b.Flags),
)
注册 manager
紧接着为 dp 注册所有的 manager
使用 IpsetsMap 生成的 ipSetsV4 加到 ipsetsManager
使用 RouteMap 生成的 bpfRTMgr
使用 IfStateMap 和 ArpMap 生成的 bpfEndpointManager
使用 CtMap 生成 conntrackScanner 开始 Scan
使用 FrontendMap,BackendMap,AffinityMap,CtMap 开始 kubeproxy
kp, err := bpfproxy.StartKubeProxy(
config.KubeClientSet,
config.Hostname,
bpfMapContext,
bpfproxyOpts...,
)
if err != nil {
log.WithError(err).Panic("Failed to start kube-proxy.")
}
bpfRTMgr.setHostIPUpdatesCallBack(kp.OnHostIPsUpdate)
bpfRTMgr.setRoutesCallBacks(kp.OnRouteUpdate, kp.OnRouteDelete)
conntrackScanner.AddUnlocked(conntrack.NewStaleNATScanner(kp))
conntrackScanner.Start()
生效安装 program, FeatureGate 可以支持 udp
err = nat.InstallConnectTimeLoadBalancer(
config.BPFCgroupV2, config.BPFLogLevel, config.BPFConntrackTimeouts.UDPLastSeen, bpfMapContext, excludeUDP)
// bpfMount:/sys/fs/bpf;udpNotSeen:60000000000
err = installProgram("connect", "4", bpfMount, cgroupPath, logLevel, udpNotSeen, bpfMc.MapSizes, excludeUDP)
if err != nil {
return err
}
err = installProgram("connect", "6", bpfMount, cgroupPath, logLevel, udpNotSeen, bpfMc.MapSizes, excludeUDP)
if err != nil {
return err
}
打开 Obj 文件 /usr/lib/calico/bpf/connect_time_no_log_v4_co-re.o
依次循环里面的 map SetPinPath 在 /sys/fs/bpf/tc/globals/ 路径下
继续为 dp 注册 epManager,newFloatingIPManager,newMasqManager 等
dp.Start
配置 kernel
/proc/sys/net/ipv4/ip_forward = 1
/proc/sys/net/ipv6/conf/all/forwarding = 1
/proc/sys/kernel/unprivileged_bpf_disabled = 1
启动所有 worker 进程
syncfelix
继续就是 4.3 syncfelix 那一套
业务变化代码分析
syncer
syncer 实例化时 watch 很多资源,当收到资源变化 msg 时将通过回调函数 syncerToValidator 发送给 calcgraph;
syncerToValidator 维护了一个 channel,收到更新 msg 后 sendto 时可以发给 asyncCalcGraph;最后通知所有的 manager OnUpdates,发到 dataplane
更新过程分两步
第一步更新 manager 的资源,第二步则是 apply 使 dataplane 生效。
在 dataplane 循环协程中,收取三种信息
- toDataplane 收到的发给 dataplane 的 msg
- iface 变化
- 地址变化
通知所有的 manager 去 Onupdate,最后需要的进行 apply 到 dataplane。
Pod
pod 对应 calico 里的 workloadendpoint,syncer 收到 workloadendpoint 的更新
- apiVersion: projectcalico.org/v3
kind: WorkloadEndpoint
metadata:
creationTimestamp: "2023-02-25T07:35:43Z"
labels:
projectcalico.org/namespace: default
projectcalico.org/orchestrator: k8s
projectcalico.org/serviceaccount: default
name: node111-k8s-pod2-eth0
namespace: default
resourceVersion: "755704"
uid: f2f486fe-3af9-4274-aaca-440828ca41f2
spec:
containerID: 9b37a494251d11e53d66be4e1d9d6a056c56af43cb00a311d417ec0f905a9534
endpoint: eth0
interfaceName: calibd2348b4f67
ipNetworks:
- 10.244.213.11/32
node: node111
orchestrator: k8s
pod: pod2
ports:
- hostIP: ""
hostPort: 0
name: nginx-port
port: 80
protocol: TCP
profiles:
- kns.default
- ksa.default.default
serviceAccountName: default
bpfEndpointManager.OnUpdate
bpfEndpointManager struct
allWEPs map[proto.WorkloadEndpointID]*proto.WorkloadEndpoint
policiesToWorkloads map[proto.PolicyID]set.Set[any] /* FIXME proto.WorkloadEndpointID or string (for a HEP) */
profilesToWorkloads map[proto.ProfileID]set.Set[any] /* FIXME proto.WorkloadEndpointID or string (for a HEP) */
allWEPs 所有 workloadendpoint 的信息
将 workloadendpoint 存到 bpfEndpointManager allWEPs 中,存储过程当中,需要重新添加 endpoint 的policy 到 policiesToWorkloads
allWEPs map[proto.WorkloadEndpointID]*proto.WorkloadEndpoint
<orchestrator_id:"k8s" workload_id:"default/pod2" endpoint_id:"eth0" >
type WorkloadEndpointID struct {
OrchestratorId k8s
WorkloadId default/pod2
EndpointId eth0
}
endpoint:<state:"active" name:"calibd2348b4f67" profile_ids:"kns.default" profile_ids:"ksa.default.default" ipv4_nets:"10.244.213.11/32" >
更新 manager 后,如果需要生效到 dataplane
d.apply()
运行所有 manager 的 CompleteDeferredWork()
- 设置 /proc/sys/net/ipv4/conf/calibd2348b4f67/accept_local = 1
- 将 workloadendpoint 和 ifacename 存到 happyWEPs,加 interface 到 iptables allow list
- 当收到 interface state 变为 up 后,设置 rp_filter = 2,accept_local = 1
- applyPolicy 给 workload 生效 policy,bpf 场景下 attachWorkloadProgram
- 找到 workload 的挂载点,如果未挂载则挂载
Program attached to TC. attachPoint=&tc.AttachPoint{Type:"workload", ToOrFrom:"to", Hook:"egress", Iface:"calibd2348b4f67", LogLevel:"off", HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xc0, 0xa8, 0x64, 0x6f}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, TunnelMTU:0x5aa, VXLANPort:0x12b5, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, MapSizes:map[string]uint32{"cali_iface2":0x3e8, "cali_state2":0x1, "cali_v4_arp2":0x2710, "cali_v4_ct3":0x7d000, "cali_v4_ct_nats":0x2710, "cali_v4_fsafes2":0x10000, "cali_v4_ip_sets":0x100000, "cali_v4_nat_aff":0x10000, "cali_v4_nat_be":0x40000, "cali_v4_nat_fe3":0x10000, "cali_v4_routes":0x40000, "cali_v4_srmsg":0x7c830}, RPFEnforceOption:0x1, NATin:0x0, NATout:0x0}
Program attached to TC. attachPoint=&tc.AttachPoint{Type:"workload", ToOrFrom:"from", Hook:"ingress", Iface:"calibd2348b4f67", LogLevel:"off", HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xc0, 0xa8, 0x64, 0x6f}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, TunnelMTU:0x5aa, VXLANPort:0x12b5, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, MapSizes:map[string]uint32{"cali_iface2":0x3e8, "cali_state2":0x1, "cali_v4_arp2":0x2710, "cali_v4_ct3":0x7d000, "cali_v4_ct_nats":0x2710, "cali_v4_fsafes2":0x10000, "cali_v4_ip_sets":0x100000, "cali_v4_nat_aff":0x10000, "cali_v4_nat_be":0x40000, "cali_v4_nat_fe3":0x10000, "cali_v4_routes":0x40000, "cali_v4_srmsg":0x7c830}, RPFEnforceOption:0x1, NATin:0x0, NATout:0x0}
AttachProgram
binaryToLoad:to_wep_no_log_co-re.o
SectionName:calico_%s_%s_ep;fromOrTo(from ingress,to egress),workload
ifName:calibd2348b4f67
Hook:ingress,egress
attach 过程
- loadobject
用 libbpf load object;
libbpf.OpenObject(file)
遍历 map,设置 map 最大 size,设置 mappin
# ls /sys/fs/bpf/tc/globals/
cali_iface2 cali_rule_ctrs2 cali_state2 cali_v4_arp2 cali_v4_ct3 cali_v4_ct_nats cali_v4_fsafes2 cali_v4_ip_sets cali_v4_nat_aff cali_v4_nat_be cali_v4_nat_fe3 cali_v4_routes cali_v4_srmsg
// 对应接口的 pinpath
# ls /sys/fs/bpf/tc/calibd2348b4f67_egr/
cali_counters1 cali_jump2
# ls /sys/fs/bpf/tc/calibd2348b4f67_igr/
cali_counters1 cali_jump2
// set pin_path,内存 fd,用户态程序与内核交互数据。
C.bpf_map__set_pin_path(m.bpfMap, cPath)
// UpdateJumpMap
C.bpf_update_jump_map(o.obj, cMapName, cProgName, C.int(mapIndex))
mapName:cali_jump2
attach 后
# tc filter show dev calibd2348b4f67 ingress
filter protocol all pref 49152 bpf chain 0
filter protocol all pref 49152 bpf chain 0 handle 0x1 calico_from_wor:[651] direct-action not_in_hw id 651 name calico_from_wor tag ed575173e2394260 jited
# tc filter show dev calibd2348b4f67 egress
filter protocol all pref 49152 bpf chain 0
filter protocol all pref 49152 bpf chain 0 handle 0x1 calico_to_workl:[649] direct-action not_in_hw id 649 name calico_to_workl tag 0cbb6af82c8023a6 jited
获取 progID,651
bpftool 查看 attach
# kubectl exec -it calico-node-mq4rw -n calico-system -- bpftool prog show id 651 --json
{"id":651,"type":"sched_cls","name":"calico_from_wor","tag":"ed575173e2394260","gpl_compatible":true,"loaded_at":1677310544,"uid":0,"bytes_xlated":13104,"jited":true,"bytes_jited":8354,"bytes_memlock":16384,"map_ids":[44,226,227,225,45,51,50,47,49,48,53],"btf_id":271}
MapIDs [44,226,227,225,45,51,50,47,49,48,53]
通过 mapID 获取 Mapfd,通过 mapfd 获取 mapinfo
C.bpf_attr_setup_obj_get_id(bpfAttr, C.uint(mapID), 0)
mapInfo, err := bpf.GetMapInfo(mapFD)
在 manager 存储 fd
iface.dpState.jumpMapFDs[ap.JumpMapFDMapKey()] = fd
networkpolicy
apiVersion: projectcalico.org/v3
items:
- apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
creationTimestamp: "2023-02-26T03:15:35Z"
name: allow-tcp-80
namespace: default
resourceVersion: "890028"
uid: 194f329c-111b-469a-bb27-17b79a819d63
spec:
ingress:
- action: Allow
destination:
ports:
- 80
protocol: TCP
source:
selector: role == 'pod1'
selector: app == 'nginx'
types:
- Ingress
kind: NetworkPolicyList
metadata:
resourceVersion: "892050"
networkpolicy 由 ActiveRulesCalculator 管理
allPolicies map[model.PolicyKey]*model.Policy
key:Policy(name=default/default.allow-tcp-80)
selectorsById[id] = sel
sel:selector
将对应的 ip 加到 ipset 内
“s:SPeglQlTBmfidv00S2cBaDCQ11JHPfyxQSqeOw” members:“10.244.213.6/32”