Kubernetes CNI flannel 基本配置与流量抓包

Kubernetes CNI flannel 基本配置与流量抓包

前言

源码链接 https://github.com/coreos/flannel/blob/master/Documentation/backends.md

Flannel是CoreOS开源的网络方案,负责为Kubernetes集群中的多个Node间提供层3的IPv4网络。容器如何与主机联网不在Flannel的考虑范围,Flannel只控制如何在主机之间传输流量。Flannel为Kubernetes提供了一个CNI插件,并提供了与Docker集成的指导。

Flannel在集群的每个主机上运行一个名为flanneld的小型代理,负责从一个预先配置的地址空间中向每个主机分配子网;使用Kubernetes API或etcd来存储网络配置、分配的子网和任何辅助数据(如主机的公共IP);数据包则通过VXLAN、UDP或host-gw这些类型的后端机制进行转发。

本篇主要进行flannle 常用模式通信验证;

Flannel may be paired with several different backends. Once set, the backend should not be changed at runtime.

VXLAN is the recommended choice. host-gw is recommended for more experienced users who want the performance improvement and whose infrastructure support it (typically it can’t be used in cloud environments). UDP is suggested for debugging only or for very old kernels that don’t support VXLAN.

AWS, GCE, and AliVPC are experimental and unsupported. Proceed at your own risk.

Flannel 可以与几个不同的后端配对。设置后,不应在运行时更改后端。

推荐使用VXLAN。建议有经验的用户使用host-gw来提高性能并获得基础架构的支持(通常不能在云环境中使用)。建议将UDP仅用于调试或不支持VXLAN的非常老的内核。

AWS,GCE和AliVPC是实验性的,不受支持。继续需要自担风险。

Recommended backends

VXLAN

Use in-kernel VXLAN to encapsulate the packets. 使用内核内VXLAN封装数据包。

Type and options:

  • Type (string): vxlan
  • VNI (number): VXLAN Identifier (VNI) to be used. On Linux, defaults to 1. On Windows should be greater than or equal to 4096. (要使用的VXLAN标识符(VNI)。在Linux上,默认值为1。在Windows上,该值应大于或等于4096。)
  • Port (number): UDP port to use for sending encapsulated packets. On Linux, defaults to kernel default, currently 8472, but on Windows, must be 4789. (UDP端口,用于发送封装的数据包。在Linux上,默认值为内核默认值,当前为8472,但在Windows上,必须为4789。)
  • GBP (Boolean): Enable VXLAN Group Based Policy. Defaults to false. GBP is not supported on Windows
  • DirectRouting (Boolean): Enable direct routes (like host-gw) when the hosts are on the same subnet. VXLAN will only be used to encapsulate packets to hosts on different subnets. Defaults to false. DirectRouting is not supported on Windows. (当主机位于同一子网中时,启用直接路由(例如host-gw)。 VXLAN仅用于将数据包封装到不同子网中的主机。默认为false。 Windows不支持DirectRouting。)
  • MacPrefix (String): Only use on Windows, set to the MAC prefix. Defaults to 0E-2A. (设置为MAC前缀)

host-gw

Use host-gw to create IP routes to subnets via remote machine IPs. Requires direct layer2 connectivity between hosts running flannel.

host-gw provides good performance, with few dependencies, and easy set up. (使用host-gw 通过创建到子网的ip 路由,需要运行flannel的 机器2层互通。)

Type:

  • Type (string): host-gw

UDP

Use UDP only for debugging if your network and kernel prevent you from using VXLAN or host-gw.(仅当网络和内核阻止您使用VXLAN或host-gw时,才将UDP用于调试。)该封装是由用户态进程完成的;

Type and options:

  • Type (string): udp
  • Port (number): UDP port to use for sending encapsulated packets. Defaults to 8285.

Experimental backends

The following options are experimental and unsupported at this time.

AliVPC

Use AliVPC to create IP routes in a alicloud VPC route table when running in an AliCloud VPC. This mitigates the need to create a separate flannel interface.

Requirements:

  • Running on an ECS instance that is in an AliCloud VPC.

  • Permission requireaccessidandkeysecret

    • Type (string): ali-vpc
    • AccessKeyID (string): API access key ID. Can also be configured with environment ACCESS_KEY_ID.
    • AccessKeySecret (string): API access key secret. Can also be configured with environment ACCESS_KEY_SECRET.

Route Limits: AliCloud VPC limits the number of entries per route table to 50.

Alloc

Alloc performs subnet allocation with no forwarding of data packets.

Type:

  • Type (string): alloc

AWS VPC

Recommended when running within an Amazon VPC, AWS VPC creates IP routes in an Amazon VPC route table. Because AWS knows about the IP, it is possible to set up ELB to route directly to that container.

Requirements:

  • Running on an EC2 instance that is in an Amazon VPC.
  • Permissions required: CreateRoute, DeleteRoute,DescribeRouteTables, ModifyNetworkInterfaceAttribute, DescribeInstances (optional)

Type and options:

  • Type (string): aws-vpc

  •   RouteTableID
    

    (string): [optional] The ID of the VPC route table to add routes to.

    • The route table must be in the same region as the EC2 instance that flannel is running on.
    • Flannel can automatically detect the ID of the route table if the optional DescribeInstances is granted to the EC2 instance.

Authentication is handled via either environment variables or the node’s IAM role. If the node has insufficient privileges to modify the VPC routing table specified, ensure that appropriate AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optionally AWS_SECURITY_TOKEN environment variables are set when running the flanneld process.

Route Limits: AWS limits the number of entries per route table to 50.

GCE

Use the GCE backend When running on Google Compute Engine Network. Instead of using encapsulation, GCE manipulates IP routes to achieve maximum performance. Because of this, a separate flannel interface is not created.

Requirements:

Type:

  • Type (string): gce

Command to create a compute instance with the correct permissions and IP forwarding enabled:

  $ gcloud compute instances create INSTANCE --can-ip-forward --scopes compute-rw

Route Limits: GCE limits the number of routes for every project to 100 by default.

IPIP

Use in-kernel IPIP to encapsulate the packets.

IPIP kind of tunnels is the simplest one. It has the lowest overhead, but can incapsulate only IPv4 unicast traffic, so you will not be able to setup OSPF, RIP or any other multicast-based protocol.

Type:

  • Type (string): ipip
  • DirectRouting (Boolean): Enable direct routes (like host-gw) when the hosts are on the same subnet. IPIP will only be used to encapsulate packets to hosts on different subnets. Defaults to false.

Note that there may exist two ipip tunnel device tunl0 and flannel.ipip, this is expected and it’s not a bug. tunl0 is automatically created per network namespace by ipip kernel module on modprobe ipip module. It is the namespace default IPIP device with attributes local=any and remote=any. When receiving IPIP protocol packets, kernel will forward them to tunl0 as a fallback device if it can’t find an option whose local/remote attribute matches their src/dst ip address more precisely. flannel.ipip is created by flannel to achieve one to many ipip network.

IPSec

Use in-kernel IPSec to encapsulate and encrypt the packets.

Strongswan is used at the IKEv2 daemon. A single pre-shared key is used for the initial key exchange between hosts and then Strongswan ensures that keys are rotated at regular intervals.

Type:

  • Type (string): ipsec
  • PSK (string): Required. The pre shared key to use. It needs to be at least 96 characters long. One method for generating this key is to run dd if=/dev/urandom count=48 bs=1 status=none | xxd -p -c 48
  • UDPEncap (Boolean): Optional, defaults to false. Forces the use UDP encapsulation of packets which can help with some NAT gateways.
  • ESPProposal (string): Optional, defaults to aes128gcm16-sha256-prfsha256-ecp256. Change this string to choose another ESP Proposal.

Hint: Add rules to your firewall: Open ports 50 (for ESP protocol), UDP 500 (for IKE, to manage encryption keys) and UDP 4500 (for IPSEC NAT-Traversal mode).

Troubleshooting

Logging

  • When flannel is run from a container, the Strongswan tools are installed. swanctl can be used for interacting with the charon and it provides a logs command…
  • Charon logs are also written to the stdout of the flannel process.

Troubleshooting

  • ip xfrm state can be used to interact with the kernel’s security association database. This can be used to show the current security associations (SA) and whether a host is successfully establishing ipsec connections to other hosts.
  • ip xfrm policy can be used to show the installed policies. Flannel installs three policies for each host it connects to.

Flannel will not restore policies that are manually deleted (unless flannel is restarted). It will also not delete stale policies on startup. They can be removed by rebooting your host or by removing all ipsec state with ip xfrm state flush && ip xfrm policy flush and restarting flannel.

实操验证

vxlan

查看Kubernetes集群网络模式,cni 为 flannel 插件:
cat /etc/cni/net.d/10-flannel.conflist

flannel 转发type 为 vxlan ;其他的还有udp,host-gw 等;
kubectl get configmap -n kube-system kube-flannel-cfg -o yaml | grep -i type

同 node pod 通信

同node pod 通信,链路比较简单;

直接经过cni0 bridge 2层转发即可;中间仍会经过iptables 规则;如果有访问不通的情况,可以排除iptables;

client pod —> client veth ----> cni0 —> server veth —> server pod

跨 node pod 通信

client pod : centos-test-764675d946-tx4r5 192.168.1.2 node01
server pod : nginx-nets-7848d4b86f-ghpc2 192.168.0.15 master

数据链路:
client pod —> veth 设备 --ip route 表项同子网路由匹配cni0 口–> cni0口(bridge) --ip route 表项同跨子网(跨node)路由匹配flannel.1口,进而匹配到bridge fdb表项–> flannel.1 设备口 —根据type类型封装处理 --> node01 eth0/bond 口 —> master eth0/bond 口 —> flannle.1 --> cni0 --> veth --> server pod

根据pod内ip 设备序号,在node 节点中,确认对应的直连veth 设备:
如下图列,pod 内为 eth0@if6 即 在 node节点上对应设备序号为 6 的 veth 设备:

pod 内访问:

可直接在该veth 口抓取对应pod 报文;

经过veth 设备后,后面又cni0 设备口进行转发,
tcpdump -i cni0 -nne -vvv

tcpdump -i flannel.1 -nne -vv

从flannel.1 口转出后,如果flannel type 为 udp 是进入用户态进行转发,type 为 vxlan 是Linux 内核进行封装转发;

当前是vxlan 类型,从node 节点 出去时,就会封装为vxlan报文;

# 可以直接使用udp 协议过滤,或根据vxlan 内层ip 过滤;
# tcpdump -i bond1 -nne -vvv udp -w /tmp/client_10-26-0-5_bond1_udp.pcapng
# tcpdump -i bond1 -nne -vvv 'udp[42:4]=0xC0A813A9 or udp[46:4]=0xC0A813A9' -w /tmp/client_node_10-26-0-5_vxlan_192.168.19.169.pcapng

# 本环境中,使用客户端内层ip 192.168.1.2  0xC0A80102
# 本测试环境中未做bond,只有1张网卡,即抓eth0 口网卡为流量从该node 节点发出到另外的node 节点:
tcpdump -i eth0 -nne -vvv 'udp[42:4]=0xC0A80102 or udp[46:4]=0xC0A80102' -w /tmp/client_pod_node_vxlan_192.168.1.2.pcapng

# brctl 命令安装 yum install bridge-utils -y
[root@node01 ~]# brctl show
bridge name	bridge id		STP enabled	interfaces
cni0		8000.be881c739f43	no		veth83529d14
							vethef816c43
docker0		8000.02423f569f49	no

# 匹配对应路由,转发至 flannel.1 or cni0
[root@node01 ~]# ip route show
default via 172.16.0.1 dev eth0
169.254.0.0/16 dev eth0 scope link metric 1002
172.16.0.0/20 dev eth0 proto kernel scope link src 172.16.0.49
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
192.168.0.0/24 via 192.168.0.0 dev flannel.1 onlink
192.168.1.0/24 dev cni0 proto kernel scope link src 192.168.1.1

# 根据mac 地址匹配决定外层ip 是发向哪台node 节点
# FDB 表为 2 层交换机的转发表,FDB 存储这 MAC-PORT 的映射关系,用于 MAC数据包从哪个接口出
[root@node01 ~]# bridge fdb show dev flannel.1
c2:a7:54:14:48:c8 dst 172.16.0.5 self permanent
[root@node01 ~]#

# mac为flannel.1 设备口mac 地址
[root@master ~]# ip a s | grep c2:a7:54:14:48:c8 -C 4
    link/ether 02:42:20:d0:eb:e4 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
    link/ether c2:a7:54:14:48:c8 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.0/32 brd 192.168.0.0 scope global flannel.1
       valid_lft forever preferred_lft forever
5: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether ca:f6:d8:42:55:5d brd ff:ff:ff:ff:ff:ff
[root@master ~]#


图中 10.20.1.4 与 10.20.1.3 通信流程:

  1. 当 Guest0 第一次发送一个目的地址 10.20.1.3 数据包的时候,进行二层转发,查询本地 Guest ARP 表,无记录则发送 ARP 广播 who is 10.20.1.3
  2. vxlan 开启了的本地 ARP 代答 proxy、l2miss、l3miss 功能,数据包经过 vtep0 逻辑设备时,当 Host ARP 表无记录时,vxlan 触发 l2miss 事件,ARP 表是用于三层 IP 进行二层 MAC 转发的映射表,存储着 IP-MAC-NIC 记录,在二层转发过程中往往需要根据 IP 地址查询对应的 MAC 地址从而通过数据链路转发到目的接口中;
  3. l2miss 事件被 Flannel 的 Daemon 进程捕捉到,Daemon 查询 Etcd 存储的路由数据库并返回 10.20.1.3 的 MAC 地址 e6:4b:f9:ce:d7:7b 并存储 Host ARP 表;
  4. vtep0 命中 ARP 记录后回复 ARP Reply;
  5. Guest0 收到 ARP Reply 后存 Guest ARP 表,开始发送数据,携带目的 e6:4b:f9:ce:d7:7b 地址;
  6. 数据包经过 bridge 时查询 FDB(Forwarding Database entry) 转发表,询问 where e6:4b:f9:ce:d7:7b send to? 如未命中记录,发生 l3miss 事件,FDB 表为 2 层交换机的转发表,FDB 存储这 MAC-PORT 的映射关系,用于 MAC数据包从哪个接口出;
  7. Flannel Daemon 捕捉 l3miss 事件,并向 FDB 表中加入目的 e6:4b:f9:ce:d7:7b 的数据包发送给对端 Host 192.168.100.3
  8. 此时 e6:4b:f9:ce:d7:7b 数据包流向 vtep0 接口,vtep0 开始进行 UDP 封装,填充 VNI 号为 1,并与对端 192.168.100.3 建立隧道,对端收到 vxlan 包进行拆分,根据 VNI 分发 vtep0 ,拆分后传回 Bridge,Bridge 根据 dst mac 地址转发到对应的 veth 接口上,此时就完成了整个数据包的转发;
  9. 回程流程类似;

Flannel 在最新 vxlan 实现上完全去掉了 l2miss & l3miss 方式,Flannel deamon 不再监听 netlink 通知,因此也不依赖 DOVE。而改成主动给目的子网添加远端 Host 路由的新方式,同时为 vtep 和 bridge 各自分配三层 IP 地址,当数据包到达目的 Host 后,在内部进行三层转发,这样的好处就是 Host 不需要配置所有的 Guest 二层 MAC 地址,从一个二层寻址转换成三层寻址,路由数目与 Host 机器数呈线性相关,官方声称做到了同一个 VNI 下每一台 Host 主机 1 route,1 arp entry and 1 FDB entry。

pod 访问 svc

Service是kubernetes最核心的概念,通过创建Service,可以为一组具有相同功能的容器应用提供一个统一的入口地址,并且将请求进行负载分发到后端的各个容器应用上。

service 有3中常见的模式,其中最常见的就是 iptables 模式。通过每个node 上的kube-proxy 下发对应iptables 规则,进行通信;

并在svc 有对应的的网段规划,在初始化时可以定义,可查看本人博客部署记录;

由于svc 内容较多,后续会单独展开一个文章详述。

参考资料

https://www.sdnlab.com/21143.html
https://blog.csdn.net/LL845876425/article/details/109427804
https://blog.csdn.net/LL845876425/article/details/108697314

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值