场景:
在vpc下每个子网部署一个ingress-nginx,由于该ingress的pod的启动需要联通主机的api-server,由于主机网络与vpc下下子网做了隔离,不能直接联通,所以通过vpc-nat-gw网关配置floatingip和eip规则的方式出网,一个vpc下一个vpc-nat-gw(属于一个单独默认子网),其他子网出网都使用这个网关
基于开源kube-ovn基于进行了二次开发,使其支持跨集群的自定义vpc网络
现象:出现了网关连通性时好时坏的现象
- 有时一切正常,包括ingress的pod(即所在子网)跨集群
- 创建时直接就不通,ingress的pod起不来,重启vpc-nat-gw的pod之后就通了,变得正常
- 创建时是联通的,有时向改vpc下其他子网添加规则,也就是更新vpc-nat-gw时变得不通的,之前通的子网部分也同时不通了
配置floatingip生效的pod内配置
[root@host-master ~]# kubectl exec -it vpc-nat-gw-vpc-default-xuan-test-1-ngw-67fb65f6f9-5grsj -n kube-system sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
/kube-ovn # ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: net1@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether f6:4a:e3:21:11:af brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.184.1/24 scope global net1
valid_lft forever preferred_lft forever
inet 192.168.184.2/24 scope global secondary net1
valid_lft forever preferred_lft forever
inet 192.168.184.3/24 scope global secondary net1
valid_lft forever preferred_lft forever
11138: eth0@if11139: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default
link/ether 00:00:00:c3:71:b4 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.0.254/24 brd 192.168.0.255 scope global eth0
valid_lft forever preferred_lft forever
Chain EXCLUSIVE_DNAT (2 references)
pkts bytes target prot opt in out source destination
91 5484 DNAT all -- * * 0.0.0.0/0 192.168.184.1 to:192.168.0.253
32 1920 DNAT all -- * * 0.0.0.0/0 192.168.184.2 to:192.168.1.253
44 2640 DNAT all -- * * 0.0.0.0/0 192.168.184.3 to:192.168.2.253
/kube-ovn # route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.0.1 0.0.0.0 UG 0 0 0 eth0
192.168.0.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
192.168.184.0 0.0.0.0 255.255.255.0 U 0 0 0 net1
/kube-ovn # iptables -t nat -nvL
............
Chain EXCLUSIVE_SNAT (2 references)
pkts bytes target prot opt in out source destination
2 120 SNAT all -- * * 192.168.0.253 0.0.0.0/0 to:192.168.184.1
1 60 SNAT all -- * * 192.168.1.253 0.0.0.0/0 to:192.168.184.2
1 60 SNAT all -- * * 192.168.2.253 0.0.0.0/0 to:192.168.184.3
配置floatingip未生效的pod内配置
可以ping通内部pod ip,但是第二条规则不完整, route -n 路由规则没有,iptables -t nat -nvL的SNAT,DNAT规则没有
[root@host-master ~]# kubectl exec -it vpc-nat-gw-vpc-subnet-test-2-ngw-65ddfbcc67-wh5cw -n kube-system sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
/kube-ovn # ping 192.168.1.253
PING 192.168.1.253 (192.168.1.253) 56(84) bytes of data.
64 bytes from 192.168.1.253: icmp_seq=1 ttl=63 time=1.68 ms
^C
--- 192.168.1.253 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.682/1.682/1.682/0.000 ms
/kube-ovn # ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: net1@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether b2:bd:6a:31:f0:28 brd ff:ff:ff:ff:ff:ff link-netnsid 0
11466: eth0@if11467: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default
link/ether 00:00:00:a2:df:a4 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.255.254/30 brd 192.168.255.255 scope global eth0
valid_lft forever preferred_lft forever
/kube-ovn # route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.0.1 0.0.0.0 UG 0 0 0 eth0
192.168.0.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
192.168.184.0 0.0.0.0 255.255.255.0 U 0 0 0 net1
vpc内部pod的日志
timeout,联通不到外部api-server,网关设置没成功,pod也未重启
日志有很多,所以不断重连
E0507 06:52:45.736777 7 reflector.go:153] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:181: Failed to list *v1beta1.Ingress: Get https://10.233.0.1:443/apis/networking.k8s.io/v1beta1/ingresses?limit=500&resourceVersion=0: dial tcp 10.233.0.1:443: i/o timeout
pod内是可以访问本地服务的
说明:pod先启动成功了,后来网关发生变动不再联通,因此该pod也未进行重启,只是不断重连,所以有了上述的日志
[root@host-master ~]# kubectl exec -it elb-vuqkb -n ns-subnet3 sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
/etc/nginx $ curl 127.0.0.1
/etc/nginx $ curl 127.0.0.1
default backend - 404/etc/nginx $ curl 192.168.1.253
总结:
由于vpc-nat-gw的不稳定性(有时联通,有时不通)从而出现了上述现象