一、故障描述
涉及kubernetes版本: 1.20
kuberntes集群发布服务后,不能正常解析service ip,无法使用telnet测试服务ip
在kubernetes命名空间里部署redis,名称为nw-redis,类型为ClusterIP
# kubectl get pods,svc | grep nw-redis
pod/nw-redis-7c7b79cbff-xvs9j 1/1 Running 0 3h5m
service/nw-redis ClusterIP 10.1.0.92 <none> 6379/TCP 3h5m
在node节点上telnet redis端口,正常
# telnet 10.1.0.92 6379
Trying 10.1.0.92...
Connected to 10.1.0.92.
Escape character is '^]'.
^]
ping service ip 不通
# ping 10.1.0.92
PING 10.1.0.92 (10.1.0.92) 56(84) bytes of data.
启动busybox容器,验证telnet、ping和域名解析
/ # nslookup nw-redis
Server: 10.1.0.2
Address 1: 10.1.0.2 kube-dns.kube-system.svc.cluster.local
Name: nw-redis
Address 1: 10.1.0.92 nw-redis.default.svc.cluster.local
/ # telnet 10.1.0.92 6379
/ # ping 10.1.0.92
结果域名解析成功、telnet和ping都不通
故障结论:pod内部或者说是集群内部无法解析service ip,不能通过service ip或者service name通信访问
二、故障分析
一般宿主机网络有限制,kube-proxy、calico问题。
三、故障排查
1)主机网络排查
排查防火墙,selinux是否关闭
# systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:firewalld(1)
[root@k8s-master1 logs]# getenforce
Disabled
修改内核参数
vi /etc/sysctl.d/k8s.conf
net.ipv4.ip_forward = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
验证
# sysctl --system |grep net.bridge.bridge
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-arptables = 0
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
# sysctl --system |grep ip_forward
net.ipv4.ip_forward = 1
2)检查ectd集群状态
ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://10.3.9.237:2379,https://10.3.9.238:2379,https://10.3.9.239:2379" endpoint health --write-out=table
3)检查apiserver、controller-manager和scheduler状态
# systemctl restart kube-apiserver
# systemctl status kube-apiserver
# systemctl restart kube-controller-manager
# systemctl status kube-controller-manager
# systemctl restart kube-scheduler
# systemctl status kube-scheduler
检查kuber-apiserver状态
api访问测试
# curl https://10.3.9.237:6443/healthz -k
ok
4)检查node节点kubelet、kube-proxy状态
kubelet主要负责容器的生命周期管理,容器可以部署并且启动运行,表示kubelet没问题,问题有可能出现在kube-proxy
# systemctl status kubelet
# systemctl status kube-proxy
提示
conntrack returned error: error looking for path of conntrack: exec: "conntrack": executable file not found in $PATH
解决方案
# yum -y install socat conntrack-tools
k8s master1节点上的kube-proxy日志,存在错误日志,如下
[root@k8s-master1 logs]# cat kube-proxy.ERROR
Log file created at: 2022/01/07 13:40:12
Running on machine: k8s-master1
Binary: Built with gc go1.15.15 for linux/amd64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0107 13:40:12.070977 10073 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1beta1.EndpointSlice: unknown (get endpointslices.discovery.k8s.io)
E0107 13:40:12.072082 10073 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: unknown (get services)
kubernetes node2节点上kube-proxy状态异常,存在错误日志,如下
# systemctl status kube-proxy -l
● kube-proxy.service - Kubernetes Proxy
Loaded: loaded (/usr/lib/systemd/system/kube-proxy.service; enabled; vendor preset: disabled)
Active: active (running) since 五 2021-12-31 22:52:07 CST; 3 days ago
Main PID: 1441 (kube-proxy)
Tasks: 19
Memory: 58.4M
CGroup: /system.slice/kube-proxy.service
└─1441 /opt/kubernetes/bin/kube-proxy --logtostderr=false --v=2 --log-dir=/opt/kubernetes/logs --config=/opt/kubernetes/cfg/kube-proxy-config.yml
12月 31 22:52:07 k8s-node1 systemd[1]: Started Kubernetes Proxy.
12月 31 15:19:02 k8s-node1 kube-proxy[1441]: E1231 15:19:02.459044 1441 node.go:161] Failed to retrieve node info: Get "https://10.3.9.237:6443/api/v1/nodes/k8s-node1": net/http: TLS handshake timeout
12月 31 15:19:08 k8s-node1 kube-proxy[1441]: E1231 15:19:08.323500 1441 node.go:161] Failed to retrieve node info: nodes "k8s-node1" is forbidden: User "system:kube-proxy" cannot get resource "nodes" in API group "" at the cluster scope
针对错误日志,是kube-proxy没有权限获取node的信息,用户system:kube-proxy
查看集群角色clusterrole
# kubectl get clusterrole
查看角色绑定clusterrolebinding
# kubectl get clusterrolebinding
通过clusterrolebinding绑定system:kube-proxy组到system:node这个clusterrole
kubectl create clusterrolebinding system-node-role-bound --clusterrole=system:node --group=system:kube-proxy
还是没解决,尝试1清除node2 的iptables,因为验证的pod都在node2上
# systemctl stop kubelet
# systemctl stop docker
# iptables --flush 清除预设表filter中的所有规则链的规则
# iptables -tnat --flush
# systemctl start docker
# systemctl start kubelet
查看kuber-proxy的日志,没有发现错误日志
INFO日志里提示,iptables rules已经同步完成,可以判断kube-proxy组件没问题
I0107 10:38:35.521896 14501 service.go:390] Adding new service port "kube-system/kube-dns:dns" at 10.1.0.2:53/UDP
I0107 10:38:35.521914 14501 service.go:390] Adding new service port "kube-system/kube-dns:dns-tcp" at 10.1.0.2:53/TCP
I0107 10:38:35.522029 14501 service.go:390] Adding new service port "kube-system/kube-dns:metrics" at 10.1.0.2:9153/TCP
I0107 10:38:35.522086 14501 service.go:390] Adding new service port "kubernetes-dashboard/kubernetes-dashboard" at 10.1.0.32:443/TCP
I0107 10:38:35.522120 14501 service.go:390] Adding new service port "nsweather/harbor-harbor-notary-server" at 10.1.0.242:4443/TCP
I0107 10:38:35.522139 14501 service.go:390] Adding new service port "nsweather/nacos:nacosport-nacos" at 10.1.0.221:8848/TCP
I0107 10:38:35.523263 14501 proxier.go:858] Stale udp service kube-system/kube-dns:dns -> 10.1.0.2
I0107 10:38:35.523336 14501 proxier.go:874] Syncing iptables rules
I0107 10:38:35.592339 14501 proxier.go:1718] Opened local port "nodePort for kubernetes-dashboard/kubernetes-dashboard" (:30001/tcp)
I0107 10:38:35.592500 14501 proxier.go:1718] Opened local port "nodePort for nsweather/mysql-svc:http" (:30021/tcp)
I0107 10:38:35.592577 14501 proxier.go:1718] Opened local port "nodePort for nsweather/harbor:http" (:30002/tcp)
I0107 10:38:35.592658 14501 proxier.go:1718] Opened local port "nodePort for nsweather/harbor:https" (:30003/tcp)
I0107 10:38:35.592977 14501 proxier.go:1718] Opened local port "nodePort for nsweather/harbor:notary" (:30004/tcp)
I0107 10:38:35.593034 14501 proxier.go:1718] Opened local port "nodePort for nsweather/nacos-nodeport:nacosport" (:30048/tcp)
I0107 10:38:35.593260 14501 proxier.go:1718] Opened local port "nodePort for nsweather/gateway-nodeport:8868tcp01" (:30155/tcp)
I0107 10:38:35.593343 14501 proxier.go:1718] Opened local port "nodePort for default/nginx-nodeport:80tcp01" (:30117/tcp)
I0107 10:38:35.593484 14501 proxier.go:1718] Opened local port "nodePort for nsweather/yujing-process-nodeport:7000tcp320661" (:32066/tcp)
I0107 10:38:35.607428 14501 proxier.go:826] syncProxyRules took 87.250124ms
I0107 10:38:35.608632 14501 proxier.go:874] Syncing iptables rules
I0107 10:38:35.666422 14501 proxier.go:826] syncProxyRules took 58.932601ms
4)排查node节点的kube-proxy是否在写iptables规则
kube-proxy的主要负责Services的负载均衡规则生成,默认情况下使用iptables实现,检查一下这些规则是否已经被写好了。
查看nw-nginx pod的iptables规则
# iptables-save |grep nw-nginx
-A KUBE-SEP-KVFGLHQMURDRZLUD -s 10.244.169.149/32 -m comment --comment "default/nw-nginx:80tcp02" -j KUBE-MARK-MASQ
-A KUBE-SEP-KVFGLHQMURDRZLUD -p tcp -m comment --comment "default/nw-nginx:80tcp02" -m tcp -j DNAT --to-destination 10.244.169.149:80
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.1.0.146/32 -p tcp -m comment --comment "default/nw-nginx:80tcp02 cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.1.0.146/32 -p tcp -m comment --comment "default/nw-nginx:80tcp02 cluster IP" -m tcp --dport 80 -j KUBE-SVC-QPWUTNZADI6EJEK2
-A KUBE-SVC-QPWUTNZADI6EJEK2 -m comment --comment "default/nw-nginx:80tcp02" -j KUBE-SEP-KVFGLHQMURDRZLUD
四、排查calico的pod,查看日志
查看kube-system命名空间下的pods状态
# kubectl get pods -n kube-system |grep calico
calico-kube-controllers-97769f7c7-tjz7m 1/1 Running 7 58d
calico-node-hnkx7 1/1 Running 13 57d
calico-node-plb2q 1/1 Running 7 58d
calico-node-wlz9c 1/1 Running 5 58d
# kubectl logs calico-kube-controllers-97769f7c7-tjz7m -n kube-system
2021-12-31 07:19:42.749 [INFO][1] main.go 88: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}
W1231 07:19:42.750459 1 client_config.go:543] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
2021-12-31 07:19:42.751 [INFO][1] main.go 109: Ensuring Calico datastore is initialized
2021-12-31 07:19:42.782 [INFO][1] main.go 149: Getting initial config snapshot from datastore
2021-12-31 07:19:42.842 [INFO][1] main.go 152: Got initial config snapshot
2021-12-31 07:19:42.843 [INFO][1] watchersyncer.go 89: Start called
2021-12-31 07:19:42.843 [INFO][1] main.go 169: Starting status report routine
2021-12-31 07:19:42.843 [INFO][1] main.go 402: Starting controller ControllerType="Node"
2021-12-31 07:19:42.843 [INFO][1] node_controller.go 138: Starting Node controller
2021-12-31 07:19:42.843 [INFO][1] watchersyncer.go 127: Sending status update Status=wait-for-ready
2021-12-31 07:19:42.843 [INFO][1] node_syncer.go 40: Node controller syncer status updated: wait-for-ready
2021-12-31 07:19:42.843 [INFO][1] watchersyncer.go 147: Starting main event processing loop
2021-12-31 07:19:42.844 [INFO][1] resources.go 343: Main client watcher loop
2021-12-31 07:19:42.855 [INFO][1] watchercache.go 297: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2021-12-31 07:19:42.855 [INFO][1] watchersyncer.go 127: Sending status update Status=resync
2021-12-31 07:19:42.856 [INFO][1] node_syncer.go 40: Node controller syncer status updated: resync
2021-12-31 07:19:42.856 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2021-12-31 07:19:42.856 [INFO][1] watchersyncer.go 221: All watchers have sync'd data - sending data and final sync
2021-12-31 07:19:42.856 [INFO][1] watchersyncer.go 127: Sending status update Status=in-sync
2021-12-31 07:19:42.856 [INFO][1] node_syncer.go 40: Node controller syncer status updated: in-sync
2021-12-31 07:19:42.867 [INFO][1] hostendpoints.go 90: successfully synced all hostendpoints
2021-12-31 07:19:42.943 [INFO][1] node_controller.go 151: Node controller is now running
2021-12-31 07:19:42.943 [INFO][1] ipam.go 45: Synchronizing IPAM data
2021-12-31 07:19:42.980 [INFO][1] ipam.go 190: Node and IPAM data is in sync
2021-12-31 08:08:04.133 [INFO][1] resources.go 371: Terminating main client watcher loop
2021-12-31 08:08:04.170 [INFO][1] resources.go 343: Main client watcher loop
2021-12-31 09:01:18.706 [INFO][1] resources.go 371: Terminating main client watcher loop
2021-12-31 09:01:18.734 [INFO][1] resources.go 343: Main client watcher loop
2021-12-31 09:57:21.682 [INFO][1] resources.go 371: Terminating main client watcher loop
2021-12-31 09:57:21.717 [INFO][1] resources.go 343: Main client watcher loop
2021-12-31 10:48:24.474 [INFO][1] resources.go 371: Terminating main client watcher loop
五、检查coredns解析(已排查、服务正常)
使用busybox:latest版本镜像,解析kuberntes服务的dns
/ # nslookup kubernetes
Server: 10.1.0.2
Address: 10.1.0.2:53
** server can't find kubernetes.default.svc.cluster.local: NXDOMAIN
*** Can't find kubernetes.svc.cluster.local: No answer
*** Can't find kubernetes.cluster.local: No answer
*** Can't find kubernetes.default.svc.cluster.local: No answer
*** Can't find kubernetes.svc.cluster.local: No answer
*** Can't find kubernetes.cluster.local: No answer
发现coredns无法解析kubernetes服务的dns
检查coredns pod的日志
# kubectl get pods -n kube-system |grep coredns
coredns-6cc56c94bd-cw9cc 1/1 Running 5 56d
# kubectl logs coredns-6cc56c94bd-cw9cc -n kube-system | more
.:53
[INFO] plugin/reload: Running configuration MD5 = a4809ab99f6713c362194263016e6fac
CoreDNS-1.6.7
linux/amd64, go1.13.6, da7f65b
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] 127.0.0.1:48719 - 14101 "HINFO IN 7523500665419724991.1265714792579118978. udp 57 false 512" NOERROR - 0 6.00261688s
[ERROR] plugin/errors: 2 7523500665419724991.1265714792579118978. HINFO: read udp 10.244.169.163:53414->219.141.136.10:53: i/o timeout
[INFO] 127.0.0.1:43057 - 17033 "HINFO IN 7523500665419724991.1265714792579118978. udp 57 false 512" NXDOMAIN qr,rd,ra 132 0.005748601s
[INFO] 127.0.0.1:60947 - 5384 "HINFO IN 7523500665419724991.1265714792579118978. udp 57 false 512" NXDOMAIN qr,rd,ra 132 4.004629275s
I1231 07:19:37.004960 1 trace.go:116] Trace[2019727887]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105 (started: 2021-12-31 07:19:21.932402464 +0000 UTC m=+0
.352370840) (total time: 15.072231539s):
Trace[2019727887]: [15.070832463s] [15.070832463s] Objects listed
I1231 07:19:37.005072 1 trace.go:116] Trace[1427131847]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105 (started: 2021-12-31 07:19:21.932604979 +0000 UTC m=+0
.352573647) (total time: 15.072371377s):
Trace[1427131847]: [15.072123344s] [15.072123344s] Objects listed
I1231 07:19:37.005095 1 trace.go:116] Trace[939984059]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105 (started: 2021-12-31 07:19:21.932457524 +0000 UTC m=+0.
352426182) (total time: 15.072583578s):
Trace[939984059]: [15.072472317s] [15.072472317s] Objects listed
[INFO] 10.244.36.118:55222 - 18531 "A IN localhost.cluster.local. udp 41 false 512" NXDOMAIN qr,aa,rd 134 0.000645256s
[INFO] 10.244.36.118:60903 - 10057 "AAAA IN localhost.cluster.local. udp 41 false 512" NXDOMAIN qr,aa,rd 134 0.000452162s
[INFO] 10.244.36.118:49223 - 15849 "A IN localhost. udp 27 false 512" NXDOMAIN qr,rd,ra 102 0.004133887s
[INFO] 10.244.36.118:55573 - 33343 "AAAA IN localhost. udp 27 false 512" NXDOMAIN qr,rd,ra 102 0.004304771s
[INFO] 10.244.159.133:57738 - 29349 "A IN git.rancher.io.cattle-system.svc.cluster.local. udp 64 false 512" NXDOMAIN qr,aa,rd 157 0.000538668
日志有如下错误信息
[ERROR] plugin/errors: 2 7523500665419724991.1265714792579118978. HINFO: read udp 10.244.169.163:53414->219.141.100.100:53: i/o timeout
定位10.244.169.163 的pod为coredns
[root@k8s-master1 kubernetes]# kubectl get pods -A -o wide |grep 10.244.169.163
kube-system coredns-6cc56c94bd-cw9cc 1/1 Running 5 57d 10.244.169.163 k8s-node2 <none> <none>
219.141.100.100 53为公司的dns服务器,通过telnet验证telnet 219.141.100.100 53 ,dns服务正常
日志看不出域名解析问题,百度查找有说是busybox镜像问题,不能正常进行DNS解析
这次验证开始使用的busybox latest版本,将镜像换成busybox:1.28.3
重新执行域名解析验证实例1:
/ # nslookup kubernetes
Server: 10.1.0.2
Address 1: 10.1.0.2 kube-dns.kube-system.svc.cluster.local
Name: kubernetes
Address 1: 10.1.0.1 kubernetes.default.svc.cluster.local
实例2:
/ # nslookup gateway.nsweather
Server: 10.1.0.2
Address 1: 10.1.0.2 kube-dns.kube-system.svc.cluster.local
Name: gateway.nsweather
Address 1: 10.1.0.167 gateway.nsweather.svc.cluster.local
busybox镜像问题:
busybox:1.28.3 # 注意镜像版本,1.28.4好像是有问题,不能正常进行DNS解析
使用busybox最新的镜像也有问题!!!