2022学习0427【K8S 集群IO高,导致服务挂掉,排障记录】

庆祝一下,今天解决了一个困扰我一个多月的问题!
【问题背景】:2核2G的腾讯云S5服务器上通过minikube部署了K8S集群。上面有nodejs,grafana两个deployment,以及filebeat一个ds
【问题现像】:
1.开始观察到grafanna常常无法访问,master节点NotReady。
2.docker ps发现apiserver和scheduler频繁重启(约6mins,不影响服务)
3.docker ps发现proxy节点有过重启记录 (雪崩的根因,proxy一崩,master就NotReady了)
4.IO经常打满,甚至导致服务器无法登录,重启后大概能好一个小时。然后再次循环问题。
【异常日志汇总】:
日志都很长,就只截取了有代表性的,不重复的异常日志
apiserver:我开始只是关注了pod层面,kubectl logs apiserver,而忘记关注container层面了,

http: TLS handshake error from 172.17.0.36:37502: read tcp 172.17.0.36:8443-\u003e172.17.0.36:37502: read: connection reset by peer
{"log":"I0423 21:50:56.150640       1 log.go:172] http: TLS handshake error from 172.17.0.36:37756: EOF\n","stream":"stderr","time":"2022-04-23T21:50:56.150805501Z"}
{"log":"E0423 21:50:56.602528       1 status.go:71] apiserver received an error that is not an metav1.Status: \u0026errors.errorString{s:\"context canceled\"}\n","stream":"stderr","time":"2022-04-23T21:50:56.60266719Z"}

scheduler: 同apiserver,这俩container重启异常频繁,我开始居然没关注这里,单纯的以为就是不稳定。

E0423 22:54:15.673221       1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: Get https://control-plane.minikube.internal:8443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=10s: context deadline exceeded
I0423 22:54:17.350345       1 leaderelection.go:277] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
F0423 22:54:17.491836       1 server.go:244] leaderelection lost

controller:

W0423 22:52:23.100798       1 garbagecollector.go:644] failed to discover some groups: map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request]
E0423 22:52:41.395453       1 leaderelection.go:320] error retrieving resource lock kube-system/kube-controller-manager: Get https://control-plane.minikube.internal:8443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: context deadline exceeded
I0423 22:52:43.318502       1 leaderelection.go:277] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
F0423 22:52:43.662661       1 controllermanager.go:279] leaderelection lost

etcd: — 这里我被误导了,range request 我一直觉得是etcd写日志,磁盘扛不住导致的,
这里还有个很大的误区,我不停尝试去优化etcd,下载了etcd-client。
ETCD优化措施方案
(其实我单节点并不需要Raft,但是为了学习,硬着头皮用上了。)
主要从3方面优化,
1.提高磁盘IO优先级
2.增加快照频率,减少日志累积,加快压缩
3.网络,这个暂且不提
哦, 最后证明没用,跟etcd优化毫无关系

{"log":"2022-04-23 21:47:38.959628 I | embed: rejected connection from \"127.0.0.1:44528\" (error \"read tcp 127.0.0.1:2379-\u003e127.0.0.1:44528: read: connection reset by peer\", ServerName \"\")\n","stream":"stderr","time":"2022-04-23T21:47:38.959858512Z"}
{"log":"2022-04-23 21:47:47.873513 W | etcdserver: read-only range request \"key:\\\"/registry/services/endpoints/kube-system/kube-scheduler\\\" \" with result \"range_response_count:1 size:590\" took too long (1.083816676s) to execute\n","stream":"stderr","time":"2022-04-23T21:47:47.873683358Z"}

根据退出时间判断下报错时间线,这里能看出来报错的一些dependency,
创建时间越早证明上一轮崩的越早。
由此可见,除了proxy之外。 先后顺序为:controller>scheduler>apiserver

[root@VM-0-36-centos ~]# docker ps -a | grep xit
1bf9080b4c48   "/dashboard --insecu…"   15 minutes ago   Exited (2) 15 minutes ago               k8s_kubernetes-dashboard_kubernetes-dashboard-696dbcc666-42fns_kubernetes-dashboard_2568a5cd-72d1-433b-b5b2-27885d2d943e_42
395c8890dd51   "kube-apiserver --ad…"   19 minutes ago   Exited (0) 15 minutes ago               k8s_kube-apiserver_kube-apiserver-vm-0-36-centos_kube-system_e83e2db116420e21a35b9d31a383202d_91
7fbee5fd1757   "kube-scheduler --au…"   20 minutes ago   Exited (255) 15 minutes ago             k8s_kube-scheduler_kube-scheduler-vm-0-36-centos_kube-system_c63a370803ea358d14eb11f27c64756f_240
0e7d81be25d9   "kube-controller-man…"   27 minutes ago   Exited (255) 15 minutes ago             k8s_kube-controller-manager_kube-controller-manager-vm-0-36-centos_kube-system_0d5c3746cb0a798a6fc95c8dab3bff0b_245
7aee6e991973   "/usr/bin/dumb-init …"   3 hours ago      Exited (137) 2 hours ago                k8s_nginx-ingress-controller_nginx-ingress-controller-6d746cd945-f67xn_ingress-nginx_0c01ad47-3f93-4d82-892e-e87b6a361db5_0
28bb20e1ce79   "/coredns -conf /etc…"   3 hours ago      Exited (0) 20 minutes ago               k8s_coredns_coredns-546565776c-4ct42_kube-system_c389cc98-3f18-4351-8e5e-346b374a47a2_54
1e61dd0aef3d   "/coredns -conf /etc…"   3 hours ago      Exited (0) 20 minutes ago               k8s_coredns_coredns-546565776c-2sfml_kube-system_c3870b1b-66cb-4179-808e-6956ecc92ebe_51

最重要的是proxy的container的日志,当时没保存,没办法复现了,大概内容如下

dial tcp 172.***** connect reset

其实在apiserver里已经反映出来了,内网通信172有问题。
网段冲突
这时候一篇文章救了我。
是我服务器内网网段地址和容器网络地址有冲突!我开始并没有在意,觉得这个跟IO没直接关系,修复了又能如何
但是秉着试一试,死马当活马医的态度,我切换了服务器内网网段。(切换完我配置的configmap什么的都没了,要重新创建。)
同时起minikube的时候还报错了, minikube delete 就可以了,重新来一次。

couldn't retrieve DNS addon deployments:

再改完这部分之后,服务神奇的起来了, 现在已经保持两个小时,docker环境干干净净~ 继续观察。
原来网段冲突才是罪魁祸首!

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值