现场问题排查-k8s(docker)上某服务pod频繁自动重启
写在最前面
根因:应用内存占用不合理(个人认为)+现场配置内存不够导致频繁触发OOM引发该现象。
为啥要写这个文章?
- 之前没有k8s下pod频繁重启的问题处理经验,这次实战沉淀思路及过程,供后续自己处理相同问题提供参考资料
- 为其他遇到类似问题的人提供一些排查思路
问题现象
现场反馈1月15日升级完以后,某系统首页的某搜索功能就经常不好用,初步排查是某服务的POD总自动重启,研发已经处理过两次了,还是没有根除。
初步排查(健康检查配置不合理导致容器重启)
问现场之前排查结论是啥呀?运维反馈怀疑资源不足,后来也增加了对应服务的资源,但问题还是没有根除。
那先看看这个服务的事件吧(docker5页面可以直接看,也可以用kubectl get event -n 服务所在命名空间
命令查看–网上资料都是不需要用-n指定命名空间的,但我不指定就报错,网上说有多个命名空间的时候需要指定,所以最好还是指定吧),然后发现了以下问题。
[root@node-133446 ~]# kubectl get event -n myNamespace
LAST SEEN TYPE REASON KIND MESSAGE
...
9m52s Normal Created Pod Created container
9m51s Normal Started Pod Started container
9m50s Warning Unhealthy Pod Liveness probe failed: Get http://192.168.220.150:18201/actuator/health: dial tcp 192.168.220.150:18201: connect: connection refused
9m18s Warning Unhealthy Pod Readiness probe failed: Get http://192.168.220.150:18201/actuator/health: dial tcp 192.168.220.150:18201: connect: connection refused
9m18s Normal Killing Pod Killing container with id docker://mySer:Container failed liveness probe.. Container will be killed and recreated.
4m44s Warning BackOff Pod Back-off restarting failed container
...
上面报错明显是服务的/actuator/health接口不通呀,根据经验这个报错是因为k8s的就绪探针配置导致的,因为着急恢复业务,就先临时把这个配置给关掉了(后续问题彻底解决后又恢复了,但根据情况调整了一些参数。具体过程另开一个帖子讲吧,感觉可以作为一个单独的知识点分享,这样也便于后续针对场景搜索)。
再次定位(自动伸缩配置调整)
关掉健康检查并重启后,服务正常了,皆大欢喜。现场说还是成哥厉害,几分钟就搞定了,然而还没高兴两分钟,我看首页案件搜索又用不了了。
这???打脸来的太快,就像是龙卷风!刚刚说我厉害的人,悄悄回座位处理其他问题了,也不搭理我了…这可太疼了。
既然没有解决,继续看事件吧,看看有啥新思路没,然后事件列表基本都是以下内容
[root@node-133446 ~]# kubectl describe service/mySer -n myNamespace
# 这里其实是根据页面上的时间列表模拟的,上面的命令执行后,Events是none,网上说要配置权限,但我没配置出来
类型 事件原因 事件信息 最后更新
Normal ScalingReplicaSet Scaled down replica set tm-... to 3 3 minutes ago
Normal ScalingReplicaSet Scaled up replica set tm-... to 4 9 minutes ago
Normal ScalingReplicaSet Scaled up replica set tm-... to 3 10 minutes ago
Normal ScalingReplicaSet Scaled down replica set tm-... to 2 13 minutes ago
Normal ScalingReplicaSet Scaled up replica set tm-... to 1 21 minutes ago
Normal ScalingReplicaSet Scaled down replica set tm-... to 0 22 minutes ago
Normal ScalingReplicaSet Scaled up replica set tm-... to 4 30 minutes ago
这Pod的个数怎么在反复伸缩呀?咨询了一下arterydocker支持,按步骤检查了下
- 集群资源充足吗?->集群内每个机器都还有不少资源呢
- 被攻击或其他管理员手工修改了?->密码没几个人知道,都在一个屋,基本不可能。为了保险,还是修改了密码,但问题依旧,所以排除该选项
- 查看Pod的yaml最下面的事件有啥东西没?->没有这个节点(其实这里跟上面脚本的内容是一样的,估计是没权限)
- 查看Pod的yaml里面有没有
kind: HorizontalPodAutoscaler
关键字?->查看了没有(其实问题就在这里,只是Pod的deploy资源和HPA资源是分别管理的,入口错了) - 那就得再研究研究了,之前也没遇到过。
陷入瓶颈了呀,咋办?网上各种找资料,说的都是自动伸缩的事儿,各种瞎点,还真给点出来了(位置在docker5控制台->资源->HPA->里面搜索出问题的服务->编辑->最小副本、最大副本控制自动伸缩)。
[root@node-133446 ~]# kubectl get hpa -n myNamespace
# 查看指定命名空间内所有的hpa配置
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
...
hpa-mySer Deployment/mySer 32m/6144m 1 4 1 319d
...
[root@node-133446 ~]# kubectl get hpa -n myNamespace hpa-mySer
# 查看指定hpa配置
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
hpa-mySer Deployment/mySer 33m/6144m 1 4 1 319d
[root@node-133446 ~]# kubectl get hpa -n myNamespace hpa-mySer -o yaml
# 查看指定hpa完整yaml
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
annotations:
autoscaling.alpha.kubernetes.io/conditions: '[{"type":"AbleToScale","status":"True","lastTransitionTime":"2022-04-09T04:23:10Z","reason":"ReadyForNewScale","message":"recommended
size matches current size"},{"type":"ScalingActive","status":"True","lastTransitionTime":"2023-02-22T10:54:45Z","reason":"ValidMetricFound","message":"the
HPA was able to successfully calculate a replica count from cpu resource"},{"type":"ScalingLimited","status":"False","lastTransitionTime":"2023-02-20T14:02:11Z","reason":"DesiredWithinRange","message":"the
desired count is within the acceptable range"}]'
autoscaling.alpha.kubernetes.io/current-metrics: '[{"type":"Resource","resource":{"name":"cpu","currentAverageValue":"33m"}}]'
autoscaling.alpha.kubernetes.io/metrics: '[{"type":"Resource","resource":{"name":"cpu","targetAverageValue":"6144m"}}]'
field.cattle.io/displayName: ""
creationTimestamp: "2022-04-09T04:22:55Z"
labels:
io.cattle.field/appId: myNamespace
name: hpa-mySer
namespace: myNamespace
resourceVersion: "235189306"
selfLink: /apis/autoscaling/v1/namespaces/myNamespace/horizontalpodautoscalers/hpa-mySer
uid: ba923625-b7bc-11ec-a55a-3cd2e55b71c6
spec:
maxReplicas: 4 # 这个就是最大副本数,最多会扩容到这个数字
minReplicas: 1 # 这个是最小副本数,最小不会少于这个数字(手工调整pod个数,估计会自动增加回来)
scaleTargetRef:
apiVersion: apps/v1beta2
kind: Deployment
name: mySer
status:
currentReplicas: 1
desiredReplicas: 1
lastScaleTime: "2023-02-22T10:49:29Z"
把这里的最大和最小都限制成1,然后再观察,该pod就没有自动伸缩了。但服务依然不可用(实际这个估计没影响业务,还是当时业务本身有问题,问题根除后,这里也恢复了),那就得继续排查了。
最终定位(服务实际内存占用超过服务配置最大限制导致重启)
既然自己没发现什么明显问题,那继续谷歌吧。
按照网上的文章,各种实验,最终有效果的脚本如下:
[root@node-133446 ~]# kubectl describe pod -n myNamespace mySer-5f5c8666f5-ll9kj
# 查看指定pod资源的详细信息(重点关注【Last State: Terminated】部分的【Reason: OOMKilled】)
Name: mySer-5f5c8666f5-ll9kj
Namespace: myNamespace
Priority: 0
PriorityClassName: <none>
Node: node-133448/133.4.4.8
Start Time: Mon, 20 Feb 2023 21:57:58 +0800
Labels: app=mySer
appid=myNamespace
pod-template-hash=5f5c8666f5
Annotations: cattle.io/timestamp: 2023-02-20T13:53:34Z
field.cattle.io/ports:
[[{"containerPort":18201,"dnsName":"mySer","hostPort":0,"kind":"ClusterIP","name":"18201tcp02","protocol":"TCP","sourcePort":0},{"conta...
Status: Running
IP: 192.168.118.5
Controlled By: ReplicaSet/mySer-5f5c8666f5
Init Containers:
init:
Container ID: docker://f75bf1e90de373487464fbc84329a786c970d9265c3adcd5b5ef5fe7a382afe9
Image: ...
Image ID: ...
Port: <none>
Host Port: <none>
Command:
sh
-c
wget --spider ...
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 20 Feb 2023 21:57:59 +0800
Finished: Mon, 20 Feb 2023 21:57:59 +0800
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-488jb (ro)
Containers:
mySer:
Container ID: docker://facfd64a6d6a6e4d087d7a1434e768e76d0e143639189193d764f76b713d3aaf
Image: ...
Image ID: ...
Ports: 18201/TCP, 18202/TCP
Host Ports: 0/TCP, 0/TCP
Command:
/bin/sh
-c
sh /install/run mySer 2.24.0
State: Waiting
Reason: CrashLoopBackOff
# 重点关注这一段,可以看到状态是已停止,原因是OOMKilled,内存溢出被杀掉了
# 重点关注这一段,可以看到状态是已停止,原因是OOMKilled,内存溢出被杀掉了
# 重点关注这一段,可以看到状态是已停止,原因是OOMKilled,内存溢出被杀掉了
Last State: Terminated
Reason: OOMKilled
Exit Code: 0
Started: Mon, 20 Feb 2023 22:12:33 +0800
Finished: Mon, 20 Feb 2023 22:16:28 +0800
Ready: False
Restart Count: 3
Limits:
cpu: 4
memory: 8196Mi # 这里的8196Mi其实是研发已经调整过的了
Requests:
cpu: 1
memory: 4Gi
Environment:
...
MYHOSTIP: (v1:status.hostIP)
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 19m default-scheduler Successfully assigned myNamespace/mySer-5f5c8666f5-ll9kj to node-133448
Normal Pulled 19m kubelet, node-133448 Container image "registry...com:5000/.../busybox-amd64:1.33" already present on machine
Normal Created 19m kubelet, node-133448 Created container
Normal Started 19m kubelet, node-133448 Started container
Normal Created 19m kubelet, node-133448 Created container
Normal Pulling 19m kubelet, node-133448 pulling image "registry...com:5000/.../...-monitor-amd64:2.2.0"
Normal Pulled 19m kubelet, node-133448 Successfully pulled image "registry...com:5000/.../...-monitor-amd64:2.2.0"
Normal Started 19m kubelet, node-133448 Started container
Normal Created 4m41s (x4 over 19m) kubelet, node-133448 Created container
Normal Started 4m41s (x4 over 19m) kubelet, node-133448 Started container
Normal Pulled 4m41s (x4 over 19m) kubelet, node-133448 Successfully pulled image "registry...com:5000/../...-7.6.1810"
Normal Pulling 4m41s (x4 over 19m) kubelet, node-133448 pulling image "registry...com:5000/...-7.6.1810"
Warning BackOff 34s (x5 over 10m) kubelet, node-133448 Back-off restarting failed container
从上面的状态信息及原因可以得知,是因为内存溢出导致容器被杀掉并自动重启了,然后查看pod崩溃前的日志,也看到了如下关键字(可以使用kubectl logs -f podName -n myNamespace --previous
命令查看,或者docker5在对应pod更多的查看日志
功能页面左下角有个以前的容器
均可查看崩溃前容器日志):
[com.thunisoft.tm.rocketmq.listener.MQConsumerListenerConcurrently:56]-[ERROR]
java.lang.OutOfMemoryError: Java heap space
以上信息足以证明,当前容器反复自动重启,是因为限制的内存不足导致。
但已经8G内存了还不够,都占用到哪儿去了?进入容器后使用top命令查看,内存基本都被mySer这个java进程占用了。内存调整到16G再观察,服务启动后,几分钟内mySer服务的内存占用又到了14G左右,但好歹重启频率降低了,基本不重启了。一个服务需要吃那么多内存,感觉不太正常,把排查过程给对应产品研发负责人反馈,找了个研发介入跟进。
后来研发给了个新包,换上以后内存占用不怎么明显了,启动后长期保持在2-3G的样子,业务高峰期内存占用也没超过6G,单节点配置8G也够用了。
持续观察两天,未再复现服务不可用情况,问题解决,健康检查及自动伸缩配置功能还原(参数有所微调),完事儿。
参考资料
Pod 水平自动扩缩:https://kubernetes.io/zh-cn/docs/tasks/run-application/horizontal-pod-autoscale/
HorizontalPodAutoscaler 演练:https://kubernetes.io/zh-cn/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/
kubectl describe pod Error from server (NotFound):pods not found: https://blog.csdn.net/u013288190/article/details/109125059
kubectl describe pod 里边没有看到events问题解决【详细步骤】: https://blog.csdn.net/marlinlm/article/details/122150345
K8S 查看 Pod 日志: https://blog.51cto.com/u_1472521/4909721
[云原生] Kubernetes(k8s)健康检查详解与实战演示(就绪性探针 和 存活性探针):https://www.cnblogs.com/liugp/p/16630873.html
健康检查配置:https://imroc.cc/kubernetes/best-practices/configure-healthcheck.html