搭建k8s集群监控-Alertmanager问题处理
pod启动错误-CrashLoopBackOff
CrashLoopBackOff说明pod正常启动后有异常退出了
describe查看
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned monitoring/alertmanager-main-0 to 192.168.6.11
Normal Pulled 23m kubelet, 192.168.6.11 Container image "quay.mirrors.ustc.edu.cn/prometheus/alertmanager:v0.21.0" already present on machine
Normal Created 23m kubelet, 192.168.6.11 Created container alertmanager
Normal Started 23m kubelet, 192.168.6.11 Started container alertmanager
Normal Pulled 23m kubelet, 192.168.6.11 Container image "quay.mirrors.ustc.edu.cn/prometheus-operator/prometheus-config-reloader:v0.47.0" already present on machine
Normal Created 23m kubelet, 192.168.6.11 Created container config-reloader
Normal Started 23m kubelet, 192.168.6.11 Started container config-reloader
Warning Unhealthy 23m (x6 over 23m) kubelet, 192.168.6.11 Liveness probe failed: Get http://172.17.25.5:9093/-/healthy: dial tcp 172.17.25.5:9093: connect: connection refused
Warning Unhealthy 8m53s (x148 over 23m) kubelet, 192.168.6.11 Readiness probe failed: Get http://172.17.25.5:9093/-/ready: dial tcp 172.17.25.5:9093: connect: connection refused
Warning BackOff 3m51s (x34 over 12m) kubelet, 192.168.6.11 Back-off restarting failed container
pod活性探测失败,无法连接,遭到拒绝
查看日志
[root@k8s-node1 ~]# kubectl logs pod/alertmanager-main-0 alertmanager -n monitoring
level=info ts=2021-06-02T02:11:49.274Z caller=main.go:216 msg="Starting Alertmanager" version="(version=0.21.0, branch=HEAD, revision=4c6c03ebfe21009c546e4d1e9b92c371d67c021d)"
level=info ts=2021-06-02T02:11:49.274Z caller=main.go:217 build_context="(go=go1.14.4, user=root@dee35927357f, date=20200617-08:54:02)"
[root@k8s-node1 ~]# kubectl logs pod/alertmanager-main-0 config-reloader -n monitoring
level=info ts=2021-06-02T01:57:31.669430944Z caller=main.go:147 msg="Starting prometheus-config-reloader" version="(version=0.47.0, branch=refs/tags/pkg/client/v0.47.0, revision=539108b043e9ecc53c4e044083651e2ebfbd3492)"
level=info ts=2021-06-02T01:57:31.669531061Z caller=main.go:148 build_context="(go=go1.16.3, user=simonpasquier, date=20210413-15:46:43)"
level=info ts=2021-06-02T01:57:31.669664237Z caller=main.go:182 msg="Starting web server for metrics" listen=:8080
level=info ts=2021-06-02T01:57:31.67010267Z caller=reloader.go:214 msg="started watching config file and directories for changes" cfg= out= dirs=/etc/alertmanager/config
level=error ts=2021-06-02T01:57:32.81121586Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://localhost:9093/-/reload\": dial tcp 127.0.0.1:9093: connect: connection refused"
level=error ts=2021-06-02T01:57:37.811710125Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://localhost:9093/-/reload\": dial tcp 127.0.0.1:9093: connect: connection refused"
level=error ts=2021-06-02T01:57:42.811117367Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://localhost:9093/-/reload\": dial tcp 127.0.0.1:9093: connect: connection refused"
level=error ts=2021-06-02T01:57:47.810889541Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://localhost:9093/-/reload\": dial tcp 127.0.0.1:9093: connect: connection refused"
查看statefulset
发现alertmanager-main没有正常准备
导出statefulset
kubectl -n monitoring get statefulset.apps/alertmanager-main -o yaml > dump.yaml
# spec.template.spec添加hostNetwork: true
# 删除原有的statefulset,重新创建
kubectl delete statefulsets.apps alertmanager-main -n monitoring
kubectl apply -f dump.yaml
参考类容:
https://github.com/prometheus-operator/kube-prometheus/issues/653