K8S集群二进制部署之Prometheus监控告警
一、背景描述
①、K8S版本为 1.16.6
二、参考文档
未修改镜像地址的yaml文件:https://github.com/coreos/kube-prometheus/tree/master/manifests
(使用腾讯云镜像仓库)修改镜像地址的yaml文件:https://gitee.com/mylanvv/kube-prometheus.git
prometheus官方文档:https://prometheus.io/docs/alerting/overview/
prometheus官方网站:https://prometheus.io/
参考部署:https://blog.csdn.net/qq_40460909/article/details/105540145
三、prometheus 组件
kube-prometheus 是一整套监控解决方案,它使用 Prometheus 采集集群指标,Grafana 做展示,包含如下组件:
- The Prometheus Operator
- Highly available Prometheus
- Highly available Alertmanager
- Prometheus node-exporter
- Prometheus Adapter for Kubernetes Metrics APIs (k8s-prometheus-adapter)
- kube-state-metrics
- Grafana
四、下载prometheus 并修改yaml文件
①、下载prometheus
[root@k8s01 work]# git clone https://github.com/coreos/kube-prometheus.git
②、修改为grafana promethes alertmanager添加nodeport端口
2.1、grafana,修改grafana-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: grafana
name: grafana
namespace: monitoring
spec:
type: NodePort # 如果没有就添加
ports:
- name: http
port: 3000
nodePort: 30299 # 指定NodePort端口
targetPort: http
selector:
app: grafana
2.2、promethes,修改prometheus-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
prometheus: k8s
name: prometheus-k8s
namespace: monitoring
spec:
type: NodePort # 如果没有就添加
ports:
- name: web
port: 9090
nodePort: 32152 # 指定NodePort端口
targetPort: web
selector:
app: prometheus
prometheus: k8s
sessionAffinity: ClientIP
2.3、alertmanager,修改alertmanager-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
alertmanager: main
name: alertmanager-main
namespace: monitoring
spec:
type: NodePort # 如果没有就添加
ports:
- name: web
port: 9093
nodePort: 30026 # 指定NodePort端口
targetPort: web
selector:
alertmanager: main
app: alertmanager
sessionAffinity: ClientIP
2.4、为alertmanager配置报警邮箱配置(可用企业微信、钉钉、webhook等等)
apiVersion: v1
data: {}
kind: Secret
metadata:
name: alertmanager-main
namespace: monitoring
stringData:
alertmanager.yaml: |-
global:
resolve_timeout: 1m # 处理超时时间
smtp_smarthost: 'smtp.qq.com:465' # 邮箱smtp服务器代理
smtp_from: '******@qq.com' # 发送邮箱名称
smtp_auth_username: '******@qq.com' # 邮箱名称
smtp_auth_password: '*******' # 授权密码
smtp_require_tls: false # 不开启tls 默认开启
receivers:
- name: email
email_configs: # 邮箱配置
- to: "490089459@qq.com" # 接收警报的email配置
route:
group_interval: 1m # 在发送新警报前的等待时间
group_wait: 10s # 最初即第一次等待多久时间发送一组警报的通知
receiver: email
repeat_interval: 120m # 发送重复警报的周期,由于是测试不想收告警太频繁改成两个小时重复发送,生产建议5分钟或者10分钟。
type: Opaque
五、部署prometheus
注:自行解决镜像拉不下来的问题
prometheus yaml 下载的本地路径
[root@k8s01 manifests]# pwd
/opt/k8s/work/kube-prometheus/manifests
①、部署prometheus-operator
[root@k8s01 manifests]# kubectl apply -f setup/*
检查启动情况
[root@k8s01 manifests]# kubectl get pods -n monitoring
②、部署promethes metric adapter grafana alertmanager
[root@k8s01 manifests]# kubectl apply -f ./*
③、查看pod是否启动
[root@k8s01 manifests]# kubectl get namespace
NAME STATUS AGE
cephfs Active 232d
default Active 242d
kube-node-lease Active 242d
kube-public Active 242d
kube-system Active 242d
kubernetes-dashboard Active 242d
monitoring Active 242d
[root@k8s01 manifests]# kubectl get pod -o wide -n monitoring
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
alertmanager-main-0 2/2 Running 0 144d 172.30.236.159 k8s02 <none> <none>
alertmanager-main-1 2/2 Running 2 144d 172.30.235.166 k8s03 <none> <none>
alertmanager-main-2 2/2 Running 2 170d 172.30.77.43 k8s04 <none> <none>
grafana-c54896b96-xkqhh 1/1 Running 1 144d 172.30.235.168 k8s03 <none> <none>
kube-state-metrics-dbb85dfd5-fcwmq 3/3 Running 4 173d 172.30.77.29 k8s04 <none> <none>
node-exporter-2sqgx 2/2 Running 2 187d 172.16.1.14 k8s04 <none> <none>
node-exporter-fgp2t 2/2 Running 0 15d 172.16.1.11 k8s01 <none> <none>
node-exporter-fnccr 2/2 Running 2 169d 172.16.1.13 k8s03 <none> <none>
node-exporter-hswcg 2/2 Running 4 242d 172.16.1.12 k8s02 <none> <none>
node-exporter-njbkx 2/2 Running 0 98d 172.16.1.15 k8s05 <none> <none>
prometheus-adapter-b8d458474-kxcdz 1/1 Running 1 144d 172.30.235.167 k8s03 <none> <none>
prometheus-k8s-0 3/3 Running 4 144d 172.30.77.48 k8s04 <none> <none>
prometheus-k8s-1 3/3 Running 4 144d 172.30.235.157 k8s03 <none> <none>
prometheus-operator-ccd7974dc-qjgw2 2/2 Running 2 144d 172.30.235.142 k8s03 <none> <none>
④、查看暴露的端口
[root@k8s01 manifests]# kubectl get svc -o wide -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
alertmanager-main NodePort 10.254.132.51 <none> 9093:30026/TCP 24m alertmanager=main,app=alertmanager
alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 242d app=alertmanager
grafana NodePort 10.254.206.245 <none> 3000:30299/TCP 31m app=grafana
kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 242d app.kubernetes.io/name=kube-state-metrics
node-exporter ClusterIP None <none> 9100/TCP 242d app.kubernetes.io/name=node-exporter
prometheus-adapter ClusterIP 10.254.238.112 <none> 443/TCP 242d name=prometheus-adapter
prometheus-k8s NodePort 10.254.165.21 <none> 9090:32152/TCP 28m app=prometheus,prometheus=k8s
prometheus-operated ClusterIP None <none> 9090/TCP 242d app=prometheus
prometheus-operator ClusterIP None <none> 8443/TCP 242d app.kubernetes.io/component=controller,app.kubernetes.io/name=prometheus-operator
六、查看报警接收邮箱是否有邮件
因为Alertmanager默认有一个Watchdog是报警状态的
因为Alertmanager设置了1m 所以会一直接收,可以打开Alertmanager的管理页面,设置为静默期
Alertmanager地址:IP:NodePrt http://x.x.x.x:30026
修改前
修改后
后期查看是否还在接受报警!
七、添加自定义监控项
定义一个pod状态不是running的监控
修改,prometheus-rules.yaml,在最后面插入内容
- alert: pod-status
annotations:
message: vv test pod-status
expr: |
kube_pod_container_status_running != 1
for: 1m
labels:
severity: warning
修改后
- alert: PrometheusOperatorReconcileErrors
annotations:
message: Errors while reconciling {{ $labels.controller }} in {{ $labels.namespace
}} Namespace.
expr: |
rate(prometheus_operator_reconcile_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1
for: 10m
labels:
severity: warning
- alert: PrometheusOperatorNodeLookupErrors
annotations:
message: Errors while reconciling Prometheus in {{ $labels.namespace }} Namespace.
expr: |
rate(prometheus_operator_node_address_lookup_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1
for: 10m
labels:
severity: warning
####下面的即新加入的
- alert: pod-status
annotations:
message: pod is down pod-status !
expr: |
kube_pod_container_status_running != 1
for: 1m
labels:
severity: warning
刷新prometheus-rules.yaml,因为使用挂载方式,刷新后,会同步到pod内,刷新需要一会时间
kubectl apply -f prometheus-rules.yaml
打开prometheus 查看是否已添加刚刚的监控项 pod-status
访问地址: IP:nodeport http://x.x.x.x:32301
八、测试prometheus添加的自定义监控 (pod-status)
创建不是running的容器
创建一个test.yaml 文件并编辑
yaml文件内容
apiVersion: apps/v1
kind: Deployment
metadata:
name: vv
spec:
replicas: 1
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: vv
template:
metadata:
labels:
app: vv
spec:
imagePullSecrets:
- name: registry-pull-secret
containers:
- name: vv
image: ccr.ccs.tencentyun.com/lanvv/test-jdk-1-8.0.181-bak:latest
imagePullPolicy: IfNotPresent
创建Pod
[root@k8s01 temp]# kubectl apply -f test.yaml
deployment.apps/vv created
[root@k8s01 temp]# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
vv-6f8599bb48-mnjsh 0/1 CrashLoopBackOff 1 19s 172.30.115.86 k8s05 <none> <none>
等了一会查收到邮件
再次查看prometheus
可以看到pod-status 有一条告警内容
九、grafana创建自定义图表
①、在prometheus 查询
先在prometheus验证我们的查询条件是否满足
kube_pod_container_status_running != 1
继续去grafana
添加图标 grafana地址: ip:30299 http://x.x.x.x:30299
②、添加数据源
数据源名称
prometheus 地址,这里填写的是集群内地址!
添加查询
Query 选择刚刚添加的prometheus数据源
指标输入查询语句
点击右上角save
保存后,可以在home 中查看到刚保存的查询Pod
点开查看
文章参考:https://blog.csdn.net/qq_40460909/article/details/105540145
下一篇:K8S集群二进制部署之Prometheus监控告警钉钉通知