K8S集群二进制部署之Prometheus监控告警

最新推荐文章于 2024-06-27 15:41:14 发布

石头-豆豆

最新推荐文章于 2024-06-27 15:41:14 发布

阅读量1.2k

点赞数

分类专栏： k8s 文章标签： kubernetes docker java

本文链接：https://blog.csdn.net/xjjj064/article/details/114881807

版权

k8s 专栏收录该内容

43 篇文章 9 订阅

订阅专栏

K8S集群二进制部署之Prometheus监控告警

一、背景描述

①、K8S版本为 1.16.6

二、参考文档

未修改镜像地址的yaml文件：https://github.com/coreos/kube-prometheus/tree/master/manifests
（使用腾讯云镜像仓库）修改镜像地址的yaml文件：https://gitee.com/mylanvv/kube-prometheus.git
prometheus官方文档：https://prometheus.io/docs/alerting/overview/
prometheus官方网站：https://prometheus.io/
参考部署：https://blog.csdn.net/qq_40460909/article/details/105540145

三、prometheus 组件

kube-prometheus 是一整套监控解决方案，它使用 Prometheus 采集集群指标，Grafana 做展示，包含如下组件：

The Prometheus Operator
Highly available Prometheus
Highly available Alertmanager
Prometheus node-exporter
Prometheus Adapter for Kubernetes Metrics APIs （k8s-prometheus-adapter）
kube-state-metrics
Grafana

四、下载prometheus 并修改yaml文件

①、下载prometheus

[root@k8s01 work]#  git clone https://github.com/coreos/kube-prometheus.git

②、修改为grafana promethes alertmanager添加nodeport端口

2.1、grafana，修改grafana-service.yaml

apiVersion: v1
kind: Service
metadata:
  labels:
    app: grafana
  name: grafana
  namespace: monitoring
spec:
  type: NodePort   # 如果没有就添加
  ports:
  - name: http
    port: 3000
    nodePort: 30299  # 指定NodePort端口
    targetPort: http
  selector:
    app: grafana

2.2、promethes，修改prometheus-service.yaml

apiVersion: v1
kind: Service
metadata:
  labels:
    prometheus: k8s
  name: prometheus-k8s
  namespace: monitoring
spec:
  type: NodePort    # 如果没有就添加
  ports:
  - name: web
    port: 9090
    nodePort: 32152  # 指定NodePort端口
    targetPort: web
  selector:
    app: prometheus
    prometheus: k8s
  sessionAffinity: ClientIP

2.3、alertmanager，修改alertmanager-service.yaml

apiVersion: v1
kind: Service
metadata:
  labels:
    alertmanager: main
  name: alertmanager-main
  namespace: monitoring
spec:
  type: NodePort  # 如果没有就添加
  ports:
  - name: web
    port: 9093
    nodePort: 30026 # 指定NodePort端口
    targetPort: web
  selector:
    alertmanager: main
    app: alertmanager
  sessionAffinity: ClientIP

2.4、为alertmanager配置报警邮箱配置（可用企业微信、钉钉、webhook等等）

apiVersion: v1
data: {}
kind: Secret
metadata:
  name: alertmanager-main
  namespace: monitoring
stringData:
  alertmanager.yaml: |-
    global:
      resolve_timeout: 1m # 处理超时时间
      smtp_smarthost: 'smtp.qq.com:465' # 邮箱smtp服务器代理
      smtp_from: '******@qq.com' # 发送邮箱名称
      smtp_auth_username: '******@qq.com' # 邮箱名称
      smtp_auth_password: '*******' # 授权密码
      smtp_require_tls: false # 不开启tls 默认开启
 
    receivers:
    - name: email
      email_configs: # 邮箱配置
      - to: "490089459@qq.com" # 接收警报的email配置
 
    route:
      group_interval: 1m # 在发送新警报前的等待时间
      group_wait: 10s # 最初即第一次等待多久时间发送一组警报的通知 
      receiver: email
      repeat_interval: 120m # 发送重复警报的周期,由于是测试不想收告警太频繁改成两个小时重复发送，生产建议5分钟或者10分钟。
type: Opaque

五、部署prometheus

注：自行解决镜像拉不下来的问题
prometheus yaml 下载的本地路径

[root@k8s01 manifests]# pwd
/opt/k8s/work/kube-prometheus/manifests

①、部署prometheus-operator

[root@k8s01 manifests]# kubectl apply -f setup/*

检查启动情况

[root@k8s01 manifests]# kubectl get pods -n monitoring

②、部署promethes metric adapter grafana alertmanager

[root@k8s01 manifests]# kubectl apply -f ./*

③、查看pod是否启动

[root@k8s01 manifests]# kubectl get namespace
NAME                   STATUS   AGE
cephfs                 Active   232d
default                Active   242d
kube-node-lease        Active   242d
kube-public            Active   242d
kube-system            Active   242d
kubernetes-dashboard   Active   242d
monitoring             Active   242d
[root@k8s01 manifests]# kubectl get pod -o wide -n monitoring
NAME                                  READY   STATUS    RESTARTS   AGE    IP               NODE    NOMINATED NODE   READINESS GATES
alertmanager-main-0                   2/2     Running   0          144d   172.30.236.159   k8s02   <none>           <none>
alertmanager-main-1                   2/2     Running   2          144d   172.30.235.166   k8s03   <none>           <none>
alertmanager-main-2                   2/2     Running   2          170d   172.30.77.43     k8s04   <none>           <none>
grafana-c54896b96-xkqhh               1/1     Running   1          144d   172.30.235.168   k8s03   <none>           <none>
kube-state-metrics-dbb85dfd5-fcwmq    3/3     Running   4          173d   172.30.77.29     k8s04   <none>           <none>
node-exporter-2sqgx                   2/2     Running   2          187d   172.16.1.14      k8s04   <none>           <none>
node-exporter-fgp2t                   2/2     Running   0          15d    172.16.1.11      k8s01   <none>           <none>
node-exporter-fnccr                   2/2     Running   2          169d   172.16.1.13      k8s03   <none>           <none>
node-exporter-hswcg                   2/2     Running   4          242d   172.16.1.12      k8s02   <none>           <none>
node-exporter-njbkx                   2/2     Running   0          98d    172.16.1.15      k8s05   <none>           <none>
prometheus-adapter-b8d458474-kxcdz    1/1     Running   1          144d   172.30.235.167   k8s03   <none>           <none>
prometheus-k8s-0                      3/3     Running   4          144d   172.30.77.48     k8s04   <none>           <none>
prometheus-k8s-1                      3/3     Running   4          144d   172.30.235.157   k8s03   <none>           <none>
prometheus-operator-ccd7974dc-qjgw2   2/2     Running   2          144d   172.30.235.142   k8s03   <none>           <none>

④、查看暴露的端口

[root@k8s01 manifests]# kubectl get svc -o wide -n monitoring
NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE    SELECTOR
alertmanager-main       NodePort    10.254.132.51    <none>        9093:30026/TCP               24m    alertmanager=main,app=alertmanager
alertmanager-operated   ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP   242d   app=alertmanager
grafana                 NodePort    10.254.206.245   <none>        3000:30299/TCP               31m    app=grafana
kube-state-metrics      ClusterIP   None             <none>        8443/TCP,9443/TCP            242d   app.kubernetes.io/name=kube-state-metrics
node-exporter           ClusterIP   None             <none>        9100/TCP                     242d   app.kubernetes.io/name=node-exporter
prometheus-adapter      ClusterIP   10.254.238.112   <none>        443/TCP                      242d   name=prometheus-adapter
prometheus-k8s          NodePort    10.254.165.21    <none>        9090:32152/TCP               28m    app=prometheus,prometheus=k8s
prometheus-operated     ClusterIP   None             <none>        9090/TCP                     242d   app=prometheus
prometheus-operator     ClusterIP   None             <none>        8443/TCP                     242d   app.kubernetes.io/component=controller,app.kubernetes.io/name=prometheus-operator

六、查看报警接收邮箱是否有邮件

因为Alertmanager默认有一个Watchdog是报警状态的
在这里插入图片描述

因为Alertmanager设置了1m 所以会一直接收，可以打开Alertmanager的管理页面，设置为静默期
Alertmanager地址：IP:NodePrt http://x.x.x.x:30026
修改前在这里插入图片描述
修改后

后期查看是否还在接受报警！

七、添加自定义监控项

定义一个pod状态不是running的监控

修改，prometheus-rules.yaml，在最后面插入内容

- alert: pod-status
  annotations:
    message: vv test pod-status
  expr: |
    kube_pod_container_status_running != 1
  for: 1m
  labels:
    severity: warning

修改后

- alert: PrometheusOperatorReconcileErrors
  annotations:
    message: Errors while reconciling {{ $labels.controller }} in {{ $labels.namespace
      }} Namespace.
  expr: |
    rate(prometheus_operator_reconcile_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1
  for: 10m
  labels:
    severity: warning
- alert: PrometheusOperatorNodeLookupErrors
  annotations:
    message: Errors while reconciling Prometheus in {{ $labels.namespace }} Namespace.
  expr: |
    rate(prometheus_operator_node_address_lookup_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1
  for: 10m
  labels:
    severity: warning
####下面的即新加入的
- alert: pod-status
  annotations:
    message: pod is down pod-status !
  expr: |
    kube_pod_container_status_running != 1
  for: 1m
  labels:
    severity: warning

刷新prometheus-rules.yaml，因为使用挂载方式，刷新后，会同步到pod内，刷新需要一会时间

kubectl apply -f prometheus-rules.yaml

打开prometheus 查看是否已添加刚刚的监控项 pod-status
访问地址： IP:nodeport http://x.x.x.x:32301
在这里插入图片描述

八、测试prometheus添加的自定义监控（pod-status）

创建不是running的容器
创建一个test.yaml 文件并编辑
yaml文件内容

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vv
spec:
  replicas: 1
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: vv
  template:
    metadata:
      labels:
        app: vv
    spec:
      imagePullSecrets:
      - name: registry-pull-secret
      containers:
      - name: vv
        image: ccr.ccs.tencentyun.com/lanvv/test-jdk-1-8.0.181-bak:latest
        imagePullPolicy: IfNotPresent

创建Pod

[root@k8s01 temp]# kubectl apply -f test.yaml 
deployment.apps/vv created

[root@k8s01 temp]# kubectl get pod -o wide
NAME                               READY   STATUS             RESTARTS   AGE     IP               NODE    NOMINATED NODE   READINESS GATES
vv-6f8599bb48-mnjsh                0/1     CrashLoopBackOff   1          19s     172.30.115.86    k8s05   <none>           <none>

等了一会查收到邮件
在这里插入图片描述
再次查看prometheus
可以看到pod-status 有一条告警内容