Helm部署Prometheus+Grafana

一、部署Prometheus+Grafana

1、创建命名空间:

kubectl create ns prometheus

2、创建证书的Secret:

kubectl create secret tls sllme-com-pem  --cert=./sllme.com.pem --key=./sllme.com.key -n prometheus

3、部署prometheus

helm pull bitnami/kube-prometheus
修改values.yaml
prometheus:
  persistence:
    enabled: true
    storageClass: "azureblob-fuse-premium"
    size: 100Gi
helm upgrade --install prometheus bitnami/kube-prometheus -f ./kube-prometheus/values.yaml  -n prometheus

4、部署grafana

helm pull bitnami/grafana
修改values.yaml
persistence:
  enabled: true 
  storageClass: "azureblob-fuse-premium"
  annotations: {}
  existingClaim: ""
  accessMode: ReadWriteOnce
  accessModes: []
  size: 100Gi

helm upgrade --install grafana bitnami/grafana -f ./values.yaml  -n prometheus

5、创建grafana-ingress:

grafana-dev.sllme.com.yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: azure/application-gateway
  generation: 1
  name: grafana-dev.sllme.com
spec:
  rules:
  - host: grafana-dev.sllme.com
    http:
      paths:
      - backend:
          service:
            name: grafana
            port:
              number: 3000
        path: /
        pathType: ImplementationSpecific
  tls:
  - hosts:
    - grafana-dev.sllme.com
    secretName: sllme-com-pem
    
kubectl apply -f grafana-dev.sllme.com.yaml -n prometheus

6、登录到grafana:

配置数据源:http://prometheus-kube-prometheus-prometheus:9090 导入grafana模版即可。由于模板太长就不写了,可以参考阿里云的Kubernetes集群监控的模板。

二、配置监控

1、参考以上配置prometheus配置,进行修改alertmanager的内容

#备注以下内容是prometheus的values.yaml文件的内容,进行修改。

alertmanager:
    config:
    global:
      resolve_timeout: 5m
    route:
      group_by: ['instance']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'lux'
      routes:
        - match:
            alertname: Watchdog
          receiver: 'lux'
    receivers:
      - name: 'lux'
        webhook_configs:
          - url: 'http://prometheus-alert-center:8080/prometheusalert?type=fs&tpl=prometheus-fs&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/c0c7e8d8-0bb9-4a72-97b5-247e21bbc50ba'

备注:修改此处的目的是将告警信息都转发到PrometheusAlert

prometheus-alert-center:8080 这个为下面告警组件的service名称

fsurl:为飞书的机器人地址

2、告警软件

我们采用github上开源的工具:PrometheusAlert进行告警 github地址:https://github.com/feiyu563/PrometheusAlert

3、部署服务

vim prometheus-alert.yaml

apiVersion: v1
data:
  app.conf: |
    appname = Lux_Alert
    login_user=lux
    login_password=xxxxxx
    httpaddr = "0.0.0.0"
    httpport = 8080
    runmode = dev
    #开启JSON请求
    copyrequestbody = true
    #告警消息标题
    title=XXX生产
    #转换Prometheus,graylog告警消息的时区为CST时区(如默认已经是CST时区,请勿开启)
    prometheus_cst_time=1
    #数据库驱动,支持sqlite3,mysql,postgres如使用mysql或postgres,请开启db_host,db_port,db_user,db_password,db_name的注释
    db_driver=sqlite3
    #db_host=127.0.0.1
    #db_port=3306
    #db_user=root
    #db_password=root
    #db_name=prometheusalert
    #是否开启告警记录 0为关闭,1为开启
    AlertRecord=0
    #是否开启告警记录定时删除 0为关闭,1为开启
    RecordLive=0
    RecordLiveDay=7
    # 是否将告警记录写入es7,0为关闭,1为开启
    alert_to_es=0
    # es地址,是[]string
    # beego.Appconfig.Strings读取配置为[]string,使用";"而不是","
    # to_es_pwd=password
    # 飞书 webhook告警
    open-feishu=1
    fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/c0c7e8d8-0bb9-4a72-97b5-247e21bbc50ba
kind: ConfigMap
metadata:
  name: prometheus-alert-center-conf
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: prometheus-alert-center
    alertname: prometheus-alert-center
  name: prometheus-alert-center
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-alert-center
      alertname: prometheus-alert-center
  template:
    metadata:
      labels:
        app: prometheus-alert-center
        alertname: prometheus-alert-center
    spec:
      containers:
      - image: feiyu563/prometheus-alert
        name: prometheus-alert-center
        env:
        - name: TZ
          value: "Asia/Shanghai"
        ports:
        - containerPort: 8080
          name: http
        resources:
          limits:
            cpu: 200m
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 100Mi
        volumeMounts:
        - name: prometheus-alert-center-conf-map
          mountPath: /app/conf/app.conf
          subPath: app.conf
        - name: prometheus-alert-center-conf-map
          mountPath: /app/user.csv
          subPath: user.csv
      volumes:
      - name: prometheus-alert-center-conf-map
        configMap:
          name: prometheus-alert-center-conf
          items:
          - key: app.conf
            path: app.conf
---
apiVersion: v1
kind: Service
metadata:
  labels:
    alertname: prometheus-alert-center
  name: prometheus-alert-center
  annotations:
    prometheus.io/scrape: 'true'
    prometheus.io/port: '8080'  
spec:
  ports:
  - name: http
    port: 8080
    targetPort: http
  selector:
    app: prometheus-alert-center
    
kubectl apply -f prometheus-alert.yaml -n prometheus
##备注:我们采用飞书进行作为接受告警信息端
fsurl为飞书创建一个群,在群上创建一个机器人的地址。

4、创建告警规则

vim k8s-rules.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus-name: kube-prometheus-prometheus
    managed-by: prometheus-operator          
  name: prometheus-k8s-rules
spec:
  groups:
  - name: node-exporter.rules
    rules:
    - alert: 节点内存过高
      expr: 100 - (node_memory_Buffers_bytes+node_memory_Cached_bytes+node_memory_MemFree_bytes) / node_memory_MemTotal_bytes*100 > 80
      for: 6m
      labels:
        severity: "Error"
      annotations:
        title: "主机内存过高"
        btn: "点击查看详情 :玫瑰:"
        link: "https://www.baidu.com"
        summary: "告警主机:{{ $labels.instance }}"
        address: "告警地址:{{ $labels.instance }}"
        description: "告警问题:节点{{ $labels.instance }}内存 > 80%; value = [{{ $value }}]"

    - alert: 节点CPU负载过高  
      expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[4m])) * 100) > 80
      for: 6m
      labels:
        severity: "Error"
      annotations:
        summary: "告警主机: {{ $labels.instance }}"
        address: "告警地址:{{ $labels.instance }}"
        description: "告警问题:节点{{ $labels.instance }}CPU > 80%; value[{{ $value }}]"

    - alert: 节点硬盘资源不足
      expr:  (node_filesystem_size_bytes{device="rootfs",mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100 > 80
      for: 6m
      labels:
        severity:  "Error"
      annotations:
        summary: "告警主机:{{$labels.job}}"
        address: "告警地址:{{ $labels.instance }}"
        description: "告警问题:节点{{ $labels.instance }}硬盘(< 20% left); value[{{ $value }}] \n  partition = {{$labels.mountpoint}}"

    - alert: 节点硬盘资源不足
      expr: (node_filesystem_size_bytes{mountpoint="/home"} - node_filesystem_free_bytes{mountpoint="/home"}) / node_filesystem_size_bytes{mountpoint="/home"} * 100 > 80
      for: 6m
      labels:
        severity:  "Error"
      annotations:
        summary: "告警主机:{{$labels.job}}"
        address: "告警地址:{{ $labels.instance }}"
        description: "告警问题:节点{{ $labels.instance }}/home硬盘 (< 20% left); value[{{ $value }}] \n  partition = {{$labels.mountpoint}}"

    - alert: node-exporter down  
      expr: up{job="nodes-exporter"} == 0
      for: 30s
      labels:
        team: node
        severity: "Error"
      annotations:
        summary: "告警主机:{{$labels.instance}}"
        address: "告警地址:{{ $labels.instance }}"
        description: "告警问题:node-exporter down"

    - alert: 节点硬盘读取率异常
      expr: sum by (job,instance) (irate(node_disk_read_bytes_total[6m])) / 1024 / 1024 > 50
      for: 6m
      labels:
        severity: "Warning"
      annotations:
        summary: "告警主机:{{$labels.job}}"
        address: "告警地址:{{ $labels.instance }}"
        description: "告警问题:节点{{ $labels.instance }}磁盘读取率 (> 50 MB/s) VALUE = {{ $value }}"


    - alert: 节点网络吞吐量输入异常
      expr: sum by (job,instance) (irate(node_network_receive_bytes_total[6m])) / 1024 / 1024 > 50
      for: 6m
      labels:
        severity: "Warning"
      annotations:
        summary: "告警主机:{{$labels.job}}"
        address: "告警地址:{{ $labels.instance }}"
        description: "告警问题:Host network interfaces are probably receiving too much data (> 50 MB/s) VALUE = {{ $value }}"

    - alert: 节点网络吞吐量输出异常 
      expr: sum by (job,instance) (irate(node_network_transmit_bytes_total[6m])) / 1024 / 1024 > 50
      for: 6m
      labels:
        severity: "Warning"
      annotations:
        summary: "告警主机:{{$labels.job}}"
        address: "告警地址:{{ $labels.instance }}"
        description: "告警问题:Host network interfaces are probably sending too much data (> 50 MB/s) VALUE = {{ $value }}"

    - alert: TCP_ESTABLISHED过高  
      expr: node_netstat_Tcp_CurrEstab > 5000
      for: 6m
      labels:
        name: Tcp_Established
        severity: "Warning"
      annotations:
        summary: "告警主机: {{ $labels.instance }}"
        address: "告警地址:{{ $labels.instance }}"
        description: "告警问题:TCP_ESTABLISHED大于5000,当前值[{{ $value }}]."

    - alert: TCP_TIME_WAIT大于10000  
      expr: node_sockstat_TCP_tw > 10000
      for: 5m
      labels:
        name: Tcp_Time_Wait
        severity: "Warning"
      annotations:
        summary: "告警主机: {{ $labels.instance }}"
        address: "告警地址:{{ $labels.instance }}"
        description: "告警问题:TCP_TIME_WAIT大于10000,当前值[{{ $value }}]."

      
  - name: k8s.rules
    rules:
    - expr: |
        sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container!="POD"}[4m])) by (namespace)
      record: namespace:container_cpu_usage_seconds_total:sum_rate
    - expr: |
        sum by (namespace, pod, container) (
          rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container!="POD"}[4m])
        ) * on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info)
      record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate
    - expr: |
        container_memory_working_set_bytes{job="kubelet", image!=""}
        * on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info)
      record: node_namespace_pod_container:container_memory_working_set_bytes
    - expr: |
        container_memory_rss{job="kubelet", image!=""}
        * on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info)
      record: node_namespace_pod_container:container_memory_rss
    - expr: |
        container_memory_cache{job="kubelet", image!=""}
        * on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info)
      record: node_namespace_pod_container:container_memory_cache
    - expr: |
        container_memory_swap{job="kubelet", image!=""}
        * on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info)
      record: node_namespace_pod_container:container_memory_swap
    - expr: |
        sum(container_memory_usage_bytes{job="kubelet", image!="", container!="POD"}) by (namespace)
      record: namespace:container_memory_usage_bytes:sum
    - expr: |
        sum by (namespace) (
            sum by (namespace, pod) (
                max by (namespace, pod, container) (
                    kube_pod_container_resource_requests_memory_bytes{job="prometheus-kube-state-metrics"}
                ) * on(namespace, pod) group_left() max by (namespace, pod) (
                    kube_pod_status_phase{phase=~"Pending|Running"} == 1
                )
            )
        )
      record: namespace:kube_pod_container_resource_requests_memory_bytes:sum
    - expr: |
        sum by (namespace) (
            sum by (namespace, pod) (
                max by (namespace, pod, container) (
                    kube_pod_container_resource_requests_cpu_cores{job="prometheus-kube-state-metrics"}
                ) * on(namespace, pod) group_left() max by (namespace, pod) (
                  kube_pod_status_phase{phase=~"Pending|Running"} == 1
                )
            )
        )
      record: namespace:kube_pod_container_resource_requests_cpu_cores:sum
    - expr: |
        sum(
          label_replace(
            label_replace(
              kube_pod_owner{job="prometheus-kube-state-metrics", owner_kind="ReplicaSet"},
              "replicaset", "$1", "owner_name", "(.*)"
            ) * on(replicaset, namespace) group_left(owner_name) kube_replicaset_owner{job="prometheus-kube-state-metrics"},
            "workload", "$1", "owner_name", "(.*)"
          )
        ) by (namespace, workload, pod)
      labels:
        workload_type: deployment
      record: mixin_pod_workload
    - expr: |
        sum(
          label_replace(
            kube_pod_owner{job="prometheus-kube-state-metrics", owner_kind="DaemonSet"},
            "workload", "$1", "owner_name", "(.*)"
          )
        ) by (namespace, workload, pod)
      labels:
        workload_type: daemonset
      record: mixin_pod_workload
    - expr: |
        sum(
          label_replace(
            kube_pod_owner{job="prometheus-kube-state-metrics", owner_kind="StatefulSet"},
            "workload", "$1", "owner_name", "(.*)"
          )
        ) by (namespace, workload, pod)
      labels:
        workload_type: statefulset
      record: mixin_pod_workload

  - name: node.rules
    rules:
    - expr: sum(min(kube_pod_info) by (node))
      record: ':kube_pod_info_node_count:'
    - expr: |
        max(label_replace(kube_pod_info{job="prometheus-kube-state-metrics"}, "pod", "$1", "pod", "(.*)")) by (node, namespace, pod)
      record: 'node_namespace_pod:kube_pod_info:'
    - expr: |
        count by (node) (sum by (node, cpu) (
          node_cpu_seconds_total{job="node-exporter"}
        * on (namespace, pod) group_left(node)
          node_namespace_pod:kube_pod_info:
        ))
      record: node:node_num_cpu:sum
    - expr: |
        sum(
          node_memory_MemAvailable_bytes{job="node-exporter"} or
          (
            node_memory_Buffers_bytes{job="node-exporter"} +
            node_memory_Cached_bytes{job="node-exporter"} +
            node_memory_MemFree_bytes{job="node-exporter"} +
            node_memory_Slab_bytes{job="node-exporter"}
          )
        )
      record: :node_memory_MemAvailable_bytes:sum

  - name: kube-prometheus-node-recording.rules
    rules:
    - expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[3m])) BY
        (instance)
      record: instance:node_cpu:rate:sum
    - expr: sum((node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}))
        BY (instance)
      record: instance:node_filesystem_usage:sum
    - expr: sum(rate(node_network_receive_bytes_total[3m])) BY (instance)
      record: instance:node_network_receive_bytes:rate:sum
    - expr: sum(rate(node_network_transmit_bytes_total[3m])) BY (instance)
      record: instance:node_network_transmit_bytes:rate:sum
    - expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[4m])) WITHOUT
        (cpu, mode) / ON(instance) GROUP_LEFT() count(sum(node_cpu_seconds_total)
        BY (instance, cpu)) BY (instance)
      record: instance:node_cpu:ratio
    - expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[4m]))
      record: cluster:node_cpu:sum_rate4m
    - expr: cluster:node_cpu_seconds_total:rate4m / count(sum(node_cpu_seconds_total)
        BY (instance, cpu))
      record: cluster:node_cpu:ratio

  - name: kubernetes-apps
    rules:
    - alert: KubePodCrashLooping 
      expr: |
        rate(kube_pod_container_status_restarts_total{job="prometheus-kube-state-metrics",container!="fluentd",namespace!="efk"}[4m]) * 60 * 5 > 0
      for: 2m
      labels:
        severity: "Error"
      annotations:
        summary: "告警事件:Pod,{{ $labels.pod }}"
        address: "命名空间:{{ $labels.namespace }}"
        description: '告警问题:Pod Crash Looping ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times / 5 minutes'

    - alert: KubePodNotReady 
      expr: |
        sum by (namespace, pod) (max by(namespace, pod) (kube_pod_status_phase{job="prometheus-kube-state-metrics", phase=~"Pending|Unknown"}) * on(namespace, pod) group_left(owner_kind) max by(namespace, pod, owner_kind) (kube_pod_owner{owner_kind!="Job"})) > 0
      for: 2m
      labels:
        severity: "Error"
      annotations: 
        summary: "告警事件:Pod,{{ $labels.pod }}"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题:Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than 2 minutes"

    - alert: KubeDeploymentGenerationMismatch 
      expr: |
        kube_deployment_status_observed_generation{job="prometheus-kube-state-metrics"}
          !=
        kube_deployment_metadata_generation{job="prometheus-kube-state-metrics"}
      for: 5m
      labels:
        severity: "Error"
      annotations:
        summary: "告警事件:deployment,{{ $labels.deployment }}"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题:Deployment generation for {{ $labels.namespace }}/{{ $labels.deployment
          }} does not match, this indicates that the Deployment has failed but has not been rolled back"

    - alert: KubeDeploymentReplicasMismatch 
      expr: |
        kube_deployment_spec_replicas{job="prometheus-kube-state-metrics"}
          !=
        kube_deployment_status_replicas_available{job="prometheus-kube-state-metrics"}
      for: 30m
      labels:
        severity: "Error"
      annotations:
        summary: "告警事件:deployment,{{ $labels.deployment }}"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题:Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not
          matched the expected number of replicas for longer than 30 minutes."

    - alert: KubeStatefulSetReplicasMismatch 
      expr: |
        kube_statefulset_status_replicas_ready{job="prometheus-kube-state-metrics"}
          !=
        kube_statefulset_status_replicas{job="prometheus-kube-state-metrics"}
      for: 5m
      labels:
        severity: "Error"
      annotations:
        summary: "告警事件:StatefulSet;{{ $labels.statefulset }}"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题:StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has
          not matched the expected number of replicas for longer than 15 minutes"

    - alert: KubeStatefulSetReplicasMismatch 
      expr: |
        kube_statefulset_status_observed_generation{job="prometheus-kube-state-metrics"}
          !=
        kube_statefulset_metadata_generation{job="prometheus-kube-state-metrics"}
      for: 5m
      labels:
        severity: "Error"
      annotations:
        summary: "告警事件:{{ $labels.statefulset }}"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题:StatefulSet generation for {{ $labels.namespace }}/{{ $labels.statefulset
          }} does not match, this indicates that the StatefulSet has failed but has
          not been rolled back"

    - alert: KubeStatefulSetUpdateNotRolledOut 
      expr: |
        max without (revision) (
          kube_statefulset_status_current_revision{job="prometheus-kube-state-metrics"}
            unless
          kube_statefulset_status_update_revision{job="prometheus-kube-state-metrics"}
        )
          *
        (
          kube_statefulset_replicas{job="prometheus-kube-state-metrics"}
            !=
          kube_statefulset_status_replicas_updated{job="prometheus-kube-state-metrics"}
        )
      for: 5m
      labels:
        severity: "Critical"
      annotations:
        summary: "告警事件:StatefulSet,{{ $labels.statefulset }}"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题:StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update
          has not been rolled out."

    - alert: KubeDaemonSetRolloutStuck 
      expr: |
        kube_daemonset_status_number_ready{job="prometheus-kube-state-metrics",daemonset!="fluentd"}
          /
        kube_daemonset_status_desired_number_scheduled{job="prometheus-kube-state-metrics"} < 1.00
      for: 15m
      labels:
        severity: "Critical"
      annotations:
        summary: "告警事件:daemonset,{{ $labels.daemonset }}"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题:Only {{ $value | humanizePercentage }} of the desired Pods of DaemonSet
          {{ $labels.namespace }}/{{ $labels.daemonset }} are scheduled and ready."

    - alert:  KubeContainerWaiting 
      expr: |
        sum by (namespace, pod, container) (kube_pod_container_status_waiting_reason{job="prometheus-kube-state-metrics"}) > 0
      for: 1h
      labels:
        severity: "Warning"
      annotations:
        summary: "告警事件:Pod,{{ $labels.pod }}"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题:Pod {{ $labels.namespace }}/{{ $labels.pod }} container {{ $labels.container}}
          has been in waiting state for longer than 1 hour."

    - alert: KubeDaemonSetNotScheduled 
      expr: |
        kube_daemonset_status_desired_number_scheduled{job="prometheus-kube-state-metrics"}
          -
        kube_daemonset_status_current_number_scheduled{job="prometheus-kube-state-metrics"} > 0
      for: 5m
      labels:
        severity: "Warning"
      annotations:
        summary: "告警事件:daemonset, {{ $labels.daemonset }}'"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题:'{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset
          }} are not scheduled.'"

    - alert: KubeDaemonSetMisScheduled 
      expr: |
        kube_daemonset_status_number_misscheduled{job="prometheus-kube-state-metrics"} > 0
      for: 5m
      labels:
        severity: "Warning"
      annotations: 
        summary: "告警事件:DaemonSet,{{ $labels.daemonset }}"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题:'{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset
          }} are running where they are not supposed to run.'"

    - alert: KubeCronJobRunning 
      expr: |
        time() - kube_cronjob_next_schedule_time{job="prometheus-kube-state-metrics"} > 3600
      for: 20m
      labels:
        severity: "Warning"
      annotations:
        summary: "告警事件:CronJob,{{ $labels.cronjob }}"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题:CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more
        than 20m to complete."

    - alert: KubeJobCompletion 
      expr: |
        kube_job_spec_completions{job="prometheus-kube-state-metrics"} - kube_job_status_succeeded{job="prometheus-kube-state-metrics"}  > 0
      for: 1h
      labels:
        severity: "Warning"
      annotations:
        title: "Job任务未完成"
        btn: "点击查看详情 :玫瑰:"
        link: "https://www.baidu.com"
        summary: "告警事件: Job,{{ $labels.job_name }}"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more
          than one hour to complete."

    - alert: KubeJobFailed 
      expr: |
        kube_job_failed{job="prometheus-kube-state-metrics"}  > 0
      for: 5m
      labels:
        severity: "Warning"
      annotations:
        title: "Job失败"
        btn: "点击查看详情 :玫瑰:"
        link: "https://www.baidu.com"
        summary: "告警事件:Job,{{ $labels.job_name }}"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题:Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete."

    - alert: KubeHpaReplicasMismatch 
      expr: |
        (kube_hpa_status_desired_replicas{job="prometheus-kube-state-metrics"}
          !=
        kube_hpa_status_current_replicas{job="prometheus-kube-state-metrics"})
          and
        changes(kube_hpa_status_current_replicas[4m]) == 0
      for: 5m
      labels:
        severity: "Warning"
      annotations:
        summary: "告警事件:HPA,{{ $labels.hpa }}"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题:HPA {{ $labels.namespace }}/{{ $labels.hpa }} has not matched the
          desired number of replicas for longer than 15 minutes."

    - alert: KubeHpaMaxedOut 
      expr: |
        kube_hpa_status_current_replicas{job="prometheus-kube-state-metrics"}
          ==
        kube_hpa_spec_max_replicas{job="prometheus-kube-state-metrics"}
      for: 5m
      labels:
        severity: "Warning"
      annotations:
        summary: "告警事件:HPA,{{ $labels.hpa }}"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题:HPA {{ $labels.namespace }}/{{ $labels.hpa }} has been running at
          max replicas for longer than 15 minutes."

  - name: kubernetes-resources 
    rules:
    - alert: KubeCPUOvercommit
      expr: |
        sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum)
          /
        sum(kube_node_status_allocatable_cpu_cores)
          >
        (count(kube_node_status_allocatable_cpu_cores)-1) / count(kube_node_status_allocatable_cpu_cores)
      for: 5m
      labels:
        severity: "Warning"
      annotations:
        summary: "告警事件:"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题:Cluster has overcommitted CPU resource requests for Pods and cannot
          tolerate node failure."

    - alert: KubeMemOvercommit 
      expr: |
        sum(namespace:kube_pod_container_resource_requests_memory_bytes:sum)
          /
        sum(kube_node_status_allocatable_memory_bytes)
          >
        (count(kube_node_status_allocatable_memory_bytes)-1)
          /
        count(kube_node_status_allocatable_memory_bytes)
      for: 5m
      labels:
        severity: "Warning"
      annotations:
        summary:  "告警事件:pod"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题:Cluster has overcommitted memory resource requests for Pods and cannot
          tolerate node failure."

    - alert:  KubeCPUOvercommit 
      expr: |
        sum(kube_resourcequota{job="prometheus-kube-state-metrics", type="hard", resource="cpu"})
          /
        sum(kube_node_status_allocatable_cpu_cores)
          > 1.5
      for: 5m
      labels:
        severity:  "Warning"
      annotations:
        summary: "告警事件:namespaces"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题: Cluster has overcommitted CPU resource requests for Namespaces"

    - alert: KubeMemOvercommit 
      expr: |
        sum(kube_resourcequota{job="prometheus-kube-state-metrics", type="hard", resource="memory"})
          /
        sum(kube_node_status_allocatable_memory_bytes{job="node-exporter"})
          > 1.5
      for: 5m
      labels:
        severity: "Warning"
      annotations:
        summary: "告警事件:namespaces"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题:Cluster has overcommitted memory resource requests for Namespaces"

    - alert: KubeQuotaExceeded 
      expr: |
        kube_resourcequota{job="prometheus-kube-state-metrics", type="used"}
          / ignoring(instance, job, type)
        (kube_resourcequota{job="prometheus-kube-state-metrics", type="hard"} > 0)
          > 0.90
      for: 5m
      labels:
        severity: "Warning"
      annotations:
        summary: "告警事件: {{ $labels.namespace }}"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题:Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota"

    - alert: CPUThrottlingHigh 
      expr: |
        sum(increase(container_cpu_cfs_throttled_periods_total{container!="",namespace!="efk" }[4m])) by (container, pod, namespace)
          /
        sum(increase(container_cpu_cfs_periods_total{}[4m])) by (container, pod, namespace)
          > ( 80 / 100 )
      for: 5m
      labels:
        severity: "Warning"
      annotations:
        summary: "告警事件:{{ $labels.pod }}"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题:{{ $value | humanizePercentage }} throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container }} in pod {{$labels.pod }}"

  - name: kubernetes-storage
    rules:
    - alert: KubePersistentVolumeUsageCritical 
      expr: |
        kubelet_volume_stats_available_bytes{job="kubelet"}
          /
        kubelet_volume_stats_capacity_bytes{job="kubelet"}
          < 0.03
      for: 2m
      labels:
        severity: "Warning"
      annotations:
        summary: "告警事件:持久卷"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim
          }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage
          }} free."
      
  - name: kubernetes-system
    rules:
    - alert: KubeVersionMismatch 
      annotations:
        summary: "告警事件:{{ $value }}"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题:There are {{ $value }} different semantic versions of Kubernetes
          components running."
      expr: |
        count(count by (gitVersion) (label_replace(kubernetes_build_info{job!~"kube-dns|coredns"},"gitVersion","$1","gitVersion","(v[0-9]*.[0-9]*.[0-9]*).*"))) > 1
      for: 5m
      labels:
        severity: "Warning"

    - alert: KubeClientErrors 
      annotations:
        summary: "告警事件:Client error"
        address: "告警主机:{{ $labels.job }}"
        description: "告警问题:Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance
          }}' is experiencing {{ $value | humanizePercentage }} errors.'"
      expr: |
        (sum(rate(rest_client_requests_total{code=~"5.."}[4m])) by (instance, job)
          /
        sum(rate(rest_client_requests_total[4m])) by (instance, job))
        > 0.01
      for: 5m
      labels:
        severity: "Error"

  - name: kubernetes-system-kubelet 
    rules:
    - alert: KubeNodeNotReady
      annotations:
        summary: "告警事件:NodeNotReady"
        address: "告警节点:{{ $labels.node }}"
        description: "告警问题:'{{ $labels.node }} has been unready for more than 2 minutes.'"
      expr: |
        kube_node_status_condition{job="prometheus-kube-state-metrics",condition="Ready",status="true"} == 0
      for: 2m
      labels:
        severity: "Warning"
        
    - alert: KubeNodeUnreachable 
      annotations:
        summary: "告警事件:KubeNodeUnreachable"
        address: "告警主机:{{ $labels.node }}"
        description: "告警问题: '{{ $labels.node }} is unreachable and some workloads may be rescheduled.'"
      expr: |
        kube_node_spec_taint{job="prometheus-kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"} == 1
      for: 2m
      labels:
        severity: "Warning"

    - alert: KubeletTooManyPods 
      annotations:
        summary: "告警事件:TooManyPods"
        address: "告警主机:{{ $labels.node }}"
        description: "告警问题:Kubelet is running at {{ $value | humanizePercentage
          }} of its Pod capacity."
      expr: |
        max(max(kubelet_running_pod_count{job="kubelet"}) by(instance) * on(instance) group_left(node) kubelet_node_name{job="kubelet"}) by(node) / max(kube_node_status_capacity_pods{job="prometheus-kube-state-metrics"}) by(node) > 0.95
      for: 5m
      labels:
        severity: "Warning"

    - alert: KubeletDown 
      annotations:
        summary: "告警事件:KubeletDown"
        address: "命名空间:{{ $labels.namespace }}"
        description: "告警问题:Kubelet has disappeared from Prometheus target discovery."
      expr: |
        absent(up{job="kubelet"} == 1)
      for: 5m
      labels:
        severity: "Warning"
        

kubectl apply -f k8s-rules.yaml -n prometheus

5、配置告警信息格式

登录PrometheusAlert

kubectl port-forward svc/prometheus-alert-center 8080:8080 -n prometheus 访问:http://127.0.0.1:8080

配置告警模版

模版内容如下:

{{ range $k,$v:=.alerts }} 状态:**{{$v.status}}** {{if eq $v.status "resolved"}}(恢复)
告警名称:[{{$v.labels.alertname}}]
告警级别:{{$v.labels.severity}}
{{$v.annotations.address}}
{{$v.annotations.summary}}
**{{$v.annotations.description}}**
⏱ : {{GetCSTtime  $v.startsAt}}
⏲ : {{GetCSTtime  $v.endsAt}} 
{{else}}(报警)
告警名称:[{{$v.labels.alertname}}]
告警级别:{{$v.labels.severity}}
{{$v.annotations.address}}
{{$v.annotations.summary}}
**{{$v.annotations.description}}**
⏱ : {{GetCSTtime  $v.startsAt}}
{{end}}
{{end}}

看一下告警的效果

 


【完成】

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
这是一个比较复杂的问题,需要详细说明。我会尽可能地给您提供一些指导。 首先,您需要了解一些基础知识,如 Kubernetes 和 Prometheus。Kubernetes 是一个容器编排平台,可以帮助您自动管理多个容器应用程序的部署、扩展和故障恢复。Prometheus 是一个开源的监控系统,可以收集和存储应用程序的指标,并支持告警和可视化。 以下是大致的步骤: 1. 部署 Kubernetes 集群,并准备好部署 PrometheusGrafana 的节点。您可以使用各种 Kubernetes 发行版,如 kops、kubeadm 等。 2. 安装和配置 Prometheus。您可以使用 Prometheus 的 Helm Chart 进行部署,并通过 Prometheus Operator 进行管理。在部署 Prometheus 时,您需要配置它来收集应用程序的指标,并将其存储在 Prometheus 存储中。 3. 部署 Grafana。您可以使用 Grafana 的 Helm Chart 进行部署,并配置它来连接到 Prometheus 存储。在 Grafana 中,您可以创建仪表板并可视化 Prometheus 存储中的指标数据。 4. 配置告警。您可以使用 Prometheus 的 Alertmanager 进行告警,并将告警发送到 Slack、Email 等渠道。在配置告警时,您需要定义告警规则,并配置 Alertmanager 来发送告警。 以上是部署 PrometheusGrafana 和告警的大致步骤。由于每个环境的部署和配置都有所不同,所以具体的细节可能会有所不同。我建议您查阅官方文档,并根据您的需求进行调整。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值