K8S监控(Prometheus+Grafana+alertmanager+钉钉机器人告警)部署方案

架构就不细讲,网上一大堆都大同小异,但是自己在部署的过程中发现踩了很多坑,记录一下整个部署过程,开干!

一、部署K8S集群资源数据采集组件:kube-state-metrics

yaml文件一共有5个:

1、cluster-role-binding.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: 2.3.0
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: kube-system

2、cluster-role.yaml 

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: 2.3.0
  name: kube-state-metrics
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  - secrets
  - nodes
  - pods
  - services
  - resourcequotas
  - replicationcontrollers
  - limitranges
  - persistentvolumeclaims
  - persistentvolumes
  - namespaces
  - endpoints
  verbs:
  - list
  - watch
- apiGroups:
  - apps
  resources:
  - statefulsets
  - daemonsets
  - deployments
  - replicasets
  verbs:
  - list
  - watch
- apiGroups:
  - batch
  resources:
  - cronjobs
  - jobs
  verbs:
  - list
  - watch
- apiGroups:
  - autoscaling
  resources:
  - horizontalpodautoscalers
  verbs:
  - list
  - watch
- apiGroups:
  - authentication.k8s.io
  resources:
  - tokenreviews
  verbs:
  - create
- apiGroups:
  - authorization.k8s.io
  resources:
  - subjectaccessreviews
  verbs:
  - create
- apiGroups:
  - policy
  resources:
  - poddisruptionbudgets
  verbs:
  - list
  - watch
- apiGroups:
  - certificates.k8s.io
  resources:
  - certificatesigningrequests
  verbs:
  - list
  - watch
- apiGroups:
  - storage.k8s.io
  resources:
  - storageclasses
  - volumeattachments
  verbs:
  - list
  - watch
- apiGroups:
  - admissionregistration.k8s.io
  resources:
  - mutatingwebhookconfigurations
  - validatingwebhookconfigurations
  verbs:
  - list
  - watch
- apiGroups:
  - networking.k8s.io
  resources:
  - networkpolicies
  - ingresses
  verbs:
  - list
  - watch
- apiGroups:
  - coordination.k8s.io
  resources:
  - leases
  verbs:
  - list
  - watch

3、deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: 2.3.0
  name: kube-state-metrics
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: kube-state-metrics
  template:
    metadata:
      labels:
        app.kubernetes.io/component: exporter
        app.kubernetes.io/name: kube-state-metrics
        app.kubernetes.io/version: 2.3.0
    spec:
      automountServiceAccountToken: true
      imagePullSecrets:
        - name: image-pull-secret
      containers:
      - image: # 我自己的私有镜像仓库/kube-state-metrics:v2.3.0,改镜像然后pull,往下看
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          timeoutSeconds: 5
        name: kube-state-metrics
        ports:
        - containerPort: 8080
          name: http-metrics
        - containerPort: 8081
          name: telemetry
        readinessProbe:
          httpGet:
            path: /
            port: 8081
          initialDelaySeconds: 5
          timeoutSeconds: 5
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          runAsUser: 65534
      nodeSelector:
        kubernetes.io/os: linux
      serviceAccountName: kube-state-metrics

4、service-account.yaml

apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: 2.3.0
  name: kube-state-metrics
  namespace: kube-system

5、service.yaml

apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: 2.3.0
  name: kube-state-metrics
  namespace: kube-system
spec:
  clusterIP: None
  ports:
  - name: http-metrics
    port: 8080
    targetPort: http-metrics
  - name: telemetry
    port: 8081
    targetPort: telemetry
  selector:
    app.kubernetes.io/name: kube-state-metrics

6、以上文件准备好了,就可以执行deploy了

需要注意的是镜像k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.3.0
这个国内无法下载,需要通过国内源下载,然后tag打包push到自己的私有仓库
docker pull registry.aliyuncs.com/google_containers/kube-apiserver:v1.28.0
docker tag 更换仓库地址
docker push 镜像

kubectl apply -f cluster-role-binding.yaml
kubectl apply -f cluster-role.yaml
kubectl apply -f deployment.yaml
kubectl apply -f service-account.yaml
kubectl apply -f service.yaml

 二、部署k8s集群节点监控

1、下载

Releases · prometheus/node_exporter · GitHub下载安装包

我这里使用旧版本:node_exporter-0.18.1.linux-amd64.tar.gz

2、安装

将安装包上传到k8s集群节点的/tmp目录,创建一下shell脚本并执行即可完成安装

install_node_exporter.sh

#!/bin/bash


mkdir -p /usr/local/software/node-exporter
# Step 1: Download Node Exporter
mv /tmp/node_exporter-0.18.1.linux-amd64.tar.gz  /usr/local/software/node-exporter

# Step 2: Extract Node Exporter
cd /usr/local/software/node-exporter/
tar xvfz node_exporter-0.18.1.linux-amd64.tar.gz

# Step 3: Change to Node Exporter directory
cd node_exporter-0.18.1.linux-amd64

# Step 4: Copy Node Exporter binary to /usr/local/bin
sudo cp node_exporter /usr/local/bin/

# Step 5: Create systemd service unit file
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOT
[Unit]
Description=Node Exporter
After=network.target

[Service]
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=default.target
EOT

# Step 6: Enable and start Node Exporter service
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# Step 7: Check Node Exporter status
sudo systemctl status node_exporter

三、部署prmoetheus

yaml文件7份、配置文件1份、告警规则文件1份

 1、clusterRole.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: default
  namespace: monitoring

2、prometheus--acme-tls.yml

apiVersion: v1
kind: Secret
metadata:
  name: prometheus--acme-tls
  namespace: monitoring
type: kubernetes.io/tls
data:
  ca.crt: ""
  tls.crt: #域名证书SSL的crt 方法是: cat 证书.crt | base64
  tls.key: #域名证书SSL的key 方法是: cat 证书.key | base64

3、prometheus-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-deployment
  namespace: monitoring
  labels:
    app: prometheus-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-server
  template:
    metadata:
      labels:
        app: prometheus-server
    spec:
      containers:
        - name: prometheus
          image: prom/prometheus
          args:
            - "--config.file=/etc/prometheus/prometheus.yml"
            - "--storage.tsdb.path=/prometheus/"
            - --storage.tsdb.retention=60d
            - --web.enable-lifecycle
            - --web.enable-admin-api
          ports:
            - containerPort: 9090
          volumeMounts:
          - mountPath: /etc/prometheus
            name: prometheus-storage-volume
            subPath: prometheus/conf
          - mountPath: /prometheus
            name: prometheus-storage-volume
            subPath: prometheus/data
      volumes:
        - name: prometheus-storage-volume
          persistentVolumeClaim:
            claimName: pvc-nas-prometheus
  

4、prometheus-ingress.yaml

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: prometheus-service
  namespace: monitoring
  annotations:
    kubernetes.io/ingress.class: "nginx"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/rewrite-target: /
    nginx.ingress.kubernetes.io/whitelist-source-range: "#需要加白名单的ip,无加白无法访问"
spec:
  tls:
  - hosts:
    # 自定义域名,访问prometheus用到
    - prometheus.xxx.xxx.com
    secretName: prometheus--acme-tls
  rules:
  - host: prometheus.xxx.xxx.com
    http:
      paths:
      - path: /
        backend:
          serviceName: prometheus-service
          servicePort: 8080

5、prometheus-service.yaml

apiVersion: v1
kind: Service
metadata:
  name: prometheus-service
  namespace: monitoring
  annotations:
      prometheus.io/scrape: 'true'
      prometheus.io/port:   '9090'
  
spec:
  selector: 
    app: prometheus-server
  ports:
    - port: 8080
      targetPort: 9090 

6、10-pv.yml (这里使用的阿里云的Nas,可以根据实际情况创建pv与pvc)

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-nas-prometheus
  labels:
    alicloud-pvname: pv-nas-prometheus
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteMany
  csi:
    driver: nasplugin.csi.alibabacloud.com
    volumeHandle: pv-nas-prometheus
    volumeAttributes:
      server: "xxxxxxxx.cn-shenzhen.nas.aliyuncs.com"
      path: "/prometheus"
  mountOptions:
  - nolock,tcp,noresvport
  - vers=3

7、20-pvc.yml

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: pvc-nas-prometheus
  namespace: monitoring
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 100Gi
  selector:
    matchLabels:
      alicloud-pvname: pv-nas-prometheus

8、prometheus.rules(该文件放在创建的pv目录内,我这里是放在阿里云nas的/prometheus/prometheus/conf/)

## CPU告警规则
groups:
- name: CpuAlertRule
  rules:
  - alert: PodCPU告警
    expr: (sum(rate(container_cpu_usage_seconds_total{image!="",pod!=""}[1m])) by (namespace, pod)) / (sum(container_spec_cpu_quota{image!="", pod!=""}) by(namespace, pod) / 100000) * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      description: "CPU使用率大于80%"
      value: "{{$value}}%"
      #summary: 'CPU使用率大于80%,当前值为{{.Value}}%,CPU使用率: {{ printf `ceil(100 - ((avg by (instance)(irate(node_cpu_seconds_total{mode="idle",instance="%s"}[1m]))) *100))` $labels.instance | query | first | value }}%'
  - alert: NodeCPU告警
    expr: round(100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m]))by(kubernetes_node)*100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      description: "CPU使用率大于80%"
      value: "{{$value}}%"
      #summary: 'CPU使用率大于80%,当前值为{{.Value}}%,CPU使用率: {{ printf `ceil(100 - ((avg by (instance)(irate(node_cpu_seconds_total{mode="idle",instance="%s"}[1m]))) *100))` $labels.instance | query | first | value }}%'

## DISK告警规则
- name: DiskAlertRule
  rules:
  - alert: Node磁盘告警
    expr: round((1- node_filesystem_avail_bytes{fstype=~"ext.+|nfs.+",mountpoint!~".*docker.*"}/node_filesystem_size_bytes{fstype=~"ext.+|nfs.+",mountpoint!~".*docker.*"})*100) > 85
    for: 1m
    labels:
      severity: warning
    annotations:
      description: "磁盘使用率大于85%"
      value: "{{$value}}%"

## MEM告警规则
- name: MemAlertRule
  rules:
  - alert: Pod内存告警
    expr: sum(container_memory_working_set_bytes{image!=""}) by(namespace, pod) / sum(container_spec_memory_limit_bytes{image!=""}) by(namespace, pod) * 100 != +inf > 85
    for: 2m
    labels:
      severity: warning
    annotations:
      description: "内存使用率大于85%"
      value: "{{$value}}%"
  - alert: Node内存告警
    expr: round(100-((node_memory_MemAvailable_bytes*100)/node_memory_MemTotal_bytes)) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      description: "内存使用率大于85%"
      value: "{{$value}}%"

## Pod意外重启
- name: PodRestartAlertRule
  rules:
  - alert: Pod重启告警
    expr: delta(kube_pod_container_status_restarts_total[1m]) > 0
    for: 1s
    labels:
      severity: warning
    annotations:
      description: "Pod发生意外重启事件"

## JvmCMSOldGC
- name: PodJvmOldGCAlertRule
  rules:
  - alert: PodJvmCMSOldGC
    expr: round((jvm_memory_pool_bytes_used{pool=~".+Old Gen"}/jvm_memory_pool_bytes_max{pool=~".+Old Gen"})*100) > 89
    for: 5s
    labels:
      severity: warning
    annotations:
      description: "Pod堆内存触发CMSOldGC"
      value: "{{$value}}%"

## Pod实例异常
- name: ContainerInstanceAlertRule
  rules:
  - alert: Pod实例异常
    expr: kube_pod_container_status_ready - kube_pod_container_status_running > 0
    for: 20s
    labels:
      severity: warning
    annotations:
      description: "Container实例异常"

## Pod实例OOM
- name: ContainerOOMAlertRule
  rules:
  - alert: Pod实例OOM
    expr: kube_pod_container_status_terminated_reason{reason="OOMKilled"} > 0
    for: 1s
    labels:
      severity: warning
    annotations:
      description: "Container实例OOM"

## Pod实例驱逐
- name: ContainerEvictionAlertRule
  rules:
  - alert: Pod实例驱逐
    expr: kube_pod_container_status_terminated_reason{reason="Evicted"} > 0
    for: 1s
    labels:
      severity: warning
    annotations:
      description: "Container实例驱逐"

## MQ内存告警
- name: MQMemoryAlertRule
  rules:
  - alert: MQ内存水位线
    expr: rabbitmq_node_mem_alarm{job=~".*rabbitmq.*"} == 1
    for: 1s
    labels:
      severity: warning
    annotations:
      description: "RabbitMQ内存高水位线告警"
      summary: RabbitMQ {{`{{ $labels.instance }}`}} High Memory Alarm is going off.  Which means the node hit highwater mark and has cut off network connectivity, see RabbitMQ WebUI
  - alert: MQ内存使用告警
    expr: round(avg(rabbitmq_node_mem_used{job=~".*rabbitmq.*"} / rabbitmq_node_mem_limit{job=~".*rabbitmq.*"})by(node,kubernetes_namespace)*100) > 90
    for: 10s
    labels:
      severity: warning
    annotations:
      description: "RabbitMQ使用告警"
      value: "{{$value}}%"
      summary: RabbitMQ {{`{{ $labels.instance }}`}} Memory Usage > 90%

##PodJava进程异常
- name: PodJavaProcessAlertRule
  rules:
  - alert: PodJava进程异常
    expr: sum(up{job="kubernetes-pods-jvm"})by(kubernetes_container_name,kubernetes_pod_name) == 0
    for: 10s
    labels:
      severity: warning
    annotations:
      description: "PodJava进程异常"
      summary: "赶快看看吧,顶不住了"

9、prometheus.yml (该文件放在创建的pv目录内,我这里是放在阿里云nas的/prometheus/prometheus/conf/)

    global:
      scrape_interval: 5s
      evaluation_interval: 5s
    rule_files:
      - /etc/prometheus/prometheus.rules
    alerting:
      alertmanagers:
      - scheme: http
        static_configs:
        - targets:
          - "alertmanager.monitoring.svc.cluster.local:9093"
    scrape_configs:
      - job_name: 'node-exporter'
        kubernetes_sd_configs:
          - role: endpoints
        relabel_configs:
        - source_labels: [__meta_kubernetes_endpoints_name]
          regex: 'node-exporter'
          action: keep
      - job_name: 'kubernetes-apiservers'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https
      - job_name: 'kubernetes-nodes'
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        kubernetes_sd_configs:
        - role: node
        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
        - role: pod
        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
          action: replace
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2
          target_label: __address__
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: kubernetes_pod_name
      - job_name: 'kube-state-metrics'
        static_configs:
          - targets: ['kube-state-metrics.kube-system.svc.cluster.local:8080']
      - job_name: 'kubernetes-cadvisor'
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        kubernetes_sd_configs:
        - role: node
        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
      - job_name: 'kubernetes-service-endpoints'
        kubernetes_sd_configs:
        - role: endpoints
        relabel_configs:
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
          action: replace
          target_label: __scheme__
          regex: (https?)
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
          action: replace
          target_label: __address__
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2
        - action: labelmap
          regex: __meta_kubernetes_service_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_service_name]
          action: replace
          target_label: kubernetes_name
## 节点监控,根据实际情况编写
      - job_name: 'k8s-pro'
        static_configs:
        - targets: ['节点ip:9100']
          labels:
            instance: devops.105213.pro
        - targets: ['节点ip:9100']
          labels:
            instance: devops.104245.pro
        - targets: ['节点ip:9100']
          labels:
            instance: devops.104249.pro
        - targets: ['节点ip:9100']
          labels:
            instance: devops.105007.pro
        - targets: ['节点ip:9100']
          labels:
            instance: devops.105008.pro
        - targets: ['节点ip:9100']
          labels:
            instance: devops.104250.pro

10、以上yaml文件准备好,即可deploy

创建命名空间:kubectl create namespace monitoring

kubectl deploy -f  10-pv.yml

kubectl deploy -f  20-pvc.yml

放文件到指定路径 prometheus.rules

放文件到指定路径 prometheus.yml

kubectl deploy -f  clusterRole.yaml

kubectl deploy -f  prometheus-deployment.yaml

kubectl deploy -f  prometheus-service.yaml

kubectl deploy -f  prometheus--acme-tls.yml

kubectl deploy -f  prometheus-ingress

 验证:访问 https://prometheus.xxx.xxx.com 

 四、部署altertmanager + webhook-dingtalk

部署需要准备yaml文件6份、配置文件1份、告警内容模版文件1份

1、10-pv.yml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-nas-alertmanager
  labels:
    alicloud-pvname: pv-nas-alertmanager
spec:
  capacity:
    storage: 50Gi
  accessModes:
    - ReadWriteMany
  csi:
    driver: nasplugin.csi.alibabacloud.com
    volumeHandle: pv-nas-alertmanager
    volumeAttributes:
      #使用了阿里云的nas,以实际为准
      server: "xxxx.cn-shenzhen.nas.aliyuncs.com"
      path: "/alertmanager"
  mountOptions:
  - nolock,tcp,noresvport
  - vers=3

2、20-pvc.yml

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: pvc-nas-alertmanager
  namespace: monitoring
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 50Gi
  selector:
    matchLabels:
      alicloud-pvname: pv-nas-alertmanager

3、alertmanager--acme-tls.yml

apiVersion: v1
kind: Secret
metadata:
  name: alertmanager--acme-tls
  namespace: monitoring
type: kubernetes.io/tls
data:
  ca.crt: ""
  tls.crt: # cat 域名证书ssl的crt文件|base64
  tls.key: # cat 域名证书ssl的key文件|base64

4、alertmanager-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager-deployment
  namespace: monitoring
  labels:
    app: alertmanager-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alertmanager-server
  template:
    metadata:
      labels:
        app: alertmanager-server
    spec:
      containers:
        - name: alertmanager
          image: prom/alertmanager:latest
          args:
            - "--config.file=/etc/alertmanager/config.yml"
            - "--storage.path=/alertmanager/data"
            - --cluster.advertise-address=0.0.0.0:9093
          ports:
            - containerPort: 9093
              protocol: TCP
          volumeMounts:
          - mountPath: /etc/alertmanager
            name: alertmanager-storage-volume
            subPath: conf
          - mountPath: /alertmanager/data
            name: alertmanager-storage-volume
            subPath: data
      volumes:
        - name: alertmanager-storage-volume
          persistentVolumeClaim:
            claimName: pvc-nas-alertmanager

5、alertmanager-ingress.yaml

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: alertmanager-service
  namespace: monitoring
  annotations:
    kubernetes.io/ingress.class: "nginx"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/rewrite-target: /
    nginx.ingress.kubernetes.io/whitelist-source-range: "填写需要加白的ip,无在白名单内无法访问的"
spec:
  tls:
  - hosts:
    - alert.xxx.com
    secretName: alertmanager--acme-tls
  rules:
  - host: alert.xxx.com
    http:
      paths:
      - path: /
        backend:
          serviceName: alertmanager
          servicePort: 9093

6、alertmanager-service.yaml

apiVersion: v1
kind: Service
metadata:
  name: alertmanager
  namespace: monitoring
spec:
  selector: 
    app: alertmanager-server
  ports:
    - name: web
      port: 9093
      protocol: TCP
      targetPort: 9093

7、config.yml (这里是配置钉钉机器人告警,没有配置邮件,如果需要按实际情况添加)

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity', 'namespace']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 10s
  receiver: 'webhook'
  routes:
  - receiver: 'webhook'
    group_wait: 10s
    group_interval: 15s
    repeat_interval: 3h
templates:
- /etc/alertmanager/config/template.tmp1

receivers:
- name: 'webhook'
  webhook_configs:
  - url: 'http://webhook-dingtalk'
    send_resolved: true

8、template.tmp1 (告警内容模版文件)

{{ define "wechat.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
===异常告警===
告警类型: {{ $alert.Labels.alertname }}
告警级别: {{ $alert.Labels.severity }}
告警详情: {{ $alert.Annotations.description}}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
{{- if gt (len $alert.Labels.namespace) 0 }}
命名空间: {{ $alert.Labels.namespace }}
{{- end }}
{{- if gt (len $alert.Labels.node) 0 }}
节点信息: {{ $alert.Labels.node }}
{{- end }}
{{- if gt (len $alert.Labels.pod) 0 }}
实例名称: {{ $alert.Labels.pod }}
{{- end }}
===END===
{{- end }}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
===异常恢复===
告警类型: {{ $alert.Labels.alertname }}
告警级别: {{ $alert.Labels.severity }}
告警详情: {{ $alert.Annotations.description}}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
{{- if gt (len $alert.Labels.namespace) 0 }}
命名空间: {{ $alert.Labels.namespace }}
{{- end }}
{{- if gt (len $alert.Labels.node) 0 }}
节点信息: {{ $alert.Labels.node }}
{{- end }}
{{- if gt (len $alert.Labels.pod) 0 }}
实例名称: {{ $alert.Labels.pod }}
{{- end }}
===END===
{{- end }}
{{- end }}
{{- end }}
{{- end }}

显示大致如下:

 9、webhook-dingtalk.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: webhook-dingtalk
  namespace: monitoring
  labels:
    app: webhook-dingtalk
spec:
  replicas: 1
  selector:
    matchLabels:
      app: webhook-dingtalk
  template:
    metadata:
      labels:
        app: webhook-dingtalk
    spec:
      containers:
        - name: webhook-dingtalk
          image: yangpeng2468/alertmanager-dingtalk-hook:v1
          env:
          - name: ROBOT_TOKEN
            valueFrom:
              secretKeyRef:
                name: dingtalk-secret
                key: token
          ports:
            - containerPort: 5000
              protocol: TCP
          resources:
            requests:
              cpu: 100m
              memory: 100Mi
            limits:
              cpu: 500m
              memory: 500Mi

---

apiVersion: v1
kind: Service
metadata:
  labels:
    app: webhook-dingtalk
  name: webhook-dingtalk
  namespace: monitoring
  #需要和alertmanager在同一个namespace
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 5000
  selector:
    app: webhook-dingtalk
  type: ClusterIP

10、在钉钉群创建机器人并获取token码

11、以上文件准备好,即可开始deploy

创建钉钉机器人的token文件,xxxxxxxxxx 为钉钉机器人的token:

kubectl create secret generic dingtalk-secret --from-literal=token=xxxxxxxxxx -n monitoring 

kubectl deploy -f webhook-dingtalk.yaml

kubectl deploy -f 10-pv.yml

kubectl deploy -f 20-pvc.yml

config.yml 文件放在创建的nas pv路径下的/alertmanager/conf

template.tmp1 文件放在创建的nas pv路径下的/alertmanager/conf/config/

kubectl deploy -f alertmanager-deployment.yaml

kubectl deploy -f alertmanager--acme-tls.yml

kubectl deploy -f alertmanager-service.yaml

kubectl deploy -f alertmanager-ingress.yaml

 访问web https://alert.xxx.com 验证

至此,数据采集以及告警都已经部署完成,接下来部署grafana展示数据

五、部署Grafana

由于我实际环境与架构设计的原因,Grafana没有部署在k8s集群中,使用docker部署了,如果需要部署到k8s,可以将docker-compose文件转换成k8s的yaml文件部署即可。

1、docker-comepose.yml

version: "3"
services:
  grafana:
    image: grafana/grafana:8.1.5
    container_name: grafana
    restart: always
    network_mode: "host"
    # ports:
    #   - 3000:3000
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=密码
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_SECURITY_ALLOW_EMBEDDING=true  
    volumes:
      - /etc/localtime:/etc/localtime:ro
      -  /data/volumes/monitor/grafana:/var/lib/grafana:z
      # 这里可以把grafana的配置文件挂载出来,我本次部署还没对这里动手
      # - /data/volumes/monitor/grafana-cfg/grafana.ini:/etc/grafana/grafana.ini:z 

访问web http://ip:3000 输入账号密码登录

2、添加prometheus数据源地址

3、导入k8s资源监控grafana模版

这里我是用了模版编号是:13105

接下来根据实际情况去修改模版里面的prometheus语句等等操作了,这里就不一一介绍了。

具体展示如下

六、踩坑记录

1、部署过程中prometheus告警调用alertmanager的时候报错

报错内容:

http://alertmanager.monitoring.svc.cluster.local:9093/api/v2/alerts count=1 msg="error sending alert" err="bad response status 404 not found"

排查发现原来是alertmanager版本问题,我之前用的是v0.15.1,解决方法更换alertmanager镜像为prom/alertmanager:latest就解决了。

2、部署过程中alertmanager调用webhook-dingtalk报错

报错内容:

error in app: exception on /dingtalk/send/ 

 直接弃用使用的镜像 billy98/webhook-dingtalk:latest和端口8080,解决方法就是更换上述部署yaml文件即可。

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
微服务是什么?微服务是用于构建应用程序的架构风格,一个大的系统可由一个或者多个微服务组成,微服务架构可将应用拆分成多个核心功能,每个功能都被称为一项服务,可以单独构建和部署,这意味着各项服务在工作和出现故障的时候不会相互影响。为什么要用微服务?单体架构下的所有代码模块都耦合在一起,代码量大,维护困难,想要更新一个模块的代码,也可能会影响其他模块,不能很好的定制化代码。微服务中可以有java编写、有Python编写的,他们都是靠restful架构风格统一成一个系统的,所以微服务本身与具体技术无关、扩展性强。大型电商平台微服务功能图为什么要将SpringCloud项目部署k8s平台?SpringCloud只能用在SpringBoot的java环境中,而kubernetes可以适用于任何开发语言,只要能被放进docker的应用,都可以在kubernetes上运行,而且更轻量,更简单。SpringCloud很多功能都跟kubernetes重合,比如服务发现,负载均衡,配置管理,所以如果把SpringCloud部署k8s,那么很多功能可以直接使用k8s原生的,减少复杂度。Kubernetes作为成熟的容器编排工具,在国内外很多公司、世界500强等企业已经落地使用,很多中小型公司也开始把业务迁移到kubernetes中。kubernetes已经成为互联网行业急需的人才,很多企业都开始引进kubernetes技术人员,实现其内部的自动化容器云平台的建设。对于开发、测试、运维、架构师等技术人员来说k8s已经成为的一项重要的技能,下面列举了国内外在生产环境使用kubernetes的公司: 国内在用k8s的公司:阿里巴巴、百度、腾讯、京东、360、新浪、头条、知乎、华为、小米、富士康、移动、银行、电网、阿里云、青云、时速云、腾讯、优酷、抖音、快手、美团等国外在用k8s的公司:谷歌、IBM、丰田、iphone、微软、redhat等整个K8S体系涉及到的技术众多,包括存储、网络、安全、监控、日志、DevOps、微服务等,很多刚接触K8S的初学者,都会感到无从下手,为了能让大家系统地学习,克服这些技术难点,推出了这套K8S架构师课程。Kubernetes的发展前景 kubernetes作为炙手可热的技术,已经成为云计算领域获取高薪要掌握的重要技能,在招聘网站搜索k8s,薪资水平也非常可观,为了让大家能够了解k8s目前的薪资分布情况,下面列举一些K8S的招聘截图: 讲师介绍:  先超容器云架构师、IT技术架构师、DevOps工程师,曾就职于世界500强上市公司,拥有多年一线运维经验,主导过上亿流量的pv项目的架构设计和运维工作;具有丰富的在线教育经验,对课程一直在改进和提高、不断的更新和完善、开发更多的企业实战项目。所教学员遍布京东、阿里、百度、电网等大型企业和上市公司。课程学习计划 学习方式:视频录播+视频回放+全套源码笔记 教学服务:模拟面试、就业指导、岗位内推、一对一答疑、远程指导 VIP终身服务:一次购买,终身学习课程亮点:1. 学习方式灵活,不占用工作时间:可在电脑、手机观看,随时可以学习,不占用上班时间2.老师答疑及时:老师24小时在线答疑3. 知识点覆盖全、课程质量高4. 精益求精、不断改进根据学员要求、随时更新课程内容5. 适合范围广,不管你是0基础,还是拥有工作经验均可学习:0基础1-3年工作经验3-5年工作经验5年以上工作经验运维、开发、测试、产品、前端、架构师其他行业转行做技术人员均可学习课程部分项目截图   课程大纲 k8s+SpringCloud全栈技术:基于世界500强的企业实战课程-大纲第一章 开班仪式老师自我介绍、课程大纲介绍、行业背景、发展趋势、市场行情、课程优势、薪资水平、给大家的职业规划、课程学习计划、岗位内推第二章 kubernetes介绍Kubernetes简介kubernetes起源和发展kubernetes优点kubernetes功能kubernetes应用领域:在大数据、5G、区块链、DevOps、AI等领域的应用第三章  kubernetes中的资源对象最小调度单元Pod标签Label和标签选择器控制器Replicaset、Deployment、Statefulset、Daemonset等四层负载均衡器Service第四章 kubernetes架构和组件熟悉谷歌的Borg架构kubernetes单master节点架构kubernetes多master节点高可用架构kubernetes多层架构设计原理kubernetes API介绍master(控制)节点组件:apiserver、scheduler、controller-manager、etcdnode(工作)节点组件:kube-proxy、coredns、calico附加组件:prometheus、dashboard、metrics-server、efk、HPA、VPA、Descheduler、Flannel、cAdvisor、Ingress     Controller。第五章 部署多master节点的K8S高可用集群(kubeadm)第六章 带你体验kubernetes可视化界面dashboard在kubernetes部署dashboard通过token令牌登陆dashboard通过kubeconfig登陆dashboard限制dashboard的用户权限在dashboard界面部署Web服务在dashboard界面部署redis服务第七章 资源清单YAML文件编写技巧编写YAML文件常用字段,YAML文件编写技巧,kubectl explain查看帮助命令,手把手教你创建一个Pod的YAML文件第八章 通过资源清单YAML文件部署tomcat站点编写tomcat的资源清单YAML文件、创建service发布应用、通过HTTP、HTTPS访问tomcat第九章  kubernetes Ingress发布服务Ingress和Ingress Controller概述Ingress和Servcie关系安装Nginx Ingress Controller安装Traefik Ingress Controller使用Ingress发布k8s服务Ingress代理HTTP/HTTPS服务Ingress实现应用的灰度发布-可按百分比、按流量分发第十章 私有镜像仓库Harbor安装和配置Harbor简介安装HarborHarbor UI界面使用上传镜像到Harbor仓库从Harbor仓库下载镜像第十一章 微服务概述什么是微服务?为什么要用微服务?微服务的特性什么样的项目适合微服务?使用微服务需要考虑的问题常见的微服务框架常见的微服务框架对比分析第十二章 SpringCloud概述SpringCloud是什么?SpringCloud和SpringBoot什么关系?SpringCloud微服务框架的优缺点SpringCloud项目部署k8s的流程第十三章 SpringCloud组件介绍服务注册与发现组件Eureka客户端负载均衡组件Ribbon服务网关Zuul熔断器HystrixAPI网关SpringCloud Gateway配置中心SpringCloud Config第十四章 将SpringCloud项目部署k8s平台的注意事项如何进行服务发现?如何进行配置管理?如何进行负载均衡?如何对外发布服务?k8s部署SpringCloud项目的整体流程第十五章 部署MySQL数据库MySQL简介MySQL特点安装部署MySQL在MySQL数据库导入数据对MySQL数据库授权第十六章 将SpringCLoud项目部署k8s平台SpringCloud的微服务电商框架安装openjdk和maven修改源代码、更改数据库连接地址通过Maven编译、构建、打包源代码在k8s部署Eureka组件在k8s部署Gateway组件在k8s部署前端服务在k8s部署订单服务在k8s部署产品服务在k8s部署库存服务第十七章 微服务的扩容和缩容第十八章 微服务的全链路监控什么是全链路监控?为什么要进行全链路监控?全链路监控能解决哪些问题?常见的全链路监控工具:zipkin、skywalking、pinpoint全链路监控工具对比分析第十九章 部署pinpoint服务部署pinpoint部署pinpoint agent在k8s中重新部署带pinpoint agent的产品服务在k8s中重新部署带pinpoint agent的订单服务在k8s中重新部署带pinpoint agent的库存服务在k8s中重新部署带pinpoint agent的前端服务在k8s中重新部署带pinpoint agent的网关和eureka服务Pinpoint UI界面使用第二十章 基于Jenkins+k8s+harbor等构建企业级DevOps平台第二十一章 基于Promethues+Alert+Grafana搭建企业级监控系统第二十二章 部署智能化日志收集系统EFK 
一、prometheus简介 Prometheus是一个开源的系统监控告警系统,现在已经加入到CNCF基金会,成为继k8s之后第二个在CNCF维护管理的项目,在kubernetes容器管理系统中,通常会搭配prometheus进行监控prometheus支持多种exporter采集数据,还支持通过pushgateway进行数据上报,Prometheus再性能上可支撑上万台规模的集群。 二、prometheus架构图 三、prometheus组件介绍 1.Prometheus Server: 用于收集和存储时间序列数据。 2.Client Library: 客户端库,检测应用程序代码,当Prometheus抓取实例的HTTP端点时,客户端库会将所有跟踪的metrics指标的当前状态发送到prometheus server端。 3.Exporters: prometheus支持多种exporter,通过exporter可以采集metrics数据,然后发送到prometheus server端 4.Alertmanager: 从 Prometheus server 端接收到 alerts 后,会进行去重,分组,并路由到相应的接收方,发出报警,常见的接收方式有:电子邮件,微信,钉钉, slack等。 5.Grafana监控仪表盘 6.pushgateway: 各个目标主机可上报数据到pushgatewy,然后prometheus server统一从pushgateway拉取数据。 四、课程亮点 五、效果图展示 六、讲师简介 先超(lucky):高级运维工程师、资深DevOps工程师,在互联网上市公司拥有多年一线运维经验,主导过亿级pv项目的架构设计和运维工作 主要研究方向: 1.云计算方向:容器 (kubernetes、docker),虚拟化(kvm、Vmware vSphere),微服务(istio),PaaS(openshift),IaaS(openstack)等2.系统/运维方向:linux系统下的常用组件(nginx,tomcat,elasticsearch,zookeeper,kafka等),DevOps(Jenkins+gitlab+sonarqube+nexus+k8s),CI/CD,监控(zabbix、prometheus、falcon)等 七、课程大纲
这是一个比较复杂的问题,需要详细说明。我会尽可能地给您提供一些指导。 首先,您需要了解一些基础知识,如 KubernetesPrometheusKubernetes 是一个容器编排平台,可以帮助您自动管理多个容器应用程序的部署、扩展和故障恢复。Prometheus 是一个开源的监控系统,可以收集和存储应用程序的指标,并支持告警和可视化。 以下是大致的步骤: 1. 部署 Kubernetes 集群,并准备好部署 PrometheusGrafana 的节点。您可以使用各种 Kubernetes 发行版,如 kops、kubeadm 等。 2. 安装和配置 Prometheus。您可以使用 Prometheus 的 Helm Chart 进行部署,并通过 Prometheus Operator 进行管理。在部署 Prometheus 时,您需要配置它来收集应用程序的指标,并将其存储在 Prometheus 存储中。 3. 部署 Grafana。您可以使用 Grafana 的 Helm Chart 进行部署,并配置它来连接到 Prometheus 存储。在 Grafana 中,您可以创建仪表板并可视化 Prometheus 存储中的指标数据。 4. 配置告警。您可以使用 PrometheusAlertmanager 进行告警,并将告警发送到 Slack、Email 等渠道。在配置告警时,您需要定义告警规则,并配置 Alertmanager 来发送告警。 以上是部署 PrometheusGrafana告警的大致步骤。由于每个环境的部署和配置都有所不同,所以具体的细节可能会有所不同。我建议您查阅官方文档,并根据您的需求进行调整。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值