此为Sunny 王苗苗同学的学习笔记,持续学习,持续分享,持续进步,向着大神之路前进~
前一篇已经部署好prometheus及grafana,这一篇实践使用alertmanager进行邮件报警。
prometheus收集指标,然后根据规则发送警报给到alertmanager,alertmanager收到警报后,根据通知规则发送通知(短信,邮箱等)。
部署alertmanager主要步骤如下
(1)、部署alertmanager
(2)、配置alertmanager与prometheus通信
(3)、配置告警规则
(1)部署alertmanager
1、新建alertmanager-cfg.yaml
kind: ConfigMap
apiVersion: v1
metadata:
name: alertmanager
namespace: prom
data:
alertmanager.yml: |-
global:
smtp_smarthost: "smtp.qq.com:465" ###邮箱服务端
smtp_from: "1306125687@qq.com" ###发件人邮箱
smtp_auth_username: "1306125687@qq.com" ###发件人邮箱认证
smtp_auth_password: "baxhmslffyaujbhc" #授权码
smtp_require_tls: false
route:
group_by: [] ###自定义,报警分组名称
group_wait: 30s ## 等待多久时间发送一组报警通知
group_interval: 1m ##在新警报前的等待时间
repeat_interval: 1m #发送重复警报的时间
receiver: default-receiver ###自定义接收名称
receivers: ##定义劲爆接受者信息
- name: default-receiver ## 需与receiver的值对应
email_configs:
- to: 'sunny.wang@tech-trans.com' ###收件人邮箱
#应用配置文件
kubectl apply -f alertmanager-cfg.yaml
授权码获取
2、新建alertmanager-deploy.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
name: alertmanager-deployment
name: alertmanager
namespace: prom
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
containers:
- image: prom/alertmanager:v0.16.1
name: alertmanager
ports:
- containerPort: 9093
protocol: TCP
volumeMounts:
- mountPath: "/alertmanager"
name: data
- mountPath: "/etc/alertmanager"
name: config-volume
resources:
requests:
cpu: 50m
memory: 50Mi
limits:
cpu: 200m
memory: 200Mi
volumes:
- name: data
emptyDir: {}
- name: config-volume
configMap:
name: alertmanager
---
apiVersion: v1
kind: Service
metadata:
labels:
app: alertmanager
annotations:
prometheus.io/scrape: 'true'
name: alertmanager
namespace: prom
spec:
type: NodePort
ports:
- port: 9093
targetPort: 9093
nodePort: 31192
selector:
app: alertmanager
##部署alertmanager
kubectl apply -f alertmanager-deploy.yaml
(2)配置alertmanager与prometheus通信
在prometheus-cfg.yaml文件中添加alerting的设置
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager.prom.svc:9093"]
(3)配置告警规则
新建prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: prom
data:
# 通用角色
general.rules: |
groups:
- name: general.rules
rules:
- alert: InstanceDown
expr: up == 0
for: 1m #持续时间为1min
labels:
severity: error
annotations:
summary: "Instance {{ $labels.instance }} 停止工作"
description: "{{ $labels.instance }} job {{ $labels.job }} 已经停止1分钟以上."
# Node对所有资源的监控
node.rules: |
groups:
- name: node.rules
rules:
- alert: NodeFilesystemUsage
expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率过高"
description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分区使用大于80% (当前值: {{ $value }})"
- alert: NodeMemoryUsage
expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} 内存使用率过高"
description: "{{ $labels.instance }}内存使用大于80% (当前值: {{ $value }})"
- alert: NodeCPUUsage
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} CPU使用率过高"
description: "{{ $labels.instance }}CPU使用大于80% (当前值: {{ $value }})"
让prometheus能读取到rules配置
1、配置文件挂载到prometheus 容器
2、将rules规则文件的路径告诉prometheus
修改prometheus-deploy的挂载卷配置
volumeMounts:
- mountPath: /etc/prometheus/prometheus.yml
name: prometheus-config
subPath: prometheus.yml
- mountPath: /prometheus/
name: prometheus-storage-volume
- mountPath: /etc/prometheus/rules ### 将rules挂载至容器的某个路径
name: prometheus-rules
subPath: ""
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
items:
- key: prometheus.yml
path: prometheus.yml
mode: 0644
- name: prometheus-rules ### configmap挂载
configMap:
name: prometheus-rules
修改prometheus-config 配置,告诉prometheus去哪读取rules
rule_files:
- "/etc/prometheus/rules/*.rules"
改好后,重新apply一下prometheus-config.yaml和prometheus-deploy.yaml
验证:
收到的报警邮件
至此,调通了发邮件的链路,至于规则配置什么的还需再好好摸索~