告警平台(Alertmanager)高级配置
基于企业微信的报警媒介
- 实时告警通知:企业微信/钉钉等即时通信工具能够实现实时的告警通知,使得团队成员能够及时响应和解决问题
- 通知范围更广:基于企业微信/钉钉的告警通知可以通过群组和@某人的方式,将告警通知发送给更广泛的接收者,避免出现漏报的情况
- 告警信息更直观:企业微信/钉钉等通信工具提供了更丰富的告警信息呈现方式,例如文本消息、链接、图片、语音等,使得告警信息更加直观和易于理解
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitor
data:
alertmanager.yml: |-
global:
resolve_timeout: 1m
smtp_smarthost: 'smtp.exmail.qq.com:465' # 邮箱服务器的SMTP主机配置
smtp_from: 'zhdya@zhdya.cn' # 发送邮件主题
smtp_auth_username: '1440350254@qq.com' # 登录用户名
smtp_auth_password: '***' # 此处的auth password是邮箱的第三方登录授权密码,而非用户密码
smtp_require_tls: false # 有些邮箱需要开启此配置,这里使用的是企微邮箱,仅做测试,不需要开启此功能。
templates:
- '/etc/alertmanager/*.tmpl'
route:
group_by: ['env','instance','type','group','job','alertname','cluster']
group_wait: 10s
group_interval: 2m
repeat_interval: 10m
receiver: 'email'
routes:
- receiver: 'wechat'
match:
severity: critical
receivers:
- name: 'email'
email_configs:
- to: '1440350254@qq.com'
send_resolved: true
html: '{{ template "email.to.html" . }}'
- name: 'wechat'
wechat_configs:
- corp_id: 'xxx'
to_party: '413'
to_user: '@all'
agent_id: 1000035
api_secret: 'xxx'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
wechat.tmpl: |-
{{ define "wechat.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
========= 监控报警 =========
告警状态:{{ .Status }}
告警级别:{{ .Labels.severity }}
告警类型:{{ $alert.Labels.alertname }}
故障主机: {{ $alert.Labels.instance }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
触发阀值:{{ .Annotations.value }}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
========= = end = =========
{{- end }}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
========= 告警恢复 =========
告警类型:{{ .Labels.alertname }}
告警状态:{{ .Status }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
========= = end = =========
{{- end }}
{{- end }}
{{- end }}
{{- end }}
email.tmpl: |-
{{ define "email.from" }}xxx.com{{ end }}
{{ define "email.to" }}xxx.com{{ end }}
{{ define "email.to.html" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{ range .Alerts }}
========= 监控报警 =========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} <br>
告警类型: {{ .Labels.alertname }} <br>
告警主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
========= = end = =========<br>
{{ end }}{{ end -}}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{ range .Alerts }}
========= 告警恢复 =========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} <br>
告警类型: {{ .Labels.alertname }} <br>
告警主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
恢复时间: {{ .EndsAt.Format "2006-01-02 15:04:05" }} <br>
========= = end = =========<br>
{{ end }}{{ end -}}
{{- end }}
测试验证
# 匹配如上webhook标签:hostname:zhdya
$ curl -XPOST -H 'Content-Type: application/json' http://alertmanager.kubernets.cn/api/v1/alerts -d '[{"labels":{"severity":"critical"},"annotations":{"summary":"This is a test alert"}}]'
告警触发
基于钉钉的报警媒介
钉钉-机器人管理(复制生成的webhook)
自定义机器人安全设置 - 钉钉开放平台 (dingtalk.com)
dingtalk部署配置
vim webhook-dingtalk.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
run: dingtalk
name: webhook-dingtalk
namespace: monitor
spec:
replicas: 1
selector:
matchLabels:
run: dingtalk
template:
metadata:
labels:
run: dingtalk
spec:
containers:
- name: dingtalk
image: timonwong/prometheus-webhook-dingtalk:v1.4.0
imagePullPolicy: IfNotPresent
args:
- --ding.profile=webhook1=https://oapi.dingtalk.com/robot/send?access_token=<替换成你的token>
ports:
- containerPort: 8060
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
labels:
run: dingtalk
name: webhook-dingtalk
namespace: monitor
spec:
ports:
- port: 8060
protocol: TCP
targetPort: 8060
selector:
run: dingtalk
sessionAffinity: None
配置alertmanager配置文件configmap
route:
group_by: ['env','instance','type','group','job','alertname','cluster']
group_wait: 10s
group_interval: 2m
repeat_interval: 10m
receiver: 'email'
routes:
- receiver: 'wechat'
match:
severity: critical
- receiver: 'webhook' ## 新增告警receiver通道
match:
hostname: zhdya
receivers:
- name: 'email'
email_configs:
- to: 'zhdyaa@163.com'
send_resolved: true
html: '{{ template "email.to.html" . }}'
- name: 'wechat'
wechat_configs:
- corp_id: 'xxx'
to_party: '413'
to_user: '@all'
agent_id: 1000035
api_secret: 'xxx'
send_resolved: true
- name: 'webhook' ## 配置接收告警的媒介
webhook_configs:
- url: 'http://webhook-dingtalk.monitor.svc.cluster.local:8060/dingtalk/webhook1/send'
send_resolved: true
测试验证
## 匹配如上webhook标签:hostname:zhdya
$ curl -XPOST -H 'Content-Type: application/json' http://alertmanager.kubernets.cn/api/v1/alerts -d '[{"labels":{"hostname":"zhdya"},"annotations":{"summary":"This is a test alert"}}]'
告警静默
什么是告警静默
静默 Silences
:指让通过设置让警报在指定时间暂时不会发送警报的一种方式。
- 用于解决严重生产故障问题时,因所花费的时间过长,通过静默设置避免接收到过多的无用通知;
- 在已知的例行维护中,为了防止对例行维护的机器发送不必要的警报;
如何设置临时静默
总结
- 实时告警通知:企业微信/钉钉等即时通信工具能够实现实时的告警通知,使得团队成员能够及时响应和解决问题。
- 告警抑制:对已知或排查问题的时候进行告警静默。