altermanager消息配置和pushgateway
1. 告警分级发送
1.1 将不同的告警用不同的方式分发
- 将所有的告警都发送给企业微信
- 将所有pods的告警都发送给钉钉
- 将192.168.31.123:9100的告警用邮件发送
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 10m
receiver: 'wechat'
routes:
- receiver: 'dingtalk'
group_wait: 30s
repeat_interval: 10m
match_re:
service: 'pods'
- receiver: 'web.hook'
group_wait: 20s
repeat_interval: 20m
match_re:
instance: '192.168.31.123:9100'
1.2 验证并重启altermanager
root@prometheus-2:/apps/alertmanager# ./amtool check-config alertmanager.yml
Checking 'alertmanager.yml' SUCCESS
Found:
- global config
- route
- 1 inhibit rules
- 3 receivers
- 0 templates
root@prometheus-2:/apps/alertmanager# systemctl restart alertmanager.servic
2. AlertManager自定义模板
2.1 创建模板
# mkdir template
root@prometheus-2:/apps/alertmanager# cat /apps/alertmanager/template/wechat.tmpl
{{ define "wechat.default.message" }}
{{ range $i, $alert :=.Alerts }}
===alertmanager监控报警===
告警状态:{{ .Status }}
告警级别:{{ $alert.Labels.severity }}
告警类型:{{ $alert.Labels.alertname }}
告警应用:{{ $alert.Annotations.summary }}
故障主机: {{ $alert.Labels.instance }}
告警主题: {{ $alert.Annotations.summary }}
触发阀值:{{ $alert.Annotations.value }}
告警详情: {{ $alert.Annotations.description }}
触发时间: {{ $alert.StartsAt.Format "2022-09-07 10:04:15" }}
===========end============
{{ end }}
{{ end }}
root@prometheus-2:/apps/alertmanager# cat /apps/alertmanager/template/mail.tmpl
{{ define "email.default.message" }}
{{ range .Alerts }}
<pre>
实例: {{ .Labels.instance }}
信息: {{ .Annotations.summary }}
详情: {{ .Annotations.description }}
时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
</pre>
{{ end }}
{{ end }}
2.2 修改alertmanager.yaml
在alertmanager.yaml中追加和修改以下配置
templates:
- '/apps/alertmanager/template/wechat.tmpl'
- '/apps/alertmanager/template/mail.tmpl'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 10m
receiver: 'wechat'
routes:
- receiver: 'dingtalk'
group_wait: 30s
repeat_interval: 2m
match_re:
service: 'pods'
- receiver: 'email'
group_wait: 20s
repeat_interval: 2m
match_re:
instance: '192.168.31.123:9100'
receivers:
- name: 'email'
email_configs:
- to: '13917099322@139.com'
html: '{{ template "email.default.message" . }}'
headers: { Subject: "{{ .CommonLabels.instance }} {{ .CommonAnnotations.summary }}" }
- name: 'dingtalk'
webhook_configs:
- url: 'http://192.168.31.201:8060/dingtalk/alertname/send'
send_resolved: true
- name: 'wechat'
wechat_configs:
- corp_id: 'ww11cfebc3eb8be3e9'
#to_user: '@all'
to_party: '2'
agent_id: '1000003'
api_secret: '9j4EAng2zKXabEMVkRQemGtBZQVA2728jHATBHgXD9w'
send_resolved: true
message: '{{ template "wechat.default.message" . }}'
2.3 重启生效配置
重启后告警内容发生了变化
3. 告警抑制和静默
3.1 告警抑制
基于告警规则,超过80%就不在发60%的告警,即由60%的表达式触发的告警被抑制了.
- 使用率大于60%但小于80% 则发60%的告警
- 使用率大于80%,那么60%的告警被抑制,不再发送60% 的告警
- 小于60% 不发送告警
- name: alertmanager_node.rules
rules:
- alert: 磁盘容量
expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80 #磁盘容量利用率大于80%
for: 2s
labels:
severity: critical
annotations:
summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!"
description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)"
- alert: 磁盘容量
expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 60 #磁盘容量利用率大于60%
for: 2s
labels:
severity: warning
annotations:
summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!"
description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)"
3.2 手动静默
3.2.1 手动静默
代码升级或者服务器维护时,可以将告警设置为静默.
默认2小时过期
3.2.2 取消手动静默
直接让他过期就可以了.
4. alertmanager 高可用
- 通过lvs或者haproxy实现基于负载均衡的高可用
- 基于Gossip机制
4.1 Gossip
Alertmanager引入了Gossip机制,Gossip机制为多个Alertmanager之间提供了信息传递机制.确保及时在多个Alertmanager分别接收到相同的告警信息的情况下,有且只有一个告警通知被发送给Receiver.
4.2 集群环境搭建
为了能够让Alertmanager节点之间进行通信,需要在Alertmanager启动是设置相应的参数.其中主要参数包括:
–cluster.listen-address string: 当前实例集群服务监听地址
–cluster.peer value: 初始化时关联的其他实例的集群服务地址
5. Pushgateway
客户端将收集的数据,push到pushgateway,prometheus再去pushgateway拿数据.
主要用来收集短期或临时数据
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rIOn5FzP-1662529512456)(D:\wd\个人文件\架构师\M马哥云原生\2.k8s\alertmanager\pushgateway.png)]
5.1 安装pushgateway
root@zookeeper-1:/apps# wget https://github.com/prometheus/pushgateway/releases/download/v1.4.3/pushgateway-1.4.3.linux-amd64.tar.gz
root@zookeeper-1:/apps# tar xf pushgateway-1.4.3.linux-amd64.tar.gz
root@zookeeper-1:/apps# ln -sf /apps/pushgateway-1.4.3.linux-amd64 /apps/pushgateway
5.2 编写service文件
/etc/systemd/system/pushgateway.service
[Unit]
Description=Nginx
After=network.target
[Service]
ExecStart=/apps/pushgateway/pushgateway
[Install]
WantedBy=multi-user.target
启动服务,服务启动后会监听在9091端口
root@zookeeper-1:/apps/pushgateway# systemctl enable --now pushgateway.service
Created symlink /etc/systemd/system/multi-user.target.wants/pushgateway.service → /etc/systemd/system/pushgateway.service.
root@zookeeper-1:/apps/pushgateway# ss -ntl|grep 9091
LISTEN 0 4096 *:9091 *:*
访问测试 http://192.168.31.121:9091/metrics
目前只有pushgateway自己的数据
5.3 prometheus 监听pushgateway
修改prometheus.yaml
- job_name: 'pushgateway'
scrape_interval: 5s
static_configs:
- targets: ['192.168.31.121:9091']
## 保留原标签,不进行替换
honor_labels: true
重启prometheus
# systemctl restart prometheus.service
确保prometheus可以正确监听到pushgateway
5.4 测试从客户端推送数据
5.4.1 推送单条数据
# echo "k8s_node1 101"|curl --data-binary @- http://192.168.31.121:9091/metrics/job/k8s-node1/instance/192.168.31.101
此时再通过metrics就能发现刚才推送的值.
可以在pushgateway的界面看到这条数据,也可以删除
5.4.2 多个数据的推送
# cat << EOF | curl --data-binary @- http://192.168.31.121:9091/metrics/job/k8s-node2/instance/192.168.31.102
#TYPE node_memory_usage
node_memory_usage 102400000
#TYPE cpu_total
cpu_total 24
EOF