文章目录
原理

本文重点介绍指标借助webhook如何通知到钉钉群,短信、电话、邮件告警参考sns_config、email_config配置
相关环境
前提已经安装Prometheus
Alertmanager介绍
Alertmanager 处理应用程序(例如 Prometheus 服务器)发送的警报。它将重复数据删除、分组,并将报警路由到告警通道上接收器集成,接收器集成,例如电子邮件、webhook、短信、电话等其他机制。它还可以对报警进行静音和抑制。
安装方法:
- 下载对应的二进制文件,地址https://prometheus.io/download/
- 指定要加载的配置文件alertmanager.yml,启动服务
./alertmanager --config.file=alertmanager.yml
prometheus-webhook-dingtalk
作用:接收alertmanager告警信息并通知到钉钉
下载地址:https://github.com/timonwong/prometheus-webhook-dingtalk/tags
./prometheus-webhook-dingtalk --config.file=config.yml --web.enable-ui
prometheus-webhook-dingtalk配置
config.yml文件如下
## Request timeout
# timeout: 5s
## Customizable templates path
templates:
- contrib/templates/1.tmpl
## You can also override default template using `default_message`
## The following example to use the 'legacy' template from v0.3.0
# default_message:
# title: '{{ template "legacy.title" . }}'
# text: '{{ template "legacy.content" . }}'
## Targets, previously was known as "profiles"
targets:
app1_webhook:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxx1
app2_webhook:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxx 2
contrib/templates/1.tmpl :存放告警模版
我们假设有两个告警通道,app1-webhook和app2-webhook
Alertmanager告警配置
alertmanager.yml文件如下
global:
resolve_timeout: 5m
route:
receiver: ops_default
group_wait: 3s
group_interval: 5s
repeat_interval: 5m
group_by: ['namespace']
routes:
- match_re:
namespace: ns1|ns3
receiver: app1_webhook
group_wait: 10s
- match_re:
namespace: "ns2"
group_wait: 10s
receiver: app2_webhook
receivers:
- name: ops_default
webhook_configs:
- url: http://localhost:8060/dingtalk/app1-webhook/send
send_resolved: false
- name: app1_webhook
webhook_configs:
- url: http://localhost:8060/dingtalk/app1_webhook/send
send_resolved: false
- name: app2_webhook
webhook_configs:
- url: http://localhost:8060/dingtalk/app2_webhook/send
send_resolved: false
这个配置文件内容是 用namespace把告警内容区分,ns1和ns3的告警通知到app1_webhook中,ns2的告警通知到app2_webhook中,
send_resolved :是否通知已解决的警报,默认为true
group_wait: 最初等待发送组通知的时间,同一组的更多初始警报时间。(默认30s)
group_by:通过指定标签进行分组
group_interval:如果已经发送通知,发送有关新警报的通知之前需要等待多长时间
repeat_interval:如果已经发送通知,在再次发送通知之前需要等待多长时间
Prometheus 告警规则
假设我们已经配置采集k8s信息,要对pod cpu大于80%告警进行配置,需要修改以下两个文件
rules/*.yml 及prometheus.yml
prometheus.yml 配置文件
修改以下内容
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/usr/local/prometheus/rules/*.yml"
localhost:9093 为Alertmanager服务地址
usr/local/prometheus/rules/*.yml 里存放告警规则配置
scrape_interval:采集指标频率
evaluation_interval: 评估告警规则的频率。默认值=1m
rules.yml 配置文件
添加以下内容
groups:
- name: POD
rules:
- alert: ns1_pod_cpu_use
expr: sum by (pod_name) (rate(container_cpu_usage_seconds_total{namespace="ns1"}[1m])) / sum by (pod_name)(container_spec_cpu_quota{namespace="ns1"} / 100000) * 100 > 80
for: 5m
labels:
namespace: ns1
severity: "warning"
annotations:
summary: "Pod {{ $labels.pod_name }} CPU使用率过高"
description: "{{ $labels.instance }}: {{ $labels.pod_name }} CPU使用率大于80 (当前值: {{ $value }})"
- alert: ns2_pod_cpu_use
expr: sum by (pod_name) (rate(container_cpu_usage_seconds_total{namespace="ns2"}[1m])) / sum by (pod_name)(container_spec_cpu_quota{namespace="ns2"} / 100000) * 100 > 80
for: 5m
labels:
namespace: ns2
severity: "warning"
annotations:
summary: "Pod {{ $labels.pod_name }} CPU使用率过高"
description: "{{ $labels.instance }}: {{ $labels.pod_name }} CPU使用率大于80 (当前值: {{ $value }})"
labels:labels子句允许指定要附加到警报的一组附加标签。任何现有的冲突标签都将被覆盖。标签值可以模板化。
for:prometheus将在每次评估期间检查警报是否持续激活5分钟,然后再发出警报。
annotations:annotations子句指定了一组信息标签,这些标签可用于存储更长的附加信息,如警报描述或runbook链接。注释值可以模板化。
848

被折叠的 条评论
为什么被折叠?



