这篇文章介绍prometheus和alertmanager的报警和通知规则,prometheus的配置文件名为prometheus.yml,alertmanager的配置文件名为alertmanager.yml
报警:指prometheus将监测到的异常事件发送给alertmanager,而不是指发送邮件通知
通知:指alertmanager发送异常事件的通知(邮件、webhook等)
报警规则
在prometheus.yml中指定匹配报警规则的间隔
# How frequently to evaluate rules.
[ evaluation_interval: <duration> | default = 1m ]
在prometheus.yml中指定规则文件(可使用通配符,如rules/*.rules)
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/etc/prometheus/alert.rules"
并基于以下模板:
ALERT <alert name>
IF <expression>
[ FOR <duration> ]
[ LABELS <label set> ]
[ ANNOTATIONS <label set> ]
其中:
Alert name是警报标识符。它不需要是唯一的。
Expression是为了触发警报而被评估的条件。它通常使用现有指标作为/metrics端点返回的指标。
Duration是规则必须有效的时间段。例如,5s表示5秒。
Label set是将在消息模板中使用的一组标签。
在prometheus-k8s-statefulset.yaml 文件创建ruleSelector,标记报警规则角色。在prometheus-k8s-rules.yaml 报警规则文件中引用
ruleSelector:
matchLabels:
role: prometheus-rulefiles
prometheus: k8s
在prometheus-k8s-rules.yaml 使用configmap 方式引用prometheus-rulefiles
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-k8s-rules
namespace: monitoring
labels:
role: prometheus-rulefiles
prometheus: k8s
data:
pod.rules.yaml: |+
groups:
- name: noah_pod.rules
rules:
- alert: Pod_all_cpu_usage
expr: (sum by(name)(rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 10
for: 5m
labels:
severity: critical
service: pods
annotations:
description: 容器 {
{ $labels.name }} CPU 资源利用率大于 75% , (current value is {
{ $value }})
summary: Dev CPU 负载告警
- alert: Pod_all_memory_usage
expr: sort_desc(avg by(name)(irate(container_memory_usage_bytes{name!=""}[5m]))*100) > 1024*10^3*2
for: 10m
labels:
severity: critical
annotations:
description: 容器 {
{ $labels.name }} Memory 资源利用率大于 2G , (current value is {
{ $value }})
summary: Dev Memory 负载告警
- alert: Pod_all_network_receive_usage
expr: sum by (name)(irate(container_network_receive_bytes_total{container_name="POD"}[1m])) > 1024*1024*50
for: 10m
labels:
severity: critical
annotations:
description: 容器 {
{ $labels.name }} network_receive 资源利用率大于 50M , (current value is {
{ $value }})
summary: network_receive 负载告警
配置文件设置好后&#