告警规则包含以下几个方面:
- 慢日志
- 内存使用情况
- 内存碎片率
- 持久化耗时
- 集群状态异常
- 集群槽位分配异常
- 从节点同步滞后
- Pod拓扑分布违背反亲和条件
Prometheus rule
groups:
- name: slowlog
rules:
- alert: slowlog
annotations:
description: 集群发现慢查询
summary: >-
集群 {{$labels.redis_kun_name}} 一分钟内共出现 {{ printf "%.2f" $value }}
次慢查询
expr: 'delta(redis_slowlog_last_id [5m]) >= 5'
labels:
group: xadd-redis
severity: critical
- alert: used memory increase too fast
annotations:
description: 按照最近10分钟的内存使用增长速度计算,预计20分钟后将会达到配置上限
summary: '集群 {{$labels.redis_kun_name}} 发现内存使用量增长过快'
expr: >-
predict_linear(redis_memory_used_bytes [10m], 60 * 20) >
redis_memory_max_bytes
labels:
group: xadd-redis
severity: critical
- alert: used memory larger than history peak
annotations:
description: 内存使用量超过历史峰值
summary: >-
集群 {{$labels.redis_kun_name}} 内存使用量超过历史峰值,当前内存使用量为 {{ printf
"%.2f" $value }} GB
expr: >-
(redis_memory_used_bytes / (1024 * 1024 * 1024)) >=
(redis_memory_used_peak_bytes / (1024 * 1024 * 1024)) and
(redis_memory_used_bytes / redis_memory_max_bytes) > 0.8
labels:
group: xadd-redis
severity: critical
- alert: option-memory usage in hight ratio
annotations:
description: 内存使用超过90%,将开启自动扩容
summary: '集群 {{$labels.redis_kun_name}} 内存使用超过redis最大设置内存的90%'
expr: >-
(redis_memory_used_bytes / redis_memory_max_bytes) * 100 > 90 and
redis_memory_max_bytes >= (1024 * 1024 * 1024)
labels:
group: xadd-redis
severity: critical
- alert: memory usage in high fragmentation ratio
annotations:
description: '当前内存碎片率为 {{ printf "%.2f" $value }},理想值为1至1.4之间'
summary: >-
集群 {{$labels.redis_kun_name}} 内存碎片率高,当前内存碎片率为 {{ printf "%.2f"
$value }},理想值为1至1.4之间
expr: >-
redis_memory_used_rss_bytes / redis_memory_used_bytes > 1.4 and
redis_memory_used_bytes >= 1024 * 1024 * 1024
for: 1w
labels:
group: xadd-redis
severity: warning
- alert: memory usage in low fragmentation ratio
annotations:
description: '当前内存碎片率为 {{ printf "%.2f" $value }},理想值为1至1.4之间'
summary: >-
集群 {{$labels.redis_kun_name}}
内存碎片率小于1,redis-server进程可能正在使用swap区,导致响应速度降低
expr: >-
redis_memory_used_rss_bytes / redis_memory_used_bytes < 1 and
redis_memory_used_bytes >= 512 * 1024 * 1024
for: 1w
labels:
group: xadd-redis
severity: warning
- alert: rdb bgsave too slow
annotations:
description: '最近一次持久化时间为 {{ $value }} 秒,建议降低该节点内存使用量至4GB以内'
summary: '集群 {{$labels.redis_kun_name}} rdb持久化时间过长,最近一次持久化时间为 {{ $value }} 秒'
expr: >-
redis_rdb_last_bgsave_duration_sec > 120 and time() -
redis_rdb_last_save_timestamp_seconds < 3600
labels:
group: xadd-redis
severity: critical
- alert: redis exporter maybe down
annotations:
description: 请联系redis平台运维团队检查服务或主机状态
summary: '集群 {{$labels.redis_kun_name}} exporter服务不可用'
expr: up == 0
for: 5m
labels:
addr: '{{ $labels.instance }}'
alias: ''
group: xadd-redis
severity: critical
- alert: redis instance maybe down
annotations:
description: redis实例可能出现故障
summary: >-
集群 {{$labels.redis_kun_name}} 中实例 {{$labels.kubernetes_pod_name}}
宕机
expr: redis_up == 0
for: 5s
labels:
group: xadd-redis
severity: critical
- alert: redis master slave failover
annotations:
description: redis集群发生主从切换
summary: '集群 {{$labels.redis_kun_name}} 发生主从切换'
expr: 'changes(redis_instance_info[5s]) > 1'
labels:
group: xadd-redis
severity: critical
- alert: state of cluster is unnormal
annotations:
description: 集群中发现节点的状态异常,并持续2分钟,触发此告警。
summary: '检测到集群:{{$labels.redis_kun_name}}不可用,已持续2分钟。'
expr: redis_cluster_state != 1
for: 2m
labels:
group: xadd-redis
severity: critical
- alert: slots of cluster assigned unnormally
annotations:
description: 发现集群槽位分配小于16384,触发此告警。
summary: '集群:{{$labels.redis_kun_name}} 的槽位未完全分配,当前分配槽位数量为:{{$value}}。'
expr: redis_cluster_slots_assigned < 16384
for: 2m
labels:
group: xadd-redis
severity: critical
- alert: slave lag too long
annotations:
description: Slave节点滞后时间超过10秒,触发此报警,集群cluster-node-timeout配置的是15秒。
summary: >-
集群:{{$labels.redis_kun_name}}
中Master节点:{{$labels.kubernetes_pod_name}} 的Slave节点滞后超过10秒,Slave
IP为{{$labels.slave_ip}}。
expr: redis_connected_slave_lag_seconds > 10
labels:
group: xadd-redis
severity: warning
- alert: cluster known nodes decreased
annotations:
description: 在两分钟内,集群的已知节点数量减少,触发此报警。
summary: '集群:{{$labels.redis_kun_name}} 在两分钟内节点数减少了{{$value}}个,请检查集群状态是否正常。'
expr: redis_cluster_known_nodes offset 2m - redis_cluster_known_nodes > 0
labels:
group: xadd-redis
severity: critical
- alert: redis exporter maybe down
annotations:
description: 2分钟内,若up状态的exporter数量减少,则认为减少的exporter存在宕机可能,触发此告警。
summary: >-
集群:{{$labels.redis_kun_name}}
的exporter可能宕机,exporter实例地址为:{{$labels.instance}}。
expr: 'up{namespace="redis"} unless up{namespace="redis"} offset 6m'
for: 5m
labels:
group: xadd-redis
severity: warning
- alert: pod of the same master-slave group was deployed in same node
annotations:
description: 当集群中同一组Master/Slave部署到相同的Node上,触发该告警
summary: '{{$labels.created_by_name}} 的主从节点部署到了相同的主机,主机IP为:{{$labels.node}}'
expr: >-
(count by(created_by_name,node)
(kube_pod_info{namespace="redis"}))>1
labels:
group: xadd-redis
severity: critical
- alert: more than one master of the same cluster was deployed in same node
annotations:
description: 当集群中多个Master部署到相同的Node上,触发该告警
summary: >-
发现集群:{{$labels.redis_kun_name}}
有多个Master部署到了相同的主机,主机IP为:{{$labels.node}}
expr: >-
count by (redis_kun_name, node) (kube_pod_info{namespace="redis"} *
on (pod) group_left(redis_kun_name)
label_replace(redis_instance_info{role="master"}, "pod", "$1",
"kubernetes_pod_name", "(.*)"))>1
labels:
group: xadd-redis
severity: critical
Alertmanager config
global:
resolve_timeout: 5m
route:
group_by: ['alertname','alias']
group_wait: 10s
group_interval: 30s
repeat_interval: 1h
receiver: 'redis-alert'
routes:
- receiver: 'redis-alert'
group_wait: 30s
group_interval: 30s
repeat_interval: 30m
match:
group: xadd-redis
receivers:
- name: redis-alert
webhook_configs:
- url: 'http://172.22.43.230/api/web/alert/webhook'
send_resolved: false
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
Webhook
@ResponseBody
@RequestMapping(value = "/webhook", method = RequestMethod.POST)
public void webhook(@RequestBody Object alertInfo) {
log.info("【alert info】:{}.", JSONObject.toJSON(alertInfo));
}
数据发送到接口后,可以直接用JSON或者Map解析,再根据数据获取指定的标签作为告警信息,也可根据告警级别,选择不同的告警方式。