记录了prometheus 告警指标
主机和硬件监控
可用内存指标
主机中可用内存容量不足 10%
- alert: HostOutOfMemory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: Host out of memory (instance { { $labels.instance }})
description: Node memory is filling up (< 10% left)\n VALUE = { { $value }}\n LABELS: { { $labels }}
内存
节点内存压力大。主要页面故障率高
- alert: HostMemoryUnderMemoryPressure
expr: rate(node_vmstat_pgmajfault[1m]) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: Host memory under memory pressure (instance { { $labels.instance }})
description: The node is under heavy memory pressure. High rate of major page faults\n VALUE = { { $value }}\n LABELS: { { $labels }}
主机网络接口流入流量异常
主机网络接口可能接收了太多的数据(> 100 MB/s)。阀值根据自己机器背板网卡决定
- alert: HostUnusualNetworkThroughputIn
expr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100
for: 5m
labels:
severity: warning
annotations:
summary: Host unusual network throughput in (instance { { $labels.instance }})
description: Host network interfaces are probably receiving too much data (> 100 MB/s)\n VALUE = { { $value }}\n LABELS: { { $labels }}
主机网络接口流出流量异常
主机网络接口可能发送了太多的数据(> 100 MB/s)。
- alert: HostUnusualNetworkThroughputOut
expr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100
for: 5m
labels:
severity: warning
annotations:
summary: Host unusual network throughput out (instance { { $labels.instance }})
description: Host network interfaces are probably sending too much data (> 100 MB/s)\n VALUE = { { $value }}\n LABELS: { { $labels }}
主机网络接收错误
{ { \$labels.instance }}接口{ { \$labels.device }}在过去5分钟内遇到{ { printf "%.0f" $value }}接收错误。
- alert: HostNetworkReceiveErrors
expr: increase(node_network_receive_errs_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: Host Network Receive Errors (instance { { $labels.instance }})
description: { { $labels.instance }} interface { { $labels.device }} has encountered { { printf "%.0f" $value }} receive errors in the last five minutes.\n VALUE = { { $value }}\n LABELS: { { $labels }}
主机网络传输错误
{ { \$labels.instance }} 接口 { { \$labels.device }} 在过去五分钟内遇到 { { printf "%.0f" $value }} 发送错误。
- alert: HostNetworkTransmitErrors
expr: increase(node_network_transmit_errs_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: Host Network Transmit Errors (instance { { $labels.instance }})
description: { { $labels.instance }} interface { { $labels.device }} has encountered { { printf "%.0f" $value }} transmit errors in the last five minutes.\n VALUE = { { $value }}\n LABELS: { { $labels }}
主机磁盘读速率
磁盘每秒读数据(> 50 MB/s)。
- alert: HostUnusualDiskReadRate
expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50
for: 5m
labels:
severity: warning
annotations:
summary: Host unusual disk read rate (instance { { $labels.instance }})
description: Disk is probably reading too much data (> 50 MB/s)\n VALUE = { { $value }}\n LABELS: { { $labels }}
主机磁盘写速率
磁盘每秒写数据
- alert: HostUnusualDiskWriteRate
expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50
for: 5m
labels:
severity: warning
annotations:
summary: Host unusual disk write rate (instance { { $labels.instance }})
description: Disk is probably writing too much data (> 50 MB/s)\n VALUE = { { $value }}\n LABELS: { { $labels }}
主机磁盘剩余空间
磁盘可用空间(<10% left)
# please add ignored mountpoints in node_exporter parameters like
# "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)"
- alert: HostOutOfDiskSpace
expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10
for: 5m
labels:
severity: warning
annotations:
summary: Host out of disk space (instance { { $labels.instance }})
description: Disk is almost full (< 10% left)\n VALUE = { { $value }}\n LABELS: { { $labels }}
根据磁盘目前的增长速度,在几个小时内是否会写满
根据当前一小时内磁盘增长量,判断磁盘在 4 个小时内会不会被写满
- alert: HostDiskWillFillIn4Hours
expr: predict_linear(node_filesystem_free_bytes{fstype!~"tmpfs"}[1h], 4 * 3600) < 0
for: 5m
labels:
severity: warni