常用告警规则

Node_exporter告警规则
NodeCPUUsageHigh:

yaml
复制代码

  • alert: NodeCPUUsageHigh
    expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=“idle”}[5m])) * 100)) > 80
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “High CPU usage detected on {{ $labels.instance }}”
    description: “CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某节点的CPU使用率超过80%并持续5分钟时触发告警。

NodeMemoryUsageHigh:

yaml
复制代码

  • alert: NodeMemoryUsageHigh
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “High Memory usage detected on {{ $labels.instance }}”
    description: “Memory usage is above 90% for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某节点的内存使用率超过90%并持续5分钟时触发告警。

NodeDiskUsageHigh:

yaml
复制代码

  • alert: NodeDiskUsageHigh
    expr: (node_filesystem_size_bytes{fstype!~“tmpfs|fuse.lxcfs”} - node_filesystem_free_bytes{fstype!~“tmpfs|fuse.lxcfs”}) / node_filesystem_size_bytes{fstype!~“tmpfs|fuse.lxcfs”} * 100 > 85
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High Disk usage detected on {{ $labels.instance }}”
    description: “Disk usage is above 85% for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某节点的磁盘使用率超过85%并持续5分钟时触发告警。

NodeFilesystemReadOnly:

yaml
复制代码

  • alert: NodeFilesystemReadOnly
    expr: node_filesystem_readonly{fstype!~“tmpfs|fuse.lxcfs”} == 1
    for: 10m
    labels:
    severity: critical
    annotations:
    summary: “Filesystem is read-only on {{ $labels.instance }}”
    description: “Filesystem has been read-only for more than 10 minutes on {{ $labels.instance }}.”
    解释:当某节点的文件系统变为只读并持续10分钟时触发告警。

NodeLoadAverageHigh:

yaml
复制代码

  • alert: NodeLoadAverageHigh
    expr: node_load1 > 2 * count(node_cpu_seconds_total{mode=“system”})
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High load average on {{ $labels.instance }}”
    description: “1-minute load average is more than twice the number of CPUs for over 5 minutes on {{ $labels.instance }}.”
    解释:当某节点的1分钟负载平均值超过CPU数量的2倍并持续5分钟时触发告警。

NodeNetworkDown:

yaml
复制代码

  • alert: NodeNetworkDown
    expr: up{job=“node_exporter”} == 0
    for: 10m
    labels:
    severity: critical
    annotations:
    summary: “Node down: {{ $labels.instance }}”
    description: “Node has been down for more than 10 minutes.”
    解释:当某节点的node_exporter数据10分钟内没有上报时触发告警。

NodeSwapUsageHigh:

yaml
复制代码

  • alert: NodeSwapUsageHigh
    expr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100 > 50
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High swap usage on {{ $labels.instance }}”
    description: “Swap usage is above 50% for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某节点的交换分区使用率超过50%并持续5分钟时触发告警。

NodeFileSystemInodesUsageHigh:

yaml
复制代码

  • alert: NodeFileSystemInodesUsageHigh
    expr: (node_filesystem_files - node_filesystem_files_free) / node_filesystem_files * 100 > 80
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High filesystem inodes usage on {{ $labels.instance }}”
    description: “Filesystem inodes usage is above 80% for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某节点的文件系统inode使用率超过80%并持续5分钟时触发告警。

NodeTemperatureHigh:

yaml
复制代码

  • alert: NodeTemperatureHigh
    expr: node_hwmon_temp_celsius > 75
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High temperature on {{ $labels.instance }}”
    description: “Node temperature is above 75 degrees Celsius for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某节点的温度超过75摄氏度并持续5分钟时触发告警。

NodeProcessCountHigh:

yaml
复制代码

  • alert: NodeProcessCountHigh
    expr: count(node_scrape_collector_duration_seconds) > 500
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High process count on {{ $labels.instance }}”
    description: “Number of processes is above 500 for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某节点的进程数超过500并持续5分钟时触发告警。

windows_exporter告警规则
WindowsCPUUsageHigh:

yaml
复制代码

  • alert: WindowsCPUUsageHigh
    expr: avg by (instance) (rate(windows_cpu_time_total{mode=“idle”}[5m])) < 0.2
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “High CPU usage on {{ $labels.instance }}”
    description: “CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某Windows节点的CPU使用率超过80%并持续5分钟时触发告警。

WindowsMemoryUsageHigh:

yaml
复制代码

  • alert: WindowsMemoryUsageHigh
    expr: windows_cs_physical_memory_bytes - windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes * 100 > 90
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “High Memory usage on {{ $labels.instance }}”
    description: “Memory usage is above 90% for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某Windows节点的内存使用率超过90%并持续5分钟时触发告警。

WindowsDiskUsageHigh:

yaml
复制代码

  • alert: WindowsDiskUsageHigh
    expr: windows_logical_disk_free_bytes / windows_logical_disk_size_bytes * 100 < 15
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High Disk usage on {{ $labels.instance }}”
    description: “Disk usage is above 85% for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某Windows节点的磁盘使用率超过85%并持续5分钟时触发告警。

WindowsNetworkDown:

yaml
复制代码

  • alert: WindowsNetworkDown
    expr: up{job=“windows_exporter”} == 0
    for: 10m
    labels:
    severity: critical
    annotations:
    summary: “Node down: {{ $labels.instance }}”
    description: “Node has been down for more than 10 minutes.”
    解释:当某Windows节点的windows_exporter数据10分钟内没有上报时触发告警。

WindowsServiceNotRunning:

yaml
复制代码

  • alert: WindowsServiceNotRunning
    expr: windows_service_status == 2
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “Service not running on {{ $labels.instance }}”
    description: “A critical service is not running for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某Windows节点上的某关键服务未运行并持续5分钟时触发告警。

WindowsPageFileUsageHigh:

yaml
复制代码

  • alert: WindowsPageFileUsageHigh
    expr: windows_os_paging_free_bytes / windows_os_paging_limit_bytes * 100
    继续

ChatGPT
windows_exporter告警规则(续)
WindowsPageFileUsageHigh:

yaml
复制代码

  • alert: WindowsPageFileUsageHigh
    expr: windows_os_paging_free_bytes / windows_os_paging_limit_bytes * 100 < 10
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High Page File usage on {{ $labels.instance }}”
    description: “Page file usage is above 90% for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某Windows节点的页面文件使用率超过90%并持续5分钟时触发告警。

WindowsHandleCountHigh:

yaml
复制代码

  • alert: WindowsHandleCountHigh
    expr: windows_process_handles_count > 10000
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High Handle count on {{ $labels.instance }}”
    description: “Number of handles is above 10,000 for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某Windows节点的句柄数超过10,000并持续5分钟时触发告警。

WindowsThreadCountHigh:

yaml
复制代码

  • alert: WindowsThreadCountHigh
    expr: windows_process_threads > 500
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High Thread count on {{ $labels.instance }}”
    description: “Number of threads is above 500 for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某Windows节点的线程数超过500并持续5分钟时触发告警。

WindowsProcessorQueueLengthHigh:

yaml
复制代码

  • alert: WindowsProcessorQueueLengthHigh
    expr: windows_system_processor_queue_length > 10
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High Processor Queue Length on {{ $labels.instance }}”
    description: “Processor queue length is above 10 for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某Windows节点的处理器队列长度超过10并持续5分钟时触发告警。

WindowsDiskIOWaitHigh:

yaml
复制代码

  • alert: WindowsDiskIOWaitHigh
    expr: avg by (instance) (rate(windows_logical_disk_idle_time{instance=~“[^:]+:.*”}[5m])) < 0.1
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High Disk I/O Wait on {{ $labels.instance }}”
    description: “Disk I/O wait time is high for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某Windows节点的磁盘I/O等待时间过高并持续5分钟时触发告警。

mysqld_exporter告警规则
MySQLHighThreadsRunning:

yaml
复制代码

  • alert: MySQLHighThreadsRunning
    expr: mysql_global_status_threads_running > 50
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “High number of running threads in MySQL on {{ $labels.instance }}”
    description: “Number of running threads is above 50 for more than 5 minutes on {{ $labels.instance }}.”
    解释:当MySQL实例的运行线程数超过50并持续5分钟时触发告警。

MySQLSlowQueries:

yaml
复制代码

  • alert: MySQLSlowQueries
    expr: rate(mysql_global_status_slow_queries[5m]) > 0
    for: 10m
    labels:
    severity: warning
    annotations:
    summary: “Slow queries detected on {{ $labels.instance }}”
    description: “One or more slow queries detected in the last 10 minutes on {{ $labels.instance }}.”
    解释:当MySQL实例在过去10分钟内检测到慢查询时触发告警。

MySQLReplicationLag:

yaml
复制代码

  • alert: MySQLReplicationLag
    expr: mysql_slave_status_seconds_behind_master > 10
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “Replication lag on MySQL slave {{ $labels.instance }}”
    description: “Replication lag is more than 10 seconds for over 5 minutes on {{ $labels.instance }}.”
    解释:当MySQL从库的复制延迟超过10秒并持续5分钟时触发告警。

MySQLHighConnections:

yaml
复制代码

  • alert: MySQLHighConnections
    expr: mysql_global_status_threads_connected > 200
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “High number of connections to MySQL on {{ $labels.instance }}”
    description: “Number of connections is above 200 for more than 5 minutes on {{ $labels.instance }}.”
    解释:当MySQL实例的连接数超过200并持续5分钟时触发告警。

MySQLLowFreeDiskSpace:

yaml
复制代码

  • alert: MySQLLowFreeDiskSpace
    expr: (mysql_global_variables_datadir - node_filesystem_free_bytes{fstype!~“tmpfs|fuse.lxcfs”}) / mysql_global_variables_datadir * 100 > 90
    for: 10m
    labels:
    severity: critical
    annotations:
    summary: “Low free disk space for MySQL on {{ $labels.instance }}”
    description: “Free disk space for MySQL data directory is below 10% for more than 10 minutes on {{ $labels.instance }}.”
    解释:当MySQL实例的数据目录剩余磁盘空间低于10%并持续10分钟时触发告警。

MySQLHighQueryTime:

yaml
复制代码

  • alert: MySQLHighQueryTime
    expr: rate(mysql_global_status_questions[5m]) > 1000
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High query rate on MySQL on {{ $labels.instance }}”
    description: “Query rate is above 1000 queries per second for more than 5 minutes on {{ $labels.instance }}.”
    解释:当MySQL实例的查询率超过每秒1000个并持续5分钟时触发告警。

MySQLInnoDBBufferPoolUsageHigh:

yaml
复制代码

  • alert: MySQLInnoDBBufferPoolUsageHigh
    expr: (mysql_global_status_innodb_buffer_pool_bytes_data / mysql_global_status_innodb_buffer_pool_bytes_total) * 100 > 85
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High InnoDB buffer pool usage on MySQL on {{ $labels.instance }}”
    description: “InnoDB buffer pool usage is above 85% for more than 5 minutes on {{ $labels.instance }}.”
    解释:当MySQL实例的InnoDB缓冲池使用率超过85%并持续5分钟时触发告警。

MySQLMaxConnectionsReached:

yaml
复制代码

  • alert: MySQLMaxConnectionsReached
    expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 90
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “Max connections threshold reached on MySQL on {{ $labels.instance }}”
    description: “Connections usage is above 90% of the max_connections limit for more than 5 minutes on {{ $labels.instance }}.”
    解释:当MySQL实例的连接数达到最大连接数限制的90%以上并持续5分钟时触发告警。

MySQLTableLocksContention:

yaml
复制代码

  • alert: MySQLTableLocksContention
    expr: rate(mysql_global_status_table_locks_waited[5m]) > 10
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “Table locks contention in MySQL on {{ $labels.instance }}”
    description: “More than 10 table locks waited per second for more than 5 minutes on {{ $labels.instance }}.”
    解释:当MySQL实例的表锁等待数超过每秒10个并持续5分钟时触发告警。

MySQLLongRunningQueries:

yaml
复制代码

  • alert: MySQLLongRunningQueries
    expr: mysql_global_status_queries{job=“mysql”} > 100
    for: 10m
    labels:
    severity: warning
    annotations:
    summary: “Long running queries in MySQL on {{ $labels.instance }}”
    description: “More than 100 long running queries for over 10 minutes on {{ $labels.instance }}.”
    解释:当MySQL实例的长时间运行查询超过100个并持续10分钟时触发告警。

blackbox_exporter告警规则
EndpointDown:
yaml
复制代码

  • alert: EndpointDown
    expr: probe_success == 0
    for:
    继续

ChatGPT
blackbox_exporter告警规则
EndpointDown:

yaml
复制代码

  • alert: EndpointDown
    expr: probe_success == 0
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “Endpoint down: {{ $labels.instance }}”
    description: “The endpoint {{ $labels.instance }} has been down for more than 5 minutes.”
    解释:当某个端点连续5分钟探测失败时触发告警。

EndpointHighLatency:

yaml
复制代码

  • alert: EndpointHighLatency
    expr: probe_duration_seconds > 0.5
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High latency on endpoint: {{ $labels.instance }}”
    description: “The endpoint {{ $labels.instance }} has a latency higher than 0.5 seconds for more than 5 minutes.”
    解释:当某个端点的响应时间超过0.5秒并持续5分钟时触发告警。

EndpointDNSResolutionFailure:

yaml
复制代码

  • alert: EndpointDNSResolutionFailure
    expr: probe_dns_lookup_time_seconds > 2
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “DNS resolution failure for endpoint: {{ $labels.instance }}”
    description: “The endpoint {{ $labels.instance }} has DNS resolution time higher than 2 seconds for more than 5 minutes.”
    解释:当某个端点的DNS解析时间超过2秒并持续5分钟时触发告警。

EndpointConnectionTimeout:

yaml
复制代码

  • alert: EndpointConnectionTimeout
    expr: probe_tcp_connect_duration_seconds > 1
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “Connection timeout on endpoint: {{ $labels.instance }}”
    description: “The endpoint {{ $labels.instance }} has connection timeout higher than 1 second for more than 5 minutes.”
    解释:当某个端点的连接时间超过1秒并持续5分钟时触发告警。

EndpointCertificateExpiry:

yaml
复制代码

  • alert: EndpointCertificateExpiry
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 7
    for: 1h
    labels:
    severity: warning
    annotations:
    summary: “SSL certificate expiry for endpoint: {{ $labels.instance }}”
    description: “The SSL certificate for endpoint {{ $labels.instance }} will expire in less than 7 days.”
    解释:当某个端点的SSL证书将在7天内过期时触发告警。

HTTPResponseCodeMismatch:

yaml
复制代码

  • alert: HTTPResponseCodeMismatch
    expr: probe_http_status_code != 200
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “Unexpected HTTP response code on endpoint: {{ $labels.instance }}”
    description: “The endpoint {{ $labels.instance }} returned an unexpected HTTP response code for more than 5 minutes.”
    解释:当某个端点的HTTP响应码不是200并持续5分钟时触发告警。

EndpointHTTPRedirects:

yaml
复制代码

  • alert: EndpointHTTPRedirects
    expr: probe_http_redirects > 5
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “Too many HTTP redirects on endpoint: {{ $labels.instance }}”
    description: “The endpoint {{ $labels.instance }} encountered more than 5 HTTP redirects for more than 5 minutes.”
    解释:当某个端点的HTTP重定向次数超过5次并持续5分钟时触发告警。

EndpointDNSLookupFailures:

yaml
复制代码

  • alert: EndpointDNSLookupFailures
    expr: increase(probe_dns_lookup_time_seconds[5m]) > 0
    for: 10m
    labels:
    severity: warning
    annotations:
    summary: “DNS lookup failures on endpoint: {{ $labels.instance }}”
    description: “The endpoint {{ $labels.instance }} had DNS lookup failures for more than 10 minutes.”
    解释:当某个端点在10分钟内有DNS解析失败时触发告警。

EndpointTCPConnectionRefused:

yaml
复制代码

  • alert: EndpointTCPConnectionRefused
    expr: probe_tcp_connect_success == 0
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “TCP connection refused on endpoint: {{ $labels.instance }}”
    description: “The endpoint {{ $labels.instance }} refused TCP connections for more than 5 minutes.”
    解释:当某个端点连续5分钟TCP连接失败时触发告警。

EndpointSSLCertificateInvalid:

yaml
复制代码

  • alert: EndpointSSLCertificateInvalid
    expr: probe_ssl_last_chain_expiry_timestamp_seconds - time() < 0
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “Invalid SSL certificate on endpoint: {{ $labels.instance }}”
    description: “The endpoint {{ $labels.instance }} has an invalid SSL certificate.”
    解释:当某个端点的SSL证书无效时触发告警。

cAdvisor告警规则
ContainerCPUUsageHigh:

yaml
复制代码

  • alert: ContainerCPUUsageHigh
    expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High CPU usage on container {{ $labels.container }}”
    description: “Container {{ $labels.container }} has CPU usage above 80% for more than 5 minutes.”
    解释:当某个容器的CPU使用率超过80%并持续5分钟时触发告警。

ContainerMemoryUsageHigh:

yaml
复制代码

  • alert: ContainerMemoryUsageHigh
    expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “High memory usage on container {{ $labels.container }}”
    description: “Container {{ $labels.container }} has memory usage above 90% for more than 5 minutes.”
    解释:当某个容器的内存使用率超过90%并持续5分钟时触发告警。

ContainerDiskIOHigh:

yaml
复制代码

  • alert: ContainerDiskIOHigh
    expr: rate(container_fs_io_time_seconds_total[5m]) > 0.5
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High disk I/O on container {{ $labels.container }}”
    description: “Container {{ $labels.container }} has disk I/O usage above 50% for more than 5 minutes.”
    解释:当某个容器的磁盘I/O使用率超过50%并持续5分钟时触发告警。

ContainerNetworkErrors:

yaml
复制代码

  • alert: ContainerNetworkErrors
    expr: rate(container_network_receive_errors_total[5m]) > 0 or rate(container_network_transmit_errors_total[5m]) > 0
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “Network errors on container {{ $labels.container }}”
    description: “Container {{ $labels.container }} is experiencing network errors for more than 5 minutes.”
    解释:当某个容器在5分钟内出现网络错误时触发告警。

ContainerRestarting:

yaml
复制代码

  • alert: ContainerRestarting
    expr: rate(container_restart_count[5m]) > 1
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “Container restarting frequently: {{ $labels.container }}”
    description: “Container {{ $labels.container }} has restarted more than once in the last 5 minutes.”
    解释:当某个容器在5分钟内重启超过1次时触发告警。

ContainerOOMKilled:

yaml
复制代码

  • alert: ContainerOOMKilled
    expr: container_last_seen < time() - 300 and container_state_terminated_reason == “OOMKilled”
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “Container OOMKilled: {{ $labels.container }}”
    description: “Container {{ $labels.container }} was killed due to out of memory in the last 5 minutes.”
    解释:当某个容器在过去5分钟内因内存不足被杀死时触发告警。

ContainerFileSystemFull:

yaml
复制代码

  • alert: ContainerFileSystemFull
    expr: container_fs_usage_bytes / container_fs_limit_bytes > 0.9
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “File system full on container {{ $labels.container }}”
    description: "Container {{ $labels.container }} file system usage is above 90
    继续

ChatGPT
cAdvisor告警规则(续)
ContainerFileSystemFull:

yaml
复制代码

  • alert: ContainerFileSystemFull
    expr: container_fs_usage_bytes / container_fs_limit_bytes > 0.9
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “File system full on container {{ $labels.container }}”
    description: “Container {{ $labels.container }} file system usage is above 90% for more than 5 minutes.”
    解释:当某个容器的文件系统使用率超过90%并持续5分钟时触发告警。

ContainerMemoryLeak:

yaml
复制代码

  • alert: ContainerMemoryLeak
    expr: increase(container_memory_usage_bytes[1h]) > 1000000000
    for: 1h
    labels:
    severity: warning
    annotations:
    summary: “Possible memory leak on container {{ $labels.container }}”
    description: “Container {{ $labels.container }} has increased its memory usage by more than 1 GB in the last hour.”
    解释:当某个容器在过去1小时内内存使用量增加超过1GB时触发告警,可能存在内存泄漏。

ContainerHighCPUThrottling:

yaml
复制代码

  • alert: ContainerHighCPUThrottling
    expr: rate(container_cpu_cfs_throttled_seconds_total[5m]) / rate(container_cpu_cfs_periods_total[5m]) > 0.2
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High CPU throttling on container {{ $labels.container }}”
    description: “Container {{ $labels.container }} is experiencing CPU throttling more than 20% of the time for more than 5 minutes.”
    解释:当某个容器在5分钟内CPU限制(throttling)时间超过总时间的20%时触发告警。

ContainerNetworkTrafficDrop:

yaml
复制代码

  • alert: ContainerNetworkTrafficDrop
    expr: rate(container_network_receive_packets_dropped_total[5m]) > 0 or rate(container_network_transmit_packets_dropped_total[5m]) > 0
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “Network traffic drops on container {{ $labels.container }}”
    description: “Container {{ $labels.container }} is experiencing network packet drops for more than 5 minutes.”
    解释:当某个容器在5分钟内出现网络数据包丢失时触发告警。

这些告警规则基于不同的exporter,可以帮助监控和管理系统和服务的健康状态。通过定义合适的阈值和条件,您可以及时发现和解决潜在问题,确保系统的稳定性和性能。

  • 3
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值