基本查询方法
PromQL聚合操作
Prometheus还提供了下列内置的聚合操作符,这些操作符作用域瞬时向量。可以将瞬时表达式返回的样本数据进行聚合,形成一个新的时间序列。
- sum (求和)
- min (最小值)
- max (最大值)
- avg (平均值)
- stddev (标准差)
- stdvar (标准差异)
- count (计数)
- count_values (对value进行计数)
- bottomk (后n条时序)
- topk (前n条时序)
- quantile (分布统计)
- increase(增长量)
- rate(平均增长率)
- offset(偏移量),offset 1d 表示前一天的
语法官网点我
示例:
count(nginx_ingress_controller_requests) ##计数
标签匹配
nginx_ingress_controller_requests {ingress="api",status="101"} ##多个标签时用"," 分割
其中支持正则
http_requests_total{code!="200"} // 表示查询 code 不为 "200" 的数据
http_requests_total{code=~"2.*"} // 表示查询 code 为 "2xx" 的数据
http_requests_total{code!~"2.*"} // 表示查询 code 不为 "2xx" 的数据
也支持数据的简单运算
count(nginx_ingress_controller_requests) /2 ##末尾加上运算规则多用在单位换算
api调用
语法
http://prometheus_ipaddr:9090/api/v1/query?query=sum(nginx_ingress_controller_requests%20{ingress=%22api%22,status=~%22301%22}%20)%20/2
解析
瞬时数据查询
URL请求参数:
query=:PromQL表达式。
time=:用于指定用于计算PromQL的时间戳。可选参数,默认情况下使用当前系统时间。
timeout=:超时设置。可选参数,默认情况下使用-query,timeout的全局设置。
/api/v1/query
区间数据查询
query=: PromQL表达式。
start=: 起始时间。
end=: 结束时间。
step=: 查询步长。
timeout=: 超时设置。可选参数,默认情况下使用-query,timeout的全局设置。
GET /api/v1/query_range
生产常用rule
nginx-ingress相关
一.Total Requests(24h)
最近24小时总请求数
sum(increase(nginx_ingress_controller_requests[24h]))
Invalid Requets(24H)
最近24小时错误数(排除测试环境的)
sum(increase(nginx_ingress_controller_requests{namespace!~".*test.*",status=~"5.*"}[24h]))
二.HTTP Error Rate(‰) (12h)
最近12小时错误请求千分比(排除测试环境的)
sum(rate(nginx_ingress_controller_requests{namespace!~".*test.*",status=~"5.*"}[12h])) / sum(rate(nginx_ingress_controller_requests{namespace!~".*test.*",}[12h])) * 1000
三.Median Response(5Min)
方式一:最近5分钟的平均响应毫秒(排除测试环境的)
sum(increase(nginx_ingress_controller_response_duration_seconds_sum{namespace!~".*test.*",status=~"2.*"}[5m]))
/
sum(increase(nginx_ingress_controller_requests{namespace!~".*test.*",status=~"2.*"}[5m])) * 1000
方式二:5分钟内95%的响应时间毫秒(排除测试环境的)
histogram_quantile(0.95, sum(rate(nginx_ingress_controller_response_duration_seconds_bucket{namespace!~".*test.*",status=~"2.*"}[5m])) by (le)) * 1000
四.表示统计昨天的此刻的24增加的访问数 (偏移一天)
sum(increase(nginx_ingress_controller_requests[24h] offset 1d))
node—exporter
实例目录不足2GB
(node_filesystem_avail_bytes{mountpoint !~ "(/run|/tmp|/boot|.*kubelet|.*docker/).*",device!="tmpfs"} / 1024 / 1024 / 1024 < 2 and on(instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left(nodename) node_uname_info{nodename=~".+"}
## 用途统计所有告警数
groups:
- name: recording_rules
rules:
- record: ALERTS_FOR_STATE:firing
expr: ALERTS_FOR_STATE and ignoring(alertstate) ALERTS{alertstate="firing"}
- name: recording_rules-1
rules:
- record: ALERTS_FOR_STATE:pending
expr: ALERTS_FOR_STATE and ignoring(alertstate) ALERTS{alertstate="pending"}
- name: example
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: 实例下线
expr: up == 0
for: 2m
labels:
severity: 致命
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} 已经发生宕机5分钟!!!"
- name: disk warning
rules:
# Alert for 磁盘剩余不够 >1 minutes.
- alert: 实例目录容量小于2GB
expr: (node_filesystem_avail_bytes{mountpoint !~ "(/run|/tmp|/boot|.*kubelet|.*docker/).*",device!="tmpfs"} / 1024 / 1024 / 1024 < 2 and on(instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left(nodename) node_uname_info{nodename=~".+"}
for: 2m
labels:
severity: 致命
#severity: deadly
annotations:
summary: "实例目录容量小于2GB"
description: "{{ $labels.instance }} of job {{ $labels.job }} \n实例: {{ $labels.mountpoint }}目录空闲容量小于2GB!!!\n当前指标 = {{ $value }}\n"
- name: disk warning
rules:
# Alert for 磁盘剩余不够5% >1 minutes.
- alert: 实例目录容量小于5%
expr: ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 5 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
for: 2m
labels:
severity: 致命
#severity: deadly
annotations:
summary: "实例目录容量小于5%"
description: "{{ $labels.instance }} of job {{ $labels.job }} \n实例: {{ $labels.mountpoint }}目录空闲容量小于5%!!!\n当前指标 = {{ $value }}\n"
# Alert for 内存不足 2 min.
- name: free used
rules:
- alert: 实例内存剩余小于 不足!!
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 5) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
#expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 5) * on(instance) group_left(nodename) (node_uname_info)
for: 5m
labels:
severity: 危险
annotations:
summary: Host out of memory (instance {{ $labels.instance }})
description: "{{ $labels.instance }} of job {{ $labels.job }} 实例内存不足 5% \n 当前指标 = {{ $value }}\n "
# Alert for cpu负载 10 min.
- name: cpu used
rules:
- alert: CPU负载过高,持续10m
expr: (sum by (instance) (avg by (mode, instance) (rate(node_cpu_seconds_total{mode!="idle"}[2m]))) > 0.95) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
for: 10m
labels:
severity: 危险
annotations:
summary: Host high CPU load (instance {{ $labels.instance }})
description: "{{ $labels.instance }} of job {{ $labels.job }} 实例CPU负载大于 95%\n 当前指标 = {{ $value }}\n "
####腾讯云tke集群规则
- name: tke disk warning
rules:
# Alert for 磁盘剩余不够 >1 minutes.
- alert: 腾讯云TKE生产集群节点 实例目录容量小于10%
expr: ((node_filesystem_avail_bytes{job="tencent-tke-prod-node-exporter"} * 100) / node_filesystem_size_bytes{job="tencent-tke-prod-node-exporter"} < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
for: 2m
labels:
severity: 致命
#severity: deadly
annotations:
summary: "实例目录容量小于20%"
description: "{{ $labels.instance }} of job {{ $labels.job }} \n实例: {{ $labels.mountpoint }}目录空闲容量小于20%!!!\n当前指标 = {{ $value }}\n"
# Alert for 内存不足 2 min.
- name: tke free used
rules:
- alert: 腾讯云TKE生产集群节点 实例内存不足10%!!
expr: (node_memory_MemAvailable_bytes{job="tencent-tke-prod-node-exporter"} / node_memory_MemTotal_bytes * 100 < 10) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
for: 5m
labels:
severity: 危险
annotations:
summary: Host out of memory (instance {{ $labels.instance }})
description: "{{ $labels.instance }} of job {{ $labels.job }} 实例内存不足 10% \n 当前指标 = {{ $value }}\n "
# Alert for cpu负载 10 min.
- name: tke cpu used tke
rules:
- alert: 腾讯云TKE生产集群节点 CPU负载过高,持续10m
expr: (sum by (instance) (avg by (mode, instance) (rate(node_cpu_seconds_total{job="tencent-tke-prod-node-exporter",mode!="idle"}[2m]))) > 0.90) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
for: 10m
labels:
severity: 危险
annotations:
summary: Host high CPU load (instance {{ $labels.instance }})
description: "{{ $labels.instance }} of job {{ $labels.job }} 实例CPU负载大于 90%\n 当前指标 = {{ $value }}\n "