Prometheus是最初在SoundCloud上构建的开源系统监视和警报工具包。自2012年成立以来,许多公司和组织都采用了Prometheus,该项目拥有非常活跃的开发人员和用户社区。Prometheus 于2016年加入了 Cloud Native Computing Foundation,这是继Kubernetes之后的第二个托管项目。
官网:https://prometheus.io 最新版本: 2.19.2
Exporter是一个采集监控数据并通过Prometheus监控规范对外提供数据的组件,能为Prometheus提供监控的接口。
Exporter将监控数据采集的端点通过HTTP服务的形式暴露给Prometheus Server,Prometheus Server通过访问该Exporter提供的Endpoint端点,即可获取到需要采集的监控数据。不同的Exporter负责不同的业务。
Prometheus 开源的系统监控和报警框架,灵感源自Google的Borgmon监控系统
AlertManager 处理由客户端应用程序(如Prometheus server)发送的警报。它负责将重复数据删除,分组和路由到正确的接收者集成,还负责沉默和抑制警报
Node_Exporter 用来监控各节点的资源信息的exporter,应部署到prometheus监控的所有节点
PushGateway 推送网关,用于接收各节点推送的数据并暴露给Prometheus server
文档:https://prometheus.io/docs/introduction/overview/
下载prometheus各组件:
https://prometheus.io/download/
环境准备
- 主机说明:
系统 | ip | 角色 | cpu | 内存 | hostname |
---|---|---|---|---|---|
CentOS 7.8 | 192.168.30.135 | prometheus、node1 | >=2 | >=2G | prometheus |
CentOS 7.8 | 192.168.30.136 | altermanager、node2 | >=2 | >=2G | altermanager |
CentOS 7.8 | 192.168.30.137 | grafana、node3 | >=2 | >=2G | grafana |
- 全部关闭防火墙和selinux:
systemctl stop firewalld && systemctl disable firewalld
sed -i 's/=enforcing/=disabled/g' /etc/selinux/config && setenforce 0
配置规则
前面已经部署了prometheus、node_exporter、alertmanager和grafana,并介绍了PromQL,且对于metrics进行了查询演示。
本文配置各项规则,完成记录及告警。
prometheus的规则分为记录规则和告警规则,rule_files
主要用于配置 rules 文件,它支持多个文件以及文件目录。
- 当前配置:
cat /usr/local/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- 192.168.30.136:9093
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['192.168.30.135:9090']
- job_name: 'node'
static_configs:
- targets: ['192.168.30.135:9100','192.168.30.136:9100','192.168.30.137:9100']
- job_name: 'alertmanager'
static_configs:
- targets: ['192.168.30.136:9093']
- 记录规则配置:
记录规则可以预先计算经常需要或计算量大的表达式,并将其结果保存为一组新的时间序列。这样,查询预先计算的结果通常比每次需要原始表达式都要快得多。这对于仪表板非常有用,仪表板每次刷新时都需要重复查询相同的表达式。
记录和告警规则存在于规则组中。组中的规则以规则的时间间隔顺序运行。记录和告警规则的名称必须是有效的度量标准名称。
注意:冒号是为用户定义的记录规则保留的,不应该被exporter使用。
mkdir /usr/local/prometheus/rules
vim /usr/local/prometheus/rules/node-record-rules.yml
groups:
- name: node-record
rules:
#system
- record: node:up
expr: up{
job="node"}
labels:
job: "node"
unit:
desc: "节点是否在线,在线1,不在线0"
- record: node:name
expr: count by (nodename) (node_uname_info)
labels:
job: "node"
unit:
desc: "节点的主机名"
- record: node:uptime
expr: time() - node_boot_time_seconds{
}
labels:
job: "node"
unit: s
desc: "节点的运行时间"
#cpu
- record: node:cpu:num
expr: count by (instance) (node_cpu_seconds_total{
job="node",mode='system'})
labels:
job: "node"
unit: v
desc: "节点的cpu 核数"
- record: node:cpu:idle:percent
expr: avg by (instance) (irate(node_cpu_seconds_total{
job="node",mode="idle"}[5m])) * 100
labels:
job: "node"
unit: "%"
desc: "5m的cpu 空闲百分比"
- record: node:cpu:used:percent
expr: (1 - avg by (instance) (irate(node_cpu_seconds_total{
job="node",mode="idle"}[5m]))) * 100
labels:
job: "node"
unit: "%"
desc: "5m的cpu 总使用百分比"
- record: node:cpu:system:percent
expr: avg by (instance) (irate(node_cpu_seconds_total{
job="node",mode="system"}[5m])) * 100
labels:
job: "node"
unit: "%"
desc: "节点的cpu system使用百分比"
- record: node:cpu:user:percent
expr: avg by (instance) (irate(node_cpu_seconds_total{
job="node",mode="user"}[5m])) * 100
labels:
job: "node"
unit: "%"
desc: "节点的cpu user使用百分比"
- record: node:cpu:iowait:percent
expr: avg by (instance) (irate(node_cpu_seconds_total{
job="node",mode="iowait"}[5m])) * 100
labels:
job: "node"
unit: "%"
desc: "节点的cpu iowait使用百分比"
- record: node:cpu:other:percent
expr: avg by (instance) (irate(node_cpu_seconds_total{
job="node",mode=~"softirq|nice|irq|steal"}[5m])) * 100
labels:
job: "node"
unit: "%"
desc: "节点的cpu other使用百分比"
#memory
- record: node:memory:total
expr: node_memory_MemTotal_bytes{
job="node"}
labels:
job: "node"
unit: bytes
desc: "节点的mem 总大小"
- record: node:memory:avail
expr: node_memory_MemAvailable_bytes{
job="node"}
labels:
job: "node"
unit: bytes
desc: "节点的mem avail大小"
- record: node:memory:free
expr: node_memory_MemFree_bytes{
job="node"}
labels:
job: "node"
unit: bytes
desc: "节点的mem free大小"
- record: node:memory:used
expr: node_memory_MemTotal_bytes{
job="node"} - node_memory_MemFree_bytes{
job="node"}
labels:
job: "node"
unit: bytes
desc: "节点的mem used总大小"
- record: node:memory:actuallyused
expr: node_memory_MemTotal_bytes{
job="node"} - node_memory_MemAvailable_bytes{
job="node"}
labels:
job: "node"
unit: bytes
desc: "节点的mem used实际大小"
- record: node:memory:avail:percent
expr: node_memory_MemAvailable_bytes{
job="node"} / node_memory_MemTotal_bytes{
job="node"} * 100
labels:
job: "node"
unit: "%"
desc: "节点的mem avail百分比"
- record: node:memory:free:percent
expr: node_memory_MemFree_bytes{
job="node"} / node_memory_MemTotal_bytes{
job="node"} * 100
labels:
job: "node"
unit: "%"
desc: "节点的mem free百分比"
- record: node:memory:used:percent
expr: (1 - node_memory_MemAvailable_bytes{
job="node"} / node_memory_MemTotal_bytes{
job="node"}) * 100
labels:
job: "node"
unit: "%"
desc: "节点的mem used百分比"
#load
- record: node:load:load1
expr: node_load1
labels:
job: "node"
unit:
desc: "节点1m load"
- record: node:load:load5
expr: node_load5
labels:
job: "node"
unit:
desc: "节点5m load"
- record: node:load:load15
expr: node_load15
labels:
job: "node"
unit:
desc: "节点15m load"
- record: node:load:load1:sum
expr: sum by (job) (node_load1)
labels:
job: "node"
unit:
desc: "节点1m 整体load"
- record: node:load:load5:sum
expr: sum by (job) (node_load5)
labels:
job: "node"
unit:
desc: <