官网地址 https://prometheus.io/
下载与安装
下载地址 https://prometheus.io/download/
- 下载 prometheus、alertmanager、node_exporter、mysqld_exporter
Prometheus Server
解压prometheus安装包,并启动
tar -zxvf prometheus-2.36.0.linux-amd64.tar.gz
cd prometheus-2.36.0.linux-amd64
./prometheus
访问 192.168.10.129:9090 ,192.168.10.129 虚拟机地址
访问http://192.168.10.129:9090/metrics 查看指标原始数据
Exporter
Exporter产生监控数据,Prometheus Server会从Exporter拉取数据
所有的Exporter https://prometheus.io/docs/instrumenting/exporters/
安装node_exporter,node_exporter主要是监控主机的、操作系统
tar -zxvf node_exporter-1.3.1.linux-amd64.tar.gz
cd node_exporter-1.3.1.linux-amd64
./node_exporter
访问 http://192.168.10.129:9100/
node_cpu_xxx 监控cpu信息
node_disk_xxx 监控磁盘信息
node_filesystem_xxx 监控文件系统信息
node_memory_xxx 监控内存信息
等等。。。
node_exporter github地址 https://github.com/prometheus/node_exporter
如果想关闭一些指标 使用 --no-collector.<name>
,开启指标 --collector.<name>
#关闭cpu指标
./node_exporter --no-collector.cpu
修改PrometheusServer的prometheus.yml文件
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
- job_name: "node_exporter"
static_configs:
# prometheus server端服务发现,用来帮助server端找到exporter
- targets: ["localhost:9100"]
重启prometheus server,发现已经有了node_exporter的指标
AlertManager告警
告警规则地址 https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
配置告警规则 alert.yaml
groups:
# 规格分组的名称
- name: test-group
rules:
#告警规则名称
- alert: TestRule
# PromQL编写表达式 如果表达式满足条件,就触发告警
expr: node_disk_read_bytes_total{device="sda", instance="localhost:9100", job="node_exporter"} > 20
# 表示评估的等待时间,当条件满足持续这么久之后才触发告警
for: 10s
# 自定义标签
labels:
node_disk_read_bytes_total: node_disk_read_bytes_total
# 用来指定附加信息,eg.用来描述告警的详情
annotations:
# 用来描述告警的概要信息
summary: "disk指标不正常,自定义标签:{{ $labels.node_disk_read_bytes_total }},实例是{{ $labels.instance }}"
# 用来描述告警的详情
description: "{{$labels.instance}}的disk指标不正常,值是{{ $value }}"
在Prometheus.yml添加规则文件地址
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "alert.yaml"
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
- job_name: "node_exporter"
static_configs:
# prometheus server端服务发现,用来帮助server端找到exporter
- targets: ["localhost:9100"]
重启Prometheus server
AlertManager
Prometheus Server 发现满足告警触发规则条件就向AlertManager推送
AlertManager配置 https://prometheus.io/docs/alerting/latest/configuration/
tar -zxvf alertmanager-0.24.0.linux-amd64.tar.gz
cd alertmanager-0.24.0.linux-amd64
修改 alertmanager.yml
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'email'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
- name: 'email'
email_configs:
- to: xxx@qq.com
from: yyy@qq.com
smarthost: smtp.qq.com:465
auth_username: yyy@qq.com
# 授权码
auth_password: uinynfsegdlibage
# 使用465端口 必须设置成false 否则启动报错
require_tls: false
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
启动AlertManager
./alertmanager
修改prometheus.yml文件
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
重启 prometheus server
访问alertmanager
PromQL
Prometheus内置的数据查询语言,可针对Prometheus中的数据做各种计算、汇总等操作
区间向量
Grafana
可视化监控指标展示工具
官网地址 https://grafana.com/
下载地址 https://grafana.com/grafana/download
wget https://dl.grafana.com/enterprise/release/grafana-enterprise-8.5.3.linux-amd64.tar.gz
tar -zxvf grafana-enterprise-8.5.3.linux-amd64.tar.gz
cd grafana-8.5.3/bin
./grafana-server web
添加datasource
官方图表地址 https://grafana.com/dashboards
集群
192.168.10.129 : Prometheus Server , node_exporter
192.168.10.130 : node_exporter
192.168.10.131 : node_exporter
1、在以上基础上添加机器状态监控节点(NODE集群配置:每台要监控的服务器都需要安装一个NODE)
在192.168.10.130 192.168.10.131 两台虚拟机上安装node_exporter 并启动
2、配置Prometheus Server 的prometheus.yml 并重启
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "alert.yaml"
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
- job_name: "node_exporter"
static_configs:
# prometheus server端服务发现,用来帮助server端找到exporter
- targets: ["192.168.10.129:9100"]
- targets: ["192.168.10.130:9100"]
- targets: ["192.168.10.131:9100"]
grafana dashboard id 8919
grafana dashboard id 1860
HTTP API
HTTP API 地址
https://prometheus.io/docs/prometheus/latest/querying/api/
瞬时数据查询
通过使用 QUERY API 我们可以查询 PromQL 在特定时间点下的计算结果。
GET /api/v1/query
URL 请求参数:
query= : PromQL 表达式。
time=<rfc3339 | unix_timestamp> : 用于指定用于计算 PromQL 的时间戳。可选参数,默认情况下使用当前系统时间。
timeout= : 超时设置。可选参数,默认情况下使用全局设置的参数 -query.timeout。
如果 time 参数缺省,则使用当前服务器时间。
查看cpu信息
GET /api/v1/query_range
URL 请求参数:
query= : PromQL 表达式。
start=<rfc3339 | unix_timestamp> : 起始时间戳。
end=<rfc3339 | unix_timestamp> : 结束时间戳。
step=<duration | float> : 查询时间步长,时间区间内每 step 秒执行一次。
timeout= : 超时设置。可选参数,默认情况下使用全局设置的参数 -query.timeout。