目录
一、安装Prometheus
1、下载安装并解压
mkdir -p /data;cd /data
wget https://github.com/prometheus/prometheus/releases/download/v2.25.1/prometheus-2.25.1.linux-amd64.tar.gz
tar xf prometheus-2.25.1.linux-amd64.tar.gz
mv prometheus-2.25.1.linux-amd64 prometheus
2、修改配置文件
#备份配置文件
cp prometheus.yml{,.bak}
cat prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/*.rules"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
#node_export添加
- job_name: 'linux'
static_configs:
- targets: ['127.0.0.1:9100','192.168.10.12:9100']
3、编写报警规则
规则参考网址:https://awesome-prometheus-alerts.grep.to/rules#host-and-hardware
1)示例1:实例宕机规则
cd prometheus;mkdir rules
cat rules/first.rules
groups:
- name: example
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {
{ $labels.instance }} down"
description: "{
{ $labels.instance }} of job {
{ $labels.job }} has been down for more than 5 minutes."
2)示例模板
groups:
- name: 实例告警规则
rules:
- alert: 实例存活告警
expr: up == 0
for: 1m
labels:
user: prometheus
severity: warning
annotations:
description: "{
{ $labels.instance }} of job {
{ $labels.job }} has been down for more than 1 minutes."
# mem报警
- alert: 内存使用率告警
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 80
for: 1m
labels:
user: prometheus
severity: warning
annotations:
description: "服务器: 内存使用超过80%!(当前值: {
{ $value }}%)"
# disk报警
- alert: 磁盘使用率告警
expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 80
for: 1m
labels:
user: prometheus
severity: warning
annotations:
description: "服务器: 磁盘设备: 使用超过80%!(挂载