Prometheus+Pushgateway+VictoriaMetrics+Grafana+Consul开源监控体系搭建
监控全局架构图
本文基于prometheus开源全家桶 + 互联网企业实战经验,指导小白如何从零搭建一套完整的监控系统,教学内容从基础监控、业务监控、进程监控、自定义指标监控等等多个维度实战讲解;
其中告警中心为自研中间件,主要解决alertmanager没办法降噪、告警升级、按业务分流到人;(可直接通过alertmanager推送告警)
1. prometheus搭建和配置介绍
1.1 prometheus搭建
官网下载地址 https://prometheus.io/download/
创建存放目录和运行账号
//创建prometheus本地数据存放目录
mkdir /home/data/prometheus_data
//创建prometheus进程运行账号
groupadd prometheus
useradd -g prometheus prometheus -d /home/prometheus
下载及解压安装包
//进入到软件安装目录
cd /usr/local
//选择最新的稳定版本,下载安装包
wget https://github.com/prometheus/prometheus/releases/download/v2.14.0/prometheus-2.14.0.linux-amd64.tar.gz
//解压安装包
tar -xvf prometheus-2.14.0.linux-amd64.tar.gz
//重命名解压目录
mv prometheus-2.14.0.linux-amd64 prometheus
配置标准化
//进入到prometheus目录
cd /usr/local/prometheus
//创建数据、配置、日志等目录
mkdir -p {
cfg,bin}
//移动二进制文件到bin目录
mv prometheus promtool bin/
//移动主配置文件,到cfg目录
mv prometheus.yml cfg/
//目录和文件授权给prometheus用户
chown -R prometheus.prometheus /usr/local/prometheus
//设置环境变量
cat >> /etc/profile <<'EOF'
PATH=/usr/local/prometheus/bin:$PATH:$HOME/bin
EOF
source /etc/profile
创建systemctl服务文件
//生成配置文件
cat > /usr/lib/systemd/system/prometheus.service <<'EOF'
[Unit]
Description=Prometheus
After=network.target
[Service]
User=prometheus
Restart=always
ExecReload=/bin/kill -HUP $MAINPID
//指定本地时序存储路径storage.tsdb.path 60d为数据存储的天数
//通过api web更新cfg配置文件需要加 --web.enable-lifecycle 参数
ExecStart=/usr/local/prometheus/bin/prometheus --storage.tsdb.retention.time=60d --config.file=/usr/local/prometheus/cfg/prometheus.yml --storage.tsdb.path=/home/data/prometheus_data
[Install]
WantedBy=multi-user.target
EOF
1.5 使用systemctl 启动
//重新加载systemctl配置文件
systemctl daemon-reload
//加入到开启自启动
systemctl enable prometheus
//启动prometheus
systemctl start prometheus
//查看prometheus
systemctl status prometheus
//查看prometheus进程服务的详细日志
journalctl -u prometheus -f
搭建完成后,可以在http://prometheusIP:9090/targets 页面中查看各个监控agent的状态;
1.2 prometheus配置文件详解
1.2.1 prometheus.yml详解
详细介绍prometheus的几种常见配置方法,静态static_configs、file_sd_configs动态文件、consul_sd_configs 注册模式consul;
以及如何配置多个remote_write远程存储VictoriaMetrics、alertmanagers告警、rule_files告警规则等;
#my global config
global:
//间隔时间,15秒pull一次
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
#Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
//配置告警的alertmanagers地址,用于处理监控规则出发的告警
- targets: ["127.0.0.1:9093"]
#Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
//存放告警规则组的文件,详细配置可查阅2.3
- "alert_rules.yml"
#remote write VictoriaMetrics
remote_write:
//写远程存储地址,支持多个prometheus写入,grafana从远程存储读取数据
- url: http://127.0.0.1:8428/api/v1/write
remote_timeout: 30s
queue_config:
capacity: 500000
max_shards: 50
max_samples_per_send: 20000
batch_send_deadline: 5s
//同时写入多个远程存储地址,配置多个url即可
- url: http://127.0.0.1:8428/api/v1/write
remote_timeout: 30s
queue_config:
capacity: 500000
max_shards: 50
max_samples_per_send: 20000
batch_send_deadline: 5s
max_retries: 3
#A scrape configuration containing exactly one endpoint to scrape:
#Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
//prometheus 同类监控组的名称,自定义
//static_configs,通过静态配置,适用于快速测试
- job_name: 'node_exporter'
scrape_interval: 30s
scrape_timeout: 30s
static_configs:
- targets: ['127.0.0.1:9090']
labels:
instance: 127.0.0.1
//file_sd_configs,通过文件的形式动态加载配置,适用于web化动态管理节点
- job_name: 'node_monitor'
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /node
file_sd_configs:
- files:
- node_job.yml
//可以通过正则表达式做标签过滤,可以省略
relabel_configs:
- source_labels: [__address__]
regex: '(.*):.*'
replacement: '$1'
target_label: host
//通过cunsul做动态发现,适用于java/php/go等业务程序上报的指标采集
- job_name: 'java_metric'
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /actuator/prometheus
consul_sd_configs:
- server: 'consul.cn:80'
services: []
//consul 认证的token,只需要consul node和servier的读权限
token: 'ea298607-8e39-686e-7d05-d9068fe7f984'
tags: ['java-cls']
//通过pushgateway做监控监控,和自定义业务指标监控
- job_name: 'push_metric'
scrape_interval: 30s
scrape_timeout: 30s
static_configs:
- targets: ['pushgatewayIP:9091']