Prometheus 利用 consul服务发现 动态添加监控报警对象
便于快捷方便使用docker方式部署
简介:
什么是prometheus?
常流行的开源监控和报警系统。
不罗嗦—直接搞起
一、部署prometheus服务
1、创建挂在目录,& prometheus的配置文件。
**
mkdir /data/prometheus -p
cat prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
2、拉取镜像,挂载目录,指定端口,后台docker运行。(使用最新版本,)
docker pull prom/prometheus &&
docker run -d -p 9090:9090 --name=prometheus -v /data/prometheus/:/etc/prometheus/ prom/prometheus
#为了方便挂载目录,后边报警需要配置其他文件(报警规则)
#启动后通过ip+ 9090 端口访问: 127.0.0.1:9090
二、部署 consul 服务,
本次测试使用单节点部署,生产环境建议使用集群模式。
1、consul服务,直接使用docker运行
docker run -d --name consul -p 8500:8500 consul:1.14.5
2、部署 consulManager
consuManager为consul 的web插件,可以更好管理consul服务,比consul自带的ui功能更强大。
consulManager基于docker-compose部署,安装docker-compose链接:docker-compose
编写consulManager的yml文件:
# mkdir /data/consulManager/tensuns
# cat docker-compose.yml
version: '3.6'
services:
flask-consul:
image: swr.cn-south-1.myhuaweicloud.com/starsl.cn/flask-consul:latest
container_name: flask-consul
hostname: flask-consul
restart: always
volumes:
- /usr/share/zoneinfo/PRC:/etc/localtime
environment:
consul_token: 25f54a-a2c9-4b33-a913-53bf45ccf #填写之前生成的uuid 生成命令:uuidgen
consul_url: http://192.16.46.130:8500/v1 #编辑为consulmanage本机的地址路径
admin_passwd: 11111111 #设置consulmanage平台登录的admin密码
log_level: INFO
networks:
- TenSunS
nginx-consul:
image: swr.cn-south-1.myhuaweicloud.com/starsl.cn/nginx-consul:latest
container_name: nginx-consul
hostname: nginx-consul
restart: always
ports:
- "1026:1026"
volumes:
- /usr/share/zoneinfo/PRC:/etc/localtime
depends_on:
- flask-consul
networks:
- TenSunS
networks:
TenSunS:
name: TenSunS
driver: bridge
ipam:
driver: default
3、启动consulManager
docker-compose pull && docker-compose up -d
通过IP+ 1026端口访问web界面 127.0.0.1:1026
4、修改promentheus配置文件(在配置文件最后追加)
#vim prometheus.yml
- job_name: 'consul'
consul_sd_configs:
- server: '192.168.46.130:8500' # Consul的地址
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: job
至此prometheus已经可以监控到consul上的服务了。
三、启动exporter 组件,docker下载。
1、启动exporter
docker pull prom/node-exporter
docker run -d -p 9100:9100 --name=node prom/node-exporter\
启动后可通过ip+9100端口查看数据。
2、使用api方式将节点注册到consul服务内。
在要加入的节点上执行:
curl -X PUT -d '{"id": "修改为当前节点的ip","name": "修改为当前节点名称","address": "修改当前节点IP地址","port": 9100,"tags": ["exporter"],"meta": {"job": "node_exporter","instance": "Prometheus服务器"},"checks": [{"http": "http://修改为当前节点ip:9100/metrics", "interval": "5s"}]}' http://192.168.46.130:8500/v1/agent/service/register
至此prometheus已经可以监控到consul上刚刚注册进来的新节点了。
四、部署grafana
docker run -d --name=grafana -p 3000:3000 grafana/grafana
1、部署grafana后,再界面添加数据源
- prometheus数据源
其他的数据源都可以。
2、导入模块,可以在管网查看符合监控数据的模块,将编号填写,并导入。
四、部署alertmanager
Prometheus 警报分为两部分。Prometheus 服务器中的警报规则将警报发送到警报管理器。然后,Alertmanager 管理这些警报,包括沉默、抑制、聚合以及通过电子邮件、待命通知系统和聊天平台等方法发送通知。
Alertmanager是由Prometheus社区开发的一个独立组件,用于处理Prometheus监控系统生成的警报(Alerts)。它的主要作用是管理和路由警报通知,确保警报以可靠的方式发送到相应的接收者,并进行去重和聚合等操作。
1、启动alertmanager
docker run -d --name=alertmanager -p 9093:9093 prom/alertmanager
2、到prometheus服务,修改配置文件
2.1、配置prometheus,将alertmanger添加。
#cat /etc/prometheus/prometheus.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- 172.16.0.236:9093 #添加alertmanager 服务的端口,配置
2.2、配置prometheus报警规则
rule_files:
- /etc/prometheus/node.yml
- "first_rules.yml"
规则文件 cat first_rules.yml
#配置规则后,课登录prometheus 和frafana界面都可以看到添加的规则
groups:
- name: 服务器资源监控
rules:
- alert: 内存使用率过高
expr: 100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 80
for: 3m
labels:
severity: 严重告警
annotations:
summary: "{{ $labels.instance }} 内存使用率过高, 请尽快处理!"
description: "{{ $labels.instance }}内存使用率超过80%,当前使用率{{ $value }}%."
- alert: 服务器宕机
expr: up == 0
for: 1s
labels:
severity: 严重告警
annotations:
summary: "{{$labels.instance}} 服务器宕机, 请尽快处理!"
description: "{{$labels.instance}} 服务器延时超过3分钟,当前状态{{ $value }}. "
- alert: CPU高负荷
expr: 100 - (avg by (instance,job)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 5m
labels:
severity: 严重告警
annotations:
summary: "{{$labels.instance}} CPU使用率过高,请尽快处理!"
description: "{{$labels.instance}} CPU使用大于90%,当前使用率{{ $value }}%. "
3、配置alertmanager
alertmanager 配置:
global:
resolve timeout:5m
route:
group by:['alertname']
receivers:
-name:'email-alert'
email configs:
to:'your-email@example.com'
from:"alertmanager@example.com
smarthost:'smtp.example.com:587
auth username:"alertmanager
auth password:"password'
#prometheus配置文件:
cat prometheus.yml
my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
- job_name: master-cop-leads-prod-workerA2
static_configs:
- targets: ['10.0.6.143:9090']
- job_name: prod-环境
static_configs:
- targets: ['10.0.6.68:9100']
- targets: ['10.0.6.69:9100']
- targets: ['10.0.6.70:9100']
- targets: ['10.0.6.71:9100']