Promtheus监控报警（实测可用）

本文链接：https://blog.csdn.net/weixin_50254029/article/details/139004792

Prometheus 利用 consul服务发现动态添加监控报警对象

便于快捷方便使用docker方式部署

简介：

什么是prometheus？
	常流行的开源监控和报警系统。

不罗嗦—直接搞起

一、部署prometheus服务

1、创建挂在目录，& prometheus的配置文件。
**

mkdir /data/prometheus -p 
cat prometheus.yml 
# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
      - targets: ["localhost:9090"]

2、拉取镜像，挂载目录，指定端口，后台docker运行。（使用最新版本，）

docker pull prom/prometheus  &&
docker run -d -p 9090:9090 --name=prometheus -v /data/prometheus/:/etc/prometheus/    prom/prometheus
#为了方便挂载目录，后边报警需要配置其他文件（报警规则）

#启动后通过ip+ 9090 端口访问: 127.0.0.1：9090
在这里插入图片描述

二、部署 consul 服务，

本次测试使用单节点部署，生产环境建议使用集群模式。
1、consul服务，直接使用docker运行

docker run -d --name consul -p 8500:8500 consul:1.14.5

2、部署 consulManager
consuManager为consul 的web插件，可以更好管理consul服务，比consul自带的ui功能更强大。
consulManager基于docker-compose部署，安装docker-compose链接：docker-compose

编写consulManager的yml文件：

# mkdir /data/consulManager/tensuns
# cat docker-compose.yml
version: '3.6'
services:
  flask-consul:
    image: swr.cn-south-1.myhuaweicloud.com/starsl.cn/flask-consul:latest
    container_name: flask-consul
    hostname: flask-consul
    restart: always
    volumes:
      - /usr/share/zoneinfo/PRC:/etc/localtime
    environment:
      consul_token: 25f54a-a2c9-4b33-a913-53bf45ccf   #填写之前生成的uuid 生成命令：uuidgen
      consul_url: http://192.16.46.130:8500/v1               #编辑为consulmanage本机的地址路径
      admin_passwd: 11111111                           #设置consulmanage平台登录的admin密码
      log_level: INFO
    networks:
      - TenSunS
  nginx-consul:
    image: swr.cn-south-1.myhuaweicloud.com/starsl.cn/nginx-consul:latest
    container_name: nginx-consul
    hostname: nginx-consul
    restart: always
    ports:
      - "1026:1026"
    volumes:
      - /usr/share/zoneinfo/PRC:/etc/localtime
    depends_on:
      - flask-consul
    networks:
      - TenSunS
networks:
  TenSunS:
    name: TenSunS
    driver: bridge
    ipam:
      driver: default

3、启动consulManager

docker-compose pull && docker-compose up -d

通过IP+ 1026端口访问web界面 127.0.0.1:1026
在这里插入图片描述
4、修改promentheus配置文件（在配置文件最后追加）

#vim prometheus.yml
- job_name: 'consul'
    consul_sd_configs:
      - server: '192.168.46.130:8500' # Consul的地址
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: job

至此prometheus已经可以监控到consul上的服务了。

三、启动exporter 组件，docker下载。

1、启动exporter

docker pull prom/node-exporter
docker run -d -p 9100:9100 --name=node prom/node-exporter\

启动后可通过ip+9100端口查看数据。

2、使用api方式将节点注册到consul服务内。
在要加入的节点上执行：

curl -X PUT -d '{"id": "修改为当前节点的ip","name": "修改为当前节点名称","address": "修改当前节点IP地址","port": 9100,"tags": ["exporter"],"meta": {"job": "node_exporter","instance": "Prometheus服务器"},"checks": [{"http": "http://修改为当前节点ip:9100/metrics", "interval": "5s"}]}'  http://192.168.46.130:8500/v1/agent/service/register

至此prometheus已经可以监控到consul上刚刚注册进来的新节点了。

在这里插入图片描述

四、部署grafana

docker run -d --name=grafana -p 3000:3000 grafana/grafana

1、部署grafana后，再界面添加数据源
- prometheus数据源
其他的数据源都可以。

2、导入模块，可以在管网查看符合监控数据的模块，将编号填写，并导入。

四、部署alertmanager

    Prometheus 警报分为两部分。Prometheus 服务器中的警报规则将警报发送到警报管理器。然后，Alertmanager 管理这些警报，包括沉默、抑制、聚合以及通过电子邮件、待命通知系统和聊天平台等方法发送通知。
   Alertmanager是由Prometheus社区开发的一个独立组件，用于处理Prometheus监控系统生成的警报（Alerts）。它的主要作用是管理和路由警报通知，确保警报以可靠的方式发送到相应的接收者，并进行去重和聚合等操作。

1、启动alertmanager

docker run -d --name=alertmanager -p 9093:9093 prom/alertmanager

2、到prometheus服务，修改配置文件

2.1、配置prometheus，将alertmanger添加。

#cat /etc/prometheus/prometheus.yml  
alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - 172.16.0.236:9093    #添加alertmanager 服务的端口，配置

2.2、配置prometheus报警规则

rule_files:
  - /etc/prometheus/node.yml
   - "first_rules.yml"
    规则文件 cat first_rules.yml
#配置规则后，课登录prometheus 和frafana界面都可以看到添加的规则

groups:
  - name: 服务器资源监控
    rules:
      - alert: 内存使用率过高
        expr: 100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 80
        for: 3m
        labels:
          severity: 严重告警
        annotations:
          summary: "{{ $labels.instance }} 内存使用率过高, 请尽快处理！"
          description: "{{ $labels.instance }}内存使用率超过80%,当前使用率{{ $value }}%."
      - alert: 服务器宕机
        expr: up == 0
        for: 1s
        labels:
          severity: 严重告警
        annotations:
          summary: "{{$labels.instance}} 服务器宕机, 请尽快处理!"
          description: "{{$labels.instance}} 服务器延时超过3分钟,当前状态{{ $value }}. "
      - alert: CPU高负荷
        expr: 100 - (avg by (instance,job)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 5m
        labels:
          severity: 严重告警
        annotations:
          summary: "{{$labels.instance}} CPU使用率过高,请尽快处理！"
          description: "{{$labels.instance}} CPU使用大于90%,当前使用率{{ $value }}%. "

3、配置alertmanager

alertmanager 配置：

global:
  resolve timeout:5m
route:
  group by:['alertname']
receivers:
  -name:'email-alert'
   email configs:
      to:'your-email@example.com'
      from:"alertmanager@example.com
      smarthost:'smtp.example.com:587
      auth username:"alertmanager
      auth password:"password'

#prometheus配置文件：

cat prometheus.yml

my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"    
  # - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
  - job_name: master-cop-leads-prod-workerA2
    static_configs:
    - targets: ['10.0.6.143:9090']
  - job_name: prod-环境
    static_configs:
    - targets: ['10.0.6.68:9100']
    - targets: ['10.0.6.69:9100']
    - targets: ['10.0.6.70:9100']
    - targets: ['10.0.6.71:9100']