prometheus+grafana+alertmanager 安装配置文档

prometheus+grafana+alertmanager 安装配置文档

1. 安装组件基本介绍:

  • **prometheus : **
    • server端守护进程,负责拉取各个终端exporter收集到metrics(监控指标数据),并记录在本身提供的tsdb时序记录数据库中,默认保留天数15天,可以通过启动参数自行设置数据保留天数。
    • prometheus官方提供了多种exporter,
    • 默认监听9090端口,对外提供web图形查询页面,以及数据库查询访问接口。
    • 配置监控规则rules(需自行手动配置),并将触发规则的告警发送至alertmanager ,并由alertmanager中配置的告警媒介向外发送告警。
  • grafana:
    • 由于prometheus本身提供的图形页面过于简陋,所以使用grafana来提供图形页面展示。
    • grafana 是专门用于图形展示的软件,支持多种数据来源,prometheus只是其中一种。
    • 自带告警功能,且告警规则可在监控图形上直接配置,不过由于此种方式不支持模板变量(dashboard中为了方便展示配置的特殊变量),即每一个指标,每一台设备均需要单独配置,所以实用性较低
    • 默认监听端口:3000
  • node_exporter:
    • agent端,prometheus官方提供的诸多exporter中的一种,安装与各监控节点主机
    • 负责抓取主机及系统各项信息,如cpu,mem ,disk,networtk.filesystem,…等等各项基本指标,非常全面。并将抓取到的各项指标metrics 通过http协议对方发布,供prometheus server端抓取。
    • 默认监听端口: 9100
  • cadvisor:
    • agent端,安装与docker主机,抓取主机和docker容器运行中各项数据。
    • 本身也已容器方式运行,监听端口8080(可自行设置对外映射端口,且建议映射到其他端口)。
    • 提供基本的graph展示页面,同时提供metrics抓取页面
  • alertmanager:
    • 接受prometheus发送的告警,并通过一定规则分组,控制告警的发送(如告警频率,规则抑制,匹配不同的告警后端媒介,设置静默等)。
    • 可配置多种不同的告警后端媒介,如:邮件,webhook,wechat(企业微信)已经一些企业版的监控告警平台等。
    • 默认监听端口:9093
  • blackbox_exporter:
    • Prometheus 官方提供的 exporter 之一,可以提供 http、dns、tcp、icmp 的监控数据采集
    • 可直接配置与prometheus server节点,也可配置在单独节点
    • 默认监听端口:9115
  • nginx:
    • 由于prometheus,alertmanager本身不具有认证功能,所以前端使用nginx对外访问,提供基本basic认证已经配置https
    • 以上各组件均需暴露自身端口,所以在docker-compos 部署过程中,将容器部署在同一网络中,前端入口映射端口由nginx统一配置,方便管理

2.prometheus-server

2.1 官方地址:

  • 官方文档地址:https://prometheus.io/docs/introduction/overview/

  • github项目下载地址: https://github.com/prometheus/prometheus

2.2 安装 prometheus server

2.2.1 linux(centos7) 下载安装
  • 创建运行prometheus server进程的系统用户,并为其创建家目录/var/lib/prometheus 作为数据存储目录

    ~]# useradd -r -m -d /var/lib/prometheus prometheus
    
  • 下载并安装prometheus server,以2.14.0为例:

    wget https://github.com/prometheus/prometheus/releases/download/v2.14.0/prometheus-2.14.0.linux-amd64.tar.gz
    tar -xf prometheus-2.14.0.linux-amd64.tar.gz  -C /usr/local/
    cd /usr/local
    ln -sv prometheus-2.14.0.linux-amd64 prometheus
    
  • 创建unit file,让systemd 管理prometheus

    vim /usr/lib/systemd/system/prometheus.service				
    [Unit]
    Description=The Prometheus 2 monitoring system and time series database.
    Documentation=https://prometheus.io
    After=network.target
    [Service]
    EnvironmentFile=-/etc/sysconfig/prometheus
    User=prometheus
    ExecStart=/usr/local/prometheus/prometheus \
    --storage.tsdb.path=/home/prometheus/prometheus \
    --config.file=/usr/local/prometheus/prometheus.yml \
    --web.listen-address=0.0.0.0:9090 \
    --web.external-url= $PROM_EXTRA_ARGS
    Restart=on-failure
    StartLimitInterval=1
    RestartSec=3
    [Install]
    WantedBy=multi-user.target
    
  • 其他运行时参数: ./prometheus --help

  • 启动服务

    systemctl daemon-reload
    systemctl start prometheus.service
    
  • 注意开启防火墙端口:

    iptables -I INPUT -p tcp --dport 9090 -s NETWORK/MASK -j ACCEPT
    
  • 浏览器访问:

    http://IP:PORT
    
2.2.2 docker安装:
  • image: prom/prometheus
  • 启动命令:
$ docker run --name prometheus -d -v ./prometheus:/etc/prometheus/ -v ./db/:/prometheus -p 9090:9090 prom/prometheus --config.file=/etc/prometheus/prometheus.yml --web.listen-address="0.0.0.0:9090" --storage.tsdb.path=/prometheus --web.console.libraries=/usr/share/prometheus/console_libraries --web.console.templates=/usr/share/prometheus/consoles --storage.tsdb.retention=30d

2.3 prometheus配置:

2.3.1 启动参数
  • 常用启动参数:

    --config.file=/etc/prometheus/prometheus.yml # 指明主配置文件
    --web.listen-address="0.0.0.0:9090" 	# 指明监听地址端口
    --storage.tsdb.path=/prometheus 	# 指明数据库目录
    --web.console.libraries=/usr/share/prometheus/console_libraries
    --web.console.templates=/usr/share/prometheus/consoles 	# 指明console lib 和 tmpl
    --storage.tsdb.retention=60d  # 指明数据保留天数,默认15
    
2.3.2 配置文件:
  • Prometheus的主配置⽂件为prometheus.yml

    它主要由global、rule_files、scrape_configs、alerting、remote_write和remote_read⼏个配置段组成:

    • global:全局配置段;

    • rule_files:指定告警规则文件的路径

    • scrape_configs:
      scrape配置集合,⽤于定义监控的⽬标对象(target)的集合,以及描述如何抓取 (scrape)相关指标数据的配置参数;
      通常,每个scrape配置对应于⼀个单独的作业(job),
      ⽽每个targets可通过静态配置(static_configs)直接给出定义,也可基于Prometheus⽀持的服务发现机制进 ⾏⾃动配置;

      - job_name: 'nodes'
        static_configs: 	# 静态指定,targets中的 host:port/metrics 将会作为metrics抓取对象
        - targets: ['localhost:9100']
        - targets: ['172.20.94.1:9100']
      
       - job_name: 'docker_host'
         file_sd_configs: 	# 基于文件的服务发现,文件中(yml 和json 格式)定义的host:port/metrics将会成为抓取对象
           - files:
               - ./sd_files/docker_host.yml
             refresh_interval: 30s
      
    • alertmanager_configs:

      ​ 可由Prometheus使⽤的Alertmanager实例的集合,以及如何同这些Alertmanager交互的配置参数;
      ​ 每个Alertmanager可通过静态配置(static_configs)直接给出定义, 也可基于Prometheus⽀持的服务发现机制进⾏⾃动配置;

    • remote_write:
      配置“远程写”机制,Prometheus需要将数据保存于外部的存储系统(例如InfluxDB)时 定义此配置段,
      随后Prometheus将样本数据通过HTTP协议发送给由URL指定适配器(Adaptor);

    • remote_read:
      配置“远程读”机制,Prometheus将接收到的查询请求交给由URL指定适配器 (Adpater)执⾏,
      Adapter将请求条件转换为远程存储服务中的查询请求,并将获取的响应数据转换为Prometheus可⽤的格式;

  • 监控及告警规则配置文件:*.yml

    • 定义监控规则
    • 需要在主配置文件rule_files: 中指定才会生效
    rule_files:
      - "test_rules.yml"  # 指定配置告警规则的文件路径
    
  • 服务发现定义文件:支持yaml 和 json 两种格式

    • 也是需要在主配置文件中定义
     file_sd_configs:
    	- files:
    		- ./sd_files/http.yml
    	  refresh_interval: 30s
    
2.3.3 简单的配置文件示例:
  • prometheus.yml 示例
global:
  scrape_interval:     15s 	 #每过15秒抓取一次指标数据
  evaluation_interval: 15s   #每过15秒执行一次报警规则,也就是说15秒执行一次报警
alerting:
  alertmanagers:
  - static_configs:
    - targets: ["localhost:9093"]   # 设置报警信息推送地址 , 一般而言设置的是alertManager的地址
rule_files:
  - "test_rules.yml"  # 指定配置告警规则的文件路径
scrape_configs: 
  - job_name: 'node'   #自己定义的监控的job_name
    static_configs: 	# 配置静态规则,直接指定抓取的ip:port
      - targets: ['localhost:9100']
  - job_name: 'CDG-MS'
    honor_labels: true
    metrics_path: '/prometheus'
    static_configs:
      - targets: ['localhost:8089']
    relabel_configs:
      - target_label: env
        replacement: dev
  - job_name: 'eureka'
    file_sd_configs: 		# 基于文件的服务发现
      - files:
          - "/app/enmonster/basic/prometheus/prometheus-2.2.1.linux-amd64/eureka.json" # 支持json 和yml 两种格式
      refresh_interval: 30s  # 30s钟自行刷新配置,读取文件,修改之后无需手动reload
    relabel_configs:
      - source_labels: [__job_name__]
        regex: (.*)
        target_label: job
        replacement: ${1}
      - target_label: env
        replacement: dev				
  • 告警规则配置文件示例:

    [root@host40 monitor-bak]# cat prometheus/rules/docker_monitor.yml 
    groups:
      - name: "container monitor"
        rules:
          - alert: "Container down: env1"
            expr: time() - container_last_seen{name="env1"} > 60
            for: 30s
            labels:
              severity: critical
            annotations:
              summary: "Container down: {{$labels.instance}} name={{$labels.name}}"
    
  • 基于文件的服务发现定义文件: *.yml

    [root@host40 monitor]# cat prometheus/sd_files/virtual_lan.yml 
    - targets: ['10.10.11.179:9100']
    - targets: ['10.10.11.178:9100']
    
    [root@host40 monitor]# cat prometheus/sd_files/tcp.yml 
    - targets: ['10.10.11.178:8001']
      labels:
        server_name: http_download
    - targets: ['10.10.11.178:3307']
      labels:
        server_name: xiaojing_db
    - targets: ['10.10.11.178:3001']
      labels:
        server_name: test_web
    
2.3.5其他配置
  • 由于prometheus很多配置需要和其他组件耦合,所以在介绍到相应组件时再行介绍

2.4 prometheus web-gui

  • web页面访问地址: http://ip:port 如:http://10.10.11.40:9090/
  • alerts: 查看告警规则
  • graph: 查询收集到的指标数据,并提供简单的绘图
  • status: prometheus运行时配置已经监听主机相关信息
  • 详情自行查看web-gui页面

3.node_exporter

3.1 基本介绍

  • node_exporter 在被监控节点安装,抓取主机监控信息,并对外提供http服务,供prometheus抓取监控信息。

  • 项目及文档地址:https://github.com/prometheus/node_exporter

  • prometheus官方提供了很多不同类型的exporter,列表地址: https://prometheus.io/docs/instrumenting/exporters/

3.2 安装node_exporter

3.2.1 linux(centos7)下载安装:
  • 下载并解压

    wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz
    tar xf node_exporter-0.18.1.linux-amd64.tar.gz -C /usr/local/
    cd /usr/local
    ln -sv node_exporter-0.18.1.linux-amd64/ node_exporter
    
  • 创建用户:

    useradd -r -m -d /var/lib/prometheus prometheus
    
  • 配置unit file:

    vim /usr/lib/systemd/system/node_exporter.service
    [Unit]
    Description=Prometheus exporter for machine metrics, written in Go with pluggable metric 
    collectors.Documentation=https://github.com/prometheus/node_exporterAfter=network.target
    [Service]
    EnvironmentFile=-/etc/sysconfig/node_exporter
    User=prometheus
    ExecStart=/usr/local/node_exporter/node_exporter \
    	$NODE_EXPORTER_OPTS
    Restart=on-failure
    StartLimitInterval=1
    RestartSec=3
    [Install]
    WantedBy=multi-user.target	
    
  • 启动服务:

    systemctl daemon-reload
    systemctl start node_exporter.service
    
  • 可以手动测试是否可以获取metrics信息:

    curl http://localhost:9100/metrics
    
  • 开启防火墙:

    iptables -I INPUT -p tcp --dport 9100 -s NET/MASK -j ACCEPT
    
3.2.2 docker安装
  • image: quay.io/prometheus/node-exporter ,prom/node-exporter

  • 启动命令:

    docker run -d --net="host" --pid="host" -v "/:/host:ro,rslave" --name monitor-node-exporter --restart always quay.io/prometheus/node-exporter --path.rootfs=/host --web.listen-address=:9100
    
  • 对于部分低版本的docker,出现报错:Error response from daemon: linux mounts: Could not find source mount of /

    解决办法:-v “/:/host:ro,rslave” -> -v “/:/host:ro”

3.3 配置node_exporter

  • 开启关闭collectors:

    ./node_exporter --help  # 查看支持的所有collectors,可根据实际需求 enable 和 disabled 各项指标收集
    
    如 --collector.cpu=disabled ,不再收集cpu相关信息
    
  • Textfile Collector: 文本文件收集器

    通过	启动参数 --collector.textfile.directory="DIR"	可开启文本文件收集器
    收集器会收集目录下所有*.prom的文件中的指标,指标必须满足	prom格式
    

    示例:

    echo my_batch_job_completion_time $(date +%s) > /path/to/directory/my_batch_job.prom.$$
    mv /path/to/directory/my_batch_job.prom.$$ /path/to/directory/my_batch_job.prom			
    echo 'role{role="application_server"} 1' > /path/to/directory/role.prom.$$
    mv /path/to/directory/role.prom.$$ /path/to/directory/role.prom	
    rpc_duration_seconds{quantile="0.5"} 4773
    http_request_duration_seconds_bucket{le="0.5"} 129389
    

    即如果node_exporter 不能满足自身指标抓取,可以通过脚本形式将指标抓取之后写入文件,由node_exporter对外提供个prometheus抓取
    可以省掉pushgateway

  • 有关prom格式和查询语法,将再之后介绍

3.4 配置prometheus抓取node_exporter 指标

  • 示例: prometheus.yml

    scrape_configs:
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
      - job_name: 'prometheus'
        # metrics_path defaults to '/metrics'
        # scheme defaults to 'http'.
        static_configs:
        - targets: ['localhost:9090']
      - job_name: 'nodes'
        static_configs:
        - targets: ['localhost:9100']
        - targets: ['172.20.94.1:9100']
    
    - job_name: 'node_real_lan'
      file_sd_configs:
        - files:
            - ./sd_files/real_lan.yml
          refresh_interval: 30s
          params: 		# 可选
            collect[]:
              - cpu
              - meminfo
              - diskstats
              - netdev
              - netstat
              - filefd
              - filesystem
              - xfs
    

4.cadvisor

4.1 官方地址:

  • https://github.com/google/cadvisor
  • image: gcr.io/google_containers/cadvisor[:v0.36.0] # 需要能访问google
  • image: google/cadvisor:v0.33.0 # docker hub镜像,版本没有google的新

4.2 docker run

sudo docker run \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:ro \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --volume=/dev/disk/:/dev/disk:ro \
  --publish=9080:8080 \
  --detach=true \
  --name=cadvisor \
  --privileged \
  --device=/dev/kmsg \
  google/cadvisor:v0.33.0

4.3 web页面查看简单的单机图形监控信息

  • http://ip:port

4.4 配置prometheus抓取

  • 配置示例:

    - job_name: 'docker'
      static_configs:
      - targets: ['localhost:9080']
    

5.grafana

5.1 官方地址

  • grafana程序下载地址:https://grafana.com/grafana/download
  • grafana dashboard 下载地址: https://grafana.com/grafana/download/

5.2 安装grafana

5.2.1 linux(centos7)安装
  • 下载并安装

    wget https://dl.grafana.com/oss/release/grafana-7.2.2-1.x86_64.rpm
    sudo yum install grafana-7.2.2-1.x86_64.rpm
    
  • 准备service 文件:

    [Unit]
    Description=Grafana instance
    Documentation=http://docs.grafana.org
    Wants=network-online.target
    After=network-online.target
    After=postgresql.service mariadb.service mysqld.service
    
    [Service]
    EnvironmentFile=/etc/sysconfig/grafana-server
    User=grafana
    Group=grafana
    Type=notify
    Restart=on-failure
    WorkingDirectory=/usr/share/grafana
    RuntimeDirectory=grafana
    RuntimeDirectoryMode=0750
    ExecStart=/usr/sbin/grafana-server                                                  \
                                --config=${CONF_FILE}                                   \
                                --pidfile=${PID_FILE_DIR}/grafana-server.pid            \
                                --packaging=rpm                                         \
                                cfg:default.paths.logs=${LOG_DIR}                       \
                                cfg:default.paths.data=${DATA_DIR}                      \
                                cfg:default.paths.plugins=${PLUGINS_DIR}                \
                                cfg:default.paths.provisioning=${PROVISIONING_CFG_DIR}
    
    LimitNOFILE=10000
    TimeoutStopSec=20
    
    [Install]
    WantedBy=multi-user.target
    
  • 启动grafana

    systemctl enable grafana-server.service
    systemctl restart grafana-server.service
    

    默认监听3000端口

  • 开启防火墙:

    iptables -I INPUT -p tcp --dport 3000 -s NET/MASK -j ACCEPT
    
5.2.2 docker安装
  • image: grafana/grafana

    docker run -d --name=grafana -p 3000:3000 grafana/grafana:7.2.2
    

5.3 grafana 简单使用流程

  • web页面访问:

    http://ip:port
    

    首次登陆会要求自行设置账号密码
    7.2版本会要求输入账号密码之后重置,初始账号密码都是admin

  • 使用流程:

    • 添加数据源
    • 添加dashboard,配置图形监控面板,也可在官网下载对应服务的dashboard模板,下载地址:https://grafana.com/grafana/download/
    • 导入模板,json 或 链接 或模板编号
    • 查看dashboard
  • 常用模板编号:

    • node-exporter: cn/8919,en/11074
    • k8s: 13105
    • docker: 12831
    • alertmanager: 9578
    • blackbox_exportre: 9965
  • 重置管理员密码:

    查看Grafana配置文件,确定grafana.db的路径
    	配置文件路径:/etc/grafana/grafana.ini
    	[paths]
    	;data = /var/lib/grafana
    	[database]
    	# For "sqlite3" only, path relative to data_path setting
    	;path = grafana.db
    通过配置文件得知grafana.db的完整路径如下:
    	/var/lib/grafana/grafana.db
    
    使用sqlites修改admin密码	
    	sqlite3 /var/lib/grafana/grafana.db
    	sqlite> update user set password = 
    '59acf18b94d7eb0694c61e60ce44c110c7a683ac6a8f09580d626f90f4a242000746579358d77dd9e570e83fa24faa88a8a6', 
    salt = 'F3FAxVm33R' where login = 'admin';
    	.exit
    
    使用admin admin 登录
    

5.4 grafana告警配置:

  • grafana-server配置 smtp服务器,配置发件邮箱

    vim /etc/grafana/grafana.ini
    [smtp]
    enabled =  true
    host = smtp.126.com:465
    user = USER@126.com
    password = PASS
    skip_verify = false
    from_address = USER@126.com
    from_name = Grafana Alart
    
  • grafana页面添加Notification Channel

    Alerting -> Notification Channel
    save之前 可以send test
    
  • 进入dashboard,添加alart rules

  • 由于现阶段grafana(7.2.2)不支持在报警查询中使用模板变量。所以报警功能实用性很低。生产中建议使用alertmanager

6.prometheus and PromQL:

6.1 PromQL 简述

  • prometheus用来查询数据库的语法规则,用来将数据库中存储的由各exporter 采集到的metric指标组织成可视化的图标信息,以及告警规则

  • promQL一个多维数据模型,其中包含通过metric name 和键/值对标识的时间序列数据

  • 一种灵活的查询语言 ,可利用此维度

  • 不依赖分布式存储;单服务器节点是自治的

  • 多种图形和仪表板支持模式

6.2 使用到promQL的组件:

  • prometheus server
  • client libraries for instrumenting application c7ode
  • push gateway
  • exporters
  • alertmanager

6.3 metric 介绍

6.3.1 metric类型
  • gauges: 返回单一数值,如:

    • node_boot_time_seconds

      node_boot_time_seconds{instance=“10.10.11.40:9100”,job=“node_real_lan”} 1574040030

  • counters: 计数,

  • histograms: 直方图,统计数据的分布情况。比如最大值,最小值,中间值,中位数,百分位数等。

  • summaries: 和

6.3.2 label
  • node_boot_time_seconds{instance=“10.10.11.40:9100”,job=“node_real_lan”}

    如上示例,这里的instance,和job 就是label

    • job : job_name,在prometheus.yml 中定义
    • instance: host:port
  • 也可以在配置文件自行定义label,如:

    - targets: ['10.10.11.178:3001']
      labels:
        server_name: test_web
    

    添加的label即会在prometheus查询数据使用:

    metric{servername=...,}
    

6.4 PromQL 表达式

  • PromQL表达式即是grafana绘制图标的基本语句,也是prometheus用来设置告警规则的基本语句,所以能弄懂或者看懂promQL 非常重要。
6.4.1 先看示例:
  • 计算cpu使用率:

    (1-((sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by (instance))/(sum(increase(node_cpu_seconds_total[1m])) by (instance)))) * 100
    

    其中metric:

    node_cpu_seconds_total 		# 总cpu 使用时间
    node_cpu_seconds_total{mode="idle"} # 空闲cpu使用时间,其他类似标签: user , system , steal , softirq , irq , nice , iowait , idle
    

    用到的函数:

    increase(  [1m])			# 1分钟之类的增量。
    sum()
    sum() by (TAG) 		# 其中 TAG 是标签,此地 instance  代表的是机器名. 按主机名进行相加,否则多主机只会显示一条线。
    
6.4.2 标签选择
  • 匹配运算:

    =		#等于 Select labels that are exactly equal to the provided string.
    !=		#不等于 Select labels that are not equal to the provided string.
    =~ 		#正则表达式匹配 Select labels that regex-match the provided string.
    !~ 		#正则表达式不匹配 Select labels that do not regex-match the provided string.	
    
  • 示例:

    node_cpu_seconds_total{mode="idle"} # mode : 标签,metric自带属性。
    api_http_requests_total{method="POST", handler="/messages"}
    
    http_requests_total{environment=~"staging|testing|development",method!="GET"}
    
  • 注意: 必须指定一个名称或至少一个与空字符串不匹配的标签匹配器

    {job=~".*"} 	# Bad! 
    {job=~".+"}              # Good!
    {job=~".*",method="get"} # Good!
    
6.4.3 运算
  • 时间范围:

    s -秒
    m - 分钟
    h - 小时
    d - 天
    w -周
    y -年	
    
  • 运算符:

    + (addition)
    - (subtraction)
    * (multiplication)
    / (division)
    % (modulo)
    ^ (power/exponentiatio
    == (equal)
    != (not-equal)
    > (greater-than)
    < (less-than)
    >= (greater-or-equal)
    <= (less-or-equal)
    and (intersection)
    or (union)
    unless (complement)	
    
  • 集合运算符:

    sum (calculate sum over dimensions)
    min (select minimum over dimensions)
    max (select maximum over dimensions)
    avg (calculate the average over dimensions)
    stddev (calculate population standard deviation over dimensions)
    stdvar (calculate population standard variance over dimensions)
    count (count number of elements in the vector)
    count_values (count number of elements with the same value)
    bottomk (smallest k elements by sample value)
    topk (largest k elements by sample value)
    quantile (calculate φ-quantile (0 ≤ φ ≤ 1) over dimensions)		
    
6.4.4 函数
  •   sum() by (instance) #求和(根据条件求和)
    
  •   increase() 		# 取增量,针对counter类型
    

    示例:

    increase(node_network_receive_bytes_total[30s])  			# 接受流量
    
  •   rate() 		# 专门搭配counter类型数据使用的函数,按照设置的一个时间段,取counter在这个时间段中的平均每秒的增量
    

    示例:

    rate(node_network_receive_bytes_total[30s])*8				# 入口带宽
    
  •   topk() 		# 给定数字x,根据数值排序之后去最高的x个数
    

    示例:

    topk(5,node_cpu_seconds_total)								# 取node_cpu_seconds_total 最长的前5个
    topk(5,increase(node_network_receive_bytes_total[10m]))		# 10m钟之内收到的流量前5
    

    注意:

    会造成图像散列。
    console中执行,取一次值。
    
  •   count() 	# 计数,如 count(node_load1 > 5)
    
  •   avg() by () 	# 取均值,by(label)
    
6.4.5 rules

配置规则以将抓取的数据汇总到新的时间序列中

示例:

  • 将以下规则,记录进prometheus.rules.yml文件

    avg(rate(rpc_durations_seconds_count[5m])) by (job, service)
    
  • 记录进prometheus.rules.yml文件中

    groups:
    - name: example
      rules:
      - record: job_service:rpc_durations_seconds_count:avg_rate5m
        expr: avg(rate(rpc_durations_seconds_count[5m])) by (job, service)
    
  • 在prometheus.yml文件中:

    rule_files:
      - 'prometheus.rules.yml'
    

相当于生产了一个新的matric,不过此matric不是抓取来的,而是计算来的。

7.alertmanager

7.1 prometheus + alertmanager 报警过程

  • 文档地址:https://prometheus.io/docs/alerting/latest/configuration/

  • 设置警报和通知的主要步骤是:

    • 安装和配置 Alertmanager
    • 配置Prometheus与Alertmanager对话
    • 在Prometheus中创建警报规则

7.2 安装alertmanager

7.2.1 linux(centos7)安装:
  • 下载地址: https://github.com/prometheus/alertmanager

  • 下载并安装:

    wget https://github.com/prometheus/alertmanager/releases/download/v0.20.0/
    alertmanager-0.20.0.linux-amd64.tar.gz
    tar -xf alertmanager-0.20.0.linux-amd64.tar.gz -C /usr/local
    cd /usr/local && ln -sv alertmanager-0.20.0.linux-amd64/ alertmanager && cd alertmanager
    

    启动:

    nohup ./alertmanager --config.file="alertmanager.yml" --storage.path="data/ --web.listen-address=":9093" &
    
7.2.2 docker安装:
  • image: prom/alertmanager

  • docker run

    docker run -dit --name monitor-alertmanager -v ./alertmanager/db/:/alertmanager -v ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml ./alertmanager/templates/:/etc/alertmanager/templates -p 9093:9093 --restart always --privileged true prom/alertmanager --config.file="/etc/alertmanager/alertmanager.yml" --storage.path="/alertmanager --web.listen-address=":9093"
    

7.3 核心概念

  • grouping: 分组

    • 分组将类似性质的警报分类为单个通知。当许多系统同时发生故障并且可能同时触发数百到数千个警报时,此功能特别有用。

      示例:

      发生网络分区时,群集中正在运行数十个或数百个服务实例。您有一半的服务实例不再可以访问数据库。
      Prometheus中的警报规则配置为在每个服务实例无法与数据库通信时为其发送警报。结果,数百个警报被发送到Alertmanager。
      作为用户,人们只希望获得一个页面,同时仍然能够准确查看受影响的服务实例。因此,可以将Alertmanager配置为按警报的群集和
      警报名称分组警报,以便它发送一个紧凑的通知。
      
    • 警报的分组,分组通知的时间以及这些通知的接收者由配置文件中的路由树(routing tree)配置。

  • Inhibition: 抑制

    • 抑制是一种概念,如果某些其他警报已经触发,则抑制某些警报的通知。

    • 示例:

      正在触发警报,通知您无法访问整个群集。可以将Alertmanager配置为使与该群集有关的所有其他警报静音。这样可以防止与实际问题无关的数百或数千个触发警报的通知。
      
    • 通过Alertmanager的配置文件配置禁止。

  • Silences: 静默

    • 静默是一种简单的方法,可以在给定时间内简单地使警报静音。沉默是根据匹配器配置的,就像路由树一样。检查传入警报是否与活动静默的所有相等或正则表达式匹配项匹配。
    • 如果这样做,则不会针对该警报发送任何通知。

7.4 配置prometheus对接alertmanager

  • alerting:

    alerting:
      alertmanagers:
      - static_configs:
        - targets: ["127.0.0.1:9093"]
    
  • rule_files:

    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    rule_files:
      - "rules/*.yml"
    
  • scrape_configs:

    scrape_configs:			
      - job_name: 'alertmanager'
        static_configs:
          - targets: ['127.0.0.1:9093']
    

7.5 prometheus rules编写

  • 示例:

    [root@xiang-03 /usr/local/prometheus]#cat rules/node.yml 
    groups:
      - name: "system info"
        rules:
          - alert: "服务器宕机"  		# 告警名称 alertname
            expr: up == 0 			   # 告警表达式,当表达式条件满足,即发送告警
            for: 1m 				   # 等待时长,等待自动恢复的时间。
            labels: 	# 此label不同于 metric中的label,发送给alertmanager之后用于管理告警项,比如匹配到那个label即触发哪种告警
              severity: critical 		# key:value 皆可完全自定义
            annotations: 				# 定义发送告警的内容,注意此地的labels为metric中的label
              summary: "{{$labels.instance}}:服务器宕机"
              description: "{{$labels.instance}}:服务器无法连接,持续时间已超过3mins"
          - alert: "CPU 使用过高"
            expr: 100-(avg(rate(node_cpu_seconds_total{mode="idle"}[1m]))by(instance)*100) > 40
            for: 1m
            labels:
              servirity: warning
            annotations:
              summary: "{{$labels.instance}}:CPU 使用过高"
              description: "{{$labels.instance}}:CPU 使用率超过 40%"
              value: "{{$value}}"
          - alert: "CPU 使用率超过90%"
            expr: 100-(avg(rate(node_cpu_seconds_total{mode="idle"}[1m])) by(instance)* 100) > 90
            for: 1m
            labels:
              severity: critical
            annotations:
              summary: "{{$labels.instance}}:CPU 使用率90%"
              description: "{{$labels.instance}}:CPU 使用率超过90%,持续时间超过5mins"
              value: "{{$value}}"
    
  • 如果需要在配置文件中使用中文,务必注意编码规则为utf8,否则报错

7.6 配置alertmanager

  • 详细文档地址: https://prometheus.io/docs/alerting/latest/configuration/
  • 主配置文件: alertmanager.yml
  • 模板配置文件: *.tmpl
  • 只是介绍少部需要用到的配置,如需查看完整配置,请查看官方文档
7.6.1 alertmanager.yml
  • 主配置文件中需要配置:

    • global: 发件邮箱配置,
    • templates: 指定邮件模板文件(如果不指定,则使用alertmanager默认模板),
    • routes: 配置告警规则,比如匹配哪个label的规则发送到哪个后端
    • receivers: 配置后端告警媒介: email,wechat,webhook等等
  • 先看示例:

    vim alertmanager.yml
    global:
      smtp_smarthost: 'xxx'
      smtp_from: 'xxx'
      smtp_auth_username: 'xxx'
      smtp_auth_password: 'xxx'
      smtp_require_tls: false
    templates:
      - '/alertmanager/template/*.tmpl'
    route:
      receiver: 'default-receiver'
      group_wait: 1s #组报警等待时间
      group_interval: 1s  #组报警间隔时间
      repeat_interval: 1s  #重复报警间隔时间
      group_by: [cluster, alertname]
      routes:
      - receiver: test
        group_wait: 1s
        match_re:
          severity: test
    receivers:
    - name: 'default-receiver'
      email_configs:
      - to: 'xx@xx.xx'
        html: '{{ template "xx.html" . }}'
        headers: { Subject: " {{ .CommonAnnotations.summary }}" }
    - name: 'test'
      email_configs:
      - to: 'xxx@xx.xx'
        html: '{{ template "xx.html" . }}'
        headers: { Subject: " {{ 第二路由匹配测试}}" }	
    
    vim test.tmpl
    {{ define "xx.html" }}
    <table border="5">
        <tr><td>报警项</td>
            <td>磁盘</td>
            <td>报警阀值</td>
            <td>开始时间</td>
        </tr>
        {{ range $i, $alert := .Alerts }}
            <tr><td>{{ index $alert.Labels "alertname" }}</td>
                <td>{{ index $alert.Labels "instance" }}</td>
                <td>{{ index $alert.Annotations "value" }}</td>
                <td>{{ $alert.StartsAt }}</td>
            </tr>
        {{ end }}
    </table>
    {{ end }}
    
  • 详解:

    •   gloable:	
      
      •   resolve_timeout: 			# 在没有报警的情况下声明为已解决的时间
        
      • 其他邮件相关配置,如示例

    •   route: 	# 所有报警信息进入后的根路由,用来设置报警的分发策略
      
      •   group_by: ['LABEL_NAME','alertname', 'cluster','job','instance',...]
        

        这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A
        和alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面

      •   group_wait: 30s
        

        当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。

      •   group_interval: 5m
        

        当第一个报警发送后,等待’group_interval’时间来发送新的一组报警信息。

      •   repeat_interval: 5m 
        

        如果一个报警信息已经发送成功了,等待’repeat_interval’时间来重新发送他们

      •   match: 
          	label_name: NAME
        

        匹配报警规则,满足条件的告警将被发给 receiver

      •   match_re:
          	label_name: <regex>, ... 
        

        正则表达式匹配。满足条件的告警将被发给 receiver

      •   receiver: receiver_name
        

        将满足match 和 match_re的告警发给后端 告警媒介(邮件,webhook,pagerduty,wechat,…)
        必须有一个default receivererr=“root route must specify a default receiver”

      •   routes:
          	- <route> ...
        

        配置多条规则。

    •   templates:
        	[ - <filepath> ... ]
      

      ​ 配置模板,比如邮件告警页面模板

    •   receivers:
        	- <receiver> ...# 列表
      
      •   - name: receiver_name 	# 用于填写在route.receiver中的名字 
        
      •    email_configs: 		# 配置邮件告警
        
        •   - to: <tmpl_string>
              send_resolved: <boolean> | default = false  	# 故障恢复之后,是否发送恢复通知
          

          配置接受邮件告警的邮箱,也可以配置单独配置发件邮箱。 详见官方文档
          https://prometheus.io/docs/alerting/latest/configuration/#email_config

      •   - name: ...
            wechat_configs:
            - send_resolved: <boolean> | default = false
              
              api_secret: <secret> | default = global.wechat_api_secret
              api_url: <string> | default = global.wechat_api_url
              corp_id: <string> | default = global.wechat_api_corp_id
          	message: <tmpl_string> | default = '{{ template "wechat.default.message" . }}'
          	
          	agent_id: <string> | default = '{{ template "wechat.default.agent_id" . }}'
          	
          	to_user: <string> | default = '{{ template "wechat.default.to_user" . }}'
          	to_party: <string> | default = '{{ template "wechat.default.to_party" . }}'
          	to_tag: <string> | default = '{{ template "wechat.default.to_tag" . }}'				
          	# 说明
          		to_user: 企业微信用户ID
          		to_party: 需要发送的组id
          		
          		corp_id: 企业微信账号唯一ID 可以在 我的企业 查看							
          		agent_id: 应用的 ID,应用管理 --> 打开自定应用查看
          		api_secret: 应用的密钥
          		
          		打开企业微信注册 https://work.weixin.qq.com
          		微信API官方文档 https://work.weixin.qq.com/api/doc#90002/90151/90854	
        

        企业微信告警配置

    •   inhibit_rules:
          - source_match:
              severity: 'critical'
            target_match:
              severity: 'warning'
            equal: ['alertname', 'dev', 'instance']
      

      抑制相关配置

7.6.2 配置企业微信告警
  • 注册企业: https://work.weixin.qq.com

    可以注册未认证企业,人数上限200,绑定个人微信即可使用web后台

    微信API官方文档 : https://work.weixin.qq.com/api/doc#90002/90151/90854

  • 注册之后绑定私人微信即可扫码进入管理后台。

  • 发送告警的应用需要新建,操作也很简单

  • 需要注意的参数:

    • corp_id: 企业微信账号唯一ID 可以在 我的企业 查看
    • agent_id: 应用的 ID,应用管理 --> 打开自定应用查看
    • api_secret: 应用的密钥
    • to_user: 企业微信用户ID,
    • to_party: 需要发送的组id,通讯录,点击组名旁边的点可查看
  • 配置示例:

    receivers:
      - name: 'default'
        email_configs:
          - to: 'XXX'
            send_resolved: true
        
        wechat_configs:
          - send_resolved: true
            corp_id: 'XXX'
            api_secret: 'XXX'
            agent_id: 1000002
            to_user: XXX
            to_party: 2
            message: '{{ template "wechat.html" . }}'
    
  • template:

    • 由于alertmanager默认的微信报警模板太丑丑陋和冗长,所以使用告警模板,邮件模板默认的倒是还可以

    • 示例1:

      cat wechat.tmpl
      {{ define "wechat.html" }}
      {{- if gt (len .Alerts.Firing) 0 -}}{{ range .Alerts }}
      [@警报~]
      实例: {{ .Labels.instance }}
      信息: {{ .Annotations.summary }}
      详情: {{ .Annotations.description }}
      值: {{ .Annotations.value }}
      时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
      {{ end }}{{ end -}}
      {{- if gt (len .Alerts.Resolved) 0 -}}{{ range .Alerts }}
      [@恢复~]
      实例: {{ .Labels.instance }}
      信息: {{ .Annotations.summary }}
      时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
      恢复: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
      {{ end }}{{ end -}}
      {{- end }}
      
7.6.3 告警模板时间问题:
  • 参考来源: https://blog.csdn.net/knight_zhou/article/details/106323719

  • Prometheus 邮件告警自定义模板的默认使用的是utc时间。

    触发时间: {{ .StartsAt.Format "2020-01-02 15:04:05" }}	
    修改之后:{{ (.StartsAt.Add 28800e9).Format "2020-01-02 15:04:05" }}
    

7.7 prometheus常用告警规则:

  • 很厉害的一个页面,包括的好多写好的规则: https://awesome-prometheus-alerts.grep.to/rules
7.7.1 容器告警指标,容器down掉告警
vim rules/docker_monitor.yml
groups:
  - name: "container monitor" 	
    rules:
      - alert: "Container down: env1"
        expr: time() - container_last_seen{name="env1"} > 60
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Container down: {{$labels.instance}} name={{$labels.name}}"	

注意:

此项指标只能监控容器down 掉,无法准确监控容器恢复(不准),即便容器没有成功启动,过一段时间,也会受到resolve通知
7.7.2 针对磁盘CPU,IO ,磁盘使用、内存使用、TCP、网络流量配置监控告警:
groups:
- name: 主机状态-监控告警
  rules:
  - alert: 主机状态
    expr: up == 0
    for: 1m
    labels:
      status: 非常严重
    annotations:
      summary: "{{$labels.instance}}:服务器宕机"
      description: "{{$labels.instance}}:服务器延时超过5分钟"
  
  - alert: CPU使用情况
    expr: 100-(avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100)
    for: 1m
    labels:
      status: 一般告警
    annotations:
      summary: "{{$labels.mountpoint}} CPU使用率过高!"
      description: "{{$labels.mountpoint }} CPU使用大于60%(目前使用:{{$value}}%)"
- alert: cpu使用率过高告警  # 查询提供了hostname label
  expr: (100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 10
nodename) (node_uname_info) > 85
  for: 5m
  labels:
    region: 成都
  annotations:
    summary: "{{$labels.instance}}({{$labels.nodename}})CPU使用率过高!"
    description: '服务器{{$labels.instance}}({{$labels.nodename}})CPU使用率超过85%(
$value}}%)'		  
- alert: 系统负载过高
  expr: (node_load1/count without (cpu, mode) (node_cpu_seconds_total{mode="system"}
nodename) (node_uname_info)>1.1
  for: 3m
  labels:
    region: 成都
  annotations:
    summary: "{{$labels.instance}}({{$labels.nodename}})系统负载过高!"
    description: '{{$labels.instance}}({{$labels.nodename}})当前负载超标率 {{printf 
  
- alert: 内存不足告警
  expr: (100 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)* o
nodename) (node_uname_info) > 80
  for: 3m
  labels:
    region: 成都
  annotations:
    summary: "{{$labels.instance}}({{$labels.nodename}})内存使用率过高!"
    description: '服务器{{$labels.instance}}({{$labels.nodename}})内存使用率超过80%(
$value}}%)'
  - alert: IO操作耗时
    expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 
    for: 1m
    labels:
      status: 严重告警
    annotations:
      summary: "{{$labels.mountpoint}} 流入磁盘IO使用率过高!"
      description: "{{$labels.mountpoint }} 流入磁盘IO大于60%(目前使用:{{$value}})"
  - alert: 网络流入
    expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|do
instance)) / 100) > 102400
    for: 1m
    labels:
      status: 严重告警
    annotations:
      summary: "{{$labels.mountpoint}} 流入网络带宽过高!"
      description: "{{$labels.mountpoint }}流入网络带宽持续2分钟高于100M. RX带宽使用率{
  - alert: 网络流出
    expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|d
instance)) / 100) > 102400
    for: 1m
    labels:
      status: 严重告警
    annotations:
      summary: "{{$labels.mountpoint}} 流出网络带宽过高!"
      description: "{{$labels.mountpoint }}流出网络带宽持续2分钟高于100M. RX带宽使用率{
  - alert: network in
    expr: sum by (instance) (irate(node_network_receive_bytes_total[2m])) / 1024 / 1
    for: 1m
    labels:
      name: network
      severity: Critical
    annotations:
      summary: "{{$labels.mountpoint}} 流入网络带宽过高"
      description: "{{$labels.mountpoint }}流入网络异常,高于100M"
      value: "{{ $value }}" 		          
  - alert: network out
    expr: sum by (instance) (irate(node_network_transmit_bytes_total[2m])) / 1024 / 
    for: 1m
    labels:
      name: network
      severity: Critical
    annotations:
      summary: "{{$labels.mountpoint}} 发送网络带宽过高"
      description: "{{$labels.mountpoint }}发送网络异常,高于100M"
      value: "{{ $value }}" 
  
  - alert: TCP会话
    expr: node_netstat_Tcp_CurrEstab > 1000
    for: 1m
    labels:
      status: 严重告警
    annotations:
      summary: "{{$labels.mountpoint}} TCP_ESTABLISHED过高!"
      description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$valu
  - alert: 磁盘容量
    expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_b
> 80
    for: 1m
    labels:
      status: 严重告警
    annotations:
      summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!"
      description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)"	
- alert: 硬盘空间不足告警  # 查询结果多了hostname等label
  expr: (100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_by
)* on(instance) group_left(nodename) (node_uname_info)> 80
  for: 3m
  labels:
    region: 成都
  annotations:
    summary: "{{$labels.instance}}({{$labels.nodename}})硬盘使用率过高!"
    description: '服务器{{$labels.instance}}({{$labels.nodename}})硬盘使用率超过80%(
$value}}%)'
  - alert: volume fullIn fourdaysd # 预计磁盘4天后写满
    expr: predict_linear(node_filesystem_free_bytes[2h], 4 * 24 * 3600) < 0
    for: 5m
    labels:
      name: disk
      severity: Critical
    annotations:
      summary: "{{$labels.mountpoint}} 预计主机可用磁盘空间4天后将写满"
      description: "{{$labels.mountpoint }}" 
      value: "{{ $value }}%"  
  - alert: disk write rate
    expr: sum by (instance) (irate(node_disk_written_bytes_total[2m])) / 1024 / 1024
    for: 1m
    labels:container_memory_max_usage_bytes
      name: disk
      severity: Critical
    annotations:
      summary: "disk write rate (instance {{ $labels.instance }})"
      description: "磁盘写入速率大于50MB/s"
      value: "{{ $value }}%" 
  - alert: disk read latency
    expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_complet
    for: 1m
    labels:
      name: disk
      severity: Critical
    annotations:
      summary: "unusual disk read latency (instance {{ $labels.instance }})"
      description: "磁盘读取延迟大于100毫秒"
      value: "{{ $value }}%" 
  - alert: disk write latency
    expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_compl
    for: 1m
    labels:
      name: disk
      severity: Critical
    annotations:
      summary: "unusual disk write latency (instance {{ $labels.instance }})"
      description: "磁盘写入延迟大于100毫秒"
      value: "{{ $value }}%" 

7.8 alertmanager 管理api

  •   GET /-/healthy	
      GET /-/ready     
      POST /-/reload
    
  • 示例:

    curl -u monitor:fosafer.com 127.0.0.1:9093/-/healthy
    	OK
    curl -XPOST -u monitor:fosafer.com 127.0.0.1:9093/-/reload
    [root@host40 monitor]# curl -XPOST -u monitor:fosafer.com 127.0.0.1:9093/-/reload
    failed to reload config: yaml: unmarshal errors:
      line 26: field receiver already set in type config.plain
    

    等同: docker exec -it monitor-alertmanager kill -1 1 ,但是失败会报错

8.blackbox_exporter

8.1 blackbox_exporter简介

  • blackbox_exporter是Prometheus 官方提供的 exporter 之一,可以提供 http、dns、tcp、icmp 的监控数据采集。

  • 官方地址: https://github.com/prometheus/blackbox_exporter

  • 应用场景:

    HTTP 测试
    定义 Request Header 信息
    判断 Http status / Http Respones Header / Http Body 内容
    TCP 测试
    业务组件端口状态监听
    应用层协议定义与监听
    ICMP 测试
    主机探活机制
    POST 测试
    接口联通性
    SSL 证书过期时间		
    

8.2 blackbox_exporter安装

8.2.1 linux(centos7) 二进制下载安装blackbox_exporter
  • 下载并解压

    wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.18.0/
    blackbox_exporter-0.18.0.linux-amd64.tar.gz
    tar -xf blackbox_exporter-0.18.0.linux-amd64.tar.gz -C /usr/local/
    cd /usr/local 
    ln -sv blackbox_exporter-0.18.0.linux-amd64 blackbox_exporter
    cd blackbox_exporter
    ./blackbox_exporter --version
    
  • 添加systemd服务unit:

    vim /lib/systemd/system/blackbox_exporter.service
    [Unit]
    Description=blackbox_exporter
    After=network.target
    [Service]
    User=root
    Type=simple
    ExecStart=/usr/local/blackbox_exporter/blackbox_exporter --config.file=/usr/local/blackbox_exporter/blackbox.yml
    Restart=on-failure
    [Install]
    WantedBy=multi-user.target
    
    systemctl daemon-reload
    systemctl enable blackbox_exporter
    systemctl start blackbox_exporter
    
  • 默认监听端口: 9115

8.2.2 docker 安装blackbox_exporter
  • image: prom/blackbox-exporter:master

  • docker run:

    docker run --rm -d -p 9115:9115 --name blackbox_exporter -v `pwd`:/config prom/blackbox-exporter:master --config.file=/config/blackbox.yml
    

8.3 配置blackbox_exporter

  • 默认配置文件:

  • blackbox_exporter 默认情况配置文件已经能够满足大多数需求,后续如需自行配置,参见官方文档,以及项目类一个示例配置文件

    • https://github.com/prometheus/blackbox_exporter/blob/master/example.yml
    cat blackbox.yml
    modules:
      http_2xx:
        prober: http
      http_post_2xx:
        prober: http
        http:
          method: POST
      tcp_connect:
        prober: tcp
      pop3s_banner:
        prober: tcp
        tcp:
          query_response:
          - expect: "^+OK"
          tls: true
          tls_config:
            insecure_skip_verify: false
      ssh_banner:
        prober: tcp
        tcp:
          query_response:
          - expect: "^SSH-2.0-"
      irc_banner:
        prober: tcp
        tcp:
          query_response:
          - send: "NICK prober"
          - send: "USER prober prober prober :prober"
          - expect: "PING :([^ ]+)"
            send: "PONG ${1}"
          - expect: "^:[^ ]+ 001"
      icmp:
        prober: icmp
    

8.4 配置prometheus: <relabel_config>

  • 官方介绍: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config

  • 参考文档: https://blog.csdn.net/qq_25934401/article/details/84325356

  • 说明:

    labels:
     	job:  job_name
     	__address__: <host>:<port>
     	instance: 默认__address__,如果没有被重新标签的话
     	__scheme__: scheme
     	__metrics_path__: path
     	__param_<name>: url 中第一个出现的 <name> 参数
    
8.4.1 http/https 测试示例:
scrape_configs:
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]  # Look for a HTTP 200 response.
    static_configs:
      - targets:
        - http://prometheus.io    # Target to probe with http.
        - https://prometheus.io   # Target to probe with https.
        - http://example.com:8080 # Target to probe with http on port 8080.
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115  # The blackbox exporter's real hostname:port.	
8.4.2 tcp探测示例:
- job_name: "blackbox_telnet_port]"
  scrape_interval: 5s
  metrics_path: /probe
  params:
    module: [tcp_connect]
  static_configs:
      - targets: [ '1x3.x1.xx.xx4:443' ]
        labels:
          group: 'xxxidc机房ip监控'
      - targets: ['10.xx.xx.xxx:443']
        labels:
          group: 'Process status of nginx(main) server'
  relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 10.xxx.xx.xx:9115		
8.4.3 icmp探测示例:
- job_name: 'blackbox00_ping_idc_ip'
  scrape_interval: 10s
  metrics_path: /probe
  params:
    module: [icmp]  #ping
  static_configs:
      - targets: [ '1x.xx.xx.xx' ]
        labels:
          group: 'xxnginx 虚拟IP'
  relabel_configs:
      - source_labels: [__address__]
        regex: (.*)(:80)?
        target_label: __param_target
        replacement: ${1}
      - source_labels: [__param_target]
        regex: (.*)
        target_label: ping
        replacement: ${1}
      - source_labels: []
        regex: .*
        target_label: __address__
        replacement: 1x.xxx.xx.xx:9115
8.4.4 POST探测示例:
- job_name: 'blackbox_http_2xx_post'
  scrape_interval: 10s
  metrics_path: /probe
  params:
    module: [http_post_2xx_query]
  static_configs:
      - targets:
        - https://xx.xxx.com/api/xx/xx/fund/query.action
        labels:
          group: 'Interface monitoring'
  relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 1x.xx.xx.xx:9115  # The blackbox exporter's real hostname:port.
8.4.5 SSL证书时间监测:
cat << 'EOF' > prometheus.yml
rule_files:
  - ssl_expiry.rules
scrape_configs:
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]  # Look for a HTTP 200 response.
    static_configs:
      - targets:
        - example.com  # Target to probe
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115  # Blackbox exporter.
        EOF 
cat << 'EOF' > ssl_expiry.rules 
groups: 
  - name: ssl_expiry.rules 
    rules: 
      - alert: SSLCertExpiringSoon 
        expr: probe_ssl_earliest_cert_expiry{job="blackbox"} - time() < 86400 * 30 
        for: 10m
EOF
8.5 查看监听过程:
  • 类似于:

    curl http://172.16.10.65:9115/probe?target=prometheus.io&module=http_2xx&debug=true
    
8.6 添加告警:
  • icmp、tcp、http、post 监测是否正常可以观察probe_success 这一指标

    probe_success == 0 ##联通性异常
    probe_success == 1 ##联通性正常
    
  • 告警也是判断这个指标是否等于0,如等于0 则触发异常报警

    [sss@prometheus01 prometheus]$ cat rules/blackbox-alert.rules 
    groups:
    - name: blackbox_network_stats
      rules:
      - alert: blackbox_network_stats
        expr: probe_success == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }}  is down"
          description: "This requires immediate action!"
    

9.docker-compose部署完整prometheus监控系统

  • 部署主机: 10.10.11.40

9.1 部署组件:

  •   prometheus
      alertmanager
      grafana
      nginx
      node_exporter
      cadvisor
      blackbox_exporter
    
  • image:

    prom/prometheus
    prom/alertmanager
    quay.io/prometheus/node-exporter  ,prom/node-exporter
    gcr.io/google_containers/cadvisor[:v0.36.0]     # 需要能访问google
    google/cadvisor:v0.33.0    # docker hub镜像,版本没有google的新
    grafana/grafana
    nginx
    
  • 将iamge pull下来之后从新tag ,并上传至本地harbor 仓库

        image: 10.10.11.40:80/base/nginx:1.19.3
        image: 10.10.11.40:80/base/prometheus:2.22.0
        image: 10.10.11.40:80/base/grafana:7.2.2
        image: 10.10.11.40:80/base/alertmanager:0.21.0
        image: 10.10.11.40:80/base/node_exporter:1.0.1
        image: 10.10.11.40:80/base/cadvisor:v0.33.0
        image: 10.10.11.40:80/base/blackbox-exporter:0.18.0
    
    

9.2 部署结构

  • 目录结构一览

    mkdir /home/deploy/monitor
    cd /home/deploy/monitor
    
    [root@host40 monitor]# tree
    .
    ├── alertmanager
    │   ├── alertmanager.yml
    │   ├── db
    │   │   ├── nflog
    │   │   └── silences
    │   └── templates
    │       └── wechat.tmpl
    ├── blackbox_exporter
    │   └── blackbox.yml
    ├── docker-compose.yml
    ├── grafana
    │   └── db
    │       ├── grafana.db
    │       ├── plugins
    		...
    ├── nginx
    │   ├── auth
    │   └── nginx.conf
    ├── node-exporter
    │   └── textfiles
    ├── node_exporter_install_docker.sh
    ├── prometheus
    │   ├── db
    │   ├── prometheus.yml
    │   ├── rules
    │   │   ├── docker_monitor.yml
    │   │   ├── system_monitor.yml
    │   │   └── tcp_monitor.yml
    │   └── sd_files
    │       ├── docker_host.yml
    │       ├── http.yml
    │       ├── icmp.yml
    │       ├── real_lan.yml
    │       ├── real_wan.yml
    │       ├── sedFDm5Rw
    │       ├── tcp.yml
    │       ├── virtual_lan.yml
    │       └── virtual_wan.yml
    └── sd_controler.sh
    
  • nginx basic认证需要的文件:

    [root@host40 monitor-bak]# ls nginx/auth/ -a
    .  ..  .htpasswd
    
  • 部分挂在目录权限:

    prometheus,grafana,alertmanager 的 db目录 需要777权限
    单独挂在的配置文件 alertmanager.yml,prometheus.yml,nginx.conf 需要 666权限。
    如果为了安全起见,建议将配置文件放入专门目录中挂载,并在command 中修改启动参数指定配置文件即可
    

9.3 docker-compose.yml

[root@host40 monitor-bak]# cat docker-compose.yml 
version: "3"
services:
  
  nginx:
    image: 10.10.11.40:80/base/nginx:1.19.3
    hostname: nginx
    container_name: monitor-nginx
    restart: always
    privileged: false
    ports:
      - 3001:3000
      - 9090:9090
      - 9093:9093
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf
      - ./nginx/auth:/etc/nginx/basic_auth
    networks:
      monitor:
        aliases:
          - nginx
    logging:
      driver: json-file
      options:
        max-file: '5'
        max-size: 50m
  
  prometheus:
    image: 10.10.11.40:80/base/prometheus:2.22.0
    container_name: monitor-prometheus
    hostname: prometheus
    restart: always
    privileged: true
    volumes:
      - ./prometheus/db/:/prometheus/
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules/:/etc/prometheus/rules/
      - ./prometheus/sd_files/:/etc/prometheus/sd_files/
    command: 
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
      - '--storage.tsdb.retention=60d'
    networks:
      monitor:
        aliases:
          - prometheus
    logging:
      driver: json-file
      options:
        max-file: '5'
        max-size: 50m

  grafana:
    image: 10.10.11.40:80/base/grafana:7.2.2
    container_name: monitor-grafana
    hostname: grafana
    restart: always
    privileged: true
    volumes:
      - ./grafana/db/:/var/lib/grafana 
    networks:
      monitor:
        aliases:
          - grafana
    logging:
      driver: json-file
      options:
        max-file: '5'
        max-size: 50m

  alertmanger:
    image: 10.10.11.40:80/base/alertmanager:0.21.0
    container_name: monitor-alertmanager
    hostname: alertmanager
    restart: always
    privileged: true
    volumes:
      - ./alertmanager/db/:/alertmanager
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - ./alertmanager/templates/:/etc/alertmanager/templates
    networks:
      monitor:
        aliases:
          - alertmanager
    logging:
      driver: json-file
      options:
        max-file: '5'
        max-size: 50m

  node-exporter:
    image: 10.10.11.40:80/base/node_exporter:1.0.1
    container_name: monitor-node-exporter
    hostname: host40
    restart: always
    privileged: true
    volumes:
      - /:/host:ro,rslave
      - ./node-exporter/textfiles/:/textfiles
    network_mode: "host"
    command: 
      - '--path.rootfs=/host'
      - '--web.listen-address=:9100'
      - '--collector.textfile.directory=/textfiles' 
    logging:
      driver: json-file
      options:
        max-file: '5'
        max-size: 50m

  cadvisor:
    image: 10.10.11.40:80/base/cadvisor:v0.33.0
    container_name: monitor-cadvisor
    hostname: cadvisor
    restart: always
    privileged: true
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    ports:
      - 9080:8080
    networks: 
      monitor:
    logging:
      driver: json-file
      options:
        max-file: '5'
        max-size: 50m

  blackbox_exporter:
    image: 10.10.11.40:80/base/blackbox-exporter:0.18.0
    container_name: monitor-blackbox
    hostname: blackbox-exporter
    restart: always
    privileged: true
    volumes:
      - ./blackbox_exporter/:/etc/blackbox_exporter
    networks:
      monitor:
        aliases:
          - blackbox
    command:
      - '--config.file=/etc/blackbox_exporter/blackbox.yml'
    logging:
      driver: json-file
      options:
        max-file: '5'
        max-size: 50m

networks:
  monitor:
    ipam:
      config:
        - subnet: 192.168.17.0/24


9.4 nginx

  • 由于prometheus,alertmanager 本身不带认证功能,所以前端使用nginx完成调度和basic auth 认证,同一代理后端监听端口,便于管理。

  • 各程序默认端口

    prometheus: 9090
    grafana:3000
    alertmanager: 9093
    node_exproter: 9100
    cadvisor: 8080 (客户端)
    
  • nginx基础image使用basic认证:

    echo monitor:`openssl passwd -crypt 123456` > .htpasswd
    
  • 单独挂在配置文件容器不更新:(当然也可以选择挂在目录,而不是直接挂在文件)

    chmod 666 nginx.conf	
    
  • nginx容器加载配置文件:

    docker exec -it web-director nginx -s reload
    
  • nginx.conf

    [root@host40 monitor-bak]# cat nginx/nginx.conf 
    user nginx;
    worker_processes auto;
    error_log /var/log/nginx/error.log;
    pid /run/nginx.pid;
    include /usr/share/nginx/modules/*.conf;
    events {
        worker_connections 10240;
    }
    http {
        log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                          '$status $body_bytes_sent "$http_referer" '
                          '"$http_user_agent" "$http_x_forwarded_for"';
        access_log  /var/log/nginx/access.log  main;
        sendfile            on;
        tcp_nopush          on;
        tcp_nodelay         on;
        keepalive_timeout   65;
        types_hash_max_size 2048;
        include             /etc/nginx/mime.types;
        default_type        application/octet-stream;
        
            proxy_connect_timeout   500ms;
        proxy_send_timeout      1000ms;
        proxy_read_timeout      3000ms;
        proxy_buffers           64 8k;
        proxy_busy_buffers_size    128k;
        proxy_temp_file_write_size 64k;
        proxy_redirect off;
        proxy_next_upstream error invalid_header timeout http_502 http_504;
            proxy_http_version 1.1;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Real-Port $remote_port;
        proxy_set_header Host $http_host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;    
        client_max_body_size 10m;
        client_body_buffer_size 512k;
        client_body_timeout 180;
        client_header_timeout 10;
        send_timeout 240;
        gzip on;
        gzip_min_length 1k;
        gzip_buffers 4 16k;
        gzip_comp_level 2;
        gzip_types application/javascript application/x-javascript text/css text/javascript image/jpeg image/gif image/png;
        gzip_vary off;
        gzip_disable "MSIE [1-6]\.";
     
            server {
                listen       3000;
                server_name  _;
                
                    location / {
                            proxy_pass http://grafana:3000;       
                }
            }
     
            server {
                listen       9090;
                server_name  _;
                
                    location / {
                            auth_basic "auth for monitor";
                            auth_basic_user_file /etc/nginx/basic_auth/.htpasswd;
                            proxy_pass http://prometheus:9090;    
                }
            }
     
            server {
                listen       9093;
                server_name  _;
                
                    location / {
                            auth_basic "auth for monitor";
                            auth_basic_user_file /etc/nginx/basic_auth/.htpasswd;
                            proxy_pass http://alertmanager:9093;  
                }
            }
    }
    

9.5 prometheus

  • 注意db目录需可写,给777权限
9.5.1 主配置文件: prometheus.yml
[root@host40 monitor-bak]# cat prometheus/prometheus.yml 
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets: ["alertmanager:9093"]
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/*.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']
  
  - job_name: 'alertmanager'
    static_configs:
      - targets: ['alertmanager:9093']
  - job_name: 'node_real_lan'
    file_sd_configs:
      - files: 
          - ./sd_files/real_lan.yml
        refresh_interval: 30s
  
  - job_name: 'node_virtual_lan'
    file_sd_configs:
      - files:
          - ./sd_files/virtual_lan.yml
        refresh_interval: 30s
  
  - job_name: 'node_real_wan'
    file_sd_configs:
      - files:
          - ./sd_files/real_wan.yml
        refresh_interval: 30s
  
  - job_name: 'node_virtual_wan'
    file_sd_configs:
      - files:
          - ./sd_files/virtual_wan.yml
        refresh_interval: 30s
  
  - job_name: 'docker_host'
    file_sd_configs:
      - files:
          - ./sd_files/docker_host.yml
        refresh_interval: 30s
  - job_name: 'tcp'
    metrics_path: /probe
    params:
      module: [tcp_connect]
    file_sd_configs:
      - files:
          - ./sd_files/tcp.yml
        refresh_interval: 30s
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox:9115       
  - job_name: 'http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    file_sd_configs:
      - files:
          - ./sd_files/http.yml
        refresh_interval: 30s
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox:9115       
  - job_name: 'icmp'
    metrics_path: /probe
    params:
      module: [icmp]
    file_sd_configs:
      - files:
          - ./sd_files/icmp.yml
        refresh_interval: 30s
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox:9115 
9.5.2 全部节点使用基于文件的服务发现:
  • 将需要监控的主机targets 写入相应job的target文件即可。示例如下:

  •   ls prometheus/sd_files/
      docker_host.yml  http.yml  icmp.yml  real_lan.yml  real_wan.yml  sedFDm5Rw  tcp.yml  virtual_lan.yml  virtual_wan.yml
    
  •   cat prometheus/sd_files/docker_host.yml
      - targets: ['10.10.11.178:9080']
      - targets: ['10.10.11.99:9080']
      - targets: ['10.10.11.40:9080']
      - targets: ['10.10.11.35:9080']
      - targets: ['10.10.11.45:9080']
      - targets: ['10.10.11.46:9080']
      - targets: ['10.10.11.48:9080']
      - targets: ['10.10.11.47:9080']
      - targets: ['10.10.11.65:9081']
      - targets: ['10.10.11.61:9080']
      - targets: ['10.10.11.66:9080']
      - targets: ['10.10.11.68:9080']
      - targets: ['10.10.11.98:9080']
      - targets: ['10.10.11.75:9080']
      - targets: ['10.10.11.97:9080']
      - targets: ['10.10.11.179:9080']
    
  •   cat prometheus/sd_files/tcp.yml
      - targets: ['10.10.11.178:8001']
        labels:
          server_name: http_download
      - targets: ['10.10.11.178:3307']
        labels:
          server_name: xiaojing_db
      - targets: ['10.10.11.178:3001']
        labels:
          server_name: test_web
    
9.5.3 rules文件:
  • docker rules:

    cat prometheus/rules/docker_monitor.yml 
    groups:
      - name: "container monitor"
        rules:
          - alert: "Container down: env1"
            expr: time() - container_last_seen{name="env1"} > 60
            for: 30s
            labels:
              severity: critical
            annotations:
              summary: "Container down: {{$labels.instance}} name={{$labels.name}}"
    
  • tcp rules:

    cat prometheus/rules/tcp_monitor.yml 
    groups:
    - name: blackbox_network_stats
      rules:
      - alert: blackbox_network_stats
        expr: probe_success == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} ,server-name: {{ $labels.server_name }} is down"
          description: "连接不通..."
    
  • system rules: # cpu ,mem, disk, network, filesystem…

    cat prometheus/rules/system_monitor.yml 
    groups:
      - name: "system info"
        rules:
          - alert: "服务器宕机"
            expr: up == 0
            for: 3m
            labels:
              severity: critical
            annotations:
              summary: "{{$labels.instance}}:服务器宕机"
              description: "{{$labels.instance}}:服务器无法连接,持续时间已超过3mins"
          - alert: "系统负载过高"
            expr: (node_load1/count without (cpu, mode) (node_cpu_seconds_total{mode="system"}))* on(instance) group_left(
    nodename) (node_uname_info) > 1.1 
            for: 3m
            labels:
              servirity: warning
            annotations:
              summary: "{{$labels.instance}}:系统负载过高"
              description: "{{$labels.instance}}:系统负载过高."
              value: "{{$value}}"
          - alert: "CPU 使用率超过90%"
            expr: 100-(avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100) > 90
            for: 3m
            labels:
              severity: critical
            annotations:
              summary: "{{$labels.instance}}:CPU 使用率90%"
              description: "{{$labels.instance}}:CPU 使用率超过90%."
              value: "{{$value}}"
          - alert: "内存使用率超过80%"
            expr: (100 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)* on(instance) group_left(
    nodename) (node_uname_info) > 80
            for: 3m
            labels:
              severity: critical
            annotations:
              summary: "{{$labels.instance}}:内存使用率80%"
              description: "{{$labels.instance}}:内存使用率超过80%"
              value: "{{$value}}"
          
          - alert: "IO操作耗时超过60%"
            expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 40 
            for: 3m
            labels:
              severity: critical
            annotations:
              summary: "{{$labels.instance}}:IO操作耗时超过60%"
              description: "{{$labels.instance}}:IO操作耗时超过60%"
              value: "{{$value}}"
              
          - alert: "磁盘分区容量超过85"
            expr: (100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes 
    {fstype=~"ext4|xfs"}*100) )* on(instance) group_left(nodename) (node_uname_info)> 85
            for: 3m
            labels:
              severity: longtime
            annotations:
              summary: "{{$labels.instance}}:磁盘分区容量超过85%"
              description: "{{$labels.instance}}:磁盘分区容量超过85%"
              value: "{{$value}}"
          
          - alert: "磁盘将在4天后写满"
            expr: predict_linear(node_filesystem_free_bytes[2h], 4 * 24 * 3600) < 0
            for: 3m
            labels:
              severity: longtime
            annotations:
              summary: "{{$labels.instance}}: 预计将有磁盘分区在4天后写满,"
              description: "{{$labels.instance}}:预计将有磁盘分区在4天后写满,"
              value: "{{$value}}"
    

9.6 alertmanager:

  • 注意db目录可写:

  • 主配置文件:

    cat alertmanager/alertmanager.yml 
    global:
      resolve_timeout: 5m
      smtp_smarthost: 'smtphz.qiye.163.com:25'
      smtp_from: 'XXX@fosafer.com'
      smtp_auth_username: 'XXX@fosafer.com'
      smtp_auth_password: 'XXX'
      smtp_hello: 'qiye.163.com'
      smtp_require_tls: true
    route:
      group_by: ['instance']
      group_wait: 30s
      receiver: default
      routes:
        - group_interval: 3m
          repeat_interval: 10m
          match:
            severiry: warning
          receiver: 'default'
        - group_interval: 3m
          repeat_interval: 30m
          match:
            severiry: critical
          receiver: 'default'
            
        - group_interval: 5m
          repeat_interval: 24h
          match:
            severiry: longtime
          receiver: 'default'
    templates:
      - ./templates/*.tmpl
    receivers:
      - name: 'default'
        email_configs:
          - to: 'xiangkaihua@fosafer.com'
            send_resolved: true
        
        wechat_configs:
          - send_resolved: true
            corp_id: 'XXX'
            api_secret: 'XXX'
            agent_id: 1000002
            to_user: XXX
            to_party: 2
            message: '{{ template "wechat.html" . }}'
     
      - name: 'critical'
        email_configs:
          - to: '342382676@qq.com'
            send_resolved: true
          - to: 'xiangkaihua@fosafer.com'
            send_resolved: true
    
  • 告警模板文件

    cat alertmanager/templates/wechat.tmpl 
    {{ define "wechat.html" }}
    {{- if gt (len .Alerts.Firing) 0 -}}{{ range .Alerts }}
    [@警报~]
    实例: {{ .Labels.instance }}
    信息: {{ .Annotations.summary }}
    详情: {{ .Annotations.description }}
    值: {{ .Annotations.value }}
    时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    {{ end }}{{ end -}}
    {{- if gt (len .Alerts.Resolved) 0 -}}{{ range .Alerts }}
    [@恢复~]
    实例: {{ .Labels.instance }}
    信息: {{ .Annotations.summary }}
    时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    恢复: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    {{ end }}{{ end -}}
    {{- end }}		
    

9.7 grafana

  • 只需要挂载volume即可,配置文件无需更改,db目录也不大,可以保存配置和dashboard

10.客户端部署

10.1 被监控主机无docker,单独安装node_exporter

  • 安装脚本:

    http://10.10.11.178:8001/node_exporter_install.sh
    

10.2 被监控主机运行docker,docker 安装 node_exporter cadvisor

  • 安装脚本:

    http://10.10.11.178:8001/node_exporter_install_docker.sh
    
  • 需要的image,对于没有添加10.10.11.40:80 仓库的docker主机,可以下载save的image,先load image 在安装

    http://10.10.11.178:8001/monitor-client.tgz
    

11.prometheus使用和维护

11.1 通过脚本添加和删除监控节点

  • 所有的job都使用基于文件的服务发现,所以,只用将target写入sd_file即可,无需重读配置文件

  • 基于此写了一个文本处理脚本作为sd_files的前端,通过命令行的形式添加和删除targets,无需手动编辑文件

  • 脚本名称: sd_controler.sh

  • 脚本使用:./sd_controler.sh 即可查看usage

  • 完整脚本如下:

    [root@host40 monitor]# cat sd_controler.sh 
    #!/bin/bash
    #version: 1.0
    #Description: add | del | show instance from|to prometheus file_sd_files.
    #       rl | vl | dk | rw | vw | tcp | http | icmp : short for job name, each one means a sd_file. 
    #       tcp | http | icmp ( because with ports for service ) add with label (server_name by default) to easy read in alert emails.
    #       each time can only add|del for one instance. 
    #说明:用来添加、删除、查看prometheus基于文件的服务发现中的条目。比如IP:PORT 组合。
    #       rl | vl | dk | rw | vw | tcp | http | icmp :这写prometheus job名称的简称,每一项代表一个job,操作一个sd_file 即job文件服务发现使用的文件。
    #       tcp | http | icmp,由于常常无法根据服务端口第一时间确认挂掉的是什么服务,所以,在tcp http icmp(顺带)添加的时候要求带上server_name的标签label,
    #               让监控人员收到告警邮件第十时间知道挂掉的是什么服务。
    #       每一次只能添加、删除一条记录,如果需要批量添加,可以直接使用vim 文本操作,或者写for 语句批量执行。
    ### vars
    SD_DIR=./prometheus/sd_files
    DOCKER_SD=$SD_DIR/docker_host.yml
    RL_HOST_SD=$SD_DIR/real_lan.yml
    VL_HOST_SD=$SD_DIR/virtual_lan.yml
    RW_HOST_SD=$SD_DIR/real_wan.yml
    VW_HOST_SD=$SD_DIR/virtual_wan.yml
    
    TCP_SD=$SD_DIR/tcp.yml
    HTTP_SD=$SD_DIR/http.yml
    ICMP_SD=$SD_DIR/icmp.yml
    
    SDFILE=
    
    ### funcs
    usage(){
            echo -e "Usage: $0 < rl | vl | dk | rw | vw | tcp | http | icmp >  < add | del | show >  [ IP:PORT | FQDN ] [ server-name ]"
            echo -e " example: \n\t node add:\t $0 rl add | del 10.10.10.10:9100\n\t tcp,http,icmp add:\t $0 tcp add 10.10.10.10:3306 web-mysql\n\t del:\t $0 http del www.baidu.com\n\t show:\t $0 rl | vl | dk | rw | vw | tcp | http | icmp show."
            exit
    }
    
    add(){
            # $1: SDFILE, $2: IP:PORT
            grep -q $2 $1 ||  echo -e "- targets: ['$2']" >> $1
    }
    
    
    del(){
            # $1: SDFILE, $2: IP:PORT
            sed -i '/'$2'/d' $1
    }
    
    add_with_label(){
            # $1: SDFILE, $2: [IP:[PROT]|FQDN] $3:SERVER-NAME
            LABEL_01="server_name"
            if ! grep -q '$2' $1;then
                    echo -e "- targets: ['$2']" >> $1
                    echo -e "  labels:" >> $1
                    echo -e "    ${LABEL_01}: $3" >> $1
            fi
    }
    
    del_with_label(){
            # $1: SDFILE, $2: [IP:[PROT]|FQDN]
            NUM=`cat -n $SDFILE |grep "'$2'"|awk '{print $1}'`
            let ENDNUM=NUM+2
     
            sed -i $NUM,${ENDNUM}d $1
    }
    
    action(){
            if [ "$1" == "add" ];then
                    add $SDFILE $2
            elif [ "$1" == "del" ];then
                    del $SDFILE $2
            elif [ "$1" == "show" ];then
                    cat $SDFILE
            fi
    }
    
    action_with_label(){
            if [ "$1" == "add" ];then
                    add_with_label $SDFILE $2 $3
            elif [ "$1" == "del" ];then
                    del_with_label $SDFILE $2 $3
            elif [ "$1" == "show" ];then
                    cat $SDFILE
            fi
    }
    
    ### main code
    [ "$2" == "" ] || [[ ! "$2" =~ ^(add|del|show)$ ]] && usage
    
    curl --version &>/dev/null || { echo -e "no curl found. " && exit 15; }
    
    if [[ $1 =~ ^(rl|vl|rw|vw|dk)$ ]] && [ "$2" == "add" ];then
            [ "$3" == "" ] && usage
     
            if [ "$4" != "-f" ];then
                    COOD=`curl -IL -o /dev/null --retry 3 --connect-timeout 3 -s -w "%{http_code}" http://$3/metrics`
                    [ "$COOD" != "200" ] &&  echo -e "http://$3/metrics is not arriable. check it again. or you can use -f to ignor it." && exit 11
            fi
    fi
    
    if [[ $1 =~ ^(tcp|http|icmp)$ ]] && [ "$2" == "add" ];then
            [ "$4" == "" ] && echo -e "监听 tcp  http  icmp 服务时必须指明 server-name." && usage
    fi
    
    case $1 in
    rl)
            SDFILE=$RL_HOST_SD
            action $2 $3 && echo $2 OK
            ;;
    vl)
            SDFILE=$VL_HOST_SD
            action $2 $3 && echo $2 OK
            ;;
    dk)
            SDFILE=$DOCKER_SD
            action $2 $3 && echo $2 OK
            ;;
    rw)
            SDFILE=$RW_HOST_SD
            action $2 $3 && echo $2 OK
            ;;
    vw)
            SDFILE=$VW_HOST_SD
            action $2 $3 && echo $2 OK
            ;;
    tcp)
            SDFILE=$TCP_SD
            action_with_label $2 $3 $4 && echo $2 OK
            ;;
    http)
            SDFILE=$HTTP_SD
            action_with_label $2 $3 $4 && echo $2 OK
            ;; 
    icmp)
            SDFILE=$ICMP_SD
            action_with_label $2 $3 $4 && echo $2 OK
            ;; 
    *)
            usage
            ;;
    esac
     
    
    

THE END

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值