分布式监控系统之高可用Prometheus：exporter+pushgateway+Prometheus+thanos+minio+alermanager+grafana

happy_king_zi

于 2024-09-13 11:59:20 发布

阅读量310

点赞数 10

分类专栏：监控架构分布式文章标签：分布式 prometheus grafana

本文链接：https://blog.csdn.net/happy_king_zi/article/details/142207154

版权

分布式同时被 3 个专栏收录

13 篇文章 0 订阅

订阅专栏

架构

6 篇文章 0 订阅

订阅专栏

监控

4 篇文章 0 订阅

订阅专栏

架构设计/环境部署

组件选择

在这里插入图片描述

架构图

在这里插入图片描述

Prometheus架构图及简介

1.多维数据模型（时序由 metric 名字和 k/v 的 labels 构成）。
2.灵活的查询语句（PromQL）。
3.无依赖存储，支持 local 和 remote 不同模型。
4.采用 http 协议，使用 pull 模式，拉取数据，简单易懂。
5.监控目标，可以采用服务发现或静态配置的方式。
6.支持多种统计数据模型，图形化友好。

thanos

thanos概述

官网

官方文档

架构图

在这里插入图片描述

thanos存储方式选择

thanos存储选择

minio

官网

官网中文文档

Minio简述

MinIO是在Apache License v2.0下发布的对象存储服务器
与Amazon S3云存储服务兼容
对象的大小可以从几KB到最大5TB
MinIO服务器足够轻，可以与应用程序堆栈捆绑在一起，类似于NodeJS，Redis和MySQL
分布式MinIO 使用Erasure Code提供针对多个节点/驱动器故障和Bit Rot的保护（数据保护）
分布式Minio保证有n/2或更多的磁盘在线，就可以保证数据安全，至少需要n/2 + 1的磁盘才可以创建对象。（高可用性）
Minio遵循严格的read-after-write和list-after-write来保证数据的一致性。

Minio特性

在这里插入图片描述

环境部署

环境部署规划

    一、主机信息
存储		主机名：l-minio[1:4].ops.bj5.test.daling.com
				cpu：8核
				内存：16g
				磁盘：
				  系统盘：40G
				  数据盘：
						minio1：4*300G（minio）
						minio2：4*300G（minio）
						minio3：4*300G（minio）
						minio4：4*300G（minio）
		
Prometheus	主机名：l-prometheus[1:2].ops.test.bj5.com
				cpu：8核
				内存：32g
				磁盘：40G+300G			
thanos 主机名：l-thanos[1:2].ops.test.bj5.com
				cpu：8核
				内存：16g
				磁盘：100G		
				
	二、每台主机部署服务
		minio1：minio、pushgateway
		minio2：minio、pushgateway
		minio3：minio
		minio4：minio、grafana、alertmanager、webhook-dingtalk、
		prometheus1：prometheus、thanos sidecar
		prometheus2：prometheus、thanos sidecar
       thanos1：thanos-query、thanos store、thanos-compact
       thanos2：thanos-query、thanos store
		
		
	三、有UI界面的服务访问地址（有访问控制的已私发）
		minio： 主备
			minio.corp.test.com
		pushgateway：主备
			pushgateway.corp.test.com
		prometheus：负载
			prometheus.corp.test.com
		thanos 	query：grafana配置的数据源地址 负载
			thanos.corp.test.com
		alertmanager：
			alertmanager.corp.test.com	
	四、内部调用的域名
		thanos-store1.srv.test.com #解析到对应的节点上
		alertmanager.srv.test.com #解析到对应的节点上
		thanos-sidecar1.srv.test.com 	#解析到对应的节点上
       minio.srv.test.com  	#解析到的所有节点上 DNS轮询

minio

	软件说明：
		软件：go开发，是一个可执行的文件
		管理：写配置文件，启动时指定参数
		UI：有UI界面，有访问权限控制
	注意：
		1、集群确定后，不能进行扩展
		2、启动时，需要所有节点依次启动，全部正常启动方可
		
	问题及解决：
		1、使用supervisor启动有问题
			解决：
				方式一 supervisor：测试配置是否有问题，有问题改之；没有，supervisor不能管理，使用方式二
				方式二 systemd： 使用系统管理脚本
		2、minio的管理员信息写在了环境变量，没有使用配置文件
			解决：
				配置服务管理目录及文件，将配置持久化到文件中，并做好权限控制
       3、thanos query报错 **413 Request Entity Too Large**
        
       原因：ng上有上传文件大小的限制（thanos query会报 413 Request Entity Too Large 的错误）
       解决：minio的UI和程序调用拆分开
           UI使用corp的域名进行访问
	       程序调用采用srv的域名进行调用
       
	环境说明：
		命令：已放置/usr/local/bin
		UI访问地址：minio.corp.test.com
			管理员密码：
				admin
				SVxgEcGOmt5hBP06WoaTNCFfR

	
	supervisor启动文件 minio.conf
[program:minio]
user=root
directory=/

command=/bin/bash /Daling/bash/minio-start.sh

autostart=True
autorestart=True
redirect_stderr=True
stopsignal=INT
stopasgroup=True
stdout_logfile=/var/log/supervisor/minio-out.log
stderr_logfile=/var/log/supervisor/minio-err.log


	管理脚本/Daling/bash/minio-start.sh
#!/bin/bash

source /etc/profile

/usr/local/bin/minio --compat server http://10.13.114.12/data1 http://10.13.114.12/data2 http://10.13.114.12/data3 http://10.13.114.12/data4 http://10.13.114.13/data1http://10.13.114.13/data2 http://10.13.114.13/data3 http://10.13.114.13/data4 http://10.13.114.11/data1 http://10.13.114.11/data2 http://10.13.114.11/data3 http://10.13.114.11/data4 http://10.13.114.14/data1 http://10.13.114.14/data2 http://10.13.114.14/data3 http://10.13.114.14/data4

	后台启动

nohup minio --compat server http://10.13.114.12/data1 http://10.13.114.12/data2 http://10.13.114.12/data3 http://10.13.114.12/data4 http://10.13.114.13/data1 http://10.13.114.13/data2 http://10.13.114.13/data3 http://10.13.114.13/data4 http://10.13.114.11/data1 http://10.13.114.11/data2 http://10.13.114.11/data3 http://10.13.114.11/data4 http://10.13.114.14/data1 http://10.13.114.14/data2 http://10.13.114.14/data3 http://10.13.114.14/data4 &

	操作：
		在UI，可以创建bucket（注意命名规范）

thanos

	软件说明：
		软件：go开发，有一个可执行的文件
		管理：有单独的配置代码，启动（不同组件）时指定参数
		UI：有UI界面，无访问权限控制
		持久化数据：有需持久化的数据，需单独配置写入目录
		组件说明：
			thanos sidecar：数据拉取
			thanos compact：数据压缩
			thanos store：数据转储
			thanos query：数据查询、聚合
		官网端口规划列表：
			https://thanos.io/getting-started.md/
			Component	Interface	            Port
			Sidecar	    gRPC	                10901
			Sidecar	    HTTP	                10902
			Query	    gRPC	                10903
			Query	    HTTP	                10904
			Store	    gRPC	                10905
			Store	    HTTP	                10906
			Receive	    gRPC (store API)	    10907
			Receive	    HTTP (remote write API)	10908
			Receive	    HTTP	                10909
			Rule	    gRPC	                10910
			Rule	    HTTP	                10911
			Compact	    HTTP	                10912
	环境说明：
		命令：已放置/usr/local/bin
		UI访问地址：
		软件的根目录：/etc/thanos
		持久化目录：
			thanos-compact：
				/data/thanos/compact
			
			
		报警媒介：需要单独配置服务（可以通过其他机器实现）
	
		
	问题及解决：
		1、thanos compact启动后，处理完数据后自动退出
	
	配置文件bucket_config.yaml（所有组件都会加载这个文件）
type: S3
config:
  bucket: "prometheus-app"
  endpoint: "minio.corp.daling.com"
  access_key: "admin"
  secret_key: "SVxgEcGOmt5hBP06WoaTNCFfR"
  region: "cn-north-1"
  insecure: true
  http_config:
    idle_conn_timeout: 2m
    response_header_timeout: 2m
    insecure_skip_verify: true
	
	各组件supervisor的管理文件
		thanos-sidecar.conf		
[program:thanos-sidecar]

user=root
directory=/

command=thanos sidecar --tsdb.path=/data/prometheus/data --prometheus.url=http://127.0.0.1:9090 --objstore.config-file=/etc/thanos/bucket_config.yaml --http-address=0.0.0.0:10902 --grpc-address=0.0.0.0:10901

autostart=True
autorestart=True
redirect_stderr=True
stopsignal=INT
stopasgroup=True

		thanos-compact.conf
[program:thanos-compact]
user=root
directory=/

command=thanos compact --data-dir=/data/thanos/compact --objstore.config-file=/etc/thanos/bucket_config.yaml --http-address=0.0.0.0:10912

autostart=True
autorestart=True
redirect_stderr=True
stopsignal=INT
stopasgroup=True

		thanos-store.conf
[program:thanos-store]
user=root
directory=/

command=thanos store --data-dir=/data/thanos/store/ --objstore.config-file=/etc/thanos/bucket_config.yaml --http-address=0.0.0.0:10906 --grpc-address=0.0.0.0:10905 --chunk-pool-size=10GB

autostart=True
autorestart=True
redirect_stderr=True
stopsignal=INT
stopasgroup=True

		thanos-query.conf ##链接store查询日志
[program:thanos-query]
user=root
directory=/

command=thanos query --http-address=0.0.0.0:10904 --grpc-address=0.0.0.0:10903 --store=thanos-sidecar1.srv.daling.com:10901 --store=thanos-sidecar2.srv.daling.com:10901 --store=thanos-store1.srv.daling.com:10905  --store=thanos-store2.srv.daling.com:10905 --query.replica-label replica

autostart=True
autorestart=True
redirect_stderr=True
stopsignal=INT
stopasgroup=True

        thanos-compact ##优化minio存储块，提高查询效率 
        启动方式：每八小时定时执行 /etc/cron.d/thanos-compact
0 */8 * * * root thanos compact --data-dir=/data/thanos/thanos-compact --objstore.config-file=/etc/thanos/bucket_config.yaml --http-address=0.0.0.0:10912 >> /tmp/thanos-compact.log 2>&1

prometheus

目录结构

.
├── data
│   ├── 01E0W35BS4ZBCDVPN1MKYFRPK4.tmp
│   ├── 01E2Z8BJSC8KB70HAPB997TQNT
│   ├── lock
│   ├── queries.active
│   ├── thanos
│   ├── thanos.shipper.json
│   └── wal  # 数据先缓存到这个目录下临时文件中，
└── databak
    ├── lock
    ├── queries.active
    ├── thanos
    ├── thanos.shipper.json
    └── wal

189 directories, 6 files

部署

	软件说明：
		软件：go开发，有一个可执行的文件
		管理：有单独的配置代码，有多个配置，启动时指定参数
		UI：有UI界面，无访问权限控制
		持久化数据：有需持久化的数据，需单独配置写入目录
	环境说明：
		命令：已放置/usr/local/bin
		UI访问地址：prometheus.corp.daling.com	
		软件的根目录：/opt/prometheus
		规则目录规划：
			rules
			├── app
			└── sys		
		持久化目录：/data/prometheus/data/
	
	

	supervisor启动文件prometheus.conf

[program:prometheus]

user=root
directory=/

command=prometheus --config.file="/opt/prometheus/prometheus.yml" --storage.tsdb.min-block-duration=2h --storage.tsdb.max-block-duration=2h --web.enable-lifecycle --storage.tsdb.path="/data/prometheus/data/"

autostart=True
autorestart=True
redirect_stderr=True
stopsignal=INT
stopasgroup=True
	
	配置文件prometheus.yml
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

  external_labels:
    region: cn-north-1
  #  monitor: infrastructure
    replica: A

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager.srv.daling.com:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  - "/opt/prometheus/rules/app/*.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'minio'
    metrics_path: '/minio/prometheus/metrics'
    honor_labels: true
    static_configs:
    - targets: ['minio.corp.daling.com']

  - job_name: 'pushgateway'
    honor_labels: true
    static_configs:
    - targets: ['pushgateway.corp.daling.com']


	规则示例一、rules/app/http.yml
groups:
- name: http_alert
  rules:
  - alert: http_exception
    expr: ceil(increase(http_server_requests_seconds_count{env="prod", status!="200", exception!="None"}[1m])) > 0
    for: 15s
    labels:
      severity: error
    annotations:
      summary: "【{{$labels.uri}}】出现异常【{{$value}}】次"
      description: "【{{$labels.uri}}】出现异常【{{$value}}】次"
  - alert: http_qps_increase
    expr: (ceil((sum without (instance) (rate(http_server_requests_seconds_count{env="prod",exception="None",status="200"}[1m])) / sum without (instance) (rate(http_server_requests_seconds_count{env="prod",exception="None",status="200"}[1m] offset 1m)))) > 10) and (sum without (instance) (rate(http_server_requests_seconds_count{env="prod",exception="None",status="200"}[1m])) > 1) and  (sum without (instance) (rate(http_server_requests_seconds_count{env="prod",exception="None",status="200"}[1m] offset 1m)) > 1) 
    for: 15s
    labels:
      severity: warn
    annotations:
      summary: "【{{$labels.uri}}】流量增大【{{$value}}】倍"
      description: "【{{$labels.uri}}】流量增大【{{$value}}】倍"
  - alert: mybatis_low_operate
    expr: ceil((sum(rate(daling_mybatis_requests_seconds_sum{env="prod",status="success"}[1m])) without (instance) / sum(rate(daling_mybatis_requests_seconds_count{env="prod",status="success"}[1m])) without (instance)) * 1000) > 500
    for: 15s
    labels:
      severity: warn
    annotations:
      summary: "【{{$labels.name}}】执行耗时【{{$value}}】毫秒"
      description: "【{{$labels.name}}】执行耗时【{{$value}}】毫秒" 
  - alert: hystrix_error_event
    expr: ceil(increase(hystrix_execution_total{event!="success"}[1m])) > 0
    for: 15s
    labels:
      severity: error
    annotations:
      summary: "【{{$labels.key}}】熔断发生【$value】次"
      description: "【{{$labels.key}}】熔断发生【$value】次"
  - alert: hystrix_circuit_breaker_open    
    expr: sum(hystrix_circuit_breaker_open) without (instance) > 0
    for: 15s
    labels:
      severity: error
    annotations:
      summary: "【{{$labels.key}}】断路器打开【$value】次"
      description: "【{{$labels.key}}】短路器打开【$value】次"


	规则示例二、rules/app/minio.yml
groups:
- name: minio_alert
  rules:
  - alert: minio_exception
    expr: minio_offline_disks > 0
    for: 1m
    labels:
    annotations:
      summary: "{{$labels.job}} in {{$labels.instance}}"
      description: "{{$labels.job}} in {{$labels.instance}}"
      
      
    规则示例三、rules/sys/linux_sys.yaml
 groups:
- name: node
  rules:
  - alert: server_status
    expr: up{job="prometheus"} == 0
    for: 15s
    annotations:
      summary: " {{ $labels.instance }} "
      description: "机器 {{ $labels.instance }} 挂了"
  - alert: Memory Usage
    expr: ceil(node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes*100) < 4
    for: 60s
    annotations:
      summary: " {{ $labels.hostname }} "
      description: "宿主机内存可用率低于10%."
      value: "{{ $value }}%"
  - alert: CPU Usage
    expr: ceil(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (hostname)*100) < 10
    for: 300s
    annotations:
      summary: " {{ $labels.hostname }} "
      description: "宿主机CPU空闲率低于10%."
      value: "{{ $value }}%"
  - alert: System Load
    expr: ceil((sum(node_load5)by(hostname))/(count(node_cpu_seconds_total{mode="idle"})by(hostname)) * 100) > 400
    for: 300s
    annotations:
      summary: " {{ $labels.hostname }} "
      description: "主机正在满负载运行."
      value: "{{ $value }}%"    
  - alert: Disk Free
    expr: ceil((node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"}) * 100) < 10
    for: 60s
    annotations:
      summary: " {{ $labels.hostname }} "
      description: "{{ $labels.mountpoint }}分区空间不足"
      value: "{{ $value }}%"
  - alert: Disk IO
    expr: ceil(sum(irate(node_disk_writes_completed_total[5m])) by (hostname)) > 2500
    for: 5m
    annotations:
      summary: " {{ $labels.hostname }} "
      description: "主机磁盘1分钟平均写入IO负载较高"
      value: "{{ $value }}iops"
  - alert: Disk IO
    expr: ceil(sum(irate(node_disk_reads_completed_total[5m])) by (hostname)) > 2500
    for: 5m
    annotations:
      summary: " {{ $labels.hostname }} "
      description: "主机磁盘1分钟平均读取IO负载较高"
      value: "{{ $value }}iops"

pushgateway

	软件说明：
		软件：go开发，是一个可执行的文件
		管理：启动时指定参数
		UI：有UI界面，无访问权限控制
		持久化数据：有需持久化的数据，需单独配置写入目录
	
	环境说明：
		命令：已放置/usr/local/bin
		UI访问地址：pushgateway.corp.daling.com	
		持久化目录：/data/pushgateway
	
	supervisor启动文件pushgateway.conf

[program:pushgateway]
user=root
directory=/

command=pushgateway --persistence.file="/data/pushgateway/data" --persistence.interval=24h

autostart=True
autorestart=True
redirect_stderr=True
stopsignal=INT
stopasgroup=True

alertmanager

	软件说明：
		软件：go开发，有一个可执行的文件
		管理：有单独的配置代码，有多个配置，启动时指定参数
		UI：有UI界面，无访问权限控制
		持久化数据：有需持久化的数据，需单独配置写入目录
	环境说明：
		命令：已放置/usr/local/bin
		UI访问地址：alertmanager.corp.daling.com	
		软件的根目录：/opt/alertmanager
		持久化目录：/data/alertmanager
		报警媒介：需要单独配置服务（可以通过其他机器实现）
	
	supervisor启动文件alertmanager.conf

[program:alertmanager]
user=root
directory=/

command=alertmanager --config.file=/opt/alertmanager/alertmanager.yml --storage.path=/data/alertmanager

autostart=True
autorestart=True
redirect_stderr=True
stopsignal=INT
stopasgroup=True

	alertmanager.yml 配置文件
global:
  resolve_timeout: 5m

route:
  group_wait: 30s
  group_interval: 1m
  repeat_interval: 10m
  receiver: 'test_hook'

  routes:
  - receiver: http_qps_increase
    group_by: ['alertname', 'job', 'uri']
    group_wait: 30s
    group_interval: 1m
    repeat_interval: 10m
    match_re:
      alertname: 'http_qps_increase'
      replica: 'A'
  - receiver: http_exception
    group_by: ['alertname', 'job', 'uri', 'exception']
    group_wait: 30s
    group_interval: 1m
    repeat_interval: 10m
    match_re:
      alertname: 'http_exception'
      replica: 'A'
  - receiver: mybatis_low_operate
    group_by: ['alertname', 'job', 'name']
    group_wait: 30s
    group_interval: 1m
    repeat_interval: 10m
    match_re:
      alertname: 'mybatis_low_operate'
      replica: 'A'
  - receiver: hystrix_error_event
    group_by: ['alertname', 'job', 'key', 'event']
    group_wait: 30s
    group_interval: 1m
    repeat_interval: 10m
    match_re:
      alertname: 'hystrix_error_event'
      replica: 'A'
  - receiver: hystrix_circuit_breaker_open
    group_by: ['alertname', 'job', 'key', 'group']
    group_wait: 30s
    group_interval: 1m
    repeat_interval: 10m
    match_re:
      alertname: 'hystrix_circuit_breaker_open'
      replica: 'A'

  - receiver: minio_hook
    group_wait: 10s
    match_re:
      alertname: 'minio_exception'
receivers:
- name: 'http_qps_increase'
  webhook_configs:
  - url: 'http://sgp.srv.daling.com/alert/push/http_qps_increase'
- name: 'http_exception'
  webhook_configs:
  - url: 'http://sgp.srv.daling.com/alert/push/http_exception'
- name: 'mybatis_low_operate'
  webhook_configs:
  - url: 'http://sgp.srv.daling.com/alert/push/mybatis_low_operate'
- name: 'hystrix_error_event'
  webhook_configs:
  - url: 'http://sgp.srv.daling.com/alert/push/hystrix_error_event'
- name: 'hystrix_circuit_breaker_open'
  webhook_configs:
  - url: 'http://sgp.srv.daling.com/alert/push/hystrix_circuit_breaker_open'
- name: 'minio_hook'
  webhook_configs:
  - url: 'http://127.0.0.1:8060/dingtalk/minio_dingding/send'
- name: 'test_hook'
  webhook_configs:
  - url: 'http://10.36.35.128:8520/alert/push/prometheus'


	报警单独配置服务：钉钉报警webhook-dingtalk
		服务说明：
			软件：go开发，是一个可执行的文件
			管理：启动时指定参数（指定钉钉机器人的接口地址）
	
		supervisor启动文件webhook-dingtalk.conf

[program:webhook-dingtalk]
user=root
directory=/

command=prometheus-webhook-dingtalk --ding.profile=alert_dingding=https://oapi.dingtalk.com/robot/send?access_token=99f5f1477e192a118ac65b0115ff807fdd0d3dbf0fe8ed99120e1b873e24232c

autostart=True
autorestart=True
redirect_stderr=True
stopsignal=INT
stopasgroup=True

happy_king_zi

关注

10
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
分布式监控系统之高可用Prometheus：exporter+pushgateway+Prometheus+thanos+minio+alermanager+grafana

1.多维数据模型（时序由 metric 名字和 k/v 的 labels 构成）。2.灵活的查询语句（PromQL）。3.无依赖存储，支持 local 和 remote 不同模型。4.采用 http 协议，使用 pull 模式，拉取数据，简单易懂。5.监控目标，可以采用服务发现或静态配置的方式。6.支持多种统计数据模型，图形化友好。
复制链接

扫一扫

专栏目录