架构设计/环境部署
组件选择
架构图
Prometheus架构图及简介
1.多维 数据模型(时序由 metric 名字和 k/v 的 labels 构成)。
2.灵活的查询语句(PromQL)。
3.无依赖存储,支持 local 和 remote 不同模型。
4.采用 http 协议,使用 pull 模式,拉取数据,简单易懂。
5.监控目标,可以采用服务发现或静态配置的方式。
6.支持多种统计数据模型,图形化友好。
thanos
架构图
thanos存储方式选择
minio
Minio简述
- MinIO是在Apache License v2.0下发布的对象存储服务器
- 与Amazon S3云存储服务兼容
- 对象的大小可以从几KB到最大5TB
- MinIO服务器足够轻,可以与应用程序堆栈捆绑在一起,类似于NodeJS,Redis和MySQL
- 分布式MinIO 使用Erasure Code提供针对多个节点/驱动器故障和Bit Rot的保护(数据保护)
- 分布式Minio保证有n/2或更多的磁盘在线,就可以保证数据安全,至少需要n/2 + 1的磁盘才可以创建对象。(高可用性)
- Minio遵循严格的read-after-write和list-after-write来保证数据的一致性。
Minio特性
环境部署
环境部署规划
一、主机信息
存储 主机名:l-minio[1:4].ops.bj5.test.daling.com
cpu:8核
内存:16g
磁盘:
系统盘:40G
数据盘:
minio1:4*300G(minio)
minio2:4*300G(minio)
minio3:4*300G(minio)
minio4:4*300G(minio)
Prometheus 主机名:l-prometheus[1:2].ops.test.bj5.com
cpu:8核
内存:32g
磁盘:40G+300G
thanos 主机名:l-thanos[1:2].ops.test.bj5.com
cpu:8核
内存:16g
磁盘:100G
二、每台主机部署服务
minio1:minio、pushgateway
minio2:minio、pushgateway
minio3:minio
minio4:minio、grafana、alertmanager、webhook-dingtalk、
prometheus1:prometheus、thanos sidecar
prometheus2:prometheus、thanos sidecar
thanos1:thanos-query、thanos store、thanos-compact
thanos2:thanos-query、thanos store
三、有UI界面的服务访问地址(有访问控制的已私发)
minio: 主备
minio.corp.test.com
pushgateway:主备
pushgateway.corp.test.com
prometheus:负载
prometheus.corp.test.com
thanos query:grafana配置的数据源地址 负载
thanos.corp.test.com
alertmanager:
alertmanager.corp.test.com
四、内部调用的域名
thanos-store1.srv.test.com #解析到对应的节点上
alertmanager.srv.test.com #解析到对应的节点上
thanos-sidecar1.srv.test.com #解析到对应的节点上
minio.srv.test.com #解析到的所有节点上 DNS轮询
minio
软件说明:
软件:go开发,是一个可执行的文件
管理:写配置文件,启动时指定参数
UI:有UI界面,有访问权限控制
注意:
1、集群确定后,不能进行扩展
2、启动时,需要所有节点依次启动,全部正常启动方可
问题及解决:
1、使用supervisor启动有问题
解决:
方式一 supervisor:测试配置是否有问题,有问题改之;没有,supervisor不能管理,使用方式二
方式二 systemd: 使用系统管理脚本
2、minio的管理员信息写在了环境变量,没有使用配置文件
解决:
配置服务管理目录及文件,将配置持久化到文件中,并做好权限控制
3、thanos query报错 **413 Request Entity Too Large**
原因:ng上有上传文件大小的限制(thanos query会报 413 Request Entity Too Large 的错误)
解决:minio的UI和程序调用拆分开
UI使用corp的域名进行访问
程序调用采用srv的域名进行调用
环境说明:
命令:已放置/usr/local/bin
UI访问地址:minio.corp.test.com
管理员密码:
admin
SVxgEcGOmt5hBP06WoaTNCFfR
supervisor启动文件 minio.conf
[program:minio]
user=root
directory=/
command=/bin/bash /Daling/bash/minio-start.sh
autostart=True
autorestart=True
redirect_stderr=True
stopsignal=INT
stopasgroup=True
stdout_logfile=/var/log/supervisor/minio-out.log
stderr_logfile=/var/log/supervisor/minio-err.log
管理脚本/Daling/bash/minio-start.sh
#!/bin/bash
source /etc/profile
/usr/local/bin/minio --compat server http://10.13.114.12/data1 http://10.13.114.12/data2 http://10.13.114.12/data3 http://10.13.114.12/data4 http://10.13.114.13/data1http://10.13.114.13/data2 http://10.13.114.13/data3 http://10.13.114.13/data4 http://10.13.114.11/data1 http://10.13.114.11/data2 http://10.13.114.11/data3 http://10.13.114.11/data4 http://10.13.114.14/data1 http://10.13.114.14/data2 http://10.13.114.14/data3 http://10.13.114.14/data4
后台启动
nohup minio --compat server http://10.13.114.12/data1 http://10.13.114.12/data2 http://10.13.114.12/data3 http://10.13.114.12/data4 http://10.13.114.13/data1 http://10.13.114.13/data2 http://10.13.114.13/data3 http://10.13.114.13/data4 http://10.13.114.11/data1 http://10.13.114.11/data2 http://10.13.114.11/data3 http://10.13.114.11/data4 http://10.13.114.14/data1 http://10.13.114.14/data2 http://10.13.114.14/data3 http://10.13.114.14/data4 &
操作:
在UI,可以创建bucket(注意命名规范)
thanos
软件说明:
软件:go开发,有一个可执行的文件
管理:有单独的配置代码,启动(不同组件)时指定参数
UI:有UI界面,无访问权限控制
持久化数据:有需持久化的数据,需单独配置写入目录
组件说明:
thanos sidecar:数据拉取
thanos compact:数据压缩
thanos store:数据转储
thanos query:数据查询、聚合
官网端口规划列表:
https://thanos.io/getting-started.md/
Component Interface Port
Sidecar gRPC 10901
Sidecar HTTP 10902
Query gRPC 10903
Query HTTP 10904
Store gRPC 10905
Store HTTP 10906
Receive gRPC (store API) 10907
Receive HTTP (remote write API) 10908
Receive HTTP 10909
Rule gRPC 10910
Rule HTTP 10911
Compact HTTP 10912
环境说明:
命令:已放置/usr/local/bin
UI访问地址:
软件的根目录:/etc/thanos
持久化目录:
thanos-compact:
/data/thanos/compact
报警媒介:需要单独配置服务(可以通过其他机器实现)
问题及解决:
1、thanos compact启动后,处理完数据后自动退出
配置文件bucket_config.yaml(所有组件都会加载这个文件)
type: S3
config:
bucket: "prometheus-app"
endpoint: "minio.corp.daling.com"
access_key: "admin"
secret_key: "SVxgEcGOmt5hBP06WoaTNCFfR"
region: "cn-north-1"
insecure: true
http_config:
idle_conn_timeout: 2m
response_header_timeout: 2m
insecure_skip_verify: true
各组件supervisor的管理文件
thanos-sidecar.conf
[program:thanos-sidecar]
user=root
directory=/
command=thanos sidecar --tsdb.path=/data/prometheus/data --prometheus.url=http://127.0.0.1:9090 --objstore.config-file=/etc/thanos/bucket_config.yaml --http-address=0.0.0.0:10902 --grpc-address=0.0.0.0:10901
autostart=True
autorestart=True
redirect_stderr=True
stopsignal=INT
stopasgroup=True
thanos-compact.conf
[program:thanos-compact]
user=root
directory=/
command=thanos compact --data-dir=/data/thanos/compact --objstore.config-file=/etc/thanos/bucket_config.yaml --http-address=0.0.0.0:10912
autostart=True
autorestart=True
redirect_stderr=True
stopsignal=INT
stopasgroup=True
thanos-store.conf
[program:thanos-store]
user=root
directory=/
command=thanos store --data-dir=/data/thanos/store/ --objstore.config-file=/etc/thanos/bucket_config.yaml --http-address=0.0.0.0:10906 --grpc-address=0.0.0.0:10905 --chunk-pool-size=10GB
autostart=True
autorestart=True
redirect_stderr=True
stopsignal=INT
stopasgroup=True
thanos-query.conf ##链接store查询日志
[program:thanos-query]
user=root
directory=/
command=thanos query --http-address=0.0.0.0:10904 --grpc-address=0.0.0.0:10903 --store=thanos-sidecar1.srv.daling.com:10901 --store=thanos-sidecar2.srv.daling.com:10901 --store=thanos-store1.srv.daling.com:10905 --store=thanos-store2.srv.daling.com:10905 --query.replica-label replica
autostart=True
autorestart=True
redirect_stderr=True
stopsignal=INT
stopasgroup=True
thanos-compact ##优化minio存储块,提高查询效率
启动方式:每八小时定时执行 /etc/cron.d/thanos-compact
0 */8 * * * root thanos compact --data-dir=/data/thanos/thanos-compact --objstore.config-file=/etc/thanos/bucket_config.yaml --http-address=0.0.0.0:10912 >> /tmp/thanos-compact.log 2>&1
prometheus
目录结构
.
├── data
│ ├── 01E0W35BS4ZBCDVPN1MKYFRPK4.tmp
│ ├── 01E2Z8BJSC8KB70HAPB997TQNT
│ ├── lock
│ ├── queries.active
│ ├── thanos
│ ├── thanos.shipper.json
│ └── wal # 数据先缓存到这个目录下临时文件中,
└── databak
├── lock
├── queries.active
├── thanos
├── thanos.shipper.json
└── wal
189 directories, 6 files
部署
软件说明:
软件:go开发,有一个可执行的文件
管理:有单独的配置代码,有多个配置,启动时指定参数
UI:有UI界面,无访问权限控制
持久化数据:有需持久化的数据,需单独配置写入目录
环境说明:
命令:已放置/usr/local/bin
UI访问地址:prometheus.corp.daling.com
软件的根目录:/opt/prometheus
规则目录规划:
rules
├── app
└── sys
持久化目录:/data/prometheus/data/
supervisor启动文件prometheus.conf
[program:prometheus]
user=root
directory=/
command=prometheus --config.file="/opt/prometheus/prometheus.yml" --storage.tsdb.min-block-duration=2h --storage.tsdb.max-block-duration=2h --web.enable-lifecycle --storage.tsdb.path="/data/prometheus/data/"
autostart=True
autorestart=True
redirect_stderr=True
stopsignal=INT
stopasgroup=True
配置文件prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
external_labels:
region: cn-north-1
# monitor: infrastructure
replica: A
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager.srv.daling.com:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "/opt/prometheus/rules/app/*.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'minio'
metrics_path: '/minio/prometheus/metrics'
honor_labels: true
static_configs:
- targets: ['minio.corp.daling.com']
- job_name: 'pushgateway'
honor_labels: true
static_configs:
- targets: ['pushgateway.corp.daling.com']
规则示例一、rules/app/http.yml
groups:
- name: http_alert
rules:
- alert: http_exception
expr: ceil(increase(http_server_requests_seconds_count{env="prod", status!="200", exception!="None"}[1m])) > 0
for: 15s
labels:
severity: error
annotations:
summary: "【{{$labels.uri}}】出现异常【{{$value}}】次"
description: "【{{$labels.uri}}】出现异常【{{$value}}】次"
- alert: http_qps_increase
expr: (ceil((sum without (instance) (rate(http_server_requests_seconds_count{env="prod",exception="None",status="200"}[1m])) / sum without (instance) (rate(http_server_requests_seconds_count{env="prod",exception="None",status="200"}[1m] offset 1m)))) > 10) and (sum without (instance) (rate(http_server_requests_seconds_count{env="prod",exception="None",status="200"}[1m])) > 1) and (sum without (instance) (rate(http_server_requests_seconds_count{env="prod",exception="None",status="200"}[1m] offset 1m)) > 1)
for: 15s
labels:
severity: warn
annotations:
summary: "【{{$labels.uri}}】流量增大【{{$value}}】倍"
description: "【{{$labels.uri}}】流量增大【{{$value}}】倍"
- alert: mybatis_low_operate
expr: ceil((sum(rate(daling_mybatis_requests_seconds_sum{env="prod",status="success"}[1m])) without (instance) / sum(rate(daling_mybatis_requests_seconds_count{env="prod",status="success"}[1m])) without (instance)) * 1000) > 500
for: 15s
labels:
severity: warn
annotations:
summary: "【{{$labels.name}}】执行耗时【{{$value}}】毫秒"
description: "【{{$labels.name}}】执行耗时【{{$value}}】毫秒"
- alert: hystrix_error_event
expr: ceil(increase(hystrix_execution_total{event!="success"}[1m])) > 0
for: 15s
labels:
severity: error
annotations:
summary: "【{{$labels.key}}】熔断发生【$value】次"
description: "【{{$labels.key}}】熔断发生【$value】次"
- alert: hystrix_circuit_breaker_open
expr: sum(hystrix_circuit_breaker_open) without (instance) > 0
for: 15s
labels:
severity: error
annotations:
summary: "【{{$labels.key}}】断路器打开【$value】次"
description: "【{{$labels.key}}】短路器打开【$value】次"
规则示例二、rules/app/minio.yml
groups:
- name: minio_alert
rules:
- alert: minio_exception
expr: minio_offline_disks > 0
for: 1m
labels:
annotations:
summary: "{{$labels.job}} in {{$labels.instance}}"
description: "{{$labels.job}} in {{$labels.instance}}"
规则示例三、rules/sys/linux_sys.yaml
groups:
- name: node
rules:
- alert: server_status
expr: up{job="prometheus"} == 0
for: 15s
annotations:
summary: " {{ $labels.instance }} "
description: "机器 {{ $labels.instance }} 挂了"
- alert: Memory Usage
expr: ceil(node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes*100) < 4
for: 60s
annotations:
summary: " {{ $labels.hostname }} "
description: "宿主机内存可用率低于10%."
value: "{{ $value }}%"
- alert: CPU Usage
expr: ceil(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (hostname)*100) < 10
for: 300s
annotations:
summary: " {{ $labels.hostname }} "
description: "宿主机CPU空闲率低于10%."
value: "{{ $value }}%"
- alert: System Load
expr: ceil((sum(node_load5)by(hostname))/(count(node_cpu_seconds_total{mode="idle"})by(hostname)) * 100) > 400
for: 300s
annotations:
summary: " {{ $labels.hostname }} "
description: "主机正在满负载运行."
value: "{{ $value }}%"
- alert: Disk Free
expr: ceil((node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"}) * 100) < 10
for: 60s
annotations:
summary: " {{ $labels.hostname }} "
description: "{{ $labels.mountpoint }}分区空间不足"
value: "{{ $value }}%"
- alert: Disk IO
expr: ceil(sum(irate(node_disk_writes_completed_total[5m])) by (hostname)) > 2500
for: 5m
annotations:
summary: " {{ $labels.hostname }} "
description: "主机磁盘1分钟平均写入IO负载较高"
value: "{{ $value }}iops"
- alert: Disk IO
expr: ceil(sum(irate(node_disk_reads_completed_total[5m])) by (hostname)) > 2500
for: 5m
annotations:
summary: " {{ $labels.hostname }} "
description: "主机磁盘1分钟平均读取IO负载较高"
value: "{{ $value }}iops"
pushgateway
软件说明:
软件:go开发,是一个可执行的文件
管理:启动时指定参数
UI:有UI界面,无访问权限控制
持久化数据:有需持久化的数据,需单独配置写入目录
环境说明:
命令:已放置/usr/local/bin
UI访问地址:pushgateway.corp.daling.com
持久化目录:/data/pushgateway
supervisor启动文件pushgateway.conf
[program:pushgateway]
user=root
directory=/
command=pushgateway --persistence.file="/data/pushgateway/data" --persistence.interval=24h
autostart=True
autorestart=True
redirect_stderr=True
stopsignal=INT
stopasgroup=True
alertmanager
软件说明:
软件:go开发,有一个可执行的文件
管理:有单独的配置代码,有多个配置,启动时指定参数
UI:有UI界面,无访问权限控制
持久化数据:有需持久化的数据,需单独配置写入目录
环境说明:
命令:已放置/usr/local/bin
UI访问地址:alertmanager.corp.daling.com
软件的根目录:/opt/alertmanager
持久化目录:/data/alertmanager
报警媒介:需要单独配置服务(可以通过其他机器实现)
supervisor启动文件alertmanager.conf
[program:alertmanager]
user=root
directory=/
command=alertmanager --config.file=/opt/alertmanager/alertmanager.yml --storage.path=/data/alertmanager
autostart=True
autorestart=True
redirect_stderr=True
stopsignal=INT
stopasgroup=True
alertmanager.yml 配置文件
global:
resolve_timeout: 5m
route:
group_wait: 30s
group_interval: 1m
repeat_interval: 10m
receiver: 'test_hook'
routes:
- receiver: http_qps_increase
group_by: ['alertname', 'job', 'uri']
group_wait: 30s
group_interval: 1m
repeat_interval: 10m
match_re:
alertname: 'http_qps_increase'
replica: 'A'
- receiver: http_exception
group_by: ['alertname', 'job', 'uri', 'exception']
group_wait: 30s
group_interval: 1m
repeat_interval: 10m
match_re:
alertname: 'http_exception'
replica: 'A'
- receiver: mybatis_low_operate
group_by: ['alertname', 'job', 'name']
group_wait: 30s
group_interval: 1m
repeat_interval: 10m
match_re:
alertname: 'mybatis_low_operate'
replica: 'A'
- receiver: hystrix_error_event
group_by: ['alertname', 'job', 'key', 'event']
group_wait: 30s
group_interval: 1m
repeat_interval: 10m
match_re:
alertname: 'hystrix_error_event'
replica: 'A'
- receiver: hystrix_circuit_breaker_open
group_by: ['alertname', 'job', 'key', 'group']
group_wait: 30s
group_interval: 1m
repeat_interval: 10m
match_re:
alertname: 'hystrix_circuit_breaker_open'
replica: 'A'
- receiver: minio_hook
group_wait: 10s
match_re:
alertname: 'minio_exception'
receivers:
- name: 'http_qps_increase'
webhook_configs:
- url: 'http://sgp.srv.daling.com/alert/push/http_qps_increase'
- name: 'http_exception'
webhook_configs:
- url: 'http://sgp.srv.daling.com/alert/push/http_exception'
- name: 'mybatis_low_operate'
webhook_configs:
- url: 'http://sgp.srv.daling.com/alert/push/mybatis_low_operate'
- name: 'hystrix_error_event'
webhook_configs:
- url: 'http://sgp.srv.daling.com/alert/push/hystrix_error_event'
- name: 'hystrix_circuit_breaker_open'
webhook_configs:
- url: 'http://sgp.srv.daling.com/alert/push/hystrix_circuit_breaker_open'
- name: 'minio_hook'
webhook_configs:
- url: 'http://127.0.0.1:8060/dingtalk/minio_dingding/send'
- name: 'test_hook'
webhook_configs:
- url: 'http://10.36.35.128:8520/alert/push/prometheus'
报警单独配置服务:钉钉报警webhook-dingtalk
服务说明:
软件:go开发,是一个可执行的文件
管理:启动时指定参数(指定钉钉机器人的接口地址)
supervisor启动文件webhook-dingtalk.conf
[program:webhook-dingtalk]
user=root
directory=/
command=prometheus-webhook-dingtalk --ding.profile=alert_dingding=https://oapi.dingtalk.com/robot/send?access_token=99f5f1477e192a118ac65b0115ff807fdd0d3dbf0fe8ed99120e1b873e24232c
autostart=True
autorestart=True
redirect_stderr=True
stopsignal=INT
stopasgroup=True