一、概况
1. 当前痛点
1. 当前公司生产环境,开发环境以及测试环境分散在openstack和各物理机上,出现问题无法第一时间发现,往往需要等业务使用的时候才能发现服务器出现问题,才进行排查,无法优于业务发现问题。
2. 各部门以及业务测试需要对服务器性能,需要各自搭建自己的监控平台,无法统一监控项且耗费人力物力。
2. prometheus的优点
1. 多维的数据模型(基于时间序列的Key、Value键值对)
5. 可利用Pushgateway(Prometheus的可选中间件)实现Push模式
3.监控告警能解决什么问题
1. 能使运维或者业务责任人,先一步发现问题,不用等到业务找上门来反馈问题,在一定程度上能在业务无感知的情况下修复问题。
2. 各部门进行测试,能及时监控服务器各组件压力,以及服务器整体负载。
Prometheus API查询方式
等于curl http://192.1.3.107:32074/api/v1/query?query=sum(kube_node_status_capacity{resource="cpu"})
但是json不识别特殊字符,需要转义
集群内存总和
curl http://192.1.3.107:32074/api/v1/query?query=sum%28kube_node_status_capacity%7Bresource%3D%22memory%22%7D%29
二、监控部署
参考文档
考虑到有些服务器未安装docker,这里使用二进制方式部署(当然docker部署也是相同的效果)
中文文档:https://songjiayang.gitbooks.io/prometheus/content/
下载prometheus各组件:https://prometheus.io/download/
prometheus部署
wget https://github.com/prometheus/prometheus/releases/download/v2.36.1/prometheus-2.36.1.linux-amd64.tar.gz
tar xf prometheus-2.36.1.linux-amd64.tar.gz
mv prometheus-2.36.1.linux-amd64 /usr/local/prometheus-2.36.1.linux-amd64
ln -s /usr/local/prometheus-2.36.1.linux-amd64 /usr/local/prometheus
mkdir -p /data/prometheus
vim /usr/lib/systemd/system/prometheus.service # 配置为服务
[Unit]
Description=Prometheus
After=network.target
[Service]
Type=simple
Environment="GOMAXPROCS=4"
User=root
Group=root
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/local/prometheus/prometheus \
--config.file=/usr/local/prometheus/prometheus.yml \
--storage.tsdb.path=/data/prometheus \
--storage.tsdb.retention=30d \
--web.console.libraries=/usr/local/prometheus/console_libraries \
--web.console.templates=/usr/local/prometheus/consoles \
--web.listen-address=0.0.0.0:9090 \
--web.read-timeout=5m \
--web.max-connections=10 \
--query.max-concurrency=20 \
--query.timeout=2m \
--web.enable-lifecycle
PrivateTmp=true
PrivateDevices=true
ProtectHome=true
NoNewPrivileges=true
LimitNOFILE=infinity
ReadWriteDirectories=/data/prometheus
ProtectSystem=full
SyslogIdentifier=prometheus
Restart=always
[Install]
WantedBy=multi-user.target
#启动prometheus
systemctl daemon-reload
systemctl enable prometheus && systemctl start prometheus
netstat -alntp | grep 9090
node_exporter部署
wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
tar -zxvf node_exporter-1.3.1.linux-amd64.tar.gz
mv node_exporter-1.3.1.linux-amd64 /usr/local/
ln -s /usr/local/node_exporter-1.3.1.linux-amd64/ /usr/local/node_exporter
vim /usr/lib/systemd/system/node_exporter.service #配置为服务
[Unit]
Description=node_exporter
After=network.target
[Service]
Type=simple
User=root
Group=root
ExecStart=/usr/local/node_exporter/node_exporter \
--web.listen-address=0.0.0.0:9100 \
--web.telemetry-path=/metrics \
--log.level=info \
--log.format=logfmt
Restart=always
[Install]
WantedBy=multi-user.target
#启动 node_exporter
systemctl daemon-reload
systemctl enable node_exporter && systemctl start node_exporter
netstat -alntp | grep node_export
--web.listen-address=0.0.0.0:32760 --no-collector.hwmon --no-collector.nfs --no-collector.nfsd --no-collector.nvme --no-collector.dmi --no-collector.arp --collector.filesystem.ignored-mount-points="^/(dev|proc|sys|var/lib/containerd/.+|/var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)" --collector.filesystem.ignored-fs-types="^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$"
pushgateway
https://juejin.cn/post/7233377767466926117
alertmanager 部署
wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
tar -zxvf alertmanager-0.24.0.linux-amd64.tar.gz
mv alertmanager-0.24.0.linux-amd64 /usr/local/
ln -s /usr/local/alertmanager-0.24.0.linux-amd64 /usr/local/alertmanager
# 配置为service
vim /usr/lib/systemd/system/alertmanager.service
[Unit]
Description=alertmanager handles alerts sent by client applications such as the Prometheus server
Documentation=https://Prometheus.io/docs/alerting/alertmanager/
After=network.target
[Service]
User=root
Group=root
ExecStart=/usr/local/alertmanager/alertmanager \
--config.file=/usr/local/alertmanager/alertmanager.yml \
--storage.path=/data/alertmanager
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable alertmanager && systemctl start alertmanager
netstat -lntp | grep alertmanager
global:
resolve_timeout: 5m # 处理超时时间,默认为5min
# 定义路由树信息
route:
group_by: ['alertname'] # 报警分组依据
receiver: 'dingding.webhook1'
group_wait: 30s # 最初即第一次等待多久时间发送一组警报的通知
group_interval: 30s # 在发送新警报前的等待时间
repeat_interval: 1h # 重复发送告警时间。默认1h
routes:
- receiver: 'dingding.webhook1'
group_wait: 30s
match_re:
alertname: '实例存活告警|磁盘使用率告警' # 匹配告警规则中的名称发送
- receiver: 'dingding.webhook.all'
group_wait: 30s
match_re:
alertname: '内存使用率告警|CPU使用率告警'
# 定义基础告警接收者
receivers:
- name: 'dingding.webhook1'
webhook_configs:
- url: 'http://cdh1:8060/dingtalk/webhook1/send'
send_resolved: true # 警报被解决之后是否通知
# 定义消息告警接收者
- name: 'dingding.webhook.all'
webhook_configs:
- url: 'http://cdh1:8060/dingtalk/webhook1/send'
send_resolved: true
prometheus-webhook-dingtalk 部署
wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.1.0/prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
tar -zxvf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
mv prometheus-webhook-dingtalk-2.1.0.linux-amd64 /usr/local/
ln -s /usr/local/prometheus-webhook-dingtalk-2.1.0.linux-amd64 /usr/local/dingtalk
cd /usr/local/dingtalk
# 配置成服务
vim /usr/lib/systemd/system/dingtalk.service
[Unit]
Description=https://github.com/timonwong/prometheus-webhook-dingtalk/releases/
After=network-online.target
[Service]
Restart=on-failure
ExecStart=/usr/local/dingtalk/prometheus-webhook-dingtalk --config.file=/usr/local/dingtalk/config.yml
[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable dingtalk && systemctl start dingtalk
添加钉钉群智能助手>>自定义机器人>>加签(这里的签复制填入secret这一行,完成后会出现webhook,填入webhook 到 config.yml)
webhook1:
url: https://oapi.dingtalk.com/robot/send?access_token=4de703c270384a7b4702d31b1ab2b32cd7807af52093c62040729148770eaf7d
# secret for signature
secret: SEC8a5ac95d220633aa76ced0186dce637382fea88c2b86cf2cf68ec3a8e999a448
webhook2:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
webhook_legacy:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
# Customize template content
message:
# Use legacy template
title: '{{ template "legacy.title" . }}'
text: '{{ template "legacy.content" . }}'
webhook_mention_all:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
mention:
all: true
webhook_mention_users:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
mention:
mobiles: ['156xxxx8827', '189xxxx8325']
flink监控部署
metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
metrics.reporter.promgateway.host: cdh1
metrics.reporter.promgateway.port: 9091
metrics.reporter.promgateway.jobName: Job
metrics.reporter.promgateway.randomJobNameSuffix: true
metrics.reporter.promgateway.deleteOnShutdown: false
#metrics.reporter.promgateway.groupingKey: k1=v1;k2=v2
metrics.reporter.promgateway.interval: 15 SECONDS
wget https://github.com/prometheus/pushgateway/releases/download/v1.4.3/pushgateway-1.4.3.linux-amd64.tar.gz
tar -zxvf pushgateway-1.4.3.linux-amd64.tar.gz
mv pushgateway-1.4.3.linux-amd64 /usr/local/
ln -s /usr/local/pushgateway-1.4.3.linux-amd64 /usr/local/pushgateway
cd /usr/local/pushgateway
nohup ./pushgateway > ./pushgateway.log 2>&1 &
kafka监控部署
wget https://github.com/danielqsj/kafka_exporter/releases/download/v1.4.2/kafka_exporter-1.4.2.linux-amd64.tar.gz
tar -zxvf kafka_exporter-1.4.2.linux-amd64.tar.gz
mv kafka_exporter-1.4.2.linux-amd64 /usr/local
ln -s /usr/local/kafka_exporter-1.4.2.linux-amd64 /usr/local/kafka_exporter
cd /usr/local/kafka_exporter
nohup ./kafka_exporter --kafka.server=10.30.6.67:9092 > ./kafka_exporter.log 2>&1 &
cat /usr/lib/systemd/system/kafka_exporter.service
[Unit]
Description=kafka_exporter
After=network.target
[Service]
Type=simple
WorkingDirectory=/mnt/prometh/kafka_exporter-1.4.2.linux-amd64/
ExecStart=/mnt/prometh/kafka_exporter-1.4.2.linux-amd64/kafka_exporter --kafka.server=10.30.6.70:9092
LimitNOFILE=65536
PrivateTmp=true
RestartSec=2
StartLimitInterval=0
Restart=always
[Install]
WantedBy=multi-user.target
es监控部署
wget https://github.com/prometheus-community/elasticsearch_exporter/releases/download/v1.3.0/elasticsearch_exporter-1.3.0.linux-amd64.tar.gz
tar -zxvf elasticsearch_exporter-1.3.0.linux-amd64.tar.gz -C /usr/local/
ln -s /usr/local/elasticsearch_exporter-1.3.0.linux-amd64 /usr/local/elasticsearch_exporter
nohup ./elasticsearch_exporter --es.all --es.indices --es.cluster_settings --es.indices_settings --es.shards --es.snapshots --es.timeout=10s --web.listen-address=:9114 --web.telemetry-path=/metrics --es.uri http://elastic:Bl666666@10.30.6.70:9200 > ./es_exporter.log 2>&1 &
cat /usr/lib/systemd/system/elasticsearch_exporter.service
[Unit]
Description=elasticsearch_exporter
After=network.target
[Service]
Type=simple
WorkingDirectory=/root/elasticsearch_exporter-1.3.0.linux-amd64/
ExecStart=/root/elasticsearch_exporter-1.3.0.linux-amd64/elasticsearch_exporter --es.all --es.indices --es.cluster_settings --es.indices_settings --es.shards --es.snapshots --es.timeout=10s --web.listen-address=:9114 --web.telemetry-path=/metrics --es.uri http://elastic:Bl666666@10.30.6.70:9200
LimitNOFILE=65536
PrivateTmp=true
RestartSec=2
StartLimitInterval=0
Restart=always
[Install]
WantedBy=multi-user.target
postgresql监控部署
wget https://github.com/prometheus-community/postgres_exporter/releases/download/v0.10.1/postgres_exporter-0.10.1.linux-amd64.tar.gz
tar -zxvf postgres_exporter-0.10.1.linux-amd64.tar.gz -C /usr/local/
ln -s /usr/local/postgres_exporter-0.10.1.linux-amd64 /usr/local/postgres_exporter
export DATA_SOURCE_NAME="postgresql://bolean:Bl666666@127.0.0.1:5432/postgres?sslmode=disable"
cd /usr/local/postgres_exporter
nohup ./postgres_exporter > ./pg_exporter.log 2>&1 &
redis 监控部署
wget https://github.com/oliver006/redis_exporter/releases/download/v1.41.0/redis_exporter-v1.41.0.linux-amd64.tar.gz
tar -zxvf redis_exporter-v1.41.0.linux-amd64.tar.gz -C /usr/local/
ln -s /usr/local/redis_exporter-v1.41.0.linux-amd64 /usr/local/redis_exporter
cd /usr/local/redis_exporter
nohup ./redis_exporter -redis.addr 127.0.0.1:6379 -redis.password Bl666666 > ./redis_exporter.log 2>&1 &
docker部署
docker pull oliver006/redis_exporter
docker run -d --restart="always" --name redis_exporter -p 9121:9121 oliver006/redis_exporter --redis.addr redis://192.168.0.41:6379 --redis.password 'STUdio2022_linker'
# cat redis-export.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-exporter
labels:
app: redis-exporter
spec:
replicas: 1
selector:
matchLabels:
app: redis-exporter
template:
metadata:
labels:
app: redis-exporter
spec:
containers:
- name: redis-exporter
image: fvhb.fjecloud.com/xsgy/redis_exporter
imagePullPolicy: IfNotPresent
# 此处添加redis相关配置,例如:地址、密码等
# 如果是监控k8s容器外的Redis,则此处的redis.addr对应的值需要添加redis://前缀,类似下面注释的那样
# args: ["-redis.addr", "redis://10.128.27.22:6379", "-redis.password", "123456@redis"]
args: ["-redis.addr", "redis:6379", "-redis.password", "Ff1z@TOFr^iwd%Ra"]
ports:
- containerPort: 9121
---
apiVersion: v1
kind: Service
metadata:
labels:
app: redis-exporter
name: redis-exporter
spec:
type: NodePort
ports:
- name: metrics
port: 9121
protocol: TCP
targetPort: 9121
nodePort: 32763
selector:
app: redis-exporter
Grafana页面
9338
MYSQL监控部署
docker run -d -p 9104:9104 --restart="always" --name mysql_exporter -e DATA_SOURCE_NAME="cri:STUdio2022_linker@(192.168.0.41:3306)/" prom/mysqld-exporter
docker run -d -p 9104:9104 --restart="always" --name mysql_exporter -e DATA_SOURCE_NAME="guest:guest@123@(10.8.15.25:3306)/" prom/mysqld-exporter
Docker-compose部署
version: '2.3'
services:
mysql_exporter:
image: prom/mysqld-exporter
container_name: mysql_exporter
restart: always
ports:
- 9104:9104
environment:
- DATA_SOURCE_NAME="cri:STUdio2022_linker@(192.168.0.41:3306)/"
volumes:
- /etc/hosts:/etc/hosts
networks:
- vos-exporter
redis_exporter:
image: oliver006/redis_exporter
container_name: redis_exporter
restart: always
ports:
- 9121:9121
command:
- "-redis.password-file=/redis_passwd.json"
volumes:
- /etc/hosts:/etc/hosts
- ./redis_passwd.json:/redis_passwd.json
networks:
- vos-exporter
networks:
vos-exporter:
K8s部署
apiVersion: apps/v1
kind: Deployment
metadata:
name: mysqld-exporter
labels:
app: mysqld-exporter
spec:
selector:
matchLabels:
app: mysqld-exporter
template:
metadata:
labels:
app: mysqld-exporter
spec:
containers:
- name: mysqld-exporter
image: prom/mysqld-exporter
env:
- name: DATA_SOURCE_NAME
value: 'readonly:vMjef!3iW2hjv9SK@(mysql:3306)/'
ports:
- containerPort: 9104
name: http
---
apiVersion: v1
kind: Service
metadata:
name: mysqld-exporter
labels:
app: mysqld-exporter
spec:
selector:
app: mysqld-exporter
type: NodePort
ports:
- port: 9104
targetPort: 9104
nodePort: 32765
Rocketmq监控部署
# cat rockemq-export.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: rocketmq-export
spec:
replicas: 1
selector:
matchLabels:
app: rocketmq-export
template:
metadata:
labels:
app: rocketmq-export
spec:
containers:
- name: rocketmq-export
#image: fvhb.fjecloud.com/xsgy/a009_msrcnn:v1.0.18_encrypted
image: slpcat/rocketmq-exporter:latest
command: ["sh", "-c", "java $JAVA_OPTS -jar rocketmq-exporter-0.0.2-SNAPSHOT.jar --rocketmq.config.rocketmqVersion=V4_5_1 --rocketmq.config.namesrvAddr=rmqnamesrv.default:9876"]
#args: ["--rocketmq.config.namesrvAddr=rmqnamesrv.default:9876"]
ports:
- containerPort: 5557
---
apiVersion: v1
kind: Service
metadata:
name: rocketmq-export
labels:
app: rocketmq-export
spec:
selector:
app: rocketmq-export
type: NodePort
ports:
- port: 5557
targetPort: 5557
nodePort: 30134
mongodb监控部署
docker run -d --network=host --restart="always" --name=mongodb_export -e MONGODB_URI='mongodb://cri:STUdio2022_linker@192.168.0.41:27017/?authSource=cri' bitnami/mongodb-exporter:latest --collect-all --web.listen-address=":9216"
# cat /srv/mongodb-export.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
k8s-app: mongodb-exporter # 根据业务需要调整成对应的名称,建议加上 MongoDB 实例的信息
name: mongodb-exporter # 根据业务需要调整成对应的名称,建议加上 MongoDB 实例的信息
spec:
replicas: 1
selector:
matchLabels:
k8s-app: mongodb-exporter # 根据业务需要调整成对应的名称,建议加上 MongoDB 实例的信息
template:
metadata:
labels:
k8s-app: mongodb-exporter # 根据业务需要调整成对应的名称,建议加上 MongoDB 实例的信息
spec:
containers:
- args:
- --collect.database # 启用采集 Database metrics
- --collect.collection # 启用采集 Collection metrics
- --collect.topmetrics # 启用采集 table top metrics
- --collect.indexusage # 启用采集 per index usage stats
- --collect.connpoolstats # 启动采集 MongoDB connpoolstats
env:
- name: MONGODB_URI
valueFrom:
secretKeyRef:
name: mongodb-secret-test
key: datasource
image: ccr.ccs.tencentyun.com/rig-agent/mongodb-exporter:0.10.0
imagePullPolicy: IfNotPresent
name: mongodb-exporter
ports:
- containerPort: 9216
name: metric-port # 这个名称在配置抓取任务的时候需要
securityContext:
privileged: false
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
imagePullSecrets:
- name: qcloudregistrykey
restartPolicy: Always
schedulerName: default-scheduler
securityContext: { }
terminationGracePeriodSeconds: 30
---
apiVersion: v1
kind: Secret
metadata:
name: mongodb-secret-test
type: Opaque
stringData:
datasource: "mongodb://cri:STUdio2022_linker@192.168.0.41:27017/admin" # 对应连接URI
---
---
apiVersion: v1
kind: Service
metadata:
name: mongodb-export
labels:
k8s-app: mongodb-exporter
spec:
selector:
k8s-app: mongodb-exporter
type: NodePort
ports:
- port: 9216
targetPort: 9216
nodePort: 30334
12079
7353
结合
openstack监控部署
OS_PROJECT_DOMAIN_NAME=Default
OS_USER_DOMAIN_NAME=Default
OS_PROJECT_NAME=admin
OS_USERNAME=admin
OS_PASSWORD=Bl666666
OS_IDENTITY_API_VERSION=3
OS_AUTH_URL=http://127.0.0.1/identity/v3
docker run -itd \
--name=openstack -p 9183:9183\
--env-file=$(pwd)/admin.novarc\
--restart=unless-stopped moghaddas/prom-openstack-exporter
# 或者物理机部署
wget https://github.com/canonical/prometheus-openstack-exporter/archive/refs/tags/0.1.6.tar.gz
sudo apt-get install python-neutronclient python-novaclient python-keystoneclient python-netaddr python-cinderclient
apt-get install python-prometheus-client
# Copy example config in place, edit to your needs
sudo cp prometheus-openstack-exporter.yaml /etc/prometheus/
. /path/to/admin-novarc
./prometheus-openstack-exporter prometheus-openstack-exporter.yaml
clickhouse监控部署
clickhouse自带agent默认端口是9363,容器部署需要暴露该端口,如不方便更改现有暴露端口,可通过GitHub - ClickHouse/clickhouse_exporter: This is a simple server that periodically scrapes ClickHouse stats and exports them via HTTP for Prometheus(https://prometheus.io/) consumption. 部署clickhouse_exporter
docker pull hotwifi/clickhouse_exporter:latest
StarRocks 监控部署
starrocks自带agent从各个host采集监控信息,且该agent兼容Prometheus可直接配置采集监控信息。
docker run -d -p 9116:9116 tkroman/clickhouse_exporter_fresh -scrape_uri=http://bolean:Bl666666@192.168.0.94:8123/
GPU监控
# cat vgpu/gpu-export.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-export
namespace: kube-mon
spec:
replicas: 1
selector:
matchLabels:
app: gpu-export
template:
metadata:
labels:
app: gpu-export
spec:
containers:
- name: gpu-export
image: hzlh-registry.cn-hangzhou.cr.aliyuncs.com/vos/dcgm-exporter:3.1.6-3.1.3-ubuntu20.04
ports:
- containerPort: 9400
hostNetwork: true
API接口监控
# cat configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
labels:
app: blackbox-exporter
name: blackbox-exporter
namespace: kube-mon
data:
blackbox.yml: |-
modules:
http_2xx:
prober: http
timeout: 10s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2"]
valid_status_codes: [200,301,302]
method: GET
preferred_ip_protocol: "ip4"
tcp_connect:
prober: tcp
timeout: 10s
---
# cat deployment.yaml
kind: Deployment
apiVersion: apps/v1
metadata:
name: blackbox-exporter
namespace: kube-mon
labels:
app: blackbox-exporter
#annotations:
#deployment.kubernetes.io/revision: 1
spec:
replicas: 1
selector:
matchLabels:
app: blackbox-exporter
template:
metadata:
labels:
app: blackbox-exporter
spec:
volumes:
- name: config
configMap:
name: blackbox-exporter
defaultMode: 420
containers:
- name: blackbox-exporter
image: prom/blackbox-exporter:v0.23.0
imagePullPolicy: IfNotPresent
args:
- --config.file=/etc/blackbox_exporter/blackbox.yml
- --log.level=info
- --web.listen-address=:9115
ports:
- name: blackbox-port
containerPort: 9115
protocol: TCP
resources:
limits:
cpu: 200m
memory: 256Mi
requests:
cpu: 100m
memory: 50Mi
volumeMounts:
- name: config
mountPath: /etc/blackbox_exporter
readinessProbe:
tcpSocket:
port: 9115
initialDelaySeconds: 5
timeoutSeconds: 5
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
---
Prometheus配置
- job_name: 'vos'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets: ['http://192.1.2.238:8317/vql/v1/health/ping']
labels:
instance: 192.1.2.238:8317
project: vos
group: web
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: 192.1.3.107:31004
导入ID 7587
vmware监控
docker run -d -p 9272:9272 -e VSPHERE_USER=lv_pingping@vsphere.local -e VSPHERE_PASSWORD=ISq59Pg#T2XHaU -e VSPHERE_HOST=192.1.5.200 -e VSPHERE_IGNORE_SSL=True -e VSPHERE_SPECS_SIZE=2000 --name vmware_exporter pryorda/vmware_exporter
K8S监控
压缩包在该目录下面
kubectl apply -f kube-state-metrics/examples/standard/
导入ID 15661 13105
https://grafana.com/grafana/dashboards/15661-1-k8s-for-prometheus-dashboard-20211010/
状态异常告警min_over_time(kube_pod_container_status_ready{instance=~"10.8.22.123:32509",pod!~"rke-.*"} [1m])!=1
三、监控配置及服务发现
通过文件服务发现
- targets: ['10.30.6.70:9100']
labels:
project: 'situation-awareness'
person: 'lvpingping'
vm: 'true'
- targets: ['10.30.6.67:9100']
labels:
project: 'situation-awareness'
person: 'lvpingping'
vm: 'false'
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- cdh1:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/mnt/prometh/prometheus-2.35.0.linux-amd64/conf/rule*.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node exporter"
scrape_interval: 10s
file_sd_configs:
- files:
- targets/node/*.yml
- job_name: "flink exporter"
scrape_interval: 10s
file_sd_configs:
- files:
- targets/flink/*.yml
- job_name: "kafka exporter"
scrape_interval: 10s
file_sd_configs:
- files:
- targets/kafka/*.yml
- job_name: "es exporter"
scrape_interval: 10s
file_sd_configs:
- files:
- targets/es/*.yml
- job_name: "postgresql exporter"
scrape_interval: 10s
file_sd_configs:
- files:
- targets/postgresql/*.yml
- job_name: "redis exporter"
scrape_interval: 10s
file_sd_configs:
- files:
- targets/redis/*.yml
- job_name: "openstack exporter"
scrape_interval: 10s
file_sd_configs:
- files:
- targets/openstack/*.yml
- job_name: "clickhouse exporter"
scrape_interval: 10s
file_sd_configs:
- files:
- targets/clickhouse/*.yml
- job_name: "starrocks exporter"
scrape_interval: 10s
file_sd_configs:
- files:
- targets/starrocks/*.yml
- job_name: "pushgateway"
scrape_interval: 10s
file_sd_configs:
- files:
- targets/pushgateway/*.yml
通过consul服务发现
wget https://releases.hashicorp.com/consul/1.12.2/consul_1.12.2_linux_amd64.zip
unzip consul_1.12.2_linux_amd64.zip
cp consul /usr/bin
consul version
# 配置服务
vim /usr/lib/systemd/system/consul.service
[Unit]
Description=Consul
After=network.target
[Service]
ExecStart=/usr/local/consul/consul agent -dev -ui -client 0.0.0.0 -log-file=/var/log/consul/
KillSignal=SIGINT
[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable consul && systemctl start consul
ss -antlp | grep consul
# 容器
docker run --name consul -d -p 8500:8500 consul
# 服务注册
curl -X PUT -d '{
"ID": "node-exporter-10-30-6-67",
"Name": "node-exporter-10-30-6-67",
"Tags": [
"node"
],
"Address": "10.30.6.67",
"Port": 9100,
"Meta": {
"vm": "fales",
"app": "node",
"person": "lvpingping",
"project": "sa"
},
"EnableTagOverride": false,
"Check": {
"HTTP": "http://10.30.6.67:9100/metrics",
"Interval": "10s"
}
}' http://10.20.0.21:8500/v1/agent/service/register
# 服务注销
curl --request PUT http://10.20.0.91:8500/v1/agent/service/deregister/node-exporter-10-30-6-67
prometheus配置文件
- job_name: 'consul-prometheus'
consul_sd_configs:
- server: '10.20.0.91:8500'
services: []
- job_name: 'consul-prometheus'
consul_sd_configs:
- server: '10.20.0.91:8500'
services: []
relabel_configs:
- source_labels: [__meta_consul_tags]
regex: .*node.*
action: keep
- regex: __meta_consul_service_metadata_(.+)
action: labelmap
四、granafa
cat /etc/yum.repos.d/grafana.repo
[grafana]
name=grafana
baseurl=https://mirrors.aliyun.com/grafana/yum/rpm
repo_gpgcheck=0
enabled=1
gpgcheck=0
yum -y install grafana
systemctl enable grafana-server && systemctl start grafana-server
netstat -nuptl|grep 3000
# 默认用户名密码都是admin
# ubuntu
apt-get install -y apt-transport-https
apt-get install -y software-properties-common wget
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
apt-get install grafana
systemctl enable grafana-server && systemctl start grafana-server
netstat -nuptl|grep 3000
# docker
docker run -d -p 3000:3000 --name=grafana -v grafana-storage:/var/lib/grafana grafana/grafana
监控模板在Dashboards | Grafana Labs 可以搜索,下载下来可能没数据,需要微调变量值
mysql 导入ID 14057
gpu 导入 ID 12239
rocketmq导入 ID 14612
node 导入ID 8919(常用) 16098
API blackbox_exporter 导入ID 7587
导入ID 15661
五、告警配置
- "./conf/rule*.yml"
# - "second_rules.yml"
基础告警
- name: 实例存活告警规则
rules:
- alert: 实例存活告警
expr: up == 0
for: 1m
labels:
user: prometheus
severity: warning
annotations:
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
- name: CPU报警规则
rules:
- alert: CPU使用率告警
expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 90
for: 1m
labels:
user: prometheus
severity: warning
annotations:
description: "服务器: CPU使用超过90%!(当前值: {{ $value }}%)"
- name: 内存报警规则
rules:
- alert: 内存使用率告警
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 95
for: 1m
labels:
user: prometheus
severity: warning
annotations:
description: "服务器: 内存使用超过80%!(当前值: {{ $value }}%)"
- name: 磁盘报警规则
rules:
- alert: 磁盘使用率告警
expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 80
for: 1m
labels:
user: prometheus
severity: warning
annotations:
description: "服务器: 磁盘设备: 使用超过80%!(挂载点: {{ $labels.mountpoint }} 当前值: {{ $value }}%)"
应用告警
- name: flink任务失败告警规则
rules:
- alert: flink任务失败告警
expr: ((flink_jobmanager_job_uptime offset 30s)-(flink_jobmanager_job_uptime))/1000 > 0
for: 15s
labels:
user: prometheus
severity: warning
annotations:
description: "服务器 {{ $labels.host }} 上的flink任务:{{ $labels.job_name }}失败,请关注!"
- name: flink任务延迟告警规则
rules:
- alert: flink任务延迟告警
expr: ((flink_jobmanager_job_uptime offset 30s)-(flink_jobmanager_job_uptime))/1000 > 0
for: 30s
labels:
user: prometheus
severity: warning
annotations:
description: "服务器 {{ $labels.host }} 上的flink任务:{{ $labels.job_name }}发生延迟,请关注!
- name: kafka消息堆积告警规则
rules:
- alert: kafka消息堆积告警
expr: sum(kafka_consumergroup_lag - kafka_consumergroup_lag offset 10m) by (consumergroup, topic) > 500000
for: 1m
labels:
user: prometheus
severity: warning
annotations:
description: "kafka topic: {{ $labels.topic }} 上consumergroup:{{ $labels.consumergroup }}堆积消息超过50万条,请关注!"
六、需要监控的列表
见语雀:https://bolean.yuque.com/gdmlxg/fld8zh/cwwqgg#b4sY 还不完全,待补充,虚拟机监控私有云平台上的,其次公司大群里发发需要进行监控的,发IP列表给到我们,进行添加
问题点
解决思路: a. 虚拟机镜像模板制作的时候带上node exports ,然后写个脚本自动往Prometheus注册
b. 已存在大量模板,且存在大量虚机,目前的虚机模板通过手动注册到Prometheus,后期优化可按1优化
解决思路: a. 实现一个脚本放在每台服务器上,监控是否存在以上服务,如果存在则向prometheus注册相应的服务,脚本需要:责任人,所属项目参数的传入。
b. 服务端脚本实现扫网段相应的端口,如果有export相应的数据则注册到prometheus,该方案缺点,需要export安装了才能发现,未安装则无法监控。
c. export安装,以及注册方式同步形成脚本,并且同步到运维组对外开放知识库,需要接入监控告警的自行通过脚本安装注册。