监控告警落地方案

一、概况

1. 当前痛点

 1. 当前公司生产环境,开发环境以及测试环境分散在openstack和各物理机上,出现问题无法第一时间发现,往往需要等业务使用的时候才能发现服务器出现问题,才进行排查,无法优于业务发现问题。

 2. 各部门以及业务测试需要对服务器性能,需要各自搭建自己的监控平台,无法统一监控项且耗费人力物力。

2. prometheus的优点

 1. 多维的数据模型(基于时间序列的Key、Value键值对)

 2. 灵活的查询和聚合语言PromQL

 3. 提供本地存储和分布式存储

 4. 通过基于HTTP的Pull模型采集时间序列数据

 5. 可利用Pushgateway(Prometheus的可选中间件)实现Push模式

 6. 可通过动态服务发现或静态配置发现目标机器

 7. 支持多种图表和数据大盘

3.监控告警能解决什么问题

 1. 能使运维或者业务责任人,先一步发现问题,不用等到业务找上门来反馈问题,在一定程度上能在业务无感知的情况下修复问题。

 2. 各部门进行测试,能及时监控服务器各组件压力,以及服务器整体负载。

Prometheus API查询方式

等于curl http://192.1.3.107:32074/api/v1/query?query=sum(kube_node_status_capacity{resource="cpu"})

但是json不识别特殊字符,需要转义

集群内存总和

curl http://192.1.3.107:32074/api/v1/query?query=sum%28kube_node_status_capacity%7Bresource%3D%22memory%22%7D%29

二、监控部署

参考文档

考虑到有些服务器未安装docker,这里使用二进制方式部署(当然docker部署也是相同的效果)

官网:https://prometheus.io

文档:Overview | Prometheus

中文文档:https://songjiayang.gitbooks.io/prometheus/content/

下载prometheus各组件:https://prometheus.io/download/

prometheus部署

wget https://github.com/prometheus/prometheus/releases/download/v2.36.1/prometheus-2.36.1.linux-amd64.tar.gz

tar xf prometheus-2.36.1.linux-amd64.tar.gz

mv prometheus-2.36.1.linux-amd64 /usr/local/prometheus-2.36.1.linux-amd64

ln -s /usr/local/prometheus-2.36.1.linux-amd64 /usr/local/prometheus

mkdir -p /data/prometheus

vim /usr/lib/systemd/system/prometheus.service # 配置为服务

[Unit]

Description=Prometheus

After=network.target

[Service]

Type=simple

Environment="GOMAXPROCS=4"

User=root

Group=root

ExecReload=/bin/kill -HUP $MAINPID

ExecStart=/usr/local/prometheus/prometheus \

  --config.file=/usr/local/prometheus/prometheus.yml \

  --storage.tsdb.path=/data/prometheus \

  --storage.tsdb.retention=30d \

  --web.console.libraries=/usr/local/prometheus/console_libraries \

  --web.console.templates=/usr/local/prometheus/consoles \

  --web.listen-address=0.0.0.0:9090 \

  --web.read-timeout=5m \

  --web.max-connections=10 \

  --query.max-concurrency=20 \

  --query.timeout=2m \

  --web.enable-lifecycle

PrivateTmp=true

PrivateDevices=true

ProtectHome=true

NoNewPrivileges=true

LimitNOFILE=infinity

ReadWriteDirectories=/data/prometheus

ProtectSystem=full

SyslogIdentifier=prometheus

Restart=always

[Install]

WantedBy=multi-user.target

#启动prometheus

systemctl daemon-reload

systemctl enable prometheus && systemctl start prometheus

netstat -alntp | grep 9090

node_exporter部署

wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz

tar -zxvf node_exporter-1.3.1.linux-amd64.tar.gz

mv node_exporter-1.3.1.linux-amd64  /usr/local/

ln -s /usr/local/node_exporter-1.3.1.linux-amd64/ /usr/local/node_exporter

vim /usr/lib/systemd/system/node_exporter.service #配置为服务

[Unit]

Description=node_exporter

After=network.target

[Service]

Type=simple

User=root

Group=root

ExecStart=/usr/local/node_exporter/node_exporter \

  --web.listen-address=0.0.0.0:9100 \

  --web.telemetry-path=/metrics \

  --log.level=info \

  --log.format=logfmt

Restart=always

[Install]

WantedBy=multi-user.target

#启动 node_exporter

systemctl daemon-reload

systemctl enable node_exporter && systemctl start node_exporter

netstat -alntp | grep node_export

--web.listen-address=0.0.0.0:32760 --no-collector.hwmon  --no-collector.nfs  --no-collector.nfsd  --no-collector.nvme  --no-collector.dmi --no-collector.arp --collector.filesystem.ignored-mount-points="^/(dev|proc|sys|var/lib/containerd/.+|/var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)" --collector.filesystem.ignored-fs-types="^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$"

pushgateway

https://juejin.cn/post/7233377767466926117

alertmanager 部署

wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz

tar -zxvf alertmanager-0.24.0.linux-amd64.tar.gz

mv alertmanager-0.24.0.linux-amd64 /usr/local/

ln -s  /usr/local/alertmanager-0.24.0.linux-amd64  /usr/local/alertmanager

# 配置为service

vim  /usr/lib/systemd/system/alertmanager.service

[Unit]

Description=alertmanager handles alerts sent by client applications such as the Prometheus server

Documentation=https://Prometheus.io/docs/alerting/alertmanager/

After=network.target

[Service]

User=root

Group=root

ExecStart=/usr/local/alertmanager/alertmanager \

  --config.file=/usr/local/alertmanager/alertmanager.yml \

  --storage.path=/data/alertmanager  

ExecReload=/bin/kill -HUP $MAINPID

Restart=on-failure

[Install]

WantedBy=multi-user.target

systemctl daemon-reload

systemctl enable alertmanager && systemctl start alertmanager

netstat -lntp | grep alertmanager

配置文件

# 全局配置项

global:

  resolve_timeout: 5m # 处理超时时间,默认为5min

# 定义路由树信息

route:

  group_by: ['alertname']  # 报警分组依据

  receiver: 'dingding.webhook1'

  group_wait: 30s        # 最初即第一次等待多久时间发送一组警报的通知

  group_interval: 30s    # 在发送新警报前的等待时间

  repeat_interval: 1h    # 重复发送告警时间。默认1h

  routes:

  - receiver: 'dingding.webhook1'

    group_wait: 30s

    match_re:

      alertname: '实例存活告警|磁盘使用率告警'   # 匹配告警规则中的名称发送

  - receiver: 'dingding.webhook.all'

    group_wait: 30s

    match_re:

      alertname: '内存使用率告警|CPU使用率告警'

# 定义基础告警接收者

receivers:

- name: 'dingding.webhook1'

  webhook_configs:

  - url: 'http://cdh1:8060/dingtalk/webhook1/send'

    send_resolved: true  # 警报被解决之后是否通知

# 定义消息告警接收者

- name: 'dingding.webhook.all'

  webhook_configs:

  - url: 'http://cdh1:8060/dingtalk/webhook1/send'

    send_resolved: true

prometheus-webhook-dingtalk 部署

wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.1.0/prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz

tar -zxvf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz

mv prometheus-webhook-dingtalk-2.1.0.linux-amd64 /usr/local/

ln -s  /usr/local/prometheus-webhook-dingtalk-2.1.0.linux-amd64  /usr/local/dingtalk

cd /usr/local/dingtalk

# 配置成服务

vim  /usr/lib/systemd/system/dingtalk.service

[Unit]

Description=https://github.com/timonwong/prometheus-webhook-dingtalk/releases/

After=network-online.target

[Service]

Restart=on-failure

ExecStart=/usr/local/dingtalk/prometheus-webhook-dingtalk --config.file=/usr/local/dingtalk/config.yml

[Install]

WantedBy=multi-user.target

systemctl daemon-reload

systemctl enable dingtalk && systemctl start dingtalk

添加钉钉群智能助手>>自定义机器人>>加签(这里的签复制填入secret这一行,完成后会出现webhook,填入webhook 到 config.yml)

配置文件

targets:

  webhook1:

    url: https://oapi.dingtalk.com/robot/send?access_token=4de703c270384a7b4702d31b1ab2b32cd7807af52093c62040729148770eaf7d

    # secret for signature

    secret: SEC8a5ac95d220633aa76ced0186dce637382fea88c2b86cf2cf68ec3a8e999a448

  webhook2:

    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx

  webhook_legacy:

    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx

    # Customize template content

    message:

      # Use legacy template

      title: '{{ template "legacy.title" . }}'

      text: '{{ template "legacy.content" . }}'

  webhook_mention_all:

    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx

    mention:

      all: true

  webhook_mention_users:

    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx

    mention:

      mobiles: ['156xxxx8827', '189xxxx8325']

flink监控部署

 修改flink配置文件,添加监控相关的配置

metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter

metrics.reporter.promgateway.host: cdh1

metrics.reporter.promgateway.port: 9091

metrics.reporter.promgateway.jobName: Job

metrics.reporter.promgateway.randomJobNameSuffix: true

metrics.reporter.promgateway.deleteOnShutdown: false

#metrics.reporter.promgateway.groupingKey: k1=v1;k2=v2

metrics.reporter.promgateway.interval: 15 SECONDS

下载pushgateway并启动

wget https://github.com/prometheus/pushgateway/releases/download/v1.4.3/pushgateway-1.4.3.linux-amd64.tar.gz

tar -zxvf pushgateway-1.4.3.linux-amd64.tar.gz

mv pushgateway-1.4.3.linux-amd64 /usr/local/

ln -s  /usr/local/pushgateway-1.4.3.linux-amd64  /usr/local/pushgateway

cd /usr/local/pushgateway

nohup  ./pushgateway > ./pushgateway.log 2>&1 &

kafka监控部署

wget https://github.com/danielqsj/kafka_exporter/releases/download/v1.4.2/kafka_exporter-1.4.2.linux-amd64.tar.gz

tar -zxvf kafka_exporter-1.4.2.linux-amd64.tar.gz

mv kafka_exporter-1.4.2.linux-amd64 /usr/local

ln -s /usr/local/kafka_exporter-1.4.2.linux-amd64 /usr/local/kafka_exporter

cd /usr/local/kafka_exporter

nohup ./kafka_exporter  --kafka.server=10.30.6.67:9092 > ./kafka_exporter.log 2>&1 &

cat /usr/lib/systemd/system/kafka_exporter.service

[Unit]

Description=kafka_exporter

After=network.target

 

[Service]

Type=simple

WorkingDirectory=/mnt/prometh/kafka_exporter-1.4.2.linux-amd64/

ExecStart=/mnt/prometh/kafka_exporter-1.4.2.linux-amd64/kafka_exporter  --kafka.server=10.30.6.70:9092

LimitNOFILE=65536

PrivateTmp=true

RestartSec=2

StartLimitInterval=0

Restart=always

 

[Install]

WantedBy=multi-user.target

es监控部署

wget https://github.com/prometheus-community/elasticsearch_exporter/releases/download/v1.3.0/elasticsearch_exporter-1.3.0.linux-amd64.tar.gz

tar -zxvf elasticsearch_exporter-1.3.0.linux-amd64.tar.gz -C /usr/local/

ln -s /usr/local/elasticsearch_exporter-1.3.0.linux-amd64 /usr/local/elasticsearch_exporter

nohup ./elasticsearch_exporter --es.all --es.indices --es.cluster_settings --es.indices_settings --es.shards --es.snapshots --es.timeout=10s --web.listen-address=:9114 --web.telemetry-path=/metrics --es.uri http://elastic:Bl666666@10.30.6.70:9200 > ./es_exporter.log 2>&1 &

cat /usr/lib/systemd/system/elasticsearch_exporter.service

[Unit]

Description=elasticsearch_exporter

After=network.target

 

[Service]

Type=simple

WorkingDirectory=/root/elasticsearch_exporter-1.3.0.linux-amd64/

ExecStart=/root/elasticsearch_exporter-1.3.0.linux-amd64/elasticsearch_exporter --es.all --es.indices --es.cluster_settings --es.indices_settings --es.shards --es.snapshots --es.timeout=10s --web.listen-address=:9114 --web.telemetry-path=/metrics --es.uri http://elastic:Bl666666@10.30.6.70:9200

LimitNOFILE=65536

PrivateTmp=true

RestartSec=2

StartLimitInterval=0

Restart=always

 

[Install]

WantedBy=multi-user.target

postgresql监控部署

wget https://github.com/prometheus-community/postgres_exporter/releases/download/v0.10.1/postgres_exporter-0.10.1.linux-amd64.tar.gz

tar -zxvf postgres_exporter-0.10.1.linux-amd64.tar.gz -C /usr/local/

ln -s /usr/local/postgres_exporter-0.10.1.linux-amd64 /usr/local/postgres_exporter

export DATA_SOURCE_NAME="postgresql://bolean:Bl666666@127.0.0.1:5432/postgres?sslmode=disable"

cd /usr/local/postgres_exporter

nohup ./postgres_exporter  >  ./pg_exporter.log 2>&1 &

redis 监控部署

wget https://github.com/oliver006/redis_exporter/releases/download/v1.41.0/redis_exporter-v1.41.0.linux-amd64.tar.gz

tar -zxvf redis_exporter-v1.41.0.linux-amd64.tar.gz -C /usr/local/

ln -s /usr/local/redis_exporter-v1.41.0.linux-amd64 /usr/local/redis_exporter

cd /usr/local/redis_exporter

nohup ./redis_exporter -redis.addr 127.0.0.1:6379  -redis.password Bl666666  >  ./redis_exporter.log 2>&1 &

docker部署

 docker pull oliver006/redis_exporter

docker run -d  --restart="always" --name redis_exporter -p 9121:9121 oliver006/redis_exporter --redis.addr redis://192.168.0.41:6379 --redis.password 'STUdio2022_linker'

# cat redis-export.yaml

---

apiVersion: apps/v1

kind: Deployment

metadata:

  name: redis-exporter

  labels:

    app: redis-exporter

spec:

  replicas: 1

  selector:

    matchLabels:

      app: redis-exporter

  template:

    metadata:

      labels:

        app: redis-exporter

    spec:

      containers:

      - name: redis-exporter

        image: fvhb.fjecloud.com/xsgy/redis_exporter

        imagePullPolicy: IfNotPresent

        # 此处添加redis相关配置,例如:地址、密码等

        # 如果是监控k8s容器外的Redis,则此处的redis.addr对应的值需要添加redis://前缀,类似下面注释的那样

        # args: ["-redis.addr", "redis://10.128.27.22:6379", "-redis.password", "123456@redis"]

        args: ["-redis.addr", "redis:6379", "-redis.password", "Ff1z@TOFr^iwd%Ra"]

        ports:

        - containerPort: 9121

---

apiVersion: v1

kind: Service

metadata:

  labels:

    app: redis-exporter

  name: redis-exporter

spec:

  type: NodePort

  ports:

  - name: metrics

    port: 9121

    protocol: TCP

    targetPort: 9121

    nodePort: 32763

  selector:

    app: redis-exporter

Grafana页面

9338

MYSQL监控部署

docker run -d  -p 9104:9104  --restart="always"  --name mysql_exporter   -e DATA_SOURCE_NAME="cri:STUdio2022_linker@(192.168.0.41:3306)/"   prom/mysqld-exporter

docker run -d  -p 9104:9104  --restart="always"  --name mysql_exporter   -e DATA_SOURCE_NAME="guest:guest@123@(10.8.15.25:3306)/"   prom/mysqld-exporter

Docker-compose部署

version: '2.3'

services:

  mysql_exporter:

    image: prom/mysqld-exporter

    container_name: mysql_exporter

    restart: always

    ports:

      - 9104:9104

    environment:

      - DATA_SOURCE_NAME="cri:STUdio2022_linker@(192.168.0.41:3306)/"

    volumes:

      - /etc/hosts:/etc/hosts

    networks:

      - vos-exporter

  redis_exporter:

    image: oliver006/redis_exporter

    container_name: redis_exporter

    restart: always

    ports:

      - 9121:9121

    command:

      - "-redis.password-file=/redis_passwd.json"

    volumes:

      - /etc/hosts:/etc/hosts

      - ./redis_passwd.json:/redis_passwd.json

    networks:

      - vos-exporter

networks:

  vos-exporter:

K8s部署

apiVersion: apps/v1

kind: Deployment

metadata:

  name: mysqld-exporter

  labels:

    app: mysqld-exporter

spec:

  selector:

    matchLabels:

      app: mysqld-exporter

  template:

    metadata:

      labels:

        app: mysqld-exporter

    spec:

      containers:

      - name: mysqld-exporter

        image: prom/mysqld-exporter

        env:

        - name: DATA_SOURCE_NAME

          value: 'readonly:vMjef!3iW2hjv9SK@(mysql:3306)/'

        ports:

        - containerPort: 9104

          name: http

---

apiVersion: v1

kind: Service

metadata:

  name: mysqld-exporter

  labels:

    app: mysqld-exporter

spec:

  selector:

    app: mysqld-exporter

  type: NodePort

  ports:

  - port: 9104

    targetPort: 9104

    nodePort: 32765

Rocketmq监控部署

# cat rockemq-export.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

  name: rocketmq-export

spec:

  replicas: 1

  selector:

    matchLabels:

      app: rocketmq-export

  template:

    metadata:

      labels:

        app: rocketmq-export

    spec:

      containers:

      - name: rocketmq-export

        #image: fvhb.fjecloud.com/xsgy/a009_msrcnn:v1.0.18_encrypted

        image: slpcat/rocketmq-exporter:latest

        command: ["sh", "-c", "java $JAVA_OPTS -jar rocketmq-exporter-0.0.2-SNAPSHOT.jar --rocketmq.config.rocketmqVersion=V4_5_1 --rocketmq.config.namesrvAddr=rmqnamesrv.default:9876"]

        #args: ["--rocketmq.config.namesrvAddr=rmqnamesrv.default:9876"]

        ports:

        - containerPort: 5557

---

apiVersion: v1

kind: Service

metadata:

  name: rocketmq-export

  labels:

    app: rocketmq-export

spec:

  selector:

    app: rocketmq-export

  type: NodePort

  ports:

  - port: 5557

    targetPort: 5557

    nodePort: 30134

mongodb监控部署

docker run  -d  --network=host --restart="always" --name=mongodb_export -e MONGODB_URI='mongodb://cri:STUdio2022_linker@192.168.0.41:27017/?authSource=cri' bitnami/mongodb-exporter:latest --collect-all --web.listen-address=":9216"

# cat  /srv/mongodb-export.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

  labels:

    k8s-app: mongodb-exporter # 根据业务需要调整成对应的名称,建议加上 MongoDB 实例的信息

  name: mongodb-exporter # 根据业务需要调整成对应的名称,建议加上 MongoDB 实例的信息

spec:

  replicas: 1

  selector:

    matchLabels:

      k8s-app: mongodb-exporter # 根据业务需要调整成对应的名称,建议加上 MongoDB 实例的信息

  template:

    metadata:

      labels:

        k8s-app: mongodb-exporter # 根据业务需要调整成对应的名称,建议加上 MongoDB 实例的信息

    spec:

      containers:

        - args:

            - --collect.database       # 启用采集 Database metrics

            - --collect.collection     # 启用采集 Collection metrics

            - --collect.topmetrics     # 启用采集 table top metrics

            - --collect.indexusage     # 启用采集 per index usage stats

            - --collect.connpoolstats  # 启动采集 MongoDB connpoolstats

          env:

            - name: MONGODB_URI

              valueFrom:

                secretKeyRef:

                  name: mongodb-secret-test

                  key: datasource

          image: ccr.ccs.tencentyun.com/rig-agent/mongodb-exporter:0.10.0

          imagePullPolicy: IfNotPresent

          name: mongodb-exporter

          ports:

            - containerPort: 9216

              name: metric-port  # 这个名称在配置抓取任务的时候需要

          securityContext:

            privileged: false

          terminationMessagePath: /dev/termination-log

          terminationMessagePolicy: File

      dnsPolicy: ClusterFirst

      imagePullSecrets:

        - name: qcloudregistrykey

      restartPolicy: Always

      schedulerName: default-scheduler

      securityContext: { }

      terminationGracePeriodSeconds: 30

---

apiVersion: v1

kind: Secret

metadata:

    name: mongodb-secret-test

type: Opaque

stringData:

    datasource: "mongodb://cri:STUdio2022_linker@192.168.0.41:27017/admin"  # 对应连接URI

---

---

apiVersion: v1

kind: Service

metadata:

  name: mongodb-export

  labels:

    k8s-app: mongodb-exporter

spec:

  selector:

    k8s-app: mongodb-exporter

  type: NodePort

  ports:

  - port: 9216

    targetPort: 9216

    nodePort: 30334

12079

7353

结合

openstack监控部署

#  admin.novarc

OS_PROJECT_DOMAIN_NAME=Default

OS_USER_DOMAIN_NAME=Default

OS_PROJECT_NAME=admin

OS_USERNAME=admin

OS_PASSWORD=Bl666666

OS_IDENTITY_API_VERSION=3

OS_AUTH_URL=http://127.0.0.1/identity/v3

docker run -itd \

--name=openstack  -p 9183:9183\

 --env-file=$(pwd)/admin.novarc\

 --restart=unless-stopped moghaddas/prom-openstack-exporter

# 或者物理机部署

wget https://github.com/canonical/prometheus-openstack-exporter/archive/refs/tags/0.1.6.tar.gz

sudo apt-get install python-neutronclient python-novaclient python-keystoneclient python-netaddr python-cinderclient

apt-get install python-prometheus-client

# Copy example config in place, edit to your needs

sudo cp prometheus-openstack-exporter.yaml /etc/prometheus/

. /path/to/admin-novarc

./prometheus-openstack-exporter prometheus-openstack-exporter.yaml

clickhouse监控部署

clickhouse自带agent默认端口是9363,容器部署需要暴露该端口,如不方便更改现有暴露端口,可通过GitHub - ClickHouse/clickhouse_exporter: This is a simple server that periodically scrapes ClickHouse stats and exports them via HTTP for Prometheus(https://prometheus.io/) consumption. 部署clickhouse_exporter

docker pull hotwifi/clickhouse_exporter:latest

StarRocks 监控部署

starrocks自带agent从各个host采集监控信息,且该agent兼容Prometheus可直接配置采集监控信息。

docker run -d -p 9116:9116 tkroman/clickhouse_exporter_fresh -scrape_uri=http://bolean:Bl666666@192.168.0.94:8123/

http://192.168.0.94:8123/

GPU监控

# cat  vgpu/gpu-export.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

  name: gpu-export

  namespace: kube-mon

spec:

  replicas: 1

  selector:

    matchLabels:

      app: gpu-export

  template:

    metadata:

      labels:

        app: gpu-export

    spec:

      containers:

      - name: gpu-export

        image: hzlh-registry.cn-hangzhou.cr.aliyuncs.com/vos/dcgm-exporter:3.1.6-3.1.3-ubuntu20.04

        ports:

        - containerPort: 9400

      hostNetwork: true

API接口监控

# cat configmap.yaml

apiVersion: v1

kind: ConfigMap

metadata:

  labels:

    app: blackbox-exporter

  name: blackbox-exporter

  namespace: kube-mon

data:

  blackbox.yml: |-

    modules:

      http_2xx:

        prober: http

        timeout: 10s

        http:

          valid_http_versions: ["HTTP/1.1", "HTTP/2"]

          valid_status_codes: [200,301,302]

          method: GET

          preferred_ip_protocol: "ip4"

      tcp_connect:

        prober: tcp

        timeout: 10s

---

# cat deployment.yaml

kind: Deployment

apiVersion: apps/v1

metadata:

  name: blackbox-exporter

  namespace: kube-mon

  labels:

    app: blackbox-exporter

  #annotations:

    #deployment.kubernetes.io/revision: 1

spec:

  replicas: 1

  selector:

    matchLabels:

      app: blackbox-exporter

  template:

    metadata:

      labels:

        app: blackbox-exporter

    spec:

      volumes:

      - name: config

        configMap:

          name: blackbox-exporter

          defaultMode: 420

      containers:

      - name: blackbox-exporter

        image: prom/blackbox-exporter:v0.23.0

        imagePullPolicy: IfNotPresent

        args:

        - --config.file=/etc/blackbox_exporter/blackbox.yml

        - --log.level=info

        - --web.listen-address=:9115

        ports:

        - name: blackbox-port

          containerPort: 9115

          protocol: TCP

        resources:

          limits:

            cpu: 200m

            memory: 256Mi

          requests:

            cpu: 100m

            memory: 50Mi

        volumeMounts:

        - name: config

          mountPath: /etc/blackbox_exporter

        readinessProbe:

          tcpSocket:

            port: 9115

          initialDelaySeconds: 5

          timeoutSeconds: 5

          periodSeconds: 10

          successThreshold: 1

          failureThreshold: 3

---

Prometheus配置

     - job_name: 'vos'

       metrics_path: /probe

       params:

         module: [http_2xx]

       static_configs:

         - targets: ['http://192.1.2.238:8317/vql/v1/health/ping']

           labels:

             instance: 192.1.2.238:8317

             project: vos

             group: web

       relabel_configs:

         - source_labels: [__address__]

           target_label: __param_target

         - target_label: __address__

           replacement: 192.1.3.107:31004

导入ID  7587

vmware监控

docker run -d -p 9272:9272  -e VSPHERE_USER=lv_pingping@vsphere.local -e VSPHERE_PASSWORD=ISq59Pg#T2XHaU -e VSPHERE_HOST=192.1.5.200  -e VSPHERE_IGNORE_SSL=True  -e VSPHERE_SPECS_SIZE=2000  --name vmware_exporter  pryorda/vmware_exporter

K8S监控

压缩包在该目录下面

kubectl  apply -f kube-state-metrics/examples/standard/

导入ID 15661     13105

https://grafana.com/grafana/dashboards/15661-1-k8s-for-prometheus-dashboard-20211010/

状态异常告警min_over_time(kube_pod_container_status_ready{instance=~"10.8.22.123:32509",pod!~"rke-.*"} [1m])!=1

三、监控配置及服务发现

通过文件服务发现

  1. 通过文件自动发现的话相应的服务需要填写再targets目录下,例如node服务targets/node/sa-node.yml
  2. 相应的所属项目组应该打上项目组的标签例如

- targets: ['10.30.6.70:9100']

  labels:

    project: 'situation-awareness'

    person:  'lvpingping'

    vm:      'true'

- targets: ['10.30.6.67:9100']

  labels:

    project: 'situation-awareness'

    person:  'lvpingping'

    vm:      'false'

# my global config

global:

  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.

  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration

alerting:

  alertmanagers:

    - static_configs:

        - targets:

           - cdh1:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

  - "/mnt/prometh/prometheus-2.35.0.linux-amd64/conf/rule*.yml"

  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:

# Here it's Prometheus itself.

scrape_configs:

  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

  - job_name: "prometheus"

    static_configs:

      - targets: ["localhost:9090"]

  - job_name: "node exporter"

    scrape_interval: 10s

    file_sd_configs:

    - files:

      - targets/node/*.yml

  - job_name: "flink exporter"

    scrape_interval: 10s

    file_sd_configs:

    - files:

      - targets/flink/*.yml

  - job_name: "kafka exporter"

    scrape_interval: 10s

    file_sd_configs:

    - files:

      - targets/kafka/*.yml

  - job_name: "es exporter"

    scrape_interval: 10s

    file_sd_configs:

    - files:

      - targets/es/*.yml

  - job_name: "postgresql exporter"

    scrape_interval: 10s

    file_sd_configs:

    - files:

      - targets/postgresql/*.yml

  - job_name: "redis exporter"

    scrape_interval: 10s

    file_sd_configs:

    - files:

      - targets/redis/*.yml

  - job_name: "openstack exporter"

    scrape_interval: 10s

    file_sd_configs:

    - files:

      - targets/openstack/*.yml

  - job_name: "clickhouse exporter"

    scrape_interval: 10s

    file_sd_configs:

    - files:

      - targets/clickhouse/*.yml

  - job_name: "starrocks exporter"

    scrape_interval: 10s

    file_sd_configs:

    - files:

      - targets/starrocks/*.yml

  - job_name: "pushgateway"

    scrape_interval: 10s

    file_sd_configs:

    - files:

      - targets/pushgateway/*.yml

通过consul服务发现

# 二进制

wget https://releases.hashicorp.com/consul/1.12.2/consul_1.12.2_linux_amd64.zip

unzip consul_1.12.2_linux_amd64.zip

cp consul /usr/bin

consul version

# 配置服务

vim  /usr/lib/systemd/system/consul.service

[Unit]

Description=Consul

After=network.target

[Service]

ExecStart=/usr/local/consul/consul  agent -dev -ui -client 0.0.0.0 -log-file=/var/log/consul/

KillSignal=SIGINT

[Install]

WantedBy=multi-user.target

systemctl daemon-reload

systemctl enable consul && systemctl start consul

ss -antlp | grep consul

 # 容器

 docker run --name consul -d -p 8500:8500 consul

 

# 服务注册

 curl -X PUT -d '{

  "ID": "node-exporter-10-30-6-67",

  "Name": "node-exporter-10-30-6-67",

  "Tags": [

    "node"

  ],

  "Address": "10.30.6.67",

  "Port": 9100,

  "Meta": {

    "vm": "fales",

    "app": "node",

    "person": "lvpingping",

    "project": "sa"

  },

  "EnableTagOverride": false,

  "Check": {

    "HTTP": "http://10.30.6.67:9100/metrics",

    "Interval": "10s"

  }

}' http://10.20.0.21:8500/v1/agent/service/register

# 服务注销

curl --request PUT http://10.20.0.91:8500/v1/agent/service/deregister/node-exporter-10-30-6-67

prometheus配置文件

- job_name: 'consul-prometheus'

  consul_sd_configs:

  - server: '10.20.0.91:8500'

    services: []  

    

    

      - job_name: 'consul-prometheus'

    consul_sd_configs:

    - server: '10.20.0.91:8500'

      services: []

    relabel_configs:

    - source_labels: [__meta_consul_tags]

      regex: .*node.*

      action: keep

    - regex: __meta_consul_service_metadata_(.+)

      action: labelmap

 

四、granafa

#centos

cat /etc/yum.repos.d/grafana.repo

[grafana]

name=grafana

baseurl=https://mirrors.aliyun.com/grafana/yum/rpm

repo_gpgcheck=0

enabled=1

gpgcheck=0

yum -y install grafana

systemctl enable grafana-server  && systemctl start grafana-server

netstat -nuptl|grep 3000

# 默认用户名密码都是admin

# ubuntu

apt-get install -y apt-transport-https

apt-get install -y software-properties-common wget

wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -

echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list

apt-get install grafana

systemctl enable grafana-server  && systemctl start grafana-server

netstat -nuptl|grep 3000

# docker

docker run -d -p 3000:3000 --name=grafana -v grafana-storage:/var/lib/grafana grafana/grafana

监控模板在Dashboards | Grafana Labs 可以搜索,下载下来可能没数据,需要微调变量值

例如:

flink 导入ID 14911

openstack 导入ID 9701

kafka 导入ID 14012

es 导入ID 4358

redis 导入ID 11692 14091     9338常用

clickhouse 导入ID 882

postgres 导入ID 9628

mysql 导入ID 14057

gpu 导入 ID 12239

rocketmq导入 ID 14612

node 导入ID 8919常用   16098

API  blackbox_exporter 导入ID 7587

导入ID 15661

五、告警配置

alertmanager部署参考前文

Prometheus配置参考前文,通过文件服务自动发现

rule_files:

  - "./conf/rule*.yml"

  # - "second_rules.yml"

基础告警

节点down告警

groups:

- name: 实例存活告警规则

  rules:

  - alert: 实例存活告警

    expr: up == 0

    for: 1m

    labels:

      user: prometheus

      severity: warning

    annotations:

      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

节点cpu利用率告警

groups:

- name: CPU报警规则

  rules:

  - alert: CPU使用率告警

    expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 90

    for: 1m

    labels:

      user: prometheus

      severity: warning

    annotations:

      description: "服务器: CPU使用超过90%!(当前值: {{ $value }}%)"

节点内存告警

groups:

- name: 内存报警规则

  rules:

  - alert: 内存使用率告警

    expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 95

    for: 1m

    labels:

      user: prometheus

      severity: warning

    annotations:

      description: "服务器: 内存使用超过80%!(当前值: {{ $value }}%)"

节点磁盘告警

groups:

- name: 磁盘报警规则

  rules:

  - alert: 磁盘使用率告警

    expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 80

    for: 1m

    labels:

      user: prometheus

      severity: warning

    annotations:

      description: "服务器: 磁盘设备: 使用超过80%!(挂载点: {{ $labels.mountpoint }} 当前值: {{ $value }}%)"

应用告警

flink任务失败告警

groups:

- name: flink任务失败告警规则

  rules:

  - alert: flink任务失败告警

    expr: ((flink_jobmanager_job_uptime offset 30s)-(flink_jobmanager_job_uptime))/1000 > 0

    for: 15s

    labels:

      user: prometheus

      severity: warning

    annotations:

      description: "服务器 {{ $labels.host }} 上的flink任务:{{ $labels.job_name }}失败,请关注!"

flink任务延迟告警

groups:

- name: flink任务延迟告警规则

  rules:

  - alert: flink任务延迟告警

    expr: ((flink_jobmanager_job_uptime offset 30s)-(flink_jobmanager_job_uptime))/1000 > 0

    for: 30s

    labels:

      user: prometheus

      severity: warning

    annotations:

      description: "服务器 {{ $labels.host }} 上的flink任务:{{ $labels.job_name }}发生延迟,请关注!

kafka消息堆积告警

groups:

- name: kafka消息堆积告警规则

  rules:

  - alert: kafka消息堆积告警

    expr: sum(kafka_consumergroup_lag - kafka_consumergroup_lag offset 10m) by (consumergroup, topic) > 500000

    for: 1m

    labels:

      user: prometheus

      severity: warning

    annotations:

      description: "kafka topic: {{ $labels.topic }} 上consumergroup:{{ $labels.consumergroup }}堆积消息超过50万条,请关注!"

其他应用告警暂未验证,可根据业务需求设置告警

六、需要监控的列表

见语雀:https://bolean.yuque.com/gdmlxg/fld8zh/cwwqgg#b4sY 还不完全,待补充,虚拟机监控私有云平台上的,其次公司大群里发发需要进行监控的,发IP列表给到我们,进行添加

问题点

  1. openstack官方提供的export只能监控集群等信息,无法监控虚拟机信息,无法监控虚拟机信息,虚拟机

 解决思路: a. 虚拟机镜像模板制作的时候带上node exports ,然后写个脚本自动往Prometheus注册

 b. 已存在大量模板,且存在大量虚机,目前的虚机模板通过手动注册到Prometheus,后期优化可按1优化

  1. 目前虽说Prometheus不用重启仍能发现新服务,但是人工介入较多,例如:agent安装,服务注册等工作

 解决思路: a. 实现一个脚本放在每台服务器上,监控是否存在以上服务,如果存在则向prometheus注册相应的服务,脚本需要:责任人,所属项目参数的传入。

 b. 服务端脚本实现扫网段相应的端口,如果有export相应的数据则注册到prometheus,该方案缺点,需要export安装了才能发现,未安装则无法监控。

 c. export安装,以及注册方式同步形成脚本,并且同步到运维组对外开放知识库,需要接入监控告警的自行通过脚本安装注册。

  1. 钉钉群告警@是按照项目来还是@运维人员处理
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值