prometheus+grafana+钉钉,全面监控服务器及docker镜像

一、架构

二、说明

首先介绍一下需要部署的组件:

prometheus:         监控核心组件
cadvisor:               用于获取docker容器的指标,并暴露端口供prometheus抓取
node-exporter :     用户获取服务器的指标,并暴露端口供prometheus抓取
grafana:                监控图表好用的可视化组件
alertmanager:       告警组件
dingtalk:                alert告警不支持钉钉,需要借助dingtalk插件

三、开始安装

本文使用docker进行部署

1.为需要被监控的主机安装node-exporter
docker pull prom/node-exporter
docker run -d -p 9100:9100 \
-v /proc:/host/proc:ro \
-v /sys:/host/sys:ro \
-v /:/rootfs:ro \
--name=node-exporter \
prom/node-exporter
2.为需要被监控的主机安装cadvisor
docker pull google/cadvisor
docker run \
-v /:/rootfs:ro \
-v /var/run:/var/run:rw \
-v /sys:/sys:ro \
-v /var/lib/docker/:/var/lib/docker:ro \
-p 9080:8080 \
--detach=true \
--name=cadvisor \
google/cadvisor
3.为监控端主机安装prometheus、grafana、alertmanager、dingtalk

本文采用docker-compose安装方式

docker pull prom/prometheus
docker pull prom/alertmanager
docker pull grafana/grafana
docker pull timonwong/prometheus-webhook-dingtalk

准备好镜像后,创建文件夹,目录结构如下

#目录结构

/usr/local/prometheus/

        --laert

                --alertmanager.yml

                --config.yml

                --dingtalk.tmpl

        --prome

                --rules

                        --rules.yml

                --prometheus.yml

        docker-compose.yml

首先介绍docker-compose.yml文件,填入以下信息

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: always
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--web.enable-lifecycle'
      - '--storage.tsdb.retention.time=30d'
    volumes:
      - ./prome:/etc/prometheus

    ports:
      - "9090:9090"


  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: always
    depends_on:
      - prometheus
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
  
      
  dingtalk:
    image: timonwong/prometheus-webhook-dingtalk
    container_name: dingtalk
    hostname: dingtalk
    restart: always
    volumes:
      - ./alert/config.yml:/etc/prometheus-webhook-dingtalk/config.yml
      - ./alert/dingtalk.tmpl:/opt/dingtalk/template/dingtalk.tmpl
    ports:
      - "29016:8060"
    environment:
      - TZ=Asia/Shanghai
   
      
  alertmanager:
    image: prom/alertmanager
    container_name: alertmanager
    hostname: alertmanager
    restart: always
    volumes:
        - ./alert/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
        - "29012:9093"
    environment:
      - TZ=Asia/Shanghai

  
     

volumes:
  prometheus_data:

 prometheus.yml文件

# prometheus.yml
global:
  scrape_interval: 15s
  


alerting:
  alertmanagers:
  - static_configs:
    - targets: ['10.0.6.110:29012']


rule_files:
  - "/etc/prometheus/rules/*.yml"
  
  
scrape_configs:
  - job_name: 'docker'
    static_configs:
      - targets: ['10.0.6.99:9100','10.0.6.98:9100','10.0.6.97:9100']  #这里改成安装了node-exporter的ip及端口
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['10.0.6.99:8080','10.0.6.98:8080','10.0.6.97:8080']#这里改成安装了cadvisor的ip及端口

rules.yml

#以下是一个简单的告警案例,具体PromQL根据实际情况编写

groups:
- name: example_group
  rules:
  - alert: HighCPUUsage
    expr: sum(rate(node_cpu_seconds_total{mode="system"}[5m])) by (instance) > 0.8
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected on {{ $labels.instance }}."
      description: "The CPU usage on instance {{ $labels.instance }} has been above 80% for the past 10 minutes. Please investigate possible causes such as high workload or inefficient processes."
  - alert: LowDiskSpace
    expr: node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.2
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on root partition ({{ $labels.instance }})"
      description: "The disk space on the root partition of instance {{ $labels.instance }} is less than 10%. Immediate action might be required to avoid system issues. Consider cleaning up unnecessary files or expanding the disk." 

 alertmanager.yml文件

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 30s
  repeat_interval: 1h
  receiver: 'webhook'
receivers:
- name: 'webhook'
  webhook_configs:
  - url: 'http://10.0.6.110:29016/dingtalk/webhook/send'  #IP换成你的IP
    send_resolved: true

 config.yml

## Request timeout
## timeout: 5s
### Uncomment following line in order to write template from scratch (be careful!)
##no_builtin_template: true
### Customizable templates path
#templates:
#- '/opt/dingtalk/template/dingtalk.tmpl'
### You can also override default template using `default_message`
### The following example to use the 'legacy' template from v0.3.0
##default_message:
##  title: '{{ template "legacy.title" . }}'
##  text: '{{ template "legacy.content" . }}'
### Targets, previously was known as "profiles"
targets:
  webhook:
    url: '钉钉群聊添加机器人生成的群聊url'   
    secret: '钉钉群聊添加机器人产生的加签秘钥'      #如下图

 

dingtalk.tmpl  自定义消息模板

 {{ define "__subject" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ end }}
 
{{ define "__alert_list" }}{{ range . }}
---
{{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }}
告警状态:{{ .Status }}
告警级别:{{ .Labels.severity }}
告警类型:{{ .Labels.alertname }}
告警主机:{{ .Labels.instance }}
告警详情:{{ .Annotations.description }}
告警时间:{{ (.StartsAt.Add 28800e9).Format "2023-01-01 10:00:00" }}
{{ end }}{{ end }}
 
{{ define "__resolved_list" }}{{ range . }}
---
{{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }}
告警状态:{{ .Status }}
告警级别:{{ .Labels.severity }}
告警类型:{{ .Labels.alertname }}
告警主机:{{ .Labels.instance }}
告警详情:{{ .Annotations.description }}
告警时间:{{ (.StartsAt.Add 28800e9).Format "2023-01-01 10:00:00" }}
恢复时间:{{ (.EndsAt.Add 28800e9).Format "2023-01-01 10:00:00" }}
{{ end }}{{ end }}
 
{{ define "default.title" }}
{{ template "__subject" . }}
{{ end }}
{{ define "default.content" }}
{{ if gt (len .Alerts.Firing) 0 }}
**Prometheus故障告警**
{{ template "__alert_list" .Alerts.Firing }}
---
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
**Prometheus故障恢复**
{{ template "__resolved_list" .Alerts.Resolved }}
{{ end }}
{{ end }}
{{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }}
{{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }}
{{ template "default.title" . }}
{{ template "default.content" . }}

以上准备就绪后,切换到docker-compose.yml的文件路径,启动服务

cd /usr/local/prometheus/
docker-compose up -d

4.启动成功后可以分别访问各端口查看是否正常启动
  1. 登录9090端口查看prometheus是否正常

点击Status>Targets  查看能否抓到数据

点击Status>Rules查看告警规则是否加载成功

登录3000端口,查看grafanan能否登录,并添加prometheus为数据源,导入看板模板

 根据需求导入模板,我这边导入了8919(主机CPU,内存等信息可视化)和14964(docker容器CPU,内存等),更多模板请点这里StarsL.cn Dashboards | Grafana Labs

导入成功后查看看板

8919模板

14964模板

5.测试钉钉告警功能

修改告警规则,根据实际情况,触发报警

例如,将磁盘剩余可用少于90切持续1分钟触发报警

rules.yml修改

groups:
- name: example_group
  rules:
  - alert: HighCPUUsage
    expr: sum(rate(node_cpu_seconds_total{mode="system"}[5m])) by (instance) > 0.8
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected on {{ $labels.instance }}."
      description: "The CPU usage on instance {{ $labels.instance }} has been above 80% for the past 10 minutes. Please investigate possible causes such as high workload or inefficient processes."
  - alert: LowDiskSpace
    expr: node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.9   #磁盘可用小于90%
    for: 1m       #持续1分钟
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on root partition ({{ $labels.instance }})"
      description: "The disk space on the root partition of instance {{ $labels.instance }} is less than 10%. Immediate action might be required to avoid system issues. Consider cleaning up unnecessary files or expanding the disk."

修改完成后重启容器

docker-compose down -v
docker-compose up -d

 等几分钟后查看钉钉是否收到告警

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值