grafana+prometheus监控tomcat项目

背景:

公司的web项目,有几个经常出现内存溢出宕机,正好搭建了prometheus和grafana,借助prometheus 插件实时监控tomcat内存情况,超过阈值告警发送到企业微信,实现自动化了解项目情况。grafana tomcat  dashboard 如下:

 

工具准备:

1、prometheus

2、alertmanager

3、grafana

4、tomcat

5、prometheus插件与配置文件地址:

链接:https://pan.baidu.com/s/1B2PWimrpCQ9MqOedPvXdaA?pwd=yyds 
提取码:yyds 
复制这段内容后打开百度网盘手机App,操作更方便哦

步骤:

1、下载tomcat监控需要的文件:从网盘地址

            jmx_prometheus_javaagent-0.16.1.jar

            config.yaml

2、配置tomcat项目 bin目录下 catalina.sh文件;export JAVA_OPTS=  后面添加配置

-javaagent:$CATALINA_HOME/bin/jmx_prometheus_javaagent-0.16.1.jar=38081:$CATALINA_HOME/bin/config.yaml


[ -z "$CATALINA_BASE" ] && CATALINA_BASE="$CATALINA_HOME"

export JAVA_OPTS="-Xmx4g -Xms3g -Xmn1024m -XX:NewRatio=4 -XX:SurvivorRatio=4 -XX:+DisableExplicitGC -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -XX:CMSMaxAbortablePrecleanTime=300 -XX:+CMSScavengeBeforeRemark  -XX:+CMSClassUnloadingEnabled -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintAdaptiveSizePolicy -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=30m -Xloggc:$CATALINA_HOME/logs/gc.log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$CATALINA_HOME/logs/heapdump -javaagent:$CATALINA_HOME/bin/jmx_prometheus_javaagent-0.16.1.jar=38081:$CATALINA_HOME/bin/config.yaml"

其中端口号,可以根据tomcat项目的端口号前面加3配置,正如我的tomcat端口是8081,jvm metrics则配置38081。

3、启动tomcat项目。

4、配置prometheus 服务器的监控  /usr/local/prometheus/prometheus.yml    


  - job_name: 'tomcat'
    static_configs:
    - targets: ['hostname:38081']
      labels:
        service: apache

5、配置jvm告警 ,新建 /usr/local/prometheus/rule_files/jvm.yml 文件   

# severity按严重程度由高到低:red、orange、yello、blue
groups:
  - name: jvm-alerting
    rules:

    # down了超过30秒
    - alert: instance-down
      expr: up == 0
      for: 30s
      labels:
        severity: middle
        service: apache
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 30 seconds."

    # down了超过1分钟
    - alert: instance-down
      expr: up == 0
      for: 1m
      labels:
        severity: high
        service: apache
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

    # down了超过5分钟
    - alert: instance-down
      expr: up == 0
      for: 5m
      labels:
        severity: critical
        service: apache
        status: 严重告警
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

    # 堆空间使用超过50%
    - alert: heap-usage-too-much
      expr: jvm_memory_bytes_used{job="tomcat", area="heap"} / jvm_memory_bytes_max * 100 > 50
      for: 1m
      labels:
        severity: middle
        service: apache
      annotations:
        summary: "JVM Instance {{ $labels.instance }} memory usage > 50%"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 50%] for more than 1 minutes. current usage ({{ $value }}%)"

    # 堆空间使用超过80%
    - alert: heap-usage-too-much
      expr: jvm_memory_bytes_used{job="tomcat", area="heap"} / jvm_memory_bytes_max * 100 > 70
      for: 1m
      labels:
        severity: high
        service: apache
      annotations:
        summary: "JVM Instance {{ $labels.instance }} memory usage > 70%"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 70%] for more than 1 minutes. current usage ({{ $value }}%)"

    # 堆空间使用超过90%
    - alert: heap-usage-too-much
      expr: jvm_memory_bytes_used{job="tomcat", area="heap"} / jvm_memory_bytes_max * 100 > 90
      for: 1m
      labels:
        severity: critical
        service: apache
        status: 严重告警
      annotations:
        summary: "JVM Instance {{ $labels.instance }} memory usage > 90%"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 90%] for more than 1 minutes. current usage ({{ $value }}%)"

    # 在5分钟里,Old GC花费时间超过30%
    - alert: old-gc-time-too-much
      expr: increase(jvm_gc_collection_seconds_sum{gc="ConcurrentMarkSweep"}[5m]) > 5 * 60 * 0.3
      for: 5m
      labels:
        severity: middle
        service: apache
      annotations:
        summary: "JVM Instance {{ $labels.instance }} Old GC time > 30% running time"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 30% running time] for more than 5 minutes. current seconds ({{ $value }}%)"

    # 在5分钟里,Old GC花费时间超过50%
    - alert: old-gc-time-too-much
      expr: increase(jvm_gc_collection_seconds_sum{gc="ConcurrentMarkSweep"}[5m]) > 5 * 60 * 0.5
      for: 5m
      labels:
        severity: high
        service: apache
      annotations:
        summary: "JVM Instance {{ $labels.instance }} Old GC time > 50% running time"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 50% running time] for more than 5 minutes. current seconds ({{ $value }}%)"

    # 在5分钟里,Old GC花费时间超过80%
    - alert: old-gc-time-too-much
      expr: increase(jvm_gc_collection_seconds_sum{gc="ConcurrentMarkSweep"}[5m]) > 5 * 60 * 0.8
      for: 5m
      labels:
        status: 严重告警
        severity: critical
        service: apache
      annotations:
        summary: "JVM Instance {{ $labels.instance }} Old GC time > 80% running time"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 80% running time] for more than 5 minutes. current seconds ({{ $value }}%)"

6、重启prometheus;

       curl -XPOST http://hostname:7070/-/reload

7、导入grafana dashboard

    Tomcat dashboard | Grafana Labs

    dashboard id: 8704 ,8878

8、
 

备注:发送企业微信告警机器人,是通过prometheus alertmanager 的web hook 接口 ,

在web hook 接口 发送到企业微信告警。

发送企业微信告警机器人若有需要,另文介绍,敬请期待。

 

  • 0
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

冰帆<

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值