grafana+prometheus监控tomcat项目

最新推荐文章于 2024-07-23 14:51:23 发布

冰帆<

最新推荐文章于 2024-07-23 14:51:23 发布

阅读量1.2k

点赞数

分类专栏：可视化大数据文章标签： tomcat jvm grafana

本文链接：https://blog.csdn.net/flye/article/details/128013357

版权

大数据同时被 2 个专栏收录

23 篇文章 0 订阅

订阅专栏

可视化

3 篇文章 0 订阅

订阅专栏

背景：

公司的web项目，有几个经常出现内存溢出宕机，正好搭建了prometheus和grafana，借助prometheus 插件实时监控tomcat内存情况，超过阈值告警发送到企业微信，实现自动化了解项目情况。grafana tomcat dashboard 如下：

工具准备：

1、prometheus

2、alertmanager

3、grafana

4、tomcat

5、prometheus插件与配置文件地址：

链接：https://pan.baidu.com/s/1B2PWimrpCQ9MqOedPvXdaA?pwd=yyds
提取码：yyds
复制这段内容后打开百度网盘手机App，操作更方便哦

步骤：

1、下载tomcat监控需要的文件：从网盘地址

jmx_prometheus_javaagent-0.16.1.jar

config.yaml

2、配置tomcat项目 bin目录下 catalina.sh文件；export JAVA_OPTS= 后面添加配置

-javaagent:$CATALINA_HOME/bin/jmx_prometheus_javaagent-0.16.1.jar=38081:$CATALINA_HOME/bin/config.yaml


[ -z "$CATALINA_BASE" ] && CATALINA_BASE="$CATALINA_HOME"

export JAVA_OPTS="-Xmx4g -Xms3g -Xmn1024m -XX:NewRatio=4 -XX:SurvivorRatio=4 -XX:+DisableExplicitGC -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -XX:CMSMaxAbortablePrecleanTime=300 -XX:+CMSScavengeBeforeRemark  -XX:+CMSClassUnloadingEnabled -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintAdaptiveSizePolicy -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=30m -Xloggc:$CATALINA_HOME/logs/gc.log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$CATALINA_HOME/logs/heapdump -javaagent:$CATALINA_HOME/bin/jmx_prometheus_javaagent-0.16.1.jar=38081:$CATALINA_HOME/bin/config.yaml"

其中端口号，可以根据tomcat项目的端口号前面加3配置，正如我的tomcat端口是8081，jvm metrics则配置38081。

3、启动tomcat项目。

4、配置prometheus 服务器的监控 /usr/local/prometheus/prometheus.yml


  - job_name: 'tomcat'
    static_configs:
    - targets: ['hostname:38081']
      labels:
        service: apache

5、配置jvm告警，新建 /usr/local/prometheus/rule_files/jvm.yml 文件

# severity按严重程度由高到低：red、orange、yello、blue
groups:
  - name: jvm-alerting
    rules:

    # down了超过30秒
    - alert: instance-down
      expr: up == 0
      for: 30s
      labels:
        severity: middle
        service: apache
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 30 seconds."

    # down了超过1分钟
    - alert: instance-down
      expr: up == 0
      for: 1m
      labels:
        severity: high
        service: apache
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

    # down了超过5分钟
    - alert: instance-down
      expr: up == 0
      for: 5m
      labels:
        severity: critical
        service: apache
        status: 严重告警
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

    # 堆空间使用超过50%
    - alert: heap-usage-too-much
      expr: jvm_memory_bytes_used{job="tomcat", area="heap"} / jvm_memory_bytes_max * 100 > 50
      for: 1m
      labels:
        severity: middle
        service: apache
      annotations:
        summary: "JVM Instance {{ $labels.instance }} memory usage > 50%"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 50%] for more than 1 minutes. current usage ({{ $value }}%)"

    # 堆空间使用超过80%
    - alert: heap-usage-too-much
      expr: jvm_memory_bytes_used{job="tomcat", area="heap"} / jvm_memory_bytes_max * 100 > 70
      for: 1m
      labels:
        severity: high
        service: apache
      annotations:
        summary: "JVM Instance {{ $labels.instance }} memory usage > 70%"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 70%] for more than 1 minutes. current usage ({{ $value }}%)"

    # 堆空间使用超过90%
    - alert: heap-usage-too-much
      expr: jvm_memory_bytes_used{job="tomcat", area="heap"} / jvm_memory_bytes_max * 100 > 90
      for: 1m
      labels:
        severity: critical
        service: apache
        status: 严重告警
      annotations:
        summary: "JVM Instance {{ $labels.instance }} memory usage > 90%"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 90%] for more than 1 minutes. current usage ({{ $value }}%)"

    # 在5分钟里，Old GC花费时间超过30%
    - alert: old-gc-time-too-much
      expr: increase(jvm_gc_collection_seconds_sum{gc="ConcurrentMarkSweep"}[5m]) > 5 * 60 * 0.3
      for: 5m
      labels:
        severity: middle
        service: apache
      annotations:
        summary: "JVM Instance {{ $labels.instance }} Old GC time > 30% running time"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 30% running time] for more than 5 minutes. current seconds ({{ $value }}%)"

    # 在5分钟里，Old GC花费时间超过50%
    - alert: old-gc-time-too-much
      expr: increase(jvm_gc_collection_seconds_sum{gc="ConcurrentMarkSweep"}[5m]) > 5 * 60 * 0.5
      for: 5m
      labels:
        severity: high
        service: apache
      annotations:
        summary: "JVM Instance {{ $labels.instance }} Old GC time > 50% running time"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 50% running time] for more than 5 minutes. current seconds ({{ $value }}%)"

    # 在5分钟里，Old GC花费时间超过80%
    - alert: old-gc-time-too-much
      expr: increase(jvm_gc_collection_seconds_sum{gc="ConcurrentMarkSweep"}[5m]) > 5 * 60 * 0.8
      for: 5m
      labels:
        status: 严重告警
        severity: critical
        service: apache
      annotations:
        summary: "JVM Instance {{ $labels.instance }} Old GC time > 80% running time"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 80% running time] for more than 5 minutes. current seconds ({{ $value }}%)"

6、重启prometheus；

curl -XPOST http://hostname:7070/-/reload

7、导入grafana dashboard

Tomcat dashboard | Grafana Labs

dashboard id: 8704 ，8878

8、

备注：发送企业微信告警机器人，是通过prometheus alertmanager 的web hook 接口，

在web hook 接口发送到企业微信告警。

发送企业微信告警机器人若有需要，另文介绍，敬请期待。

冰帆<

关注

0
点赞
踩
9

收藏

觉得还不错? 一键收藏
打赏
2
评论
grafana+prometheus监控tomcat项目

公司的web项目，有几个经常出现内存溢出宕机，正好搭建了prometheus和grafana，借助prometheus 插件实时监控tomcat内存情况，超过阈值告警发送到企业微信，实现自动化了解项目情况。
复制链接

扫一扫