背景:
公司的web项目,有几个经常出现内存溢出宕机,正好搭建了prometheus和grafana,借助prometheus 插件实时监控tomcat内存情况,超过阈值告警发送到企业微信,实现自动化了解项目情况。grafana tomcat dashboard 如下:
工具准备:
1、prometheus
2、alertmanager
3、grafana
4、tomcat
5、prometheus插件与配置文件地址:
链接:https://pan.baidu.com/s/1B2PWimrpCQ9MqOedPvXdaA?pwd=yyds
提取码:yyds
复制这段内容后打开百度网盘手机App,操作更方便哦
步骤:
1、下载tomcat监控需要的文件:从网盘地址
jmx_prometheus_javaagent-0.16.1.jar
config.yaml
2、配置tomcat项目 bin目录下 catalina.sh文件;export JAVA_OPTS= 后面添加配置
-javaagent:$CATALINA_HOME/bin/jmx_prometheus_javaagent-0.16.1.jar=38081:$CATALINA_HOME/bin/config.yaml
[ -z "$CATALINA_BASE" ] && CATALINA_BASE="$CATALINA_HOME"
export JAVA_OPTS="-Xmx4g -Xms3g -Xmn1024m -XX:NewRatio=4 -XX:SurvivorRatio=4 -XX:+DisableExplicitGC -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -XX:CMSMaxAbortablePrecleanTime=300 -XX:+CMSScavengeBeforeRemark -XX:+CMSClassUnloadingEnabled -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintAdaptiveSizePolicy -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=30m -Xloggc:$CATALINA_HOME/logs/gc.log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$CATALINA_HOME/logs/heapdump -javaagent:$CATALINA_HOME/bin/jmx_prometheus_javaagent-0.16.1.jar=38081:$CATALINA_HOME/bin/config.yaml"
其中端口号,可以根据tomcat项目的端口号前面加3配置,正如我的tomcat端口是8081,jvm metrics则配置38081。
3、启动tomcat项目。
4、配置prometheus 服务器的监控 /usr/local/prometheus/prometheus.yml
- job_name: 'tomcat'
static_configs:
- targets: ['hostname:38081']
labels:
service: apache
5、配置jvm告警 ,新建 /usr/local/prometheus/rule_files/jvm.yml 文件
# severity按严重程度由高到低:red、orange、yello、blue
groups:
- name: jvm-alerting
rules:
# down了超过30秒
- alert: instance-down
expr: up == 0
for: 30s
labels:
severity: middle
service: apache
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 30 seconds."
# down了超过1分钟
- alert: instance-down
expr: up == 0
for: 1m
labels:
severity: high
service: apache
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
# down了超过5分钟
- alert: instance-down
expr: up == 0
for: 5m
labels:
severity: critical
service: apache
status: 严重告警
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
# 堆空间使用超过50%
- alert: heap-usage-too-much
expr: jvm_memory_bytes_used{job="tomcat", area="heap"} / jvm_memory_bytes_max * 100 > 50
for: 1m
labels:
severity: middle
service: apache
annotations:
summary: "JVM Instance {{ $labels.instance }} memory usage > 50%"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 50%] for more than 1 minutes. current usage ({{ $value }}%)"
# 堆空间使用超过80%
- alert: heap-usage-too-much
expr: jvm_memory_bytes_used{job="tomcat", area="heap"} / jvm_memory_bytes_max * 100 > 70
for: 1m
labels:
severity: high
service: apache
annotations:
summary: "JVM Instance {{ $labels.instance }} memory usage > 70%"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 70%] for more than 1 minutes. current usage ({{ $value }}%)"
# 堆空间使用超过90%
- alert: heap-usage-too-much
expr: jvm_memory_bytes_used{job="tomcat", area="heap"} / jvm_memory_bytes_max * 100 > 90
for: 1m
labels:
severity: critical
service: apache
status: 严重告警
annotations:
summary: "JVM Instance {{ $labels.instance }} memory usage > 90%"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 90%] for more than 1 minutes. current usage ({{ $value }}%)"
# 在5分钟里,Old GC花费时间超过30%
- alert: old-gc-time-too-much
expr: increase(jvm_gc_collection_seconds_sum{gc="ConcurrentMarkSweep"}[5m]) > 5 * 60 * 0.3
for: 5m
labels:
severity: middle
service: apache
annotations:
summary: "JVM Instance {{ $labels.instance }} Old GC time > 30% running time"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 30% running time] for more than 5 minutes. current seconds ({{ $value }}%)"
# 在5分钟里,Old GC花费时间超过50%
- alert: old-gc-time-too-much
expr: increase(jvm_gc_collection_seconds_sum{gc="ConcurrentMarkSweep"}[5m]) > 5 * 60 * 0.5
for: 5m
labels:
severity: high
service: apache
annotations:
summary: "JVM Instance {{ $labels.instance }} Old GC time > 50% running time"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 50% running time] for more than 5 minutes. current seconds ({{ $value }}%)"
# 在5分钟里,Old GC花费时间超过80%
- alert: old-gc-time-too-much
expr: increase(jvm_gc_collection_seconds_sum{gc="ConcurrentMarkSweep"}[5m]) > 5 * 60 * 0.8
for: 5m
labels:
status: 严重告警
severity: critical
service: apache
annotations:
summary: "JVM Instance {{ $labels.instance }} Old GC time > 80% running time"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 80% running time] for more than 5 minutes. current seconds ({{ $value }}%)"
6、重启prometheus;
curl -XPOST http://hostname:7070/-/reload
7、导入grafana dashboard
Tomcat dashboard | Grafana Labs
dashboard id: 8704 ,8878
8、
备注:发送企业微信告警机器人,是通过prometheus alertmanager 的web hook 接口 ,
在web hook 接口 发送到企业微信告警。
发送企业微信告警机器人若有需要,另文介绍,敬请期待。