最近项目中需要做微服务架构服务链路追踪和JVM监控平台,对比了几个组件平台,最终选择了SkyWalking,【分布式系统的应用程序性能监控工具,专为微服务、云原生和基于容器(Docker、Kubernetes、Mesos)架构而设计。】
官网地址:https://skywalking.apache.org/
本文章相关版本:
skywalking-oap:skywalking-oap-server:8.6.0-es7
skywalking-ui:skywalking-ui:8.6.0
docker 资源地址:https://hub.docker.com/r/apache/skywalking-oap-server/tags
1、docker启动skywalking-oap和skywalking-ui
docker run --name skywalking-oap --restart always -p 11800:11800 -p 12800:12800 -d -e TZ=Asia/Shanghai -e SW_ES_USER=<ES-USER> -e SW_ES_PASSWORD=<ES-PWD> -e SW_STORAGE=elasticsearch7 -e SW_STORAGE_ES_CLUSTER_NODES=<ES-IP>:9200 -v /etc/localtime:/etc/localtime:ro apache/skywalking-oap-server:8.6.0-es7
docker run --name skywalking-ui --restart always -p 9898:8080 -d --link skywalking-oap:skywalking-oap -e TZ=Asia/Shanghai -e SW_OAP_ADDRESS=skywalking-oap:12800 -v /etc/localtime:/etc/localtime:ro apache/skywalking-ui:8.6.0
启动单节点ES实例
docker run -d -p 9200:9200 -p 9300:9300 --name='elasticsearch' -e "discovery.type=single-node" -e ES_JAVA_OPTS="-Xms512m -Xmx512m" -e TZ=Asia/Shanghai -v /etc/localtime:/etc/localtime:ro elasticsearch:7.6.2
启动后的效果:
2、下载java-agent
下载地址:https://skywalking.apache.org/downloads/
这里选择:v8.6.0 for H2/MySQL/TiDB/InfluxDB/ElasticSearch 7进行下载;下载后解压:apache-skywalking-apm-es7-8.6.0.tar.gz
解压后如图:
2.1 将agent目录,copy到需要监控的jar的地方;
2.2 修改agent/config/agent.config的collector.backend_service,改为上面启动skywalking-oap的IP:PORT,
2.3 在jar的启动脚本中增加如下内容:
-javaagent:/apache-skywalking-apm-bin-es7/agent/skywalking-agent.jar -Dskywalking.agent.service_name=micro-service
** 注意:上面内容必须放到 -jar之前。**
3、访问界面查看效果
访问地址:http://skywalking-UI的IP:9898
4、配置告警
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Sample alarm rules.
rules:
# Rule unique name, must be ended with `_rule`.
service_resp_time_rule:
metrics-name: service_resp_time
op: ">"
threshold: 1000
period: 10
count: 3
silence-period: 5
message: 实例:{name}\n响应时间在过去10分钟的3分钟内超过1000毫秒.
service_sla_rule:
# Metrics value need to be long, double or int
metrics-name: service_sla
op: "<"
threshold: 8000
# The length of time to evaluate the metrics
period: 10
# How many times after the metrics match the condition, will trigger alarm
count: 2
# How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
silence-period: 3
message: 在过去10分钟的2分钟内,实例:{name}的成功率低于80%.
service_resp_time_percentile_rule:
# Metrics value need to be long, double or int
metrics-name: service_percentile
op: ">"
threshold: 1000,1000,1000,1000,1000
period: 10
count: 3
silence-period: 5
message: 由于p50>1000、p75>1000、p90>1000、p95>1000、p99>1000的多种情况,实例:{name}警报在过去10分钟的3分钟内的百分位响应时间.
service_instance_resp_time_rule:
metrics-name: service_instance_resp_time
op: ">"
threshold: 1000
period: 10
count: 2
silence-period: 5
message: 实例:{name}\n响应时间在过去10分钟的2分钟内超过1000毫秒
database_access_resp_time_rule:
metrics-name: database_access_resp_time
threshold: 1000
op: ">"
period: 10
count: 2
message: 数据库访问:{name}\n响应时间在过去10分钟的2分钟内超过1000毫秒
endpoint_relation_resp_time_rule:
metrics-name: endpoint_relation_resp_time
threshold: 1000
op: ">"
period: 10
count: 2
message: 端点关系:{name}\n响应时间在过去10分钟的2分钟内超过1000毫秒
unhealthy_event_rule:
metrics-name: Unhealthy
# Healthiness check is usually a scheduled task,
# they may be unhealthy for the first few times,
# and can be unhealthy occasionally due to network jitter,
# please adjust the threshold as per your actual situation.
threshold: 3
op: ">"
period: 5
count: 1
message: 实例:{name}已不正常运行5分钟
instance_jvm_old_gc_count_rule:
metrics-name: instance_jvm_old_gc_count
threshold: 1
op: ">"
period: 10
count: 1
message: 实例:{name}\n 指标:OldGC次数\n 告警:最近10分钟内>=2
instance_jvm_young_gc_count_rule:
metrics-name: instance_jvm_young_gc_count
threshold: 1
op: ">"
period: 5
count: 1
message: 实例:{name}\n 指标:YoungGC次数\n 告警:最近5分钟内>=2
# Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
# Because the number of endpoint is much more than service and instance.
#
# endpoint_avg_rule:
# metrics-name: endpoint_avg
# op: ">"
# threshold: 1000
# period: 10
# count: 2
# silence-period: 5
# message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes
#webhooks:
# - http://127.0.0.1/notify/
# - http://127.0.0.1/go-wechat/
dingtalkHooks:
textTemplate: |-
{
"msgtype": "text",
"text": {
"content": "Apache SkyWalking 告警: \n %s."
}
}
webhooks:
- url: https://oapi.dingtalk.com/robot/send?access_token=<钉钉机器人的access_token>
secret: <钉钉机器人的secret>
效果如下:
最后的话
- 整理不易,如果对你有用,请给个在看,谢谢~~
- 如有不正确的地方,请予以指正。【W:编程心声】
- 如有任何问题,关注公众号编程心声后,留言即可。