SkyWalking--告警--使用/教程/示例

原文网址:SkyWalking--告警--使用/教程_IT利刃出鞘的博客-CSDN博客

简介

说明

本文介绍SkyWalking的告警功能的用法。

SkyWalking支持WebHook、gRPC、微信、钉钉、飞书等通知方式。

官网

alarm:https://github.com/apache/skywalking/blob/master/docs/en/setup/backend/backend-alarm.md

oal规则语法:https://github.com/apache/skywalking/blob/master/docs/en/concepts-and-designs/oal.md

范围和字段:https://github.com/apache/skywalking/blob/master/docs/en/concepts-and-designs/scope-definitions.md

事件:https://github.com/apache/skywalking/blob/master/docs/en/concepts-and-designs/event.md

配置示例

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Sample alarm rules.
rules:
  # Rule unique name, must be ended with `_rule`.
  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: 服务:{name}\n  指标:响应时间\n  详情:至少3次超过1000毫秒(最近10分钟内)
  service_sla_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_sla
    op: "<"
    threshold: 8000
    # The length of time to evaluate the metrics
    period: 10
    # How many times after the metrics match the condition, will trigger alarm
    count: 2
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 3
    message: 服务:{name}\n  指标:成功率\n  详情:至少2次低于80%(最近10分钟内)
  service_resp_time_percentile_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_percentile
    op: ">"
    threshold: 1000,1000,1000,1000,1000
    period: 10
    count: 3
    silence-period: 5
    # 至少有一个条件达到:p50>1000、p75>1000、p90>1000、p95>1000、p99>1000
    message: 服务:{name}\n  指标:响应时间\n  详情:至少3次百分位超过1000ms(最近10分钟内)
  service_instance_resp_time_rule:
    metrics-name: service_instance_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 2
    silence-period: 5
    message: 实例:{name}\n  指标:响应时间\n  详情:至少2次超过1000毫秒(最近10分钟内)
  database_access_resp_time_rule:
    metrics-name: database_access_resp_time
    threshold: 1000
    op: ">"
    period: 10
    count: 2
    message: 数据库访问:{name}\n  指标:响应时间\n  详情:至少2次超过1000毫秒(最近10分钟内)
  endpoint_relation_resp_time_rule:
    metrics-name: endpoint_relation_resp_time
    threshold: 1000
    op: ">"
    period: 10
    count: 2
    message: 端点关系:{name}\n  指标:响应时间\n  详情:至少2次超过1000毫秒(最近10分钟内)
  instance_jvm_old_gc_count_rule:
    metrics-name: instance_jvm_old_gc_count
    threshold: 1
    op: ">"
    period: 1440
    count: 1
    message: 实例:{name}\n  指标:OldGC次数\n  详情:最近1天内大于1次
  instance_jvm_young_gc_count_rule:
    metrics-name: instance_jvm_young_gc_count
    threshold: 1
    op: ">"
    period: 5
    count: 100
    message: 实例:{name}\n  指标:YoungGC次数\n  详情:最近5分钟内大于100次
  # 需要在config/oal/core.oal添加一行:endpoint_abnormal = from(Endpoint.*).filter(responseCode in [404, 500, 503]).count();
  endpoint_abnormal_rule:
    metrics-name: endpoint_abnormal
    threshold: 1
    op: ">="
    period: 2
    count: 1
    message: 接口:{name}\n  指标:接口异常\n  详情:最近2分钟内至少1次\n
#  Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
#  Because the number of endpoint is much more than service and instance.
#
#  endpoint_avg_rule:
#    metrics-name: endpoint_avg
#    op: ">"
#    threshold: 1000
#    period: 10
#    count: 2
#    silence-period: 5
#    message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes

webhooks:
#  - http://127.0.0.1/notify/
#  - http://127.0.0.1/go-wechat/


dingtalkHooks:
  textTemplate: |-
    {
      "msgtype": "text",
      "text": {
        "content": "Apache SkyWalking 告警: \n %s"
      }
    }
  webhooks:
    - url: https://oapi.dingtalk.com/robot/send?access_token=<钉钉机器人的access_token>
      secret: <钉钉机器人的secret>

​

告警简介

说明

Apache SkyWalking告警是由一组规则驱动。

告警规则的配置文件:SkyWalking服务端安装路径/config/alarm-settings.yml。

alarm-settings.yml中的rules.xxx_rule.metrics-name对应的是config/oal路径下的配置文件中的详细规则:core.oal、event.oal,java-agent.oal, browser.oal。

告警规则的组成部分

告警规则的定义分为三部分:

  1. 告警规则:定义了触发告警所考虑的条件。
  2. WebHooks:当告警触发时,被调用的服务端点列表。
  3. gRPCHook:当告警触发时,被调用的远程gRPC方法的主机和端口。

名词含义

Defines the relation between scope and entity name.

  • Service: Service name
  • Instance: {Instance name} of {Service name}
  • Endpoint: {Endpoint name} in {Service name}
    • 端点。即:接口(也就是url)
    • endpoint 规则相比 service、instance 规则耗费更多内存及资源
  • Database: Database service name
  • Service Relation: {Source service name} to {Dest service name}
  • Instance Relation: {Source instance name} of {Source service name} to {Dest instance name} of {Dest service name}
  • Endpoint Relation: {Source endpoint name} in {Source Service name} to {Dest endpoint name} in {Dest service name}

规则

上边是文章的部分内容,为便于维护,全文已转移到此网址:SkyWalking-告警-使用教程 - 自学精灵

  • 12
    点赞
  • 21
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 24
    评论
评论 24
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

IT利刃出鞘

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值