SkyWalking之告警

云川之下

已于 2022-04-07 16:19:51 修改

阅读量2k

点赞数 1

分类专栏： spring cloud 文章标签： Skywalking

于 2022-04-07 15:45:53 首次发布

本文链接：https://blog.csdn.net/m0_45406092/article/details/124016643

版权

spring cloud 专栏收录该内容

34 篇文章 5 订阅

订阅专栏

本文详细介绍了如何在SkyWalking中设置服务性能告警，涉及平均响应时间、成功率、超时比例等关键指标。通过示例展示了如何配置规则来检测10分钟内的异常情况，如响应时间超过1秒、成功率低于80%等，并解读了SLA概念及其在性能监控中的应用。

摘要由CSDN通过智能技术生成

概述

op: "<"表示小于指标触发
op: ">"表示大于指标时触发

方式：WebHook
配置文件位置：\config\alarm-settings.yml
建议不要使用 endpoint 规则，相比 service、instance 规则耗费更多内存及资源
alarm参考：https://github.com/apache/skywalking/blob/master/docs/en/setup/backend/backend-alarm.md
metrics参考：https://github.com/apache/skywalking/blob/master/docs/en/concepts-and-designs/oal.md
oal-scope参考：https://github.com/apache/skywalking/blob/master/docs/en/concepts-and-designs/scope-definitions.md
默认 oal：\config\oal\core.oal，可自定义oal于该文件中，后在alarm-settings文件中使用
example
- // Calculate the percent of response status is true, for each service.
- endpoint_success = from(Endpoint.*).filter(status == true).percent()
- // Calculate the sum of response code in [404, 500, 503], for each service.
- endpoint_abnormal = from(Endpoint.*).filter(responseCode in [404, 500, 503]).sum()

1. 示例

1.1 在最近10分钟的3分钟内服务平均响应时间超过1000ms：

侧重响应时间超时

  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.

1.2 在最近10分钟的2分钟内服务成功率低于80%

侧重成功率

service_sla_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_sla
    op: "<"
    threshold: 8000
    # The length of time to evaluate the metrics
    period: 10
    # How many times after the metrics match the condition, will trigger alarm
    count: 2
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 3
    message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes

sla参见你真的了解性能压测中的SLA吗？

1.2.1 SLA定义

：简单来说就是一种指标，用来衡量服务质量的，一般通过百分比来作为考核条件

服务级别协议（英语：service-level agreement，缩写SLA）也称服务等级协议、服务水平协议，是服务提供商与客户之间定义的正式承诺[维基百科定义]。SLA的概念，对互联网公司来说就是网站服务可用性的一个保证。

SLA包括两个要素，一个是SLI,一个是SLO，其中SLI定义的是测量指标；SLO定义的是服务提供的一种状态。

SLI：SLI是经过仔细定义的测量指标，它根据不同系统特点确定要测量什么，SLI的确定是一个非常复杂的过程。SLI确定测量的具体指标，在确定具体指标的时候，需要做到该指标能否准确描述服务质量以及该指标是否可靠。

SLO：SLO(服务等级目标)指定了服务所提供功能的一种期望状态，包含所有能够描述服务应该提供什么样功能的信息。一般描述为：每分钟平均qps > 100k/s；99% 访问延迟 < 500ms；99% 每分钟带宽 > 200MB/s。

1.3 在最近10分钟的3分钟90%服务响应时间超过1秒

侧重超时的比例


  service_p90_sla_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_p90
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: 90% response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes

1.4 在最近10分钟的2分钟内服务实例的平均响应时间超过1秒

service_instance_resp_time_rule:
    metrics-name: service_instance_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 2
    silence-period: 5
    message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes

规则中的参数属性解释

Rule name。在告警信息中显示的唯一名称。必须以_rule结尾。规则命名用户自定义

属性	含义
metrics-name	oal脚本中的度量名称，不能随便定义
threshold	阈值，与metrics-name和下面的比较符号相匹配
op	比较操作符，可以设定>,<,=
period	多久检查一次当前的指标数据是否符合告警规则，`单位分钟`
count	达到多少次后，发送告警消息
silence-period	在多久之内，忽略相同的告警消息
message	告警消息内容
include-names	本规则告警生效的服务列表

Metrics name。也是oal脚本中的度量名。只支持long,double和int类型。详细参数可见解压包下/你的路径apache-skywalking-apm-bin-es7\config\oal\core.oal文件：

service_instance_sla = from(ServiceInstance.*).percent(status == true);
service_instance_resp_time= from(ServiceInstance.latency).longAvg();
service_instance_cpm = from(ServiceInstance.*).cpm();