alertmanager报警规则详解

转载请注明出处,原文链接http://tailnode.tk/2017/03/al...

说明

这篇文章介绍prometheus和alertmanager的报警和通知规则,prometheus的配置文件名为prometheus.yml,alertmanager的配置文件名为alertmanager.yml
报警:指prometheus将监测到的异常事件发送给alertmanager,而不是指发送邮件通知
通知:指alertmanager发送异常事件的通知(邮件、webhook等)

报警规则

prometheus.yml中指定匹配报警规则的间隔

# How frequently to evaluate rules.
[ evaluation_interval: <duration> | default = 1m ]

prometheus.yml中指定规则文件(可使用通配符,如rules/*.rules)

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - rules/mengyuan.rules

在rules目录中添加mengyuan.rules

ALERT goroutines_gt_70
  IF go_goroutines > 70
  FOR 5s  
  LABELS { status = "yellow" }
  ANNOTATIONS {
    summary = "goroutines 超过 70,当前值{{ $value }}",
    description = "当前实例 {{ $labels.instance }}",
  }

ALERT goroutines_gt_90
  IF go_goroutines > 90
  FOR 5s  
  LABELS { status = "red" }
  ANNOTATIONS {
    summary = "goroutines 超过 90,当前值{{ $value }}",
    description = "当前实例 {{ $labels.instance }}",
  }

配置文件设置好后,需要让prometheus重新读取,有两种方法:

  1. 通过HTTP API向/-/reload发送POST请求,例:curl -X POST http://localhost:9090/-/reload

  2. 向prometheus进程发送SIGHUP信号

将邮件通知与rules对比一下(还需要配置alertmanager.yml才能收到邮件)

通知规则

设置alertmanager.yml的的route与receivers

route:
  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  group_by: ['alertname']

  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first 
  # notification.
  group_wait: 5s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.
  group_interval: 1m

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 3h 

  # A default receiver
  receiver: mengyuan

receivers:
- name: 'mengyuan'
  webhook_configs:
  - url: http://192.168.0.53:8080
  email_configs:
  - to: 'mengyuan@tenxcloud.com'

名词解释

Route

route属性用来设置报警的分发策略,它是一个树状结构,按照深度优先从左向右的顺序进行匹配。

// Match does a depth-first left-to-right search through the route tree
// and returns the matching routing nodes.
func (r *Route) Match(lset model.LabelSet) []*Route {

Alert

Alert是alertmanager接收到的报警,类型如下。

// Alert is a generic representation of an alert in the Prometheus eco-system.
type Alert struct {
    // Label value pairs for purpose of aggregation, matching, and disposition
    // dispatching. This must minimally include an "alertname" label.
    Labels LabelSet `json:"labels"`

    // Extra key/value information which does not define alert identity.
    Annotations LabelSet `json:"annotations"`

    // The known time range for this alert. Both ends are optional.
    StartsAt     time.Time `json:"startsAt,omitempty"`
    EndsAt       time.Time `json:"endsAt,omitempty"`
    GeneratorURL string    `json:"generatorURL"`
}

具有相同Lables的Alert(key和value都相同)才会被认为是同一种。在prometheus rules文件配置的一条规则可能会产生多种报警

Group

alertmanager会根据group_by配置将Alert分组。如下规则,当go_goroutines等于4时会收到三条报警,alertmanager会将这三条报警分成两组向receivers发出通知。

ALERT test1
  IF go_goroutines > 1
  LABELS {label1="l1", label2="l2", status="test"}
ALERT test2
  IF go_goroutines > 2
  LABELS {label1="l2", label2="l2", status="test"}
ALERT test3
  IF go_goroutines > 3
  LABELS {label1="l2", label2="l1", status="test"}

主要处理流程

  1. 接收到Alert,根据labels判断属于哪些Route(可存在多个Route,一个Route有多个Group,一个Group有多个Alert)

  2. 将Alert分配到Group中,没有则新建Group

  3. 新的Group等待group_wait指定的时间(等待时可能收到同一Group的Alert),根据resolve_timeout判断Alert是否解决,然后发送通知

  4. 已有的Group等待group_interval指定的时间,判断Alert是否解决,当上次发送通知到现在的间隔大于repeat_interval或者Group有更新时会发送通知

TODO

  • 重启对发送报警与通知的影响

  • 能否组成集群

参考

CAN长字节DM1报文是指在CAN总线上传输的长度超过8个字节的DM1报文。根据引用\[1\],当要传输的数据长度超过8个字节时,首先使用TPCM进行广播,广播内容包含即将传输报文的PGN、总的数据包长度等信息,然后使用TP.DT进行数据传输。相邻两个TP.DT之间的时间间隔是50ms到200ms。根据引用\[2\],当字节数大于8时,将会使用多帧传输参数组。根据引用\[3\],DM1报文是Diagnostic Message 1, Active Diagnostic Trouble Codes的缩写,用于点亮故障指示灯、红色停机灯等,并周期性播报控制器中处于激活状态的故障码。DM1报文的格式包括各个字节的定义,如故障指示灯、红色停机灯、琥珀色警告指示灯等。因此,CAN长字节DM1报文是指在CAN总线上传输的长度超过8个字节的DM1报文,用于传输更多的故障码信息。 #### 引用[.reference_title] - *1* [车载通信——J1939 DM1](https://blog.csdn.net/weixin_64064747/article/details/130193432)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^control_2,239^v3^insert_chatgpt"}} ] [.reference_item] - *2* [J1939广播DM1报文](https://blog.csdn.net/mengdeguodu_/article/details/108173263)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^control_2,239^v3^insert_chatgpt"}} ] [.reference_item] - *3* [J1939商用车在线诊断DM1报文](https://blog.csdn.net/traveller93/article/details/120735912)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^control_2,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值