hualinux 进阶 prom 1-2.12:Prometheus配置Alertmanager

目录

一、 Alertmanager配置说明

二、global全局配置

三、templates警告模板

四、route告警路由

4.1 说明

4.2 例子

五、receivers接收器

六、inhibit_rules(抑制规则)


上篇介绍了Alertmanager安装,本章对Alertmanager的配置进行简单说明,也可以提前看一下Prometheus官方Alertmanager配置文档

一、 Alertmanager配置说明

Alertmanager配置文件格式,如下所示:

global:
  [resolve_timeout:<duration>|default=5m]
  [smtp_from:<tmpl_string>]
  [smtp_smarthost:<string>]
  [smtp_hello:<string>|default="Ilocalhost"]
  [smtp_authusername:<string>]
  [smtp_auth_password:<secret>]
  [smtp_auth_identity:<string>]
  [smtp_auth_secret:<secret>]
  [smtp_require_tls:<bool>|default=true]
  [slack_api_url:<secret>]
  [victorops_apikey:<secret>]
  [victorops_apiurl:<string>|default=https://alert.victorops.com/integrations/generic/20131114/alet/"]
  [pagerduty_url:<string>|default="https://events.pagerduty.com/2/enqueue"]
  [opsgenie_api_key:<secret>]
  [opsgenie_api_url:<string>|default="https://api.opsgenie.com/"]
  [hipchat_api_url:<string>|default="https://api.hipchat.com/"]
  [hipchat_auth_token:<secret>]
  [wechat_api_url:<string>|default=htts://qyapi.weixin.qcom/cgi-bin/"]
  [wechat_apisecret:<secret>]
  [wechat_apicorp_id:<string>]
  [http_config:<http_config>]

templates:
    [-<filepath>...]
route:<route>
receivers:
  -<receiver>...
inhibit_rules:
    [-<inhibit_rule>..

看到alertmanager配置文件格式通常包括global(全局配置)、templates(告警模板)、route(告警路由)、receivers(接收器)和inhibit_rules(抑制规则)等主要配置项模块

二、global全局配置

即全局配置,在Alertmanager配置文件中,只要全局配置项中配置的选项内容均为公共设置,便可以作为其他配置项的默认值,也可以被其他配置项中的设置覆盖掉。其中resolve_timeout用于设置处理超时时间,是声明告警状态为已解决的时间,它的时长设定有可能影响到告警恢复通知的接收时间,读者需要根据日常生产环境总结出适合自己的时长进行定义,默认为5分钟。如果每一次告警均需要通过电子邮件接收,可以设置用于发送电子邮件的SMTP服务器信息和通知服务,其中对应的配置内容如下:

global:
  # The default SMTP From header field.
  [ smtp_from: <tmpl_string> ]
  # The default SMTP smarthost used for sending emails, including port number.
  # Port number usually is 25, or 587 for SMTP over TLS (sometimes referred to as STARTTLS).
  # Example: smtp.example.org:587
  [ smtp_smarthost: <string> ]
  # The default hostname to identify to the SMTP server.
  [ smtp_hello: <string> | default = "localhost" ]
  # SMTP Auth using CRAM-MD5, LOGIN and PLAIN. If empty, Alertmanager doesn't authenticate to the SMTP server.
  [ smtp_auth_username: <string> ]
  # SMTP Auth using LOGIN and PLAIN.
  [ smtp_auth_password: <secret> ]
  # SMTP Auth using PLAIN.
  [ smtp_auth_identity: <string> ]
  # SMTP Auth using CRAM-MD5.
  [ smtp_auth_secret: <secret> ]
  # The default SMTP TLS requirement.
  # Note that Go does not support unencrypted connections to remote SMTP endpoints.
  [ smtp_require_tls: <bool> | default = true ]

  # The API URL to use for Slack notifications.
  [ slack_api_url: <secret> ]
  [ victorops_api_key: <secret> ]
  [ victorops_api_url: <string> | default = "https://alert.victorops.com/integrations/generic/20131114/alert/" ]
  [ pagerduty_url: <string> | default = "https://events.pagerduty.com/v2/enqueue" ]
  [ opsgenie_api_key: <secret> ]
  [ opsgenie_api_url: <string> | default = "https://api.opsgenie.com/" ]
  [ wechat_api_url: <string> | default = "https://qyapi.weixin.qq.com/cgi-bin/" ]
  [ wechat_api_secret: <secret> ]
  [ wechat_api_corp_id: <string> ]

  # The default HTTP client configuration
  [ http_config: <http_config> ]

  # ResolveTimeout is the default value used by alertmanager if the alert does
  # not include EndsAt, after this time passes it can declare the alert as resolved if it has not been updated.
  # This has no impact on alerts from Prometheus, as they always include EndsAt.
  [ resolve_timeout: <duration> | default = 5m ]

·smtp_smarthost,邮箱SMTP服务器代理地址。
·smtp_from,发送邮件的名称。
·smtp_auth_username,邮箱用户名称。
·smtp_auth_password,邮箱授权密码。

同时也可以使用smtp_require_tls来设置TLS协议使用状况,如果使用TLS则设置为true且为默认项,如果不使用则设置为false。

更多的全局变量请看Prometheus官方Alertmanager的全局配置文档 

三、templates警告模板

告警模板可以自定义告警通知的外观格式及其包含的对应告警数据。在templates部分中包含告警模板的目录列表,也就是设置已存在的模板文件路径,例如:

templates:
- '/disk1/alertData/alertmanager/template/*.tmpl'

Alertmanager启动时加载该路径下的模板文件。可以自己定义告警模板。

 

四、route告警路由

4.1 说明

告警路由模块描述了在收到Prometheus server生成的告警后,将告警发送到receiver指定的目的地址的规则。Alertmanager对传入的告警信息进行处理,根据所定义的规则和操作进行匹配。所有路由的组合可以理解为树状结构,设置的第一个route称为根节点,其后包含的节点称为子节点。每个告警都从配置的根节点路由进入路由树,按照深度优先从左向右进行遍历匹配,在所匹配的节点上停止。如果告警与节点的任何子节点都不匹配,也就是说没有可匹配的子节点,则根据当前节点的配置参数处理告警。route的配置选项内容如下:

[ receiver: <string> ]
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
#
# To aggregate by all possible labels use the special value '...' as the sole label name, for example:
# group_by: ['...']
# This effectively disables aggregation entirely, passing through all
# alerts as-is. This is unlikely to be what you want, unless you have
# a very low alert volume or your upstream notification system performs
# its own grouping.
[ group_by: '[' <labelname>, ... ']' ]

# Whether an alert should continue matching subsequent sibling nodes.
[ continue: <boolean> | default = false ]

# A set of equality matchers an alert has to fulfill to match the node.
match:
  [ <labelname>: <labelvalue>, ... ]

# A set of regex-matchers an alert has to fulfill to match the node.
match_re:
  [ <labelname>: <regex>, ... ]

# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
[ group_wait: <duration> | default = 30s ]

# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.)
[ group_interval: <duration> | default = 5m ]

# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
[ repeat_interval: <duration> | default = 4h ]

# Zero or more child routes.
routes:
  [ - <route> ... ]

上面的选项相关说明

选项说明
[ receiver: <string> ]配置要发送告警使用的接收器名称
[ group_by: '[' <labelname>, ... ']' ]指定要分组的标签,若告警中包含的标签符合group_ by 中指定的标签名称,这些警告会被合并为一个通知发送给接收器,即实现告警分组
[ continue: <boolean> | default = false ]若设置为false,则告警在满足条件时终止所有匹配处理,并且忽略后续节点;若设置为true,告警则会继续进行后续子节点的匹配
match:
  [ <labelname>: <labelvalue>, ... ]
通过字符形式进行告警匹配设置,用于判断当前告警中是否具有标签labelname.且等于labelvalue
match_re:
  [ <labelname>: <regex>, ... ]
通过正则表达式进行告警匹配设置,判断当前告警标签是否适配正则表达式的信息
[ group_wait: <duration> | default = 30s ]设置从接收告警到发送的等待时间,若在等待时间内当前group接收到了新的告警,这些告警会被合并为一一个通知进行发送,默认设置为30秒

[ group_interval: <duration> | default = 5m ]
设置相同的gourp之间发送告警通知的时间间隔,默认设置为5分钟
[ repeat_interval: <duration> | default = 4h ]设置告警成功发送后能够再次发送完全相同的告警的时间间隔,默认是4小时
routes:
  [ - <route> ... ]
可进行子路由节点匹配设置

4.2 例子

route:
  receiver: 'admin-receiver'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  group_by: [cluster, alertname]
  routes:
    - match:
      team: developers
      group_by: [product, environment]
      receiver: 'developer-pager'
    - match_re:
        service: mysql|redis
      receiver: 'database-pager'

在以上路由设置中,默认告警均被发送给管理员admin-receiver,且根路由中按照cluster和alertname进行了告警分组。子路由中若匹配到告警中标签team的值为developers,Alertmanager将按照标签product和environment对告警进行分组后发送通知,使得开发人员快速定位故障。还可以通过正则表达式进行匹配,若告警信息中含有service标签,且值匹配到mysql或redis,就会向数据库管理员database-pager发送告警通知。

Alertmanager的路由设置为树状结构,如果配置文件中有很多不同子路由,对该配置文件的管理是相当痛苦的一件事情。为此,Prometheus发布了一个名为Routing tree editor的Web浏览器工具,可以使用此工具对配置文件中的路由进行编辑和预核查。

通过浏览器访问工具地址https://prometheus.io/webtools/alerting/routing-tree-editor/,然后复制并粘贴Alertmanager配置文件内容到编辑工具中,点击“Draw Routing Tree”按钮即可看到路由结构信息。现在我们以上面的路由为例进行演示

五、receivers接收器

接收器是一个统称,每个receiver需要设置一个全局唯一名称,并且对应一个或者多个通知方式,包括电子邮箱、微信、PagerDuty、HipChat和Webhook等。目前官方提供的接收器配置选项如下:

# The unique name of the receiver.
name: <string>

# Configurations for several notification integrations.
email_configs:
  [ - <email_config>, ... ]
pagerduty_configs:
  [ - <pagerduty_config>, ... ]
pushover_configs:
  [ - <pushover_config>, ... ]
slack_configs:
  [ - <slack_config>, ... ]
opsgenie_configs:
  [ - <opsgenie_config>, ... ]
webhook_configs:                 # 官方建议通过webhook接收器实现自定义通知集成
  [ - <webhook_config>, ... ]
victorops_configs:
  [ - <victorops_config>, ... ]
wechat_configs:                  # 已经对微信告警进行支持
  [ - <wechat_config>, ... ]

可以看到Alertmanager提供了多种接收器,官方建议通过webhook接收器实现自定义通知集成,可以支持用户定制。可以在官方文档https://prometheus.io/docs/alerting/configuration/#receiver里进行查找

六、inhibit_rules(抑制规则)

inhibit_rule模块中设置实现告警抑制功能,我们可以指定在特定条件下要忽略的告警条件。可以使用此选项设置首选项,例如优先处理某些告警,如果同一组中的告警同时发生,则忽略其他告警。合理设置抑制规则可以减少“垃圾”告警的产生。一个inhibit_rule模块的配置信息如下:

# Matchers that have to be fulfilled in the alerts to be muted.
target_match:
  [ <labelname>: <labelvalue>, ... ]
target_match_re:
  [ <labelname>: <regex>, ... ]

# Matchers for which one or more alerts have to exist for the
# inhibition to take effect.
source_match:
  [ <labelname>: <labelvalue>, ... ]
source_match_re:
  [ <labelname>: <regex>, ... ]

# Labels that must have an equal value in the source and target
# alert for the inhibition to take effect.
[ equal: '[' <labelname>, ... ']' ]

示例

inhibit_rules:                 # 抑制规则
  - source_match:              # 当存在源标签告警触发时抑制含有目标标签的告警
alertname: 'TORouterDown'
target_match_re:
    alertname: '.*Unreachable' # 目标标签值正则匹配,如:RedisUnreachable
  equal: ['dc', 'rack']        # 保证该配置下标签内容相同才会被抑制

以上示例中,当源alertname:'TORouterDown'时,目标和源告警必须具有与equal列表中标签名称相同的标签值,目标中正则匹配到的alertname:'.*Unreachable'才会被抑制从而不发送。这里应该避免source_match和target_match之间的重叠,否则很难理解和维护。建议谨慎使用此功能。使用基于症状的告警时,告警之间很少需要依赖链。针对数据中心中断等大规模故障时可以保留抑制规则。

                                                                                                                                    [root@vm82 alertmanager]# cat alertmanager.yml 
global:                                       # 全局配置模块
  resolve_timeout: 5m                         # 用于设置处理超时时间,默认是5分钟

route:                                        # 路由配置模块
  group_by: ['alertname']                     # 告警分组
  group_wait: 10s                             # 10s内收到的同组告警在同一条告警通知中发送出去                             
  group_interval: 10s                         # 同组之间发送告警通知的时间间隔
  repeat_interval: 1h                         # 相同告警信息发送重复告警的周期
  receiver: 'web.hook'                        # 使用的接收器为web.hook
receivers:                                    # 接收器配置模块
- name: 'web.hook'                            # 设置接收器名称为web.hook
  webhook_configs:                            # 设置webhook地址
  - url: 'http://127.0.0.1:5001/'
inhibit_rules:                                # 告警抑制功能模块
  - source_match:
      severity: 'critical'                    # 当存在源标签告警触发时抑制含有目标标签的告警
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']   # 保证该配置下标签内容相同才会被抑制

已标记关键词 清除标记
©️2020 CSDN 皮肤主题: 技术工厂 设计师:CSDN官方博客 返回首页