Prometheus(5)Alert manager配置和Pormetheus 配置说明

1 概述

Pormetheus的警告由独立的两部分组成。

Prometheus 服务中的警告规则将警告发送警告到Alertmanager。然后这个Alertmanager管理这些警告
包括:

  1. silencing,
  2. inhibition,
  3. aggregation,
  4. 以及通过一些方法发送通知,例如:email,PagerDuty和HipChat。

2 Alertmanager (警报管理器)

2.1 Grouping(分组)

Grouping分组将性质类似的警告分组成一个通知类

当许多系统同时出现故障时,这种情况尤其有用,可以使数百到数千个警报可能同时触发

例如:

  1. 当出现网络分区时,十个到数百个服务实例正在集群中运行。
  2. 当多半服务实例暂时无法访问数据库,如果服务实例不能和数据库通信,则对于已经配置好警报规则的Prometheus服务将会对每个服务实例发送一个警报,这样便会导致数百个警报发送到Alertmanager。
  3. 如果一个用户仅仅想看到一个页面,这个页面上的数据是精确地表示哪个服务实例受影响了。如果没有设置分组,这些数据会有许多个通知,还是比较分散的,这时便可以使用grouping进行分组
  4. Alertmanager便可以通过它们的集群和警报名称来分组标签, 这样它可以发送一个单独受影响的通知

如何配置:

  • 警报分组,分组通知的时间,和通知的接受者在配置文件中由一个路由树配置的

2.2 inhibition(抑制)

如果某些其他警报已经触发了,则对于某些警报,Inhibition是一个抑制通知的概念

例如:

  1. 一个警报已经触发,它正在通知整个集群是不可达的时,Alertmanager则可以配置成关心这个集群的其他警报无效。
  2. 这可以防止与实际问题无关的数百或数千个触发警报的通知

如何配置:

  • 通过Alertmanager的配置文件配置Inhibition

2.3 silencing(静默)

静默,可以在给定时间内简单地忽略所有警报

slience基于matchers配置,类似路由树。

  1. 来到的警告将会被检查,判断它们是否和活跃的slience相等或者正则表达式匹配。
  2. 如果匹配成功,则不会将这些警报发送给接收者。

如何配置:

  • Silences在Alertmanager的web接口中配置

2.4 Client behavior(客户行为)

Alertmanager 对其客户的行为有特殊要求。这些仅与 Prometheus 不用于发送警报的高级用例相关。

2.5 High Availability(高可用性)

Alertmanager 支持配置以创建集群以实现高可用性。这可以使用–cluster-* 标志进行配置。

重要的是不要在 Prometheus 和它的 Alertmanagers 之间对流量进行负载平衡,而是将 Prometheus 指向所有 Alertmanagers 的列表。

3 configuration (配置)

Alertmanager通过命令行标志和配置文件进行配置

  1. 命令行标志配置不可变的系统参数,

    查看所有命令,请使用命令alertmanager -h

  2. 配置文件定义了禁止规则、通知路由和通知接收器。

可视化编辑器可以帮助构建路由树。

Alertmanager能够在运行时动态加载配置文件。

  1. 如果新的配置有错误,则配置中的变化不会生效,错误也会被记录;
  2. 同时错误日志被输出到终端,通过发送SIGHUP信号量给这个进程,或者通过HTTP POST请求/-/reload来触发Alertmanager配置动态重新加载。

3.1 配置文件

使用-config.file指定要加载的配置文件

./alertmanager -config.file=simple.yml

配置文件使用yaml格式编写的,括号表示参数是可选的,对于非列表参数,该值将设置为指定的默认值。

  1. <duration>: 与正则表达式匹配的持续时间[0-9]+(ms|[smhdwy])
    ((([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?|0)
    例如:1d, 1h30m, 5m, 10s

  2. <labeltime>: 与正则表达式匹配的字符串[a-zA-Z_][a-zA-Z0-9_]*

  3. <labelvalue>: 一串 unicode 字符

  4. <filepath>: 当前工作目录下的有效路径

  5. <boolean>: 布尔值: false 或者 true

  6. <string>: 常规字符串

  7. <secret>: 一个秘密的常规字符串例如密码

  8. <tmpl_string>: 一个在使用前被模板扩展的字符串

  9. <tmpl_secret>: 在使用前进行模板扩展的字符串,这是一个秘密的常规字符串
    10.<int>: 一个整数值

全局配置指定在所有其他配置上下文中有效的参数。它们还作为其他配置部分的默认值。

global:
  # The default SMTP From header field.
  [ smtp_from: <tmpl_string> ]
  # The default SMTP smarthost used for sending emails, including port number.
  # Port number usually is 25, or 587 for SMTP over TLS (sometimes referred to as STARTTLS).
  # Example: smtp.example.org:587
  [ smtp_smarthost: <string> ]
  # The default hostname to identify to the SMTP server.
  [ smtp_hello: <string> | default = "localhost" ]
  # SMTP Auth using CRAM-MD5, LOGIN and PLAIN. If empty, Alertmanager doesn't authenticate to the SMTP server.
  [ smtp_auth_username: <string> ]
  # SMTP Auth using LOGIN and PLAIN.
  [ smtp_auth_password: <secret> ]
  # SMTP Auth using PLAIN.
  [ smtp_auth_identity: <string> ]
  # SMTP Auth using CRAM-MD5.
  [ smtp_auth_secret: <secret> ]
  # The default SMTP TLS requirement.
  # Note that Go does not support unencrypted connections to remote SMTP endpoints.
  [ smtp_require_tls: <bool> | default = true ]

  # The API URL to use for Slack notifications.
  [ slack_api_url: <secret> ]
  [ slack_api_url_file: <filepath> ]
  [ victorops_api_key: <secret> ]
  [ victorops_api_url: <string> | default = "https://alert.victorops.com/integrations/generic/20131114/alert/" ]
  [ pagerduty_url: <string> | default = "https://events.pagerduty.com/v2/enqueue" ]
  [ opsgenie_api_key: <secret> ]
  [ opsgenie_api_url: <string> | default = "https://api.opsgenie.com/" ]
  [ wechat_api_url: <string> | default = "https://qyapi.weixin.qq.com/cgi-bin/" ]
  [ wechat_api_secret: <secret> ]
  [ wechat_api_corp_id: <string> ]

  # The default HTTP client configuration
  [ http_config: <http_config> ]

  # ResolveTimeout is the default value used by alertmanager if the alert does
  # not include EndsAt, after this time passes it can declare the alert as resolved if it has not been updated.
  # This has no impact on alerts from Prometheus, as they always include EndsAt.
  [ resolve_timeout: <duration> | default = 5m ]

# Files from which custom notification template definitions are read.
# The last component may use a wildcard matcher, e.g. 'templates/*.tmpl'.
templates:
  [ - <filepath> ... ]

# The root node of the routing tree.
route: <route>

# A list of notification receivers.
receivers:
  - <receiver> ...

# A list of inhibition rules.
inhibit_rules:
  [ - <inhibit_rule> ... ]

# A list of mute time intervals for muting routes.
mute_time_intervals:
  [ - <mute_time_interval> ... ]

3.2 <route>

路由块定义路由树中的节点及其子节点。如果未设置,其可选配置参数将从其父节点继承。

每个警报在已配置路由树的顶部节点,这个节点必须匹配所有警报,然后遍历所有的子节点

  1. 如果continue设置成false, 当匹配到第一个孩子时,它会停止下来
  2. 如果continue设置成true, 则警报将继续匹配后续的兄弟姐妹节点
  3. 如果一个警报不匹配一个节点的任何孩子,这个警报将会基于当前节点的配置参数来处理警报
[ receiver: <string> ]
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
#
# To aggregate by all possible labels use the special value '...' as the sole label name, for example:
# group_by: ['...']
# This effectively disables aggregation entirely, passing through all
# alerts as-is. This is unlikely to be what you want, unless you have
# a very low alert volume or your upstream notification system performs
# its own grouping.
[ group_by: '[' <labelname>, ... ']' ]

# Whether an alert should continue matching subsequent sibling nodes.
[ continue: <boolean> | default = false ]

# DEPRECATED: Use matchers below.
# A set of equality matchers an alert has to fulfill to match the node.
match:
  [ <labelname>: <labelvalue>, ... ]

# DEPRECATED: Use matchers below.
# A set of regex-matchers an alert has to fulfill to match the node.
match_re:
  [ <labelname>: <regex>, ... ]

# A list of matchers that an alert has to fulfill to match the node. 
matchers:
  [ - <matcher> ... ]

# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
[ group_wait: <duration> | default = 30s ]

# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.)
[ group_interval: <duration> | default = 5m ]

# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
[ repeat_interval: <duration> | default = 4h ]

# Times when the route should be muted. These must match the name of a
# mute time interval defined in the mute_time_intervals section. 
# Additionally, the root node cannot have any mute times.
# When a route is muted it will not send any notifications, but
# otherwise acts normally (including ending the route-matching process
# if the `continue` option is not set.)
mute_time_intervals:
  [ - <string> ...]

# Zero or more child routes.
routes:
  [ - <route> ... ]

举例

# The root route with all parameters, which are inherited by the child
# routes if they are not overwritten.
route:
  receiver: 'default-receiver'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  group_by: [cluster, alertname]
  # All alerts that do not match the following child routes
  # will remain at the root node and be dispatched to 'default-receiver'.
  routes:
  # All alerts with service=mysql or service=cassandra
  # are dispatched to the database pager.
  - receiver: 'database-pager'
    group_wait: 10s
    matchers:
    - service=~"mysql|cassandra"
  # All alerts with the team=frontend label match this sub-route.
  # They are grouped by product and environment rather than cluster
  # and alertname.
  - receiver: 'frontend-pager'
    group_by: [product, environment]
    matchers:
    - team="frontend"

3.3 <mute_time_interval>

指定可以在路由树中引用的命名时间间隔,以在一天中的特定时间使特定路由静音。

name: <string>
time_intervals:
  [ - <time_interval> ... ]

3.4 <time_interval>

包含时间间隔的实际定义。该语法支持以下字段:

- times:
  [ - <time_range> ...]
  weekdays:
  [ - <weekday_range> ...]
  days_of_month:
  [ - <days_of_month_range> ...]
  months:
  [ - <month_range> ...]
  years:
  [ - <year_range> ...]

所有字段都是列表。
在每个非空列表中,必须至少满足一个元素才能匹配该字段。
如果未指定字段,则任何值都将匹配该字段。对于匹配完整时间间隔的瞬间,所有字段都必须匹配。
所有定义均采用 UTC,目前不支持其他时区。

3.4.1 time_range

范围包括开始时间和结束时间,以便于表示在小时边界开始/结束的时间。

例如,开始时间:“17:00”和结束时间:“24:00”将从 17:00 开始,并在 24:00 之前结束。

    times:
    - start_time: HH:MM
      end_time: HH:MM

3.4.2 days_of_month_range

月份中数字天数的列表。天数从 1 开始。也接受从月底开始的负值,

例如,

  1. 1 月期间的 -1 表示 1 月 31 日。
  2. [‘1:5’, ‘-3:-1’]。延长超过月初或月底将导致它被钳制。
  3. [‘1:31’],在二月指定将根据闰年将实际结束日期限制为 28 或 29。两端包容。

3.4.3 month_range

不区分大小写的名称(例如“January”)或数字标识的日历月列表,
如:

  1. January = 1。

也接受范围。

例如

  1. [‘1:3’, ‘may:august’, ‘december’]。两端包容。

3.4.4 year_range

年份的数字列表。接受范围。

例如

  1. [‘2020:2022’, ‘2030’]。两端包容。

3.5 <inhibit_rule>

当存在与另一组匹配器匹配的警报(源)时,抑制规则将与一组匹配器匹配的警报(目标)静音。

对于列表中的标签名称,目标警报和源警报必须具有相同的标签值,在equal这个标签里面。

缺少标签和具有空值的标签是一回事。因此,如果源警报和目标警报中都缺少列出的所有标签名称equal,改抑制规则将生效。

为了防止警报抑制自身,同时匹配规则的目标端和源端的警报不能被相同为真的警报(包括自身)抑制

建议以警报从不匹配双方的方式选择目标和源匹配器,它更容易推理并且不会触发这种特殊情况

# DEPRECATED: Use target_matchers below.
# Matchers that have to be fulfilled in the alerts to be muted.
target_match:
  [ <labelname>: <labelvalue>, ... ]
# DEPRECATED: Use target_matchers below.
target_match_re:
  [ <labelname>: <regex>, ... ]

# A list of matchers that have to be fulfilled by the target 
# alerts to be muted.
target_matchers:
  [ - <matcher> ... ]

# DEPRECATED: Use source_matchers below.
# Matchers for which one or more alerts have to exist for the
# inhibition to take effect.
source_match:
  [ <labelname>: <labelvalue>, ... ]
# DEPRECATED: Use source_matchers below.
source_match_re:
  [ <labelname>: <regex>, ... ]

# A list of matchers for which one or more alerts have 
# to exist for the inhibition to take effect.
source_matchers:
  [ - <matcher> ... ]

# Labels that must have an equal value in the source and target
# alert for the inhibition to take effect.
[ equal: '[' <labelname>, ... ']' ]

3.6 <http_config>

允许配置接收方用来与基于 HTTP 的 API 服务通信的 HTTP 客户端。

# Note that `basic_auth` and `authorization` options are mutually exclusive.

# Sets the `Authorization` header with the configured username and password.
# password and password_file are mutually exclusive.
basic_auth:
  [ username: <string> ]
  [ password: <secret> ]
  [ password_file: <string> ]

# Optional the `Authorization` header configuration.
authorization:
  # Sets the authentication type.
  [ type: <string> | default: Bearer ]
  # Sets the credentials. It is mutually exclusive with
  # `credentials_file`.
  [ credentials: <secret> ]
  # Sets the credentials with the credentials read from the configured file.
  # It is mutually exclusive with `credentials`.
  [ credentials_file: <filename> ]

# Optional OAuth 2.0 configuration.
# Cannot be used at the same time as basic_auth or authorization.
oauth2:
  [ <oauth2> ]

# Optional proxy URL.
[ proxy_url: <string> ]

# Configure whether HTTP requests follow HTTP 3xx redirects.
[ follow_redirects: <bool> | default = true ]

# Configures the TLS settings.
tls_config:
  [ <tls_config> ]

3.6.1 oauth2

使用客户端凭据授予类型的 OAuth 2.0 身份验证。

Alertmanager 使用给定的客户端访问和密钥从指定的端点获取访问令牌。

client_id: <string>
[ client_secret: <secret> ]

# Read the client secret from a file.
# It is mutually exclusive with `client_secret`.
[ client_secret_file: <filename> ]

# Scopes for the token request.
scopes:
  [ - <string> ... ]

# The URL to fetch the token from.
token_url: <string>

# Optional parameters to append to the token URL.
endpoint_params:
  [ <string>: <string> ... ]

3.6.2 <tls_config>

允许配置 TLS 连接

# CA certificate to validate the server certificate with.
[ ca_file: <filepath> ]

# Certificate and key files for client cert authentication to the server.
[ cert_file: <filepath> ]
[ key_file: <filepath> ]

# ServerName extension to indicate the name of the server.
# http://tools.ietf.org/html/rfc4366#section-3.1
[ server_name: <string> ]

# Disable validation of the server certificate.
[ insecure_skip_verify: <boolean> | default = false]

3.7 <receiver>

Receiver 是一个或多个通知集成的命名配置。

注意:作为取消过去暂停新接收器的一部分,除了现有要求外,还同意新的通知集成需要有一个具有推送访问权限的承诺维护者。

# The unique name of the receiver.
name: <string>

# Configurations for several notification integrations.
email_configs:
  [ - <email_config>, ... ]
pagerduty_configs:
  [ - <pagerduty_config>, ... ]
pushover_configs:
  [ - <pushover_config>, ... ]
slack_configs:
  [ - <slack_config>, ... ]
opsgenie_configs:
  [ - <opsgenie_config>, ... ]
webhook_configs:
  [ - <webhook_config>, ... ]
victorops_configs:
  [ - <victorops_config>, ... ]
wechat_configs:
  [ - <wechat_config>, ... ]

3.7.1 <email_config>

# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = false ]

# The email address to send notifications to.
to: <tmpl_string>

# The sender address.
[ from: <tmpl_string> | default = global.smtp_from ]

# The SMTP host through which emails are sent.
[ smarthost: <string> | default = global.smtp_smarthost ]

# The hostname to identify to the SMTP server.
[ hello: <string> | default = global.smtp_hello ]

# SMTP authentication information.
[ auth_username: <string> | default = global.smtp_auth_username ]
[ auth_password: <secret> | default = global.smtp_auth_password ]
[ auth_secret: <secret> | default = global.smtp_auth_secret ]
[ auth_identity: <string> | default = global.smtp_auth_identity ]

# The SMTP TLS requirement.
# Note that Go does not support unencrypted connections to remote SMTP endpoints.
[ require_tls: <bool> | default = global.smtp_require_tls ]

# TLS configuration.
tls_config:
  [ <tls_config> ]

# The HTML body of the email notification.
[ html: <tmpl_string> | default = '{{ template "email.default.html" . }}' ]
# The text body of the email notification.
[ text: <tmpl_string> ]

# Further headers email header key/value pairs. Overrides any headers
# previously set by the notification implementation.
[ headers: { <string>: <tmpl_string>, ... } ]

3.8 其他

其他的请查看官方文档
https://prometheus.io/docs/alerting/latest/configuration/

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

?abc!

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值