记:Altermanager告警信息无法发送问题

文章讲述了在将监控系统从本地部署到生产环境时,作者遇到普罗米修斯发送告警邮件失败的问题。经过排查,发现问题出在服务器时间不同步,导致Alertmanager将告警标记为已解决。通过调整时间并延长resolve_timeout设置,解决了告警接收问题。
摘要由CSDN通过智能技术生成

问题描述

之前在本地已经搭建好普罗米修斯+Alertmanager监控告警,并配置了邮箱告警,测试后没问题,能收到告警邮件。但是部署到生产环境上后,就收不到了。

生产环境服务器:

机房1和机房2是两个不同的网段,两个网络的应用程序要互通,需要开通策略,并且策略已开通

环境

IP

服务

机房1

172.19.57.28:9090

普罗米修斯

机房2

172.20.134.67:8025

alertmanager

这么部署理论上没有问题,普罗米修斯也能监控到alertmanager状态,说明开通的策略没问题

排查

  1. 使用curl手动调用添加告警信息api
curl -X POST -H "Content-Type: application/json" -d '[{"labels":{"alertname":"系统连续崩溃,已经出现雪崩状况!","dev":"sda1","instance":"实例1","msgtype":"testing"}}]' http://172.17.255.142:8025/api/v2/alerts

结果收到了告警信息邮件

说明alert自身没有问题,邮箱配置也没有问题,那么问题可能出在了普罗米修斯把告警信息发送给alert的环节中。

  1. 使用wireshark在alert服务器抓包

安装wireshark

yum -y install wireshark

抓取url包含/api/v2/alerts的数据包

tshark -i eth0 -Y "ip.dst == 172.20.134.67 and tcp.dstport == 8025 and http.request.uri contains \"/api/v2/alerts\"" -V

数据包输出的信息:

Frame 1130: 607 bytes on wire (4856 bits), 607 bytes captured (4856 bits) on interface 0
    Interface id: 0
    Encapsulation type: Ethernet (1)
    Arrival Time: Jan 23, 2024 11:40:36.058137187 CST
    [Time shift for this packet: 0.000000000 seconds]
    Epoch Time: 1705981236.058137187 seconds
    [Time delta from previous captured frame: 0.245109469 seconds]
    [Time delta from previous displayed frame: 0.000000000 seconds]
    [Time since reference or first frame: 45.302038112 seconds]
    Frame Number: 1130
    Frame Length: 607 bytes (4856 bits)
    Capture Length: 607 bytes (4856 bits)
    [Frame is marked: False]
    [Frame is ignored: False]
    [Protocols in frame: eth:ip:tcp:http:json]
Ethernet II, Src: c0:ff:a8:25:d2:06 (c0:ff:a8:25:d2:06), Dst: d0:0d:e5:3a:2f:59 (d0:0d:e5:3a:2f:59)
    Destination: d0:0d:e5:3a:2f:59 (d0:0d:e5:3a:2f:59)
        Address: d0:0d:e5:3a:2f:59 (d0:0d:e5:3a:2f:59)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Source: c0:ff:a8:25:d2:06 (c0:ff:a8:25:d2:06)
        Address: c0:ff:a8:25:d2:06 (c0:ff:a8:25:d2:06)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Type: IP (0x0800)
Internet Protocol Version 4, Src: 172.18.1.130 (172.18.1.130), Dst: 172.20.134.67 (172.20.134.67)
    Version: 4
    Header length: 20 bytes
    Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00: Not-ECT (Not ECN-Capable Transport))
        0000 00.. = Differentiated Services Codepoint: Default (0x00)
        .... ..00 = Explicit Congestion Notification: Not-ECT (Not ECN-Capable Transport) (0x00)
    Total Length: 593
    Identification: 0xefeb (61419)
    Flags: 0x02 (Don't Fragment)
        0... .... = Reserved bit: Not set
        .1.. .... = Don't fragment: Set
        ..0. .... = More fragments: Not set
    Fragment offset: 0
    Time to live: 56
    Protocol: TCP (6)
    Header checksum: 0x70cf [validation disabled]
        [Good: False]
        [Bad: False]
    Source: 172.18.1.130 (172.18.1.130)
    Destination: 172.20.134.67 (172.20.134.67)
Transmission Control Protocol, Src Port: 56798 (56798), Dst Port: ca-audit-da (8025), Seq: 1, Ack: 2, Len: 541
    Source port: 56798 (56798)
    Destination port: ca-audit-da (8025)
    [Stream index: 10]
    Sequence number: 1    (relative sequence number)
    [Next sequence number: 542    (relative sequence number)]
    Acknowledgment number: 2    (relative ack number)
    Header length: 32 bytes
    Flags: 0x018 (PSH, ACK)
        000. .... .... = Reserved: Not set
        ...0 .... .... = Nonce: Not set
        .... 0... .... = Congestion Window Reduced (CWR): Not set
        .... .0.. .... = ECN-Echo: Not set
        .... ..0. .... = Urgent: Not set
        .... ...1 .... = Acknowledgment: Set
        .... .... 1... = Push: Set
        .... .... .0.. = Reset: Not set
        .... .... ..0. = Syn: Not set
        .... .... ...0 = Fin: Not set
    Window size value: 229
    [Calculated window size: 229]
    [Window size scaling factor: -1 (unknown)]
    Checksum: 0x2ef0 [validation disabled]
        [Good Checksum: False]
        [Bad Checksum: False]
    Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
        No-Operation (NOP)
            Type: 1
                0... .... = Copy on fragmentation: No
                .00. .... = Class: Control (0)
                ...0 0001 = Number: No-Operation (NOP) (1)
        No-Operation (NOP)
            Type: 1
                0... .... = Copy on fragmentation: No
                .00. .... = Class: Control (0)
                ...0 0001 = Number: No-Operation (NOP) (1)
        Timestamps: TSval 2167850496, TSecr 4192078536
            Kind: Timestamp (8)
            Length: 10
            Timestamp value: 2167850496
            Timestamp echo reply: 4192078536
    [SEQ/ACK analysis]
        [Bytes in flight: 541]
Hypertext Transfer Protocol
    POST /api/v2/alerts HTTP/1.1\r\n
        [Expert Info (Chat/Sequence): POST /api/v2/alerts HTTP/1.1\r\n]
            [Message: POST /api/v2/alerts HTTP/1.1\r\n]
            [Severity level: Chat]
            [Group: Sequence]
        Request Method: POST
        Request URI: /api/v2/alerts
        Request Version: HTTP/1.1
    Host: 172.17.255.142:8025\r\n
    User-Agent: Prometheus/2.45.0\r\n
    Content-Length: 398\r\n
        [Content length: 398]
    Content-Type: application/json\r\n
    \r\n
    [Full request URI: http://172.17.255.142:8025/api/v2/alerts]
    [HTTP request 1/1]
JavaScript Object Notation: application/json
    Array
        Object
            Member Key: "annotations"
                Object
                    Member Key: "description"
                        String value: Mysql数据库宕机,请检查
                    Member Key: "summary"
                        String value: 您的 172.19.57.19:9105 的Mysql已停止运行!
            Member Key: "endsAt"
                String value: 2024-01-23T03:39:32.337Z
            Member Key: "startsAt"
                String value: 2024-01-23T02:49:17.337Z
            Member Key: "generatorURL"
                String value: http://zonghe31:9090/graph?g0.expr=mysql_up+%3D%3D+0&g0.tab=1
            Member Key: "labels"
                Object
                    Member Key: "alertname"
                        String value: Mysql status
                    Member Key: "instance"
                        String value: 172.19.57.19:9105
                    Member Key: "job"
                        String value: mysql8_node
                    Member Key: "severity"
                        String value: error


Frame 1507: 570 bytes on wire (4560 bits), 570 bytes captured (4560 bits) on interface 0
    Interface id: 0
    Encapsulation type: Ethernet (1)
    Arrival Time: Jan 23, 2024 11:40:54.797106494 CST
    [Time shift for this packet: 0.000000000 seconds]
    Epoch Time: 1705981254.797106494 seconds
    [Time delta from previous captured frame: 0.132284603 seconds]
    [Time delta from previous displayed frame: 18.738969307 seconds]
    [Time since reference or first frame: 64.041007419 seconds]
    Frame Number: 1507
    Frame Length: 570 bytes (4560 bits)
    Capture Length: 570 bytes (4560 bits)
    [Frame is marked: False]
    [Frame is ignored: False]
    [Protocols in frame: eth:ip:tcp:http:json]
Ethernet II, Src: c0:ff:a8:25:d2:06 (c0:ff:a8:25:d2:06), Dst: d0:0d:e5:3a:2f:59 (d0:0d:e5:3a:2f:59)
    Destination: d0:0d:e5:3a:2f:59 (d0:0d:e5:3a:2f:59)
        Address: d0:0d:e5:3a:2f:59 (d0:0d:e5:3a:2f:59)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Source: c0:ff:a8:25:d2:06 (c0:ff:a8:25:d2:06)
        Address: c0:ff:a8:25:d2:06 (c0:ff:a8:25:d2:06)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Type: IP (0x0800)
Internet Protocol Version 4, Src: 172.18.1.130 (172.18.1.130), Dst: 172.20.134.67 (172.20.134.67)
    Version: 4
    Header length: 20 bytes
    Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00: Not-ECT (Not ECN-Capable Transport))
        0000 00.. = Differentiated Services Codepoint: Default (0x00)
        .... ..00 = Explicit Congestion Notification: Not-ECT (Not ECN-Capable Transport) (0x00)
    Total Length: 556
    Identification: 0xefef (61423)
    Flags: 0x02 (Don't Fragment)
        0... .... = Reserved bit: Not set
        .1.. .... = Don't fragment: Set
        ..0. .... = More fragments: Not set
    Fragment offset: 0
    Time to live: 56
    Protocol: TCP (6)
    Header checksum: 0x70f0 [validation disabled]
        [Good: False]
        [Bad: False]
    Source: 172.18.1.130 (172.18.1.130)
    Destination: 172.20.134.67 (172.20.134.67)
Transmission Control Protocol, Src Port: 56798 (56798), Dst Port: ca-audit-da (8025), Seq: 542, Ack: 116, Len: 504
    Source port: 56798 (56798)
    Destination port: ca-audit-da (8025)
    [Stream index: 10]
    Sequence number: 542    (relative sequence number)
    [Next sequence number: 1046    (relative sequence number)]
    Acknowledgment number: 116    (relative ack number)
    Header length: 32 bytes
    Flags: 0x018 (PSH, ACK)
        000. .... .... = Reserved: Not set
        ...0 .... .... = Nonce: Not set
        .... 0... .... = Congestion Window Reduced (CWR): Not set
        .... .0.. .... = ECN-Echo: Not set
        .... ..0. .... = Urgent: Not set
        .... ...1 .... = Acknowledgment: Set
        .... .... 1... = Push: Set
        .... .... .0.. = Reset: Not set
        .... .... ..0. = Syn: Not set
        .... .... ...0 = Fin: Not set
    Window size value: 229
    [Calculated window size: 229]
    [Window size scaling factor: -1 (unknown)]
    Checksum: 0xfb7d [validation disabled]
        [Good Checksum: False]
        [Bad Checksum: False]
    Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
        No-Operation (NOP)
            Type: 1
                0... .... = Copy on fragmentation: No
                .00. .... = Class: Control (0)
                ...0 0001 = Number: No-Operation (NOP) (1)
        No-Operation (NOP)
            Type: 1
                0... .... = Copy on fragmentation: No
                .00. .... = Class: Control (0)
                ...0 0001 = Number: No-Operation (NOP) (1)
        Timestamps: TSval 2167869235, TSecr 4192104776
            Kind: Timestamp (8)
            Length: 10
            Timestamp value: 2167869235
            Timestamp echo reply: 4192104776
    [SEQ/ACK analysis]
        [Bytes in flight: 504]
Hypertext Transfer Protocol
    POST /api/v2/alerts HTTP/1.1\r\n
        [Expert Info (Chat/Sequence): POST /api/v2/alerts HTTP/1.1\r\n]
            [Message: POST /api/v2/alerts HTTP/1.1\r\n]
            [Severity level: Chat]
            [Group: Sequence]
        Request Method: POST
        Request URI: /api/v2/alerts
        Request Version: HTTP/1.1
    Host: 172.17.255.142:8025\r\n
    User-Agent: Prometheus/2.45.0\r\n
    Content-Length: 361\r\n
        [Content length: 361]
    Content-Type: application/json\r\n
    \r\n
    [Full request URI: http://172.17.255.142:8025/api/v2/alerts]
    [HTTP request 2/2]
    [Prev request in frame: 1130]
JavaScript Object Notation: application/json
    Array
        Object
            Member Key: "annotations"
                Object
                    Member Key: "description"
                        String value: 节点断联已超过1分钟!
                    Member Key: "summary"
                        String value: 节点失联
            Member Key: "endsAt"
                String value: 2024-01-23T03:39:51.075Z
            Member Key: "startsAt"
                String value: 2024-01-23T02:49:36.075Z
            Member Key: "generatorURL"
                String value: http://zonghe31:9090/graph?g0.expr=up+%3D%3D+0&g0.tab=1
            Member Key: "labels"
                Object
                    Member Key: "alertname"
                        String value: 实例存活告警
                    Member Key: "instance"
                        String value: 172.19.57.28:9100
                    Member Key: "job"
                        String value: server_node
                    Member Key: "severity"
                        String value: Disaster

可以看到第106行,告警信息传过来了,说明普罗米修斯传递的信息也没问题

  1. 日志调整为debug级,查看日志

通过以上排查证明了普罗米修斯和alertmanager都是正常运行的,哪么问题出在了哪里呢?

通过日志发现,alert接收到的告警信息都被标记成了resolved状态

正常情况状态应该为active,alert有个resolve_timeout配置项,默认5分钟,同一条告警信息,超过5分钟后没有再收到,就给标记成已解决状态,不再向外发送告警信息

于是想到了两台服务器的时间可能不一样,使用命令timedatectl查看,发现服务器的时间差了6分钟,将服务器时间修改一致后,能收到告警信息邮件了

修改时间命令:

date -s "yyyy-MM-dd HH:mm:ss"

PS:把resolve_timeout时间设置的长一些,发现没有用,还需要学习📖

  • 7
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值