问题描述
之前在本地已经搭建好普罗米修斯+Alertmanager监控告警,并配置了邮箱告警,测试后没问题,能收到告警邮件。但是部署到生产环境上后,就收不到了。
生产环境服务器:
机房1和机房2是两个不同的网段,两个网络的应用程序要互通,需要开通策略,并且策略已开通
环境 | IP | 服务 |
机房1 | 172.19.57.28:9090 | 普罗米修斯 |
机房2 | 172.20.134.67:8025 | alertmanager |
这么部署理论上没有问题,普罗米修斯也能监控到alertmanager状态,说明开通的策略没问题
排查
- 使用curl手动调用添加告警信息api
curl -X POST -H "Content-Type: application/json" -d '[{"labels":{"alertname":"系统连续崩溃,已经出现雪崩状况!","dev":"sda1","instance":"实例1","msgtype":"testing"}}]' http://172.17.255.142:8025/api/v2/alerts
结果收到了告警信息邮件
说明alert自身没有问题,邮箱配置也没有问题,那么问题可能出在了普罗米修斯把告警信息发送给alert的环节中。
- 使用wireshark在alert服务器抓包
安装wireshark
yum -y install wireshark
抓取url包含/api/v2/alerts
的数据包
tshark -i eth0 -Y "ip.dst == 172.20.134.67 and tcp.dstport == 8025 and http.request.uri contains \"/api/v2/alerts\"" -V
数据包输出的信息:
Frame 1130: 607 bytes on wire (4856 bits), 607 bytes captured (4856 bits) on interface 0
Interface id: 0
Encapsulation type: Ethernet (1)
Arrival Time: Jan 23, 2024 11:40:36.058137187 CST
[Time shift for this packet: 0.000000000 seconds]
Epoch Time: 1705981236.058137187 seconds
[Time delta from previous captured frame: 0.245109469 seconds]
[Time delta from previous displayed frame: 0.000000000 seconds]
[Time since reference or first frame: 45.302038112 seconds]
Frame Number: 1130
Frame Length: 607 bytes (4856 bits)
Capture Length: 607 bytes (4856 bits)
[Frame is marked: False]
[Frame is ignored: False]
[Protocols in frame: eth:ip:tcp:http:json]
Ethernet II, Src: c0:ff:a8:25:d2:06 (c0:ff:a8:25:d2:06), Dst: d0:0d:e5:3a:2f:59 (d0:0d:e5:3a:2f:59)
Destination: d0:0d:e5:3a:2f:59 (d0:0d:e5:3a:2f:59)
Address: d0:0d:e5:3a:2f:59 (d0:0d:e5:3a:2f:59)
.... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
.... ...0 .... .... .... .... = IG bit: Individual address (unicast)
Source: c0:ff:a8:25:d2:06 (c0:ff:a8:25:d2:06)
Address: c0:ff:a8:25:d2:06 (c0:ff:a8:25:d2:06)
.... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
.... ...0 .... .... .... .... = IG bit: Individual address (unicast)
Type: IP (0x0800)
Internet Protocol Version 4, Src: 172.18.1.130 (172.18.1.130), Dst: 172.20.134.67 (172.20.134.67)
Version: 4
Header length: 20 bytes
Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00: Not-ECT (Not ECN-Capable Transport))
0000 00.. = Differentiated Services Codepoint: Default (0x00)
.... ..00 = Explicit Congestion Notification: Not-ECT (Not ECN-Capable Transport) (0x00)
Total Length: 593
Identification: 0xefeb (61419)
Flags: 0x02 (Don't Fragment)
0... .... = Reserved bit: Not set
.1.. .... = Don't fragment: Set
..0. .... = More fragments: Not set
Fragment offset: 0
Time to live: 56
Protocol: TCP (6)
Header checksum: 0x70cf [validation disabled]
[Good: False]
[Bad: False]
Source: 172.18.1.130 (172.18.1.130)
Destination: 172.20.134.67 (172.20.134.67)
Transmission Control Protocol, Src Port: 56798 (56798), Dst Port: ca-audit-da (8025), Seq: 1, Ack: 2, Len: 541
Source port: 56798 (56798)
Destination port: ca-audit-da (8025)
[Stream index: 10]
Sequence number: 1 (relative sequence number)
[Next sequence number: 542 (relative sequence number)]
Acknowledgment number: 2 (relative ack number)
Header length: 32 bytes
Flags: 0x018 (PSH, ACK)
000. .... .... = Reserved: Not set
...0 .... .... = Nonce: Not set
.... 0... .... = Congestion Window Reduced (CWR): Not set
.... .0.. .... = ECN-Echo: Not set
.... ..0. .... = Urgent: Not set
.... ...1 .... = Acknowledgment: Set
.... .... 1... = Push: Set
.... .... .0.. = Reset: Not set
.... .... ..0. = Syn: Not set
.... .... ...0 = Fin: Not set
Window size value: 229
[Calculated window size: 229]
[Window size scaling factor: -1 (unknown)]
Checksum: 0x2ef0 [validation disabled]
[Good Checksum: False]
[Bad Checksum: False]
Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
No-Operation (NOP)
Type: 1
0... .... = Copy on fragmentation: No
.00. .... = Class: Control (0)
...0 0001 = Number: No-Operation (NOP) (1)
No-Operation (NOP)
Type: 1
0... .... = Copy on fragmentation: No
.00. .... = Class: Control (0)
...0 0001 = Number: No-Operation (NOP) (1)
Timestamps: TSval 2167850496, TSecr 4192078536
Kind: Timestamp (8)
Length: 10
Timestamp value: 2167850496
Timestamp echo reply: 4192078536
[SEQ/ACK analysis]
[Bytes in flight: 541]
Hypertext Transfer Protocol
POST /api/v2/alerts HTTP/1.1\r\n
[Expert Info (Chat/Sequence): POST /api/v2/alerts HTTP/1.1\r\n]
[Message: POST /api/v2/alerts HTTP/1.1\r\n]
[Severity level: Chat]
[Group: Sequence]
Request Method: POST
Request URI: /api/v2/alerts
Request Version: HTTP/1.1
Host: 172.17.255.142:8025\r\n
User-Agent: Prometheus/2.45.0\r\n
Content-Length: 398\r\n
[Content length: 398]
Content-Type: application/json\r\n
\r\n
[Full request URI: http://172.17.255.142:8025/api/v2/alerts]
[HTTP request 1/1]
JavaScript Object Notation: application/json
Array
Object
Member Key: "annotations"
Object
Member Key: "description"
String value: Mysql数据库宕机,请检查
Member Key: "summary"
String value: 您的 172.19.57.19:9105 的Mysql已停止运行!
Member Key: "endsAt"
String value: 2024-01-23T03:39:32.337Z
Member Key: "startsAt"
String value: 2024-01-23T02:49:17.337Z
Member Key: "generatorURL"
String value: http://zonghe31:9090/graph?g0.expr=mysql_up+%3D%3D+0&g0.tab=1
Member Key: "labels"
Object
Member Key: "alertname"
String value: Mysql status
Member Key: "instance"
String value: 172.19.57.19:9105
Member Key: "job"
String value: mysql8_node
Member Key: "severity"
String value: error
Frame 1507: 570 bytes on wire (4560 bits), 570 bytes captured (4560 bits) on interface 0
Interface id: 0
Encapsulation type: Ethernet (1)
Arrival Time: Jan 23, 2024 11:40:54.797106494 CST
[Time shift for this packet: 0.000000000 seconds]
Epoch Time: 1705981254.797106494 seconds
[Time delta from previous captured frame: 0.132284603 seconds]
[Time delta from previous displayed frame: 18.738969307 seconds]
[Time since reference or first frame: 64.041007419 seconds]
Frame Number: 1507
Frame Length: 570 bytes (4560 bits)
Capture Length: 570 bytes (4560 bits)
[Frame is marked: False]
[Frame is ignored: False]
[Protocols in frame: eth:ip:tcp:http:json]
Ethernet II, Src: c0:ff:a8:25:d2:06 (c0:ff:a8:25:d2:06), Dst: d0:0d:e5:3a:2f:59 (d0:0d:e5:3a:2f:59)
Destination: d0:0d:e5:3a:2f:59 (d0:0d:e5:3a:2f:59)
Address: d0:0d:e5:3a:2f:59 (d0:0d:e5:3a:2f:59)
.... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
.... ...0 .... .... .... .... = IG bit: Individual address (unicast)
Source: c0:ff:a8:25:d2:06 (c0:ff:a8:25:d2:06)
Address: c0:ff:a8:25:d2:06 (c0:ff:a8:25:d2:06)
.... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
.... ...0 .... .... .... .... = IG bit: Individual address (unicast)
Type: IP (0x0800)
Internet Protocol Version 4, Src: 172.18.1.130 (172.18.1.130), Dst: 172.20.134.67 (172.20.134.67)
Version: 4
Header length: 20 bytes
Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00: Not-ECT (Not ECN-Capable Transport))
0000 00.. = Differentiated Services Codepoint: Default (0x00)
.... ..00 = Explicit Congestion Notification: Not-ECT (Not ECN-Capable Transport) (0x00)
Total Length: 556
Identification: 0xefef (61423)
Flags: 0x02 (Don't Fragment)
0... .... = Reserved bit: Not set
.1.. .... = Don't fragment: Set
..0. .... = More fragments: Not set
Fragment offset: 0
Time to live: 56
Protocol: TCP (6)
Header checksum: 0x70f0 [validation disabled]
[Good: False]
[Bad: False]
Source: 172.18.1.130 (172.18.1.130)
Destination: 172.20.134.67 (172.20.134.67)
Transmission Control Protocol, Src Port: 56798 (56798), Dst Port: ca-audit-da (8025), Seq: 542, Ack: 116, Len: 504
Source port: 56798 (56798)
Destination port: ca-audit-da (8025)
[Stream index: 10]
Sequence number: 542 (relative sequence number)
[Next sequence number: 1046 (relative sequence number)]
Acknowledgment number: 116 (relative ack number)
Header length: 32 bytes
Flags: 0x018 (PSH, ACK)
000. .... .... = Reserved: Not set
...0 .... .... = Nonce: Not set
.... 0... .... = Congestion Window Reduced (CWR): Not set
.... .0.. .... = ECN-Echo: Not set
.... ..0. .... = Urgent: Not set
.... ...1 .... = Acknowledgment: Set
.... .... 1... = Push: Set
.... .... .0.. = Reset: Not set
.... .... ..0. = Syn: Not set
.... .... ...0 = Fin: Not set
Window size value: 229
[Calculated window size: 229]
[Window size scaling factor: -1 (unknown)]
Checksum: 0xfb7d [validation disabled]
[Good Checksum: False]
[Bad Checksum: False]
Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
No-Operation (NOP)
Type: 1
0... .... = Copy on fragmentation: No
.00. .... = Class: Control (0)
...0 0001 = Number: No-Operation (NOP) (1)
No-Operation (NOP)
Type: 1
0... .... = Copy on fragmentation: No
.00. .... = Class: Control (0)
...0 0001 = Number: No-Operation (NOP) (1)
Timestamps: TSval 2167869235, TSecr 4192104776
Kind: Timestamp (8)
Length: 10
Timestamp value: 2167869235
Timestamp echo reply: 4192104776
[SEQ/ACK analysis]
[Bytes in flight: 504]
Hypertext Transfer Protocol
POST /api/v2/alerts HTTP/1.1\r\n
[Expert Info (Chat/Sequence): POST /api/v2/alerts HTTP/1.1\r\n]
[Message: POST /api/v2/alerts HTTP/1.1\r\n]
[Severity level: Chat]
[Group: Sequence]
Request Method: POST
Request URI: /api/v2/alerts
Request Version: HTTP/1.1
Host: 172.17.255.142:8025\r\n
User-Agent: Prometheus/2.45.0\r\n
Content-Length: 361\r\n
[Content length: 361]
Content-Type: application/json\r\n
\r\n
[Full request URI: http://172.17.255.142:8025/api/v2/alerts]
[HTTP request 2/2]
[Prev request in frame: 1130]
JavaScript Object Notation: application/json
Array
Object
Member Key: "annotations"
Object
Member Key: "description"
String value: 节点断联已超过1分钟!
Member Key: "summary"
String value: 节点失联
Member Key: "endsAt"
String value: 2024-01-23T03:39:51.075Z
Member Key: "startsAt"
String value: 2024-01-23T02:49:36.075Z
Member Key: "generatorURL"
String value: http://zonghe31:9090/graph?g0.expr=up+%3D%3D+0&g0.tab=1
Member Key: "labels"
Object
Member Key: "alertname"
String value: 实例存活告警
Member Key: "instance"
String value: 172.19.57.28:9100
Member Key: "job"
String value: server_node
Member Key: "severity"
String value: Disaster
可以看到第106行,告警信息传过来了,说明普罗米修斯传递的信息也没问题
- 日志调整为debug级,查看日志
通过以上排查证明了普罗米修斯和alertmanager都是正常运行的,哪么问题出在了哪里呢?
通过日志发现,alert接收到的告警信息都被标记成了resolved
状态
正常情况状态应该为active
,alert有个resolve_timeout
配置项,默认5分钟,同一条告警信息,超过5分钟后没有再收到,就给标记成已解决
状态,不再向外发送告警信息
于是想到了两台服务器的时间可能不一样,使用命令timedatectl
查看,发现服务器的时间差了6分钟,将服务器时间修改一致后,能收到告警信息邮件了
修改时间命令:
date -s "yyyy-MM-dd HH:mm:ss"
PS:把resolve_timeout
时间设置的长一些,发现没有用,还需要学习📖