1. 摘要
本文主要介绍,如何通过prometheus监控服务状态,并产生告警信息,已便于运维人员快速响应。
2. 整体架构
本次设计用到prometheus服务,alertmanger服务,blackbox exporter。
以上服务都可以在官网下载:https://prometheus.io/download/
3. prometheus 部署
- 下载并解压
$ tar xvf alertmanager-$VERSION.darwin-amd64.tar.gz
$ ls prometheus-2.35.0-rc0.linux-amd64.tar.gz
console_libraries consoles LICENSE NOTICE prometheus prometheus.yml promtool
- 启动,这里prometheus.yml不做详细解释,请参考官网文档
$ ./prometheus
ts=2022-04-12T06:20:30.952Z caller=main.go:488 level=info msg="No time or size retention was set so using the default time retention" duration=15d
ts=2022-04-12T06:20:30.953Z caller=main.go:525 level=info msg="Starting Prometheus" version="(version=2.35.0-rc0, branch=HEAD, revision=5b73e518260d8bab36ebb1c0d0a5826eba8fc0a0
- 浏览器访问localhost:9090端口
4. blackbox exporter 部署
Blackbox Exporter是Prometheus社区提供的官方黑盒监控解决方案,其允许用户通过:HTTP、HTTPS、DNS、TCP以及ICMP的方式对网络进行探测。
- 下载并解压:
$ tar xvf blackbox_exporter-0.20.0.linux-amd64.tar.gz
$ ls
blackbox_exporter blackbox.yml LICENSE NOTICE
- 下面是一个简化的探针配置文件blockbox.yml
modules:
http_2xx:
prober: http
http_post_2xx:
prober: http
http:
method: POST
tcp_connect:
prober: tcp
pop3s_banner:
prober: tcp
tcp:
query_response:
- expect: "^+OK"
tls: true
tls_config:
insecure_skip_verify: false
grpc:
prober: grpc
grpc:
tls: true
preferred_ip_protocol: "ip4"
grpc_plain:
prober: grpc
grpc:
tls: false
service: "service1"
ssh_banner:
prober: tcp
tcp:
query_response:
- expect: "^SSH-2.0-"
- send: "SSH-2.0-blackbox-ssh-check"
irc_banner:
prober: tcp
tcp:
query_response:
- send: "NICK prober"
- send: "USER prober prober prober :prober"
- expect: "PING :([^ ]+)"
send: "PONG ${1}"
- expect: "^:[^ ]+ 001"
icmp:
prober: icmp
注:更多的HTTP请求方法、HTTP头信息、请求参数、auth、证书认证等,请参考官方文档。
- 通过运行以下命令,并指定使用的探针配置文件启动Blockbox Exporter实例:
blackbox_exporter --config.file=/etc/prometheus/blackbox.yml
- 与Prometheus集成,在prometheus,yml中,加入如下配置,实现对http://www.123.com 和 http://www.baidu.com 的探测
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx] # 模块对应 blackbox.yml
static_configs:
- targets:
- http://www.123.com # http
- http://www.baidu.com # http
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115
- 重新启动prometheus,并访问prometheus页面验证
5. AlertManager部署
- 下载并解压:
$ tar xvf alertmanager-$VERSION.darwin-amd64.tar.gz
$ ls
alertmanager alertmanager.yml amtool data LICENSE NOTICE
- Alertmanager解压后会包含一个默认的alertmanager.yml配置文件,内容如下所示:
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/' #这里配置接收告警的服务
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
Alertmanager的配置主要包含两个部分:路由(route)以及接收器(receivers)。所有的告警信息都会从配置中的顶级路由(route)进入路由树,根据路由规则将告警信息发送给相应的接收器。
- 启动Alertmanager
./alertmanager
用户也在启动Alertmanager时使用参数修改相关配置。–config.file用于指定alertmanager配置文件路径,–storage.path用于指定数据存储路径。
- 查看运行状态
Alertmanager启动后可以通过9093端口访问,http://localhost:9093
- 关联Prometheus与Alertmanager
在Prometheus的架构中被划分成两个独立的部分。Prometheus负责产生告警,而Alertmanager负责告警产生后的后续处理。因此Alertmanager部署完成后,需要在Prometheus中设置Alertmanager相关的信息。
编辑Prometheus配置文件prometheus.yml,并添加以下内容
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
- 告警规则配置
新建文件blackbox_rules.yml
groups:
- name: blackbox_network_stats
rules:
- alert: blackbox_network_stats
expr: up == 0 #这里遵循Promsql的语法
for: 1m #如1分钟内持续为0 报警
labels:
severity: critical
annotations:
description: 'Job {{ $labels.job }} {{ $labels.instance }}.'
summary: '{{ $labels.instance }} down ! ! !'
- 编辑Prometheus配置文件prometheus.yml,并添加以下内容
rule_files:
- "blackbox_rules.yml"
- 重新启动prometheus,配置完成。
可以通过访问http://localhost:9093/#/alerts,查看告警信息
6. 验证接收告警
为了验证接收告警,我这里写了一个简单的http服务,通过alertmanager的web hook方式验证测试
- 修改 alertmanager.yml 并重启
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:8981/' #这里为接收告警的服务地址
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
- 用python写的一个简单的接收http告警的服务
#coding=utf-8
import http.client
import urllib
from http.server import HTTPServer, BaseHTTPRequestHandler
import json
def start_server():
data = {'result': 'this is a test'}
host = ('localhost', 8981)
class Resquest(BaseHTTPRequestHandler):
def do_POST(self):
length = int(self.headers['Content-Length'])
post_data = urllib.parse.parse_qs(self.rfile.read(length).decode('utf-8'))
# You now have a dictionary of the post data
data = {"Method:": self.command,
"Path:": self.path,
"Post Data":post_data}
print(data)
self.send_response(200)
self.send_header('Content-type', 'application/json')
self.end_headers()
self.wfile.write(json.dumps(data).encode())
server = HTTPServer(host, Resquest)
print("Starting server, listen at: %s:%s" % host)
server.serve_forever()
if __name__ == '__main__':
start_server()
print("start server success...")
启动,观察接收到的告警信息
$ python server.py
Starting server, listen at: localhost:8981
{'Method:': 'POST', 'Post Data': {'{"receiver":"web\\\\.hook","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"blackbox_network_stats","instance":"172.17.0.1:8001","job":"kong","severity":"critical"},"annotations":{"description":"Job kong 172.17.0.1:8001.","summary":"172.17.0.1:8001 down ! ! !"},"startsAt":"2022-04-12T06:21:45.185Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://ubuntu:9090/graph?g0.expr': ['up == 0\\u0026g0.tab=1","fingerprint":"5776c946d916f29c"}],"groupLabels":{"alertname":"blackbox_network_stats"},"commonLabels":{"alertname":"blackbox_network_stats","instance":"172.17.0.1:8001","job":"kong","severity":"critical"},"commonAnnotations":{"description":"Job kong 172.17.0.1:8001.","summary":"172.17.0.1:8001 down ! ! !"},"externalURL":"http://ubuntu:9093","version":"4","groupKey":"{}:{alertname=\\"blackbox_network_stats\\"}","truncatedAlerts":0}\n']}, 'Path:': '/'}
127.0.0.1 - - [12/Apr/2022 14:30:30] "POST / HTTP/1.1" 200 -