1 进行时间同步
实现报警前把所有机器时间同步再检查一遍.
ntpdate cn.ntp.org.cn
2 Linux部署
第一步:下载安装包
下载安装包:alertmanager-0.16.2.linux-amd64.tar.gz
链接:https://pan.baidu.com/s/1kRDIZ8zPByhjs11JP30e5A
提取码:l3i1
第二步:上传压缩包解压到特定的文件夹
[root@localhost ~]# mv alertmanager-0.16.2.linux-amd64.tar.gz /opt/prometheus/
[root@localhost ~]# cd /opt/prometheus/
[root@localhost prometheus]# ls
alertmanager-0.16.2.linux-amd64.tar.gz prometheus-2.6.1.linux-amd64
grafana-5.3.4-1.x86_64.rpm prometheus-2.6.1.linux-amd64.tar.gz
[root@localhost prometheus]# tar -zxvf alertmanager-0.16.2.linux-amd64.tar.gz
alertmanager-0.16.2.linux-amd64/
alertmanager-0.16.2.linux-amd64/LICENSE
alertmanager-0.16.2.linux-amd64/alertmanager.yml
alertmanager-0.16.2.linux-amd64/alertmanager
alertmanager-0.16.2.linux-amd64/amtool
alertmanager-0.16.2.linux-amd64/NOTICE
[root@localhost prometheus]#
[root@localhost prometheus]# ls
alertmanager-0.16.2.linux-amd64 prometheus-2.6.1.linux-amd64
alertmanager-0.16.2.linux-amd64.tar.gz prometheus-2.6.1.linux-amd64.tar.gz
grafana-5.3.4-1.x86_64.rpm
[root@localhost prometheus]# mv alertmanager-0.16.2.linux-amd64 alertmanager
[root@localhost prometheus]# ls
alertmanager prometheus-2.6.1.linux-amd64
alertmanager-0.16.2.linux-amd64.tar.gz prometheus-2.6.1.linux-amd64.tar.gz
grafana-5.3.4-1.x86_64.rpm
[root@localhost prometheus]#
查看是否安装成功
[root@localhost alertmanager]# ./alertmanager --version
alertmanager, version 0.16.2 (branch: HEAD, revision: 308b7620642dc147794e6686a3f94d1b6fc8ef4d)
build user: root@1e9a48272b38
build date: 20190405-12:27:40
go version: go1.11.6
[root@localhost alertmanager]#
第三步:启动alertManager
启动 AlertManager 来接受 Prometheus 发送过来的报警信息,并执行各种方式的告警。
在alertmanager的安装目录下执行:
[root@localhost alertmanager]# ./alertmanager --config.file=alertmanager.yml
AlertManager 默认启动的端口为 9093,启动完成后,浏览器访问 http://<IP>:9093
可以看到默认提供的 UI 页面,因为我们还没有配置报警规则来触发报警,所有现在是没有任何告警信息的,
3 配置告警信息
查看目录结构
[root@localhost prometheus]# cd alertmanager/
[root@localhost alertmanager]# ls
alertmanager alertmanager.yml amtool LICENSE NOTICE
[root@localhost alertmanager]# ll
总用量 38964
-rwxr-xr-x. 1 3434 3434 23072841 4月 5 2019 alertmanager
-rw-r--r--. 1 3434 3434 380 4月 5 2019 alertmanager.yml
-rwxr-xr-x. 1 3434 3434 16801752 4月 5 2019 amtool
-rw-r--r--. 1 3434 3434 11357 4月 5 2019 LICENSE
-rw-r--r--. 1 3434 3434 457 4月 5 2019 NOTICE
[root@localhost alertmanager]#
3.1 查看默认配置
[root@localhost alertmanager]# cat alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
3.2 其主要配置的作用
global: 全局配置
包括报警解决后的超时时间、SMTP 相关配置、各种渠道通知的 API 地址等等。
route: 用来设置报警的分发策略
,它是一个树状结构,按照深度优先从左向右的顺序进行匹配。
receivers: 配置告警消息接受者信息
,
例如常用的 email、wechat、slack、webhook 等消息通知方式。
inhibit_rules: 抑制规则配置
,当存在与另一组匹配的警报(源)时,抑制规则将禁用与一组匹配的警报(目标)。
3.3 邮件告警的配置
配置告警信息:配置详情
global:
resolve_timeout: 5m # 超时,默认5min
#这里为 QQ 邮箱 SMTP 服务地址,官方地址为 smtp.qq.com 端口为 465 或 587,同时要设置开启 POP3/SMTP 服务。
smtp_smarthost: 'smtp.qq.com:465'
smtp_from: 'xxx@qq.com'
smtp_auth_username: 'xxx@qq.com'
smtp_auth_password: 'xxxxxx' # 这里是邮箱的授权密码,不是登录密码
smtp_require_tls: false
# 是否使用 tls,根据环境不同,来选择开启和关闭。
#如果提示报错 email.loginAuth failed: 530 Must issue a STARTTLS command first,那么就需要设置为 true。
#如果开启了 tls,提示报错 starttls failed: x509: certificate signed by unknown authority,需要在 email_configs 下配置 insecure_skip_verify: true 来跳过 tls 验证。
smtp_hello: 'qq.com'
route: # route用来设置报警的分发策略
group_by: ['alertname'] # 采用哪个标签来作为分组依据
# 组告警等待时间。也就是告警产生后等待5s,如果有同组告警一起发出
group_wait: 5s
group_interval: 5s # 两组告警的间隔时间
repeat_interval: 5m # 重复告警的间隔时间,减少相同邮件的发送频率
receiver: 'email' # 设置默认接收人
receivers: # 配置报警信息接收者信息。
- name: 'email' # 警报接收者名称
email_configs:
# 接收警报的email(这里是引用模板文件中定义的变量)
- to: 'xxxxxxxx@qq.com'
send_resolved: true # 故障恢复后通知
# 抑制规则配置,当存在与另一组匹配的警报(源)时,抑制规则将禁用与一组匹配的警报(目标)。
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
3.4 告警的具体操作
[root@localhost alertmanager]# vim alertmanager.yml
global:
resolve_timeout: 5m
smtp_from: '15***775@qq.com'
smtp_smarthost: 'smtp.qq.com:465'
smtp_auth_username: '154***75@qq.com'
smtp_auth_password: 'y***bhjhi'
smtp_require_tls: false
route:
group_by: ['alertname']
group_wait: 5s
group_interval: 5s
repeat_interval: 5m
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: '154***5@qq.com'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
~
"alertmanager.yml" 25L, 566C 已写入
[root@localhost alertmanager]#
3.5 使用amtool工具检查配置
修改好配置文件后,可以使用amtool工具检查配置
[root@localhost alertmanager]# ./amtool check-config alertmanager.yml
Checking 'alertmanager.yml' SUCCESS
Found:
- global config
- route
- 1 inhibit rules
- 1 receivers
- 0 templates
[root@localhost alertmanager]#
3.6 重新启动alert manager
[root@localhost alertmanager]# ./alertmanager --config.file=alertmanager.yml
4 Prometheus 配置 AlertManager 告警规则
在 Prometheus 配置 AlertManager 服务地址以及告警规则,新建报警规则文件 node-up.rules
如下:
4.1node-up.rules规则的设置
groups:
- name: node-up
rules:
- alert: node-up
expr: up{job="node-exporter"} == 0
for: 15s
labels:
severity: 1
team: node
annotations:
summary: "{{ $labels.instance }} 已停止运行超过 15s!"
4.2 具体操作
[root@localhost prometheus-2.6.1.linux-amd64]# ls
console_libraries consoles data LICENSE NOTICE prometheus prometheus.yml promtool
[root@localhost prometheus-2.6.1.linux-amd64]# mkdir rules
[root@localhost prometheus-2.6.1.linux-amd64]# cd rules/
[root@localhost rules]# vim node-up.rules
groups:
- name: node-up
rules:
- alert: node-up
expr: up{job="agent1"} == 0
for: 15s
labels:
severity: 1
team: node
annotations:
summary: "{{ $labels.instance }} 已停止运行超过 15s!"
~
~
"node-up.rules" [新] 11L, 237C 已写入
[root@localhost rules]#
该 rules 目的是监测 node 是否存活,
- expr :为 PromQL 表达式验证特定节点 job=“agent1” 是否活着,
- for :表示报警状态为 Pending 后等待 15s 变成 Firing 状态,一旦变成 Firing 状态则将报警发送到 AlertManager,
- labels 和 annotations 对该 alert 添加更多的标识说明信息,所有添加的标签注解信息,以及 prometheus.yml 中该 job 已添加 label 都会自动添加到邮件内容中
4.3 修改 prometheus.yml 配置文件,添加 rules 规则文件
[root@localhost ~]# cd /opt/prometheus/prometheus-2.6.1.linux-amd64/
[root@localhost prometheus-2.6.1.linux-amd64]# vim prometheus.yml
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 192.168.156.133:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "/opt/prometheus/prometheus-2.6.1.linux-amd64/rules/*.rules"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'agent1'
static_configs:
- targets: ['192.168.156.133:9100']
"prometheus.yml" 32L, 1074C 已写入
4.4 重启Prometheus
"prometheus.yml" 32L, 1074C 已写入
[root@localhost prometheus-2.6.1.linux-amd64]# pkill prometheus
[root@localhost prometheus-2.6.1.linux-amd64]# lsof -i:9090
[root@localhost prometheus-2.6.1.linux-amd64]# ./prometheus --config.file=prometheus.yml &
4.5 查看是否配置成功
按下面的操作,便会进入下面的界面
由此可知,我们配置成功了
4.6 告警状态有三种状态
Prometheus Alert 告警状态有三种状态: Inactive、Pending、Firing。
- Inactive:非活动状态,
表示正在监控,但是还未有任何警报触发
。 - Pending:
表示这个警报必须被触发
。由于警报可以被分组、压抑/抑制或静默/静音,所以等待验证,一旦所有的验证都通过,则将转到 Firing 状态。 - Firing:
将警报发送到 AlertManage
r,它将按照配置将警报发送给所有接收者。一旦警报解除,则将状态转到 Inactive,如此循环。
5 触发警报
定义的 rule 规则为监测 job="agent1" Node 是否活着
,那么就可以停掉 node-exporter 服务来间接起到 Node Down 的作用,从而达到报警条件,触发报警规则。
查看配置信息,确定监控的节点端口等信息,进行对应的停止
[root@localhost prometheus-2.6.1.linux-amd64]# cat prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "/opt/prometheus/prometheus-2.6.1.linux-amd64/rules/*.rules"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'agent1'
static_configs:
- targets: ['192.168.156.133:9100']
查看对应端口的进程
[root@localhost prometheus-2.6.1.linux-amd64]# lsof -i:9100
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
node_expo 67601 root 3u IPv6 629243 0t0 TCP *:jetdirect (LISTEN)
node_expo 67601 root 5u IPv6 783547 0t0 TCP localhost.localdomain:jetdirect->localhost.localdomain:47248 (ESTABLISHED)
prometheu 76836 root 15u IPv4 783055 0t0 TCP localhost.localdomain:47248->localhost.localdomain:jetdirect (ESTABLISHED)
停止node结点:agent1的进程
[root@localhost prometheus-2.6.1.linux-amd64]# kill 67601
[root@localhost prometheus-2.6.1.linux-amd64]# lsof -i:9100
[root@localhost prometheus-2.6.1.linux-amd64]#
停止服务后,
- 等待 15s 之后可以看到 Prometheus target 里面 node-exproter 状态为 unhealthy 状态,
- 等待 15s 后,alert 页面由绿色 agent1 (0 active) Inactive 状态变成了黄色 node-up (1 active) Pending 状态,
- 继续等待 15s 后状态变成红色 Firing 状态,向 AlertManager 发送报警信息,此时 AlertManager 则按照配置规则向接受者发送邮件告警。
查看邮箱
重新启动node
[root@localhost node_export]# nohup ./node_exporter &
[2] 81062
[1] 已终止 nohup ./node_exporter
[root@localhost node_export]# nohup: 忽略输入并把输出追加到"nohup.out"
[root@localhost node_export]#
会再次发一个邮件,如下
5 使用自定义模板发送
5.1 编写模板文件
在alert manager的安装目录里面新建应该template目录,这template目录里面编写模板文件
模板文件如下
{{ define "email.from" }}xxxxxxxx@qq.com{{ end }}
{{ define "email.to" }}xxxxxxxx@qq.com{{ end }}
{{ define "email.to.html" }}
{{ range .Alerts }}
=========start==========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} 级 <br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ .StartsAt.Format "2019-08-04 16:58:15" }} <br>
=========end==========<br>
{{ end }}
{{ end }}
实际操作
[root@localhost alertmanager]#
[root@localhost alertmanager]# mkdir template
[root@localhost alertmanager]# ls
alertmanager alertmanager.yml amtool data LICENSE NOTICE template
[root@localhost alertmanager]# cd template/
[root@localhost template]# vim email1.tepl
{{ define "email.from" }}xxxxxxxx@qq.com{{ end }}
{{ define "email.to" }}xxxxxxxx@qq.com{{ end }}
{{ define "email.to.html" }}
{{ range .Alerts }}
=========start==========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} 级 <br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ .StartsAt.Format "2019-08-04 16:58:15" }} <br>
=========end==========<br>
{{ end }}
{{ end }}
~
~
~
~
~
~
~
~
~
~
~
"email1.tepl" [新] 15L, 550C 已写入
[root@localhost template]#
5.2 新增alertmanager的配置文件进行测试
global:
resolve_timeout: 5m
smtp_from: '{{ template "email.from" . }}'
smtp_smarthost: 'smtp.qq.com:465'
smtp_auth_username: '{{ template "email.from" . }}'
smtp_auth_password: 'ymbwwkcakpxbhjhi'
smtp_require_tls: false
smtp_hello: 'qq.com'
templates:
- '/opt/prometheus/alertmanager/template/email1.tmpl'
route:
group_by: ['alertname']
group_wait: 5s
group_interval: 5s
repeat_interval: 5m
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: '{{ template "email.to" . }}'
html: '{{ template "email.to.html" . }}'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
email.from、email.to、email.to.html
三种模板变量,可以在 alertmanager.yml 文件中直接配置引用
email.to.html
就是要发送的邮件内容,支持 Html 和 Text 格式,这里为了显示好看,采用 Html 格式简单显示信息。下
{{ range .Alerts }}
是个循环语法,用于循环获取匹配的 Alerts 的信息,下边的告警信息跟上边默认邮件显示信息一样,只是提取了部分核心值来展示。
实际操作:
[root@localhost alertmanager]# ls
alertmanager alertmanager.yml amtool data LICENSE NOTICE template
[root@localhost alertmanager]# vim alertmanager1.yml
global:
resolve_timeout: 5m
smtp_from: '{{ template "email.from" . }}'
smtp_smarthost: 'smtp.qq.com:465'
smtp_auth_username: '{{ template "email.from" . }}'
smtp_auth_password: 'ymbww****xbhjhi'
smtp_require_tls: false
smtp_hello: 'qq.com'
templates:
- '/etc/alertmanager-tmpl/email.tmpl'
route:
group_by: ['alertname']
group_wait: 5s
group_interval: 5s
repeat_interval: 5m
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: '{{ template "email.to" . }}'
html: '{{ template "email.to.html" . }}'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
"alertmanager1.yml" [新] 29L, 719C 已写入
[root@localhost alertmanager]# ls
alertmanager alertmanager1.yml alertmanager.yml amtool data LICENSE NOTICE template
[root@localhost alertmanager]#
5.3 查看配置文件是否正确
[root@localhost alertmanager]# ./amtool check-config alertmanager1.yml
Checking 'alertmanager1.yml' SUCCESS
Found:
- global config
- route
- 1 inhibit rules
- 1 receivers
- 1 templates
SUCCESS
[root@localhost alertmanager]#
5.4 启动alert manager
[root@localhost alertmanager]# ./alertmanager --config.file=alertmanager1.yml
level=info ts=2022-02-10T03:24:22.679397949Z caller=main.go:177 msg="Starting Alertmanager" version="(version=0.16.2, branch=HEAD, revision=308b7620642dc147794e6686a3f94d1b6fc8ef4d)"
level=info ts=2022-02-10T03:24:22.679510727Z caller=main.go:178 build_context="(go=go1.11.6, user=root@1e9a48272b38, date=20190405-12:27:40)"
level=info ts=2022-02-10T03:24:22.68530334Z caller=cluster.go:161 component=cluster msg="setting advertise address explicitly" addr=192.168.156.133 port=9094
level=info ts=2022-02-10T03:24:22.689931066Z caller=cluster.go:632 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2022-02-10T03:24:22.703779166Z caller=main.go:334 msg="Loading configuration file" file=alertmanager1.yml
level=info ts=2022-02-10T03:24:22.707237841Z caller=main.go:428 msg=Listening address=:9093
level=info ts=2022-02-10T03:24:24.690305758Z caller=cluster.go:657 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000287352s
level=info ts=2022-02-10T03:24:32.693591832Z caller=cluster.go:649 component=cluster msg="gossip settled; proceeding" elapsed=10.003586882s
5.5 修改node-up.rules
由于配置了 {{ .Annotations.description }} 变量,而之前 node-up.rules 中并没有配置该变量,会导致获取不到值。
所以需要在Prometheus的安装目录里面修改之前配置的规则文件
[root@localhost prometheus-2.6.1.linux-amd64]# ls
console_libraries consoles data LICENSE NOTICE prometheus prometheus.yml promtool rules
[root@localhost prometheus-2.6.1.linux-amd64]# cd rules/
[root@localhost rules]# ls
node-up.rules
[root@localhost rules]# vim node-up.rules
groups:
- name: node-up
rules:
- alert: node-up
expr: up{job="agent1"} == 0
for: 15s
labels:
severity: 1
team: node
annotations:
summary: "{{ $labels.instance }} 已停止运行超过 15s!"
description: "{{ $labels.instance }} 检测到异常停止!请重点关注!!!"
~
"node-up.rules" 12L, 323C 已写入
[root@localhost rules]#
5.6 重启 Promethues 服务
[root@localhost rules]#
[root@localhost rules]# cd ..
[root@localhost prometheus-2.6.1.linux-amd64]# ls
console_libraries consoles data LICENSE NOTICE prometheus prometheus.yml promtool rules
[root@localhost prometheus-2.6.1.linux-amd64]# pkill prometheus
level=warn ts=2022-02-10T03:28:40.638273674Z caller=main.go:405 msg="Received SIGTERM, exiting gracefully..."
level=info ts=2022-02-10T03:28:40.638327573Z caller=main.go:430 msg="Stopping scrape discovery manager..."
level=info ts=2022-02-10T03:28:40.638335586Z caller=main.go:444 msg="Stopping notify discovery manager..."
level=info ts=2022-02-10T03:28:40.63834017Z caller=main.go:466 msg="Stopping scrape manager..."
level=info ts=2022-02-10T03:28:40.638359536Z caller=main.go:426 msg="Scrape discovery manager stopped"
level=info ts=2022-02-10T03:28:40.638369808Z caller=main.go:440 msg="Notify discovery manager stopped"
level=info ts=2022-02-10T03:28:40.638431616Z caller=manager.go:664 component="rule manager" msg="Stopping rule manager..."
level=info ts=2022-02-10T03:28:40.638478552Z caller=manager.go:670 component="rule manager" msg="Rule manager stopped"
level=info ts=2022-02-10T03:28:40.638521662Z caller=main.go:460 msg="Scrape manager stopped"
[root@localhost prometheus-2.6.1.linux-amd64]# level=info ts=2022-02-10T03:28:40.640008618Z caller=notifier.go:521 component=notifier msg="Stopping notification manager..."
level=info ts=2022-02-10T03:28:40.640035125Z caller=main.go:615 msg="Notifier manager stopped"
level=info ts=2022-02-10T03:28:40.640192411Z caller=main.go:627 msg="See you next time!"
[1]+ 完成 ./prometheus --config.file=prometheus.yml
[root@localhost prometheus-2.6.1.linux-amd64]# lsof -i:9090
[root@localhost prometheus-2.6.1.linux-amd64]# ./prometheus --config.file=prometheus.yml &
[1] 81615
[root@localhost prometheus-2.6.1.linux-amd64]# level=info ts=2022-02-10T03:28:53.958420258Z caller=main.go:243 msg="Starting Prometheus" version="(version=2.6.1, branch=HEAD, revision=b639fe140c1f71b2cbad3fc322b17efe60839e7e)"
level=info ts=2022-02-10T03:28:53.95851453Z caller=main.go:244 build_context="(go=go1.11.4, user=root@4c0e286fe2b3, date=20190115-19:12:04)"
level=info ts=2022-02-10T03:28:53.958534672Z caller=main.go:245 host_details="(Linux 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 30 15:51:32 UTC 2021 x86_64 localhost.localdomain (none))"
level=info ts=2022-02-10T03:28:53.958548683Z caller=main.go:246 fd_limits="(soft=1024, hard=4096)"
level=info ts=2022-02-10T03:28:53.95855905Z caller=main.go:247 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2022-02-10T03:28:53.959002719Z caller=main.go:561 msg="Starting TSDB ..."
level=info ts=2022-02-10T03:28:53.959671934Z caller=web.go:429 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2022-02-10T03:28:53.959878293Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1644301801123 maxt=1644364800000 ulid=01FVEDMKCQGGJ3F9NDEETVAZW0
level=info ts=2022-02-10T03:28:53.959919384Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1644364800000 maxt=1644429600000 ulid=01FVGFR6499R9A354RPZ3BC6ET
level=info ts=2022-02-10T03:28:53.95993753Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1644451200000 maxt=1644458400000 ulid=01FVGS5JZ95MA7N14KF461PTZ5
level=info ts=2022-02-10T03:28:53.959958412Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1644429600000 maxt=1644451200000 ulid=01FVGS5K4K55VJQEH45W337PQK
level=warn ts=2022-02-10T03:28:54.114211565Z caller=head.go:434 component=tsdb msg="unknown series references" count=320781
level=info ts=2022-02-10T03:28:54.116993838Z caller=main.go:571 msg="TSDB started"
level=info ts=2022-02-10T03:28:54.117041776Z caller=main.go:631 msg="Loading configuration file" filename=prometheus.yml
level=info ts=2022-02-10T03:28:54.11854499Z caller=main.go:657 msg="Completed loading of configuration file" filename=prometheus.yml
level=info ts=2022-02-10T03:28:54.118568236Z caller=main.go:530 msg="Server is ready to receive web requests."
[root@localhost prometheus-2.6.1.linux-amd64]# lsof -i:9090
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
prometheu 81615 root 3u IPv6 838619 0t0 TCP *:websm (LISTEN)
prometheu 81615 root 7u IPv4 838620 0t0 TCP localhost:43908->localhost:websm (ESTABLISHED)
prometheu 81615 root 8u IPv6 838621 0t0 TCP localhost:websm->localhost:43908 (ESTABLISHED)
[root@localhost prometheus-2.6.1.linux-amd64]#
5.7 测试
上面的配置有一些问题,测试会出现下面这个问题
好像是模板里面的内容获取不到,大家可以参考去看,最终的效果如下:
[root@localhost node_export]# lsof -i:9100
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
prometheu 83165 root 23u IPv4 870803 0t0 TCP localhost.localdomain:55416->localhost.localdomain:jetdirect (ESTABLISHED)
node_expo 84543 root 3u IPv6 870789 0t0 TCP *:jetdirect (LISTEN)
node_expo 84543 root 5u IPv6 870804 0t0 TCP localhost.localdomain:jetdirect->localhost.localdomain:55416 (ESTABLISHED)
[root@localhost node_export]# kill 84543
[root@localhost node_export]#
重新启动node节点后,也是会发送一封邮件
[root@localhost node_export]# nohup ./node_exporter &
[1] 84685
[root@localhost node_export]# nohup: 忽略输入并把输出追加到"nohup.out"
[root@localhost node_export]#