单节点搭建:
- zabbix server搭建过程了解
- 采集的数据默认在./data中,默认以2h的数据存储为一个block,https://www.ctolib.com/docs/sfile/prometheus-book/ha/prometheus-local-storage.html
- 告警配置如何生效?确定当前配置的告警配置哪里有问题?
未生效原因及配置的主要点:- rules file中的内容是会全部显示到报警所发的内容中,在slack发送中的对link的配置是指在slack中显示报警时可以直接让关注的报警接收人点击链接进入到报警发生的位置或者你想让他看的位置
- 对于rule file中的username是可以用中文
- 在alertmanager.yml中关于slack的配置,api_url不加引号,channel 那么必须是指定的,否则会报错,错误如下
level=error ts=2018-10-19T08:42:36.63691218Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="cancelling noretry for \"slack\" due to unrecoverable error: unexpected status code 404"
实例:
# prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/usr/local/prometheus-2.4.3/rules/test.yml" #要不与promutheus.yml在同一级目录中,要不是绝对路径,相对路径无法读取
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['127.0.0.1:9090']
labels:
instance: localhost
- job_name: 'linux'
static_configs:
- targets: ['127.0.0.1:9100']
labels:
instance: node1
- targets: ['172.18.2.28:9090']
labels:
instance: node2
- targets: ['172.18.2.28:1234']
labels:
instance: node3
# rules/test.yml
groups:
- name: test
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: page
annotations:
description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes.'
summary: 'Instance {{ $labels.instance }} down'
link: 'http://172.18.2.27:9090/alerts'
color: "#D00000" #发送时的颜色显示,#D00000为红色
username: "刘蓉"
#alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25'
smtp_from: 'lori_liurong@163.com'
smtp_auth_username: 'lori_liurong@163.com'
smtp_auth_password: 'liurong199686'
smtp_require_tls: false
route:
group_by: ['ip','id','type']
group_wait: 10s
group_interval: 10s
repeat_interval: 2h #在发送成功的前提下,重复发报警的时间间隔
receiver: 'liurong'
receivers:
- name: 'liurong'
email_configs:
- to: 'lori_liurong@163.com'
headers: { Subject: "[WARN] 报警邮件test" }
slack_configs:
- send_resolved: true
api_url: https://hooks.slack.com/services/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 #slack中webhook的url,每个channel的webhook的url都不同
channel: "#test-alermanager"
text: "{{ range .Alerts }} {{ .Annotations.description}}\n {{end}} @{{ .CommonAnnotations.username}} <{{.CommonAnnotations.link}}| click here>"
title: "{{.CommonAnnotations.summary}}"
title_link: "{{.CommonAnnotations.link}}"
color: "{{.CommonAnnotations.color}}"
在检测到alertmanager的计算规则时会出现当前有问题的报警,具体解释:http://blog.51cto.com/xujpxm/2055970
-
日志输出
where can I find prometheus logs?
https://github.com/prometheus/prometheus/issues/2363启动方式使用脚本方式启动,指定输出日志路径