prometheus、alertmanager、grafana监控告警入门
windows下安装
一、下载
prometheus、alertmanager、exporter下载链接:https://prometheus.io/download
grafana下载链接:https://grafana.com/grafana/download/8.2.2?edition=enterprise&platform=windows
- 下载prometheus、alertmanager、各种exporter(node_exporter、mysqld_exporter)、grafana
二、安装启动
1.exporter
There are a number of libraries and servers which help in exporting existing metrics from third-party systems as Prometheus metrics.
有许多库和服务器可以帮助将第三方系统中的现有指标导出为Prometheus指标。
官网是这么介绍的,exporter就是这样的一种server。
-
node_exporter,用来导出节点的指标的
https://github.com/prometheus-community/windows_exporter。
启动命令:
.\windows_exporter-0.18.1-amd64.exe
访问:http://localhost:9182/
-
mysqld_exporter,用来 导出mysql指标的
https://github.com/prometheus/mysqld_exporter
启动命令:
.\mysqld_exporter --config.my-cnf ‘C:\Program Files\MySQL\MySQL Server 5.7\my.ini’
这里要指定mysql的配置文件my.ini,还要配置数据库的用户密码主机端口
链接:my.ini配置文件详解
创建用户并授权:CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'XXXXXXXX' WITH MAX_USER_CONNECTIONS 3; GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
[client] # pipe= # socket=MYSQL user=exporter password=XXXXXXXX hostname=localhost port=3306
访问:http://localhost:9104/
2.prometheus
- 修改配置文件\prometheus-2.35.0.windows-amd64\prometheus.yml,配置以上两个exporter
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
- job_name: "mysql"
static_configs:
- targets: ["localhost:9104"]
- job_name: "node_exporter"
static_configs:
- targets: ["localhost:9182"]
-
\prometheus-2.35.0.windows-amd64 启动prometheus.exe
-
访问 http://localhost:9090/
3.grafana
-
\grafana-8.2.2\bin 启动grafana-server.exe
-
访问 http://localhost:3000/,默认账号密码admin/admin
-
配置文件\grafana-8.2.2\conf\sample.ini,可参照自定义一个配置,修改端口等等,这里采用默认的
-
配置dashboard
打开 https://grafana.com/grafana/dashboards/?search=windows 搜索合适的模板
复制id,import,load
选择prometheus数据源,import
各种监控指标 尽收眼底(这是windows node dashboard,简单说就是导入了一个别人写好的模板,用来展示数据)
4.alertmanager
链接:https://prometheus.io/docs/alerting/latest/overview/
Alerting with Prometheus is separated into two parts. Alerting rules in Prometheus servers send alerts to an Alertmanager. The Alertmanager then manages those alerts, including silencing, inhibition, aggregation and sending out notifications via methods such as email, on-call notification systems, and chat platforms.
普罗米修斯警报分为两部分。Prometheus服务器的警报规则会向Alertmanager发送警报。Alertmanager然后管理这些警报,包括静音、抑制、聚合和通过电子邮件、随叫随到通知系统和聊天平台等方法发送通知。
The main steps to setting up alerting and notifications are:
Setup and configure the Alertmanager
Configure Prometheus to talk to the Alertmanager
Create alerting rules in Prometheus
设置警报和通知的主要步骤是:
设置和配置Alertmanager
配置Prometheus与Alertmanager进行通信
在普罗米修斯创建警报规则
-
配置alertmanager
在QQ邮箱设置 --> 帐户 开启pop3/SMTP服务,发送短信,拿到授权码,配置到alertmanager.yml 的 smtp_auth_password
alertmanager.yml# 全局配置项 global: resolve_timeout: 5m #超时,默认5min #邮箱smtp服务 smtp_smarthost: 'smtp.qq.com:465' smtp_from: 'xxx@qq.com' smtp_auth_username: 'xxx@qq.com' smtp_auth_password: 'pwnawmndbkmqbcfg' smtp_require_tls: false # 定义模板信息 templates: - 'template/*.tmpl' # 路径 # 路由 route: group_by: ['alertname'] # 报警分组依据 group_wait: 10s #组等待时间 group_interval: 10s # 发送前等待时间 repeat_interval: 1h #重复周期 receiver: 'mail' # 默认警报接收者 # 警报接收者 receivers: - name: 'mail' #警报名称 email_configs: - to: '{{ template "email.to" . }}' #接收警报的email html: '{{ template "email.to.html" . }}' # 模板 send_resolved: true # 告警抑制 inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']
邮件模板email.tmpl
{{ define "email.from" }}xxxxx@qq.com{{ end }} {{ define "email.to" }}yyyyy@vip.qq.com{{ end }} {{ define "email.to.html" }} {{ range .Alerts }} =========start==========<br> 告警程序: prometheus_alert <br> 告警级别: {{ .Labels.severity }} 级 <br> 告警类型: {{ .Labels.alertname }} <br> 故障主机: {{ .Labels.instance }} <br> 告警主题: {{ .Annotations.summary }} <br> 告警详情: {{ .Annotations.description }} <br> 触发时间: {{ .StartsAt.Format "2019-08-04 16:58:15" }} <br> =========end==========<br> {{ end }} {{ end }}
启动alertmanager.exe
-
创建规则(告警内存为例,内存使用率超过50%则告警)
在 \prometheus-2.35.0.windows-amd64\rules目录下创建规则 test_rule.ymlgroups: - name: example rules: - alert: NodeMemoryUsage expr: 100 - (windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes)*100 > 50 for: 1m labels: severity: warning annotations: summary: "{{$labels.instance}}: High Memory usage detected" description: "{{$labels.instance}}: Memory usage is above 80% (current value is:{{ $value }})"
-
将规则配置到 prometheus.yml
rule_files: - "rules/*.yml"
重启prometheus,稍等片刻,接收到告警邮件
Linux下安装
拉取镜像
- docker pull prom/prometheus
- docker pull prom/alertmanager
- docker pull grafana/grafana-enterprise