监控prometheus+alertmanager+PrometheusAlert

此篇文章在于记录监控搭建方法


prometheus

prometheus存储的是时序数据,即按相同时序(相同名称和标签),以时间维度存储连续的数据的集合。

监控目标可用consul注册发现
consul 安装:

sudo yum install -y yum-utils
sudo yum-config-manager --add-repo https://rpm.releases.hashicorp.com/RHEL/hashicorp.repo
sudo yum -y install consul
更具该条命令参数修改/etc/consul/consul.hcl  中的参数,端口是8500
(consul agent -server -ui -bootstrap-expect=1 -data-dir=/opt/consul  -node=consul-1 -client=0.0.0.0  -bind=10.72.88.200 -datacenter=dc1)
curl -X PUT -d '{"id"(service): "test-key-value","name"(instance): "10.72.88.200","address": "10.72.88.200","port": 9100,"tags": ["node","hf004"],"meta":{"cloud":"geely","project":"bond"},"checks": [{"http": "http://10.72.88.200:9100/metrics", "interval": "5s"}]}'  http://10.72.88.200:8500/v1/agent/service/register
  将节点注册到consul服务中并且添加标签
curl -X PUT http://10.72.88.200:8500/v1/agent/service/deregister/node-exporter1(id)     将节点从consul中注销

relabel_configs:
- source_labels: [__meta_consul_tags]
regex: “,”
action: drop 添加这个可以把consul 8300指标删除
- regex: _meta_consul_service_metadata(.+)
action: labelmap (将自定义的标签保留下来)

部署prometheus

1.官网下载安装包解压即可主要在于配置规则

./promethues --web.enable-lifecycle 加入该参数可以进行热加载配置文件
curl -X POST http://IP/-/reload

global:
  scrape_interval: 15s                  每隔多少秒去检测一次目标
  evaluation_interval: 15s              每隔多少秒去执行rules
  # scrape_timeout is set to the global default (10s).

# 配置你的altermanager(可以同时配置多个)
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - 127.0.0.1:9093

#配置你的规则(可以同时配置多个)
rule_files:
  # - "rules/first_rules.yml"
  # - "rules/second_rules.yml"

#监控目标配置
scrape_configs:
  - job_name: "consul_test"
    consul_sd_configs:
    - server: '172.30.12.167:8500'
      services: []
  - job_name: "prometheus1"
    static_configs:   (手动添加)
      - targets: ["localhost:9090"]
      - targets: ["localhost:9100"]
      自定义一些标签可以在alertmanager里使用
        labels:
          idc: shanghai
          system: baidu
          owner: xxx
  - job_name: "prometheus2"
    - job_name: "prometheus1"
      file_sd_configs:
       - files:
         - /usr/local/prometheus/test.yaml
         refresh_interval: 5s
    可以将现有的标签进行替换
    relabel_configs:
      - action: replace
        source_labels: ["_address_"]
        regex: "(.*)"
        target_label: "instance"(自动新增的标签)
        replacement: "$1"
        或者
      - source_labels: ["_address_"]
        regex: "(.*)"
        target_label: "test"
        replacement: $1

 
 test.yaml内容如下:
 - targets:
   - 10.1.9.1xx 
   - 10.1.9.2xx
   labels:
     service: aaa  

如果要监控接口等信息要运行blackbox_exporter  
- job_name: 'http_status'
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets: ['10.72.88.200:80']
        labels:
          instance: 'port_status'
          group: 'port'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 10.72.88.200:9115(将地址修改成black_exporter地址端口)
     这里的relabel_configs:不加好像不行

2.rules/first_rules.yml

promtool check rules /path/to/example.rules.yml 检查语法是否正确

groups:
- name: node_monitor
  rules:

  # Alert for any instance that is unreachable for >5 minutes.
  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      severity: 'critical'
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: " {{ $labels.instance }} has been down for more than 5 minutes. {{$labels.test}}"

- name: cpu_test
  rules:
  - alert: CPU
    expr: (1-rate(node_cpu_seconds_total{mode="idle"}[1m]))*100 > 1
    for: 5s
    labels:
      severity: 'warning'
    annotations:
      summary: " cpu利用率超过 90%,{{ $labels.instance }}当前值: {{ $value }}%"

3.altermanager.yaml

global:
  resolve_timeout: 5m
  smtp_from: "archive@qq.com"
  smtp_smarthost: "smtp.partner.com:587"
  smtp_auth_username: "archive@qq.com"
  smtp_auth_password: "mi1PooI7F%Ht9m0#"
route:
  group_by: ['alertname']
  group_wait: 5s
  group_interval: 5s
  repeat_interval: 5s
  receiver: 'email'  这里只是配置默认的receiver

  routes:
  - match:         直接匹配
      service: foo1
    receiver: "email1"
  - match_re:      正则匹配
      owner: "xxxx"
    receiver: "email"

receivers:    这里配置多个receiver,email,webhook等
- name: 'email'
  email_configs:
  - to: 'test@qq.com'
    send_resolved: true    发送已解决的问题

- name: 'email1'        一个receiver下面可以有多个接收器
  webhook_configs:
  - url: 'http://prometheus-webhook-dingtalk.kube.com
  email_configs:
  - to: 'test@qq.com'
    send_resolved: true

inhibit_rules: # 抑制规则
  - source_match: # 源标签警报触发时抑制含有目标标签的警报,在当前警报匹配 
      severity: 'warning'  # 此处的抑制匹配一定在最上面的route中配置不然,会提示找不key。
    target_match:
      severity: 'critical' # 目标标签值正则匹配,可以是正则表达式如: ".*MySQL.*"
    equal: ['alertname','instance'] # 确保这个配置下的标签内容相同才会抑制,也就是说警报中必须有这三个标签值才会被抑制

4、PrometheusAlert
github或者gittee 中搜索feiyu563/PrometheusAlert
下载后编辑app.conf 然后运行promethuesalert
访问后使用app.conf中的username和password 点击模板修改模板
但是注意在alertmanager中配置
receivers:

  • name: ‘PrometheusAlert’
    webhook_configs:
    • url: 为promethuesalert中模板后面的地址

可手动或等待Prometheus告警触发后,去PrometheusAlert中查看收到的日志消息。通过json中的键值调整模板中的信息。
时间格式不一样的话可以在模板中指定时间格式 TimeFormat $v.startsAt "2006/01/02 15:04:05"或者直接 GetCSTtime $v.startsAt获取当前时间

5、接口监控

**promethues:**
- job_name: "http"
  metrics_path: /probe
  static_configs:
      - targets:
        - 10.72.88.200:80
  relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 1x.xx.xx.xx:9115    接口检测black_exporter必须写这个 relabel_configs(没弄懂为啥)


**rules:**
- name: blackbox_network_stats
  rules:
  - alert: blackbox_network_stats
    expr: probe_success == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }}:80  is down"
      description: "This requires immediate action!"

6、promethuesAlert模板
企微模板:
{{ $var := .externalURL}}{{ range k , k, k,v:=.alerts }}{{if eq $v.status “resolved”}}

Prometheus恢复信息
[{{KaTeX parse error: Expected 'EOF', got '}' at position 19: …abels.alertname}̲}]({{var}})
告警级别: {{$v.labels.severity}}
开始时间: {{TimeFormat $v.startsAt “2006/01/02 15:04:05”}}
结束时间: {{TimeFormat KaTeX parse error: Expected 'EOF', got '}' at position 31: …01/02 15:04:05"}̲} <font color="…v.labels.instance}}
主机地域: {{KaTeX parse error: Expected 'EOF', got '}' at position 16: v.labels.region}̲} **{{v.annotations.description}}**
{{else}}> Prometheus告警信息
[{{KaTeX parse error: Expected 'EOF', got '}' at position 19: …abels.alertname}̲}]({{var}})
告警级别: {{$v.labels.severity}}
开始时间: {{TimeFormat $v.startsAt “2006/01/02 15:04:05”}}
结束时间: {{TimeFormat KaTeX parse error: Expected 'EOF', got '}' at position 31: …01/02 15:04:05"}̲} <font color="…v.labels.instance}}
主机地域: {{KaTeX parse error: Expected 'EOF', got '}' at position 16: v.labels.region}̲} **{{v.annotations.description}}**
{{end}}{{ end }}

Email模板:
{{ $var := .externalURL}}{{ range k , k, k,v:=.alerts }}
{{if eq $v.status “resolved”}}

Prometheus恢复信息

{{$v.labels.alertname}}

告警级别:{{$v.labels.severity}}
开始时间:{{TimeFormat $v.startsAt "2006/01/02 15:04:05"}}
结束时间:{{TimeFormat $v.endsAt "2006/01/02 15:04:05"}}
故障主机IP:{{$v.labels.instance}}

{{$v.annotations.description}}

{{else}}

Prometheus告警信息

{{$v.labels.alertname}}

告警级别:{{$v.labels.severity}}
开始时间:{{TimeFormat $v.startsAt "2006/01/02 15:04:05"}}
故障主机IP:{{$v.labels.instance}}

{{$v.annotations.description}}

{{end}} {{ end }}

如果alertmanager自己报警smtp模板:
global:
resolve_timeout: 5m
smtp_smarthost: ‘smtp.partner.outlo:25’
smtp_from: ‘gitlab_notific.com’
smtp_auth_username: ‘gitla’
smtp_auth_password: ‘Joq3440’
smtp_require_tls: true

templates:

  • ‘/usr/local/alertmanager/*.tmp’
    route:
    group_by: [‘alertname’, ‘instance’]
    group_wait: 30s
    group_interval: 30s
    repeat_interval: 3m
    receiver: email
    routes:
  • receiver: email
    group_wait: 30s
    match:
    severity: critical
  • receiver: web-hook
    group_wait: 30s
    match:
    severity: warning
    receivers:
  • name: ‘web-hook’
    webhook_configs:
    • url: ‘http://10.172.88.200:8888/prometheusalert?type=email&tpl=prometheus-email&email=minglo@tech.com’
      send_resolved: true
  • name: ‘email’
    email_configs:
    • to: ‘minglo@tech.com’
      html: ‘{{ template “email.to.html” . }}’
      headers: { Subject: " {{ .CommonLabels.instance }} {{ .CommonLabels.alertname}}" }
      send_resolved: true

alert.tmp
{{ define “email.from” }}12345671@qq.com{{ end }}
{{ define “email.to 1” }}minglo@tech.com{{ end }}
{{ define “email.to 2” }}minglo@tech.com{{ end }}
{{ define “email.to.html” }}
{{- if gt (len .Alerts.Firing) 0 -}}{{ range .Alerts }}

@告警通知

告警程序: prometheus_alert
告警级别: {{ .Labels.severity }} 级
告警类型: {{ .Labels.alertname }}
故障主机: {{ .Labels.instance }}
告警主题: {{ .Annotations.summary }}
告警详情: {{ .Annotations.description }}
触发时间: {{ .StartsAt.Local.Format "2006-01-02 15:04:05" }}
{{ end }}{{ end -}} {{- if gt (len .Alerts.Resolved) 0 -}}{{ range .Alerts }}

@告警恢复

告警程序: prometheus_alert
故障主机: {{ .Labels.instance }}
故障主题: {{ .Annotations.summary }}
告警详情: {{ .Annotations.description }}
告警时间: {{ .StartsAt.Local.Format "2006-01-02 15:04:05" }}
恢复时间: {{ .EndsAt.Local.Format "2006-01-02 15:04:05" }}
{{ end }}{{ end -}} {{- end }}
  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值