Kubernetes部署和使用Prometheus
默认已安装kuberneter
kubernetes1.25 kube-prometheus release-0.12
这里使用的是Kube-Prometheus
1、架构介绍
Prometheus Server:抓取和存储时间序列数据,同时提供数据的查询和告警策略的配置管理
Alertmanager:Prometheus Server 会将告警发送给 Alertmanager,Alertmanager 根据路由配置,将告警信息发送给指定的或组,支持邮件、Webhook、微信、钉钉、短信等
Grafana:用于展示数据
Push Gateway:Prometheus 通过 Pull 的方式拉取数据,但是有些监控数据可能是短期的,如果没有采集数据可能会出现丢失,Push Gateway 可以用来接收数据
ServiceMonitor:监控配置,通过选择service的/metrics接口获取数据
Exporter:用来采集非云原生监控数据,主机的监控数据通过 node_exporter 采集,MySQL 的监控数据通过 mysql_exporter 采集
PromQL:查询数据的一种语法
Service Discovery:用来发现监控目标的自动发现,常用的有基于 Kubernetes、 Consul、Eureka、文件的自动发现等
2、安装
# 这里注意去github找kubernetes对应的版本,1.25对应的kube-prometheus版本为release-0.12
git clone -b release-0.12 https://github.com/prometheus-operator/kube-prometheus.git
cd kube-prometheus
# 部分镜像下载不下来,这里在对应文件修改镜像地址
vi manifests/kubeStateMetrics-deployment.yaml
vi manifests/prometheusAdapter-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 2.7.0
name: kube-state-metrics
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/part-of: kube-prometheus
template:
metadata:
annotations:
kubectl.kubernetes.io/default-container: kube-state-metrics
labels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 2.7.0
spec:
automountServiceAccountToken: true
containers:
- args:
- --host=127.0.0.1
- --port=8081
- --telemetry-host=127.0.0.1
- --telemetry-port=8082
# image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.7.0
image: registry.cn-hangzhou.aliyuncs.com/ialso/kube-state-metrics:v2.7.0
name: kube-state-metrics
resources:
limits:
cpu: 100m
memory: 250Mi
requests:
cpu: 10m
memory: 190Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsUser: 65534
- args:
- --logtostderr
- --secure-listen-address=:8443
- --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
- --upstream=http://127.0.0.1:8081/
image: quay.io/brancz/kube-rbac-proxy:v0.14.0
name: kube-rbac-proxy-main
ports:
- containerPort: 8443
name: https-main
resources:
limits:
cpu: 40m
memory: 40Mi
requests:
cpu: 20m
memory: 20Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsGroup: 65532
runAsNonRoot: true
runAsUser: 65532
- args:
- --logtostderr
- --secure-listen-address=:9443
- --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
- --upstream=http://127.0.0.1:8082/
image: quay.io/brancz/kube-rbac-proxy:v0.14.0
name: kube-rbac-proxy-self
ports:
- containerPort: 9443
name: https-self
resources:
limits:
cpu: 20m
memory: 40Mi
requests:
cpu: 10m
memory: 20Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsGroup: 65532
runAsNonRoot: true
runAsUser: 65532
nodeSelector:
kubernetes.io/os: linux
serviceAccountName: kube-state-metrics
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/component: metrics-adapter
app.kubernetes.io/name: prometheus-adapter
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 0.10.0
name: prometheus-adapter
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
app.kubernetes.io/component: metrics-adapter
app.kubernetes.io/name: prometheus-adapter
app.kubernetes.io/part-of: kube-prometheus
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
template:
metadata:
labels:
app.kubernetes.io/component: metrics-adapter
app.kubernetes.io/name: prometheus-adapter
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 0.10.0
spec:
automountServiceAccountToken: true
containers:
- args:
- --cert-dir=/var/run/serving-cert
- --config=/etc/adapter/config.yaml
- --logtostderr=true
- --metrics-relist-interval=1m
- --prometheus-url=http://prometheus-k8s.monitoring.svc:9090/
- --secure-port=6443
- --tls-cipher-suites=TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA
# image: registry.k8s.io/prometheus-adapter/prometheus-adapter:v0.10.0
image: registry.cn-hangzhou.aliyuncs.com/ialso/prometheus-adapter:v0.10.0
livenessProbe:
failureThreshold: 5
httpGet:
path: /livez
port: https
scheme: HTTPS
initialDelaySeconds: 30
periodSeconds: 5
name: prometheus-adapter
ports:
- containerPort: 6443
name: https
readinessProbe:
failureThreshold: 5
httpGet:
path: /readyz
port: https
scheme: HTTPS
initialDelaySeconds: 30
periodSeconds: 5
resources:
limits:
cpu: 250m
memory: 180Mi
requests:
cpu: 102m
memory: 180Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
volumeMounts:
- mountPath: /tmp
name: tmpfs
readOnly: false
- mountPath: /var/run/serving-cert
name: volume-serving-cert
readOnly: false
- mountPath: /etc/adapter
name: config
readOnly: false
nodeSelector:
kubernetes.io/os: linux
serviceAccountName: prometheus-adapter
volumes:
- emptyDir: {}
name: tmpfs
- emptyDir: {}
name: volume-serving-cert
- configMap:
name: adapter-config
name: config
kubectl apply --server-side -f manifests/setup
kubectl wait \
--for condition=Established \
--all CustomResourceDefinition \
--namespace=monitoring
kubectl apply -f manifests/
3、配置外部访问
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ingress-grafana
namespace: monitoring
annotations:
kubernetes.io/ingress.class: "nginx"
spec:
# 转发规则
rules:
- host: grafana.ialso.cn
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: grafana
port:
number: 3000
- host: alertmanager.ialso.cn
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: alertmanager-main
port:
number: 9093
- host: prometheus.ialso.cn
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus-k8s
port:
number: 9090
4、配置grafana
面板配置:https://grafana.com/grafana/dashboards/
导入面板
输入模板ID
配置
5、ControllerManager报警处理
# 确定目标serviceMonitor是否存在
kubectl get serviceMonitor -n monitoring kube-controller-manager
# 查看目标对应的service标签
kubectl get serviceMonitor -n monitoring kube-controller-manager -o yaml
# 查看目标service是否存在(我这里service不存在)
kubectl get svc -n kube-system -l app.kubernetes.io/name=kube-controller-manager
# 查看kube-controll端口
netstat -lntp|grep "kube-controll"
# 为目标创建service&endpiont
vi cm-prometheus.yaml
apiVersion: v1
kind: Endpoints
metadata:
labels:
app.kubernetes.io/name: kube-controller-manager
name: cm-prometheus
namespace: kube-system
subsets:
- addresses:
- ip: 10.10.0.15
ports:
- name: https-metrics
port: 10252
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: kube-controller-manager
name: cm-prometheus
namespace: kube-system
spec:
type: ClusterIP
ports:
- name: https-metrics
port: 10252
protocol: TCP
targetPort: 10252
# 再次查看目标service是否存在
kubectl get svc -n kube-system -l app.kubernetes.io/name=kube-controller-manager
此时prometheus target已经出现了kube-controller-manager,但是状态时DOWN,可能是因为Controller Manager监听的127.0.0.1,导致无法被外部访问
# 修改kube-controller-manager配置文件(- --bind-address=0.0.0.0)
vi /etc/kubernetes/manifests/kube-controller-manager.yaml
此时prometheus target已经出现了kube-controller-manager的状态就应该是UP了
6、Scheduler报警处理
# 确定目标serviceMonitor是否存在
kubectl get serviceMonitor -n monitoring kube-scheduler
# 查看目标对应的service标签
kubectl get serviceMonitor -n monitoring kube-scheduler -o yaml
# 查看目标service是否存在(我这里service不存在)
kubectl get serviceMonitor -n monitoring kube-scheduler -l app.kubernetes.io/name=kube-scheduler
# 查看kube-scheduler端口
netstat -lntp|grep "kube-scheduler"
# 为目标创建service&endpiont
vi scheduler-prometheus.yaml
apiVersion: v1
kind: Endpoints
metadata:
labels:
app.kubernetes.io/name: kube-scheduler
name: scheduler-prometheus
namespace: kube-system
subsets:
- addresses:
- ip: 10.10.0.15
ports:
- name: https-metrics
port: 10259
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: kube-scheduler
name: scheduler-prometheus
namespace: kube-system
spec:
type: ClusterIP
ports:
- name: https-metrics
port: 10259
protocol: TCP
targetPort: 10259
# 再次查看目标service是否存在
kubectl get serviceMonitor -n monitoring kube-scheduler -l app.kubernetes.io/name=kube-scheduler
此时prometheus target已经出现了kube-scheduler,但是状态时DOWN,可能是因为Scheduler监听的127.0.0.1,导致无法被外部访问
# 修改kube-scheduler配置文件(- --bind-address=0.0.0.0)
vi /etc/kubernetes/manifests/kube-scheduler.yaml
此时prometheus target已经出现了kube-scheduler的状态就应该是UP了
7、监控etcd
配置Service&Endpoints
# 查看证书文件位置
cat /etc/kubernetes/manifests/etcd.yaml
# 找到其中(--cert-file、--key-file)
--cert-file=/etc/kubernetes/pki/etcd/server.crt
--key-file=/etc/kubernetes/pki/etcd/server.key
# 尝试访问metrics接口
curl --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key https://10.10.0.15:2379/metrics -k | tail -1
# 查看目前的ServiceMonitor
kubectl get ServiceMonitor -n monitoring
# 创建Etcd的Service&Endpoints
vi etcd-prometheus.yaml
kubectl apply -f etcd-prometheus.yaml
# 测试10.96.27.17是service etcd-prometheus的ClusterIP
curl --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key https://10.96.27.17:2379/metrics -k | tail -1
apiVersion: v1
kind: Endpoints
metadata:
labels:
app: etcd-prometheus
name: etcd-prometheus
namespace: kube-system
subsets:
- addresses:
- ip: 10.10.0.15
ports:
- name: https-metrics
port: 2379
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
labels:
app: etcd-prometheus
name: etcd-prometheus
namespace: kube-system
spec:
type: ClusterIP
ports:
- name: https-metrics
port: 2379
protocol: TCP
targetPort: 2379
创建证书供Prometheus使用
# 创建证书
kubectl create secret generic etcd-ssl \
--from-file=/etc/kubernetes/pki/etcd/ca.crt \
--from-file=/etc/kubernetes/pki/etcd/server.crt \
--from-file=/etc/kubernetes/pki/etcd/server.key \
-n monitoring
# 挂载到Prometheus(修改后prometheus-k8s-n会自动重启)
kubectl edit prometheus k8s -n monitoring
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"monitoring.coreos.com/v1","kind":"Prometheus","metadata":{"annotations":{},"labels":{"app.kubernetes.io/component":"prometheus","app.kubernetes.io/instance":"k8s","app.kubernetes.io/name":"prometheus","app.kubernetes.io/part-of":"kube-prometheus","app.kubernetes.io/version":"2.41.0"},"name":"k8s","namespace":"monitoring"},"spec":{"alerting":{"alertmanagers":[{"apiVersion":"v2","name":"alertmanager-main","namespace":"monitoring","port":"web"}]},"enableFeatures":[],"externalLabels":{},"image":"quay.io/prometheus/prometheus:v2.41.0","nodeSelector":{"kubernetes.io/os":"linux"},"podMetadata":{"labels":{"app.kubernetes.io/component":"prometheus","app.kubernetes.io/instance":"k8s","app.kubernetes.io/name":"prometheus","app.kubernetes.io/part-of":"kube-prometheus","app.kubernetes.io/version":"2.41.0"}},"podMonitorNamespaceSelector":{},"podMonitorSelector":{},"probeNamespaceSelector":{},"probeSelector":{},"replicas":2,"resources":{"requests":{"memory":"400Mi"}},"ruleNamespaceSelector":{},"ruleSelector":{},"securityContext":{"fsGroup":2000,"runAsNonRoot":true,"runAsUser":1000},"serviceAccountName":"prometheus-k8s","serviceMonitorNamespaceSelector":{},"serviceMonitorSelector":{},"version":"2.41.0"}}
creationTimestamp: "2023-05-22T07:44:05Z"
generation: 1
labels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: k8s
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 2.41.0
name: k8s
namespace: monitoring
resourceVersion: "105405"
uid: 9f6efc98-0f83-4d4d-b8e6-c77bd857efa0
spec:
alerting:
alertmanagers:
- apiVersion: v2
name: alertmanager-main
namespace: monitoring
port: web
enableFeatures: []
evaluationInterval: 30s
externalLabels: {}
image: quay.io/prometheus/prometheus:v2.41.0
nodeSelector:
kubernetes.io/os: linux
podMetadata:
labels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: k8s
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 2.41.0
podMonitorNamespaceSelector: {}
podMonitorSelector: {}
probeNamespaceSelector: {}
probeSelector: {}
replicas: 2
# 添加证书
secrets:
- etcd-ssl
resources:
requests:
memory: 400Mi
ruleNamespaceSelector: {}
ruleSelector: {}
scrapeInterval: 30s
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
serviceAccountName: prometheus-k8s
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}
version: 2.41.0
创建ServiceMonitor
# 创建etcd的ServiceMonitor
vi etcd-servermonitor.yaml
kubectl apply -f etcd-servermonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: etcd
namespace: monitoring
labels:
app: etcd
spec:
jobLabel: k8s-app
endpoints:
- interval: 30s
port: https-metrics # 对应Service.spec.ports.name
scheme: https
tlsConfig:
# 注意此处路径应为
caFile: /etc/prometheus/secrets/etcd-ssl/ca.crt # 证书路径
certFile: /etc/prometheus/secrets/etcd-ssl/server.crt
keyFile: /etc/prometheus/secrets/etcd-ssl/server.key
insecureSkipVerify: true # 关闭证书校验
selector:
matchLabels:
app: etcd-prometheus # 跟svc的 lables 保持一致
namespaceSelector:
matchNames:
- kube-system
此时应当可以看到prometheus target中的etcd
grafana面板(面板id:3070)
8、监控mysql
默认集群已安装mysql,这里用的mysql5.7
另外这里监控的是单节点,多节点监控暂时还不清楚要怎么用,可能要多个mysql_exporter?
创建用于监控的用户
create user 'exporter'@'%' identified by '123456';
grant process,replication client,select on *.* to 'exporter'@'%';
配置mysql_exporter
apiVersion: apps/v1
kind: Deployment
metadata:
name: mysql-exporter
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
k8s-app: mysql-exporter
template:
metadata:
labels:
k8s-app: mysql-exporter
spec:
containers:
- name: mysql-exporter
image: registry.cn-beijing.aliyuncs.com/dotbalo/mysqld-exporter
env:
- name: DATA_SOURCE_NAME
# 格式为username:password@(ip/service.namespace:3306)/
value: "exporter:123456@(mysql-master.mysql:3306)/"
imagePullPolicy: IfNotPresent
ports:
- containerPort: 9104
---
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: mysql-exporter
name: mysql-exporter
namespace: monitoring
spec:
type: ClusterIP
ports:
# 此处name应为api,代指9104端口
- name: api
protocol: TCP
port: 9104
selector:
k8s-app: mysql-exporter
# 10.96.136.65是service mysql-exporter的ClusterIP
curl 10.96.136.65:9104/metrics
配置serviceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: mysql-exporter
namespace: monitoring
labels:
k8s-app: mysql-exporter
namespace: monitoring
spec:
jobLabel: k8s-app
endpoints:
- port: api
interval: 30s
scheme: http
selector:
matchLabels:
k8s-app: mysql-exporter
namespaceSelector:
matchNames:
- monitoring
此时应当可以看到prometheus target中的mysql_exporter
grafana面板(面板id:7362)
如果出现问题可参照上方cm和scheduler方式排查,若均无问题可查看日志看是否出错kubectl logs -n monitoring mysql-exporter-6559759477-m8tqc
9、告警(邮件)
https://prometheus.io/docs/alerting/latest/alertmanager/
https://github.com/prometheus/alertmanager/blob/main/doc/examples/simple.yml
- Global:全局配置,主要用来配置一些通用的配置,比如邮件通知的账号、密码、SMTP 服务器、微信告警等
- Templates:用于放置自定义模板的位置
- Route:告警路由配置,用于告警信息的分组路由,可以将不同分组的告警发送给不同的收件人
- Inhibit_rules:告警抑制,主要用于减少告警的次数,防止“告警轰炸”
- Receivers:告警收件人配置
# 修改告警规则
vi kube-prometheus/manifests/alertmanager-secret.yaml
# 应用新的告警规则
kubectl replace -f kube-prometheus/manifests/alertmanager-secret.yaml
apiVersion: v1
kind: Secret
metadata:
labels:
app.kubernetes.io/component: alert-router
app.kubernetes.io/instance: main
app.kubernetes.io/name: alertmanager
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 0.25.0
name: alertmanager-main
namespace: monitoring
stringData:
alertmanager.yaml: |-
# global:全局配置,主要配置告警方式,如邮件、webhook等
global:
# 超时,默认5min
resolve_timeout: 5m
# 邮件配置
smtp_smarthost: 'smtp.qq.com:465'
smtp_from: '2750955630@qq.com'
smtp_auth_username: '2750955630@qq.com'
smtp_auth_password: 'puwluaqcmkrdddge'
smtp_require_tls: false
# 模板
templates:
- '/usr/local/alertmanager/*.tmp'
# 路由
route:
# 路由分组规则
group_by: [ 'namespace', 'job', 'alertname' ]
# 当一个新告警组被创建时,需要等待'group_wait'后才发送初始通知,防止告警轰炸
group_wait: 30s
# 当第一次告警通知发出后,在新的评估周期内又收到了该分组最新告警,则需等待'group_interval'时间后,开始发送为该组触发的新告警
group_interval: 2m
# 告警通知成功发送后,若问题一直未恢复,再次重复发送的间隔
repeat_interval: 10m
# 配置告警消息接收者
receiver: 'Default'
# 子路由
routes:
- receiver: 'email'
match:
alertname: "Watchdog"
# 配置报警信息接收者信息
receivers:
- name: 'Default'
email_configs:
# 接收警报的email
- to: 'xumeng03@bilibili.com'
# 故障恢复后通知
send_resolved: true
- name: 'email'
email_configs:
# 接收警报的email
- to: '2750955630@qq.com'
# 故障恢复后通知
send_resolved: true
# 抑制规则配置
inhibit_rules:
- source_matchers:
- severity="critical"
target_matchers:
- severity=~"warning|info"
equal:
- namespace
- alertname
- source_matchers:
- severity="warning"
target_matchers:
- severity="info"
equal:
- namespace
- alertname
- source_matchers:
- alertname="InfoInhibitor"
target_matchers:
- severity="info"
equal:
- namespace
type: Opaque
10、告警(企业微信)
需获取到以下信息
# 企业ID: wwe86504f797d306ce
# 部门ID: 4
# 应用AgentId: 1000002
# 应用Secret: FrAuzVnZvkmJdQcRiESKtBHsX8Xmq5LHEc-cn-xxxx
并且配置好“网页授权及JS-SDK”与“企业可信IP”
这里如果使用nginx配置可信域名,可以采用以下方式(在能获取授权文件的同时不影响正常的服务)
server {
# 监听端口
listen 443 ssl;
# 监听域名
server_name ialso.cn;
# 证书信息
ssl_certificate /etc/nginx/ssl/ialso.cn_bundle.pem;
ssl_certificate_key /etc/nginx/ssl/ialso.cn.key;
ssl_session_cache shared:SSL:1m;
ssl_session_timeout 5m;
ssl_ciphers HIGH:!aNULL:!MD5;
ssl_prefer_server_ciphers on;
root /etc/nginx/wechat; # MP_verify_7UJT32UzCOGkaUNB.txt 文件放在了 /etc/nginx/wechat 目录下
location / {
# 访问 https://ialso.cn/WW_verify_wEY0iTPFwKQAen0a.txt 时 --> try_files $uri --> try_files /WW_verify_wEY0iTPFwKQAen0a.txt --> /etc/nginx/wechat/WW_verify_wEY0iTPFwKQAen0a.txt --> 实现了访问
# 访问 https://ialso.cn/网关转发的uri/xxx 时 --> try_files $uri --> try_files /网关转发的uri/xxx --> /data/wx/网关转发的uri/xxx 不存在 --> try_files @gateway --> location @gateway --> proxy_pass http://ialso_index --> 实现了访问
try_files $uri @gateway;
}
# 转发/下的请求到ialso_kubernetes_ingress
location @gateway {
proxy_pass http://ialso_index;
# 转发时保留原始请求域名
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Real-IP $remote_addr;
}
}
配置好上述信息就可以开始验证下消息的发送
python wechat.py Warning "warning message"
#!/bin/python
# wechat.py
import urllib,urllib2
import json
import sys
import simplejson
reload(sys)
sys.setdefaultencoding('utf-8')
def gettoken(corpid,corpsecret):
gettoken_url = 'https://qyapi.weixin.qq.com/cgi-bin/gettoken?corpid=' + corpid + '&corpsecret=' + corpsecret
print gettoken_url
try:
token_file = urllib2.urlopen(gettoken_url)
except urllib2.HTTPError as e:
print e.code
print e.read().decode("utf8")
sys.exit()
token_data = token_file.read().decode('utf-8')
token_json = json.loads(token_data)
token_json.keys()
token = token_json['access_token']
return token
def senddata(access_token,subject,content):
send_url = 'https://qyapi.weixin.qq.com/cgi-bin/message/send?access_token=' + access_token
# toparty: 部门ID agentid: 应用ID
send_values = {
"toparty":"4",
"msgtype":"text",
"agentid":"1000002",
"text":{
"content":subject + '\n' + content
},
"safe":"0"
}
send_data = simplejson.dumps(send_values, ensure_ascii=False).encode('utf-8')
send_request = urllib2.Request(send_url, send_data)
response = json.loads(urllib2.urlopen(send_request).read())
print str(response)
if __name__ == '__main__':
# 消息标题
subject = str(sys.argv[1])
# 消息内容
content = str(sys.argv[2])
# 企业ID
corpid = 'wwe86504f797d306ce'
# 应用secret
corpsecret = 'FrAuzVnZvkmJdQcRiESKtBHsX8Xmq5LHEc-cn-wl3UY'
accesstoken = gettoken(corpid,corpsecret)
senddata(accesstoken,subject,content)
成功收到消息后就开始修改alertmanager规则(若不成功,可根据返回消息指示进行配置)
apiVersion: v1
kind: Secret
metadata:
labels:
app.kubernetes.io/component: alert-router
app.kubernetes.io/instance: main
app.kubernetes.io/name: alertmanager
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 0.25.0
name: alertmanager-main
namespace: monitoring
stringData:
alertmanager.yaml: |-
# global:全局配置,主要配置告警方式,如邮件、webhook等
global:
# 超时,默认5min
resolve_timeout: 5m
# 邮件配置
smtp_smarthost: 'smtp.qq.com:465'
smtp_from: '2750955630@qq.com'
smtp_auth_username: '2750955630@qq.com'
smtp_auth_password: 'puwluaqcmkrdddge'
smtp_require_tls: false
# 企业微信配置
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
# 企业ID
wechat_api_corp_id: 'wwe86504f797d306ce'
# 应用secret
wechat_api_secret: 'FrAuzVnZvkmJdQcRiESKtBHsX8Xmq5LHEc-cn-wl3UY'
# 模板
templates:
- '/usr/local/alertmanager/*.tmp'
# 路由
route:
# 路由分组规则
group_by: [ 'namespace', 'job', 'alertname' ]
# 当一个新告警组被创建时,需要等待'group_wait'后才发送初始通知,防止告警轰炸
group_wait: 30s
# 当第一次告警通知发出后,在新的评估周期内又收到了该分组最新告警,则需等待'group_interval'时间后,开始发送为该组触发的新告警
group_interval: 2m
# 告警通知成功发送后,若问题一直未恢复,再次重复发送的间隔
repeat_interval: 10m
# 配置告警消息接收者
receiver: 'Default'
# 子路由
routes:
- receiver: 'wechat'
match:
alertname: "Watchdog"
# 配置报警信息接收者信息
receivers:
- name: 'Default'
email_configs:
# 接收报警的email
- to: 'xumeng03@bilibili.com'
# 故障恢复后通知
send_resolved: true
- name: 'email'
email_configs:
# 接收报警的email
- to: '2750955630@qq.com'
# 故障恢复后通知
send_resolved: true
- name: 'wechat'
wechat_configs:
# 接收报警的部门ID
- to_party: 2
# 报警应用AgentId
agent_id: 1000002
# 故障恢复后通知
send_resolved: true
# 抑制规则配置
inhibit_rules:
- source_matchers:
- severity="critical"
target_matchers:
- severity=~"warning|info"
equal:
- namespace
- alertname
- source_matchers:
- severity="warning"
target_matchers:
- severity="info"
equal:
- namespace
- alertname
- source_matchers:
- alertname="InfoInhibitor"
target_matchers:
- severity="info"
equal:
- namespace
type: Opaque
11、自定义告警模板
添加自定义模板
# 添加自定义模板
vi kube-prometheus/manifests/alertmanager-secret.yaml
# 应用新的告警规则
kubectl replace -f kube-prometheus/manifests/alertmanager-secret.yaml
apiVersion: v1
kind: Secret
metadata:
labels:
app.kubernetes.io/component: alert-router
app.kubernetes.io/instance: main
app.kubernetes.io/name: alertmanager
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 0.25.0
name: alertmanager-main
namespace: monitoring
stringData:
wechat.tmpl: |-
{{ define "wechat.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 -}}
# 异常报警!!!
{{- end }}
告警状态:{{ .Status }}
告警级别:{{ .Labels.severity }}
告警类型:{{ $alert.Labels.alertname }}
故障主机: {{ $alert.Labels.instance }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}}
触发阀值:{{ .Annotations.value }}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 -}}
# 异常恢复!!!
{{- end }}
异常恢复!!!
告警类型:{{ .Labels.alertname }}
告警状态:{{ .Status }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
alertmanager.yaml: |-
# global:全局配置,主要配置告警方式,如邮件、webhook等
global:
# 超时,默认5min
resolve_timeout: 5m
# 邮件配置
smtp_smarthost: 'smtp.qq.com:465'
smtp_from: '2750955630@qq.com'
smtp_auth_username: '2750955630@qq.com'
smtp_auth_password: 'puwluaqcmkrdddge'
smtp_require_tls: false
# 企业微信配置
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
wechat_api_corp_id: 'wwe86504f797d306ce'
wechat_api_secret: 'FrAuzVnZvkmJdQcRiESKtBHsX8Xmq5LHEc-cn-wl3UY'
# 模板
templates:
- '/etc/alertmanager/config/*.tmpl'
# 路由
route:
# 路由分组规则
group_by: [ 'namespace', 'job', 'alertname' ]
# 当一个新告警组被创建时,需要等待'group_wait'后才发送初始通知,防止告警轰炸
group_wait: 30s
# 当第一次告警通知发出后,在新的评估周期内又收到了该分组最新告警,则需等待'group_interval'时间后,开始发送为该组触发的新告警
group_interval: 2m
# 告警通知成功发送后,若问题一直未恢复,再次重复发送的间隔
repeat_interval: 5m
# 配置告警消息接收者
receiver: 'Default'
# 子路由
routes:
- receiver: 'wechat'
match:
alertname: "Watchdog"
# 配置报警信息接收者信息
receivers:
- name: 'Default'
email_configs:
# 接收报警的email
- to: 'xumeng03@bilibili.com'
# 故障恢复后通知
send_resolved: true
- name: 'email'
email_configs:
# 接收报警的email
- to: '2750955630@qq.com'
# 故障恢复后通知
send_resolved: true
- name: 'wechat'
wechat_configs:
# 接收报警的部门ID
- to_party: 4
# 报警应用AgentId
agent_id: 1000002
# 故障恢复后通知
send_resolved: true
# 使用指定模板
message: '{{ template "wechat.default.message" . }}'
# 抑制规则配置
inhibit_rules:
- source_matchers:
- severity="critical"
target_matchers:
- severity=~"warning|info"
equal:
- namespace
- alertname
- source_matchers:
- severity="warning"
target_matchers:
- severity="info"
equal:
- namespace
- alertname
- source_matchers:
- alertname="InfoInhibitor"
target_matchers:
- severity="info"
equal:
- namespace
type: Opaque