一、配置alertManager 的configMap
将alertManager的配置信息写入configMap中,在kubernetes中创建
#AlertManager的配置
kind: ConfigMap
apiVersion: v1
metadata:
name: alertmanager
namespace: cattle-prometheus
data:
alertmanager.yml: |- #alertManager的配置文件
route:
group_by: ['alertname'] #分组
group_wait: 10s
group_interval: 10s
repeat_interval: 1m #重复发送间隔
receiver: 'web.hook' #接受者名称
routes:
- receiver: web.hook
group_wait: 10s
receivers: #接受者配置
- name: 'web.hook'
webhook_configs:
- url: 'http://172.200.96.145:8080/api/cloudnative/container/clusters/alert'
send_resolved: true
二、在Kubernetes中启动alertManager并创建Service
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
name: alertmanager-deployment
name: alertmanager
namespace: cattle-prometheus
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
containers:
- image: yanfa-harbor.51iwifi.com/rancher/mirrored-prometheus-alertmanager:v0.22.2
name: alertmanager
ports:
- containerPort: 9093
protocol: TCP
volumeMounts:
- mountPath: "/alertmanager"
name: data
- mountPath: "/etc/alertmanager"
name: config-volume
resources:
requests:
cpu: 50m
memory: 50Mi
limits:
cpu: 200m
memory: 200Mi
volumes:
- name: data
emptyDir: {}
- name: config-volume
configMap:
name: alertmanager
---
apiVersion: v1
kind: Service
metadata:
labels:
app: alertmanager
annotations:
prometheus.io/scrape: 'true'
name: alertmanager-operated #服务名称
namespace: cattle-prometheus
spec:
type: NodePort
ports:
- port: 9093
targetPort: 9093
nodePort: 30001
selector:
app: alertmanager
三、在kubernetes中为prometheus添加告警规则
使用prometheus的crd为prometheus添加告警规则
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
app: exporter-kube-controller-manager
chart: exporter-kube-controller-manager-0.0.1
io.cattle.field/appId: cluster-monitoring
release: cluster-monitoring
source: rancher-monitoring
name: prometheus-k8s
namespace: cattle-prometheus
spec:
groups:
- name: prometheus-k8s
rules:
# - alert: PodNotReady
# annotations:
# message: 'Pod状态异常! 命名空间:{{ $labels.namespace }},Pod名称: {{ $labels.pod }}.'
# expr: |
# sum by(namespace, pod) (kube_pod_status_phase{phase!~"Running|Succeeded",namespace=~"default|kafka|kube-system|logging|monitoring|k8s-dcm|ai"}) > 0
# for: 5m
# labels:
# severity: critical
- alert: DeploymentNotReady #告警规则
annotations:
message: '集群166:Deployment状态异常! 命名空间:{{ $labels.namespace }},Deployment名称: {{ $labels.deployment }}.'
expr: |
kube_deployment_spec_replicas{job="expose-kubernetes-metrics",namespace="default"} != kube_deployment_status_replicas_available{job="expose-kubernetes-metrics",namespace="default"}
for: 10s
labels:
severity: critical
- alert: StatefulSetNotReady
annotations:
message: '集群166:StatefulSet状态异常!命名空间:{{ $labels.namespace }},StatefulSet称: {{ $labels.statefulset }}.'
expr: |
kube_statefulset_status_replicas_ready{job="kube-state-metrics",namespace=~"default|kafka|kube-system|logging|monitoring|k8s-dcm|ai"} != kube_statefulset_status_replicas{job="kube-state-metrics",namespace=~"default|kafka|kube-system|logging|monitoring|k8s-dcm|ai"}
for: 5m
labels:
severity: critical