在钉钉群添加一个kube-event使用的机器人
完成之后 复制产生的webhook备用
https://oapi.dingtalk.com/robot/send?access_token=xxxxxx
将以下yaml
文件保存到kube-event.yaml
文件中,修改启动参数中的--sink
为自己刚才复制的webhook
地址,label
中写刚才自定义的关键字cluster1
,level
指定告警为Warning
级别的事件。
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
name: kube-eventer
name: kube-eventer
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: kube-eventer
template:
metadata:
labels:
app: kube-eventer
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ''
spec:
dnsPolicy: ClusterFirstWithHostNet
serviceAccount: kube-eventer
containers:
- image: registry.aliyuncs.com/acs/kube-eventer-amd64:v1.2.0-484d9cd-aliyun
name: kube-eventer
command:
- "/kube-eventer"
- "--source=kubernetes:https://kubernetes.default"
## .e.g,dingtalk sink demo
# - --sink=dingtalk:https://xxxxx&label=cluster1&level=Normal
- --sink=dingtalk:[your_webhook_url]&label=[your_cluster_id]&level=[Normal or Warning(default)]
env:
# If TZ is assigned, set the TZ value as the time zone
- name: TZ
value: "Asia/Shanghai"
volumeMounts:
- name: localtime
mountPath: /etc/localtime
readOnly: true
- name: zoneinfo
mountPath: /usr/share/zoneinfo
readOnly: true
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 500m
memory: 250Mi
volumes:
- name: localtime
hostPath:
path: /etc/localtime
- name: zoneinfo
hostPath:
path: /usr/share/zoneinfo
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kube-eventer
rules:
- apiGroups:
- ""
resources:
- events
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kube-eventer
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-eventer
subjects:
- kind: ServiceAccount
name: kube-eventer
namespace: kube-system
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-eventer
namespace: kube-system
这个label就是你钉钉机器人的关键字cluster1 level我选择了normal 起初用warning 使用stress命令 触发oom 没有告警,一度以为自己配置错了
应用yaml
kubectl apply -f kube-event.yaml
查看pod日志
kubectl log -f ${pod_name} -n kube-system
当 kubernetes 集群中发生 Pod因为 OOM 、拉取不到镜像、健康检查不通过等错误导致重启,集群管理员其实是不知道的,因为 Kubernetes 有自我修复机制,Pod宕掉,可以重新启动一个。有了事件告警,集群管理员就可以及时发现服务问题,进行修复。
手动查看event命令
kubectl get events --all-namespaces
kubectl get events --field-selector type=Warning
kubectl get events --sort-by=.metadata.creationTimestamp
kubectl get events --field-selector involvedObject.kind=Node
自己使用过程是在测试环境,一发布告警就很频繁,所以关键字可能需要自行调整
如果想要区分namespcae,可以配置多个同类webhook,但是不同的namespace。
官方文档:
https://github.com/AliyunContainerService/kube-eventer