前言:
k8s的event是一个有用且容易被忽视的一种资源,本文将主要分析下event是什么,以及如何利用event来维护我们的k8s平台。
什么是event
我们先执行一个deployment,执行成功后describe一下,看到描述文件最后部分如下:
deployment:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 18m deployment-controller Scaled up replica set jingtao-event-deploy-5dd6495cfb to 1
对应属性和value的字面意思很容易理解,这里不再赘述。
再看下该deployment执行产生的pod,describe一下,看到描述文件最后部分如下:
pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 19m default-scheduler Successfully assigned default/jingtao-event-deploy-5dd6495cfb-6csj4 to 10.50.42.236
Normal Pulling 19m kubelet Pulling image "ubuntu:xenial"
Normal Pulled 19m kubelet Successfully pulled image "ubuntu:xenial" in 1.114824009s
Normal Created 19m kubelet Created container jingtao-event-deploy
Normal Started 19m kubelet Started container jingtao-event-deploy
也很好理解,这些事件可以很好的帮助我们了解到执行过程中到底发生了什么。
那么event到底是什么?执行get event命令试下:
> kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
20m Normal Scheduled pod/jingtao-event-deploy-5dd6495cfb-6csj4 Successfully assigned default/jingtao-event-deploy-5dd6495cfb-6csj4 to 10.50.42.236
20m Normal Pulling pod/jingtao-event-deploy-5dd6495cfb-6csj4 Pulling image "ubuntu:xenial"
20m Normal Pulled pod/jingtao-event-deploy-5dd6495cfb-6csj4 Successfully pulled image "ubuntu:xenial" in 1.114824009s
20m Normal Created pod/jingtao-event-deploy-5dd6495cfb-6csj4 Created container jingtao-event-deploy
20m Normal Started pod/jingtao-event-deploy-5dd6495cfb-6csj4 Started container jingtao-event-deploy
20m Normal SuccessfulCreate replicaset/jingtao-event-deploy-5dd6495cfb Created pod: jingtao-event-deploy-5dd6495cfb-6csj4
20m Normal ScalingReplicaSet deployment/jingtao-event-deploy Scaled up replica set jingtao-event-deploy-5dd6495cfb to 1
41m Normal Killing pod/jingtao-self-deploy-674fff788d-x9kt8 Stopping container jingtao-self-deploy
41m Normal SuccessfulDelete replicaset/jingtao-self-deploy-674fff788d Deleted pod: jingtao-self-deploy-674fff788d-x9kt8
41m Normal ScalingReplicaSet deployment/jingtao-self-deploy Scaled down replica set jingtao-self-deploy-674fff788d to 1
可以看到event与pod、deployment一样,在k8s中被当成了一种独立资源在管理,所以它不单单是一个日志,它是有自己事件模型的。
event的意义
我们还是执行一个deployment,但这次我们给一个错误的镜像,describe一下:
Conditions:
Type Status Reason
---- ------ ------
Available False MinimumReplicasUnavailable
Progressing True ReplicaSetUpdated
OldReplicaSets: <none>
NewReplicaSet: jingtao-err-deploy-685dd8cd78 (1/1 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 110s deployment-controller Scaled up replica set jingtao-err-deploy-685dd8cd78 to 1
“是否可用”是False的,但通过event什么也看不出来,那么pod也describe一下试试:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m12s default-scheduler Successfully assigned default/jingtao-err-deploy-685dd8cd78-clbk5 to 10.50.42.236
Normal Pulling 86s (x4 over 3m12s) kubelet Pulling image "ubuntu:err"
Warning Failed 81s (x4 over 3m9s) kubelet Failed to pull image "ubuntu:err": rpc error: code = Unknown desc = Error response from daemon: manifest for ubuntu:err not found: manifest unknown: manifest unknown
Warning Failed 81s (x4 over 3m9s) kubelet Error: ErrImagePull
Warning Failed 70s (x6 over 3m9s) kubelet Error: ImagePullBackOff
Normal BackOff 55s (x7 over 3m9s) kubelet Back-off pulling image "ubuntu:err"
这时就可以完美体现出event的意义了,可以清楚的知道问题卡在了拉取镜像上,而且type不是normal,而是warning。
这里看到了一个特殊的格式“81s (x4 over 3m9s) ”,经过查询资料才明白它表达的是:在3分9秒内尝试了4次,最后一次尝试是81秒之前。OK,那么我们get event一下看看对不对:
> kubectl get events --sort-by='{.metadata.creationTimestamp}'
LAST SEEN TYPE REASON OBJECT MESSAGE
58m Normal Scheduled pod/jingtao-event-deploy-5dd6495cfb-6csj4 Successfully assigned default/jingtao-event-deploy-5dd6495cfb-6csj4 to 10.50.42.236
58m Normal SuccessfulCreate replicaset/jingtao-event-deploy-5dd6495cfb Created pod: jingtao-event-deploy-5dd6495cfb-6csj4
58m Normal ScalingReplicaSet deployment/jingtao-event-deploy Scaled up replica set jingtao-event-deploy-5dd6495cfb to 1
58m Normal Pulling pod/jingtao-event-deploy-5dd6495cfb-6csj4 Pulling image "ubuntu:xenial"
58m Normal Started pod/jingtao-event-deploy-5dd6495cfb-6csj4 Started container jingtao-event-deploy
58m Normal Created pod/jingtao-event-deploy-5dd6495cfb-6csj4 Created container jingtao-event-deploy
58m Normal Pulled pod/jingtao-event-deploy-5dd6495cfb-6csj4 Successfully pulled image "ubuntu:xenial" in 1.114824009s
7m37s Normal SuccessfulCreate replicaset/jingtao-err-deploy-685dd8cd78 Created pod: jingtao-err-deploy-685dd8cd78-clbk5
7m37s Normal ScalingReplicaSet deployment/jingtao-err-deploy Scaled up replica set jingtao-err-deploy-685dd8cd78 to 1
7m37s Normal Scheduled pod/jingtao-err-deploy-685dd8cd78-clbk5 Successfully assigned default/jingtao-err-deploy-685dd8cd78-clbk5 to 10.50.42.236
5m50s Normal Pulling pod/jingtao-err-deploy-685dd8cd78-clbk5 Pulling image "ubuntu:err"
5m34s Warning Failed pod/jingtao-err-deploy-685dd8cd78-clbk5 Error: ImagePullBackOff
2m26s Normal BackOff pod/jingtao-err-deploy-685dd8cd78-clbk5 Back-off pulling image "ubuntu:err"
5m45s Warning Failed pod/jingtao-err-deploy-685dd8cd78-clbk5 Error: ErrImagePull
5m45s Warning Failed pod/jingtao-err-deploy-685dd8cd78-clbk5 Failed to pull image "ubuntu:err": rpc error: code = Unknown desc = Error response from daemon: manifest for ubuntu:err not found: manifest unknown: manifest unknown
什么?竟然也只有一条记录?跟想象的不一样!原来,这种表现都源于k8s“抠门”的设计。
我们都知道k8s对资源的存储最终都落到了etcd上,出于性能上的考虑etcd对于k8s来说又是一种宝贵的资源,用这种宝贵的资源来存储大量的“日志”又显得很不划算。所以k8s对于event做了2件事。
第一件:kubectl在查询event时进了聚合和加工,面对不同查询返回不同的格式和内容。
第二件:kubenetes只保留了1小时的event信息,过期的event直接暴力清理掉。
至于“81s (x4 over 3m9s)”在k8s内部到底保留了1条event还是4条event?我个人倾向于前者,因为get event确实只看到了1条,剩下的信息是与关联资源之间以“桥表”或者“属性”方式维护起来的,确实相同的东西存储多份既浪费资源又没有任何意义。
有哪些事件
失败事件:如上面案例,显式声明遇到错误。
驱逐事件:k8s内置的清理保护机制,驱逐完告诉你一声,觉得有损失赶紧弥补。
存储相关事件:例如容器挂载PV的情况。
调度事件:例如内存或CPU不足导致pod分配不出来。
怎么利用event
如果只是单纯作为告警,1小时的时效基本可以满足我们要求了,但如果还要做溯源和分析,那需要先解决event的持久化问题。 Event持久化的常规流程是通过 kubernetes-event-exporter等工具将kubernetes api中生成的event数据转发到上层,再由上层其他工具进行聚合和分析。
kubernetes-event-exporter以deployment方式部署在k8s中,上层可以对接es、kafka等,本质上又回归到分布式数据采集,类似elk那一套。
原生event的缺陷
我们通过event对k8s平台进行监控管理的时候发现,k8s自带的event只与其管理的k8s资源对象有关系,例如可以通过event告诉我们部署因为缺资源没有部署成功,但现实场景下我们希望通过node的监控更早的发现问题,node相关事件在k8s-event中是缺失的。
这时需要借助node-problem-detector,它以DaemonSet的方式运行在node上,负责为node生成k8s框架中的event并推送到k8s的APIServer上,这些event虽然也只存活1小时,但又可以借助上述的kubernetes-event-exporter进行持久存储。
事件分析处理
万事俱备只欠东风,事件收集完毕,需要一个类似“事件中心”的平台好好利用这些event。事件中心的职责有:DashBoard、查询、分析、告警、订阅等。可以说事件的产生、传输和收集都是统一方法论的,但对事件的使用才真正考验平台运维的功底。