npd部署&测试
简介
项目地址
https://github.com/kubernetes/node-problem-detector
node-problem-detector的作用是收集k8s集群管理中节点问题,并将其报告给apiserver。它是在每个节点上运行的守护程序。node-problem-detector可以作为DaemonSet运行,也可以独立运行。当前,GCE集群中默认开启此扩展。
背景
- 基础架构守护程序问题:ntp服务关闭;
- 硬件问题:CPU,内存或磁盘损坏;
- 内核问题:内核死锁,文件系统损坏;
- 容器运行时问题:运行时守护程序无响应
- 等等
当kubernetes中节点发生上述问题,在整个集群中,k8s服务组件并不会感知以上问题,就会导致pod仍会调度至问题节点。
为了解决这个问题,我们引入了这个新的守护进程node-problem-detector,从各个守护进程收集节点问题,并使它们对上游层可见。一旦上游层面发现了这些问题,我们就可以讨论补救措施。
上报的API类型
node-problem-detector使用Event和NodeCondition将问题报告给apiserver。
- NodeCondition:导致节点无法处理于Pod生命周期的的永久性问题应报告为NodeCondition。
- Event:对pod影响有限的临时问题应作为event报告。
检查问题类型
当前支持的问题检测类型:
- SystemLogMonitor
- SystemStatsMonitor
- CustomPluginMonitor
- HealthChecker
不同的检测类型通过不同的goroutine来实现,配置例子参考: https://github.com/kubernetes/node-problem-detector/tree/master/config, 配置文件为json结尾。
举例:kernel-monitor.json
{
"plugin": "kmsg", # 插件名称
"logPath": "/dev/kmsg", # 日志路径
"lookback": "5m",
"bufferSize": 10,
"source": "kernel-monitor", # 监控的内容
"metricsReporting": true,
"conditions": [
{
"type": "KernelDeadlock",
"reason": "KernelHasNoDeadlock",
"message": "kernel has no deadlock"
},
{
"type": "ReadonlyFilesystem",
"reason": "FilesystemIsNotReadOnly",
"message": "Filesystem is not read-only"
}
],
"rules": [
{
"type": "temporary",
"reason": "OOMKilling",
"pattern": "Killed process \\d+ (.+) total-vm:\\d+kB, anon-rss:\\d+kB, file-rss:\\d+kB.*"
},
{
"type": "permanent",
"condition": "ReadonlyFilesystem",
"reason": "FilesystemIsReadOnly",
"pattern": "Remounting filesystem read-only"
}
]
}
其中有一些默认的监控json配置文件,需要修改logPath的路径,按照真是路径来,否则不会收集到有用信息
例如:docker-monitor.json
"plugin": "journald",
"pluginConfig": {
"source": "dockerd"
},
"logPath": "/var/log/journal",
"lookback": "5m",
"bufferSize": 10,
"source": "docker-monitor",
"metricsReporting": true,
我部署的环境中就没有 /var/log/journal 可以使用/var/log/messages 代替,或者使用docker的单独日志路径
部署
当前测试部署的版本为:0.8.7
镜像准备
使用可以拉取k8s.gcr.io的镜像环境进行镜像下载
k8s.gcr.io/node-problem-detector/node-problem-detector:v0.8.7
yaml文件准备
权限文件
npd-clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
annotations:
meta.helm.sh/release-name: npd
meta.helm.sh/release-namespace: kube-system
labels:
app.kubernetes.io/instance: npd
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: node-problem-detector
helm.sh/chart: node-problem-detector-2.2.3
name: npd-node-problem-detector
rules:
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- apiGroups:
- ""
resources:
- nodes/status
verbs:
- patch
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
- update
npd-clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
app.kubernetes.io/instance: npd
app.kubernetes.io/name: node-problem-detector
name: npd-node-problem-detector
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: npd-node-problem-detector
subjects:
- kind: ServiceAccount
name: npd-node-problem-detector
namespace: kube-system
daemonsets.apps
npd-ds.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "2"
generation: 2
labels:
app.kubernetes.io/instance: npd
app.kubernetes.io/name: node-problem-detector
name: npd-node-problem-detector
namespace: kube-system
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: node-problem-detector
app.kubernetes.io/instance: npd
app.kubernetes.io/name: node-problem-detector
template:
metadata:
creationTimestamp: null
labels:
app: node-problem-detector
app.kubernetes.io/instance: npd
app.kubernetes.io/name: node-problem-detector
spec:
containers:
- command:
- /bin/sh
- -c
- exec /node-problem-detector --logtostderr --config.system-log-monitor=/config/kernel-monitor.json,/config/docker-monitor.json --prometheus-address=0.0.0.0
--prometheus-port=20257 --k8s-exporter-heartbeat-period=5m0s
env:
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: k8s.gcr.io/node-problem-detector/node-problem-detector:v0.8.7
imagePullPolicy: IfNotPresent
name: node-problem-detector
ports:
- containerPort: 20257
name: exporter
protocol: TCP
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/log/
name: log
readOnly: true
- mountPath: /etc/localtime
name: localtime
readOnly: true
- mountPath: /custom-config
name: custom-config
readOnly: true
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: npd-node-problem-detector
serviceAccountName: npd-node-problem-detector
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
operator: Exists
volumes:
- hostPath:
path: /var/log/
type: ""
name: log
- hostPath:
path: /etc/localtime
type: FileOrCreate
name: localtime
- configMap:
defaultMode: 420
name: npd-node-problem-detector-custom-config
name: custom-config
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
cm文件
npd-cm.yaml :用来添加自定义监控插件的配置json
apiVersion: v1
kind: ConfigMap
metadata:
labels:
app.kubernetes.io/instance: npd
app.kubernetes.io/name: node-problem-detector
name: npd-node-problem-detector-custom-config
namespace: kube-system
在ds中的command命令参数
-
对于系统日志监控器
–config.system-log-monitor:系统日志监控器配置文件的路径列表,以逗号分隔,例如 config/kernel-monitor.json。node-problem-detector将为每个配置启动一个单独的日志监视器。您可以使用不同的日志监视器来监视不同的系统日志。
-
对于系统状态监控器
–config.system-stats-monitor:系统状态监视配置文件的路径列表,以逗号分隔,例如 config / system-stats-monitor.json。node-problem-detector将为每个配置启动一个单独的系统状态监视器。您可以使用不同的系统状态监视器来监视与问题相关的不同系统状态。
-
对于自定义插件监控器
–config.custom-plugin-monitor:自定义插件监视器配置文件的路径列表,以逗号分隔,例如 config/custom-plugin-monitor.json。node-problem-detector将为每个配置启动一个单独的自定义插件监视器。您可以使用不同的自定义插件监视器来监视不同的节点问题。
可以按照需求,配置不通的json监控器
交付部署
kubectl apply -f npd-clusterrole.yaml
kubectl apply -f npd-clusterrolebinding.yaml
kubectl apply -f npd-cm.yaml
kubectl apply -f npd-ds.yaml
查看状态
[root@master test]# kubectl get pod -n kube-system -l app.kubernetes.io/name=node-problem-detector
NAME READY STATUS RESTARTS AGE
npd-node-problem-detector-58gqs 1/1 Running 0 3m7s
npd-node-problem-detector-9lmcq 1/1 Running 0 3m7s
npd-node-problem-detector-rtc7b 1/1 Running 0 3m7s
模拟故障
由于这个故障不好触发,只能通过输入对于的信息到/dev/kmsg中来模拟实现
# 在worker节点上
# sudo sh -c "echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg"
]# kubectl describe nodes node2
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning KernelOops 4s kernel-monitor kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING
npd部署&测试
简介
项目地址
https://github.com/kubernetes/node-problem-detector
node-problem-detector的作用是收集k8s集群管理中节点问题,并将其报告给apiserver。它是在每个节点上运行的守护程序。node-problem-detector可以作为DaemonSet运行,也可以独立运行。当前,GCE集群中默认开启此扩展。
背景
- 基础架构守护程序问题:ntp服务关闭;
- 硬件问题:CPU,内存或磁盘损坏;
- 内核问题:内核死锁,文件系统损坏;
- 容器运行时问题:运行时守护程序无响应
- 等等
当kubernetes中节点发生上述问题,在整个集群中,k8s服务组件并不会感知以上问题,就会导致pod仍会调度至问题节点。
为了解决这个问题,我们引入了这个新的守护进程node-problem-detector,从各个守护进程收集节点问题,并使它们对上游层可见。一旦上游层面发现了这些问题,我们就可以讨论补救措施。
上报的API类型
node-problem-detector使用Event和NodeCondition将问题报告给apiserver。
- NodeCondition:导致节点无法处理于Pod生命周期的的永久性问题应报告为NodeCondition。
- Event:对pod影响有限的临时问题应作为event报告。
检查问题类型
当前支持的问题检测类型:
- SystemLogMonitor
- SystemStatsMonitor
- CustomPluginMonitor
- HealthChecker
不同的检测类型通过不同的goroutine来实现,配置例子参考: https://github.com/kubernetes/node-problem-detector/tree/master/config, 配置文件为json结尾。
举例:kernel-monitor.json
{
"plugin": "kmsg", # 插件名称
"logPath": "/dev/kmsg", # 日志路径
"lookback": "5m",
"bufferSize": 10,
"source": "kernel-monitor", # 监控的内容
"metricsReporting": true,
"conditions": [
{
"type": "KernelDeadlock",
"reason": "KernelHasNoDeadlock",
"message": "kernel has no deadlock"
},
{
"type": "ReadonlyFilesystem",
"reason": "FilesystemIsNotReadOnly",
"message": "Filesystem is not read-only"
}
],
"rules": [
{
"type": "temporary",
"reason": "OOMKilling",
"pattern": "Killed process \\d+ (.+) total-vm:\\d+kB, anon-rss:\\d+kB, file-rss:\\d+kB.*"
},
{
"type": "permanent",
"condition": "ReadonlyFilesystem",
"reason": "FilesystemIsReadOnly",
"pattern": "Remounting filesystem read-only"
}
]
}
其中有一些默认的监控json配置文件,需要修改logPath的路径,按照真是路径来,否则不会收集到有用信息
例如:docker-monitor.json
"plugin": "journald",
"pluginConfig": {
"source": "dockerd"
},
"logPath": "/var/log/journal",
"lookback": "5m",
"bufferSize": 10,
"source": "docker-monitor",
"metricsReporting": true,
我部署的环境中就没有 /var/log/journal 可以使用/var/log/messages 代替,或者使用docker的单独日志路径
部署
当前测试部署的版本为:0.8.7
镜像准备
使用可以拉取k8s.gcr.io的镜像环境进行镜像下载
k8s.gcr.io/node-problem-detector/node-problem-detector:v0.8.7
yaml文件准备
权限文件
npd-clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
annotations:
meta.helm.sh/release-name: npd
meta.helm.sh/release-namespace: kube-system
labels:
app.kubernetes.io/instance: npd
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: node-problem-detector
helm.sh/chart: node-problem-detector-2.2.3
name: npd-node-problem-detector
rules:
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- apiGroups:
- ""
resources:
- nodes/status
verbs:
- patch
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
- update
npd-clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
app.kubernetes.io/instance: npd
app.kubernetes.io/name: node-problem-detector
name: npd-node-problem-detector
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: npd-node-problem-detector
subjects:
- kind: ServiceAccount
name: npd-node-problem-detector
namespace: kube-system
daemonsets.apps
npd-ds.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "2"
generation: 2
labels:
app.kubernetes.io/instance: npd
app.kubernetes.io/name: node-problem-detector
name: npd-node-problem-detector
namespace: kube-system
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: node-problem-detector
app.kubernetes.io/instance: npd
app.kubernetes.io/name: node-problem-detector
template:
metadata:
creationTimestamp: null
labels:
app: node-problem-detector
app.kubernetes.io/instance: npd
app.kubernetes.io/name: node-problem-detector
spec:
containers:
- command:
- /bin/sh
- -c
- exec /node-problem-detector --logtostderr --config.system-log-monitor=/config/kernel-monitor.json,/config/docker-monitor.json --prometheus-address=0.0.0.0
--prometheus-port=20257 --k8s-exporter-heartbeat-period=5m0s
env:
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: k8s.gcr.io/node-problem-detector/node-problem-detector:v0.8.7
imagePullPolicy: IfNotPresent
name: node-problem-detector
ports:
- containerPort: 20257
name: exporter
protocol: TCP
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/log/
name: log
readOnly: true
- mountPath: /etc/localtime
name: localtime
readOnly: true
- mountPath: /custom-config
name: custom-config
readOnly: true
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: npd-node-problem-detector
serviceAccountName: npd-node-problem-detector
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
operator: Exists
volumes:
- hostPath:
path: /var/log/
type: ""
name: log
- hostPath:
path: /etc/localtime
type: FileOrCreate
name: localtime
- configMap:
defaultMode: 420
name: npd-node-problem-detector-custom-config
name: custom-config
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
cm文件
npd-cm.yaml :用来添加自定义监控插件的配置json
apiVersion: v1
kind: ConfigMap
metadata:
labels:
app.kubernetes.io/instance: npd
app.kubernetes.io/name: node-problem-detector
name: npd-node-problem-detector-custom-config
namespace: kube-system
在ds中的command命令参数
-
对于系统日志监控器
–config.system-log-monitor:系统日志监控器配置文件的路径列表,以逗号分隔,例如 config/kernel-monitor.json。node-problem-detector将为每个配置启动一个单独的日志监视器。您可以使用不同的日志监视器来监视不同的系统日志。
-
对于系统状态监控器
–config.system-stats-monitor:系统状态监视配置文件的路径列表,以逗号分隔,例如 config / system-stats-monitor.json。node-problem-detector将为每个配置启动一个单独的系统状态监视器。您可以使用不同的系统状态监视器来监视与问题相关的不同系统状态。
-
对于自定义插件监控器
–config.custom-plugin-monitor:自定义插件监视器配置文件的路径列表,以逗号分隔,例如 config/custom-plugin-monitor.json。node-problem-detector将为每个配置启动一个单独的自定义插件监视器。您可以使用不同的自定义插件监视器来监视不同的节点问题。
可以按照需求,配置不通的json监控器
交付部署
kubectl apply -f npd-clusterrole.yaml
kubectl apply -f npd-clusterrolebinding.yaml
kubectl apply -f npd-cm.yaml
kubectl apply -f npd-ds.yaml
查看状态
[root@master test]# kubectl get pod -n kube-system -l app.kubernetes.io/name=node-problem-detector
NAME READY STATUS RESTARTS AGE
npd-node-problem-detector-58gqs 1/1 Running 0 3m7s
npd-node-problem-detector-9lmcq 1/1 Running 0 3m7s
npd-node-problem-detector-rtc7b 1/1 Running 0 3m7s
模拟故障
由于这个故障不好触发,只能通过输入对于的信息到/dev/kmsg中来模拟实现
# 在worker节点上
# sudo sh -c "echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg"
]# kubectl describe nodes node2
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning KernelOops 4s kernel-monitor kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING