pod故障排查常用命令
查看pod
kubectl get pod -o wide
kubectl get pods --namespace kube-system
查看pod容器的日志
- kubectl logs <pod name> :用于查看pod容器的日志
检索与pod相关的事件列表
- 2. kubectl describe pod <pod name> :用于检索与pod相关的事件列表
3.kubectl get pod <pod name> -o yaml:用于提取存储在Kubernetes中的pod的yaml定义。
运行交互式命令
- 4. kubectl exec -it <pod name> bash:用于在pod的一个容器中运行交互式命令。
节点维护:禁用某工作节点
Kubernetes 中的三个命令:cordon
、drain
以及 delete
都可以实现 node 的停止调度,也就是后面再创建的 pod 不会继续被调度到该节点上,他们之间最大的区别是暴力程度不一样。
Cordon 控制
- • 临时将节点从 Kubernets 集群隔离
- • 影响最小,只会将 node 节点标识为
SchedulingDisabled
状态,也就是禁止调度 - • 后面创建的 Pod,将不会调度到这个节点
- • 原来节点运行的 Pod 不受影响,继续对外服务
- • 具体命令:
kubectl cordon [node name]
- • 恢复调度命令:
kubectl uncordon [node name]
Drain 控制
简要介绍
- • 目标:先控制不可调度,然后将原来的 Pod 驱逐、排干
- • 首先,将原来的 Pod 驱逐到其他节点重新创建运行
- • 然后,将节点标识为
SchedulingDisabled
状态,也就是禁止调度 - • 具体命令:
kubectl drain [node name] --force --ignore-daemonsets --delete-local-data
--force
: 就算 Pod 不被 ReplicationController、ReplicaSet、Job、DaemonSet、StatefulSet 等控制器管理,也直接处理;不加 force 参数只会删除该Node 节点上前面提到的几个控制器类型的 Pod,加上之后所有的 Pod 都将删除
--ignore-daemonsets
: 忽略 DeamonSet 管理的 Pod,否则 DaemonSet 被删除后,仍会自动重建
--delete-local-data
: 删除本地数据,即使 emptyDir 也将删除
- • 恢复调度命令:
kubectl uncordon [node name]
- • drain 执行的方式是比较安全的,它会等到 Pod 容器应用程序优雅的停止之后再删除
- • 详细的过程:先在当前节点删除 Pod,然后再在其他节点创建对应的 Pod。因此为了保证 Drain 驱逐过程中不中断服务,必须保证要驱逐的 Pod 副本的数量大于 1,并且采用“反亲和”策略将这些 Pod 调度到不同的节点。这样子可以保证驱逐过程对服务没有影响。
注意事项:
- 1. 对节点执行内核升级、硬件维护等操作之前,你可以使用
kubectl drain
命令安全地驱逐节点上面的 pod - 2. drain 的驱逐方式,将通过容器指定的
PodDisruptionBudgets
来优雅的中止容器,也就是优雅的终止 Pod 中容器的进程 - 3. kubectl drain 会返回成功驱逐的Pod
- 4. 后续,通过物理机断电或者云平台删除虚拟机类型的节点都不影响整个集群
正常情况下,Kubernetes 的 PodDisruptionBudgets 配置时是符合 Pod 驱逐的理想情况的,也就是说 maxUnavailable 设置为 0, maxSurge 设置为 1:
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
Delete 删除
- • 首先,将原来的 Pod 驱逐到其他节点重新创建运行
- • 驱逐过程:现在当前节点删除 Pod,然后再在其他节点创建这些 Pod
- • Node 节点删除,Master 将会失去对其的控制,该节点从集群去除
- • delete 是一种非常暴力的删除节点方式,驱逐时都是强制干掉容器的进程,并没有做到优雅终止 Pod,相比较而言,drain 相对比较安全。
- • 执行命令:
kubectl delete node [node name]
Kubectl CLI 使用
默认情况下,不管是 minikube 还是常规的 k8s 集群安装,都会在默认的用户主目录下面创建一个 ~/.kube/config 文件,kubectl 默认读取该配置的集群信息进行操作;
获取资源类型
集群支持的资源 CRD 类型,可以通过如下命令获取:
[root@master ~]# kubectl api-resources
NAME SHORTNAMES APIVERSION NAMESPACED KIND
bindings v1 true Binding
componentstatuses cs v1 false ComponentStatus
configmaps cm v1 true ConfigMap
endpoints ep v1 true Endpoints
events ev v1 true Event
limitranges limits v1 true LimitRange
namespaces ns v1 false Namespace
nodes no v1 false Node
persistentvolumeclaims pvc v1 true PersistentVolumeClaim
persistentvolumes pv v1 false PersistentVolume
pods po v1 true Pod
podtemplates v1 true PodTemplate
replicationcontrollers rc v1 true ReplicationController
resourcequotas quota v1 true ResourceQuota
secrets v1 true Secret
serviceaccounts sa v1 true ServiceAccount
services svc v1 true Service
mutatingwebhookconfigurations admissionregistration.k8s.io/v1 false MutatingWebhookConfiguration
validatingwebhookconfigurations admissionregistration.k8s.io/v1 false ValidatingWebhookConfiguration
agents agent agent.k8s.elastic.co/v1alpha1 true Agent
customresourcedefinitions crd,crds apiextensions.k8s.io/v1 false CustomResourceDefinition
apiservices apiregistration.k8s.io/v1 false APIService
- • NAME : api 资源名称
- • SHORTNAMES: api 资源简称,在查询时可以使用简称
- • APIVERSION: api 资源版本
- • NAMESPACED: api 资源是否是命名空间范围的,比如 pv 的值就是 false 代表 pv 是全局的,不是限定于某个具体命名空间的
- • KIND:api 资源类型
查询资源清单配置结构信息
在 yaml 清单配置某类资源时,碰到不知道某段配置具体的路径以及值类型、是否必填时,可以通过如下命令查看,比如,查看 pod配置:
# 查看 pod 第一层级的 配置信息,每段配置有详细的配置
[root@master ~]# kubectl explain pod
KIND: Pod
VERSION: v1
DESCRIPTION:
Pod is a collection of containers that can run on a host. This resource is
created by clients and scheduled onto hosts.
FIELDS:
apiVersion <string>
APIVersion defines the versioned schema of this representation of an
object. Servers should convert recognized schemas to the latest internal
value, and may reject unrecognized values. More info:
https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
kind <string>
Kind is a string value representing the REST resource this object
represents. Servers may infer this from the endpoint the client submits
requests to. Cannot be updated. In CamelCase. More info:
https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
查询某类资源
以查询 POD 为例,其他类型的资源类型同样的查询方法,只是替换个类型:
# kubectl get pod -n [命名空间], 不指定命名空间,只会查询默认的命名空间:default
[root@master ~]# kubectl get pod -n cos
NAME READY STATUS RESTARTS AGE
cloud-bmp-7d688998f8-qprvw 1/1 Running 0 9d
cloud-bsm-5df444986b-r9vmb 1/1 Running 0 9d
cloud-component-elasticsearch-server-75c494957-n7phx 1/1 Running 0 16d
# -o wide 输出更多列
[root@master ~]# kubectl get pod -n cos -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cloud-bmp-7d688998f8-qprvw 1/1 Running 0 9d 192.168.219.78 master <none> <none>
cloud-bsm-5df444986b-r9vmb 1/1 Running 0 9d 192.168.219.82 master <none> <none>
cloud-component-elasticsearch-server-75c494957-n7phx 1/1 Running 0 16d 192.168.186.149 k8s-prod-node2 <none> <none>
# 查询某个具体 pod 的明细
[root@master ~]# kubectl describe pod cloud-bmp-7d688998f8-qprvw -n cos
Name: cloud-bmp-7d688998f8-qprvw
Namespace: cos
Priority: 0
Node: master/172.28.105.220
Start Time: Tue, 12 Jul 2022 18:33:47 +0800
Labels: app=cloud-bmp
pod-template-hash=7d688998f8
Annotations: cni.projectcalico.org/containerID: 8adb5ce7ffa7a891b28612646dacfd4f3a05084e1abd86dec9f9d5e1013ba869
cni.projectcalico.org/podIP: 192.168.219.78/32
cni.projectcalico.org/podIPs: 192.168.219.78/32
Status: Running
IP: 192.168.219.78
IPs:
IP: 192.168.219.78
Controlled By: ReplicaSet/cloud-bmp-7d688998f8
Init Containers:
sw-agent-sidecar:
Container ID: docker://6b24e1fa2b7768ce233926813d4c7e2ea7c220a660c6e293cfcfeac3866e3ef9
Image: reg.kolla.org/brs-dev/skywalking-agent-sidecar:8.9.0
Image ID: docker-pullable://reg.kolla.org/brs-dev/skywalking-agent-sidecar@sha256:6178f1bc6454523900f6d8a7bec15da3f086a175b213dc96af312e09690d8c26
Port: <none>
Host Port: <none>
Command:
sh
Args:
-c
mkdir -p /skywalking/agent && cp -r /usr/skywalking/agent/* /skywalking/agent
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 12 Jul 2022 18:34:03 +0800
Finished: Tue, 12 Jul 2022 18:34:04 +0800
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/skywalking/agent from sw-agent (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hdpg5 (ro)
Containers:
cloud-bmp:
Container ID: docker://0bc7917aff37fcd791162c9e27c908060240805a9abdf107b574fdc6b02a81d4
Image: reg.kolla.org/cos/bmp:ddc41bc5
Image ID: docker-pullable://reg.kolla.org/cos/bmp@sha256:291afcd930ca57e70171f26d7f73c8ecbe032db77ea4f731af542b6a5f248651
Port: 8080/TCP
Host Port: 0/TCP
State: Running
Started: Tue, 12 Jul 2022 18:34:09 +0800
Ready: True
Restart Count: 0
Limits:
cpu: 1
memory: 2Gi
Requests:
cpu: 200m
memory: 1Gi
......
# 输出配置信息到 yaml
[root@master ~]# kubectl get pod cloud-bmp-7d688998f8-qprvw -n cos -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
cni.projectcalico.org/containerID: 8adb5ce7ffa7a891b28612646dacfd4f3a05084e1abd86dec9f9d5e1013ba869
cni.projectcalico.org/podIP: 192.168.219.78/32
cni.projectcalico.org/podIPs: 192.168.219.78/32
creationTimestamp: "2022-07-12T10:33:47Z"
generateName: cloud-bmp-7d688998f8-
labels:
app: cloud-bmp
pod-template-hash: 7d688998f8
name: cloud-bmp-7d688998f8-qprvw
namespace: cos
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
......
# 根据 label 选择器匹配
[root@master ~]# kubectl get pod -n cos -l app=cloud-bmp
NAME READY STATUS RESTARTS AGE
cloud-bmp-7d688998f8-qprvw 1/1 Running 0 9d
pod 日志 & 登录
- • 查看 pod 日志
# kubectl logs [pod name] -n [命名空间]
[root@master ~]# kubectl logs cloud-bmp-7d688998f8-qprvw -n cos
[INFO ] 2022-07-21 17:50:12.695 [org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer#2-1raceId] [org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer#2-1] TaskServiceImpl - received message: {"host": "172.28.102.242", "message": {"changed": false, "skipped": true, "task": "TASK [command - shown /nasfs/orabaknas owner]"}, "playbook": true, "recordId": 116006, "success": true}
[INFO ] 2022-07-21 17:50:12.790 [org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer#2-1raceId] [org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer#2-1] TaskServiceImpl - received message: {"host": "172.28.102.242", "message": {"changed": false, "skipped": true, "task": "TASK [template - generate create oracle instance shell]"}, "playbook": true, "recordId": 116006, "success": true}
[INFO ] 2022-07-21 17:50:12.893 [org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer#2-1raceId] [org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer#2-1] TaskServiceImpl - received message: {"host": "172.28.102.242", "message": {"changed": false, "skipped": true, "task": "TASK [copy - copy pw.dmp]"}, "playbook": true, "recordId": 116006, "success": true}
- • 登录,进入 pod 容器
# kubectl exec -it [pod name] -c [container name] -n [namespace] [command] (command 可以是 bash 这种直接登录的,也可以直接执行远程命令)
[root@master ~]# kubectl exec -it cloud-bmp-7d688998f8-qprvw -c cloud-bmp -n cos bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
bash-4.4# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if6385: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1440 qdisc noqueue state UP
link/ether 6e:d6:24:df:1f:ee brd ff:ff:ff:ff:ff:ff
inet 192.168.219.78/32 scope global eth0
valid_lft forever preferred_lft forever
bash-4.4#
# 远程命令 (无需登录,执行 ls 命令)
[root@master ~]# kubectl exec -it cloud-bmp-7d688998f8-qprvw -c cloud-bmp -n cos ls
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
BUILDING.txt README.md conf native-jni-lib
CONTRIBUTING.md RELEASE-NOTES include temp
LICENSE RUNNING.txt lib webapps
NOTICE bin logs work·
与宿主机拷贝文件
- 从 pod拷贝文件到 宿主机
# kubectl cp [namespace]/[podname]:容器中绝对路径 宿主机目标路径(容器绝对路径前的 / 不要加)
[root@master ~]# kubectl cp default/node:etc/hosts ./hosts
[root@master ~]# ll
total 258400
-rw-------. 1 root root 1461 Nov 10 2021 anaconda-ks.cfg
-rw-r--r-- 1 root root 217525 Nov 10 2021 calico.yaml
-rw------- 1 root root 7463424 Feb 23 13:30 curl.tar
drwxr-xr-x 3 root root 17 Dec 25 2021 go
-rw-r--r-- 1 root root 204 Jul 21 23:28 hosts
drwxr-xr-x 2 root root 6 Dec 24 2021 images
-rwxr-xr-x. 1 root root 444 Nov 10 2021 image.sh
drwxr-xr-x 2 root root 57 Nov 10 2021 ingress
-rw-r--r-- 1 root root 248 Nov 10 2021 ingress-http.yaml
-rw-r--r-- 1 root root 465 Dec 16 2021 kubeadm.config
-rw-r--r-- 1 root root 37261 Feb 28 11:12 nginx-9.5.18.tgz
drwxr-xr-x 3 root root 19 Jul 21 23:28 ssl
-rw-r--r-- 1 root root 1179 Mar 24 20:35 test.json
-rw-r--r-- 1 root root 256823296 Jun 8 17:50 test.tar.gz
-rw-r--r-- 1 root root 663 Dec 29 2021 test.yaml
-rw-r--r-- 1 root root 19539 Jul 21 23:26 xx.txt
drwxr-xr-x 7 root root 285 Jun 1 21:43 yy_work
[root@master ~]# more hosts
# Kubernetes-managed hosts file.
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
192.168.219.189 node
- 将宿主机中的文件拷贝到容器中
# kubectl cp 宿主机文件路径 [namespace]/[podname]:容器中目标路径
[root@master ~]# kubectl cp /root/test.yaml default/node:/etc
[root@master ~]# kubectl exec -it node -n default ls more /etc/test.yaml
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
ls: more: No such file or directory
/etc/test.yaml
command terminated with exit code 1
[root@master ~]# kubectl exec -it node -n default more /etc/test.yaml
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: saturn-executor
name: saturn-executor
spec:
selector:
matchLabels:
app: saturn-executor
template:
metadata:
labels:
app: saturn-executor
spec:
containers:
- name: saturn-executor
image: reg.kolla.org/saturn/demo-java-job:0.0.2-saturn-v3.5.1
imagePullPolicy: IfNotPresent
#resources:
# limits: # 最大使用量
# cpu: 500m
# memory: 2Gi
--More-- (85% of 663 bytes)