作者:彭靖田
Kubernetes的node节点主要有kubelet、kube-proxy、flannel、dockerd四个组件组成,本文主要分析kube-proxy组件的功能和原理。Pod是Kubernetes中资源分配的最小单位,也是执行任务的最小实体。
每个Pod都拥有flannel overlay network上的独立IP。node节点内pod通信由docker0网桥实现;跨node节点间通信由flannel实现。
图1:Kubernetes集群架构
Pod无法直接对Kubernetes集群外部访问提供服务,Service是服务的抽象,执行相同任务的pod可以组成一个Service,以Service的IP提供服务,Service实现服务请求的转发。
Pod通过定义metadata.labels.key的方式为自己打上标签,因此,Service就可以通过spec.selector选择对应标签的Pod,实现执行相同任务Pod的服务发现。
综上,Service负责将外部的请求发送到Kubernetes内部的Pod,同时也将内部pod的请求发送到外部。
根据应用场景的不同,Kubernetes提供3种类型的Service。
- ClusterIP:只能内部访问,随机分配本地端口,缺省类型;
- NodePort:可外部访问<NodeIP>:<NodePort>,指定NodePort。
- LoadBalancer:云服务场景下,支持外部负载均衡器,**<LoadBalancerIP>:<Port>的服务请求。
无论哪种类型的Service,其功能都是由kube-proxy组件真正实现的。
kube-proxy目前有两种实现方式:userspace和iptables。
userspace是在用户空间,通过kube-proxy实现LB的代理服务,Kubernetes最开始使用此方案,后来由于效率原因,改为默认支持iptables方案。
iptables
iptables的方案主要利用了linuxiptables的nat转发进行实现。本文以深度学习平台的预测服务TensorFlowServing服务为例,分析其实现原理。
TensorFlow Serving分为服务端和客户端程序,我们深度学习平台的服务端运行在Kubernetes集群内,客户端运行在集群外部。
Pod
下面定义在Kubernetes中TensorFlow Serving服务端的Pod,以MNIST CNN模型为例,不妨设文件名为inference-pod.yaml。
kind: Pod
apiVersion: v1
metadata:
name: inference-pod-0
labels:
name: mnist
spec:
containers:
- name: tf-serving
image: mind/tf-serving:0.4
ports:
- containerPort: xxxx
command:
- "./tensorflow_model_server"
args:
- "--model_name=mnist_CNN"
- "--port=6666"
-"--model_base_path=/mnt/tf_models/mnist_CNN"
volumeMounts:
- name: mynfs
mountPath: /mnt
volumes:
- name: mynfs
nfs:
path: /
server: xx.xx.xx.xx
创建inference-pod-0:
$ kubectl create -f inference-pod.yaml
查看运行状态:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
inference-pod-0 1/1 Running 0 42m
查看inference-pod-0的IP(kubectl支持前缀查找):
$ kubectl describe po inference | grep IP
IP: xx.xx.xx.xx
Service
定义对应的Service,这里使用NodePort类型,不妨设文件名为inference-service.yaml:
kind: Service
apiVersion: v1
metadata:
name:inference-service-0
spec:
selector:
name: mnist
ports:
- protocol: TCP
port: xxxx
targetPort: xxxx
nodePort: xxxx
type: NodePort
3个Port含义如下:
port:Kubernetes集群内访问Service的端口;
targetPort:Service访问Pod中Container的端口;
nodePort:Service对外提供服务的端口;
创建inference-service-0:
$ kubectl create -f inference-service.yaml
查看运行状态:
$ kubectl get svc
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
inference-service-0 xx.xx.xx.xx <nodes> xxx:xxx/TCP 2h
检查inference-service-0是否能够发现对应的inference-pod-0:
$ kubectl describe svc inference
Name: inference-service-0
Namespace: default
Labels: <none>
Selector: name=mnist
Type: NodePort
IP: xx.xx.xx.xx
Port: <unset> xxxx/TCP
NodePort: <unset> xxx/TCP
Endpoints: xx.xx.xx.xx:xxxx
Session Affinity: None
No events.
Endpoints字段显示inference-service-0已经发现了inference-pod-0(xx.xx.xx.xx:xxx),对外提供服务的IP和Port分别是xx.xx.xx.xx和xxxx。
Client
测试图片如下:
找一台安装了TensorFlow的服务器,直接在裸机上发起请求:
$ python tf_predict.py --server_host=xx.xx.x.x:xxxx--model_name=mnist_CNN --input_img=sample_0.png
7
返回正确的预测结果。
kube-proxy服务发现原理
kube-proxy利用iptables的nat完成服务发现。
现在inference-service-0后端代理了1个Pod,ip是xx.xx.xx.xx,看下kube-proxy对应写的iptables规则:
$ sudo iptables -S -t nat | grep KUBE
-N KUBE-MARK-DROP
-N KUBE-MARK-MASQ
-N KUBE-NODEPORTS
-N KUBE-POSTROUTING
-N KUBE-SEP-GYCDLIYS6Q7266WO
-N KUBE-SEP-RVISLOLI7KKADQKA
-N KUBE-SERVICES
-N KUBE-SVC-CAVPFFD4EDKETLMK
-N KUBE-SVC-NPX46M4PTMTKRN6Y
-A PREROUTING -m comment --comment "kubernetes serviceportals" -j KUBE-SERVICES
-A OUTPUT -m comment --comment "kubernetes serviceportals" -j KUBE-SERVICES
-A POSTROUTING -m comment --comment "kubernetes postroutingrules" -j KUBE-POSTROUTING
-A KUBE-MARK-DROP -j MARK --set-xmark 0x8000/0x8000
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A KUBE-NODEPORTS -p tcp -m comment --comment"default/inference-service-0:" -m tcp --dport 32000 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment"default/inference-service-0:" -m tcp --dport 32000 -jKUBE-SVC-CAVPFFD4EDKETLMK
-A KUBE-POSTROUTING -m comment --comment "kubernetesservice traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE
-A KUBE-SEP-GYCDLIYS6Q7266WO -s xx.xx.xx.xx/32 -m comment--comment "default/kubernetes:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-GYCDLIYS6Q7266WO -p tcp -m comment --comment"default/kubernetes:https" -m recent --set --nameKUBE-SEP-GYCDLIYS6Q7266WO --mask xx.xx.xx.xx--rsource -m tcp -j DNAT--to-destination xx.xx.xx.xx:xxxx
-A KUBE-SEP-RVISLOLI7KKADQKA -s xx.xx.xx.xx/32 -m comment --comment"default/inference-service-0:" -j KUBE-MARK-MASQ
-A KUBE-SEP-RVISLOLI7KKADQKA -p tcp -m comment --comment"default/inference-service-0:" -m tcp -j DNAT --to-destination xx.xx.xx.xx:xxx
-A KUBE-SERVICES -d xx.xx.xx.xx/32 -p tcp -m comment --comment"default/kubernetes:https cluster IP" -m tcp --dport 443 -jKUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SERVICES -d xx.xx.xx.xx/32 -p tcp -m comment --comment"default/inference-service-0: cluster IP" -m tcp --dport xxxx -jKUBE-SVC-CAVPFFD4EDKETLMK
-A KUBE-SERVICES -m comment --comment "kubernetes servicenodeports; NOTE: this must be the last rule in this chain" -m addrtype--dst-type LOCAL -j KUBE-NODEPORTS
-A KUBE-SVC-CAVPFFD4EDKETLMK -m comment --comment"default/inference-service-0:" -j KUBE-SEP-RVISLOLI7KKADQKA
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment"default/kubernetes:https" -m recent --rcheck --seconds 10800 --reap--name KUBE-SEP-GYCDLIYS6Q7266WO --mask xx.xx.xx.xx--rsource -jKUBE-SEP-GYCDLIYS6Q7266WO
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment"default/kubernetes:https" -j KUBE-SEP-GYCDLIYS6Q7266WO
接下来详细的分析iptables的规则。
首先,如果访问node的xxxx端口,请求会进入以下两条链:KUBE-MARK-MASQ和KUBE-SVC-CAVPFFD4EDKETLMK。
-A KUBE-NODEPORTS -p tcp -m comment --comment"default/inference-service-0:" -m tcp --dport xxxx -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment"default/inference-service-0:" -m tcp --dport xxxx -jKUBE-SVC-CAVPFFD4EDKETLMK
然后,请求跳转到KUBE-SVC-CAVPFFD4EDKETLMK链:
-A KUBE-SVC-CAVPFFD4EDKETLMK -m comment --comment"default/inference-service-0:" -j KUBE-SEP-RVISLOLI7KKADQKA
接着,请求再次跳转到KUBE-SEP-RVISLOLI7KKADQKA链:
-A KUBE-SEP-RVISLOLI7KKADQKA -s xx.xx.xx.xx/32 -m comment--comment "default/inference-service-0:" -j KUBE-MARK-MASQ
-A KUBE-SEP-RVISLOLI7KKADQKA -p tcp -m comment --comment"default/inference-service-0:" -m tcp -j DNAT --to-destination xx.xx.xx.xx:xxxx
最终,通过DNAT将请求发送到了后端Pod的xx.xx.xx.xx的xxxx端口。
现在,分析集群内服务直接访问clusterip的工作方式,inference-service-0的clusterip为xx.xx.xx.xx。
集群内访问xx.xx.xx.xx的xxx端口请求,会跳转到KUBE-SVC-CAVPFFD4EDKETLMK链:
-A KUBE-SERVICES -d xx.xx.xx.xx/xx -p tcp -m comment --comment"default/inference-service-0: cluster IP" -m tcp --dport xxxx -jKUBE-SVC-CAVPFFD4EDKETLMK
然后,进一步跳转到KUBE-SEP-RVISLOLI7KKADQKAl链,与NodePort一样最终通过DNAT将请求发送到了后端的Pod。
-A KUBE-SVC-CAVPFFD4EDKETLMK -m comment --comment"default/inference-service-0:" -j KUBE-SEP-RVISLOLI7KKADQKA
kube-proxy负载均衡原理
为了分析kube-proxy如何利用iptables的规则来时实现简单的负载均衡,我们再创建一个inference-pod-1:
$ kubectl create -f inference-pod-1.yaml
查看inference-pod-1的运行状态:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
inference-pod-0 1/1 Running 0 1h
inference-pod-1 1/1 Running 0 3m
查看inference-pod-1的ip:
$ kubectl describe po inference-pod-1 | grep IP
IP: xx.xx.xx.xx
查看inference-service-0的后端是否更新:
$ kubectl describe svc inference
Name: inference-service-0
Namespace: default
Labels: <none>
Selector: name=mnist
Type: NodePort
IP: xx.xx.xx.xx
Port: <unset> xxxx/TCP
NodePort: <unset> xxxx/TCP
Endpoints: xx.xx.xx.xx:xxx, xx.xx.xx.xx:xxx
Session Affinity: None
No events.
发现更新成功,后端已经新增了inference-pod-1的ip和port(xx.xx.xx.xx:xxx)。
查看更新后的iptables规则:
-N KUBE-MARK-DROP
-N KUBE-MARK-MASQ
-N KUBE-NODEPORTS
-N KUBE-POSTROUTING
-N KUBE-SEP-D6IZJMBUD3SKR4IF
-N KUBE-SEP-GYCDLIYS6Q7266WO
-N KUBE-SEP-RVISLOLI7KKADQKA
-N KUBE-SERVICES
-N KUBE-SVC-CAVPFFD4EDKETLMK
-N KUBE-SVC-NPX46M4PTMTKRN6Y
-A PREROUTING -m comment --comment "kubernetes serviceportals" -j KUBE-SERVICES
-A OUTPUT -m comment --comment "kubernetes serviceportals" -j KUBE-SERVICES
-A POSTROUTING -m comment --comment "kubernetes postroutingrules" -j KUBE-POSTROUTING
-A KUBE-MARK-DROP -j MARK --set-xmark 0x8000/0x8000
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A KUBE-NODEPORTS -p tcp -m comment --comment"default/inference-service-0:" -m tcp --dport xxxx -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/inference-service-0:"-m tcp --dport xxxx -j KUBE-SVC-CAVPFFD4EDKETLMK
-A KUBE-POSTROUTING -m comment --comment "kubernetesservice traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE
-A KUBE-SEP-D6IZJMBUD3SKR4IF -s xx.xx.xx.xx/32 -m comment--comment "default/inference-service-0:" -j KUBE-MARK-MASQ
-A KUBE-SEP-D6IZJMBUD3SKR4IF -p tcp -m comment --comment"default/inference-service-0:" -m tcp -j DNAT --to-destination xx.xx.xx.xx:xxx
-A KUBE-SEP-GYCDLIYS6Q7266WO -s xx.xx.xx.xx/32 -m comment --comment"default/kubernetes:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-GYCDLIYS6Q7266WO -p tcp -m comment --comment"default/kubernetes:https" -m recent --set --nameKUBE-SEP-GYCDLIYS6Q7266WO --mask xx.xx.xx.xx--rsource -m tcp -j DNAT--to-destination xx.xx.xx.xx:xxxx
-A KUBE-SEP-RVISLOLI7KKADQKA -s xx.xx.xx.xx/32 -m comment--comment "default/inference-service-0:" -j KUBE-MARK-MASQ
-A KUBE-SEP-RVISLOLI7KKADQKA -p tcp -m comment --comment"default/inference-service-0:" -m tcp -j DNAT --to-destination xx.xx.xx.xx:xxx
-A KUBE-SERVICES -d xx.xx.xx.xx/32 -p tcp -m comment --comment"default/kubernetes:https cluster IP" -m tcp --dport 443 -jKUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SERVICES -d xx.xx.xx.xx/32 -p tcp -m comment --comment"default/inference-service-0: cluster IP" -m tcp --dport xxxx -jKUBE-SVC-CAVPFFD4EDKETLMK
-A KUBE-SERVICES -m comment --comment "kubernetes servicenodeports; NOTE: this must be the last rule in this chain" -m addrtype--dst-type LOCAL -j KUBE-NODEPORTS
-A KUBE-SVC-CAVPFFD4EDKETLMK -m comment --comment"default/inference-service-0:" -m statistic --mode random--probability 0.50000000000 -j KUBE-SEP-RVISLOLI7KKADQKA
-A KUBE-SVC-CAVPFFD4EDKETLMK -m comment --comment"default/inference-service-0:" -j KUBE-SEP-D6IZJMBUD3SKR4IF
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment"default/kubernetes:https" -m recent --rcheck --seconds 10800 --reap--name KUBE-SEP-GYCDLIYS6Q7266WO --mask xx.xx.xx.xx--rsource -jKUBE-SEP-GYCDLIYS6Q7266WO
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment"default/kubernetes:https" -j KUBE-SEP-GYCDLIYS6Q7266WO
下面分析NodePort的访问请求是如何到达两个后端inference-pod-0和inference-pod-1的。
首先,访问xxx端口的请求仍然跳转到KUBE-SVC-CAVPFFD4EDKETLMK链:
-A KUBE-NODEPORTS -p tcp -m comment --comment"default/inference-service-0:" -m tcp --dport xxxxx -jKUBE-SVC-CAVPFFD4EDKETLMK
然后,与之前不同的是,这里没有直接跳转到KUBE-SEP-RVISLOLI7KKADQKA链,而是利用iptables的probability特性,使请求有50%的概率进入KUBE-SEP-RVISLOLI7KKADQKA链,另外50%的概率进入KUBE-SEP-D6IZJMBUD3SKR4IF链。
-A KUBE-SVC-CAVPFFD4EDKETLMK -m comment --comment"default/inference-service-0:" -m statistic --mode random--probability 0.50000000000 -j KUBE-SEP-RVISLOLI7KKADQKA
-A KUBE-SVC-CAVPFFD4EDKETLMK -m comment --comment"default/inference-service-0:" -j KUBE-SEP-D6IZJMBUD3SKR4IF
接着分析,KUBE-SEP-RVISLOLI7KKADQKA链对应xx.xx.xx.xx的inference-pod-0:
-A KUBE-SEP-RVISLOLI7KKADQKA -p tcp -m comment --comment"default/inference-service-0:" -m tcp -j DNAT --to-destination xx.xx.xx.xx:xxx
而KUBE-SEP-D6IZJMBUD3SKR4IF链则对应xx.xx.xx.xx的inference-pod-1:
-A KUBE-SEP-D6IZJMBUD3SKR4IF -p tcp -m comment --comment"default/inference-service-0:" -m tcp -j DNAT --to-destination xx.xx.xx.xx:xxx
综上,通过分析iptables规则,我们已然清楚了解kube-proxy实现负载均衡和服务发现的原理。