具有推理工作负载的自动缩放推理服务
https://kserve.github.io/website/0.10/modelserving/autoscaling/autoscaling/
具有目标并发性的推理服务
创建推理服务
应用tensorflow示例CR,缩放目标设置为1。Annotation autoscaling.knative.dev/target是软限制,而不是严格执行的限制,若请求突然爆发,则可能会超过此值。
scaleTarget和scaleMetric是在kserve的0.9版本中引入的,应该在新模式和旧模式中都可用。这是定义自动缩放选项的首选方式。
旧框架
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "flowers-sample"
annotations:
autoscaling.knative.dev/target: "1"
spec:
predictor:
tensorflow:
storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
新框架
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "flowers-sample"
spec:
predictor:
scaleTarget: 1
scaleMetric: concurrency
model:
modelFormat:
name: tensorflow
storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
应用autoscale.yaml创建autoscale推理服务。
kubectl
kubectl apply -f autoscale.yaml
期望输出
$ inferenceservice.serving.kserve.io/flowers-sample created
预测具有并发请求的推理服务
第一步是确定入口IP和端口,并设置INGRESS_HOST和INGRESS_PORT
在30秒内发送流量,保持5个in-flight请求。
MODEL_NAME=flowers-sample
INPUT_PATH=input.json
SERVICE_HOSTNAME=$(kubectl get inferenceservice $MODEL_NAME -o jsonpath='{.status.url}' | cut -d "/" -f 3)
hey -z 30s -c 5 -m POST -host ${SERVICE_HOSTNAME} -D $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict
期望输出
Summary:
Total: 30.0193 secs
Slowest: 10.1458 secs
Fastest: 0.0127 secs
Average: 0.0364 secs
Requests/sec: 137.4449
Total data: 1019122 bytes
Size/request: 247 bytes
Response time histogram:
0.013 [1] |
1.026 [4120] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
2.039 [0] |
3.053 [0] |
4.066 [0] |
5.079 [0] |
6.093 [0] |
7.106 [0] |
8.119 [0] |
9.133 [0] |
10.146 [5] |
Latency distribution:
10% in 0.0178 secs
25% in 0.0188 secs
50% in 0.0199 secs
75% in 0.0210 secs
90% in 0.0231 secs
95% in 0.0328 secs
99% in 0.1501 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0002 secs, 0.0127 secs, 10.1458 secs
DNS-lookup: 0.0002 secs, 0.0000 secs, 0.1502 secs
req write: 0.0000 secs, 0.0000 secs, 0.0020 secs
resp wait: 0.0360 secs, 0.0125 secs, 9.9791 secs
resp read: 0.0001 secs, 0.0000 secs, 0.0021 secs
Status code distribution:
[200] 4126 responses
现在检查运行pod的数量,Kserve使用Knative Serving自动缩放器,该自动缩放器基于每个pod的平均in-flight请求数(并发)。由于扩展目标设置为1,并且我们用5个并发请求加载服务,因此自动缩放器尝试扩展到5个pod。请注意,在所有请求中,直方图上有5个请求,大约需要10秒,这是最初生成pod和下载模型以准备服务的冷启动时间成本。如果映像没有缓存在pod计划所在的节点上,则冷启动可能需要更长的时间(拉取服务映像)。
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
flowers-sample-default-7kqt6-deployment-75d577dcdb-sr5wd 3/3 Running 0 42s
flowers-sample-default-7kqt6-deployment-75d577dcdb-swnk5 3/3 Running 0 62s
flowers-sample-default-7kqt6-deployment-75d577dcdb-t2njf 3/3 Running 0 62s
flowers-sample-default-7kqt6-deployment-75d577dcdb-vdlp9 3/3 Running 0 64s
flowers-sample-default-7kqt6-deployment-75d577dcdb-vm58d 3/3 Running 0 42s
检查仪表板
查看Knative Serving Scaling面板(如果已配置)。
kubectl
kubectl port-forward --namespace knative-monitoring $(kubectl get pods --namespace knative-monitoring --selector=app=grafana --output=jsonpath="{.items..metadata.name}") 3000
具有目标QPS的推理服务
创建推理服务
应用相同的tensorflow示例CR
kubectl
kubectl apply -f autoscale.yaml
期望输出
$ inferenceservice.serving.kserve.io/flowers-sample created
预测具有目标QPS的推理服务
第一步是确定入口IP和端口,并设置INGRESS_HOST和INGRESS_PORT
发送30秒的流量,保持50个qp。
MODEL_NAME=flowers-sample
INPUT_PATH=input.json
SERVICE_HOSTNAME=$(kubectl get inferenceservice $MODEL_NAME -o jsonpath='{.status.url}' | cut -d "/" -f 3)
hey -z 30s -q 50 -m POST -host ${SERVICE_HOSTNAME} -D $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict
期望输出
Summary:
Total: 30.0264 secs
Slowest: 10.8113 secs
Fastest: 0.0145 secs
Average: 0.0731 secs
Requests/sec: 683.5644
Total data: 5069675 bytes
Size/request: 247 bytes
Response time histogram:
0.014 [1] |
1.094 [20474] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
2.174 [0] |
3.254 [0] |
4.333 [0] |
5.413 [0] |
6.493 [0] |
7.572 [0] |
8.652 [0] |
9.732 [0] |
10.811 [50] |
Latency distribution:
10% in 0.0284 secs
25% in 0.0334 secs
50% in 0.0408 secs
75% in 0.0527 secs
90% in 0.0765 secs
95% in 0.0949 secs
99% in 0.1334 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0001 secs, 0.0145 secs, 10.8113 secs
DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0196 secs
req write: 0.0000 secs, 0.0000 secs, 0.0031 secs
resp wait: 0.0728 secs, 0.0144 secs, 10.7688 secs
resp read: 0.0000 secs, 0.0000 secs, 0.0031 secs
Status code distribution:
[200] 20525 responses
现在检查一下正在运行的pod的数量,我们每秒加载50个请求的服务,从仪表板中可以看到,它达到了平均并发10个,自动缩放器尝试扩展到10个pod。
检查仪表板
查看Knative Serving Scaling面板(如果已配置)。
kubectl port-forward --namespace knative-monitoring $(kubectl get pods --namespace knative-monitoring --selector=app=grafana --output=jsonpath="{.items..metadata.name}") 3000
自动缩放器计算60秒窗口内的平均并发,因此需要一分钟时间才能稳定在所需的并发级别,但它也计算6秒的紧急窗口,如果该窗口达到2倍的目标并发,它将进入紧急模式。从仪表板上你可以看到它进入了紧急模式,在这种模式下,自动缩放器在更短、更敏感的窗口上操作。一旦紧急条件在60秒内不再满足,自动缩放器将返回到60秒稳定窗口。
GPU上的自动缩放!
GPU指标很难在GPU上进行自动缩放,但由于Knative在GPU上基于并发的自动缩放非常简单有效!
使用GPU资源创建推理服务
应用tensorflow gpu示例CR
旧框架
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "flowers-sample-gpu"
spec:
predictor:
tensorflow:
storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
runtimeVersion: "2.6.2-gpu"
resources:
limits:
nvidia.com/gpu: 1
新框架
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "flowers-sample-gpu"
spec:
predictor:
model:
modelFormat:
name: tensorflow
storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
runtimeVersion: "2.6.2-gpu"
resources:
limits:
nvidia.com/gpu: 1
应用autoscale-gpu.yaml
kubectl
kubectl apply -f autoscale-gpu.yaml
预测具有并发请求的推理服务
第一步是确定入口IP和端口,并设置INGRESS_HOST和INGRESS_PORT
发送30秒的流量维持5个in-flight的请求。
MODEL_NAME=flowers-sample-gpu
INPUT_PATH=input.json
SERVICE_HOSTNAME=$(kubectl get inferenceservice $MODEL_NAME -o jsonpath='{.status.url}' | cut -d "/" -f 3)
hey -z 30s -c 5 -m POST -host ${SERVICE_HOSTNAME} -D $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict
期望输出
Summary:
Total: 30.0152 secs
Slowest: 9.7581 secs
Fastest: 0.0142 secs
Average: 0.0350 secs
Requests/sec: 142.9942
Total data: 948532 bytes
Size/request: 221 bytes
Response time histogram:
0.014 [1] |
0.989 [4286] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
1.963 [0] |
2.937 [0] |
3.912 [0] |
4.886 [0] |
5.861 [0] |
6.835 [0] |
7.809 [0] |
8.784 [0] |
9.758 [5] |
Latency distribution:
10% in 0.0181 secs
25% in 0.0189 secs
50% in 0.0198 secs
75% in 0.0210 secs
90% in 0.0230 secs
95% in 0.0276 secs
99% in 0.0511 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0000 secs, 0.0142 secs, 9.7581 secs
DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0291 secs
req write: 0.0000 secs, 0.0000 secs, 0.0023 secs
resp wait: 0.0348 secs, 0.0141 secs, 9.7158 secs
resp read: 0.0001 secs, 0.0000 secs, 0.0021 secs
Status code distribution:
[200] 4292 responses
自动缩放自定义
使用ContainerConcurrent自动缩放
ContainerConcurrent确定推理服务的每个副本在任何给定时间可以处理的同时请求的数量,这是一个硬限制,如果并发达到硬限制,多余的请求将被缓冲,并且必须等到有足够的可用容量来执行请求。
旧框架
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "flowers-sample"
spec:
predictor:
containerConcurrency: 10
tensorflow:
storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
新框架
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "flowers-sample"
spec:
predictor:
containerConcurrency: 10
model:
modelFormat:
name: tensorflow
storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
应用autoscale-custom.yaml
kubectl
kubectl apply -f autoscale-custom.yaml
启用向下缩放到零
KServe默认情况下会将minReplicas设置为1,如果您想启用缩小到零的功能,尤其是在GPU上服务的用例中,您可以将minReplica设置为0,这样当没有收到流量时,pod会自动缩小到零。
旧框架
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "flowers-sample"
spec:
predictor:
minReplicas: 0
tensorflow:
storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
新框架
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "flowers-sample"
spec:
predictor:
minReplicas: 0
model:
modelFormat:
name: tensorflow
storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
应用scale-down-to-zero.yaml
kubetl
kubectl apply -f scale-down-to-zero.yaml
组件级别的自动缩放配置
自动缩放选项也可以在组件级别进行配置。这允许在自动缩放配置方面具有更大的灵活性。在典型的部署中,转换器可能需要与预测器不同的自动缩放配置。此功能允许用户根据需要缩放单个组件。
旧框架
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: torch-transformer
spec:
predictor:
scaleTarget: 2
scaleMetric: concurrency
pytorch:
storageUri: gs://kfserving-examples/models/torchserve/image_classifier
transformer:
scaleTarget: 8
scaleMetric: rps
containers:
- image: kserve/image-transformer:latest
name: kserve-container
command:
- "python"
- "-m"
- "model"
args:
- --model_name
- mnist
新框架
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: torch-transformer
spec:
predictor:
scaleTarget: 2
scaleMetric: concurrency
model:
modelFormat:
name: pytorch
storageUri: gs://kfserving-examples/models/torchserve/image_classifier
transformer:
scaleTarget: 8
scaleMetric: rps
containers:
- image: kserve/image-transformer:latest
name: kserve-container
command:
- "python"
- "-m"
- "model"
args:
- --model_name
- mnist
应用autoscale-adv.yaml创建autoscale推理服务。scaleMetric的默认值是concurrency,可能的值是concurrency、rps、cpu和memory。