Autoscaling-KServe

https://kserve.github.io/website/0.10/modelserving/autoscaling/autoscaling/

具有目标并发性的推理服务

创建推理服务

应用tensorflow示例CR,缩放目标设置为1。Annotation autoscaling.knative.dev/target是软限制,而不是严格执行的限制,若请求突然爆发,则可能会超过此值。
scaleTarget和scaleMetric是在kserve的0.9版本中引入的,应该在新模式和旧模式中都可用。这是定义自动缩放选项的首选方式。

旧框架

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "flowers-sample"
  annotations:
    autoscaling.knative.dev/target: "1"
spec:
  predictor:
    tensorflow:
      storageUri: "gs://kfserving-examples/models/tensorflow/flowers"

新框架

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "flowers-sample"
spec:
  predictor:
    scaleTarget: 1
    scaleMetric: concurrency
    model:
      modelFormat:
        name: tensorflow
      storageUri: "gs://kfserving-examples/models/tensorflow/flowers"

应用autoscale.yaml创建autoscale推理服务。

kubectl

kubectl apply -f autoscale.yaml

期望输出

$ inferenceservice.serving.kserve.io/flowers-sample created

预测具有并发请求的推理服务

第一步是确定入口IP和端口,并设置INGRESS_HOST和INGRESS_PORT
在30秒内发送流量,保持5个in-flight请求。

MODEL_NAME=flowers-sample
INPUT_PATH=input.json
SERVICE_HOSTNAME=$(kubectl get inferenceservice $MODEL_NAME -o jsonpath='{.status.url}' | cut -d "/" -f 3)

hey -z 30s -c 5 -m POST -host ${SERVICE_HOSTNAME} -D $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict

期望输出

Summary:
  Total:    30.0193 secs
  Slowest:  10.1458 secs
  Fastest:  0.0127 secs
  Average:  0.0364 secs
  Requests/sec: 137.4449

  Total data:   1019122 bytes
  Size/request: 247 bytes

Response time histogram:
  0.013 [1] |
  1.026 [4120]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  2.039 [0] |
  3.053 [0] |
  4.066 [0] |
  5.079 [0] |
  6.093 [0] |
  7.106 [0] |
  8.119 [0] |
  9.133 [0] |
  10.146 [5]    |


Latency distribution:
  10% in 0.0178 secs
  25% in 0.0188 secs
  50% in 0.0199 secs
  75% in 0.0210 secs
  90% in 0.0231 secs
  95% in 0.0328 secs
  99% in 0.1501 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0002 secs, 0.0127 secs, 10.1458 secs
  DNS-lookup:   0.0002 secs, 0.0000 secs, 0.1502 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0020 secs
  resp wait:    0.0360 secs, 0.0125 secs, 9.9791 secs
  resp read:    0.0001 secs, 0.0000 secs, 0.0021 secs

Status code distribution:
  [200] 4126 responses

现在检查运行pod的数量,Kserve使用Knative Serving自动缩放器,该自动缩放器基于每个pod的平均in-flight请求数(并发)。由于扩展目标设置为1,并且我们用5个并发请求加载服务,因此自动缩放器尝试扩展到5个pod。请注意,在所有请求中,直方图上有5个请求,大约需要10秒,这是最初生成pod和下载模型以准备服务的冷启动时间成本。如果映像没有缓存在pod计划所在的节点上,则冷启动可能需要更长的时间(拉取服务映像)。

$ kubectl get pods
NAME                                                       READY   STATUS            RESTARTS   AGE
flowers-sample-default-7kqt6-deployment-75d577dcdb-sr5wd         3/3     Running       0          42s
flowers-sample-default-7kqt6-deployment-75d577dcdb-swnk5         3/3     Running       0          62s
flowers-sample-default-7kqt6-deployment-75d577dcdb-t2njf         3/3     Running       0          62s
flowers-sample-default-7kqt6-deployment-75d577dcdb-vdlp9         3/3     Running       0          64s
flowers-sample-default-7kqt6-deployment-75d577dcdb-vm58d         3/3     Running       0          42s

检查仪表板

查看Knative Serving Scaling面板(如果已配置)。
kubectl

kubectl port-forward --namespace knative-monitoring $(kubectl get pods --namespace knative-monitoring --selector=app=grafana  --output=jsonpath="{.items..metadata.name}") 3000

在这里插入图片描述

具有目标QPS的推理服务

创建推理服务

应用相同的tensorflow示例CR
kubectl

kubectl apply -f autoscale.yaml

期望输出

$ inferenceservice.serving.kserve.io/flowers-sample created

预测具有目标QPS的推理服务

第一步是确定入口IP和端口,并设置INGRESS_HOST和INGRESS_PORT

发送30秒的流量,保持50个qp。

MODEL_NAME=flowers-sample
INPUT_PATH=input.json
SERVICE_HOSTNAME=$(kubectl get inferenceservice $MODEL_NAME -o jsonpath='{.status.url}' | cut -d "/" -f 3)

hey -z 30s -q 50 -m POST -host ${SERVICE_HOSTNAME} -D $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict

期望输出

Summary:
  Total:    30.0264 secs
  Slowest:  10.8113 secs
  Fastest:  0.0145 secs
  Average:  0.0731 secs
  Requests/sec: 683.5644

  Total data:   5069675 bytes
  Size/request: 247 bytes

Response time histogram:
  0.014 [1] |
  1.094 [20474] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  2.174 [0] |
  3.254 [0] |
  4.333 [0] |
  5.413 [0] |
  6.493 [0] |
  7.572 [0] |
  8.652 [0] |
  9.732 [0] |
  10.811 [50]   |


Latency distribution:
  10% in 0.0284 secs
  25% in 0.0334 secs
  50% in 0.0408 secs
  75% in 0.0527 secs
  90% in 0.0765 secs
  95% in 0.0949 secs
  99% in 0.1334 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0001 secs, 0.0145 secs, 10.8113 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0196 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0031 secs
  resp wait:    0.0728 secs, 0.0144 secs, 10.7688 secs
  resp read:    0.0000 secs, 0.0000 secs, 0.0031 secs

Status code distribution:
  [200] 20525 responses

现在检查一下正在运行的pod的数量,我们每秒加载50个请求的服务,从仪表板中可以看到,它达到了平均并发10个,自动缩放器尝试扩展到10个pod。

检查仪表板

查看Knative Serving Scaling面板(如果已配置)。

kubectl port-forward --namespace knative-monitoring $(kubectl get pods --namespace knative-monitoring --selector=app=grafana  --output=jsonpath="{.items..metadata.name}") 3000

在这里插入图片描述

自动缩放器计算60秒窗口内的平均并发,因此需要一分钟时间才能稳定在所需的并发级别,但它也计算6秒的紧急窗口,如果该窗口达到2倍的目标并发,它将进入紧急模式。从仪表板上你可以看到它进入了紧急模式,在这种模式下,自动缩放器在更短、更敏感的窗口上操作。一旦紧急条件在60秒内不再满足,自动缩放器将返回到60秒稳定窗口。

GPU上的自动缩放!

GPU指标很难在GPU上进行自动缩放,但由于Knative在GPU上基于并发的自动缩放非常简单有效!

使用GPU资源创建推理服务

应用tensorflow gpu示例CR

旧框架

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "flowers-sample-gpu"
spec:
  predictor:
    tensorflow:
      storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
      runtimeVersion: "2.6.2-gpu"
      resources:
        limits:
          nvidia.com/gpu: 1

新框架

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "flowers-sample-gpu"
spec:
  predictor:
    model:
      modelFormat:
        name: tensorflow
      storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
      runtimeVersion: "2.6.2-gpu"
      resources:
        limits:
          nvidia.com/gpu: 1

应用autoscale-gpu.yaml

kubectl

kubectl apply -f autoscale-gpu.yaml

预测具有并发请求的推理服务

第一步是确定入口IP和端口,并设置INGRESS_HOST和INGRESS_PORT

发送30秒的流量维持5个in-flight的请求。

MODEL_NAME=flowers-sample-gpu
INPUT_PATH=input.json
SERVICE_HOSTNAME=$(kubectl get inferenceservice $MODEL_NAME -o jsonpath='{.status.url}' | cut -d "/" -f 3)

hey -z 30s -c 5 -m POST -host ${SERVICE_HOSTNAME} -D $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict

期望输出

Summary:
  Total:    30.0152 secs
  Slowest:  9.7581 secs
  Fastest:  0.0142 secs
  Average:  0.0350 secs
  Requests/sec: 142.9942

  Total data:   948532 bytes
  Size/request: 221 bytes

Response time histogram:
  0.014 [1] |
  0.989 [4286]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  1.963 [0] |
  2.937 [0] |
  3.912 [0] |
  4.886 [0] |
  5.861 [0] |
  6.835 [0] |
  7.809 [0] |
  8.784 [0] |
  9.758 [5] |


Latency distribution:
  10% in 0.0181 secs
  25% in 0.0189 secs
  50% in 0.0198 secs
  75% in 0.0210 secs
  90% in 0.0230 secs
  95% in 0.0276 secs
  99% in 0.0511 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0000 secs, 0.0142 secs, 9.7581 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0291 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0023 secs
  resp wait:    0.0348 secs, 0.0141 secs, 9.7158 secs
  resp read:    0.0001 secs, 0.0000 secs, 0.0021 secs

Status code distribution:
  [200] 4292 responses

自动缩放自定义

使用ContainerConcurrent自动缩放

ContainerConcurrent确定推理服务的每个副本在任何给定时间可以处理的同时请求的数量,这是一个硬限制,如果并发达到硬限制,多余的请求将被缓冲,并且必须等到有足够的可用容量来执行请求。
旧框架

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "flowers-sample"
spec:
  predictor:
    containerConcurrency: 10
    tensorflow:
      storageUri: "gs://kfserving-examples/models/tensorflow/flowers"

新框架

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "flowers-sample"
spec:
  predictor:
    containerConcurrency: 10
    model:
      modelFormat:
        name: tensorflow
      storageUri: "gs://kfserving-examples/models/tensorflow/flowers"

应用autoscale-custom.yaml
kubectl

kubectl apply -f autoscale-custom.yaml

启用向下缩放到零

KServe默认情况下会将minReplicas设置为1,如果您想启用缩小到零的功能,尤其是在GPU上服务的用例中,您可以将minReplica设置为0,这样当没有收到流量时,pod会自动缩小到零。
旧框架

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "flowers-sample"
spec:
  predictor:
    minReplicas: 0
    tensorflow:
      storageUri: "gs://kfserving-examples/models/tensorflow/flowers"

新框架

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "flowers-sample"
spec:
  predictor:
    minReplicas: 0
    model:
      modelFormat:
        name: tensorflow
      storageUri: "gs://kfserving-examples/models/tensorflow/flowers"

应用scale-down-to-zero.yaml
kubetl

kubectl apply -f scale-down-to-zero.yaml

组件级别的自动缩放配置

自动缩放选项也可以在组件级别进行配置。这允许在自动缩放配置方面具有更大的灵活性。在典型的部署中,转换器可能需要与预测器不同的自动缩放配置。此功能允许用户根据需要缩放单个组件。

旧框架

  apiVersion: serving.kserve.io/v1beta1
  kind: InferenceService
  metadata:
    name: torch-transformer  
  spec:
    predictor:
      scaleTarget: 2
      scaleMetric: concurrency
      pytorch:
        storageUri: gs://kfserving-examples/models/torchserve/image_classifier
    transformer:
      scaleTarget: 8
      scaleMetric: rps
      containers:
        - image: kserve/image-transformer:latest
          name: kserve-container
          command:
            - "python"
            - "-m"
            - "model"
          args:
            - --model_name
            - mnist

新框架

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: torch-transformer  
spec:
  predictor:
    scaleTarget: 2
    scaleMetric: concurrency
    model:
      modelFormat:
        name: pytorch
      storageUri: gs://kfserving-examples/models/torchserve/image_classifier
  transformer:
    scaleTarget: 8
    scaleMetric: rps
    containers:
      - image: kserve/image-transformer:latest
        name: kserve-container
        command:
          - "python"
          - "-m"
          - "model"
        args:
          - --model_name
          - mnist

应用autoscale-adv.yaml创建autoscale推理服务。scaleMetric的默认值是concurrency,可能的值是concurrency、rps、cpu和memory。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值