基于亚马逊云科技云原生服务，构建生产级别的 LLM 推理环境

亚马逊云开发者

于 2024-06-28 17:28:58 发布

阅读量78

点赞数

文章标签：科技云原生

原文链接：https://mp.weixin.qq.com/s?__biz=Mzg4NjU5NDUxNg==&mid=2247570445&idx=2&sn=6c5bd9cb48dad0207068287a3a2a97e7&chksm=ce45ebc0e5c425ef12688d222e5fd9499bdc41779432d704bf813945eb00e9192d9f848a8d1b&scene=126&sessionid=0

版权

背景介绍

近年来，随着计算力和数据量的不断提升，大语言模型（ LLM ）在自然语言处理领域取得了令人瞩目的进展，展现出了广阔的应用前景。在企业场景中，LLM 可以被应用于多个领域，例如智能问答、文本摘要、内容创作、代码生成等，为提高工作效率、优化客户体验等带来全新的可能性。越来越多的企业开始探索将 LLM 引入业务环境中。然而，企业在自有环境中部署和运行 LLM 面临诸多挑战：

部署复杂性：LLM 模型通常规模庞大，需要大量算力资源，部署运维较为困难。
扩展性限制：企业难以随着业务发展灵活扩展 LLM 推理服务的计算资源。
可观测性缺失：缺乏对 LLM 服务的监控和运维能力，难以保证服务质量。
存储管理成本高：LLM 模型文件往往体量庞大，分布式存储和管理成本高昂。

如何在企业自有环境中平滑部署并高效运行 LLM ，满足业务需求，是当前企业急需解决的问题。

总体架构

为解决企业在自有环境中部署和运行 LLM 面临的诸多挑战，我们提出了一种基于亚马逊云科技云原生服务的解决方案。该解决方案旨在为企业提供一个生产级别的 LLM 推理环境，具备良好的扩展性、可观测性以及存储管理能力。整体架构设计遵循云原生的理念，充分利用了亚马逊云科技的各种托管服务和开源工具，构建了一个可靠、可扩展、易于管理和可观测的 LLM 部署运行平台。架构图如下所示：

我们可以从 4 个层面来看这个架构设计，分别是基础设施、服务网格、应用和可观测性。基础设施层提供了云原生的资源管理能力；服务网格层负责流量管控和部署策略；应用层包含了 LLM 推理的核心功能；而可观测性层则确保了整个平台的可视化和可维护性。每层的具体组件和作用如下：

基础设施层

Amazon EKS 作为 Kubernetes 集群的基础承载层
Amazon Elastic Load Balancer 提供应用层负载均衡能力
Amazon EFS 统一管理 LLM 模型数据持久化存储
Amazon Karpenter 实现计算资源的弹性伸缩

服务网格层

Kong 作为 API 网关，实现流量控制和基本认证
Istio Service Mesh 支持灰度/金丝雀发布等

应用层

自研应用网关层处理请求转发及适配
Text Generation WebUI 、 vLLM 和 Text Generation Inference 等开源方案作为 LLM 推理引擎（算力单元）

可观测性层

Prometheus/Grafana/Loki 实现指标和日志监控
KubeSphere 和 KubeCost 提供集群管理和费用管控能力

使用该方案用户只需专注于 LLM 模型的选型和应用，底层的基础设施和运维管理完全由解决方案自动化处理，大幅降低了 LLM 服务的运维复杂度。同时该方案具备水平扩展能力，能够随时根据业务发展灵活扩缩容 LLM 推理资源，保证高性能和高可用。此外，还提供可观测性能力，确保服务质量，统一的 LLM 模型存储也使得存储管理更加便捷高效。整体部署方案代码和配置已经发布在 Github 仓库，可扫码查看。

Github 仓库

扫码了解更多

创新点

作为大语言模型在企业级场景落地的先驱性实践，该解决方案在多个技术层面做出了创新，以确保 LLM 服务的高性能、高可靠以及良好的运维体验：

1. 算力单元支持多种开源框架，包括 Text Generation WebUI 、 vLLM 和 Text Generation Inference 等，并且对开源 LLM 框架 Text Generation WebUI 进行改造，支持在 Kubernetes 环境下运行

原生的 Text Generation WebUI 项目是为单机环境设计的，不支持在 Kubernetes 集群中部署运行。我们对其进行了深度定制化改造，使其能在 Kubernetes 环境下顺利部署和运行。主要的改造工作包括：

封装成 Docker 容器化应用，实现无状态部署
将模型数据持久化到外部存储（Amazon EFS），解除和宿主机的耦合
优化配置加载和资源请求方式，适配 Kubernetes 的调度机制
修改日志输出，与 Kubernetes 日志收集组件对接

2. 支持利用 Amazon Neuron 芯片加速 LLM 推理

大语言模型的计算复杂度很高，对算力需求极大。利用亚马逊云科技推出的 Amazon Neuron 芯片（Inferentia 和 Trainium），可以大幅降低推理的延迟和成本。我们对原生的 Text Generation WebUI 进行了以下关键改造：

修改模型加载模块，添加新的加载器以支持 Amazon Neuron SDK
重构推理逻辑，使用 Amazon Neuron 的 Python API 进行推理计算
优化显存使用策略，充分发挥 Amazon Neuron 芯片的算力

3. 构建统一的 LLM 应用网关层

为了实现 LLM 服务的高可用、可扩展，我们自研了一个应用网关层，作为与 LLM 推理引擎的适配层。该网关层由 Golang 语言编写，基于 Fiber 框架构建，充当反向代理，接收客户端请求并进行协议转换和负载分发。网关层的主要创新点有：

内置服务发现机制，自动探索可用的 LLM 推理实例
实现请求级别的负载均衡和故障转移策略
支持限流、认证等网关常见功能
提供统一的指标和日志输出，与监控系统集成

实施步骤

前提条件

1、安装 kubectl

ARCH=amd64
PLATFORM=$(uname -s)_$ARCH
curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz"
curl -sL "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_checksums.txt" | grep $PLATFORM | sha256sum --check
tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz
sudo mv /tmp/eksctl /usr/local/bin
eksctl version

左右滑动查看完整示意

2、安装 eksctl

# 下载1.28版本
curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.28.5/2024-01-04/bin/linux/amd64/kubectl


# 下载验证文件
curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.28.5/2024-01-04/bin/linux/amd64/kubectl.sha256


# 验证
sha256sum -c kubectl.sha256


# 更改执行
chmod +x ./kubectl


mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$HOME/bin:$PATH


# 设置PATH环境变量
echo 'export PATH=$HOME/bin:$PATH' >> ~/.bashrc


# 查看版本
kubectl version --client

左右滑动查看完整示意

3、安装 helm

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
helm version

左右滑动查看完整示意

4、安装 Amazon CLI 、配置 Amazon 权限

5、创建 Amazon EKS 集群

（1）集群创建配置文件 eks-cluster.yaml ，下述配置以 us-west-2 、 1.28 版本为例，可以根据需要自行配置

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig


metadata:
  name: eks-cluster-prod
  region: us-west-2
  version: "1.28"


kubernetesNetworkConfig:
  ipFamily: IPv4


managedNodeGroups:
- name: managed-node
  labels:
    role: co-worker
  instanceType: c6i.large
  minSize: 1
  desiredCapacity: 1
  maxSize: 3

左右滑动查看完整示意

（2）创建集群命令

eksctl create cluster -f eks-cluster.yaml

左右滑动查看完整示意

准备基础环境

安装 Amazon LoadBalancer Controller

1、创建 OIDC Provider ，需配置 CLUSTER_NAME

export cluster_name=${CLUSTER_NAME}
eksctl utils associate-iam-oidc-provider --cluster $cluster_name --approve

左右滑动查看完整示意

2、获取并创建 IAM Policy

# 获取IAM Policy
curl -o iam-policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.7.0/docs/install/iam_policy.json


# 创建IAM Policy
aws iam create-policy \
    --policy-name AWSLoadBalancerControllerIAMPolicy \
    --policy-document file://iam-policy.json

左右滑动查看完整示意

3、创建 Service Account ，需配置 CLUSTER_NAME 、 REGION 和

ACCOUNT_ID 等信息

export cluster_name=${CLUSTR_NAME}
export region=${REGION}
export AWS_ACCOUNT_ID=${ACCOUNT_ID}


eksctl create iamserviceaccount \
  --cluster=$cluster_name \
  --namespace=kube-system \
  --name=aws-load-balancer-controller \
  --attach-policy-arn=arn:aws:iam::$AWS_ACCOUNT_ID:policy/AWSLoadBalancerControllerIAMPolicy \
  --override-existing-serviceaccounts \
  --region $region \
  --approve

左右滑动查看完整示意

4、添加并 helm repo

helm repo add eks https://aws.github.io/eks-charts
helm repo update eks

左右滑动查看完整示意

5、安装 Amazon Load Balancer Controller

helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
  -n kube-system \
  --set clusterName=$cluster_name \
  --set serviceAccount.create=false \
  --set serviceAccount.name=aws-load-balancer-controller

左右滑动查看完整示意

安装 EFS Driver

1、下载并创建 IAM Policy

curl -O https://raw.githubusercontent.com/kubernetes-sigs/aws-efs-csi-driver/master/docs/iam-policy-example.json
aws iam create-policy \
    --policy-name EKS_EFS_CSI_Driver_Policy \
    --policy-document file://iam-policy-example.json

左右滑动查看完整示意

2、创建 Service Account ，需配置 CLUSTER_NAME 信息

export cluster_name=${CLUSTER_NAME}
export role_name=AmazonEKS_EFS_CSI_DriverRole
eksctl create iamserviceaccount \
    --name efs-csi-controller-sa \
    --namespace kube-system \
    --cluster $cluster_name \
    --role-name $role_name \
    --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEFSCSIDriverPolicy \
    --approve
TRUST_POLICY=$(aws iam get-role --role-name $role_name --query 'Role.AssumeRolePolicyDocument' | \
    sed -e 's/efs-csi-controller-sa/efs-csi-*/' -e 's/StringEquals/StringLike/')
aws iam update-assume-role-policy --role-name $role_name --policy-document "$TRUST_POLICY"

左右滑动查看完整示意

3、添加并更新 helm repo

# 1. 添加helm repo
helm repo add aws-efs-csi-driver https://kubernetes-sigs.github.io/aws-efs-csi-driver/


# 2. 更新helm repo
helm repo update aws-efs-csi-driver

左右滑动查看完整示意

4、安装 EFS Driver

helm upgrade --install aws-efs-csi-driver --namespace kube-system aws-efs-csi-driver/aws-efs-csi-driver \
  --set controller.serviceAccount.create=false \
  --set controller.serviceAccount.name=efs-csi-controller-sa

左右滑动查看完整示意

5、安装 EFS Storage Class

（1）配置文件 efs-sc.yaml

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com

左右滑动查看完整示意

（2）部署命令

kubectl apply -f efs-sc.yaml

左右滑动查看完整示意

安装 EBS Driver

1、创建 Service Account，需配置 CLUSTER_NAME 和 ACCOUNT_ID 信息

export cluster_name=${CLUSTER_NAME}
export AWS_ACCOUNT_ID=${ACCOUNT_ID}


eksctl create iamserviceaccount \
    --name ebs-csi-controller-sa \
    --namespace kube-system \
    --cluster $cluster_name \
    --role-name AmazonEKS_EBS_CSI_DriverRole \
    --role-only \
    --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
    --approve

左右滑动查看完整示意

2、安装 EBS Driver

eksctl create addon --name aws-ebs-csi-driver --cluster $cluster_name --service-account-role-arn arn:aws:iam::${AWS_ACCOUNT_ID}:role/AmazonEKS_EBS_CSI_DriverRole  --force

左右滑动查看完整示意

3、安装 EBS Storage Class

（1）配置文件 ebs-sc.yaml

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-sc
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer

左右滑动查看完整示意

（2）安装命令

kubectl apply -f gp3-sc.yaml

左右滑动查看完整示意

安装 Karpenter

Karpenter 主要用于 Amazon EKS 集群的 CA 扩展
Karpenter 的安装主要包括两种方式：直接安装（适用于创建全新集群）和 Migration（适用于集群已经存在）
文章中采用第二种方式安装 Karpenter ，具体步骤可扫码参考
集群安装完成后需创建 NodePool 以指导 Karpenter 如何扩缩集群，示例如下

具体步骤

扫码了解更多

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["5"]
      nodeClassRef:
        name: default
  limits:
    cpu: 1000
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h # 30 * 24h = 720h
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2 # Amazon Linux 2
  role: "KarpenterNodeRole-eks-cluster-prod" # replace with your cluster name
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: eks-cluster-prod # replace with your cluster name
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: eks-cluster-prod # replace with your cluster name

左右滑动查看完整示意

准备控制面环境

安装 KubeSphere

1、下载安装文件

wget https://github.com/kubesphere/ks-installer/releases/download/v3.4.1/kubesphere-installer.yaml
wget https://github.com/kubesphere/ks-installer/releases/download/v3.4.1/cluster-configuration.yaml

左右滑动查看完整示意

2、安装命令

kubectl apply -f kubesphere-installer.yaml

左右滑动查看完整示意

3、监控安装进程

kubectl logs -n kubesphere-system $(kubectl get pod -n kubesphere-system -l 'app in (ks-install, ks-installer)' -o jsonpath='{.items[0].metadata.name}') -f

左右滑动查看完整示意

4、查看安装结果

kubectl get svc -n kubesphere-system

左右滑动查看完整示意

5、查看 UI 页面：对应 ks-console 服务

安装 KubeCost

1、安装命令

helm upgrade -i kubecost oci://public.ecr.aws/kubecost/cost-analyzer --version 2.0.2 \
    --namespace kubecost --create-namespace \
    -f https://raw.githubusercontent.com/kubecost/cost-analyzer-helm-chart/develop/cost-analyzer/values-eks-cost-monitoring.yaml

左右滑动查看完整示意

2、查看安装结果

kubectl get svc -n kubecost

3、查看 UI 页面：对应 kubecost-cost-analyzer 服务

安装 Prometheus 和 Grafana

1、添加并更新 helm repo

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

左右滑动查看完整示意

2、安装 Prometheus Stack

# 1. 创建命名空间 - monitoring
kubectl create ns monitoring


# 2. 安装prometheus stack
helm install eks-kube-prometheus-stack prometheus-community/kube-prometheus-stack -n monitoring

左右滑动查看完整示意

3、查看安装情况

kubectl get svc -n monitoring

4、查看 UI 页面

（1）Prometheus

（2）Grafana

安装 Loki

1、Loki 架构如下，可以理解为类 Prometheus ，只不过存储的是日志而不是指标，Loki 非常轻量级且适配 K8S ，数据可以存储到 Amazon S3 实现存算分离架构

2、添加并更新 helm repo

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

左右滑动查看完整示意

3、创建配置文件 loki-custom-values.yaml ，示例如下，可以根据具体情况进行配置

loki:
  auth_enabled: false
  storage:
    type: "s3"
    s3:
      region: "us-west-2"
      accessKeyId: "xxx"
      secretAccessKey: "xxx"
    bucketNames:
      chunks: "loki-chunks"
      ruler: "loki-ruler"
      admin: "loki-admin"
  commonConfig:
    replication_factor: 1


read:
  persistence:
    storageClass: ebs-sc
  replicas: 1


write:
  persistence:
    storageClass: ebs-sc
  replicas: 1


backend:
  persistence:
    storageClass: ebs-sc
  replicas: 1


gateway:
  enabled: true
  basicAuth:
      enabled: false

左右滑动查看完整示意

4、安装 Loki ，此处选择安装 5.42.2 版本

helm upgrade loki --values loki-custom-values.yaml --namespace loki grafana/loki --version 5.42.2

左右滑动查看完整示意

5、安装 Promtail

helm install promtail --namespace loki grafana/promtail

左右滑动查看完整示意

6、登录 Grafana 查看应用日志，下图是访问应用网关时打印的日志

准备数据面环境

安装 Kong

1、安装 Gateway 和 GatewaClass

（1）配置文件 – gateway.yaml

---
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: kong
  annotations:
    konghq.com/gatewayclass-unmanaged: 'true'


spec:
  controllerName: konghq.com/kic-gateway-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: kong
spec:
  gatewayClassName: kong
  listeners:
  - name: proxy
    port: 80
    protocol: HTTP

左右滑动查看完整示意

（2）安装命令

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.0.0/standard-install.yaml
kubectl apply -f gateway.yaml

左右滑动查看完整示意

2、添加并更新 helm repo

helm repo add kong https://charts.konghq.com
helm repo update

左右滑动查看完整示意

3、安装 Kong

helm install kong kong/ingress. -n kong --create-namespace

左右滑动查看完整示意

部署服务网格

1、添加 helm repo 并更新

helm repo add istio https://istio-release.storage.googleapis.com/charts
helm repo update

左右滑动查看完整示意

2、创建命名空间

kubectl create namespace istio-system

左右滑动查看完整示意

3、安装 istio-base

helm install istio-base istio/base -n istio-system --set defaultRevision=default

左右滑动查看完整示意

4、安装 istio discovery

helm install istiod istio/istiod -n istio-system --wait

左右滑动查看完整示意

5、安装 ingress gateway

（1）配置文件 ingressgateway.yaml

service:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "external"
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
    service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
    service.beta.kubernetes.io/aws-load-balancer-attributes: "load_balancing.cross_zone.enabled=true"

左右滑动查看完整示意

（2）安装命令

helm install istio-ingressgateway istio/gateway -n istio-system -f ingressgateway.yaml

左右滑动查看完整示意

6、安装 istio 对应的 add-on

for ADDON in kiali jaeger prometheus grafana
do
    ADDON_URL="https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/$ADDON.yaml"
    kubectl apply -f $ADDON_URL
done

左右滑动查看完整示意

7、使用 Kiali 查看部署的应用：此处部署的 Service 仅作演示目的

部署应用网关

应用官方是使用 Fiber 框架、Go 语言开发的应用网关，项目 github （后续集成到 Amazon-examples 中并开源），可扫码查看
应用网关可以响应客户端请求、适配后端算力单元（例如 Text Generation WebUI + Nvidia GPU）组成的大语言模型推理框架、记录 API 调用日志和指标并以此实现算力单元的自动扩缩
下载代码后编译镜像

github 地址

扫码了解更多

./scripts/build_and_push.sh

4、部署 Service ，需要根据情况更改 Image

（1）配置文件 service-text-generation-webui-proxy.yaml

---
apiVersion: v1
kind: Service
metadata:
  name: text-generation-webui-proxy # Service名称
  labels:
    app: text-generation-webui-proxy    # Service自身标签
spec:
  ports:
  - port: 3000  # K8S集群内部访问Service时使用的端口
    protocol: TCP
    targetPort: 3000  # 目标Pod的监听端口
    name: http
  selector:
    app: text-generation-webui-proxy
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: text-generation-webui-proxy
  name: text-generation-webui-proxy
  namespace: default
spec:
  replicas: 1
  revisionHistoryLimit: 10  # 滚动更新后, 保留的历史版本数
  selector: # 找到匹配的RS
    matchLabels:
      app: text-generation-webui-proxy
  strategy: # 更新策略
    rollingUpdate:
      maxSurge: 25% 
      maxUnavailable: 25%
    type: RollingUpdate # 更新类型, 滚动更新
  template:
    metadata:
      labels:
        app: text-generation-webui-proxy
    spec:
      containers:
      - image: xxxx
        imagePullPolicy: IfNotPresent
        name: text-generation-webui-proxy
      restartPolicy: Always
      terminationGracePeriodSeconds: 30

左右滑动查看完整示意

（2）部署命令

kubectl apply -f service-text-generation-webui-proxy.yaml

左右滑动查看完整示意

5、查看部署的 Service

kubectl get all -l app=text-generation-webui-proxy

左右滑动查看完整示意

6、配置 Kong Route ，使得 Kong 接收到的请求转发给该应用网关达到暴露服务的目的

（1）配置文件 kong-route-ingress.yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: text-generation-webui-proxy
  annotations:
    konghq.com/strip-path: 'true'
spec:
  ingressClassName: kong
  rules:
  - http:
      paths:
      - path: /
        pathType: ImplementationSpecific
        backend:
          service:
            name: text-generation-webui-proxy
            port:
              number: 3000

左右滑动查看完整示意

（2）命令

kubectl apply -f kong-route-ingress.yaml

左右滑动查看完整示意

部署 Text Generation WebUI

1、Text Generation WebUI 是最近比较热门的开源项目，目标是 LLM 界的 SD-WebUI ，目前部分客户已经使用其作为 LLM 的推理框架

2、该方案进行了优化，使得 Text Generation WebUI 可以部署于 K8S 且对部分支持的模型可以使用 Amazon Neuron 芯片作为推理引擎

3、Amazon Neuron 芯片支持，核心修改如下：

（1）modules/models.py 加入 neuron_loader() 以支持使用 Amazon Neuron 芯片加载模型

def neuron_loader(model_name):
    from transformers_neuronx.llama.model import LlamaForSampling


    path_to_model = Path(f'{shared.args.model_dir}/{model_name}/model')
    path_to_neuron = Path(f'{shared.args.model_dir}/{model_name}/neuron_artifacts')
    path_to_tokenizer = Path(f'{shared.args.model_dir}/{model_name}/tokenizer')


    model = LlamaForSampling.from_pretrained(path_to_model, batch_size=1, tp_degree=12, amp='f16')
    model.load(path_to_neuron) # Load the compiled Neuron artifacts
    model.to_neuron() # will skip compile


    tokenizer = AutoTokenizer.from_pretrained(path_to_tokenizer)


    return model, tokenizer
  


def load_model(model_name, loader=None):
    logger.info(f"Loading {model_name}")
    t0 = time.time()


    shared.is_seq2seq = False
    shared.model_name = model_name
    load_func_map = {
        'Transformers': huggingface_loader,
        'AutoGPTQ': AutoGPTQ_loader,
        'GPTQ-for-LLaMa': GPTQ_loader,
        'llama.cpp': llamacpp_loader,
        'llamacpp_HF': llamacpp_HF_loader,
        'RWKV': RWKV_loader,
        'ExLlama': ExLlama_loader,
        'ExLlama_HF': ExLlama_HF_loader,
        'ExLlamav2': ExLlamav2_loader,
        'ExLlamav2_HF': ExLlamav2_HF_loader,
        'ctransformers': ctransformers_loader,
        'AutoAWQ': AutoAWQ_loader,
        'QuIP#': QuipSharp_loader,
        'HQQ': HQQ_loader,
        'Neuron': neuron_loader
    }

左右滑动查看完整示意

（2）modules/models_setting.py ，修改模型设置以支持采用 neuron_loader 进行模型加载

def infer_loader(model_name, model_settings):
    path_to_model = Path(f'{shared.args.model_dir}/{model_name}')
    if not path_to_model.exists():
        loader = None
    elif (path_to_model / 'quantize_config.json').exists() or ('wbits' in model_settings and type(model_settings['wbits']) is int and model_settings['wbits'] > 0):
        loader = 'ExLlama_HF'
    elif (path_to_model / 'quant_config.json').exists() or re.match(r'.*-awq', model_name.lower()):
        loader = 'AutoAWQ'
    elif len(list(path_to_model.glob('*.gguf'))) > 0:
        loader = 'llama.cpp'
    elif re.match(r'.*\.gguf', model_name.lower()):
        loader = 'llama.cpp'
    elif re.match(r'.*rwkv.*\.pth', model_name.lower()):
        loader = 'RWKV'
    elif re.match(r'.*exl2', model_name.lower()):
        loader = 'ExLlamav2_HF'
    elif re.match(r'.*-hqq', model_name.lower()):
        return 'HQQ'
    elif re.match(r'.*-neuron', model_name.lower()):
        return 'Neuron'
    else:
        loader = 'Transformers'


    return loader

左右滑动查看完整示意

（3）modules/text_generation.py ，修改 encode() 函数，不使用 CUDA device 加载

def encode(prompt, add_special_tokens=True, add_bos_token=True, truncation_length=None):
    ...
    return input_ids

左右滑动查看完整示意

（4）modules/text_generation.py ，修改 generate_reply_HF() 函数，使用 Amazon Neuron SDK 进行模型推理

def generate_with_callback(callback=None, *args, **kwargs):
  neuron_kwargs = dict()
    neuron_kwargs['input_ids'] = kwargs['inputs']
    neuron_kwargs['top_k'] = kwargs['top_k']
    neuron_kwargs['top_p'] = kwargs['top_p']
    neuron_kwargs['temperature'] = kwargs['temperature'] 
    neuron_kwargs['eos_token_override']=kwargs['eos_token_id']
    neuron_kwargs['sequence_length']=kwargs['max_new_tokens']


    kwargs['stopping_criteria'].append(Stream(callback_func=callback))


    clear_torch_cache()
    with torch.inference_mode():
      shared.model.sample(neuron_kwargs)

左右滑动查看完整示意

4、K8S 支持

（1）生成 Dockerfile 文件

# BUILDER
FROM nvidia/cuda:12.1.1-devel-ubuntu22.04 as builder
WORKDIR /builder
ARG TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST:-3.5;5.0;6.0;6.1;7.0;7.5;8.0;8.6+PTX}"
ARG BUILD_EXTENSIONS="${BUILD_EXTENSIONS:-}"
ARG APP_UID="${APP_UID:-6972}"
ARG APP_GID="${APP_GID:-6972}"


RUN apt update && \
    apt install --no-install-recommends -y git vim build-essential python3-dev pip bash curl net-tools && \
    rm -rf /var/lib/apt/lists/*
WORKDIR /home/app/
# RUN git clone https://github.com/oobabooga/text-generation-webui.git
COPY . /home/app/text-generation-webui
WORKDIR /home/app/text-generation-webui
RUN GPU_CHOICE=A USE_CUDA118=FALSE LAUNCH_AFTER_INSTALL=FALSE INSTALL_EXTENSIONS=TRUE ./start_linux.sh --verbose
# COPY CMD_FLAGS.txt /home/app/text-generation-webui/
EXPOSE ${CONTAINER_PORT:-7860} ${CONTAINER_API_PORT:-5000} ${CONTAINER_API_STREAM_PORT:-5005}
WORKDIR /home/app/text-generation-webui
# set umask to ensure group read / write at runtime
CMD umask 0002 && export HOME=/home/app/text-generation-webui && ./start_linux.sh

左右滑动查看完整示意

（2）生成镜像并上传 – bash 脚本

#!/usr/bin/env bash


aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com


# This script shows how to build the Docker image and push it to ECR to be ready for use
# by SageMaker.


# The argument to this script is the image name. This will be used as the image on the local
# machine and combined with the account and region to form the repository name for ECR.
# The name of our algorithm
algorithm_name=text-generation-webui


account=$(aws sts get-caller-identity --query Account --output text)


# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}


fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"


# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1


if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi


# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}


# Build the docker image locally with the image name and then push it to ECR
# with the full name.


docker build -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}


docker push ${fullname}

左右滑动查看完整示意

5、部署服务

（1）配置文件（service-text-generation-webui.yaml）：根据具体情况修改 Image 和 EFS Claim

---
apiVersion: v1
kind: Service
metadata:
  name: text-generation-webui # Service名称
  namespace: default
  labels:
    app: text-generation-webui    # Service自身标签
spec:
  ports:
  - port: 5000  # K8S集群内部访问Service时使用的端口
    protocol: TCP
    targetPort: 5000  # 目标Pod的监听端口
    name: api
  - port: 5005
    protocol: TCP
    targetPort: 5005
    name: api-stream
  - port: 7860
    protocol: TCP
    targetPort: 7860
    name: web
  selector:
    app: text-generation-webui
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: text-generation-webui
  name: text-generation-webui
  namespace: default
spec:
  replicas: 1
  revisionHistoryLimit: 10  # 滚动更新后, 保留的历史版本数
  selector: # 找到匹配的RS
    matchLabels:
      app: text-generation-webui
  strategy: # 更新策略
    rollingUpdate:
      maxSurge: 25% 
      maxUnavailable: 25%
    type: RollingUpdate # 更新类型, 滚动更新
  template:
    metadata:
      labels:
        app: text-generation-webui
    spec:
      tolerations:
        - key: "gpu-load"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
      containers:
      - image: xxx
        imagePullPolicy: IfNotPresent
        name: text-generation-webui
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
          - name: persistent-storage-for-models
            mountPath: /home/app/text-generation-webui/models
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      volumes:
      - name: persistent-storage-for-models
        persistentVolumeClaim:
          claimName: efs-claim-text-generation-webui

左右滑动查看完整示意

（2）将模型文件下载并拷贝到 EFS 对应路径，例如截图中是 “TheBloke_Llama-2-7B-Chat-AWQ” 模型

（3）部署命令

kubectl apply -f service-text-generation-webui.yaml

左右滑动查看完整示意

集群自动扩缩

1、大语言模型推理的自动扩缩是一个比较复杂的问题，需要考虑方方面面的因素，包括但不限于底层的算力机类型、模型的大小、模型是否量化、输入和输出 token 数量、推理的参数设定以及特定框架的参数等，因此设计一个适配各种场景的完美方案非常困难。比较合理的扩缩指标信息包括“单位时间请求数”、“请求响应时长”等等

2、本方案中采用“单位时间请求数”作为 Kubernetes deployment 扩缩的依据，下面以 Text Generation Inference 为例讲解，配置文件如下。由于只是作为展示，因此设置的扩展指标非常敏感 – 每 2 秒有一个请求就会扩展。生产中应根据实际情况配置

kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2
metadata:
  name: llama2-13b-chat-awq
  namespace: tgi
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tgi-engine-llama2-13b-chat-awq
  minReplicas: 1
  maxReplicas: 3
  metrics:
  # use a "Pods" metric, which takes the average of the
  # given metric across all pods controlled by the autoscaling target
  - type: Pods
    pods:
      metric:
        name: tgi_request_per_second
      # target 500 milli-requests per second,
      # which is 1 request every two seconds
      target:
        type: Value
        averageValue: 500m
  behavior: # 这里是重点
    scaleDown:
      stabilizationWindowSeconds: 300 # 需要缩容时，先观察5分钟，如果一直持续需要缩容才执行缩容
      policies:
      - type: Percent
        value: 100 # 允许全部缩掉
        periodSeconds: 15
    scaleUp:
      stabilizationWindowSeconds: 0 # 需要扩容时，立即扩容
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15 # 每15s最大允许扩容当前1倍数量的Pod
      - type: Pods
        value: 4
        periodSeconds: 15 # 每15s最大允许扩容 4 个 Pod
      selectPolicy: Max # 使用以上两种扩容策略中算出来扩容Pod数量最大的

左右滑动查看完整示意

3、查看相关 Deployment 初始配置 – 此时没有任何请求，初始值为 0

4、当压力增加，自动扩展 deploy 的 pod 数量，同时通过 Karpenter 出发点集群节点的扩展

5、当压力下降持续 5 分钟后， pod 自动进行收缩，同时通过 Karpenter 触发集群节点的收缩

方案验证

1、目前方案提供了 HTTP 接口（ OpenAI-Compatible ）以便于调用 LLM 推理能力

2、调用命令 v1/completions/ 接口

curl http://127.0.0.1:5000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "This is a cake recipe:\n\n1.",
    "max_tokens": 200,
    "temperature": 1,
    "top_p": 0.9,
    "seed": 10
  }'

左右滑动查看完整示意

3、调用截图

总结

通过基于亚马逊云科技云原生服务的解决方案，我们为企业在自有环境中平滑部署和高效运行大型语言模型提供了一种创新的实践方式。该解决方案遵循云原生理念，融合了多种亚马逊云科技基础服务和开源工具，构建了一个功能完备、灵活可扩展、易于运维的 LLM 部署运行平台。

在技术层面，我们将开源 LLM 框架 Text Generation WebUI 、 vLLM 和 Text Generation Inference 部署于 Amazon EKS 集群实现 LLM 的推理，同时对开源 LLM 框架 Text Generation WebUI 进行了多方位的创新改造，使其能够在 Amazon EKS 集群中稳定运行，同时充分利用 Amazon Neuron 加速芯片的算力，极大提升了推理性能。另外，我们自研的统一 LLM 应用网关层为服务注入了高可用、负载均衡等生产级能力，形成了一个端到端的 LLM 部署和运维体系，显著降低了企业应用 LLM 能力的复杂度和总体拥有成本。

本篇作者

高郁

亚马逊云科技解决方案架构师，主要负责企业客户上云，帮助客户进行云架构设计和技术咨询，专注于智能湖仓、人工智能与机器学习等技术方向。

梁宇辉

亚马逊云科技机器学习产品技术专家，负责基于亚马逊云科技的机器学习方案的咨询与设计，专注于机器学习的推广与应用，深度参与了很多真实客户的机器学习项目的构建以及优化。对于深度学习模型分布式训练，推荐系统和计算广告等领域具有丰富经验。

刘磊

亚马逊云科技解决方案架构师。

星标不迷路，开发更极速！

关注后记得星标「亚马逊云开发者」

点击阅读原文查看博客，获得更详细内容

听说，点完下面4个按钮

就不会碰到bug了！

亚马逊云开发者

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
基于亚马逊云科技云原生服务，构建生产级别的 LLM 推理环境

背景介绍近年来，随着计算力和数据量的不断提升，大语言模型（ LLM ）在自然语言处理领域取得了令人瞩目的进展，展现出了广阔的应用前景。在企业场景中，LLM 可以被应用于多个领域，例如智能问答、文本摘要、内容创作、代码生成等，为提高工作效率、优化客户体验等带来全新的可能性。越来越多的企业开始探索将 LLM 引入业务环境中。然而，企业在自有环境中部署和运行 LLM 面临诸多挑战：部署复杂性：LLM 模...
复制链接

扫一扫