
一、前言
随着 LLM 模型越来越大,单 GPU 已经无法加载一个模型。以 Qwen-14B-Chat 模型为例,模型权重大概 28GB,但是单个 NVIDIA A10 仅有 24GB 显存。如果想要在 A10 上部署 Qwen-14B-Chat 模型,我们需要将模型切分后部署到 2 个 A10 机器上,每个 A10 卡加载一半的模型,这种方式称之为分布式推理。
社区涌现了很多支持分布式推理的框架如 vllm、deepspeed-mii,rtp-llm 等。本文选取了 vllm 框架,从源码角度分析 vllm + Ray 如何实现 LLM 模型的分布式推理。
二、在 K8s 中部署 vllm 分布式推理应用
2.1 模型准备
下载 Qwen-14B-Chat 到 OSS 中,并在集群中创建对应的 pv,pvc。pvc 名称为 llm-model。
kubectl apply -f- << EOF
apiVersion: v1
kind: Secret
metadata:
name: oss-secret
stringData:
akId: ${your-accesskey-id} # 用于访问oss的AK
akSecret: ${your-accesskey-secert} # 用于访问oss的SK
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: llm-model
labels:
alicloud-pvname: llm-model
spec:
capacity:
storage: 30Gi
accessModes:
- ReadOnlyMany
persistentVolumeReclaimPolicy: Retain
csi:
driver: ossplugin.csi.alibabacloud.com
volumeHandle: model-oss
nodePublishSecretRef:
name: oss-secret
namespace: default
volumeAttributes:
bucket: ${your-bucket-name}
url: ${your-bucket-endpoint} # e.g. oss-cn-hangzhou.aliyuncs.com
otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
path: "/"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llm-model
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 30Gi
selector:
matchLabels:
alicloud-pvname: llm-model
EOF
2.2 部署分布式 vllm 应用

1. 执行以下命令,部署 vllm 应用
kubectl apply -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
labels:
app: vllm
spec:
replicas: 2
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- vllm
topologyKey: kubernetes.io/hostname
volumes:
- name: model
persistentVolumeClaim:
claimName: llm-model
containers:
- name: vllm
image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1
command:
- "sh"
- "-c"
- "sleep 7d"
ports:
- containerPort: 8080
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 30
periodSeconds: 30
resources:
limits:
nvidia.com/gpu: "1"
requests:
cpu: 4
memory: 8Gi
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /mnt/models
name: model
EOF
2. 执行以下命令,启动 vllm 应用
-
启动 ray
在 Pod1 上运行
ray start --head
# 启动后,日志中会显示ray-head-address地址
在 Pod2 上运行
# ray-head-address 设置为pod1日志中显示的ray-head-address地址
ray start --address=<ray-head-address>
-
运行如下命令,初始化 Pod2 上的本地模型
python3 model_init.py
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
config = AutoConfig.from_pretrained(
"/mnt/models/Qwen-14B-Chat",
trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("/mnt/models/Qwen-14B-Chat", trust_remote_code=True)
-
在 Pod1 上运行如下命令启动 qwen 模型
python3 -m vllm.entrypoints.openai.api_server \
--port 8080 \
--trust-remote-code \
--served-model-name qwen \
--model /mnt/models/Qwen-14B-Chat \
--gpu-memory-utilization 0.95 \
--tensor-parallel-size 2
-
登陆 pod1,访问应用
kubectl -n <your-namespace> exec -it <pod1-name> bash
curl -H "Content-Type: application/json" \
http://localhost:8080/v1/chat/completions -X POST \
-d '{"model": "qwen", "messages": [{"role": "user", "content": "你好"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10, "stop":["<|endoftext|>", "<|im_end|>", "<|im_start|>"]}'
三、分布式推理总体流程分析
1.入口函数:vllm/entrypoints/openai/api_server.py main
if __name__ == "__main__":
# 构建engine args
engin