NVIDIA Dynamo初体验(二）—— vllm部署之aggregate&disaggregate

原创已于 2025-07-07 11:25:26 修改 · 844 阅读

31 ·

CC 4.0 BY-SA版权

文章标签：

#llama #人工智能

于 2025-07-04 18:55:03 首次发布

作为AI小白，初体验dynamo之后，想要继续探索dynamo工具，对dynamo在RAG架构的优化有个初步理解，再探索dynamo两种部署方式，计算它们之间的性能差异。

一、RAG架构总览&Dynamo优化方向

一个典型的分布式RAG服务架构，核心组件分为4个部分：

.
├── frontend/          # OpenAI兼容的API层
├── router/            # 路由与负载均衡
├── workers/           
│   ├── prefill/       # 预填充计算（完整Prompt处理）
│   └── decode/        # Token生成（流式输出）
└── (其他如向量数据库)

一个请求的生命周期组件协作：

请求接收
→ Frontend接收用户请求 POST /chat
路由分配
→ Router识别为"预填充请求"，分配至Prefill Worker
检索增强
→ Prefill Worker执行：
a. 向量检索获取相关文档
b. 拼接Prompt：[System] + Retrieved Docs + User Query
c. 运行LLM前向计算，生成KV Cache
生成调度
→ Router将KV Cache转移至空闲Decode Worker
流式输出
→ Decode Worker逐个生成Token，通过Frontend流式返回
结果返回
→ Frontend包装为OpenAI格式响应，包含finish_reason: stop

dynamo的核心聚焦于Prefill阶段计算效率、KV Cache传输效率 和 Decode阶段资源调度的优化

Prefill阶段计算效率提升——拆分prefill计算图，将子图调度到不通CUDA流中；采用异步预取技术，提高GPU利用率；采用稀疏注意力优化注意力矩阵的计算复杂度
KV Cache传输改进——针对大集群部署，实现GPUDirect RDMA零拷贝传输、动态分片压缩、构建网络拓扑减少跨节点传输次数
Decode阶段性能提升——合并多个Decode请求为微批次提高吞吐量、持久化decode线程避免重复初始化、对长序列启用重计算机制，仅存储关键KV Cache节点
Prefill Decode分离——避免高负载Prefill阻塞Decode关键路径

二、Dynamo部署

配置见configs/目录下的yaml文件

Aggregated（集成式）：
将prefill和docode功能模块部署在同一物理节点或紧密耦合的集群中，共享硬件资源（如GPU、内存）

部署方法：

资源下载

git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo

搭建docker镜像

./container/build.sh --framework vllm

启动基础docker服务
启动etcd和NATS，用于分布式协调和消息队列

docker compose -f deploy/metrics/docker-compose.yml up -d

运行vlm docker容器

./container/run.sh -it --framework vllm  
#可指定所需参数，./container/run.sh -h 查看可指定的参数
./container/run.sh -it --framework vllm --hf-cache /var/model-cache/ngc/hub/ -v /mnt:/mnt

运行dynamo进程

#已在步骤4进入到容器交互中
cd examples/llm
vim configs/agg.yaml # 修改配置文件，将model改为本地对应的路径
dynamo serve graphs.agg:Frontend -f configs/agg.yaml # 执行集中式配置部署

Disaggregated（分离式部署）：
将Prefill Worker和Decode Worker分开部署，分配不同的GPU资源

部署方法：
1-4步与aggregated的部署方法一致
5. 运行dynamo进程

#已在步骤4进入到容器交互中
cd examples/llm
vim configs/disagg.yaml # 修改配置文件，将model改为本地对应的路径
dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml # 执行解耦式配置部署

三、运行结果

服务运行记录
启动服务后，可以观测到日志中有如下打印：

启动时打印
在另一个会话终端执行：

curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "messages": [
    {
        "role": "user",
        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
    }
    ],
    "stream":false,
    "max_tokens": 10000
  }' | jq

日志中打印利用率：
在这里插入图片描述

aggregate & disaggregate测试
https://github.com/vllm-project/production-stack/tree/main/benchmarks/multi-round-qa

dynamo-aggregate

==================== Performance summary ======================
QPS: 10.0000 reqs/s

Processing speed: 2.4505 reqs/s

Requests on-the-fly: 0

Input tokens per second: 7456.9953 tokens/s

Output tokens per second: 245.0541 tokens/s

Average generation throughput (per request): 5.8770 tokens/req/s

Average TTFT: 92.5261s

Time range: 1751624676.4032784 - 1751624917.574545 (241.17s)
===============================================================

dynamo-disaggregate

==================== Performance summary ======================
  QPS: 10.0000 reqs/s

  Processing speed: 2.3079 reqs/s

  Requests on-the-fly: 0

  Input tokens per second: 7022.8796 tokens/s

  Output tokens per second: 230.7880 tokens/s

  Average generation throughput (per request): 29.4287 tokens/req/s

  Average TTFT: 98.8401s

Time range: 1751625616.446478 - 1751625872.092325 (255.65s)
===============================================================

虽然不太懂每个参数的含义，，但对比看出disaggregate中的Average generation throughput有提升，还得继续学习呀！