NVIDIA Dynamo初体验(二)—— vllm部署之aggregate&disaggregate

作为AI小白,初体验dynamo之后,想要继续探索dynamo工具,对dynamo在RAG架构的优化有个初步理解,再探索dynamo两种部署方式,计算它们之间的性能差异。

一、RAG架构总览&Dynamo优化方向

一个典型的分布式RAG服务架构,核心组件分为4个部分:

.
├── frontend/          # OpenAI兼容的API层
├── router/            # 路由与负载均衡
├── workers/           
│   ├── prefill/       # 预填充计算(完整Prompt处理)
│   └── decode/        # Token生成(流式输出)
└── (其他如向量数据库)

一个请求的生命周期组件协作:

  1. 请求接收
    → Frontend接收用户请求 POST /chat

  2. 路由分配
    → Router识别为"预填充请求",分配至Prefill Worker

  3. 检索增强
    → Prefill Worker执行:
    a. 向量检索获取相关文档
    b. 拼接Prompt:[System] + Retrieved Docs + User Query
    c. 运行LLM前向计算,生成KV Cache

  4. 生成调度
    → Router将KV Cache转移至空闲Decode Worker

  5. 流式输出
    → Decode Worker逐个生成Token,通过Frontend流式返回

  6. 结果返回
    → Frontend包装为OpenAI格式响应,包含finish_reason: stop

dynamo的核心聚焦于Prefill阶段计算效率KV Cache传输效率Decode阶段资源调度的优化

  • Prefill阶段计算效率提升——拆分prefill计算图,将子图调度到不通CUDA流中;采用异步预取技术,提高GPU利用率;采用稀疏注意力优化注意力矩阵的计算复杂度
  • KV Cache传输改进——针对大集群部署,实现GPUDirect RDMA零拷贝传输、动态分片压缩、构建网络拓扑减少跨节点传输次数
  • Decode阶段性能提升——合并多个Decode请求为微批次提高吞吐量、持久化decode线程避免重复初始化、对长序列启用重计算机制,仅存储关键KV Cache节点
  • Prefill Decode分离——避免高负载Prefill阻塞Decode关键路径

二、Dynamo部署

配置见configs/目录下的yaml文件

  • Aggregated(集成式):
    将prefill和docode功能模块部署在同一物理节点或紧密耦合的集群中,共享硬件资源(如GPU、内存)
    同时处理
    HTTP请求
    Frontend
    Aggregated Worker
    Prefill+Decode

部署方法:

  1. 资源下载
git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo
  1. 搭建docker镜像
./container/build.sh --framework vllm
  1. 启动基础docker服务
    启动etcd和NATS,用于分布式协调和消息队列
docker compose -f deploy/metrics/docker-compose.yml up -d
  1. 运行vlm docker容器
./container/run.sh -it --framework vllm  
#可指定所需参数,./container/run.sh -h 查看可指定的参数
./container/run.sh -it --framework vllm --hf-cache /var/model-cache/ngc/hub/ -v /mnt:/mnt
  1. 运行dynamo进程
#已在步骤4进入到容器交互中
cd examples/llm
vim configs/agg.yaml # 修改配置文件,将model改为本地对应的路径
dynamo serve graphs.agg:Frontend -f configs/agg.yaml # 执行集中式配置部署
  • Disaggregated(分离式部署):
    将Prefill Worker和Decode Worker分开部署,分配不同的GPU资源
    HTTP请求
    Frontend
    Router
    Prefill Worker
    Decode Worker

部署方法:
1-4步与aggregated的部署方法一致
5. 运行dynamo进程

#已在步骤4进入到容器交互中
cd examples/llm
vim configs/disagg.yaml # 修改配置文件,将model改为本地对应的路径
dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml # 执行解耦式配置部署

三、运行结果

  1. 服务运行记录
    启动服务后,可以观测到日志中有如下打印:

启动时打印
在另一个会话终端执行:

curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "messages": [
    {
        "role": "user",
        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
    }
    ],
    "stream":false,
    "max_tokens": 10000
  }' | jq

日志中打印利用率:
在这里插入图片描述

  1. aggregate & disaggregate测试
    https://github.com/vllm-project/production-stack/tree/main/benchmarks/multi-round-qa
  • dynamo-aggregate

    ==================== Performance summary ======================
    QPS: 10.0000 reqs/s
    
    Processing speed: 2.4505 reqs/s
    
    Requests on-the-fly: 0
    
    Input tokens per second: 7456.9953 tokens/s
    
    Output tokens per second: 245.0541 tokens/s
    
    Average generation throughput (per request): 5.8770 tokens/req/s
    
    Average TTFT: 92.5261s
    
    Time range: 1751624676.4032784 - 1751624917.574545 (241.17s)
    ===============================================================
    
  • dynamo-disaggregate

    ==================== Performance summary ======================
      QPS: 10.0000 reqs/s
    
      Processing speed: 2.3079 reqs/s
    
      Requests on-the-fly: 0
    
      Input tokens per second: 7022.8796 tokens/s
    
      Output tokens per second: 230.7880 tokens/s
    
      Average generation throughput (per request): 29.4287 tokens/req/s
    
      Average TTFT: 98.8401s
    
    Time range: 1751625616.446478 - 1751625872.092325 (255.65s)
    ===============================================================
    

虽然不太懂每个参数的含义,,但对比看出disaggregate中的Average generation throughput有提升,还得继续学习呀!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值