vLLM 用于 LLM 推理服务和客户端的访问方式 3 - 流式

vLLM 用于 LLM 推理服务和客户端的访问方式 3 - 流式

flyfish

流式访问

包括如何解析

import requests
import json

# 配置
BASE_URL = "http://0.0.0.0:8000/v1"
API_KEY = "token-abc123"
MODEL_NAME = "LLM-Research/Meta-Llama-3-8B-Instruct"
INPUT_CONTENT = "Who are you?"

# 构建请求头
HEADERS = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {API_KEY}'
}

# 构建请求体
DATA = {
    "model": MODEL_NAME,
    "messages": [
        {"role": "user", "content": INPUT_CONTENT}
    ],
    "top_k": 50,  # 控制生成文本时考虑的最高概率词汇的数量
    "temperature": 0.7,  # 控制生成文本的随机性
    "max_tokens": 10,  # 生成的最大令牌数
    "presence_penalty": 0.1,  # 控制重复惩罚
    "frequency_penalty": 0.1,  # 控制频率惩罚
    "stream": True
}

def send_request(url, headers, data):
    """发送 POST 请求并返回响应"""
    response = requests.post(url, headers=headers, json=data, stream=True)
    return response

def process_stream_response(response):
    """处理流式响应并实时输出生成的文本"""

    generated_text = ""
    for line in response.iter_lines():
        if line:
            decoded_line = line.decode('utf-8')
            print(f"Received data: {decoded_line}")
            if decoded_line.startswith('data:'):
                json_data = decoded_line[len('data:'):].strip()
                if json_data == '[DONE]':
                    break
                data = json.loads(json_data)
                choices = data.get('choices', [])
                if choices:
                    delta = choices[0].get('delta', {})
                    new_text = delta.get('content', '')
                    generated_text += new_text
                    print(new_text, end='')
    return generated_text



def main():
    url = f"{BASE_URL}/chat/completions"
    response = send_request(url, HEADERS, DATA)
    
    if response.status_code == 200:
        generated_text = process_stream_response(response)
        print("Final Generated Text:", generated_text)
    else:
        print(f"Request failed with status code {response.status_code}: {response.text}")

if __name__ == "__main__":
    main()

接收数据

Received data: data: {"id":"chat-1","object":"chat.completion.chunk","created":1730428960,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}
Received data: data: {"id":"chat-1","object":"chat.completion.chunk","created":1730428960,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":"I"},"logprobs":null,"finish_reason":null}]}
IReceived data: data: {"id":"chat-1","object":"chat.completion.chunk","created":1730428960,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" am"},"logprobs":null,"finish_reason":null}]}
 amReceived data: data: {"id":"chat-1","object":"chat.completion.chunk","created":1730428960,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" L"},"logprobs":null,"finish_reason":null}]}
 LReceived data: data: {"id":"chat-1","object":"chat.completion.chunk","created":1730428960,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":"La"},"logprobs":null,"finish_reason":null}]}
LaReceived data: data: {"id":"chat-1","object":"chat.completion.chunk","created":1730428960,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":"MA"},"logprobs":null,"finish_reason":null}]}
MAReceived data: data: {"id":"chat-1","object":"chat.completion.chunk","created":1730428960,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":","},"logprobs":null,"finish_reason":null}]}
,Received data: data: {"id":"chat-1","object":"chat.completion.chunk","created":1730428960,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" an"},"logprobs":null,"finish_reason":null}]}
 anReceived data: data: {"id":"chat-1","object":"chat.completion.chunk","created":1730428960,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" AI"},"logprobs":null,"finish_reason":null}]}
 AIReceived data: data: {"id":"chat-1","object":"chat.completion.chunk","created":1730428960,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" assistant"},"logprobs":null,"finish_reason":null}]}
 assistantReceived data: data: {"id":"chat-1","object":"chat.completion.chunk","created":1730428960,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" developed"},"logprobs":null,"finish_reason":"length","stop_reason":null}]}
 developedReceived data: data: [DONE]
Final Generated Text: I am LLaMA, an AI assistant developed

在这里插入图片描述

流式接收

data: {"id":"chat-1","object":"chat.completion.chunk","created":1730429430,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}
data: {"id":"chat-1","object":"chat.completion.chunk","created":1730429430,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":"I"},"logprobs":null,"finish_reason":null}]}
data: {"id":"chat-1","object":"chat.completion.chunk","created":1730429430,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" am"},"logprobs":null,"finish_reason":null}]}
data: {"id":"chat-1","object":"chat.completion.chunk","created":1730429430,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" L"},"logprobs":null,"finish_reason":null}]}
data: {"id":"chat-1","object":"chat.completion.chunk","created":1730429430,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":"La"},"logprobs":null,"finish_reason":null}]}
data: {"id":"chat-1","object":"chat.completion.chunk","created":1730429430,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":"MA"},"logprobs":null,"finish_reason":null}]}
data: {"id":"chat-1","object":"chat.completion.chunk","created":1730429430,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":","},"logprobs":null,"finish_reason":null}]}
data: {"id":"chat-1","object":"chat.completion.chunk","created":1730429430,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" an"},"logprobs":null,"finish_reason":null}]}
data: {"id":"chat-1","object":"chat.completion.chunk","created":1730429430,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" AI"},"logprobs":null,"finish_reason":null}]}
data: {"id":"chat-1","object":"chat.completion.chunk","created":1730429430,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" assistant"},"logprobs":null,"finish_reason":null}]}
data: {"id":"chat-1","object":"chat.completion.chunk","created":1730429430,"model":"LLM-Research/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" developed"},"logprobs":null,"finish_reason":"length","stop_reason":null}]}
data: [DONE]
Final Generated Text: I am LLaMA, an AI assistant developed

解析后

I
 am
 L
La
MA
,
 an
 AI
 assistant
 developed
### 关于vLLM流式推理机制 #### vLLM流式推理概述 vLLM支持通过特定参数设置来启用流式推理功能,这使得客户端能够实时接收模型生成的结果。对于像Qwen2这样的大模型,在部署时可以通过HTTP接口传递`"stream": true`参数开启此特性[^1]。 #### 实现方式详解 为了实现流式推理服务器端需按照如下方式进行配置: - **API请求结构**:当发起预测请求时,除了指定使用的模型(`model`)对话历史(`messages`)外,还需明确指出希望采用流式传输模式(`stream=true`)。 ```json { "model": "qwen-7b-chat", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me about the weather today"}, {"role": "assistant", "content": ""} ], "stream": true } ``` - **响应处理逻辑**:一旦启用了流式输出选项,则服务端会分批次返回部分完成的回答给前端应用;每接收到一个新的消息片段后,立即发送至用户界面显示更新的内容,从而营造出连续流畅的文字呈现效果。 ```python import requests url = 'http://localhost:8000/v1/chat/completions' data = { "model": "qwen-7b-chat", "messages": [{"role": "user", "content": "What is your favorite color?"}], "stream": True, } response = requests.post(url, json=data) for chunk in response.iter_lines(): if chunk: decoded_line = chunk.decode('utf-8') print(decoded_line) ``` 上述代码展示了如何构建一个简单的POST请求向已部署好的vLLM实例询问问题并获取其逐步产生的回复。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

二分掌柜的

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值