vLLM (1) - Qwen2推理&部署

戴昊光

已于 2024-09-28 19:20:52 修改

阅读量4.8k

点赞数 56

文章标签：人工智能 language model nlp python transformer

于 2024-08-17 18:05:10 首次发布

本文链接：https://blog.csdn.net/daihaoguang/article/details/141283856

版权

系列文章目录

vLLM (1) - Qwen2推理&部署
 vLLM (2) - 架构总览
 vLLM (3) - Sequence & SequenceGroup
vLLM (4) - LLMEngine上篇
 vLLM (5) - LLMEngine下篇
 vLLM (6) - Scheduler & BlockSpaceManager

前言

vllm是一个优秀的大模型推理框架，它具备如下优点：易于使用，且具有最先进的服务吞吐量、高效的注意力键值内存管理（通过PagedAttention实现）、连续批处理输入请求、优化的CUDA内核等功能（摘自qwen使用手册）。
为了深刻的理解vllm，我将写系列文章来解析，内容包括：1）小试牛刀，使用vllm来推理和部署一种大模型；2）深入理解，源码解析。因为Qwen2在同期的大模型中效果确实不错，并且有相应的使用手册（前面已给出），本篇尝试使用vllm来推理和部署Qwen2，vllm源码解析放在后续文章中。

一、Qwen2准备

克隆项目代码：

git clone https://github.com/QwenLM/Qwen2.git

安装环境：官方好像没有特别完整的提到requirement.txt，后续如果运行代码提示缺什么包就再安装即可。我自己是使用之前跑glm-4-9b-chat的环境。

torch
transformers>=4.40.0
# ...

下载模型，这边选择从modelscope下载Qwen2-7B-Instruct

# git lfs install
git clone https://www.modelscope.cn/qwen/Qwen2-7B-Instruct.git

模型文件

修改和运行代码：

修改代码：代码文件为examples/demo/cli_demo.py，将DEFAULT_CKPT_PATH替换成模型所在本地路径，以加载已下载好的模型；

# DEFAULT_CKPT_PATH = 'Qwen/Qwen2-7B-Instruct'
DEFAULT_CKPT_PATH = '/home/ubuntu/Projects_ubuntu/Qwen2-7B-Instruct'

运行代码，此时即可再命令行中与Qwen2对话（这是原生的推理，不是我们的重点，可以忽略）。

# 在Qwen项目路径下
python examples/demo/cli_demo.py --cpu-only False

二、vllm安装

需要注意的是，根据官方所说，最好新开一个虚拟环境，以防止和原有环境之间的冲突。然后执行如下命令即可安装最新版本：

pip install vllm

# 查看安装情况
>>> import vllm
>>> print(vllm.__version__)
0.5.0.post1
>>> import torch
>>> print(torch.__version__)
2.3.0+cu121

三、离线推理

使用vllm离线推理，可以单个推理，也支持batch推理（可参考此链接），代码如下：

# qwen_vllm.py
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

DEFAULT_CKPT_PATH = '/home/ubuntu/Projects_ubuntu/Qwen2-7B-Instruct'

# Initialize the tokenizer
# tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct")
tokenizer = AutoTokenizer.from_pretrained(DEFAULT_CKPT_PATH)    # 替换成本地路径

# Pass the default decoding hyperparameters of Qwen2-7B-Instruct
# max_tokens is for the maximum length for generation.
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)

# Input the model name or path. Can be GPTQ or AWQ models.
# llm = LLM(model="Qwen/Qwen2-7B-Instruct")
llm = LLM(model=DEFAULT_CKPT_PATH)   # 替换成本地路径

# =================== 单个推理 ====================
# Prepare your prompts
prompt = "Tell me something about large language models."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# generate outputs
outputs = llm.generate([text], sampling_params)
# =================== 单个推理 ====================

# # =================== batch推理 ====================
# # Prepare your prompts
# prompts = ["Tell me something about large language models.", "Please tell me in detail how to be a scientist.", "Happy birthday to me!"]
# messages = []
# for prompt in prompts:
#     message = [
#         {"role": "system", "content": "You are a helpful assistant."},
#         {"role": "user", "content": prompt}
#     ]
#     messages.append(message)

# text = tokenizer.apply_chat_template(
#     messages,
#     tokenize=False,
#     add_generation_prompt=True
# )

# # generate outputs
# outputs = llm.generate(text, sampling_params)
# # =================== batch推理 ====================

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

单个推理：[00:10<00:00, 10.44s/it, est. speed input: 2.59 toks/s, output: 49.02 toks/s]，速度貌似不快；

# 单个推理打印信息
(vllm) ubuntu@ubuntu:~/Projects_ubuntu/Qwen2$ python qwen_vllm.py 
# 这是transformer的提示
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
# v0.5.0.post1是vllm的版本号，其他是一些模型相关的配置，后续可以研究一下
INFO 07-08 11:20:49 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/home/ubuntu/Projects_ubuntu/Qwen2-7B-Instruct', speculative_config=None, tokenizer='/home/ubuntu/Projects_ubuntu/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/home/ubuntu/Projects_ubuntu/Qwen2-7B-Instruct)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-08 11:20:53 model_runner.py:160] Loading model weights took 14.2487 GB
INFO 07-08 11:20:59 gpu_executor.py:83] # GPU blocks: 2725, # CPU blocks: 4681
INFO 07-08 11:21:01 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-08 11:21:01 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-08 11:21:05 model_runner.py:965] Graph capturing finished in 5 secs.
Processed prompts: 100%|██████████| 1/1 [00:10<00:00, 10.44s/it, est. speed input: 2.59 toks/s, output: 49.02 toks/s]
Prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nTell me something about large language models.<|im_end|>\n<|im_start|>assistant\n', Generated text: "Large language models (LLMs) are sophisticated artificial intelligence systems designed to understand and generate human-like text. They are trained on vast amounts of textual data, enabling them to learn patterns, contexts, and linguistic nuances that allow them to produce coherent and contextually appropriate responses.\n\nHere are some key aspects of large language models:\n\n1. **Training Data**: LLMs are typically trained on massive datasets containing billions of words. This includes a wide variety of texts such as books, articles, web pages, and more. The size and diversity of the training data contribute significantly to the model's ability to generalize and understand different types of language.\n\n2. **Architecture**: Modern LLMs often use transformer architectures, which were introduced by researchers at Google in 2017. Transformers enable efficient parallel processing and have been crucial in achieving state-of-the-art results in various natural language processing tasks. Other architectures like RNNs (Recurrent Neural Networks) and LSTM (Long Short-Term Memory) networks were previously used but have been largely replaced by transformers due to their superior performance and efficiency.\n\n3. **Size and Complexity**: Large language models can have billions or even trillions of parameters. These models are so complex that they require significant computational resources for both training and inference. Training an LLM can take weeks or even months on powerful hardware like GPUs (Graphics Processing Units).\n\n4. **Applications**: LLMs have a wide range of applications, including but not limited to:\n   - **Text generation**: From writing articles, stories, and poems to summarizing texts and answering questions.\n   - **Translation**: Translating text from one language to another.\n   - **Code generation**: Automatically generating code based on specifications or examples.\n   - **Dialogue systems**: Creating conversational AI for customer support, chatbots, and more.\n   - **Content creation**: Assisting content creators in generating ideas, outlines, and even full articles.\n   - **Language understanding**: Improving the accuracy of tasks like sentiment analysis, named entity recognition, and more.\n\n5. **Ethical Considerations**: As with any AI technology, large language models raise ethical concerns. These include issues related to bias, misinformation, privacy, and the potential misuse of these technologies. Ensuring that LLMs are trained on diverse and inclusive data, and developing robust mechan

将代码换成batch推理部分，会发现还是10s，[00:10<00:00, 3.51s/it, est. speed input: 7.69 toks/s, output: 102.39 toks/s]，这么看并行效果应该是不错：

# batch推理打印信息
(vllm) ubuntu@ubuntu:~/Projects_ubuntu/Qwen2$ python qwen_vllm.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-08 13:36:01 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/home/ubuntu/Projects_ubuntu/Qwen2-7B-Instruct', speculative_config=None, tokenizer='/home/ubuntu/Projects_ubuntu/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/home/ubuntu/Projects_ubuntu/Qwen2-7B-Instruct)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-08 13:36:03 model_runner.py:160] Loading model weights took 14.2487 GB
INFO 07-08 13:36:10 gpu_executor.py:83] # GPU blocks: 2725, # CPU blocks: 4681
INFO 07-08 13:36:11 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-08 13:36:11 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-08 13:36:15 model_runner.py:965] Graph capturing finished in 4 secs.
Processed prompts: 100%|█████████| 3/3 [00:10<00:00,  3.51s/it, est. speed input: 7.69 toks/s, output: 102.39 toks/s]
Prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nTell me something about large language models.<|im_end|>\n<|im_start|>assistant\n', Generated text: 'Large language models (LLMs) are sophisticated artificial intelligence systems designed to understand and generate human-like text. They are trained on vast amounts of textual data, which enables them to learn patterns, context, and semantics in language. Here are some key aspects of large language models:\n\n1. **Training Data**: LLMs are typically trained on massive datasets containing billions of words. This includes web pages, books, articles, and other written content. The training process involves feeding the model sequences of text, allowing it to learn statistical relationships between words and their contexts.\n\n2. **Hierarchical Structure**: These models often have multiple layers of neural networks that enable them to capture different levels of abstraction. Lower layers may focus on basic word meanings and syntactic structures, while higher layers can understand more complex concepts like narrative coherence and argumentation.\n\n3. **Generative Capabilities**: One of the most notable features of large language models is their ability to generate new text. Given a prompt or input, an LLM can produce coherent and contextually relevant responses. This capability has been leveraged in various applications, from chatbots and content generation to creative writing and code completion.\n\n4. **Continual Learning**: Some advanced LLMs can be fine-tuned with additional data or tasks, allowing them to adapt and improve their performance over time. This is known as continual learning, where the model learns from new data without forgetting what it has learned previously.\n\n5. **Ethical Considerations**: As with any AI technology, there are ethical concerns surrounding large language models. These include issues like bias, misinformation, privacy, and the potential misuse of such powerful tools. Developers and users must consider how these models impact society and work to mitigate negative consequences.\n\n6. **Applications**: Large language models find use in a wide range of applications:\n   - **Natural Language Processing (NLP)**: Used for tasks like language translation, sentiment analysis, and question answering.\n   - **Content Generation**: Writing articles, stories, and even poetry.\n   - **Code Generation**: Helping developers write code based on specifications or examples.\n   - **Research and Education**: Assisting in research by summarizing large volumes of text or generating hypotheses.\n\n7. **Accessibility and Availability**: With advancements in computing power and the democratization of AI tools, large language models are increasingly accessible to researchers, developers, and individuals. Tools like GPT-3, developed by OpenAI, and BERT by Google, have made it easier for people to experiment with and utilize these models.\n\n8. **'
Prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nPlease tell me in detail how to be a scientist.<|im_end|>\n<|im_start|>assistant\n', Generated text: "Becoming a scientist involves a combination of education, training, and experience in a specific scientific field. Here are the steps you can follow to become a scientist:\n\n1. **Choose a field of interest**: The first step is to choose a field of science that interests you. This could be biology, chemistry, physics, environmental science, astronomy, psychology, or any other field of your choice. Researching and exploring different fields can help you find what truly interests you.\n\n2. **Obtain a bachelor's degree**: Most scientists have at least a bachelor's degree in their chosen field of science. Completing a bachelor's program will provide you with foundational knowledge in your field and introduce you to research methods and basic laboratory skills.\n\n3. **Pursue advanced degrees**: To advance your career as a scientist, you typically need to obtain a master's degree and/or a Ph.D. These advanced degrees involve more specialized study, original research, and often include teaching responsibilities. A Ph.D. is particularly important for those seeking careers in academia or research positions in industry.\n\n4. **Gain research experience**: While pursuing your degree, seek out opportunities to conduct research under the guidance of experienced scientists. This hands-on experience will help you develop critical thinking skills, learn new techniques, and contribute to the scientific community.\n\n5. **Attend conferences and workshops**: Networking with other scientists and attending conferences and workshops can help you stay up-to-date on the latest research, meet potential mentors, and expand your professional connections.\n\n6. **Publish your work**: Publishing research findings in reputable scientific journals is crucial for advancing your career. It demonstrates your ability to conduct independent research, analyze data, and communicate complex ideas effectively.\n\n7. **Seek additional training or certifications**: Depending on your field, additional training or certifications may be required or beneficial. For example, chemists might pursue certifications in specific areas of chemistry, while biologists might seek certification in laboratory safety.\n\n8. **Build a professional network**: Establishing relationships with other scientists, industry professionals, and academics can open doors to job opportunities, collaborations, and mentorship.\n\n9. **Consider interdisciplinary approaches**: Many modern scientific challenges require expertise from multiple disciplines. Consider learning about related fields to enhance your problem-solving skills and broaden your career prospects.\n\n10. **Stay committed to lifelong learning**: Science is an ever-evolving field, so continuous learning and staying informed about the latest developments in your field are essential.\n\nRemember, becoming a scientist is not just about acquiring knowledge and skills; it also involves passion, curiosity, and"
Prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHappy birthday to me!<|im_end|>\n<|im_start|>assistant\n', Generated text: 'Happy Birthday to you! May this special day bring you joy, laughter, and all the happiness you deserve. Enjoy celebrating another year of your life with loved ones, and may the coming year be filled with success, good health, and memorable experiences. Cheers to you!'

四、适配openAI-API的API服务

使用vllm将大模型推理做成api服务非常方便，只需要输入如下命令即可。同时还可以通过--host和--port参数来自定义地址。你无需担心chat模板，因为它默认会使用由tokenizer提供的chat模板。

# --model后面跟模型名称或者路径
python -m vllm.entrypoints.openai.api_server --model /home/ubuntu/Projects_ubuntu/Qwen2-7B-Instruct

vllm构建api是基于fastapi的，可以访问http://localhost:8000/docs来查看有哪些接口：
在这里插入图片描述
由于是对话，所以我们选择/v1/chat/completions，输入（Request body）为：

// api输入
{
  "model": "Qwen/Qwen2-7B-Instruct",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me something about large language models."}
  ]
}

日志信息如下，主要包括输入输出的吞吐、GPU/CPU的kv_cache使用量。

INFO 07-08 13:52:04 async_llm_engine.py:564] Received request cmpl-4b019ccda77a482083537dc3a3dee79f: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nTell me something about large language models.<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32741, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 40451, 752, 2494, 911, 3460, 4128, 4119, 13, 151645, 198, 151644, 77091, 198], lora_request: None.
INFO 07-08 13:52:04 metrics.py:341] Avg prompt throughput: 2.8 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 07-08 13:52:09 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 49.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.7%, CPU KV cache usage: 0.0%.
INFO 07-08 13:52:14 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 49.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.2%, CPU KV cache usage: 0.0%.
INFO 07-08 13:52:15 async_llm_engine.py:133] Finished request cmpl-4b019ccda77a482083537dc3a3dee79f.

输出请看如下Response body，除了返回的生成内容，还有token使用量，结束原因, 函数调用tool_calls等信息，和OpenAI的接口响应基本一致。

// Response body
{
  "id": "cmpl-4b019ccda77a482083537dc3a3dee79f",
  "object": "chat.completion",
  "created": 1720417924,
  "model": "/home/ubuntu/Projects_ubuntu/Qwen2-7B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Large language models (LLMs) are sophisticated artificial intelligence systems designed to understand and generate human-like language. They are trained on massive amounts of text data, which enables them to learn patterns and context, making them capable of performing a wide range of natural language processing (NLP) tasks. Here are a few key aspects of large language models:\n\n1. **Training Data**: Large language models are typically trained on vast datasets, often containing billions of words. This allows them to understand a wide variety of contexts and topics, enhancing their ability to produce coherent and relevant text.\n\n2. **Size**: The term \"large\" in large language models refers to the size of the model, which can range from hundreds of millions to trillions of parameters. Larger models, like those in the trillion-parameter range, are often referred to as \"ultralarge\" models. These models can capture more complex language patterns due to their greater capacity.\n\n3. **Generative Capabilities**: LLMs can generate new text that is similar to the text they were trained on. This capability is used in various applications, such as text completion, question answering, and creative writing.\n\n4. **Context Understanding**: A key feature of LLMs is their ability to understand context. This means they can interpret the nuances of language, including sarcasm, humor, and tone, which makes their output more natural and human-like.\n\n5. **Applications**: Large language models are used in a variety of applications, including chatbots, language translation, text summarization, and content generation. They are also used in research to explore the capabilities and limitations of AI in understanding human language.\n\n6. **Ethical Considerations**: The use of large language models raises concerns about privacy, bias, and the potential misuse of the technology. Ensuring that these models are trained on diverse and inclusive datasets, and that they do not perpetuate or amplify biases, is crucial.\n\n7. **Advancements and Innovations**: The field of large language models is rapidly evolving. New architectures, techniques for fine-tuning models on specific tasks, and methods for improving their efficiency and interpretability are constantly being developed.\n\n8. **Limitations**: Despite their impressive capabilities, large language models still have limitations. They can struggle with tasks that require very specific knowledge or that involve complex reasoning beyond the scope of their training data. Additionally, they can sometimes generate text that is factually incorrect or even harmful.\n\nIn summary, large language models are powerful tools in the field of AI, designed to understand and generate human language. They are used in a variety of applications, but also require careful consideration of ethical and practical implications.",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 27,
    "total_tokens": 562,
    "completion_tokens": 535
  }
}

我们在上面使用api请求的时候，只是输入了model和messages，这两个是必要参数，而其他参数则使用的默认参数。下面是默认的Request body模板，可以看到其中还包含了top_p等与生成相关的参数。

{
  "messages": [
    {
      "content": "string",
      "role": "system",
      "name": "string"
    },
    {
      "content": "string",
      "role": "user",
      "name": "string"
    },
    {
      "content": "string",
      "role": "assistant",
      "name": "string",
      "function_call": {
        "arguments": "string",
        "name": "string"
      },
      "tool_calls": [
        {
          "id": "string",
          "function": {
            "arguments": "string",
            "name": "string"
          },
          "type": "function"
        }
      ]
    },
    {
      "content": "string",
      "role": "tool",
      "name": "string",
      "tool_call_id": "string"
    },
    {
      "content": "string",
      "role": "function",
      "name": "string"
    },
    {
      "content": "string",
      "role": "system",
      "name": "string"
    }
  ],
  "model": "string",
  "frequency_penalty": 0,
  "logit_bias": {
    "additionalProp1": 0,
    "additionalProp2": 0,
    "additionalProp3": 0
  },
  "logprobs": false,
  "top_logprobs": 0,
  "max_tokens": 0,
  "n": 1,
  "presence_penalty": 0,
  "response_format": {
    "type": "text"
  },
  "seed": 0,
  "stop": "string",
  "stream": false,
  "stream_options": {
    "include_usage": true
  },
  "temperature": 0.7,
  "top_p": 1,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "string",
        "description": "string",
        "parameters": {}
      }
    }
  ],
  "tool_choice": "none",
  "user": "string",
  "best_of": 0,
  "use_beam_search": false,
  "top_k": -1,
  "min_p": 0,
  "repetition_penalty": 1,
  "length_penalty": 1,
  "early_stopping": false,
  "ignore_eos": false,
  "min_tokens": 0,
  "stop_token_ids": [
    0
  ],
  "skip_special_tokens": true,
  "spaces_between_special_tokens": true,
  "echo": false,
  "add_generation_prompt": true,
  "add_special_tokens": false,
  "include_stop_str_in_output": false,
  "guided_json": "string",
  "guided_regex": "string",
  "guided_choice": [
    "string"
  ],
  "guided_grammar": "string",
  "guided_decoding_backend": "string",
  "guided_whitespace_pattern": "string"
}

五、量化

这边以GPTQ为例，下载好模型Qwen2-7B-Instruct-GPTQ-Int4：

git clone https://www.modelscope.cn/qwen/Qwen2-7B-Instruct-GPTQ-Int4.git

1.离线推理

代码是复用的，只需要做如下修改：

# 路径修改成量化模型的路径
DEFAULT_CKPT_PATH = '/home/ubuntu/Projects_ubuntu/Qwen2-7B-Instruct-GPTQ-Int4'
# 指明quantization="gptq"
llm = LLM(model=DEFAULT_CKPT_PATH, quantization="gptq")

运行代码：

(vllm) (base) ubuntu@ubuntu:~/Projects_ubuntu/Qwen2$ python qwen_vllm.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-08 15:20:34 gptq_marlin.py:137] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
WARNING 07-08 15:20:34 config.py:217] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 07-08 15:20:34 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/home/ubuntu/Projects_ubuntu/Qwen2-7B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/home/ubuntu/Projects_ubuntu/Qwen2-7B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/home/ubuntu/Projects_ubuntu/Qwen2-7B-Instruct-GPTQ-Int4)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-08 15:20:35 model_runner.py:160] Loading model weights took 5.2066 GB
INFO 07-08 15:20:39 gpu_executor.py:83] # GPU blocks: 13143, # CPU blocks: 4681
INFO 07-08 15:20:40 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-08 15:20:40 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-08 15:20:45 model_runner.py:965] Graph capturing finished in 5 secs.
Processed prompts: 100%|███████████████████████████████████████████████████████████| 3/3 [00:04<00:00,  1.45s/it, est. speed input: 18.58 toks/s, output: 224.79 toks/s]
Prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nTell me something about large language models.<|im_end|>\n<|im_start|>assistant\n', Generated text: 'Large language models, also known as generative pre-trained models (GPTs) or large-scale language models, are sophisticated artificial intelligence systems that have been trained on massive amounts of text data to understand and generate human-like language. These models excel in various natural language processing tasks, including but not limited to:\n\n1. **Text generation**: Large language models can generate coherent and contextually relevant text based on given prompts. This capability has applications in creative writing, content generation, and even helping to write code.\n\n2. **Translation**: They can translate text from one language to another with varying degrees of accuracy, depending on the quality of the training data and the specific task at hand.\n\n3. **Summarization**: These models can provide concise summaries of lengthy documents or texts, which is useful for quickly understanding the main points of an article or report.\n\n4. **Question answering**: They can answer questions posed in natural language by searching for relevant information within the text they were trained on or in external knowledge sources.\n\n5. **Code generation**: Some models can even generate code snippets based on textual descriptions or specifications, which can be helpful for developers and programmers.\n\n6. **Dialogue systems**: They can participate in conversations, responding to questions and engaging in discussions in a way that mimics human interaction.\n\n7. **Text classification**: Large language models can categorize text into predefined classes, such as sentiment analysis (positive vs. negative), topic classification, or genre identification.\n\n8. **Named entity recognition**: They can identify and classify named entities like people, places, organizations, etc., within text.\n\n9. **Dependency parsing**: They can analyze the grammatical structure of sentences, identifying subject, object, and other parts of speech.\n\nThe development of large language models is an active area of research, driven by the need for AI systems that can better understand and interact with humans in complex and nuanced ways. Models like GPT-2, GPT-3, and their variants have pushed the boundaries of what is possible with AI-generated text, although they also raise significant ethical concerns related to bias, misinformation, and privacy.'
Prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nPlease tell me in detail how to be a scientist.<|im_end|>\n<|im_start|>assistant\n', Generated text: "Becoming a scientist involves a combination of education, training, and experience in a specific scientific field. Here are the steps you can follow to become a scientist:\n\n1. **Education**: The first step to becoming a scientist is to complete an undergraduate degree in a science-related field such as biology, chemistry, physics, or mathematics. You can choose to study at a community college, university, or technical school. If you want to specialize further, you can pursue a graduate degree (Master's or Ph.D.) in your chosen field.\n\n2. **Gain Experience**: While studying, it's important to gain practical experience by participating in internships, research projects, or laboratory work. This will help you build skills and gain knowledge that is directly applicable to your field. Additionally, working with experienced scientists can provide valuable mentorship and guidance.\n\n3. **Develop Skills**: As a scientist, you'll need to develop various skills such as critical thinking, problem-solving, data analysis, communication, and teamwork. These skills can be honed through coursework, research projects, and practical experiences.\n\n4. **Specialize**: Choose a specific area within your chosen field where you want to focus your expertise. This could be anything from genetics to astrophysics. Specialization allows you to become an expert in a particular area and contribute unique insights to the scientific community.\n\n5. **Publish Research**: One of the key aspects of being a scientist is publishing your findings in reputable journals. This not only helps you gain recognition but also contributes to the body of knowledge in your field. Publishing requires conducting original research, analyzing data, writing papers, and presenting your findings at conferences.\n\n6. **Attend Conferences and Networking**: Attend scientific conferences, workshops, and seminars to learn about new developments in your field, network with other scientists, and potentially collaborate on future projects.\n\n7. **Obtain Funding**: Scientific research often requires funding. Learn how to write grant proposals, seek out funding opportunities, and manage grants effectively to support your research.\n\n8. **Continue Learning**: Science is a continuously evolving field. Stay updated with the latest research and developments by reading scientific literature, attending workshops, and participating in professional development activities.\n\n9. **Ethics and Integrity**: Adhere to ethical standards in scientific research, including proper data handling, acknowledging sources, and avoiding conflicts of interest. Integrity is crucial for maintaining credibility in the scientific community.\n\n10. **Career Path**: Depending on your interests and goals, you can choose various career paths such as academic research, industry research, government research"
Prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHappy birthday to me!<|im_end|>\n<|im_start|>assistant\n', Generated text: 'Happy Birthday! Wishing you a day filled with joy, love, and laughter. May this year bring you health, happiness, success, and all the dreams you aspire for. Enjoy your special day!'

这时候日志信息中可见使用了量化模块gptq_marlin.py，同时它提示：Use quantization=gptq_marlin for faster inference，回头试试是不是更快。这边列张表对比以下量化前后的性能，显存占用和吞吐均提升超过2倍。

model	batch_size	GPU memory(model)	input	output
Qwen2-7b-instruct	3	14.25GB	7.69 toks/s	102.39 toks/s
Qwen2-7b-instruct-gptq-int4	3	5.2GB	18.58 toks/s	224.79 toks/s

2.API服务

和原来一样，只需修改如下：

python -m vllm.entrypoints.openai.api_server --model /home/ubuntu/Projects_ubuntu/Qwen2-7B-Instruct-GPTQ-Int4 --quantization gptq

六、真实显存占用

实际测试真实显存占用和模型加载的显存占用不是正相关的，如下表所示（尚未开始推理）。这一点与vllm的显存预分配机制有关：
1）除了加载模型，还需要初始化kv_cache，这部分就是预分配的显存，这就导致使用vllm时显存占用高于使用transformers；
2）为什么总的显存是这个数字比如下面的18GB而不是其他的数值呢？vllm中使用gpu_memory_utilization来限制最大显存占用，默认是0.9，计算可知大概是21.6GB，它的组成为：模型加载使用的显存A、模型推理消耗的最大显存B以及kv_cache的显存C；而当下的18GB是A + C；A是已知的，B可以通过一次前向模拟获得，那么C就能被计算出来；因而只要将这部分显存先占用起来，就能得到当前18GB的占用。