LLMs之Code：Qwen2.5-Coder的简介、安装和使用方法、案例应用之详细攻略

一个处女座的程序猿

已于 2024-11-17 11:30:25 修改

阅读量4.5k

点赞数 29

分类专栏： NLP/LLMs 文章标签： Qwen2.5-Coder 大语言模型

于 2024-11-17 00:04:52 首次发布

本文链接：https://blog.csdn.net/qq_41185868/article/details/143825828

版权

NLP/LLMs 专栏收录该内容

776 篇文章

订阅专栏

LLMs之Code：Qwen2.5-Coder的简介、安装和使用方法、案例应用之详细攻略

导读：这篇论文介绍了Qwen2.5-Coder系列模型，这是一个针对代码生成的强大开源大型语言模型。

>> 背景痛点：现有代码大型语言模型的不足：虽然现有开源代码LLM（如StarCoder, CodeLlama, DeepSeek-Coder, CodeQwen1.5, CodeStral）在编码评估中表现出色，但与最新的闭源模型（Claude-3.5-Sonnet, GPT-4o）相比仍存在差距。这些模型在代码生成、补全、推理和修复等方面表现不够优秀。

>> 具体的解决方案：论文提出了Qwen2.5-Coder系列模型，包含六个不同规模的模型 (0.5B/1.5B/3B/7B/14B/32B)。该系列模型基于Qwen2.5架构，并进行了以下改进：

● 大规模预训练数据：使用超过5.5 T的token的代码特定预训练数据集，该数据集包含多种类型的数据，包括源代码数据、文本-代码关联数据、合成数据、数学数据和文本数据。数据清洗过程使用了多阶段的过滤方法，并结合了弱模型分类器和评分器。预训练过程包含文件级别和仓库级别两个阶段，以确保全面覆盖。

● 精心设计的指令微调数据集：为了将模型转化为代码助手，论文构建了一个高质量的指令微调数据集，包含各种代码相关问题和解决方案，数据来源包括真实世界应用和代码LLM生成的合成数据。该数据集的构建使用了多种技术，例如：多语言编程代码识别、从GitHub合成指令数据、多语言代码指令数据生成（多Agent协作框架）、指令数据检查列表评分和多语言沙箱代码验证。

● 数据混合策略：为了平衡编码能力和通用语言理解能力，论文对代码、数学和文本数据进行了仔细的混合，最终比例为70%代码、20%文本和10%数学数据。

● 去污染：为了避免测试集泄漏导致结果膨胀，对预训练和后训练数据集进行了去污染处理，移除了HumanEval、MBPP、GSM8K和MATH等关键数据集。

>> 核心思路步骤：Qwen2.5-Coder的训练过程分为三个阶段：

● 阶段一：文件级别预训练：使用最大长度为8192个token的序列进行预训练，目标包括下一个token预测和Fill-in-the-Middle (FIM)。

● 阶段二：仓库级别预训练：将上下文长度扩展到32768个token，并使用YARN机制支持高达131072个token的序列，目标同样包括下一个token预测和仓库级别的FIM。

● 阶段三：指令微调：使用精心设计的指令微调数据集，采用粗到细的微调策略，并结合了监督微调和直接偏好优化 (DPO)，其中DPO利用了多语言代码沙箱和LLM作为评判者。

>> 优势：Qwen2.5-Coder系列模型在代码生成任务上取得了显著的成果，在多个基准测试中达到了最先进的水平，甚至在某些任务上超越了更大的模型。

● 强大的代码生成能力：在多个代码相关基准测试中取得了最先进的性能，包括代码生成、补全、推理和修复。在相同模型规模下，其性能优于更大的模型。

● 多语言支持：在多种编程语言上表现出色，平衡了不同语言的性能。

● 强大的数学和通用语言能力：在数学推理和通用自然语言理解方面也表现良好。

● 长上下文能力：支持高达128K token的输入长度。

● 开源：采用许可的开源许可证，方便开发者使用。

>> 结论和观点：

● 大规模高质量的数据和精心设计的训练策略对于构建强大的代码LLM至关重要。

● 规模化（大规模数据和模型）是构建强大代码LLM的关键。该模型的开源发布将促进代码智能研究的发展，并支持开发者在实际应用中更广泛地采用。

《Qwen2.5-Coder Technical Report》翻译与解读

Qwen2.5-Coder的简介

1、Qwen2.5-Coder 特点

2、模型列表

3、特殊 token 及其对应的 token id

4、模型评估

5、训练策略

Qwen2.5-Coder的安装和使用方法

1、安装

2、使用方法

(1)、与 Qwen2.5-Coder-32B-Instruct (指令模型) 进行对话

(2)、使用 Qwen2.5-Coder-32B (基础模型) 进行代码补全(code completion)任务

(3)、采用YaRN 技术处理长文本 (超过 32,768 tokens):

(4)、文件级代码补全 ("fill-in-the-middle")

(5)、仓库级代码补全

3、部署

(1)、使用 vLLM 部署 Qwen2.5-Coder

离线批量推理

多 GPU 分布式服务

(2)、基于Gradio界面以获得更好的体验

Qwen2.5-Coder的案例应用

1、基础用法

《Qwen2.5-Coder Technical Report》翻译与解读

地址	论文地址：https://arxiv.org/abs/2409.12186
时间	2024年9月18日
作者	阿里巴巴-通义千问团队
摘要	在本报告中，我们将介绍Qwen2.5-Coder系列，这是对其前身CodeQwen1.5的重大升级。该系列包括六个型号：qwen2.5 -编码器-(0.5B/1.5B/3B/7B/14B/32B)。作为一个特定于代码的模型，Qwen2.5- coder建立在Qwen2.5架构之上，并在超过5.5万亿个token的庞大语料库上继续进行预训练。通过细致的数据清理、可扩展的合成数据生成和平衡的数据混合，Qwen2.5-Coder展示了令人印象深刻的代码生成能力，同时保留了通用和数学技能。这些模型已经在广泛的代码相关任务上进行了评估，在超过10个基准测试中实现了最先进的（SOTA）性能，包括代码生成、完成、推理和修复，始终优于相同模型大小的更大模型。我们相信Qwen2.5-Coder系列的发布将推动代码智能的研究，并且凭借其宽松的许可，支持开发人员在实际应用中更广泛地采用。

Qwen2.5-Coder的简介

2024年11月，发布Qwen2.5-Coder 是阿里云Qwen团队开发的Qwen2.5大型语言模型系列的代码版本。它是一个强大的、多样化的、实用的开源代码大型语言模型 (Open CodeLLM)。此前被称为 CodeQwen1.5。

GitHub地址：GitHub - QwenLM/Qwen2.5-Coder: Qwen2.5-Coder is the code version of Qwen2.5, the large language model series developed by Qwen team, Alibaba Cloud.

1、Qwen2.5-Coder 特点

>> 强大 (Powerful)：Qwen2.5-Coder-32B-Instruct 成为当前最先进的开源代码模型，其编码能力可与 GPT-4o 相媲美。它展现出强大且全面的编码能力，同时具备良好的通用和数学能力。
>> 多样 (Diverse)：在之前开源的 1.5B/7B 两个尺寸的基础上，此次发布增加了四个模型尺寸，包括 0.5B/3B/14B/32B。目前，Qwen2.5-Coder 涵盖了六个主流模型尺寸，以满足不同开发者的需求。
>> 实用 (Practical)：该项目探索了 Qwen2.5-Coder 在代码助手和 Artifacts 两种场景中的实用性，并提供了一些示例，展示了 Qwen2.5-Coder 在现实世界场景中的潜在应用。
>> 长上下文理解和生成：支持 128K tokens 的上下文长度。
>> 支持多种编程语言：支持 92 种编程语言 (具体语言列表见原文)。并保留了基础模型的数学和通用能力优势。
['ada', 'agda', 'alloy', 'antlr', 'applescript', 'assembly', 'augeas', 'awk', 'batchfile', 'bluespec', 'c', 'c#', 'c++', 'clojure', 'cmake', 'coffeescript', 'common-lisp', 'css', 'cuda', 'dart', 'dockerfile', 'elixir', 'elm', 'emacs-lisp', 'erlang', 'f#', 'fortran', 'glsl', 'go', 'groovy', 'haskell', 'html', 'idris', 'isabelle', 'java', 'java-server-pages', 'javascript', 'json', 'julia', 'jupyter-notebook', 'kotlin', 'lean', 'literate-agda', 'literate-coffeescript', 'literate-haskell', 'lua', 'makefile', 'maple', 'markdown', 'mathematica', 'matlab', 'objectc++', 'ocaml', 'pascal', 'perl', 'php', 'powershell', 'prolog', 'protocol-buffer', 'python', 'r', 'racket', 'restructuredtext', 'rmarkdown', 'ruby', 'rust', 'sas', 'scala', 'scheme', 'shell', 'smalltalk', 'solidity', 'sparql', 'sql', 'stan', 'standard-ml', 'stata', 'swift', 'systemverilog', 'tcl', 'tcsh', 'tex', 'thrift', 'typescript', 'verilog', 'vhdl', 'visual-basic', 'vue', 'xslt', 'yacc', 'yaml', 'zig']

2、模型列表

model name	type	length	Download
Qwen2.5-Coder-0.5B	base	32k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-1.5B	base	32k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-3B	base	32k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-7B	base	128k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-14B	base	128k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-32B	base	128k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-0.5B-instruct	instruct	32k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-1.5B-instruct	instruct	32k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-3B-instruct	instruct	32k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-7B-instruct	instruct	128k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-14B-instruct	instruct	128k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-32B-instruct	instruct	128k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-0.5B-Instruct-AWQ	instruct	32k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-0.5B-Instruct-GGUF	instruct	32k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-0.5B-Instruct-GPTQ-Int4	instruct	32k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-0.5B-Instruct-GPTQ-Int8	instruct	32k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-1.5B-Instruct-AWQ	instruct	32k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-1.5B-Instruct-GGUF	instruct	32k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int4	instruct	32k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int8	instruct	32k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-3B-Instruct-AWQ	instruct	32k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-3B-Instruct-GGUF	instruct	32k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-3B-Instruct-GPTQ-Int4	instruct	32k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-3B-Instruct-GPTQ-Int8	instruct	32k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-7B-Instruct-AWQ	instruct	128k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-7B-Instruct-GGUF	instruct	128k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-7B-Instruct-GPTQ-Int4	instruct	128k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-7B-Instruct-GPTQ-Int8	instruct	128k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-14B-Instruct-AWQ	instruct	128k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-14B-Instruct-GGUF	instruct	128k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-14B-Instruct-GPTQ-Int4	instruct	128k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-14B-Instruct-GPTQ-Int8	instruct	128k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-32B-Instruct-AWQ	instruct	128k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-32B-Instruct-GGUF	instruct	128k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-32B-Instruct-GPTQ-Int4	instruct	128k	🤗 Hugging Face • 🤖 ModelScope
Qwen2.5-Coder-32B-Instruct-GPTQ-Int8	instruct	128k	🤗 Hugging Face • 🤖 ModelScope

3、特殊 token 及其对应的 token id

为了与 Qwen2.5 保持一致，我们更新了特殊 token 及其对应的 token id。新的特殊 token 如下：

{
  "<|fim_prefix|>": 151659, 
  "<|fim_middle|>": 151660, 
  "<|fim_suffix|>": 151661, 
  "<|fim_pad|>": 151662, 
  "<|repo_name|>": 151663, 
  "<|file_sep|>": 151664, 
  "<|im_start|>": 151644, 
  "<|im_end|>": 151645
}

4、模型评估

5、训练策略

图2：Qwen2.5-Coder的三阶段训练流水线。

Qwen2.5-Coder的安装和使用方法

1、安装

需要 Python 3.9 或更高版本以及 transformers>4.37.0 (因为 transformers 从 4.37.0 版本开始集成 Qwen2 代码)。可以使用以下命令安装所需的包

pip install -r requirements.txt

2、使用方法

使用方法：主要通过 transformers 库进行调用。使用方法根据任务类型不同而有所区别，
Qwen2.5-Coder-[0.5-32]B-Instrcut是用于聊天的指令模型；
Qwen2.5-Coder-[0.5-32]B是一个通常用于完成的基础模型，可以作为微调的更好起点。

(1)、与 Qwen2.5-Coder-32B-Instruct (指令模型) 进行对话

使用 AutoModelForCausalLM 和 AutoTokenizer 加载模型和分词器，并使用 apply_chat_template 函数将消息转换为模型可理解的格式，然后使用 generate 方法进行对话。 max_new_tokens 参数控制响应的最大长度。代码示例见原文。

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2.5-Coder-32B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "write a quick sort algorithm."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

函数apply_chat_template()用于将消息转换为模型可以理解的格式。add_generation_prompt参数用于添加生成提示，该<|im_start|>assistant\n提示引用输入。值得注意的是，我们按照以前的惯例将 ChatML 模板应用于聊天模型。max_new_tokens参数用于设置响应的最大长度。函数tokenizer.batch_decode()用于解码响应。就输入而言，上述消息是一个示例，用于展示如何格式化对话历史记录和系统提示。您可以以相同的方式使用其他大小的指示模型。

(2)、使用 Qwen2.5-Coder-32B (基础模型) 进行代码补全(code completion)任务

加载模型和分词器，使用 generate 方法进行代码补全。 max_new_tokens 参数控制输出的最大长度。代码示例见原文。

from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda" # the device to load the model onto

# Now you do not need to add "trust_remote_code=True"
TOKENIZER = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B")
MODEL = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-32B", device_map="auto").eval()

# tokenize the input into tokens
input_text = "#write a quick sort algorithm"
model_inputs = TOKENIZER([input_text], return_tensors="pt").to(device)

# Use `max_new_tokens` to control the maximum output length.
generated_ids = MODEL.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=False)[0]
# The generated_ids include prompt_ids, so we only need to decode the tokens after prompt_ids.
output_text = TOKENIZER.decode(generated_ids[len(model_inputs.input_ids[0]):], skip_special_tokens=True)

print(f"Prompt: {input_text}\n\nGenerated text: {output_text}")

(3)、采用YaRN 技术处理长文本 (超过 32,768 tokens):

使用 YaRN 技术来处理超过 32,768 tokens 的长输入。需要在 config.json 文件中添加相应的配置。

{
  ...,
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }
}

(4)、文件级代码补全 ("fill-in-the-middle")

使用 <|fim_prefix|>, <|fim_suffix|>, 和 <|fim_middle|> 这三个特殊标记来表示代码结构的不同部分。代码示例见原文。

代码插入任务也称为“填补中间”挑战，要求以填补给定代码上下文中空白的方式插入代码段。对于符合最佳实践的方法，我们建议遵守论文“有效训练语言模型以填补中间”[ arxiv ]中概述的格式指南。这涉及使用三个专门的标记<fim_prefix>、<fim_suffix>和<fim_middle>来表示代码结构的各个段。提示的结构应如下：

prompt = '<|fim_prefix|>' + prefix_code + '<|fim_suffix|>' + suffix_code + '<|fim_middle|>'

from transformers import AutoTokenizer, AutoModelForCausalLM
# load model
device = "cuda" # the device to load the model onto

TOKENIZER = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B")
MODEL = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-32B", device_map="auto").eval()

input_text = """<|fim_prefix|>def quicksort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[len(arr) // 2]
    <|fim_suffix|>
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]
    return quicksort(left) + middle + quicksort(right)<|fim_middle|>"""

model_inputs = TOKENIZER([input_text], return_tensors="pt").to(device)

# Use `max_new_tokens` to control the maximum output length.
generated_ids = MODEL.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=False)[0]
# The generated_ids include prompt_ids, we only need to decode the tokens after prompt_ids.
output_text = TOKENIZER.decode(generated_ids[len(model_inputs.input_ids[0]):], skip_special_tokens=True)

print(f"Prompt: {input_text}\n\nGenerated text: {output_text}")

(5)、仓库级代码补全

使用 <|repo_name|> 和 <|file_sep|> 这两个特殊标记来表示仓库结构。代码示例见原文。

重要提示：Qwen2.5-Coder-[0.5-32]B-Instrcut 是用于聊天的指令模型；Qwen2.5-Coder-[0.5-32]B 是基础模型，通常用于代码补全，并且是微调的更好起点。模型更新了特殊标记及其对应的标记 ID，以保持与 Qwen2.5 的一致性。新的特殊标记及其 ID 见原文。

input_text = f'''<|repo_name|>{repo_name}
<|file_sep|>{file_path1} 
{file_content1}
<|file_sep|>{file_path2} 
{file_content2}'''



from transformers import AutoTokenizer, AutoModelForCausalLM
device = "cuda" # the device to load the model onto

# Now you do not need to add "trust_remote_code=True"
TOKENIZER = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B")
MODEL = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-32B", device_map="auto").eval()

# tokenize the input into tokens
input_text = """<|repo_name|>library-system
<|file_sep|>library.py
class Book:
    def __init__(self, title, author, isbn, copies):
        self.title = title
        self.author = author
        self.isbn = isbn
        self.copies = copies

    def __str__(self):
        return f"Title: {self.title}, Author: {self.author}, ISBN: {self.isbn}, Copies: {self.copies}"

class Library:
    def __init__(self):
        self.books = []

    def add_book(self, title, author, isbn, copies):
        book = Book(title, author, isbn, copies)
        self.books.append(book)

    def find_book(self, isbn):
        for book in self.books:
            if book.isbn == isbn:
                return book
        return None

    def list_books(self):
        return self.books

<|file_sep|>student.py
class Student:
    def __init__(self, name, id):
        self.name = name
        self.id = id
        self.borrowed_books = []

    def borrow_book(self, book, library):
        if book and book.copies > 0:
            self.borrowed_books.append(book)
            book.copies -= 1
            return True
        return False

    def return_book(self, book, library):
        if book in self.borrowed_books:
            self.borrowed_books.remove(book)
            book.copies += 1
            return True
        return False

<|file_sep|>main.py
from library import Library
from student import Student

def main():
    # Set up the library with some books
    library = Library()
    library.add_book("The Great Gatsby", "F. Scott Fitzgerald", "1234567890", 3)
    library.add_book("To Kill a Mockingbird", "Harper Lee", "1234567891", 2)
    
    # Set up a student
    student = Student("Alice", "S1")
    
    # Student borrows a book
"""
model_inputs = TOKENIZER([input_text], return_tensors="pt").to(device)

# Use `max_new_tokens` to control the maximum output length.
generated_ids = MODEL.generate(model_inputs.input_ids, max_new_tokens=1024, do_sample=False)[0]
# The generated_ids include prompt_ids, so we only need to decode the tokens after prompt_ids.
output_text = TOKENIZER.decode(generated_ids[len(model_inputs.input_ids[0]):], skip_special_tokens=True)

print(f"Prompt: \n{input_text}\n\nGenerated text: \n{output_text}")



预期输出如下：
Generated text:
    book = library.find_book("1234567890")
    if student.borrow_book(book, library):
    print(f"{student.name} borrowed {book.title}")
    else:
    print(f"{student.name} could not borrow {book.title}")
    
        # Student returns a book
        if student.return_book(book, library):
            print(f"{student.name} returned {book.title}")
        else:
            print(f"{student.name} could not return {book.title}")
        
        # List all books in the library
        print("All books in the library:")
        for book in library.list_books():
            print(book)

if __name__ == "__main__":
main()

3、部署

(1)、使用 vLLM 部署 Qwen2.5-Coder

离线批量推理

作为 Qwen2.5 家族的一员，Qwen2.5-Coder 也得到了 vLLM 的支持。详细教程可以参见Qwen 教程。这里我们给出了一个 vLLM 中离线批量推理的简单示例。

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B")

# Pass the default decoding hyperparameters of Qwen1.5-32B-Chat
# max_tokens is for the maximum length for generation.
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=1024)

# Input the model name or path. Can be GPTQ or AWQ models.
llm = LLM(model="Qwen/Qwen2.5-Coder-32B")

# Prepare your prompts
prompt = "#write a quick sort algorithm.\ndef quick_sort("

# generate outputs
outputs = llm.generate([prompt], sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

多 GPU 分布式服务

为了扩大服务吞吐量，分布式服务可以帮助您利用更多 GPU 设备。使用超长序列进行推理时，可能会导致 GPU 内存不足。在这里，我们演示如何通过传入参数来运行具有张量并行性的 Qwen2.5-Coder-32B tensor_parallel_size。

llm = LLM(model="Qwen/Qwen2.5-Coder-32B", tensor_parallel_size=8)

(2)、基于Gradio界面以获得更好的体验

# 切换到聊天机器人演示目录
cd demo/chatbot/
# Linux和Windows用户以及搭载Intel处理器的macOS用户运行以下命令
python app.py 

# 搭载Apple Silicon的macOS用户运行以下命令，不支持Intel，性能可能比RTX 4090慢20倍
PYTORCH_ENABLE_MPS_FALLBACK=1 python app.py

# 切换到提供Gradio界面的工件模式演示目录
cd demo/artifacts/
# 运行应用
python app.py

# 可根据需求指定--server_port, --share, --server_name等参数