基于Tensorrt-llm完成对llm部署

xiaomu_347

已于 2024-06-20 10:52:25 修改

阅读量1.2k

点赞数 15

文章标签：人工智能

于 2024-06-16 16:15:41 首次发布

本文链接：https://blog.csdn.net/xiaomu_347/article/details/139720850

版权

随着大模型的爆火，投入到生产环境的模型参数量规模也变得越来越大（从数十亿参数到千亿参数规模），从而导致大模型的推理成本急剧增加。因此，市面上也出现了很多的推理框架，用于降低模型推理延迟以及提升模型吞吐量。TensorRT-LLM 为用户提供了易于使用的 Python API 来定义大语言模型 (LLM) 并构建 TensorRT 引擎，以便在 NVIDIA GPU 上高效地执行推理。 TensorRT-LLM 还包含用于创建执行这些 TensorRT 引擎的 Python 和 C++ 运行时组件。此外，它还包括一个用于与 NVIDIA Triton 推理服务集成的后端；同时，使用 TensorRT-LLM 构建的模型可以使用使用张量并行和流水线并行在单 GPU 或者多机多 GPU 上执行。

TensorRT-LLM 的 Python API 的架构看起来与 PyTorch API 类似。它为用户提供了包含 einsum、softmax、matmul 或 view 等函数的 functional 模块。 layers 模块捆绑了有用的构建块来组装 LLM；比如： Attention 块、MLP 或整个 Transformer 层。特定于模型的组件，例如： GPTAttention 或 BertAttention，可以在 models 模块中找到。

为了最大限度地提高性能并减少内存占用，TensorRT-LLM 允许使用不同的量化模式执行模型。 TensorRT-LLM 支持 INT4 或 INT8 权重量化（也称为仅 INT4/INT8 权重量化）以及 SmoothQuant 技术的完整实现。同时，TensorRT-LLM 优化了一系列知名模型在 NVIDIA GPU 上的性能。GitHub链接：https://github.com/NVIDIA/TensorRT-LLM/tree/main

一、TensorRT-LLM 诞生的背景

第一、大模型参数量大，推理成本高。以10B参数规模的大模型为例，使用FP16数据类型进行部署至少需要20GB以上（模型权重+KV缓存等）。

第二、纯TensorRT使用较复杂，ONNX存在内存限制。深度学习模型通常使用各种框架（如PyTorch、TensorFlow、Keras等）进行训练和部署，而每个框架都有自己的模型表示和存储格式。因此，开发者通常使用 ONNX 解决深度学习模型在不同框架之间的互操作性问题。比如：TensorRT 就需要先将 PyTorch 模型转成 ONNX，然后再将 ONNX 转成 TensorRT。除此之外，一般还需要做数据对齐，因此需要编写 plugin，通过修改 ONNX 来适配 TensorRT plugin。另外， ONNX 使用Protobuf作为其模型文件的序列化格式。Protobuf是一种轻量级的、高效的数据交换格式，但它在序列化和反序列化大型数据时有一个默认的大小限制。在Protobuf中，默认的大小限制是2GB。这意味着单个序列化的消息不能超过2GB的大小。当你尝试加载或修改超过2GB的ONNX模型时，就会收到相关的限制提示。

第三、纯FasterTransformer使用门槛高。FasterTransformer 是用 C++ 实现的；同时，它的接口和文档相对较少，用户可能需要更深入地了解其底层实现和使用方式，这对于初学者来说可能会增加学习和使用的难度。并且 FasterTransformer 的生态较小，可用的资源和支持较少，这也会增加使用者在理解和应用 FasterTransformer 上的困难。因此，与 Python 应用程序的部署和集成相比，它可能涉及到更多的技术细节和挑战。这可能需要用户具备更多的系统级编程知识和经验，以便将 FasterTransformer 与其他系统或应用程序进行无缝集成。

综上所述，TensorRT-LLM 诞生了。TensorRT-LLM可以视为TensorRT和FasterTransformer的结合体，旨为大模型推理加速而生。

FasterTransformer和Triton推理框架都是用于实现高性能推理的工具，但它们各自的特点和用途有所不同。以下是两者的主要区别：

FasterTransformer专注于Transformer模型，而Triton支持多种模型格式。
FasterTransformer侧重于模型的优化和加速，而Triton侧重于模型的部署和扩展。
FasterTransformer主要针对NVIDIA GPU进行优化，而Triton支持包括NVIDIA GPU在内的多种硬件。
除了使用 C ++ 作为后端部署，FasterTransformer 还集成了 TensorFlow（使用 TensorFlow op）、PyTorch （使用 Pytorch op）和 Triton作为后端框架进行部署。

二、基本使用

（1）基本特性

除了FastTransformer对Transformer做的attention优化、softmax优化、算子融合等方式之外，还引入了众多的大模型推理优化特性：

Multi-head Attention(MHA)
Multi-query Attention (MQA)
Group-query Attention(GQA)
In-flight Batching
Paged KV Cache for the Attention
Tensor Parallelism
Pipeline Parallelism
INT4/INT8 Weight-Only Quantization (W4A16 & W8A16)
SmoothQuant
GPTQ
AWQ
FP8
Greedy-search
Beam-search
RoPE

同时对众多开源大模型都做了调用实例，包括但不局限：

Baichuan
Bert
Blip2
BLOOM
ChatGLM-6B
ChatGLM2-6B
Falcon
GPT
GPT-J
GPT-Nemo
GPT-NeoX
LLaMA
LLaMA-v2
MPT
OPT
SantaCoder
StarCoder

使用上仍然保持了TensorRT两阶段的调用方式——build+run：
build：通过配置参数将模型文件序列化为tensorRT的engine文件
run：加载engine文件，传入数据，进行inference

简单的总结流程就是：

huggingface 模型—>tensorRT-llm模型(模型转换)---->转为trt引擎----->trt引擎推理。

TensorRT-llm 为什么快：

1. 模型预编译，并优化内核
2. 模型进行量化
3. In-flight批处理
4. page attention 以及高效缓存K、V.

（2）具体使用

目前，TensorRT-LLM必须由源码编译获得，为方便构建编译环境，官方提供了docker构建方式详细指引可参考：

TensorRT-LLM/docs/source/quick-start-guide.md at main · NVIDIA/TensorRT-LLM · GitHub

具体操作步骤如下：

-》准备环境

###拉取基础镜像
docker pull nvidia/cuda:12.4.0-devel-ubuntu22.04


###运行镜像
# Obtain and start the basic docker image environment (optional).
docker run --rm --ipc=host --runtime=nvidia --gpus all --entrypoint /bin/bash -it nvidia/cuda:12.4.0-devel-ubuntu22.04

-》编译/安装tensorrt-llm

# Install dependencies, TensorRT-LLM requires Python 3.10
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs

# Install the latest preview version (corresponding to the main branch) of TensorRT-LLM.
# If you want to install the stable version (corresponding to the release branch), please
# remove the `--pre` option.
pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com

###或者
###pip install tensorrt_llm==0.8.0 --extra-index-url https://pypi.nvidia.com --extra-index-url https://download.pytorch.org/whl/cu121

# Check installation
python3 -c "import tensorrt_llm"

请注意，TensorRT- llm依赖于TensorRT。在包括TensorRT 8的早期版本中，覆盖升级到新版本可能需要显式运行pip uninstall TensorRT来卸载旧版本。

-》大模型转换、编译与推理

这一步拉取tensorrt-llm源码，其提供的./examples中有较多参考示例，先安装示例所需要的库

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
pip install -r examples/bloom/requirements.txt
git lfs install

然后针对LLaMA 7B model为例进行如下操作

###将hf 模型转为tensorRT-LLM格式模型
python convert_checkpoint.py --model_dir ./meta-llama/Llama-2-7b-chat-hf \
                              --output_dir ./tllm_checkpoint_1gpu_bf16 \
                              --dtype bfloat16

###转化为engine
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_bf16 \
            --output_dir ./tmp/llama/7B/trt_engines/bf16/1-gpu \
            --gpt_attention_plugin bfloat16 \
            --gemm_plugin bfloat16

行完此脚本后，将生成以下结果。/tmp/llama/7B/trt_engines/bf16/1-gpu文件夹中现在有以下文件：

Llama_float16_tp1_rank0.engine：构建脚本的主要输出，包含嵌入了模型权重的可执行操作图。
config.json：包括有关模型的详细信息，例如其一般结构和精度，以及有关哪些插件被纳入引擎的信息。
model.cache：缓存模型编译中的一些时间和优化信息，使得后续构建更快。

基于编译生成的engine进行推理

python examples/llama/run.py 
   --engine_dir=./tmp/llama/7B/trt_engines/bf16/1-gpu 
   --max_output_len 100
   --tokenizer_dir meta-llama/Llama-2-7b-chat-hf 
   --input_text "How do I count to nine in French?

-》基于triton sever进行部署

除了本地执行之外，您还可以使用NVIDIA Triton 推理服务器来创建 LLM 的生产就绪部署。NVIDIA 正在为 TensorRT-LLM 发布一个新的 Triton 推理服务器后端，它利用 TensorRT-LLM C++ 运行时快速执行推理，并包括动态批处理和分页 KV 缓存等技术。带有 TensorRT-LLM 后端的 Triton 推理服务器可通过 NGC 作为预构建容器使用。

首先，创建一个模型存储库，以便 Triton Inference Server 可以读取模型和任何相关元数据。tensorrtllm_backend 存储库包含您可以使用的适当模型存储库的骨架all_models/inflight_batcher_llm/。该目录中现在有四个子文件夹，用于保存模型执行过程不同部分的工件：

/preprocessing和：包含Triton 推理服务器 Python 后端/postprocessing的脚本，用于对文本输入进行标记并对模型输出进行反标记，以在字符串和模型运行的标记 ID 之间进行转换。
/tensorrt_llm：放置您之前编译的模型引擎的地方。
/ensemble：定义一个模型集成，将前三个组件链接在一起，并告诉 Triton Inference Server 如何通过它们传输数据。

拉下示例模型存储库并将您在上一步中编译的模型复制到其中：

# After exiting the TensorRT-LLM Docker container
cd ..
git clone -b release/0.8.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrtllm_backend
cp ../TensorRT-LLM/examples/llama/out/*   all_models/inflight_batcher_llm/tensorrt_llm/1/

接下来，使用如下信息修改存储库骨架中的一些配置文件：

编译模型引擎的位置
使用哪种标记器
批量执行推理时如何处理 KV 缓存的内存分配

python tools/fill_template.py --in_place \
      all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt \
      decoupled_mode:true,engine_dir:/all_models/inflight_batcher_llm/tensorrt_llm/1,\
max_tokens_in_paged_kv_cache:,batch_scheduler_policy:guaranteed_completion,kv_cache_free_gpu_mem_fraction:0.2,\
max_num_sequences:4
 
python tools/fill_template.py --in_place \
    all_models/inflight_batcher_llm/preprocessing/config.pbtxt \
    tokenizer_type:llama,tokenizer_dir:meta-llama/Llama-2-7b-chat-hf
 
python tools/fill_template.py --in_place \
    all_models/inflight_batcher_llm/postprocessing/config.pbtxt \
    tokenizer_type:llama,tokenizer_dir:meta-llama/Llama-2-7b-chat-hf

现在，启动 Docker 容器并启动 Triton 服务器。指定world_size，即构建模型的 GPU 数量，并指向model_repo刚刚设置的。

docker run -it --rm --gpus all --network host --shm-size=1g \
     -v $(pwd)/all_models:/all_models \
     -v $(pwd)/scripts:/opt/scripts \
     nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3
 
# Log in to huggingface-cli to get tokenizer
huggingface-cli login --token *****
 
# Install python dependencies
pip install sentencepiece protobuf
 
# Launch Server
python /opt/scripts/launch_triton_server.py --model_repo /all_models/inflight_batcher_llm --world_size 1

要向正在运行的服务器发送请求并与之交互，您可以使用Triton Inference Server 客户端库之一或向生成端点发送 HTTP 请求。首先，您可以使用功能更齐全的客户端脚本或以下 curl 命令：

curl -X POST localhost:8000/v2/models/ensemble/generate -d \
'{
"text_input": "How do I count to nine in French?",
"parameters": {
"max_tokens": 100,
"bad_words":[""],
"stop_words":[""]
}
}'

-》量化

如果觉得需要对模型进行量化INT8 KV cache + AWQ，则执行如下脚本

python ../quantization/quantize.py --model_dir /tmp/llama-7b-hf \
                                   --output_dir ./tllm_checkpoint_1gpu_awq_int8_kv_cache \
                                   --dtype float16 \
                                   --qformat int4_awq \
                                   --awq_block_size 128 \
                                   --kv_cache_dtype int8 \
                                   --calib_size 32

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_awq_int8_kv_cache \
            --output_dir ./tmp/llama/7B/trt_engines/int8_kv_cache_int4_AWQ/1-gpu/ \
            --gemm_plugin auto \

接着在cnn_dailymail 数据集中的测试文本，生成rouge 结果

python ../summarize.py --test_trt_llm \
                       --hf_model_dir /tmp/llama-7b-hf \
                       --data_type fp16 \
                       --engine_dir ./tmp/llama/7B/trt_engines/int8_kv_cache_int4_AWQ/1-gpu \
                       --test_hf

参考链接：

1、https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama