Tensorrt-llm加速qwen1.5-14b-chat推理

mzixi

已于 2024-04-11 20:33:15 修改

阅读量1.3k

点赞数 13

文章标签： python

于 2024-04-10 15:26:13 首次发布

本文链接：https://blog.csdn.net/weixin_53215849/article/details/137597005

版权

参考Tlntin大佬的代码：https://github.com/Tlntin/Qwen-TensorRT-LLM.git

一、拉取镜像

docker pull qliang1014/tensorrt-llm:v1.0.0

二、下载文件

git clone https://github.com/Tlntin/Qwen-TensorRT-LLM.git
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git -b v0.8.0

三、建立容器

docker run -d \
    --name triton_server \
    --net host \
    --shm-size=2g \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    --gpus all \
    -v tensorrtllm_backend:/tensorrtllm_backend \
    -v Qwen-TensorRT-LLM/examples/qwen2:/root/qwen \
    qliang1014/tensorrt-llm:v1.0.0 sleep 864000

四、模型转换

cd /root/qwen
python3 build.py --hf_model_dir ./qwen1.5-14b-chat \
                --dtype float16 \
                --remove_input_padding \
                --gpt_attention_plugin float16 \
                --gemm_plugin float16 \
                --enable_context_fmha \
                --use_weight_only \
                --weight_only_precision int4 \
                --output_dir ./tmp/Qwen1.5/14B/trt_engines/int4_weight_only/2-gpu/inflight \
                --use_inflight_batching \
                --paged_kv_cache \
                --world_size 2 \
                --tp_size 2

hf_model_dir : 本地模型的路径
use_weight_only weight_only_precision int4 ：使用int4量化
output_dir ：保存的engine路径
use_inflight_batching paged_kv_cache ：使用流式输出
world_size 2 tp_size 2 ：使用两张显卡

转换成功后测试是否转换成功：

mpirun -n 2 --allow-run-as-root  \
python3 run.py \
    --tokenizer_dir ./qwen1.5-14b-chat \
    --engine_dir=./tmp/Qwen1.5/14B/trt_engines/int4_weight_only/2-gpu/

五、triton 部署

1、复制上一部分编译好的Engine文件

cd /root/qwen//tmp/Qwen1.5/14B/trt_engines/int4_weight_only/2-gpu/
cp -r ./* /tensorrtllm_backend/triton_model_repo/tensorrt_llm/1/

2、复制 tokenzer文件（转换前的）

cd /root/qwen/
cp -r qwen1.5_14b_chat /tensorrtllm_backend/triton_model_repo/tensorrt_llm/

# 删除tokenizer目录的Huggingface模型文件（可选）
rm /tensorrtllm_backend/triton_model_repo/tensorrt_llm/qwen1.5_14b_chat/*.safetensors

3、编写Triton中的预处理配置和后处理配置，修改triton_model_repo/preprocessing/config.pbtxt文件和triton_model_repo/postprocessing/config.pbtxt文件

修改前：

parameters {
  key: "tokenizer_dir"
  value: {
	string_value: "${tokenizer_dir}"
  }
}

parameters {
  key: "tokenizer_type"
  value: {
	string_value: "${tokenizer_type}"
  }
}

修改后：

parameters {
  key: "tokenizer_dir"
  value: {
	string_value: "/tensorrtllm_backend/triton_model_repo/tensorrt_llm/qwen_7b_chat"
  }
}

parameters {
  key: "tokenizer_type"
  value: {
	string_value: "auto"
  }
}

4、简单修改一下preprocess/postprocess的model.py的initialize函数，示例是llama的，我们要改成qwen的tokenizer配置。

修改前

elf.tokenizer.pad_token = self.tokenizer.eos_token
self.pad_id = self.tokenizer.encode(self.tokenizer.pad_token,
                                            add_special_tokens=False)[0]

修改后（可能需要依据自己的使用模型修改）：

gen_config_path = os.path.join(tokenizer_dir, 'generation_config.json')
with open(gen_config_path, 'r') as f:
    gen_config = json.load(f)
if isinstance (gen_config["eos_token_id"], list):
    pad_id = end_id = gen_config["eos_token_id"][0]
### if model type is base, run this branch
else:
    pad_id = gen_config["bos_token_id"]
    end_id = gen_config["eos_token_id"]
self.tokenizer_pad_id = pad_id
self.tokenizer_end_id = end_id
eos_token = self.tokenizer.decode(end_id)
self.tokenizer.eos_token = self.tokenizer.pad_token = eos_token

7、参考tensorrtllm_backend 0.7.0的readme，将表格里面的变量填好（文件 triton_model_repo/tensorrt_llm/config.pbtxt），比如batch_size,是否开启流等，每个版本略有不同，可以自行斟酌，此处不再过多论述。（

triton_max_batch_size最低应为4；

decoupled_mode建议设置为true；

gpt_model_path设置为Engine的路径，也就是/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1；

gpt_model_type设置为inflight_fused_batching用于开启流推理，设置inflight_batching 需要转换模型时设置use_inflight_batching 和 paged_kv_cache ，默认为 v1；

preprocessing_instance_count和postprocessing_instance_count为分词的时候用多少个CPU核心，可以设置为你的CPU核心数；

max_queue_delay_microseconds队列最大延迟微秒可以设置为1000，这个参数貌似是间隔多久才返回请求给客户端的；

bls_instance_count同样可以根据cpu核心数设置。

exclude_input_in_output设置为true,也就是返回时排除输入。

kv_cache_free_gpu_mem_fraction：将多少空闲的显存设置给kv_chche，默认0.9

）

8、启动服务

python3 scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/triton_model_repo

六、部署问题

1、使用多卡时需要自己编译镜像，直接使用官方的镜像可以build但无法run。

2、不同显卡编译的镜像可能互不支持，本镜像支持3080,3090显卡

3、不同显卡build 的模型可能相互不支持

4、部署和build时的参数要一致