TigerBot-70b-4k-v4 推理部署

最新推荐文章于 2025-02-25 16:48:24 发布

GoldenSeS

最新推荐文章于 2025-02-25 16:48:24 发布

阅读量428

点赞数

分类专栏： NLP 文章标签： python 语言模型神经网络人工智能 gpt-3 nlp

本文链接：https://blog.csdn.net/m0_73854133/article/details/134788180

版权

NLP 专栏收录该内容

1 篇文章

订阅专栏

TigerBot-70b-4k-v4 推理部署

模型本地部署

根据实际测试，加载模型需要约120G显存，最低需要6张3090显卡（流水线并行）

如果使用vllm进行加速推理（张量并行），考虑8张3090显卡或者4张A100-40G（模型分割要求）

模型下载

截至目前(2023.12.4)，模型数据仅在huggingface上保存，在恒源云上的下载方式如下：

开启恒源云代理

export https_proxy=http://turbo.gpushare.com:30000 http_proxy=http://turbo.gpushare.com:30000

访问模型下载地址

在这里建议使用wget下载模型文件，优点是能够断点续传，下方是wget示例

wget https://huggingface.co/TigerResearch/tigerbot-70b-chat-v4-4k/resolve/main/pytorch_model-00001-of-00015.bin

关闭恒源云代理

unset http_proxy && unset https_proxy

依赖安装

克隆官方github仓库

git clone https://github.com/TigerResearch/TigerBot.git && cd Tigerbot

安装依赖库

pip install -r requirements.txt

模型推理

对于普通的多卡推理，示例推理代码如下

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python infer.py --model_path /path/to/your/model --max_input_length 1024 --max_generate_length 1024 --streaming True

量化

官方提供了在线量化与推理的方法

CUDA_VISIBLE_DEVICES=0,1,2,3 python other_infer/quant_infer.py --model_path /path/to/your/model --wbit 8

vllm 加速推理

安装vllm

pip install vllm

创建新的推理.py文件

import torch
from vllm import LLM, SamplingParams
 
# Set the number of GPUs you want to use
num_gpus = 8  # Change this to the number of GPUs you have
 
# Define your prompts and sampling parameters
prompts = """
### Instruction:
第一次指令

### Instruction:
第二次指令

### Response:
"""
sampling_params = SamplingParams(temperature=1, top_p=0.9, top_k=50, max_tokens=512, stop="</s>")
 
# Initialize the VLLM model
llm = LLM(model="/hy-tmp/tigerbot-70b-chat-v4-4k", tensor_parallel_size=8, trust_remote_code=True)
 
# Move the model to GPUs
llm = torch.nn.DataParallel(llm, device_ids=list(range(num_gpus)))
 
# Generate outputs
outputs = llm.module.generate(prompts, sampling_params)
 
# Print the outputs
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

需要注意的是这里的提示词格式与llama2不同，tigerbot的提示词遵循以下格式

### Instruction:
第一次指令

### Instruction:
第二次指令

### Response:

报错修复指引

安装过程中的报错大多是由于依赖库的版本问题，调整后可以解决。

flash-attn库安装报错

/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZNK3c106SymIntltEl

修复方法：重新构建 flash-attn库

pip uninstall flash-attn
FLASH_ATTENTION_FORCE_BUILD=TRUE pip install flash-attn