text-generation-inference使用

I still …

已于 2024-04-10 19:27:00 修改

阅读量3.7k

点赞数 23

文章标签： TGI 大模型

于 2023-12-30 20:10:24 首次发布

本文链接：https://blog.csdn.net/qq_44370676/article/details/135309011

版权

TGI使用

1.docker安装
2.本地安装
3.客户端使用
4.更新

因为最近工作需要跑LLM，目前LLM一般都是多进程跑，目前只用Inference功能，因此让LLM部分和本身业务分离会让project维护性好很多。因此用到了text-generation-inference库，这是huggingface官方的LLM server库。

使用参考huggingface-tutorial,text-generation-inference。

1.docker安装

以codellama和wizard为例

sudo docker run --gpus '"device=0,1,2,3"' --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=<your token> -p 8989:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id codellama/CodeLlama-34b-Instruct-hf

sudo docker run --gpus all -e NVIDIA_VISIBLE_DEVICES=0,1 --shm-size 1g -p 8989:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id WizardLM/WizardCoder-15B-V1.0

docker安装比较方便，但是需要用到nvidia-container-toolkit，如果自己没有机器管理员权限很麻烦，这时可以尝试本地安装。

2.本地安装

2.1.rust + anaconda3

首先需要安装rust和anaconda，安装好anaconda3后新建一个名为text-generation-inference（或者其它名字）的环境

# 准备conda环境，3.9 <= python版本 < 3.13
conda create -n text-generation-inference python=3.9
conda activate text-generation-inference

安装rust

# 安装rust
curl --proto '=https' --tlsv1.2 -sSf -o rustup-init.sh https://sh.rustup.rs
chmod +x rustup-init.sh
export CARGO_HOME=/your/custom/path/to/cargo
export RUSTUP_HOME=/your/custom/path/to/rustup
./rustup-init.sh # 如果添加环境变量记得加sudo

# You can uninstall at any time with `rustup self uninstall` and
# these changes will be reverted.

# 安装完后运行下面命令初始化环境变量
export PATH="$CARGO_HOME/bin:$PATH"
rustup default stable

安装完后会提醒：

To get started you may need to restart your current shell. This would reload your PATH environment variable to include Cargo’s bin directory (``$HOME/.cargo/bin`).

To configure your current shell, run:

source "$CARGO_HOME/env"，这个命令等同于 export PATH=$CARGO_HOME/bin:$PATH。

同时还要安装protoc和openssl，如果有就不需要：

PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP

# 有管理员权限直接apt-get install，没有就得下源码build安装
sudo apt-get install libssl-dev

2.2.安装server

下载，我没用 git clone，目前最新版为1.3.4。

wget https://github.com/huggingface/text-generation-inference/archive/refs/tags/v1.3.4.zip

进入 text-generation-inference-1.3.4 文件夹安装server，安装的时候会执行 pip install ..，因此记得先切换到 text-generation-inference conda环境。

BUILD_EXTENSIONS=True make install

细节：Makefile内容，在 text-generation-inference-1.3.4 下执行 make install 等同于执行 make install-server install-router install-launcher install-custom-kernels，具体内容包括：

1.make install-server: 执行 cd server && make install，查看server/Makefile文件内容，会执行以下命令：
- 1.1.gen-server，生成server有关的python代码，放在 text_generation_server/pb 目录下，具体包括：
  - 1.1.1.pip install grpcio-tools==1.51.1 mypy-protobuf==3.4.0 'types-protobuf>=3.20.4' --no-cache-dir
  - 1.1.2.mkdir text_generation_server/pb || true，创建 text_generation_server/pb 目录，如果存在，继续执行后面命令。
  - 1.1.3.python -m grpc_tools.protoc -I../proto --python_out=text_generation_server/pb --grpc_python_out=text_generation_server/pb --mypy_out=text_generation_server/pb ../proto/generate.proto，据 generate.proto 文件生成相应的 python 代码，放在 text_generation_server/pb 目录下。
  - 1.1.4.find text_generation_server/pb/ -type f -name "*.py" -print0 -exec sed -i -e 's/^$import.*pb2$/from . \1/g' {} \;，对 text_generation_server/pb/ 下的所有文件 import 开头并包含 pb2 的行改为以 from . import 开头，这样它将从当前目录下的模块中导入。
  - 1.1.5.touch text_generation_server/pb/__init__.py ，将 text_generation_server/pb/ 更新为python包。
- 1.2.pip install pip --upgrade
- 1.3.pip install -r requirements_cuda.txt，requirements_cuda.txt内容，包括huggingface transformer库
- 1.4.pip install -e ".[bnb, accelerate, quantize, peft]"
2.make install-router: 执行 cd router && cargo install --path .，安装的内容进了 $CARGO_HOME/bin 目录。
3.make install-launcher: 执行 cd launcher && cargo install --path .
4.make install-custom-kernels: 执行以下脚本，也就是只有 BUILD_EXTENSIONS=True 的情况下安装 custom_kernels

if [ "$$BUILD_EXTENSIONS" = "True" ]; then 
	cd server/custom_kernels && python setup.py install; 
else 
	echo "Custom kernels are disabled, you need to set the BUILD_EXTENSIONS environment variable to 'True' in order to build them. (Please read the docs, kernels might not work on all hardware)"; 
fi

2.3.下载模型开启server

官方给的示例为：make run-falcon-7b-instruct，这个命令等价于：text-generation-launcher --model-id tiiuae/falcon-7b-instruct --port 8080，因此实际用到的是 text-generation-launcher 命令。text-generation-server 本身被安装在 $CARGO_HOME/bin 目录下，因此此时可以切换到 text-generation-inference-1.3.4 以外的目录继续运行了。

运行 text-generation-launcher --help，选项包括：

--model-id: 加载的model名字，可以是hf 列表下的id比如 gpt2 或 OpenAssistant/oasst-sft-1-pythia-12b. 或者是本地路径，最好是通过 save_pretrained(...) 下来的。
--sharded: 是否要将model共享到多gpu运行，可选项 [true, false]，默认会调用所有可用gpu。如果是 false，那么 --num-shared 选项将失效。
--num-shared：如果不想使用所有的gpu，那么可以使用 CUDA_VISIBLE_DEVICES=0,1 text-generation-launcher... --num_shard 2 创建2个copies，每个copy 2个shard，在4个gpu上运行。
--max-input-length: 最长输入token序列长度，默认1024。
--huggingface-hub-cache: 设置模型存储位置，也可以通过环境变量 HUGGINGFACE_HUB_CACHE 来设置。在本地，用huggingface的API下模型时会默认下到这个环境变量所指向目录。
-e：设置环境变量。如果下载llama这种模型需要hf token可以通过 -e HUGGING_FACE_HUB_TOKEN=<token> 传入。

下载huggingface的模型也可以通过以下方式进行

from huggingface_hub import snapshot_download
import sys

if __name__ == '__main__':
	repo_id = sys.argv[1]
	cache_dir = sys.argv[2]
	token = None
	if len(sys.argv) > 3:
		token = sys.argv[3]
    snapshot_download(repo_id=repo_id, cache_dir=cache_dir, token=token)

假如我想使用WizardCoder模型，用2个gpu，放在8989端口，那么命令可以为：

CUDA_VISIBLE_DEVICES=0,1 text-generation-launcher --model-id WizardLM/WizardCoder-15B-V1.0 --max-input-length=4096 --max-total-tokens=5120 --huggingface-hub-cache=<your/path> --port 8989

如果要用llama、codellama记得先让huggingface账户获取下载权限然后获取huggingface token，将token用作环境变量传入。

3.客户端使用

python代码有两种方式进行query，直接用Rest API发送post请求或者通过huggingface封装的API。参考官方教程，Rest API说明。

下面给出两个示例，注意address格式是 ip:port。

Rest API

import requests
import json

def query(address: str, prompt: str, double: bool=False):
    headers = {
        "Content-Type": "application/json"
    }
    data = {
        "inputs": prompt,
        "parameters": {"max_new_tokens": 1024,
                       "temperature": 0.5}
    }
    server_url = "http://" + address + "/generate"  # 替换为实际的服务器地址和端口
    # 发送POST请求，将JSON数据发送到server
    response = requests.post(server_url, headers=headers, data=json.dumps(data))

    # 检查服务器的响应状态码
    if response.status_code == 200 and not response.text.startswith("Invalid:"):
        # 解析服务器的字符串响应
        response_data_json: dict = json.loads(response.text)
        # 生成的文本
        return response_data_json['generated_text']
    else:
        print("Error: Server returned a non-200 status code or encounter invalid result")
        return ""

huggingface API

from huggingface_hub import InferenceClient

def query(address: str, prompt: str, double: bool=False):
	client = InferenceClient(model="http://" + address)
	response: str = client.text_generation(input, max_new_tokens=1024, temperature=0.5)
	return response