问题背景
我有两个GPT2的模型,模型1只有1亿参数,并以16位浮点数存储,也就是250MB左右,模型2有35亿参数,同样以16位浮点数存储,也就是7GB左右。
我以为推理的时候加载模型到显存中后占用的空间应该也是差不多的大小,但是1亿参数的模型加载到TorchServe中后却占用了957MB,不知道为什么多出来700多MB。
$ nvidia-smi
Sun Mar 19 13:54:05 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 1 NVIDIA GeForce ... Off | 00000000:05:00.0 Off | N/A |
| 23% 28C P8 9W / 250W | 959MiB / 11178MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 3267643 C /home/venv/bin/python 957MiB |
+-----------------------------------------------------------------------------+
解释
其实多出来的这部分是CUDA上下文占用的显存开销,它是在执行了第一个CUDA相关操作后创建的。
如果想知道自己显卡的CUDA上下文要占用多少显存,可以创建一个非常简单的张量,然后转移到GPU上,看一下显存占用即可。
$ python
Python 3.9.12 (main, Apr 5 2022, 06:56:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> data = [1]
>>> x_data = torch.tensor(data)
>>> x_data.cuda()
tensor([1], device='cuda:0')
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:04:00.0 Off | N/A |
| 23% 33C P8 9W / 250W | 437MiB / 11178MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:05:00.0 Off | N/A |
| 23% 28C P8 9W / 250W | 959MiB / 11178MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3307828 C python 435MiB |
| 1 N/A N/A 3267643 C /home/venv/bin/python 957MiB |
+-----------------------------------------------------------------------------+
可以发现,即使我们创建了一个只包含一个元素的张量,但是显存的占用还是达到了435MB。
CUDA上下文占用的显存跟模型大小并不相关,只跟显卡型号有关。也就是说,相比于小模型,大模型占用的CUDA上下文并不会突出多少。
实验代码
实验模型
# 保存1亿参数的模型
from transformers import GPT2Tokenizer, GPT2LMHeadModel
hf_model_path = "IDEA-CCNL/Wenzhong-GPT2-110M"
tokenizer = GPT2Tokenizer.from_pretrained(hf_model_path)
model = GPT2LMHeadModel.from_pretrained(hf_model_path)
model.half()
model.save_pretrained("Wenzhong-GPT2-110M")
TorchServe
模型处理脚本:Transformer_handler_generalized.py
import torch as th
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
from ts.torch_handler.base_handler import BaseHandler
class TransformersGpt2Handler(BaseHandler):
def __init__(self):
super(TransformersGpt2Handler, self).__init__()
self.initialized = False
def initialize(self, ctx):
self.manifest = ctx.manifest
properties = ctx.system_properties
model_dir = properties.get("model_dir")
self.device = th.device(
"cuda:" + str(properties.get("gpu_id"))
if th.cuda.is_available() and properties.get("gpu_id") is not None
else "cpu"
)
self.model = GPT2LMHeadModel.from_pretrained(model_dir, torch_dtype=th.float16)
self.model.to(self.device)
hf_model_path = "IDEA-CCNL/Wenzhong-GPT2-110M"
self.tokenizer = GPT2TokenizerFast.from_pretrained(hf_model_path)
self.end_token_id = self.tokenizer.add_special_tokens({"pad_token": "<|endoftext|>"})
self.model.eval()
self.initialized = True
def preprocess(self, requests):
inputs = None
for idx, data in enumerate(requests):
input_text = data.get("body").get("prompt")
if isinstance(input_text, (bytes, bytearray)):
input_text = input_text.decode("utf-8")
inputs = self.tokenizer(input_text, return_tensors="pt")
return inputs
def inference(self, data, *args, **kwargs):
generation_output = self.model.generate(
**data.to(self.device), return_dict_in_generate=True, top_k=4, penalty_alpha=0.6,
output_scores=True, do_sample=True, eos_token_id=91)
return generation_output
def postprocess(self, inference_output):
inferences = []
for idx, sentence in enumerate(inference_output.sequences):
output = self.tokenizer.decode(sentence)
inferences.append(output)
return [inferences]
模型打包
torch-model-archiver --model-name Wenzhong-GPT2-110M --force --version 1.0 --serialized-file Wenzhong-GPT2-110M/pytorch_model.bin --handler Transformer_handler_generalized.py --export-path model_store/ --extra-files "Wenzhong-GPT2-110M/config.json"
拉取并启动TorchServe镜像
docker pull pytorch/torchserve:latest-gpu
config.properties配置文件
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
number_of_netty_threads=32
job_queue_size=1000
model_store=/home/model-server/model-store
workflow_store=/home/model-server/wf-store
cors_allowed_origin=*
cors_allowed_methods=*
install_py_dep_per_model=true
default_response_timeout=600
启动TorchServe
docker run --rm -it -d --name Wenzhong --gpus all -p 18080:8080 -p 18081:8081 -v ${pwd}/model_store:/home/model-server/model-store pytorch/torchserve:latest-gpu
安装transformers
docker exec Wenzhong pip install -i http://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.com transformers
注册模型
curl -X POST "http://localhost:18081/models?url=Wenzhong-GPT2-110M.mar"
curl -X PUT "http://localhost:18081/models/Wenzhong-GPT2-110M?min_worker=1"
curl -X POST 'http://localhost:18080/predictions/Wenzhong-GPT2-110M' --data '{"prompt": "你是谁?"}'
参考资料
GitHub Issue: The memory occupied by the model becomes larger after it is loaded into the GPU