模型训练和推理过程中的显存占用问题

最新推荐文章于 2025-04-27 14:13:04 发布

Matrix 工作室

最新推荐文章于 2025-04-27 14:13:04 发布

阅读量4.4k

点赞数 3

分类专栏： # PyTorch 文章标签： python 深度学习 pytorch

本文链接：https://blog.csdn.net/weixin_43336281/article/details/129643995

版权

PyTorch 专栏收录该内容

9 篇文章

订阅专栏

问题背景

我有两个GPT2的模型，模型1只有1亿参数，并以16位浮点数存储，也就是250MB左右，模型2有35亿参数，同样以16位浮点数存储，也就是7GB左右。

我以为推理的时候加载模型到显存中后占用的空间应该也是差不多的大小，但是1亿参数的模型加载到TorchServe中后却占用了957MB，不知道为什么多出来700多MB。

$ nvidia-smi
Sun Mar 19 13:54:05 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   1  NVIDIA GeForce ...  Off  | 00000000:05:00.0 Off |                  N/A |
| 23%   28C    P8     9W / 250W |    959MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    1   N/A  N/A   3267643      C   /home/venv/bin/python             957MiB |
+-----------------------------------------------------------------------------+

解释

其实多出来的这部分是CUDA上下文占用的显存开销，它是在执行了第一个CUDA相关操作后创建的。

如果想知道自己显卡的CUDA上下文要占用多少显存，可以创建一个非常简单的张量，然后转移到GPU上，看一下显存占用即可。

$ python
Python 3.9.12 (main, Apr  5 2022, 06:56:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> data = [1]
>>> x_data = torch.tensor(data)
>>> x_data.cuda()
tensor([1], device='cuda:0')

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:04:00.0 Off |                  N/A |
| 23%   33C    P8     9W / 250W |    437MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:05:00.0 Off |                  N/A |
| 23%   28C    P8     9W / 250W |    959MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   3307828      C   python                            435MiB |
|    1   N/A  N/A   3267643      C   /home/venv/bin/python             957MiB |
+-----------------------------------------------------------------------------+

可以发现，即使我们创建了一个只包含一个元素的张量，但是显存的占用还是达到了435MB。

CUDA上下文占用的显存跟模型大小并不相关，只跟显卡型号有关。也就是说，相比于小模型，大模型占用的CUDA上下文并不会突出多少。

实验代码

实验模型

# 保存1亿参数的模型
from transformers import GPT2Tokenizer, GPT2LMHeadModel
hf_model_path = "IDEA-CCNL/Wenzhong-GPT2-110M"
tokenizer = GPT2Tokenizer.from_pretrained(hf_model_path)
model = GPT2LMHeadModel.from_pretrained(hf_model_path)
model.half()
model.save_pretrained("Wenzhong-GPT2-110M")

TorchServe

模型处理脚本：Transformer_handler_generalized.py

import torch as th
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
from ts.torch_handler.base_handler import BaseHandler


class TransformersGpt2Handler(BaseHandler):
	def __init__(self):
		super(TransformersGpt2Handler, self).__init__()
		self.initialized = False

	def initialize(self, ctx):
		self.manifest = ctx.manifest
		properties = ctx.system_properties
		model_dir = properties.get("model_dir")

		self.device = th.device(
			"cuda:" + str(properties.get("gpu_id"))
			if th.cuda.is_available() and properties.get("gpu_id") is not None
			else "cpu"
		)

		self.model = GPT2LMHeadModel.from_pretrained(model_dir, torch_dtype=th.float16)
		self.model.to(self.device)

		hf_model_path = "IDEA-CCNL/Wenzhong-GPT2-110M"
		self.tokenizer = GPT2TokenizerFast.from_pretrained(hf_model_path)
		self.end_token_id = self.tokenizer.add_special_tokens({"pad_token": "<|endoftext|>"})

		self.model.eval()
		self.initialized = True

	def preprocess(self, requests):
		inputs = None
		for idx, data in enumerate(requests):
			input_text = data.get("body").get("prompt")
			if isinstance(input_text, (bytes, bytearray)):
				input_text = input_text.decode("utf-8")
			inputs = self.tokenizer(input_text, return_tensors="pt")
		return inputs

	def inference(self, data, *args, **kwargs):
		generation_output = self.model.generate(
			**data.to(self.device), return_dict_in_generate=True, top_k=4, penalty_alpha=0.6,
			output_scores=True, do_sample=True, eos_token_id=91)
		return generation_output

	def postprocess(self, inference_output):
		inferences = []
		for idx, sentence in enumerate(inference_output.sequences):
			output = self.tokenizer.decode(sentence)
			inferences.append(output)
		return [inferences]

模型打包

torch-model-archiver --model-name Wenzhong-GPT2-110M --force --version 1.0 --serialized-file Wenzhong-GPT2-110M/pytorch_model.bin  --handler Transformer_handler_generalized.py    --export-path model_store/ --extra-files "Wenzhong-GPT2-110M/config.json"

拉取并启动TorchServe镜像

docker pull pytorch/torchserve:latest-gpu

config.properties配置文件

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082

number_of_netty_threads=32
job_queue_size=1000
model_store=/home/model-server/model-store
workflow_store=/home/model-server/wf-store

cors_allowed_origin=*
cors_allowed_methods=*

install_py_dep_per_model=true

default_response_timeout=600

启动TorchServe

docker run --rm -it -d --name Wenzhong --gpus all -p 18080:8080 -p 18081:8081 -v ${pwd}/model_store:/home/model-server/model-store pytorch/torchserve:latest-gpu

安装transformers

docker exec Wenzhong pip install -i http://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.com transformers

注册模型

curl -X POST "http://localhost:18081/models?url=Wenzhong-GPT2-110M.mar"
curl -X PUT  "http://localhost:18081/models/Wenzhong-GPT2-110M?min_worker=1"
curl -X POST 'http://localhost:18080/predictions/Wenzhong-GPT2-110M' --data '{"prompt": "你是谁？"}'