显存管理的方法

宇宙计算机

于 2024-07-14 22:54:09 发布

阅读量304

点赞数 3

文章标签：人工智能显存管理

本文链接：https://blog.csdn.net/weixin_44151034/article/details/140378393

版权

显存管理的方法

我运行的代码如下（先提供个示例代码，方便大家之后看后面的知识来理解）
1. 及时释放显存
2. 使用上下文管理器
3. 分批处理数据
4. 使用更小的模型
5. 动态调整显存分配
6. 使用混合精度训练
总结

笔者在使用for while循环无限次调用本地大模型的时候，炸内存。在深度学习模型的训练和推理过程中，显存管理是一个非常重要的方面，特别是在使用大型模型时。显存管理不当可能会导致显存不足，从而导致程序崩溃或性能下降。以下是一些显存管理的方法和技巧，可以帮助你在使用循环运行代码时更好地管理显存。

我运行的代码如下（先提供个示例代码，方便大家之后看后面的知识来理解）

import os
import sys
import json
import torch
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model

# 从命令行参数获取输入
model_path = sys.argv[1]
data_file = sys.argv[2]
output_file = sys.argv[3]

# 加载预训练模型
def load_model(model_path):
    tokenizer, model, image_processor, context_len = load_pretrained_model(
        model_path=model_path,
        model_base=None,
        model_name=get_model_name_from_path(model_path),
        # load_4bit=True,
        # load_8bit=True     # 如果不是8bit跑，需要占用30944mib
    )
    return tokenizer, model, image_processor, context_len

# 读取 JSON 数据文件
with open(data_file, 'r') as f:
    data = json.load(f)

# 打开输出文件
with open(output_file, 'w') as out_f:
    # 遍历每一张图片的数据
    for image in data:
        image_name = image['image_path']
        pred_classes = image['pred_classes']
        image_file = f"/home/data/yjgroup/fsy/VG_100K/{image_name}"

        prompt = (
            "Generate relation triples for the objects in the image. Relation categories: "
            "Input image: '{}', objects: {}. Generate triples and confidences."
        ).format(image_name, pred_classes)

        args = type('Args', (), {
            "model_path": model_path,
            "model_base": None,
            "model_name": get_model_name_from_path(model_path),
            "query": prompt,
            "conv_mode": None,
            "image_file": image_file,
            "sep": ",",
            "temperature": 0.2,
            "top_p": None,
            "num_beams": 1,
            'max_new_tokens': 512
        })()

        # 加载模型
        tokenizer, model, image_processor, context_len = load_model(model_path)

        # 运行模型评估
        try:
            response = eval_model(args)
        except Exception as e:
            response = f"Error: {str(e)}"
        
        # 将生成的数据保存到总文件中
        out_f.write(f"Image: {image_name}\n")
        out_f.write(f"Response: {response}\n")
        out_f.write("\n")

        # 这里我已经尝试清理显存了，下面还有很多不同的方法
        del model
        torch.cuda.empty_cache()

1. 及时释放显存

在循环中使用模型后，应该及时释放显存。
比如for循环中每次调用大模型一次结束时（我这里写的简单了，大家可以更深入的去研究一下）

# 清理显存
del model
torch.cuda.empty_cache()

torch.cuda.empty_cache() 可以释放未使用的显存，但不会影响已经分配的显存块。它可以减少显存碎片化，提高显存的利用效率。

2. 使用上下文管理器

使用上下文管理器可以确保在退出上下文时自动释放资源。对于模型加载和推理过程，可以使用上下文管理器来确保显存的及时释放：

from contextlib import contextmanager

@contextmanager
def load_and_unload_model(model_path):
    tokenizer, model, image_processor, context_len = load_model(model_path)
    try:
        yield tokenizer, model, image_processor, context_len
    finally:
        del model
        torch.cuda.empty_cache()

然后在循环中使用这个上下文管理器：

for image in data:
    with load_and_unload_model(model_path) as (tokenizer, model, image_processor, context_len):
        # 运行模型评估
        try:
            response = eval_model(args)
        except Exception as e:
            response = f"Error: {str(e)}"
        
        # 将生成的数据保存到总文件中
        out_f.write(f"Image: {image_name}\n")
        out_f.write(f"Response: {response}\n")
        out_f.write("\n")

3. 分批处理数据

如果数据量很大，可以考虑将数据分批处理，以减少单次处理所需的显存。可以将数据分成若干小批次，每次只处理一个小批次的数据：

batch_size = 10
for i in range(0, len(data), batch_size):
    batch_data = data[i:i + batch_size]
    for image in batch_data:
        with load_and_unload_model(model_path) as (tokenizer, model, image_processor, context_len):
            # 运行模型评估
            try:
                response = eval_model(args)
            except Exception as e:
                response = f"Error: {str(e)}"
            
            # 将生成的数据保存到总文件中
            out_f.write(f"Image: {image_name}\n")
            out_f.write(f"Response: {response}\n")
            out_f.write("\n")

4. 使用更小的模型

如果显存仍然不足，可以考虑使用更小的模型或量化模型。例如，你的代码中提到的 load_4bit=True 和 load_8bit=True 可以显著减少模型的显存占用：

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path),
    load_4bit=True,  # 或者 load_8bit=True
)

5. 动态调整显存分配

PyTorch 提供了一些选项来动态调整显存分配，例如 torch.cuda.memory_allocated() 和 torch.cuda.memory_reserved() 可以帮助你监控显存使用情况，做出相应的调整。

6. 使用混合精度训练

混合精度训练可以减少显存占用，并且通常可以提高计算速度。可以使用 torch.cuda.amp 来实现：

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for image in data:
    with load_and_unload_model(model_path) as (tokenizer, model, image_processor, context_len):
        with autocast():
            # 运行模型评估
            try:
                response = eval_model(args)
            except Exception as e:
                response = f"Error: {str(e)}"
            
            # 将生成的数据保存到总文件中
            out_f.write(f"Image: {image_name}\n")
            out_f.write(f"Response: {response}\n")
            out_f.write("\n")