LLaVA项目使用说明（一）运行Demo

qq_58400270

已于 2023-12-18 22:56:08 修改

阅读量8.6k

点赞数 39

分类专栏：多模态大模型 LLaVA 文章标签：服务器学习

于 2023-12-18 22:23:40 首次发布

本文链接：https://blog.csdn.net/qq_58400270/article/details/135072952

版权

多模态大模型同时被 2 个专栏收录

4 篇文章

订阅专栏

LLaVA

4 篇文章

订阅专栏

llava 使用说明（一）项目下载与Demo运行

运行项目使用到conda环境以及GPU，相关的配置可以参考笔者的另外两篇笔记。

1. 下载项目

首先进入指定目录，使用git clone从github仓库中下载项目。

git clone git@github.com:haotian-liu/LLaVA.git

下载git项目

在此之前先将github和服务器联通（类似于开放权限之类的吧）。

git仓库的一些基本配置

git config --global user.name "用户名"
git config --gloabal user.email "用户邮箱"
git config --list # 检查信息

然后使用命令生成ssh密钥，使用-c添加注释。
```
ssh-keygen -t rsa -C "comment"
```
接着将id_rsa.pub中的内容（密钥）复制到github中，添加成功即可。

2. 搭建运行环境

2.1 配置Anaconda环境等

使用conda创建编程环境，编程语言为python 3.10.13，使用的torch.__version__为2.0.1+cu117。

conda create -n llava python=3.10 -y # -y 是自动回答所有提示
conda activate llava

2.2 安装程序

这一部分主要参考LLaVA项目源文件执行下来即可。

git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA

安装相应的包。

pip install --upgrade pip  # enable PEP 660 support
pip install -e .

为训练样例安装额外的包。

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

在安装训练样例包时出现报错，如下图所示。
在这里插入图片描述

执行的命令具体含义是：pip install flash-attn 安装名为 flash-attn 的 Python 包，这是一个特殊的注意力机制。--no-build-isolation：用于在构建和安装软件包时关闭构建隔离。经过查阅资料，发现报错原因可能是torch或者CUDA版本不匹配，环境配置要求如下，具体配置办法可以参考另一篇。

CUDA 11.6 and above
PyTorch 1.12 and above

至此已经大致安装完程序所需的包。

3. 快速开始

3.1 代码示例（加载模型）

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model

# 如果下载到本地了就用本地路径，可以连接到huggingface的话就用这个
model_path = "liuhaotian/llava-v1.5-7b"

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)

为了使用模型，需要从HuggingFace中下载权重文件。下载链接关于下载文件的方法也可以参考官方文档。但是在下载的过程中一直无法连接到HuggingFace，因此直接从本机下载然后复制到服务器中。下载完成后打开其中的config.json文件，按照配置下载视觉编码器的权重并将配置中的视觉编码器路径改为本地路径。尝试运行quickstart.py文件。运行成功会出现如下结果，说明已经成功加载模型。

(llava) .... /LLaVA#python3.10 /path/to/your/code/quickstart.py       
[2023-12-17 00:36:54,368] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:14<00:00,  7.07s/it]

尝试评估模型（evaluate）。

model_path = "liuhaotian/llava-v1.5-7b"
prompt = "What are the things I should be cautious about when I visit here?"
image_file = "https://llava-vl.github.io/static/images/view.jpg"

args = type('Args', (), {
    "model_path": model_path,
    "model_base": None,
    "model_name": get_model_name_from_path(model_path),
    "query": prompt,
    "conv_mode": None,
    "image_file": image_file,
    "sep": ",",
    "temperature": 0,
    "top_p": None,
    "num_beams": 1,
    "max_new_tokens": 512
})()

eval_model(args)

3.2 修改CUDA和CUDNN配置

在评估模型的时候出现报错，原因是CUDA库初始化失败，应该与环境配置有关。

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

在运行命令前加上GPU的选择，命令行如下。

CUDA_VISIBLE_DEVICES=0 python3.10 /path/to/your/file/quickstart.py

出现报错如下，这段错误信息指示CUDA内存已经用尽，尝试为GPU 0 分配 20.00 MiB 的内存，但已经分配了 10.22 GiB，还剩下 4.50 MiB 的可用内存。PyTorch 已经保留了总计 10.22 GiB 的内存，这可能导致碎片化问题。错误信息建议尝试通过设置 max_split_size_mb 参数来避免内存碎片问题。可以查看内存管理文档和 PYTORCH_CUDA_ALLOC_CONF 文档获取更多信息。

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.75 GiB total capacity; 10.22 GiB already allocated; 4.50 MiB free; 10.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

查阅资料得知该问题主要是由显存不足引起，可能的解决方案包括以下几种：

减小batch_size，使用小批次评估和训练可以减小显存消耗；
推理阶段忽略代码的梯度变化，with torch.no_grad()可以有效节省显存；
减小输入大小
…

根据建议设置了max_split_size_mb，命令行输入：

set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32

再次出现报错：

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

查阅资料得知可能是cudnn，cuda，pytorch的配置问题，因为前期准备的时候是对照着相应版本下载的，应该不存在兼容性问题。考虑环境变量配置问题（笔者也没搞懂具体原因），在环境变量中添加CUDNN_HOME，CUDA_HOME以及库环境变量。

export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export CUDNN_HOME=/usr/local/cuda
export CUDA_HOME=/usr/local/cuda

配置好source之后重启命令行生效（这里笔者因为没有重启命令行所以仍然一直报错，也没找到其他方法，确实重启解决90%的问题），检查环境变量配置。

echo $LD_LIBRARY_PATH
echo $CUDNN_HOME
echo $CUDA_HOME

检查无误后运行程序，得到输出如下：

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:16<00:00,  8.08s/it]
When visiting this location, which features a pier extending over a large body of water, there are a few things to be cautious about. First, be mindful of the weather conditions, as the pier may be affected by strong winds or storms, which could make it unsafe to walk on. Second, be aware of the water depth and any potential hazards, such as submerged rocks or debris, that could pose a risk to your safety. Additionally, be cautious of the presence of wildlife in the area, as there might be birds or other animals that could pose a threat or disturbance. Finally, be respectful of the environment and other visitors, and follow any posted rules or guidelines to ensure a safe and enjoyable experience for everyone.

运行成功。

3.3 `eval_model`代码分析

观察代码中提到的部分，大致内容是使用model_path指定的模型文件，对image_file指定的图片提问prompt。

model_path = "liuhaotian/llava-v1.5-7b"
prompt = "What are the things I should be cautious about when I visit here?"
image_file = "https://llava-vl.github.io/static/images/view.jpg"

图像	回答
	When visiting this location, which features a pier extending over a large body of water, there are a few things to be cautious about. First, be mindful of the weather conditions, as the pier may be affected by strong winds or storms, which could make it unsafe to walk on. Second, be aware of the water depth and any potential hazards, such as submerged rocks or debris, that could pose a risk to your safety. Additionally, be cautious of the presence of wildlife in the area, as there might be birds or other animals that could pose a threat or disturbance. Finally, be respectful of the environment and other visitors, and follow any posted rules or guidelines to ensure a safe and enjoyable experience for everyone.

图像

回答

When visiting this location, which features a pier extending over a large body of water, there are a few things to be cautious about. First, be mindful of the weather conditions, as the pier may be affected by strong winds or storms, which could make it unsafe to walk on. Second, be aware of the water depth and any potential hazards, such as submerged rocks or debris, that could pose a risk to your safety. Additionally, be cautious of the presence of wildlife in the area, as there might be birds or other animals that could pose a threat or disturbance. Finally, be respectful of the environment and other visitors, and follow any posted rules or guidelines to ensure a safe and enjoyable experience for everyone.

由于整段代码只有eval_model产生了输出。函数如下。

# function path: llava/eval/run_llava.py /eval_model
def eval_model(args):
    ...

加载模型，使用from llava.model.builder.py提供的 load_pretrained_model函数。

	# 禁用 Torch 初始化，可能由于多个进程同时访问 GPU 导致的问题
    disable_torch_init()
    # 加载预训练模型
    model_name = get_model_name_from_path(args.model_path)
    tokenizer, model, image_processor, context_len = load_pretrained_model(
        args.model_path, args.model_base, model_name
    )

接下来是处理查询（query，在上面的测试样例中就是prompt），同时处理图像占位符，为模型提供一致的输入结构。根据下载的文件处理，这里并没有用显示的图像占位符。

	# 处理查询（query），初始化query并查找query中是否存在图像占位
    qs = args.query
    image_token_se = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN
    if IMAGE_PLACEHOLDER in qs:
        if model.config.mm_use_im_start_end:
            qs = re.sub(IMAGE_PLACEHOLDER, image_token_se, qs)
        else:
            qs = re.sub(IMAGE_PLACEHOLDER, DEFAULT_IMAGE_TOKEN, qs)
    else:
        if model.config.mm_use_im_start_end:
            qs = image_token_se + "\n" + qs
        else:
            qs = DEFAULT_IMAGE_TOKEN + "\n" + qs

接下来是根据任务（模型默认或者用户指定）来选择对话类型并选择对话模板。conve包含了两个对话人物（conv.roles，一个提问者和一个回答者）。

	# 根据模型名称确定对话模式
    ...
    # 如果用户提供了 `--conv-mode` 参数，则使用用户指定的对话模式
    ...
    # 选择对话模板
    conv = conv_templates[args.conv_mode].copy()
    conv.append_message(conv.roles[0], qs)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()

第四步是加载图像并将图像token加入输入token序列中。

    # 解析、加载及预处理图像
    image_files = image_parser(args)
    images = load_images(image_files)
    images_tensor = process_images(
        images,
        image_processor,
        model.config
    ).to(model.device, dtype=torch.float16)

    # 将图像 token 添加到输入 token 中
    input_ids = (
        tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt")
        .unsqueeze(0)
        .cuda()
    )

tokenizer_image_token是mm_utils.py中定义的一个哈数，用于将带有图像占位符的prompt插入图像并转化为token。其中的参数包括

prompt：一个包含图像占位符的提示（prompt）；
tokenizer：用于token化；
image_token_index：分配给表示图像的特殊标记的索引；
return_tensors：设置函数的返回值为张量或者整型列表。

def tokenizer_image_token(prompt, tokenizer, image_token_index=IMAGE_TOKEN_INDEX, return_tensors=None):
    # 将'<image>'所分开的文字提示分别token化
    prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split('<image>')]

    def insert_separator(X, sep):
        return [ele for sublist in zip(X, [sep] * len(X)) for ele in sublist][:-1]

    input_ids = []
    offset = 0
    # Check if the first chunk starts with the BOS token
    if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id:
        offset = 1
        input_ids.append(prompt_chunks[0][0])

    # Insert image_token_index as a separator between prompt chunks
    for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
        input_ids.extend(x[offset:])

    # Return tokenized input as either a list or a PyTorch tensor
    if return_tensors is not None:
        if return_tensors == 'pt':
            return torch.tensor(input_ids, dtype=torch.long)
        raise ValueError(f'Unsupported tensor type: {return_tensors}')
    return input_ids

接下来是检测到关键词停止生成。

	# 设置停止条件，基于关键词停止生成
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

最后产生输出token，由model.generate产生，因为在代码中使用的model_base=None，因此之际调用

# 使用 `torch.inference_mode()` 上下文，以减少内存使用
    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=images_tensor,
            do_sample=True if args.temperature > 0 else False,
            temperature=args.temperature,
            top_p=args.top_p,
            num_beams=args.num_beams,
            max_new_tokens=args.max_new_tokens,
            use_cache=True,
            stopping_criteria=[stopping_criteria],
        )

对于生成的token，检查与输入token的不同数量，用于检测生成文本与输入prompt之间的关系，帮助识别生成文本包含的新信息。

# 计算输入和输出之间的不同 token 数量
    input_token_len = input_ids.shape[1]
    n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
    if n_diff_input_output > 0:
        print(
            f"[Warning] {n_diff_input_output} output_ids are not the same as the input_ids"
        )

解码并输出结果，使用batch_decode进行分批解码，有助于效率和内存管理。

# 解码输出，并打印结果
    outputs = tokenizer.batch_decode(
        output_ids[:, input_token_len:], skip_special_tokens=True
    )[0]
    outputs = outputs.strip()
    if outputs.endswith(stop_str):
        outputs = outputs[: -len(stop_str)]
    outputs = outputs.strip()
    print(outputs)

4. Demo 运行

运行Demo需要将模型checkpoint下载到本地，Model Zoo。

运行一个服务控制器（controller）
```
python3.10 -m llava.serve.controller --host 0.0.0.0 --port 10000
```
运行成功后会出现接收信息。
打开新的终端，运行网页服务器（gradio web server）
```
python3.10 -m llava.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload
```
运行成功后，同样会有提示信息。

观察到提示可以单击链接或进入localhost:7860进入网页端进行问答（需要开启model_worker）。
打开新的终端，运行model_worker。
```
python3.10 -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path /path/to/weights/llava-v1.5-7b
```
个性化说明：
- --model-path：选择模型路径，每个worker都对应一个单独的模型；
- CUDA_VISIBLE_DEVICES：在运行命令前加上GPU设置可以选择多个GPU运行。
- --load-4bit和--load-8bit可以指定量化位数来降低GPU内存占用。选择4位或8位表示了量化的位数。选择4位将使用更少的位数来表示权重和激活值，从而减小了模型的内存需求，但这也可能降低模型的精度。选择8位提供了更多的精度，但相对于原始32位表示，仍然减少了内存需求。
- --model-base：用于指定用于训练 LoRA 权重的基本语言模型（LLM）。
可以加载多个模型，但是需要占用不同的端口。
```
python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port <different from 40000, say 40001> --worker http://localhost:<change accordingly, i.e. 40001> --model-path <ckpt2>
```

打开客户端界面。这里主要是不依赖与Gradio界面的客户端，同样支持多个GPU，指定量化位数等。

python -m llava.serve.cli \
    --model-path liuhaotian/llava-v1.5-7b \
    --image-file "https://llava-vl.github.io/static/images/view.jpg" \
    --load-4bit

进行指定的修改后仍然出现问题，暂时没找到合适的解决方案。

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 6; 10.75 GiB total capacity; 1012.33 MiB already allocated; 10.50 MiB free; 1014.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF