【杂记】vLLM多卡推理踩坑记录

最新推荐文章于 2025-04-09 20:00:00 发布

LZXCyrus

最新推荐文章于 2025-04-09 20:00:00 发布

阅读量6.6k

点赞数 18

分类专栏：杂记文章标签：人工智能 vLLM 多卡推理语言模型 AIGC 深度学习 nccl

本文链接：https://blog.csdn.net/m0_65814643/article/details/144110567

版权

杂记专栏收录该内容

5 篇文章

订阅专栏

写在前面

仅作个人学习与记录用。主要记录vLLM在多卡推理时遇到的问题。

配置

vllm version: 0.5.1
GPU 0: Tesla V100-PCIE-16GB
GPU 1: Tesla V100-PCIE-16GB

问题一

ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-PCIE-16GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.

解决办法

查看官方文档

dtype – The data type for the model weights and activations. Currently, we support float32, float16, and bfloat16. If auto, we use the torch_dtype attribute specified in the model config file. However, if the torch_dtype in the config is float32, we will use float16 instead.

因此在代码中添加：

    llm = LLM(
        ...
        dtype='float16',
        ...
    )

问题二

在解决问题一后，当把vLLM.LLM中的参数tensor_parallel_size设置为大于1的值时（即设置多卡），程序会卡住并引发RuntimeError：

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method.

解决办法

按照 https://github.com/vllm-project/vllm/issues/6152中的回复，设置：

export VLLM_WORKER_MULTIPROC_METHOD=spawn

或在代码中添加：

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

问题三

解决问题二后，程序有可能会卡住并引发与新进程的引导阶段相关的RuntimeError：

RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
ERROR 11-28 20:12:42 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 3449547 died, exit code: 1
INFO 11-28 20:12:42 multiproc_worker_utils.py:123] Killing local vLLM worker processes

解决办法

按照 https://github.com/vllm-project/vllm/issues/5637 的解决办法，将spawn改成fork可能不会奏效，这是因为：

 It seems some tests will initialize cuda before launching vllm worker, which makes fork not possible.

此时解决方法可以尝试在任何可能对CUDA动手动脚的命令之前首先添加：

VLLM_WORKER_MULTIPROC_METHOD=spawn

或在代码最开头添加：

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

问题四

解决问题三后，重新运行，日志信息卡在：

(VllmWorkerProcess pid=166876)[0;0m INFO 11-25 20:57:27 pynccl.py:63] vLLM is using nccl==2.20.5

解决办法

按照 https://docs.vllm.ai/en/stable/getting_started/debugging.html 打开vllm日志：

export VLLM_LOGGING_LEVEL=DEBUG to turn on more logging.

export CUDA_LAUNCH_BLOCKING=1 to identify which CUDA kernel is causing the problem.

export NCCL_DEBUG=TRACE to turn on more logging for NCCL.

export VLLM_TRACE_FUNCTION=1 to record all function calls for inspection in the log files to tell which function crashes or hangs.

也可以在代码中添加：

os.environ["VLLM_LOGGING_LEVEL"]="DEBUG"
os.environ["NCCL_DEBUG"]="TRACE"
os.environ["VLLM_TRACE_FUNCTION"]="1"

获取详细报错信息。

如果没有更多报错信息，且你所遇到的问题与上述描述完全一致，可能是由于GPU间P2P通信不能正常工作。

按照 https://github.com/NVIDIA/nccl/issues/631 中的描述，可以尝试添加：

export NCCL_P2P_DISABLE=1

或在代码最开头添加：

os.environ["NCCL_P2P_DISABLE"]="1"

【杂记】vLLM多卡推理踩坑记录

目录

写在前面

配置

问题一

解决办法

问题二

解决办法

问题三

解决办法

问题四

解决办法