【杂记】vLLM多卡推理踩坑记录


写在前面

仅作个人学习与记录用。主要记录vLLM在多卡推理时遇到的问题。

配置

vllm version: 0.5.1
GPU 0: Tesla V100-PCIE-16GB
GPU 1: Tesla V100-PCIE-16GB

问题一

ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-PCIE-16GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.

解决办法

查看 官方文档

dtype – The data type for the model weights and activations. Currently, we support float32, float16, and bfloat16. If auto, we use the torch_dtype attribute specified in the model config file. However, if the torch_dtype in the config is float32, we will use float16 instead.

因此在代码中添加:

    llm = LLM(
        ...
        dtype='float16',
        ...
    )

问题二

在解决问题一后,当把vLLM.LLM中的参数tensor_parallel_size设置为大于1的值时(即设置多卡),程序会卡住并引发RuntimeError:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method.

解决办法

按照 https://github.com/vllm-project/vllm/issues/6152中的回复,设置:

export VLLM_WORKER_MULTIPROC_METHOD=spawn

或在代码中添加:

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

问题三

解决问题二后,程序有可能会卡住并引发与新进程的引导阶段相关的RuntimeError:

RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
ERROR 11-28 20:12:42 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 3449547 died, exit code: 1
INFO 11-28 20:12:42 multiproc_worker_utils.py:123] Killing local vLLM worker processes

解决办法

按照 https://github.com/vllm-project/vllm/issues/5637 的解决办法,将spawn改成fork可能不会奏效,这是因为:

 It seems some tests will initialize cuda before launching vllm worker, which makes fork not possible.

此时解决方法可以尝试在任何可能对CUDA动手动脚的命令之前首先添加:

VLLM_WORKER_MULTIPROC_METHOD=spawn

或在代码最开头添加:

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

问题四

解决问题三后,重新运行,日志信息卡在:

(VllmWorkerProcess pid=166876)[0;0m INFO 11-25 20:57:27 pynccl.py:63] vLLM is using nccl==2.20.5

解决办法

按照 https://docs.vllm.ai/en/stable/getting_started/debugging.html 打开vllm日志:

export VLLM_LOGGING_LEVEL=DEBUG to turn on more logging.

export CUDA_LAUNCH_BLOCKING=1 to identify which CUDA kernel is causing the problem.

export NCCL_DEBUG=TRACE to turn on more logging for NCCL.

export VLLM_TRACE_FUNCTION=1 to record all function calls for inspection in the log files to tell which function crashes or hangs.

也可以在代码中添加:

os.environ["VLLM_LOGGING_LEVEL"]="DEBUG"
os.environ["NCCL_DEBUG"]="TRACE"
os.environ["VLLM_TRACE_FUNCTION"]="1"

获取详细报错信息。

如果没有更多报错信息,且你所遇到的问题与上述描述完全一致,可能是由于GPU间P2P通信不能正常工作。

按照 https://github.com/NVIDIA/nccl/issues/631 中的描述,可以尝试添加:

export NCCL_P2P_DISABLE=1

或在代码最开头添加:

os.environ["NCCL_P2P_DISABLE"]="1"
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值