通过调低–gpu-memory-utilization的比例(默认为0.9),可以避免此问题
model = LLM(
args.model_name_or_path,
trust_remote_code=True,
tensor_parallel_size=num_gpus,
max_model_len = 2048,
gpu_memory_utilization=0.8
)
通过调低–gpu-memory-utilization的比例(默认为0.9),可以避免此问题
model = LLM(
args.model_name_or_path,
trust_remote_code=True,
tensor_parallel_size=num_gpus,
max_model_len = 2048,
gpu_memory_utilization=0.8
)