(vllm) zhzx@zhzx-S2600WF-LS:/media/zhzx/ssd2/Qwen3-32B$ vllm serve "/media/zhzx/ssd2/Qwen3-32B" --host 0.0.0.0 --port 8060 --dtype bfloat16 --tensor-parallel-size 2 --cpu-offload-gb 20 --gpu-memory-utilization 0.8 --max-model-len 8126 --api-key token-abc123 --enable-prefix-caching --trust-remote-code \
>
INFO 05-10 20:06:33 [__init__.py:239] Automatically detected platform cuda.
INFO 05-10 20:06:36 [api_server.py:1043] vLLM API server version 0.8.5.post1
INFO 05-10 20:06:36 [api_server.py:1044] args: Namespace(subparser='serve', model_tag='/media/zhzx/ssd2/Qwen3-32B', config='', host='0.0.0.0', port=8060, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='token-abc123', lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/media/zhzx/ssd2/Qwen3-32B', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='bfloat16', max_model_len=8126, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.8, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=True, prefix_caching_hash_algo='builtin', cpu_offload_gb=20.0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f625a071bc0>)
WARNING 05-10 20:06:36 [config.py:2972] Casting torch.float16 to torch.bfloat16.
INFO 05-10 20:06:42 [config.py:717] This model supports multiple tasks: {'reward', 'embed', 'generate', 'classify', 'score'}. Defaulting to 'generate'.
WARNING 05-10 20:06:42 [arg_utils.py:1658] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0.
INFO 05-10 20:06:42 [config.py:1770] Defaulting to use mp for distributed inference
INFO 05-10 20:06:42 [api_server.py:246] Started engine process with PID 73230
INFO 05-10 20:06:45 [__init__.py:239] Automatically detected platform cuda.
INFO 05-10 20:06:47 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/media/zhzx/ssd2/Qwen3-32B', speculative_config=None, tokenizer='/media/zhzx/ssd2/Qwen3-32B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8126, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/media/zhzx/ssd2/Qwen3-32B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
WARNING 05-10 20:06:48 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 05-10 20:06:48 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 05-10 20:06:48 [cuda.py:289] Using XFormers backend.
INFO 05-10 20:06:50 [__init__.py:239] Automatically detected platform cuda.
(VllmWorkerProcess pid=73270) INFO 05-10 20:06:53 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=73270) INFO 05-10 20:06:53 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=73270) INFO 05-10 20:06:53 [cuda.py:289] Using XFormers backend.
ERROR 05-10 20:06:53 [engine.py:448] Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Quadro RTX 6000 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the `dtype` flag in CLI, for example: --dtype=half.
ERROR 05-10 20:06:53 [engine.py:448] Traceback (most recent call last):
ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 05-10 20:06:53 [engine.py:448] engine = MQLLMEngine.from_vllm_config(
ERROR 05-10 20:06:53 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
ERROR 05-10 20:06:53 [engine.py:448] return cls(
ERROR 05-10 20:06:53 [engine.py:448] ^^^^
ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__
ERROR 05-10 20:06:53 [engine.py:448] self.engine = LLMEngine(*args, **kwargs)
ERROR 05-10 20:06:53 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 275, in __init__
ERROR 05-10 20:06:53 [engine.py:448] self.model_executor = executor_class(vllm_config=vllm_config)
ERROR 05-10 20:06:53 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 286, in __init__
ERROR 05-10 20:06:53 [engine.py:448] super().__init__(*args, **kwargs)
ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 52, in __init__
ERROR 05-10 20:06:53 [engine.py:448] self._init_executor()
ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 124, in _init_executor
ERROR 05-10 20:06:53 [engine.py:448] self._run_workers("init_device")
ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 05-10 20:06:53 [engine.py:448] driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 05-10 20:06:53 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/utils.py", line 2456, in run_method
ERROR 05-10 20:06:53 [engine.py:448] return func(*args, **kwargs)
ERROR 05-10 20:06:53 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 604, in init_device
ERROR 05-10 20:06:53 [engine.py:448] self.worker.init_device() # type: ignore
ERROR 05-10 20:06:53 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/worker.py", line 177, in init_device
ERROR 05-10 20:06:53 [engine.py:448] _check_if_gpu_supports_dtype(self.model_config.dtype)
ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/worker.py", line 546, in _check_if_gpu_supports_dtype
ERROR 05-10 20:06:53 [engine.py:448] raise ValueError(
ERROR 05-10 20:06:53 [engine.py:448] ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Quadro RTX 6000 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the `dtype` flag in CLI, for example: --dtype=half.
Process SpawnProcess-1:
ERROR 05-10 20:06:53 [multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 73270 died, exit code: -15
INFO 05-10 20:06:53 [multiproc_worker_utils.py:124] Killing local vLLM worker processes
Traceback (most recent call last):
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 450, in run_mp_engine
raise e
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
engine = MQLLMEngine.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
return cls(
^^^^
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 275, in __init__
self.model_executor = executor_class(vllm_config=vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 286, in __init__
super().__init__(*args, **kwargs)
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 52, in __init__
self._init_executor()
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 124, in _init_executor
self._run_workers("init_device")
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
driver_worker_output = run_method(self.driver_worker, sent_method,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/utils.py", line 2456, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 604, in init_device
self.worker.init_device() # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/worker.py", line 177, in init_device
_check_if_gpu_supports_dtype(self.model_config.dtype)
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/worker.py", line 546, in _check_if_gpu_supports_dtype
raise ValueError(
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Quadro RTX 6000 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the `dtype` flag in CLI, for example: --dtype=half.
Traceback (most recent call last):
File "/home/zhzx/miniconda3/envs/vllm/bin/vllm", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 53, in main
args.dispatch_function(args)
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 27, in cmd
uvloop.run(run_server(args))
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
async with build_async_engine_client(args) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 269, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
最新发布