文章目录
部署前置工作
下载 vLLM Docker 镜像
vLLM 提供了一个官方的 Docker 镜像用于部署,这个镜像可以用来运行与 OpenAI 兼容的服务,并且在 Docker Hub 上可用,名为 vllm/vllm-openai。
# https://hub.docker.com/r/vllm/vllm-openai
docker pull vllm/vllm-openai
# 国内测试机无法直接下载(试了很多国内加速器都不能正常下载),可以在海外机器下载镜像,然后上传到国内自有库,再从国内自有库下载镜像
# 国内自有库下载镜像
docker pull swr.cn-east-3.myhuaweicloud.com/kubesre/docker.io/vllm/vllm-openai:latest
下载模型
# 海外机器
huggingface-cli login --token **** # 在登录hf后,token在右上角settings-token里面
huggingface-cli download --resume-download meta-llama/Llama-3.1-70B-Instruct --local-dir /data/meta-llama/Llama-3.1-70B-Instruct
# 国内测试机器
wget https://hf-mirror.com/hfd/hfd.sh
chmod a+x hfd.sh
export HF_ENDPOINT=https://hf-mirror.com
apt-get update
apt-get install -y aria2
aria2c --version
apt-get install git-lfs
./hfd.sh meta-llama/Llama-3.2-11B-Vision-Instruct --hf_username Dong-Hua --hf_token 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 aria2c -x 4
./hfd.sh hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 --hf_username bifeng --hf_token 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 aria2c -x 4
# 断点续传
aria2c --header='Authorization: Bearer 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' --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c 'https://hf-mirror.com/hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4/resolve/main/model-00004-of-00009.safetensors' -d '.' -o 'model-00004-of-00009.safetensors'
Qwen2.5-72B-Instruct-GPTQ-Int4
千问最新 72B 大模型,中文友好,GPTQ int4 量化过,半精度低精度都无法在单卡部署
启动指令
docker run --runtime nvidia --gpus all \
-v /data1/data_vllm:/data \
-p 8001:8000 \
--name qwen_llm_3 \
--ipc=host \
swr.cn-east-3.myhuaweicloud.com/kubesre/docker.io/vllm/vllm-openai \
--model /data/Qwen2.5-72B-Instruct-GPTQ-Int4 \
--max-model-len 102400
# docker inspect qwen_llm_3
# python3 -m vllm.entrypoints.openai.api_server --model /data/Qwen2.5-72B-Instruct-GPTQ-Int4
接口文档地址:http://localhost:8001/docs
curl http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/data/Qwen2.5-72B-Instruct-GPTQ-Int4",
"messages": [
{"role": "system", "content": "你是一个喜剧人"},
{"role": "user", "content": "给我讲个短笑话"}
],
"max_tokens": 1024,
"stop": "<|eot_id|>",
"temperature": 0.7,
"top_p": 1,
"top_k": -1
}'
问题记录
- 千问设置 128k 上下文:https://qwen.readthedocs.io/en/latest/deployment/vllm.html#extended-context-support
注意在启动参数直接设置 --max-model-len 131072 无效,需要在模型配置文件增加以下配置:
"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
}
- 设置 128k 之后按如下指令启动会报 KV 缓存空间不足
# 启动参数
docker run --runtime nvidia --gpus all \
-v /data1/data_vllm:/data \
-p 8001:8000 \
--name qwen_llm_3 \
--ipc=host \
swr.cn-east-3.myhuaweicloud.com/kubesre/docker.io/vllm/vllm-openai \
--model /data/Qwen2.5-72B-Instruct-GPTQ-Int4
根据以上参数可知,gpu 内存利用率(--gpu-memory-utilization)默认 90%,最大上下文长度(--max-model-len)不设置默认读取模型文件的 128k,根据 KV 缓存占用公式 vLLM 性能测试分析,当 KV 缓存不足时可以增加 gpu 内存利用率或者调低最大上下文长度或者给 KV Cache 进行量化处理或者增加显卡,比如:
docker run --runtime nvidia --gpus all \
-v /data1/data_vllm:/data \
-p 8001:8000 \
--name qwen_llm_3 \
--ipc=host \
swr.cn-east-3.myhuaweicloud.com/kubesre/docker.io/vllm/vllm-openai \
--model /data/Qwen2.5-72B-Instruct-GPTQ-Int4 \
--max-model-len 102400 \ # 设置最大上下文长度 102400
--kv-cache-dtype fp8_e4m3 \ # 对 KV 进行 fp8 量化
--gpu-memory-utilization 0.95 # gpu 利用率增加到 95%
PS:测试环境为单卡 80G 显存,Qwen2.5-72B-Instruct-GPTQ-Int4 模型权重已占用 38.5492 GB
日志:
(base) [root@iv-ycl6gxrcwka8j6ujk4bc data_vllm]# docker run --runtime nvidia --gpus all \
> -v /data1/data_vllm:/data \
> -p 8001:8000 \
> --name qwen_llm_3 \
> --ipc=host \
> swr.cn-east-3.myhuaweicloud.com/kubesre/docker.io/vllm/vllm-openai \
> --model /data/Qwen2.5-72B-Instruct-GPTQ-Int4
INFO 10-11 07:16:12 api_server.py:219] vLLM API server version 0.5.3.post1
INFO 10-11 07:16:12 api_server.py:220] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/data/Qwen2.5-72B-Instruct-GPTQ-Int4', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 10-11 07:16:12 gptq_marlin.py:87] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
WARNING 10-11 07:16:12 arg_utils.py:762] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 10-11 07:16:12 config.py:806] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 10-11 07:16:12 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/data/Qwen2.5-72B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/data/Qwen2.5-72B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/data/Qwen2.5-72B-Instruct-GPTQ-Int4, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 10-11 07:16:13 model_runner.py:680] Starting to load model /data/Qwen2.5-72B-Instruct-GPTQ-Int4...
Loading safetensors checkpoint shards: 0% Completed | 0/11 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 9% Completed | 1/11 [00:01<00:12, 1.21s/it]
Loading safetensors checkpoint shards: 18% Completed | 2/11 [00:02<00:11, 1.29s/it]
Loading safetensors checkpoint shards: 27% Completed | 3/11 [00:03<00:10, 1.35s/it]
Loading safetensors checkpoint shards: 36% Completed | 4/11 [00:05<00:09, 1.40s/it]
Loading safetensors checkpoint shards: 45% Completed | 5/11 [00:06<00:08, 1.38s/it]
Loading safetensors checkpoint shards: 55% Completed | 6/11 [00:08<00:06, 1.36s/it]
Loading safetensors checkpoint shards: 64% Completed | 7/11 [00:09<00:05, 1.36s/it]
Loading safetensors checkpoint shards: 73% Completed | 8/11 [00:10<00:04, 1.35s/it]
Loading safetensors checkpoint shards: 82% Completed | 9/11 [00:12<00:02, 1.37s/it]
Loading safetensors checkpoint shards: 91% Completed | 10/11 [00:13<00:01, 1.34s/it]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:14<00:00, 1.21s/it]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:14<00:00, 1.31s/it]
INFO 10-11 07:16:28 model_runner.py:692] Loading model weights took 38.5492 GB
INFO 10-11 07:16:29 gpu_executor.py:102] # GPU blocks: 6446, # CPU blocks: 819
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 317, in <module>
[rank0]: run_server(args)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server
[rank0]: if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 265, in __init__
[rank0]: self._initialize_kv_caches()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 377, in _initialize_kv_caches
[rank0]: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 105, in initialize_cache
[rank0]: self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 212, in initialize_cache
[rank0]: raise_if_cache_size_invalid(num_gpu_blocks,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 372, in raise_if_cache_size_invalid
[rank0]: raise ValueError(
[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (103136). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
Llama-3.2-11B-Vision-Instruct
Llama3.2 最新支持视觉大模型,不支持音频输入
启动指令
引擎启动参数:https://docs.vllm.ai/en/stable/models/engine_args.html
docker run --runtime nvidia --gpus all \
-v /data1/data_vllm:/data \
-p 8001:8000 \
--name llama_audio \
--ipc=host \
crpi-esihkuc4dzkvkjot.cn-hangzhou.personal.cr.aliyuncs.com/huadong_vllm/hd \
--model /data/Llama-3.2-11B-Vision-Instruct \
--max_num_seqs 16 \
--enforce-eager
接口文档地址:http://localhost:8001/docs
curl http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/data/Qwen2.5-72B-Instruct-GPTQ-Int4",
"messages": [
{"role": "system", "content": "你是一个喜剧人"},
{"role": "user", "content": "给我讲个短笑话"}
],
"max_tokens": 1024,
"stop": "<|eot_id|>",
"temperature": 0.7,
"top_p": 1,
"top_k": -1
}'
问题记录
- Transformer 版本太低,需要升级到最新版本(重新下载最新 vllm 镜像即可)
docker run --runtime nvidia --gpus all \
-v /data1/data_vllm:/data \
-p 8001:8000 \
--name llama_audio \
--ipc=host \
swr.cn-east-3.myhuaweicloud.com/kubesre/docker.io/vllm/vllm-openai \
--model /data/Llama-3.2-11B-Vision-Instruct
日志:
INFO 10-11 07:52:36 api_server.py:219] vLLM API server version 0.5.3.post1
INFO 10-11 07:52:36 api_server.py:220] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/data/Llama-3.2-11B-Vision-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 989, in from_pretrained
config_class = CONFIG_MAPPING[config_dict["model_type"]]
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 691, in __getitem__
raise KeyError(key)
KeyError: 'mllama'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 317, in <module>
run_server(args)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server
if llm_engine is not None else AsyncLLMEngine.from_engine_args(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 457, in from_engine_args
engine_config = engine_args.create_engine_config()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 699, in create_engine_config
model_config = ModelConfig(
File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 152, in __init__
self.hf_config = get_config(self.model, trust_remote_code, revision,
File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/config.py", line 59, in get_config
raise e
File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/config.py", line 44, in get_config
config = AutoConfig.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 991, in from_pretrained
raise ValueError(
ValueError: The checkpoint you are trying to load has model type `mllama` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
- OOM 异常
docker run --runtime nvidia --gpus all \
-v /data1/data_vllm:/data \
-p 8001:8000 \
--name llama_audio \
--ipc=host \
crpi-esihkuc4dzkvkjot.cn-hangzhou.personal.cr.aliyuncs.com/huadong_vllm/hd \
--model /data/Llama-3.2-11B-Vision-Instruct
视觉模型的 KV 计算和语言模型不太一样,尝试可以将批次大小调低(--max_num_seqs,默认256,调至 16),也可以调节 gpu 显存利用率(或者加显卡):
docker run --runtime nvidia --gpus all \
-v /data1/data_vllm:/data \
-p 8001:8000 \
--name llama_audio \
--ipc=host \
crpi-esihkuc4dzkvkjot.cn-hangzhou.personal.cr.aliyuncs.com/huadong_vllm/hd \
--model /data/Llama-3.2-11B-Vision-Instruct \
--max_num_seqs 16
日志:
INFO 10-11 00:54:54 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-11 00:54:54 api_server.py:527] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/data/Llama-3.2-11B-Vision-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 10-11 00:54:54 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/6dd3d033-6ec0-4c32-aefe-4665201f0154 for IPC Path.
INFO 10-11 00:54:54 api_server.py:177] Started engine process with PID 26
WARNING 10-11 00:54:54 arg_utils.py:940] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
WARNING 10-11 00:54:58 arg_utils.py:940] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
INFO 10-11 00:54:58 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/data/Llama-3.2-11B-Vision-Instruct', speculative_config=None, tokenizer='/data/Llama-3.2-11B-Vision-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/Llama-3.2-11B-Vision-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
INFO 10-11 00:54:59 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
INFO 10-11 00:54:59 selector.py:116] Using XFormers backend.
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 10-11 00:54:59 model_runner.py:1014] Starting to load model /data/Llama-3.2-11B-Vision-Instruct...
INFO 10-11 00:55:00 selector.py:116] Using XFormers backend.
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:01<00:06, 1.72s/it]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:03<00:05, 1.79s/it]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:05<00:03, 1.81s/it]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:07<00:01, 1.82s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:07<00:00, 1.41s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:07<00:00, 1.58s/it]
INFO 10-11 00:55:08 model_runner.py:1025] Loading model weights took 19.9073 GB
INFO 10-11 00:55:08 enc_dec_model_runner.py:297] Starting profile run for multi-modal models.
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args
return cls(
^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
self.engine = LLMEngine(*args,
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 339, in __init__
self._initialize_kv_caches()
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 474, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks
return self.driver_worker.determine_num_available_blocks()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
self.model_runner.profile_run()
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 348, in profile_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 201, in execute_model
hidden_or_intermediate_states = model_executable(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 1084, in forward
cross_attention_states = self.vision_model(pixel_values,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 556, in forward
output = self.transformer(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 430, in forward
hidden_states = encoder_layer(
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 398, in forward
hidden_state = self.mlp(hidden_state)
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/clip.py", line 278, in forward
hidden_states, _ = self.fc1(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 367, in forward
output_parallel = self.quant_method.apply(self, input_, bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 135, in apply
return F.linear(x, layer.weight, bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 15.70 GiB. GPU 0 has a total capacity of 79.35 GiB of which 334.94 MiB is free. Process 2468 has 79.02 GiB memory in use. Of the allocated memory 62.55 GiB is allocated by PyTorch, and 15.98 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 571, in <module>
uvloop.run(run_server(args))
File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 538, in run_server
async with build_async_engine_client(args) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 192, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start
- 此视觉模型需要启用 eager-mode PyTorch
--enforce-eager:
- Always use eager-mode PyTorch. If False, will use eager mode and CUDA graph in hybrid for maximal performance and flexibility.
增加 --enforce-eager 能正常启动:
docker run --runtime nvidia --gpus all \
-v /data1/data_vllm:/data \
-p 8001:8000 \
--name llama_audio \
--ipc=host \
crpi-esihkuc4dzkvkjot.cn-hangzhou.personal.cr.aliyuncs.com/huadong_vllm/hd \
--model /data/Llama-3.2-11B-Vision-Instruct \
--max_num_seqs 16 \
--enforce-eager
日志:
INFO 10-11 01:01:19 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-11 01:01:19 api_server.py:527] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/data/Llama-3.2-11B-Vision-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=16, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 10-11 01:01:19 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/2a5bad77-628d-4503-884e-cd690e78f044 for IPC Path.
INFO 10-11 01:01:19 api_server.py:177] Started engine process with PID 26
WARNING 10-11 01:01:19 arg_utils.py:940] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
WARNING 10-11 01:01:22 arg_utils.py:940] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
INFO 10-11 01:01:22 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/data/Llama-3.2-11B-Vision-Instruct', speculative_config=None, tokenizer='/data/Llama-3.2-11B-Vision-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/Llama-3.2-11B-Vision-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
INFO 10-11 01:01:23 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
INFO 10-11 01:01:23 selector.py:116] Using XFormers backend.
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 10-11 01:01:24 model_runner.py:1014] Starting to load model /data/Llama-3.2-11B-Vision-Instruct...
INFO 10-11 01:01:24 selector.py:116] Using XFormers backend.
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:01<00:06, 1.72s/it]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:03<00:05, 1.79s/it]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:05<00:03, 1.80s/it]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:07<00:01, 1.82s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:07<00:00, 1.41s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:07<00:00, 1.58s/it]
INFO 10-11 01:01:32 model_runner.py:1025] Loading model weights took 19.9073 GB
INFO 10-11 01:01:32 enc_dec_model_runner.py:297] Starting profile run for multi-modal models.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
INFO 10-11 01:01:48 gpu_executor.py:122] # GPU blocks: 10025, # CPU blocks: 1638
INFO 10-11 01:01:50 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 10-11 01:01:50 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1724, in capture
output_hidden_or_intermediate_states = self.model(
^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 1078, in forward
skip_cross_attention = max(attn_metadata.encoder_seq_lens) == 0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: operation not permitted when stream is capturing
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args
return cls(
^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
self.engine = LLMEngine(*args,
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 339, in __init__
self._initialize_kv_caches()
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 487, in _initialize_kv_caches
self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 125, in initialize_cache
self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 266, in initialize_cache
self._warm_up_model()
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 282, in _warm_up_model
self.model_runner.capture_model(self.gpu_cache)
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1448, in capture_model
graph_runner.capture(**capture_inputs)
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1723, in capture
with torch.cuda.graph(self._graph, pool=memory_pool, stream=stream):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/cuda/graphs.py", line 185, in __exit__
self.cuda_graph.capture_end()
File "/usr/local/lib/python3.12/dist-packages/torch/cuda/graphs.py", line 83, in capture_end
super().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 571, in <module>
uvloop.run(run_server(args))
File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 538, in run_server
async with build_async_engine_client(args) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 192, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start
Qwen2-Audio-7B-Instruct
千问音频大模型
启动指令
引擎启动参数:https://docs.vllm.ai/en/stable/models/engine_args.html
docker run --runtime nvidia --gpus all \
-v /data1/data_vllm:/data \
-p 8001:8000 \
--name qwen_audio \
--ipc=host \
crpi-esihkuc4dzkvkjot.cn-hangzhou.personal.cr.aliyuncs.com/huadong_vllm/hd \
--model /data/Qwen2-Audio-7B-Instruct
问题记录
- ‘Qwen2AudioConfig’ object has no attribute ‘hidden_size’
vllm 暂不支持此模型,https://github.com/vllm-project/vllm/issues/8394
------------------------------------------------------------------------------
截止 2024/10/14,qwen-audio 分支合并中(审核中):https://github.com/vllm-project/vllm/pull/9248
日志:
INFO 10-11 01:20:25 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-11 01:20:25 api_server.py:527] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/data/Qwen2-Audio-7B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 10-11 01:20:25 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/cf23126e-3f74-4d96-be11-35016eaa9ef4 for IPC Path.
INFO 10-11 01:20:25 api_server.py:177] Started engine process with PID 26
INFO 10-11 01:20:29 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/data/Qwen2-Audio-7B-Instruct', speculative_config=None, tokenizer='/data/Qwen2-Audio-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/Qwen2-Audio-7B-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
INFO 10-11 01:20:30 model_runner.py:1014] Starting to load model /data/Qwen2-Audio-7B-Instruct...
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args
return cls(
^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
self.engine = LLMEngine(*args,
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 325, in __init__
self.model_executor = executor_class(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
self._init_executor()
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 40, in _init_executor
self.driver_worker.load_model()
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 183, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1016, in load_model
self.model = get_model(model_config=self.model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
return loader.load_model(model_config=model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 399, in load_model
model = _initialize_model(model_config, self.load_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 176, in _initialize_model
return build_model(
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 161, in build_model
return model_class(config=hf_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen.py", line 876, in __init__
self.transformer = QWenModel(config, cache_config, quant_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen.py", line 564, in __init__
config.hidden_size,
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py", line 202, in __getattribute__
return super().__getattribute__(key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Qwen2AudioConfig' object has no attribute 'hidden_size'
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 571, in <module>
uvloop.run(run_server(args))
File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 538, in run_server
async with build_async_engine_client(args) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 192, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start
系列文章:
一、大模型推理框架选型调研
二、TensorRT-LLM & Triton Server 部署过程记录
三、vLLM 大模型推理引擎调研文档
四、vLLM 推理引擎性能分析基准测试
五、vLLM 部署大模型问题记录
六、Triton Inference Server 架构原理