【AI】Ubuntu 22.04 4060Ti 16G vllm-api部署Qwen3-8B-FP8

下载模型

# 非常重要,否则容易不兼容报错
pip install modelscope -U
cd /data/ai/models
modelscope download --model Qwen/Qwen3-8B-FP8 --local_dir ./Qwen3-8B-FP8

安装vllm

创建虚拟环境

mkdir vllm
cd vllm/
python -m venv venv
ource venv/bin/activate

安装vllm

# 安装vLLM框架及ModelScope
pip install modelscope vllm -i https://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.com

# 安装FlashAttention优化模块
# 安装系统级构建工具
sudo apt-get install build-essential python3-dev
# 安装Python构建工具
pip install setuptools wheel ninja -i https://mirrors.aliyun.com/pypi/simple/

# 更新Transformers库
pip install --upgrade transformers -i https://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.com

启动vllm openapi服务器

vllm serve /data/ai/models/Qwen3-8B-FP8 \
--served-model-name Qwen3-8B-FP8 \
--port 8000 \
--dtype auto \
--gpu-memory-utilization 0.8 \
--max-model-len 4096 \
--tensor-parallel-size 1

启动日志

(venv) yeqiang@yeqiang-Default-string:/data/ai/vllm$ vllm serve /data/ai/models/Qwen3-8B-FP8 --served-model-name Qwen3-8B-FP8 --port 8000 --dtype auto --gpu-memory-utilization 0.8 --max-model-len 4096 --tensor-parallel-size 1
INFO 05-06 20:48:11 [__init__.py:239] Automatically detected platform cuda.
INFO 05-06 20:48:14 [api_server.py:1043] vLLM API server version 0.8.5.post1
INFO 05-06 20:48:14 [api_server.py:1044] args: Namespace(subparser='serve', model_tag='/data/ai/models/Qwen3-8B-FP8', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/data/ai/models/Qwen3-8B-FP8', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=4096, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.8, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=['Qwen3-8B-FP8'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f2d48275000>)
INFO 05-06 20:48:18 [config.py:717] This model supports multiple tasks: {'generate', 'reward', 'embed', 'score', 'classify'}. Defaulting to 'generate'.
INFO 05-06 20:48:18 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 05-06 20:48:18 [fp8.py:63] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
INFO 05-06 20:48:20 [__init__.py:239] Automatically detected platform cuda.
INFO 05-06 20:48:22 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='/data/ai/models/Qwen3-8B-FP8', speculative_config=None, tokenizer='/data/ai/models/Qwen3-8B-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen3-8B-FP8, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 05-06 20:48:22 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fdf82584790>
INFO 05-06 20:48:22 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-06 20:48:22 [cuda.py:221] Using Flash Attention backend on V1 engine.
WARNING 05-06 20:48:22 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 05-06 20:48:22 [gpu_model_runner.py:1329] Starting to load model /data/ai/models/Qwen3-8B-FP8...
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.98it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.75it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.78it/s]

INFO 05-06 20:48:23 [loader.py:458] Loading weights took 1.18 seconds
WARNING 05-06 20:48:23 [kv_cache.py:128] Using Q scale 1.0 and prob scale 1.0 with fp8 attention. This may cause accuracy issues. Please make sure Q/prob scaling factors are available in the fp8 checkpoint.
INFO 05-06 20:48:23 [gpu_model_runner.py:1347] Model loading took 8.8011 GiB and 1.314728 seconds
INFO 05-06 20:48:30 [backends.py:420] Using cache directory: /home/yeqiang/.cache/vllm/torch_compile_cache/075128b044/rank_0_0 for vLLM's torch.compile
INFO 05-06 20:48:30 [backends.py:430] Dynamo bytecode transform time: 6.33 s
INFO 05-06 20:48:33 [backends.py:136] Cache the graph of shape None for later use
INFO 05-06 20:48:53 [backends.py:148] Compiling a graph for general shape takes 22.83 s
WARNING 05-06 20:48:54 [fp8_utils.py:431] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /data/ai/vllm/venv/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=6144,K=4096,device_name=NVIDIA_GeForce_RTX_4060_Ti,dtype=fp8_w8a8,block_shape=[128,128].json
WARNING 05-06 20:48:56 [fp8_utils.py:431] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /data/ai/vllm/venv/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=4096,K=4096,device_name=NVIDIA_GeForce_RTX_4060_Ti,dtype=fp8_w8a8,block_shape=[128,128].json
WARNING 05-06 20:48:56 [fp8_utils.py:431] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /data/ai/vllm/venv/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=24576,K=4096,device_name=NVIDIA_GeForce_RTX_4060_Ti,dtype=fp8_w8a8,block_shape=[128,128].json
WARNING 05-06 20:48:56 [fp8_utils.py:431] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /data/ai/vllm/venv/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=4096,K=12288,device_name=NVIDIA_GeForce_RTX_4060_Ti,dtype=fp8_w8a8,block_shape=[128,128].json
INFO 05-06 20:49:28 [monitor.py:33] torch.compile takes 29.15 s in total
INFO 05-06 20:49:29 [kv_cache_utils.py:634] GPU KV cache size: 11,184 tokens
INFO 05-06 20:49:29 [kv_cache_utils.py:637] Maximum concurrency for 4,096 tokens per request: 2.73x
INFO 05-06 20:49:54 [gpu_model_runner.py:1686] Graph capturing finished in 25 secs, took 2.61 GiB
INFO 05-06 20:49:54 [core.py:159] init engine (profile, create kv cache, warmup model) took 90.51 seconds
INFO 05-06 20:49:54 [core_client.py:439] Core engine process 0 ready.
WARNING 05-06 20:49:54 [config.py:1239] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 05-06 20:49:54 [serving_chat.py:118] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 05-06 20:49:54 [serving_completion.py:61] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 05-06 20:49:54 [api_server.py:1090] Starting vLLM API server on http://0.0.0.0:8000
INFO 05-06 20:49:54 [launcher.py:28] Available routes are:
INFO 05-06 20:49:54 [launcher.py:36] Route: /openapi.json, Methods: GET, HEAD
INFO 05-06 20:49:54 [launcher.py:36] Route: /docs, Methods: GET, HEAD
INFO 05-06 20:49:54 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 05-06 20:49:54 [launcher.py:36] Route: /redoc, Methods: GET, HEAD
INFO 05-06 20:49:54 [launcher.py:36] Route: /health, Methods: GET
INFO 05-06 20:49:54 [launcher.py:36] Route: /load, Methods: GET
INFO 05-06 20:49:54 [launcher.py:36] Route: /ping, Methods: GET, POST
INFO 05-06 20:49:54 [launcher.py:36] Route: /tokenize, Methods: POST
INFO 05-06 20:49:54 [launcher.py:36] Route: /detokenize, Methods: POST
INFO 05-06 20:49:54 [launcher.py:36] Route: /v1/models, Methods: GET
INFO 05-06 20:49:54 [launcher.py:36] Route: /version, Methods: GET
INFO 05-06 20:49:54 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 05-06 20:49:54 [launcher.py:36] Route: /v1/completions, Methods: POST
INFO 05-06 20:49:54 [launcher.py:36] Route: /v1/embeddings, Methods: POST
INFO 05-06 20:49:54 [launcher.py:36] Route: /pooling, Methods: POST
INFO 05-06 20:49:54 [launcher.py:36] Route: /score, Methods: POST
INFO 05-06 20:49:54 [launcher.py:36] Route: /v1/score, Methods: POST
INFO 05-06 20:49:54 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
INFO 05-06 20:49:54 [launcher.py:36] Route: /rerank, Methods: POST
INFO 05-06 20:49:54 [launcher.py:36] Route: /v1/rerank, Methods: POST
INFO 05-06 20:49:54 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 05-06 20:49:54 [launcher.py:36] Route: /invocations, Methods: POST
INFO 05-06 20:49:54 [launcher.py:36] Route: /metrics, Methods: GET
INFO:     Started server process [201874]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

验证基本服务状态

yeqiang@yeqiang-Default-string:/data/ai/vllm$ curl http://localhost:8000/v1/models
{"object":"list","data":[{"id":"Qwen3-8B-FP8","object":"model","created":1746535967,"owned_by":"vllm","root":"/data/ai/models/Qwen3-8B-FP8","parent":null,"max_model_len":4096,"permission":[{"id":"modelperm-9c2faa75985d4efabc3ddf63942c3f04","object":"model_permission","created":1746535967,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

GPU状态

基于机器学习的音频情感分析系统Python源码(高分项目),能够从语音中识别出四种基本情感:愤怒、快乐、中性和悲伤。个人经导师指导并认可通过的高分设计项目,评审分99分,代码完整确保可以运行,小白也可以亲自搞定,主要针对计算机相关专业的正在做毕业设计的学生和需要项目实战练习的学习者,可作为毕业设计、课程设计、期末大作业,代码资料完整,下载可用。 基于机器学习的音频情感分析系统Python源码(高分项目)基于机器学习的音频情感分析系统Python源码(高分项目)基于机器学习的音频情感分析系统Python源码(高分项目)基于机器学习的音频情感分析系统Python源码(高分项目)基于机器学习的音频情感分析系统Python源码(高分项目)基于机器学习的音频情感分析系统Python源码(高分项目)基于机器学习的音频情感分析系统Python源码(高分项目)基于机器学习的音频情感分析系统Python源码(高分项目)基于机器学习的音频情感分析系统Python源码(高分项目)基于机器学习的音频情感分析系统Python源码(高分项目)基于机器学习的音频情感分析系统Python源码(高分项目)基于机器学习的音频情感分析系统Python源码(高分项目)基于机器学习的音频情感分析系统Python源码(高分项目)基于机器学习的音频情感分析系统Python源码(高分项目)基于机器学习的音频情感分析系统Python源码(高分项目)基于机器学习的音频情感分析系统Python源码(高分项目)基于机器学习的音频情感分析系统Python源码(高分项目)基于机器学习的音频情感分析系统Python源码(高分项目)基于机器学习的音频情感分析系统Python源码(高分项目)基于机器学习的音频情感分析系统Python源码(高分项目)基于机器学习的音频情感分析系统Python源码(高分项目)基于机器学习的音频情感分析系统P
内容概要:本文介绍了《车联网(智能网联汽车)产业发展行动计划(2025-2030年)》的内容,涵盖发展背景、现状、目标、重点任务及保障措施。全球智能网联汽车成为汽车产业变革的核心方向,中国通过“车路云一体化”试点等初步形成全产业链生态优势。面对技术瓶颈、标准缺失、商业模式不清等挑战,中国设定了到2030年建成全球领先智能网联汽车产业体系的目标,包括L3级自动驾驶规模化商用、智能网联汽车新增产值突破1万亿元等阶段性目标。重点任务涉及技术突破(如AI、通信、芯片等)、基础设施建设(如智能化道路、云控平台等)、标准与法规完善、示范应用与商业化、产业协同与生态构建。保障措施包括政策支持、人才培育、安全保障和宣传推广。最终目标是实现经济效益、社会效益和战略意义,推动中国从“跟跑”向“领跑”跨越。; 适合人群:对智能网联汽车行业感兴趣的各界人士,包括政府决策者、企业管理人员、科研人员、投资者等。; 使用场景及目标:①帮助政府决策者了解智能网联汽车的发展方向和政策措施;②为企业管理人员提供行业趋势和发展机会的参考;③为科研人员明确研究重点和技术突破方向;④为投资者提供投资领域的指导。; 其他说明:本文详细阐述了智能网联汽车产业的发展规划,强调技术创新、生态协同和安全可控,旨在推动中国智能网联汽车产业的全面发展,为全球汽车产业变革贡献中国方案。
<think>好的,我现在需要帮助用户在Ubuntu 22.04系统上部署Qwen2.5-VL-7B-Instruct-GGUF模型。首先,我得回忆一下用户提供的引用内容。引用[2]提到用户之前使用过联想Y7000配置的机器,安装了4060显卡、16G内存和Ubuntu 22.04,微调过Qwen2.5-VL-2B模型。现在用户想部署7B的GGUF版本,可能需要类似的环境配置。 首先,安装GGUF模型通常需要llama.cpp或其衍生工具,因为GGUF是GGML的后续格式,专为CPU推理优化,但也支持GPU加速。用户有4060显卡,应该可以利用CUDA加速。因此,我需要确保系统中安装了NVIDIA驱动、CUDA工具包以及相关的依赖库。 接下来,用户可能需要下载Qwen2.5-VL-7B-Instruct的GGUF模型文件。根据经验,模型可能在Hugging Face或ModelScope上提供。需要指导用户如何正确下载模型文件,并验证文件的完整性,比如通过哈希校验。 然后,编译llama.cpp或兼容的推理引擎。这可能涉及从源码克隆仓库,安装必要的构建工具如CMake、g++,以及启用CUDA支持。在编译过程中,需要确保CUDA架构参数正确,比如针对RTX 4060的Compute Capability是8.9,但可能llama.cpp的最新版本已经支持,或者需要手动指定。 安装完成后,用户需要配置模型路径和参数,编写一个简单的Python脚本或使用提供的示例代码来加载模型并进行推理。可能还需要安装Python绑定,如llama-cpp-python库,并确保其与系统已安装的CUDA版本兼容。 另外,用户可能关心内存和显存的需求。7B的模型在GGUF格式下,量化等级不同会影响资源使用。例如,Q4_K_M的量化可能需要约5-6GB内存,而用户有16G内存和8GB显存的4060,应该足够运行,但需要根据量化等级调整参数。 最后,测试模型是否正常运行,处理可能的错误,比如缺少动态库、权限问题,或者模型加载失败。建议用户查看日志,安装必要的依赖,如OpenBLAS或其他数学库,以优化性能。 可能还需要注意Python环境的管理,使用虚拟环境如venv或conda来隔离依赖,避免与系统包冲突。同时,引用[1]中提到Qwen2.5-VL-72B的表现优异,但用户使用的是7B版本,可能需要调整预期性能,尤其是在处理视觉任务时的输入预处理步骤,不过GGUF模型可能仅支持文本,需确认模型是否是多模态的。如果Qwen2.5-VL-7B是视觉语言模型,可能需要额外的步骤处理图像输入,但GGUF格式可能仅支持部分功能,需查阅文档确认。 总结步骤大致为:安装依赖→下载模型→编译推理引擎→配置环境→运行测试。过程中需详细说明每个步骤的命令和可能遇到的问题,确保用户能逐步跟随操作。</think>### 在Ubuntu 22.04部署Qwen2.5-VL-7B-Instruct-GGUF模型的步骤 #### 一、环境准备 1. **安装NVIDIA驱动和CUDA工具包** ```bash sudo apt update sudo apt install nvidia-driver-535 cuda-toolkit-12-2 ``` 安装完成后验证: ```bash nvidia-smi # 查看GPU状态 nvcc --version # 查看CUDA版本 ``` 2. **安装基础依赖** ```bash sudo apt install build-essential cmake python3-pip python3-venv ``` #### 二、获取模型文件 1. **下载GGUF模型文件** 从Hugging Face或ModelScope获取模型文件(例如`qwen2.5-vl-7b-instruct.Q4_K_M.gguf`): ```bash wget https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct-GGUF/resolve/main/qwen2.5-vl-7b-instruct.Q4_K_M.gguf ``` #### 三、编译llama.cpp 1. **克隆仓库并编译** ```bash git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make clean && LLAMA_CUDA=1 make -j8 ``` 编译参数说明: - `LLAMA_CUDA=1` 启用CUDA加速 - 若遇到架构错误,可通过`CMAKE_CUDA_ARCHITECTURES=89`指定RTX 4060的算力 #### 四、运行推理 1. **使用命令行测试** ```bash ./main -m /path/to/qwen2.5-vl-7b-instruct.Q4_K_M.gguf -n 512 --color -p "描述这张图片的内容:" ``` 参数说明: - `-n` 控制生成文本长度 - `--temp` 调整生成随机性 2. **Python API调用** 安装`llama-cpp-python`: ```bash pip install llama-cpp-python[cuBLAS] --force-reinstall --upgrade ``` 编写测试脚本: ```python from llama_cpp import Llama llm = Llama( model_path="qwen2.5-vl-7b-instruct.Q4_K_M.gguf", n_gpu_layers=35, # 根据显存调整层数 n_ctx=2048 ) response = llm.create_chat_completion( messages=[{"role": "user", "content": "描述这张图片的内容:"}] ) print(response["choices"][0]["message"]["content"]) ``` #### 五、性能优化建议 1. **量化等级选择** | 量化类型 | 文件大小 | 显存占用 | 质量保留 | |---------|----------|----------|----------| | Q2_K | ~3GB | 4GB | 低 | | Q4_K_M | ~5GB | 6GB | 中 | | Q6_K | ~7GB | 8GB | 高 | 2. **Batch参数调整** ```python llm = Llama( ... n_batch=512, # 增大batch提升吞吐量 n_threads=8 # 匹配CPU核心数 ) ``` #### 六、常见问题解决 1. **CUDA内存不足** ```bash export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128 ``` 2. **缺少cuBLAS库** ```bash sudo apt install libopenblas-dev ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值