SGLang相关参数

最新推荐文章于 2025-04-08 11:10:02 发布

强化学习曾小健

最新推荐文章于 2025-04-08 11:10:02 发布

阅读量2.7k

点赞数 3

文章标签：机器人

本文链接：https://blog.csdn.net/sinat_37574187/article/details/141691277

版权

SGLANG_USE_MODELSCOPE=true python -m sglang.bench_latency --model-path /data/hub/LLM-Research/Meta-Llama-3___1-70B-Instruct-AWQ-INT4 --batch 128 --input-len 256 --output-len 32 --tp 8 --mem-fraction-static 0.9 --dtype float16 --context-length 512 --enable-p2p-check --quantization Marlin
usage: bench_latency.py [-h] [--model-path MODEL_PATH] [--tokenizer-path TOKENIZER_PATH]
[--host HOST] [--port PORT] [--additional-ports [ADDITIONAL_PORTS ...]]
[--tokenizer-mode {auto,slow}] [--skip-tokenizer-init]
[--load-format {auto,pt,safetensors,npcache,dummy}]
[--dtype {auto,half,float16,bfloat16,float,float32}]
[--kv-cache-dtype {auto,fp8_e5m2}] [--trust-remote-code] [--is-embedding]
[--context-length CONTEXT_LENGTH]
[--quantization {awq,fp8,gptq,marlin,gptq_marlin,awq_marlin,squeezellm,bitsandbytes}]
[--served-model-name SERVED_MODEL_NAME] [--chat-template CHAT_TEMPLATE]
[--mem-fraction-static MEM_FRACTION_STATIC]
[--max-running-requests MAX_RUNNING_REQUESTS] [--max-num-reqs MAX_NUM_REQS]
[--max-total-tokens MAX_TOTAL_TOKENS]
[--chunked-prefill-size CHUNKED_PREFILL_SIZE]
[--max-prefill-tokens MAX_PREFILL_TOKENS]
[--schedule-policy {lpm,random,fcfs,dfs-weight}]
[--schedule-conservativeness SCHEDULE_CONSERVATIVENESS]
[--tensor-parallel-size TENSOR_PARALLEL_SIZE]
[--stream-interval STREAM_INTERVAL] [--random-seed RANDOM_SEED]
[--log-level LOG_LEVEL] [--log-level-http LOG_LEVEL_HTTP] [--log-requests]
[--show-time-cost] [--api-key API_KEY] [--file-storage-pth FILE_STORAGE_PTH]
[--data-parallel-size DATA_PARALLEL_SIZE]
[--load-balance-method {round_robin,shortest_queue}]
[--nccl-init-addr NCCL_INIT_ADDR] [--nnodes NNODES] [--node-rank NODE_RANK]
[--disable-flashinfer] [--disable-flashinfer-sampling]
[--disable-radix-cache] [--disable-regex-jump-forward]
[--disable-cuda-graph] [--disable-cuda-graph-padding] [--disable-disk-cache]
[--disable-custom-all-reduce] [--enable-mixed-chunk]
[--enable-torch-compile] [--enable-p2p-check] [--enable-mla]
[--triton-attention-reduce-in-fp32] [--efficient-weight-load]

[--run-name RUN_NAME] [--batch-size BATCH_SIZE [BATCH_SIZE ...]]
[--input-len INPUT_LEN [INPUT_LEN ...]]
[--output-len OUTPUT_LEN [OUTPUT_LEN ...]]
[--result-filename RESULT_FILENAME] [--correctness-test] [--cut-len CUT_LEN]
[--graph-sql GRAPH_SQL] [--graph-filename GRAPH_FILENAME]
bench_latency.py: error: argument --quantization: invalid choice: 'Marlin' (choose from 'awq', 'fp8', 'gptq', 'marlin', 'gptq_marlin', 'awq_marlin', 'squeezellm', 'bitsandbytes')