llmperf测试大模型API性能
llmperf是一个用来评估LLM API性能的工具。
官方仓库地址:https://github.com/ray-project/llmperf
1. 安装准备
脚本依赖python3环境,测试前客户端安装python3,本文使用python版本为3.8。
# 创建一个python虚拟环境
python -m venv llmperf-env
source llmperf-env/bin/activate
# 安装依赖
pip install torch TensorFlow
# 取消代理设置
unset http_proxy
unset https_proxy
# 安装工具
git clone https://github.com/ray-project/llmperf.git
cd llmperf
pip install -e .
从huggingface下载分词器,并将下载后的文件安装如下目录放置到llmperf文件夹下:
https://huggingface.co/hf-internal-testing/llama-tokenizer/tree/main
$ ls
analyze-token-benchmark-results.ipynb LICENSE.txt NOTICE.txt pyproject.toml requirements-dev.txt src
hf-internal-testing llm_correctness.py pre-commit.sh README.md result_outputs token_benchmark_ray.py
$ tree hf-internal-testing/
hf-internal-testing/
└── llama-tokenizer
├── gitattributes
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model
# 修改token_benchmark_ray.py,将分词器改为本地路径:
62 tokenizer = LlamaTokenizerFast.from_pretrained(
63 "./hf-internal-testing/llama-tokenizer"
2. 测试实践
支持负载测试和正确性测试。
2.1 大模型API负载测试
llmperf兼容不同类型的大模型API,包’openai’, ‘anthropic’, 'litellm’等。
OpenAI Compatible APIs:
export OPENAI_API_KEY=secret_abcdefg
export OPENAI_API_BASE="https://api.endpoints.anyscale.com/v1"
python token_benchmark_ray.py \
--model "meta-llama/Llama-2-7b-chat-hf" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api openai \
--additional-sampling-params '{}'
测试案例,以本地部署的大模型为例,测试baichuan2-13B-Chat模型。
# 设置大模型API的Key和URL,本次使用OPENAI_API类型的API进行测试。
export OPENAI_API_KEY=EMPTY
export OPENAI_API_BASE=http://10.210.18.41:32576/v1
# 运行脚本,设置测试的model,tokens相关参数,并发数量等。
python token_benchmark_ray.py --model "baichuan2-13B-Chat" --mean-input-tokens 100 --stddev-input-tokens 50 --mean-output-tokens 500 --stddev-output-tokens 100 --max-num-completed-requests 2 --timeout 600 --num-concurrent-requests 1 --results-dir "result_outputs" --llm-api openai --additional-sampling-params '{}'
2024-04-26 17:19:19,470 INFO worker.py:1749 -- Started a local Ray instance.
100%|████████████████████████████████████| 2/2 [00:11<00:00, 5.67s/it]
\Results for token benchmark for baichuan2-13B-Chat queried with the openai api.
inter_token_latency_s
p25 = 0.08192225433145225
p50 = 0.08256251340393315
p75 = 0.08320277247641406
p90 = 0.08358692791990259
p95 = 0.08371497973439876
p99 = 0.08381742118599571
mean = 0.08256251340393315
min = 0.08128199525897137
max = 0.08384303154889494
stddev = 0.001810926127469796
ttft_s
p25 = 0.1733547967596678
p50 = 0.18000664250575937
p75 = 0.18665848825185094
p90 = 0.19064959569950587
p95 = 0.1919799648487242
p99 = 0.19304426016809884
mean = 0.18000664250575937
min = 0.16670295101357624
max = 0.1933103339979425
stddev = 0.018814260937872945
end_to_end_latency_s
p25 = 3.06721269452828
p50 = 5.121514831029344
p75 = 7.175816967530409
p90 = 8.408398249431048
p95 = 8.81925867673126
p99 = 9.147947018571431
mean = 5.121514831029344
min = 1.0129105580272153
max = 9.230119104031473
stddev = 5.810443885303661
request_output_throughput_token_per_s
p25 = 9.643341004971674
p50 = 10.401396010428488
p75 = 11.159451015885304
p90 = 11.614284019159394
p95 = 11.765895020250756
p99 = 11.887183821123847
mean = 10.401396010428488
min = 8.88528599951486
max = 11.917506021342119
stddev = 2.1441033394836757
number_input_tokens
p25 = 99.5
p50 = 103.0
p75 = 106.5
p90 = 108.6
p95 = 109.3
p99 = 109.86
mean = 103.0
min = 96
max = 110
stddev = 9.899494936611665
number_output_tokens
p25 = 34.25
p50 = 59.5
p75 = 84.75
p90 = 99.9
p95 = 104.94999999999999
p99 = 108.99
mean = 59.5
min = 9
max = 110
stddev = 71.4177848998413
Number Of Errored Requests: 0
Overall Output Throughput: 10.486871057088438
Number Of Completed Requests: 2
Completed Requests Per Minute: 10.574996023954728
# result_outputs文件夹下的结果输出
ls -alh baichuan2-13B-Chat_100_500_*
-rw-r--r-- 1 root root 1.9K Apr 26 18:00 baichuan2-13B-Chat_100_500_individual_responses.json
-rw-r--r-- 1 root root 4.4K Apr 26 18:00 baichuan2-13B-Chat_100_500_summary.json
测试脚本参数解读:
python token_benchmark_ray.py --help
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
2024-04-26 17:58:32,799 INFO worker.py:1749 -- Started a local Ray instance.
usage: token_benchmark_ray.py [-h] --model MODEL [--mean-input-tokens MEAN_INPUT_TOKENS] [--stddev-input-tokens STDDEV_INPUT_TOKENS] [--mean-output-tokens MEAN_OUTPUT_TOKENS]
[--stddev-output-tokens STDDEV_OUTPUT_TOKENS] [--num-concurrent-requests NUM_CONCURRENT_REQUESTS] [--timeout TIMEOUT]
[--max-num-completed-requests MAX_NUM_COMPLETED_REQUESTS] [--additional-sampling-params ADDITIONAL_SAMPLING_PARAMS] [--results-dir RESULTS_DIR] [--llm-api LLM_API]
[--metadata METADATA]
Run a token throughput and latency benchmark.
options:
-h, --help show this help message and exit
--model MODEL The model to use for this load test.
--mean-input-tokens MEAN_INPUT_TOKENS
The mean number of tokens to send in the prompt for the request. (default: 550) # 请求中prompt的平均token数,默认值550。
--stddev-input-tokens STDDEV_INPUT_TOKENS
The standard deviation of number of tokens to send in the prompt for the request. (default: 150) # 请求中prompt的token数的标准差,默认值150。
--mean-output-tokens MEAN_OUTPUT_TOKENS
The mean number of tokens to generate from each llm request. This is the max_tokens param for the completions API. Note that this is not always the number of tokens returned.
(default: 150) # 从每个LLM请求中生成的平均token数。这是completions API的max_tokens参数。注意这并不一定总是是返回的token数,默认值150。
--stddev-output-tokens STDDEV_OUTPUT_TOKENS
The stdandard deviation on the number of tokens to generate per llm request. (default: 80) # 每个LLM请求生成的token数的标准差,默认值80。
--num-concurrent-requests NUM_CONCURRENT_REQUESTS
The number of concurrent requests to send (default: 10) # 要发送的并发请求数量。默认值为10。
--timeout TIMEOUT The amount of time to run the load test for. (default: 90) # 运行负载测试的时间长度。默认90s。
--max-num-completed-requests MAX_NUM_COMPLETED_REQUESTS
The number of requests to complete before finishing the test. Note that its possible for the test to timeout first. (default: 10) # 在测试结束前完成的请求数量,如果超过设定时间,测试提前结束。
--additional-sampling-params ADDITIONAL_SAMPLING_PARAMS
Additional sampling params to send with the each request to the LLM API. (default: {}) No additional sampling params are sent.
--results-dir RESULTS_DIR
The directory to save the results to. (`default: `) No results are saved) # 输出结果路径
--llm-api LLM_API The name of the llm api to use. Can select from ['openai', 'anthropic', 'litellm'] (default: openai) # 大模型API类型
--metadata METADATA A comma separated list of metadata to include in the results, e.g. name=foo,bar=1. These will be added to the metadata field of the results.
2.2 大模型API正确性测试
OpenAI Compatible APIs:
export OPENAI_API_KEY=secret_abcdefg
export OPENAI_API_BASE=https://console.endpoints.anyscale.com/m/v1
python llm_correctness.py \
--model "meta-llama/Llama-2-7b-chat-hf" \
--max-num-completed-requests 150 \
--timeout 600 \
--num-concurrent-requests 10 \
--results-dir "result_outputs"
实践案例,以本地部署的大模型为例,测试baichuan2-13B-Chat模型。
export OPENAI_API_KEY=EMPTY
export OPENAI_API_BASE=http://10.210.18.41:32576/v1
python llm_correctness.py \
> --model "baichuan2-13B-Chat" \
> --max-num-completed-requests 150 \
> --timeout 600 \
> --num-concurrent-requests 10 \
> --results-dir "result_outputs"
2024-04-26 18:24:10,240 INFO worker.py:1749 -- Started a local Ray instance.
0%| | 0/150 [00:00<?, ?it/s](raylet) WARNING: 16 PYTHON worker processes have been started on node: 424e0f50137d0212cb68607fc3306ab347751605fbc7973d6f0a9db1 with address: 172.16.0.9. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
100%|███████████████████████████████████| 150/150 [02:12<00:00, 1.13it/s]
Mismatched and errored requests.
mismatched request: 900197, expected: 9197
mismatched request: 5000, 300, 53, expected: 5353
mismatched request: 404407, expected: 4447
mismatched request: 200811, expected: 2811
...
Results for llm correctness test for baichuan2-13B-Chat queried with the openai api.
Errors: 0, Error rate: 0.0
Mismatched: 109, Mismatch rate: 0.7266666666666667
Completed: 150
Completed without errors: 150
# result_outputs文件夹下的结果输出
ls baichuan2-13B-Chat_correctness_* -alh
-rw-r--r-- 1 root root 140K Apr 26 18:26 baichuan2-13B-Chat_correctness_individual_responses.json
-rw-r--r-- 1 root root 405 Apr 26 18:26 baichuan2-13B-Chat_correctness_summary.json
3. 在线案例
网上能查到的测试案例https://huggingface.co/datasets/ssong1/llmperf-bedrock
横向比较不同的provider,参数设置如下:
- 请求总数:100
- 并发:1
- 提示的令牌长度:1024
- 预期输出长度:1024
- 试验型号:claude-instant-v1-100k
python token_benchmark_ray.py \
--model bedrock/anthropic.claude-instant-v1 \
--mean-input-tokens 1024 \
--stddev-input-tokens 0 \
--mean-output-tokens 1024 \
--stddev-output-tokens 100 \
--max-num-completed-requests 100 \
--num-concurrent-requests 1 \
--llm-api litellm
利用 LLMPerf,我们对一系列 LLM 推理提供程序进行了基准测试。 我们的分析侧重于根据以下关键指标评估其性能、可靠性和效率:
- 输出令牌吞吐量,表示每秒返回的平均输出令牌数。此指标对于需要高吞吐量(如摘要和翻译)的应用程序非常重要,并且易于在不同模型和提供商之间进行比较。
- 第一个令牌的时间 (TTFT),表示 LLM 返回第一个令牌的持续时间。TTFT 对于聊天机器人等流媒体应用程序尤为重要。
第一个令牌的时间(秒):对于流式处理应用程序,TTFT 是 LLM 返回第一个令牌之前的时间。
Framework | Model | Median | Mean | Min | Max | P25 | P75 | P95 | P99 |
---|---|---|---|---|---|---|---|---|---|
bedrock | claude-instant-v1 | 1.21 | 1.29 | 1.12 | 2.19 | 1.17 | 1.27 | 1.89 | 2.17 |
输出令牌吞吐量(令牌/秒):输出令牌吞吐量以每秒返回的平均输出令牌数来衡量。 我们通过向每个 LLM 推理提供者发送 100 个请求来收集结果,并根据 100 个请求计算平均输出令牌吞吐量。 输出令牌吞吐量越高,表示 LLM 推理提供程序的吞吐量越高。
Framework | Model | Median | Mean | Min | Max | P25 | P75 | P95 | P99 |
---|---|---|---|---|---|---|---|---|---|
bedrock | claude-instant-v1 | 65.64 | 65.98 | 16.05 | 110.38 | 57.29 | 75.57 | 99.73 | 106.42 |