llmperf测试大模型API性能

llmperf测试大模型API性能

llmperf是一个用来评估LLM API性能的工具。

官方仓库地址:https://github.com/ray-project/llmperf

1. 安装准备

脚本依赖python3环境,测试前客户端安装python3,本文使用python版本为3.8。

# 创建一个python虚拟环境
python -m venv llmperf-env
source llmperf-env/bin/activate

# 安装依赖
pip install torch TensorFlow

# 取消代理设置
unset http_proxy
unset https_proxy

# 安装工具
git clone https://github.com/ray-project/llmperf.git
cd llmperf
pip install -e .

从huggingface下载分词器,并将下载后的文件安装如下目录放置到llmperf文件夹下:

https://huggingface.co/hf-internal-testing/llama-tokenizer/tree/main

$ ls
analyze-token-benchmark-results.ipynb  LICENSE.txt         NOTICE.txt     pyproject.toml  requirements-dev.txt  src
hf-internal-testing                    llm_correctness.py  pre-commit.sh  README.md       result_outputs        token_benchmark_ray.py

$ tree hf-internal-testing/
hf-internal-testing/
└── llama-tokenizer
    ├── gitattributes
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    ├── tokenizer.json
    └── tokenizer.model

# 修改token_benchmark_ray.py,将分词器改为本地路径:
62     tokenizer = LlamaTokenizerFast.from_pretrained(
63         "./hf-internal-testing/llama-tokenizer"

2. 测试实践

支持负载测试和正确性测试。

2.1 大模型API负载测试

llmperf兼容不同类型的大模型API,包’openai’, ‘anthropic’, 'litellm’等。

OpenAI Compatible APIs:

export OPENAI_API_KEY=secret_abcdefg
export OPENAI_API_BASE="https://api.endpoints.anyscale.com/v1"

python token_benchmark_ray.py \
--model "meta-llama/Llama-2-7b-chat-hf" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api openai \
--additional-sampling-params '{}'

测试案例,以本地部署的大模型为例,测试baichuan2-13B-Chat模型。

# 设置大模型API的Key和URL,本次使用OPENAI_API类型的API进行测试。
export OPENAI_API_KEY=EMPTY
export OPENAI_API_BASE=http://10.210.18.41:32576/v1
# 运行脚本,设置测试的model,tokens相关参数,并发数量等。
python token_benchmark_ray.py --model "baichuan2-13B-Chat" --mean-input-tokens 100 --stddev-input-tokens 50 --mean-output-tokens 500 --stddev-output-tokens 100 --max-num-completed-requests 2 --timeout 600 --num-concurrent-requests 1 --results-dir "result_outputs" --llm-api openai --additional-sampling-params '{}'
2024-04-26 17:19:19,470 INFO worker.py:1749 -- Started a local Ray instance.
100%|████████████████████████████████████| 2/2 [00:11<00:00,  5.67s/it]
\Results for token benchmark for baichuan2-13B-Chat queried with the openai api.

inter_token_latency_s
    p25 = 0.08192225433145225
    p50 = 0.08256251340393315
    p75 = 0.08320277247641406
    p90 = 0.08358692791990259
    p95 = 0.08371497973439876
    p99 = 0.08381742118599571
    mean = 0.08256251340393315
    min = 0.08128199525897137
    max = 0.08384303154889494
    stddev = 0.001810926127469796
ttft_s
    p25 = 0.1733547967596678
    p50 = 0.18000664250575937
    p75 = 0.18665848825185094
    p90 = 0.19064959569950587
    p95 = 0.1919799648487242
    p99 = 0.19304426016809884
    mean = 0.18000664250575937
    min = 0.16670295101357624
    max = 0.1933103339979425
    stddev = 0.018814260937872945
end_to_end_latency_s
    p25 = 3.06721269452828
    p50 = 5.121514831029344
    p75 = 7.175816967530409
    p90 = 8.408398249431048
    p95 = 8.81925867673126
    p99 = 9.147947018571431
    mean = 5.121514831029344
    min = 1.0129105580272153
    max = 9.230119104031473
    stddev = 5.810443885303661
request_output_throughput_token_per_s
    p25 = 9.643341004971674
    p50 = 10.401396010428488
    p75 = 11.159451015885304
    p90 = 11.614284019159394
    p95 = 11.765895020250756
    p99 = 11.887183821123847
    mean = 10.401396010428488
    min = 8.88528599951486
    max = 11.917506021342119
    stddev = 2.1441033394836757
number_input_tokens
    p25 = 99.5
    p50 = 103.0
    p75 = 106.5
    p90 = 108.6
    p95 = 109.3
    p99 = 109.86
    mean = 103.0
    min = 96
    max = 110
    stddev = 9.899494936611665
number_output_tokens
    p25 = 34.25
    p50 = 59.5
    p75 = 84.75
    p90 = 99.9
    p95 = 104.94999999999999
    p99 = 108.99
    mean = 59.5
    min = 9
    max = 110
    stddev = 71.4177848998413
Number Of Errored Requests: 0
Overall Output Throughput: 10.486871057088438
Number Of Completed Requests: 2
Completed Requests Per Minute: 10.574996023954728

# result_outputs文件夹下的结果输出
ls -alh baichuan2-13B-Chat_100_500_*
-rw-r--r-- 1 root root 1.9K Apr 26 18:00 baichuan2-13B-Chat_100_500_individual_responses.json
-rw-r--r-- 1 root root 4.4K Apr 26 18:00 baichuan2-13B-Chat_100_500_summary.json

测试脚本参数解读:

python token_benchmark_ray.py --help
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
2024-04-26 17:58:32,799 INFO worker.py:1749 -- Started a local Ray instance.
usage: token_benchmark_ray.py [-h] --model MODEL [--mean-input-tokens MEAN_INPUT_TOKENS] [--stddev-input-tokens STDDEV_INPUT_TOKENS] [--mean-output-tokens MEAN_OUTPUT_TOKENS]
                              [--stddev-output-tokens STDDEV_OUTPUT_TOKENS] [--num-concurrent-requests NUM_CONCURRENT_REQUESTS] [--timeout TIMEOUT]
                              [--max-num-completed-requests MAX_NUM_COMPLETED_REQUESTS] [--additional-sampling-params ADDITIONAL_SAMPLING_PARAMS] [--results-dir RESULTS_DIR] [--llm-api LLM_API]
                              [--metadata METADATA]

Run a token throughput and latency benchmark.

options:
  -h, --help            show this help message and exit
  --model MODEL         The model to use for this load test.
  --mean-input-tokens MEAN_INPUT_TOKENS
                        The mean number of tokens to send in the prompt for the request. (default: 550)     # 请求中prompt的平均token数,默认值550。
  --stddev-input-tokens STDDEV_INPUT_TOKENS
                        The standard deviation of number of tokens to send in the prompt for the request. (default: 150)    # 请求中prompt的token数的标准差,默认值150。
  --mean-output-tokens MEAN_OUTPUT_TOKENS
                        The mean number of tokens to generate from each llm request. This is the max_tokens param for the completions API. Note that this is not always the number of tokens returned.
                        (default: 150)  # 从每个LLM请求中生成的平均token数。这是completions API的max_tokens参数。注意这并不一定总是是返回的token数,默认值150。
  --stddev-output-tokens STDDEV_OUTPUT_TOKENS
                        The stdandard deviation on the number of tokens to generate per llm request. (default: 80)  # 每个LLM请求生成的token数的标准差,默认值80。
  --num-concurrent-requests NUM_CONCURRENT_REQUESTS
                        The number of concurrent requests to send (default: 10)     # 要发送的并发请求数量。默认值为10。
  --timeout TIMEOUT     The amount of time to run the load test for. (default: 90)      # 运行负载测试的时间长度。默认90s。
  --max-num-completed-requests MAX_NUM_COMPLETED_REQUESTS
                        The number of requests to complete before finishing the test. Note that its possible for the test to timeout first. (default: 10)   # 在测试结束前完成的请求数量,如果超过设定时间,测试提前结束。
  --additional-sampling-params ADDITIONAL_SAMPLING_PARAMS
                        Additional sampling params to send with the each request to the LLM API. (default: {}) No additional sampling params are sent.
  --results-dir RESULTS_DIR
                        The directory to save the results to. (`default: `) No results are saved)   # 输出结果路径
  --llm-api LLM_API     The name of the llm api to use. Can select from ['openai', 'anthropic', 'litellm'] (default: openai)    # 大模型API类型
  --metadata METADATA   A comma separated list of metadata to include in the results, e.g. name=foo,bar=1. These will be added to the metadata field of the results.

2.2 大模型API正确性测试

OpenAI Compatible APIs:

export OPENAI_API_KEY=secret_abcdefg
export OPENAI_API_BASE=https://console.endpoints.anyscale.com/m/v1

python llm_correctness.py \
--model "meta-llama/Llama-2-7b-chat-hf" \
--max-num-completed-requests 150 \
--timeout 600 \
--num-concurrent-requests 10 \
--results-dir "result_outputs"

实践案例,以本地部署的大模型为例,测试baichuan2-13B-Chat模型。

export OPENAI_API_KEY=EMPTY
export OPENAI_API_BASE=http://10.210.18.41:32576/v1
python llm_correctness.py \
> --model "baichuan2-13B-Chat" \
> --max-num-completed-requests 150 \
> --timeout 600 \
> --num-concurrent-requests 10 \
> --results-dir "result_outputs"
2024-04-26 18:24:10,240 INFO worker.py:1749 -- Started a local Ray instance.
  0%|                                                                                                                                                                             | 0/150 [00:00<?, ?it/s](raylet) WARNING: 16 PYTHON worker processes have been started on node: 424e0f50137d0212cb68607fc3306ab347751605fbc7973d6f0a9db1 with address: 172.16.0.9. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
100%|███████████████████████████████████| 150/150 [02:12<00:00,  1.13it/s]
Mismatched and errored requests.
    mismatched request: 900197, expected: 9197
    mismatched request: 5000, 300, 53, expected: 5353
    mismatched request: 404407, expected: 4447
    mismatched request: 200811, expected: 2811
    ...

Results for llm correctness test for baichuan2-13B-Chat queried with the openai api.
Errors: 0, Error rate: 0.0
Mismatched: 109, Mismatch rate: 0.7266666666666667
Completed: 150
Completed without errors: 150

# result_outputs文件夹下的结果输出
ls baichuan2-13B-Chat_correctness_* -alh
-rw-r--r-- 1 root root 140K Apr 26 18:26 baichuan2-13B-Chat_correctness_individual_responses.json
-rw-r--r-- 1 root root  405 Apr 26 18:26 baichuan2-13B-Chat_correctness_summary.json

3. 在线案例

网上能查到的测试案例https://huggingface.co/datasets/ssong1/llmperf-bedrock

横向比较不同的provider,参数设置如下:

  • 请求总数:100
  • 并发:1
  • 提示的令牌长度:1024
  • 预期输出长度:1024
  • 试验型号:claude-instant-v1-100k
python token_benchmark_ray.py \
    --model bedrock/anthropic.claude-instant-v1 \
    --mean-input-tokens 1024 \
    --stddev-input-tokens 0 \
    --mean-output-tokens 1024 \
    --stddev-output-tokens 100 \
    --max-num-completed-requests 100 \
    --num-concurrent-requests 1 \
    --llm-api litellm

利用 LLMPerf,我们对一系列 LLM 推理提供程序进行了基准测试。 我们的分析侧重于根据以下关键指标评估其性能、可靠性和效率:

  1. 输出令牌吞吐量,表示每秒返回的平均输出令牌数。此指标对于需要高吞吐量(如摘要和翻译)的应用程序非常重要,并且易于在不同模型和提供商之间进行比较。
  2. 第一个令牌的时间 (TTFT),表示 LLM 返回第一个令牌的持续时间。TTFT 对于聊天机器人等流媒体应用程序尤为重要。

第一个令牌的时间(秒):对于流式处理应用程序,TTFT 是 LLM 返回第一个令牌之前的时间。

FrameworkModelMedianMeanMinMaxP25P75P95P99
bedrockclaude-instant-v11.211.291.122.191.171.271.892.17

输出令牌吞吐量(令牌/秒):输出令牌吞吐量以每秒返回的平均输出令牌数来衡量。 我们通过向每个 LLM 推理提供者发送 100 个请求来收集结果,并根据 100 个请求计算平均输出令牌吞吐量。 输出令牌吞吐量越高,表示 LLM 推理提供程序的吞吐量越高。

FrameworkModelMedianMeanMinMaxP25P75P95P99
bedrockclaude-instant-v165.6465.9816.05110.3857.2975.5799.73106.42
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

lldhsds

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值