LLM压测

liuzhenghua66

已于 2024-07-10 15:41:55 修改

阅读量450

点赞数 4

分类专栏： # AI 文章标签： python

于 2024-05-30 19:30:54 首次发布

本文链接：https://blog.csdn.net/liuzhenghua66/article/details/139332747

版权

AI 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

简介

大型语言模型（LLM）的压力测试（压测）是评估模型在高负载条件下性能和稳定性的关键方法。压测的目标是模拟实际使用场景中的高并发请求，检测系统的极限、潜在瓶颈和稳定性问题，以确保模型在生产环境中能够稳定高效地运行。

LLM压测的必要性

性能评估：压测能够揭示模型在处理大量请求时的响应时间和吞吐量。通过压测，可以确定系统在高负载下的性能表现，找到潜在的性能瓶颈。
系统稳定性：模拟高负载场景可以测试系统的稳定性，确保模型在面对突发流量时不会崩溃或出现严重性能下降。
扩展能力：压测可以帮助评估系统的可扩展性，确定在增加负载时系统是否能够线性扩展，以及在扩展过程中是否存在性能瓶颈。
资源利用率优化：通过压测，可以分析系统在高负载下的资源使用情况（如CPU、GPU、内存等），从而优化资源配置，提高系统效率。
发现潜在问题：压测可以帮助发现系统在高负载下可能出现的各种问题，如内存泄漏、连接超时、线程死锁等，为系统优化提供依据。

数据集

相同的输入可能会触发模型的缓存，使得推理速度变快。为了更准确的对当前硬件部署的模型能力进行评估，我们可以用网上开源的数据集，随机抽取一部分数据用于测试。

下载数据集

# 仓库地址
https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/tree/main
# 进不去可以用镜像站下载
https://hf-mirror.com/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/tree/main

# 下载
curl -L -o ShareGPT_V3_unfiltered_cleaned_split.json https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

curl -L -o ShareGPT_V3_unfiltered_cleaned_split.json https://hf-mirror.com/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

# 执行benchmark可能需要设置的环境变量
export HF_ENDPOINT=https://hf-mirror.com
export HF_HUB_CACHE=/data/huggingface

数据集格式

数据集的格式为一个josn文件，内容大概如下：

[
  {
    "id": "i6IyJda_0",
    "conversations": [
      {
        "from": "human",
        "value": "How to tell if a customer segment is well segmented? In 3 bullet points."
      },
      {
        "from": "gpt",
        "value": "1. Homogeneity: The segment should consist of customers who share similar characteristics and behaviors.\n2. Distinctiveness: The segment should be different from other segments in terms of their characteristics
 and behaviors.\n3. Stability: The segment should remain relatively stable over time and not change drastically. The characteristics and behaviors of customers within the segment should not change significantly."
      }
    ]
  },
  {
    "id": "A5AbcES_0",
    "conversations": [
      {
        "from": "human",
        "value": "In Java, I want to replace string like \"This is a new {object} at {place}\" with a Map, {object: \"student\", \"point 3, 4\"}, and get a result \"This is a new student at point 3, 4\". How can I do?"
      },
      {
        "from": "gpt",
        "value": "You can use the `String.format()` method in Java to replace placeholders in a string with values from a map. Here's an example code snippet that demonstrates how you can achieve this:\n```java\nimport java.util.HashMap;\nimport java.util.Map;\n\npublic class StringReplaceExample {\n    public static void main(String[] args) {\n        String input = \"This is a new {object} at {place}\";\n        Map<String, String> replacements = new HashMap<>();\n        replacements.put(\"object\", \"student\");\n        replacements.put(\"place\", \"point 3, 4\");\n\n        String output = replacePlaceholders(input, replacements);\n        System.out.println(output);\n    }\n\n    public static String replacePlaceholders(String input, Map<String, String> replacements) {\n        for (Map.Entry<String, String> entry : replacements.entrySet()) {\n            String placeholder = \"{\" + entry.getKey() + \"}\";\n            String replacement = entry.getValue();\n            input = input.replace(placeholder, replacement);\n        }\n        return input;\n    }\n}\n```\nIn this example, we define the input string as \"This is a new {object} at {place}\", and create a `HashMap` called `replacements` that maps the placeholders \"object\" and \"place\" to their corresponding values \"student\" and \"point 3, 4\". We then pass these values to the `replacePlaceholders()` method, which iterates over the entries in the `replacements` map and replaces each placeholder in the input string with its corresponding value using the `replace()` method. Finally, the `replacePlaceholders()` method returns the modified string, which is printed to the console.\n\nThe output of this program will be:\n```csharp\nThis is a new student at point 3, 4\n```\nNote that you can modify the `replacements` map to include additional placeholders and their corresponding values, and the `replacePlaceholders()` method will automatically replace them in the input string."
      }
    ]
  }
]

benchmark

LLM（Large Language Model）基准测试（benchmarking）是评估大型语言模型性能的关键方法，benchmark_latency和benchmark_serving是用于评估机器学习模型性能的两个重要方面，尤其是在部署和实际应用中。它们的目标是确保模型在不同条件下具有良好的响应时间和服务能力。

benchmark_latency

Latency 是指从输入数据发送到模型，直到模型返回预测结果之间所花费的时间。在机器学习模型的评估中，benchmark_latency 主要关注以下方面：

Inference Time: 单次推理所需的时间，包括模型加载、数据预处理、推理过程以及后处理。
Throughput: 每秒处理的请求数，通常与延迟成反比关系。
Consistency: 在不同负载条件下，延迟是否保持稳定，是否存在明显的抖动或延迟尖峰。

import aiohttp
import asyncio
import json
import logging
import time
from typing import List, Tuple

import numpy as np

from util import sample_requests, get_tokenizer

logger = logging.getLogger(__name__)
# Tuple[prompt_len, completion_len, request_time_in_milliseconds]
REQUEST_LATENCY: List[Tuple[int, int, float]] = []

# 替换为你的API密钥和端点
API_KEY = 'your_api_key'
API_URL = 'http://localhost:80/v1/chat/completions'
MODEL_UID = 'qwen1.5-chat-7b'

HEADERS = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {API_KEY}'
}


async def send_request(session, payload, prompt_len):
    request_start_time = time.time()
    async with session.post(API_URL, json=payload, headers=HEADERS) as response:
        if response.status == 200:
            result = await response.json()
            completion_tokens = result["usage"]["completion_tokens"]
            request_end_time = time.time()
            request_latency = request_end_time - request_start_time
            REQUEST_LATENCY.append((prompt_len, completion_tokens, request_latency))
            return result
        else:
            return {'error': response.status, 'message': await response.text()}


async def benchmark(
    input_requests: List[Tuple[str, int, int]],
) -> None:
    async with aiohttp.ClientSession() as session:
        for idx, request in enumerate(input_requests):
            prompt, prompt_len, output_len = request
            payload = {
                'model': MODEL_UID,
                "n": 1,
                "temperature": 0,
                "top_p": 1.0,
                'messages': [{"role": "user", "content": prompt}],
                'max_tokens': 8192
            }
            response = await send_request(session, payload, prompt_len)
            print(f"Response {idx + 1}: {json.dumps(response, ensure_ascii=False, indent=2)}")


def main():
    logger.info("Preparing for benchmark.")
    dataset_path = r'ShareGPT_V3_unfiltered_cleaned_split.json'
    tokenizer_name_or_path = 'qwen/Qwen1.5-7B-Chat'
    num_request = 10
    tokenizer = get_tokenizer(tokenizer_name_or_path)
    input_requests = sample_requests(dataset_path, num_request, tokenizer)

    logger.info("Benchmark starts.")
    benchmark_start_time = time.time()
    asyncio.run(benchmark(input_requests))
    benchmark_end_time = time.time()
    benchmark_time = benchmark_end_time - benchmark_start_time

    print(f"Total time: {benchmark_time:.2f} s")
    print(f"Throughput: {len(REQUEST_LATENCY) / benchmark_time:.2f} requests/s")
    avg_latency = np.mean([latency for _, _, latency in REQUEST_LATENCY])
    print(f"Average latency: {avg_latency:.2f} s")
    avg_per_token_latency = np.mean(
        [
            latency / (prompt_len + output_len)
            for prompt_len, output_len, latency in REQUEST_LATENCY
        ]
    )
    print(f"Average latency per token: {avg_per_token_latency:.2f} s")
    avg_per_output_token_latency = np.mean(
        [latency / output_len for _, output_len, latency in REQUEST_LATENCY]
    )
    print("Average latency per output token: " f"{avg_per_output_token_latency:.2f} s")


if __name__ == '__main__':
    main()

benchmark_serving

Serving 是指在生产环境中部署和运行模型，以处理实际的用户请求。benchmark_serving 关注模型在生产环境中的整体性能和稳定性，包括：

Scalability: 系统在增加负载时能否有效扩展，保持高性能。
Reliability: 系统的可靠性，包括在高负载或异常情况下的稳定性。
Resource Utilization: 评估CPU、GPU、内存等资源的使用情况，确保在高效利用资源的同时保持高性能。
Latency under Load: 在高并发请求下，系统的延迟表现。

import aiohttp
import asyncio
import json
import logging
import time
from typing import List, Tuple

import numpy as np

from util import sample_requests, get_tokenizer

logger = logging.getLogger(__name__)
# Tuple[prompt_len, completion_len, request_time_in_milliseconds]
REQUEST_LATENCY: List[Tuple[int, int, float]] = []

# 替换为你的API密钥和端点
API_KEY = 'your_api_key'
API_URL = 'http://localhost:80/v1/chat/completions'
MODEL_UID = 'qwen1.5-chat-7b'

HEADERS = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {API_KEY}'
}


async def send_request(session, payload, prompt_len):
    request_start_time = time.time()
    async with session.post(API_URL, json=payload, headers=HEADERS) as response:
        if response.status == 200:
            result = await response.json()
            completion_tokens = result["usage"]["completion_tokens"]
            request_end_time = time.time()
            request_latency = request_end_time - request_start_time
            REQUEST_LATENCY.append((prompt_len, completion_tokens, request_latency))
            return result
        else:
            return {'error': response.status, 'message': await response.text()}


class BenchMarkRunner:

    def __init__(
        self,
        requests: List[Tuple[str, int, int]],  # prompt, prompt_len, completion_len
        concurrency: int,
    ):
        self.concurrency = concurrency
        self.requests = requests
        self.request_left = len(requests)
        self.request_queue = asyncio.Queue(concurrency or 100)

    async def run(self):
        tasks = []
        for i in range(self.concurrency):
            tasks.append(asyncio.create_task(self.worker()))
        for req in self.requests:
            await self.request_queue.put(req)
        # When all request is done, most worker will hang on self.request_queue, but at least one worker will exit
        await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)

    async def worker(self):
        timeout = aiohttp.ClientTimeout(total=5 * 60)
        async with aiohttp.ClientSession(timeout=timeout) as session:
            while self.request_left > 0:
                prompt, prompt_len, completion_len = await self.request_queue.get()
                payload = {
                    'model': MODEL_UID,
                    "n": 1,
                    "temperature": 0,
                    "top_p": 1.0,
                    'messages': [{"role": "user", "content": prompt}],
                    'max_tokens': 8192
                }
                response = await send_request(session, payload, prompt_len)
                self.request_left -= 1
                print(f"Response {len(self.requests) - self.request_left}: {json.dumps(response, ensure_ascii=False, indent=2)}")


def main():
    dataset_path = r'ShareGPT_V3_unfiltered_cleaned_split.json'
    tokenizer_name_or_path = 'qwen/Qwen1.5-7B-Chat'
    num_request = 100
    concurrency = 10
    logger.info("Preparing for benchmark.")
    tokenizer = get_tokenizer(tokenizer_name_or_path)
    input_requests = sample_requests(dataset_path, num_request, tokenizer)

    logger.info("Benchmark starts.")
    benchmark_start_time = time.time()
    asyncio.run(BenchMarkRunner(input_requests, concurrency).run())
    benchmark_end_time = time.time()
    benchmark_time = benchmark_end_time - benchmark_start_time

    print(f"Total time: {benchmark_time:.2f} s")
    print(f"Throughput: {len(REQUEST_LATENCY) / benchmark_time:.2f} requests/s")
    avg_latency = np.mean([latency for _, _, latency in REQUEST_LATENCY])
    print(f"Average latency: {avg_latency:.2f} s")
    avg_per_token_latency = np.mean(
        [
            latency / (prompt_len + output_len)
            for prompt_len, output_len, latency in REQUEST_LATENCY
        ]
    )
    print(f"Average latency per token: {avg_per_token_latency:.2f} s")
    avg_per_output_token_latency = np.mean(
        [latency / output_len for _, output_len, latency in REQUEST_LATENCY]
    )
    print("Average latency per output token: " f"{avg_per_output_token_latency:.2f} s")
    throughput = (
            sum([output_len for _, output_len, _ in REQUEST_LATENCY]) / benchmark_time
    )
    print(f"Throughput: {throughput} tokens/s")


if __name__ == '__main__':
    main()

附录

固定请求数，一次性发完请求

import asyncio
import aiohttp
import json

# 替换为你的API密钥和端点
API_KEY = 'your_api_key'
API_URL = 'http://localhost:80/v1/chat/completions'

HEADERS = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {API_KEY}'
}

# 示例请求体，可以根据实际情况修改
REQUEST_BODY = {
    'model': 'qwen1.5-chat-7b',
    'messages': [{'role': 'user', 'content': '写一段8千字的作文'}],
    'max_tokens': 8192
}

async def fetch(session, payload):
    async with session.post(API_URL, json=payload, headers=HEADERS) as response:
        if response.status == 200:
            result = await response.json()
            return result
        else:
            return {'error': response.status, 'message': await response.text()}

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = []
        for _ in range(20):  # 创建20个并发任务
            tasks.append(fetch(session, REQUEST_BODY))
        responses = await asyncio.gather(*tasks)
        for idx, response in enumerate(responses):
            print(f"Response {idx + 1}: {json.dumps(response, ensure_ascii=False, indent=2)}")

if __name__ == '__main__':
    asyncio.run(main())

固定并发数与请求数

import asyncio
import aiohttp
import json

# 替换为你的API密钥和端点
API_KEY = 'your_api_key'
API_URL = 'https://api.yourservice.com/v1/chat/completions'

HEADERS = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {API_KEY}'
}

# 示例请求体，可以根据实际情况修改
REQUEST_BODY = {
    'model': 'qwen1.5-chat-7b',
    'messages': [{'role': 'user', 'content': '写一段8千字的作文'}],
    'max_tokens': 8192
}

async def fetch(session, payload, semaphore, idx):
    async with semaphore:
        async with session.post(API_URL, json=payload, headers=HEADERS) as response:
            if response.status == 200:
                result = await response.json()
                print(f"Response {idx + 1}: {json.dumps(result, ensure_ascii=False, indent=2)}")
                return result
            else:
                error_message = {'error': response.status, 'message': await response.text()}
                print(f"Response {idx + 1} Error: {json.dumps(error_message, ensure_ascii=False, indent=2)}")
                return error_message

async def main():
    semaphore = asyncio.Semaphore(20)  # 限制并发量为20
    async with aiohttp.ClientSession() as session:
        tasks = []
        for i in range(500):  # 创建500个请求
            tasks.append(fetch(session, REQUEST_BODY, semaphore, i))
        responses = await asyncio.gather(*tasks)
        return responses

if __name__ == '__main__':
    asyncio.run(main())

util.py

提供加载tokenizer，抽取数据集功能。

import json
import logging
import random
from typing import TYPE_CHECKING, List, Tuple

from transformers import AutoTokenizer, PreTrainedTokenizerFast

logger = logging.getLogger(__name__)
if TYPE_CHECKING:
    from transformers import PreTrainedTokenizerBase
# A fast LLaMA tokenizer with the pre-processed `tokenizer.json` file.
_FAST_LLAMA_TOKENIZER = "hf-internal-testing/llama-tokenizer"


def get_tokenizer(
        tokenizer_name: str,
        *args,
        tokenizer_mode: str = "auto",
        trust_remote_code: bool = False,
        **kwargs,
) -> "PreTrainedTokenizerBase":
    """Gets a tokenizer for the given model name via Huggingface."""
    if tokenizer_mode == "slow":
        if kwargs.get("use_fast", False):
            raise ValueError("Cannot use the fast tokenizer in slow tokenizer mode.")
        kwargs["use_fast"] = False

    if (
            "llama" in tokenizer_name.lower()
            and kwargs.get("use_fast", True)
            and tokenizer_name != _FAST_LLAMA_TOKENIZER
    ):
        logger.info(
            "For some LLaMA-based models, initializing the fast tokenizer may "
            "take a long time. To eliminate the initialization time, consider "
            f"using '{_FAST_LLAMA_TOKENIZER}' instead of the original "
            "tokenizer."
        )
    try:
        tokenizer = AutoTokenizer.from_pretrained(
            tokenizer_name, *args, trust_remote_code=trust_remote_code, **kwargs
        )
    except TypeError as e:
        # The LLaMA tokenizer causes a protobuf error in some environments.
        err_msg = (
            "Failed to load the tokenizer. If you are using a LLaMA-based "
            f"model, use '{_FAST_LLAMA_TOKENIZER}' instead of the original "
            "tokenizer."
        )
        raise RuntimeError(err_msg) from e
    except ValueError as e:
        # If the error pertains to the tokenizer class not existing or not
        # currently being imported, suggest using the --trust-remote-code flag.
        if not trust_remote_code and (
                "does not exist or is not currently imported." in str(e)
                or "requires you to execute the tokenizer file" in str(e)
        ):
            err_msg = (
                "Failed to load the tokenizer. If the tokenizer is a custom "
                "tokenizer not yet available in the HuggingFace transformers "
                "library, consider setting `trust_remote_code=True` in LLM "
                "or using the `--trust-remote-code` flag in the CLI."
            )
            raise RuntimeError(err_msg) from e
        else:
            raise e

    if not isinstance(tokenizer, PreTrainedTokenizerFast):
        logger.warning(
            "Using a slow tokenizer. This might cause a significant "
            "slowdown. Consider using a fast tokenizer instead."
        )
    return tokenizer


def sample_requests(
        dataset_path: str,
        num_requests: int,
        tokenizer: "PreTrainedTokenizerBase",
) -> List[Tuple[str, int, int]]:
    # Load the dataset
    with open(dataset_path) as f:
        dataset = json.load(f)
    # Filter out the conversations with less than 2 turns.
    dataset = [data for data in dataset if len(data["conversations"]) >= 2]
    # Only keep the first two turns of each conversation.
    dataset = [
        (data["conversations"][0]["value"], data["conversations"][1]["value"])
        for data in dataset
    ]
    # Tokenize the prompts and completions(input_msg, input_token_len, output_token_len).
    tokenized_dataset = []
    prompts = [prompt for prompt, _ in dataset]
    prompt_token_ids = tokenizer(prompts).input_ids
    completions = [completion for _, completion in dataset]
    completion_token_ids = tokenizer(completions).input_ids
    for i in range(len(dataset)):
        output_len = len(completion_token_ids[i])
        tokenized_dataset.append((prompts[i], prompt_token_ids[i], output_len))
    # Filter out too long sequences.
    filtered_dataset: List[Tuple[str, int, int]] = []
    for prompt, prompt_token_ids, output_len in tokenized_dataset:
        prompt_len = len(prompt_token_ids)
        if prompt_len < 4 or output_len < 4:
            # Prune too short sequences.
            continue
        if prompt_len > 1024 or prompt_len + output_len > 2048:
            # Prune too long sequences.
            continue
        filtered_dataset.append((prompt, prompt_len, output_len))
    # Sample the requests.
    sampled_requests = random.sample(filtered_dataset, num_requests)
    return sampled_requests

curl

curl -i http://localhost:80/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-4",
    "max_tokens": 8192,
    "messages": [
      {"role": "user", "content": "hello"}
    ]
  }'

# 循环调用10次
for i in {1..10}
do
  curl -i http://localhost:80/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -d '{
      "model": "gpt-4",
	  "max_tokens": "8192",
      "messages": [
        {"role": "user", "content": "hello"}
      ]
    }'
done

健康检查

curl -i http://localhost:80/v1/chat/completions -H "Content-Type:application/json" -d '{"model":"qwen2-72b-fp8-instruct","max_tokens":1,"messages":[{"role":"user","content":"this is a health check, please return ok"}]}'

liuzhenghua66

关注

4
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
LLM压测

大型语言模型（LLM）的压力测试（压测）是评估模型在高负载条件下性能和稳定性的关键方法。压测的目标是模拟实际使用场景中的高并发请求，检测系统的极限、潜在瓶颈和稳定性问题，以确保模型在生产环境中能够稳定高效地运行。
复制链接

扫一扫

专栏目录