LLM Benchmarks

We very often see a menagerie of performance benchmarks for LLM papers listed to showcase the "breakthroughs" and very likely know very little about the specifics about each particular test suite.

There, then, lies a danger of being misled and manipulated by the authors to see only the tailored tests, and we should make sure to know what is actually being presented by the swarms of LLM papers, to better judge their quality and make sure the inspirations we derived from them are not distorted by the lens of presentation.

Sources

[1] a short listing of popular LLM tests:

https://luv-bansal.medium.com/benchmarking-llms-how-to-evaluate-language-model-performance-b5d061cc8679

[2] more listing:

Decoding 21 LLM Benchmarks: What You Need to Know

[3] a guide for benchmarking your AI project:

https://towardsdatascience.com/llm-evals-setup-and-the-metrics-that-matter-2cc27e8e35f3

LLM Benchmarks

Coding

HumanEval: LLM Benchmark for Code Generation [1]

HumanEval is the quintessential evaluation tool for measuring the performance of LLMs in code generation tasks.

HumanEval consist of HumanEval Dataset and pas@k metric which use to evaluate LLM performance. This hand-crafted dataset, consisting of 164 programming challenges with unit tests, and the novel evaluation metric, designed to assess the functional correctness of the generated code, have revolutionized how we measure the performance of LLMs in code generation tasks

HumanEval Dataset

HumanEval Dataset consist of set of 164 handwritten programming problems assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions. Each problem includes a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem

Pass@k Metric

The pass@k metric, designed to evaluate the functional correctness of generated code samples. The pass@k metric is defined as the probability that at least one of the top k-generated code samples for a problem passes the unit tests. This method is inspired by how human developers test code correctness based on whether it passes certain unit tests.

You can look at HumanEval leaderboard here, where multiple Language Models finetuned on Code benchmarked and currently GPT-4 maintained at the top position in the leaderboard.

HumanEval Benchmark (Code Generation) | Papers With Code

MBPP (Mostly Basic Python Programming) [1]

MBPP benchmark is designed to measure the ability of LLM to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases to check for functional correctness. You can find the MBPP Dataset at huggingface

==> MBPP state the coding problem in natural language, while HumanEval prompt by a function to be completed, with comment stating the problem, sometimes with mathematical expression; hence the latter can claim to test more than simple coding ability

==> in terms of coding problem difficult, these 2 seems similar, about the "easy" level on leetcode.

CodeXGLUE (General Language Understanding Evaluation benchmark for CODE) [2]

CodeXGLUE tests how well LLMs can understand and work with code. It measures code intelligence across different types of tasks, including:

  • Code-to-Code: Fixing code errors, finding duplicate code, etc.

  • Text-to-Code: Searching for code using natural language

  • Code-to-Text: Explaining what code does

  • Text-to-Text: Translating technical documentation

Here’s the full list of included tasks, datasets, programming languages, etc.:

CodeXGLUE table of tasks, datasets, and languages

Source: GitHub

NLP

Chatbot Arena (Chat) [1]

Benchmarking LLM assistants is extremely challenging because the problems can be open-ended, and it is very difficult to write a program to automatically evaluate the response quality.

Chatbot Arena, a benchmark platform for large language models (LLMs) that features anonymous, randomised battles between two LLMs in a crowdsourced manner. And finally LLMs ranked using Elo rating system which is widely used rating in chess.

Arena contains popular open-source LLMs and in the arena, a user can chat with two anonymous models side-by-side and then vote for output response which one is better among those two LLMs. This crowdsourcing way of data collection represents some use cases of LLMs in the wild. You can try out Chatbot Arena at https://arena.lmsys.org and rank LLMs by voting the best output response.

Chatbot Arena Elo, based on 42K anonymous votes from Chatbot Arena using the Elo rating system.

https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard?source=post_page-----b5d061cc8679--------------------------------

MT Bench (Mixed QA) [1]

MT-bench is a challenging multi-turn question set designed to evaluate the conversational and instruction-following ability of models.

MT-Bench is a carefully curated benchmark that includes 80 high-quality, multi-turn questions. These questions are tailored to assess the conversation flow and instruction-following capabilities of models in multi-turn dialogues. They include both common use cases and challenging instructions meant to distinguish between chatbots.

MT Bench Dataset

MT Bench identified 8 primary categories of user prompts: Writing, Roleplay, Extraction, Reasoning, Math, Coding, Knowledge I (STEM), and Knowledge II (humanities/social science) which consist of crafted 10 multi-turn questions per category, yielding a set of 160 questions in total.

Evaluate ChatBot’s Answers

Its always difficult to evaluate language answers, and human as an evaluator always preference to be the gold standard but it is notoriously slow and expensive to evaluate performance of every LLM therefore they used GPT-4 as a judge to grade ChatBot’s Answers. This approach explain in the paper “Judging LLM-as-a-judge” and in the blog where they evaluate Vicuna using GPT-4 as a judge.

ARC Benchmark (Reasoning MC) [1]

AI2 Reasoning Challenge (ARC) to be a more demanding “knowledge and reasoning” test which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI. The ARC question set is partitioned into a Challenge Set and an Easy Set, where the Challenge Set contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurence algorithm.

==> of course we cannot be sure there is any underlying human-like reasoning capacity for current ML-based AI, in fact I'd argue we can be fairly certain there isn't any, due to the obviously lack of full prepositional logical reasoning ability; we are simply testing for reasoning-like competence which is crucial for using LLM in simple tasks and human interactions.

The ARC Dataset

The ARC dataset contains 7787 non-diagram4-way multiple-choice science questions designed for 3rd through 9th grade-level standardized tests.

==> the multiple choice is the key difference from MT Bench, since MT bench requires conversational response. 

These questions derived from numerous sources and targeting various knowledge types (e.g., spatial, experimental, algebraic, process, factual, structural, definition, and purpose) — are split into an Easy Set and a Challenge Set.

Difficultly nature of task can be interpreted by when top neural models from the SQuAD and SNLI tasks tested against ARC Benchmark then none are able to significantly outperform a random baseline.

Leaderboard
ARC maintains their own leaderboard here and this benchmark is also part of Huggingface open LLM leaderboard.

MultiNLI (Multi-Genre Natural Language Inference) (Reasoning MC) [2]

This benchmark contains 433K sentence pairs—premise and hypothesis—across many distinct “genres” of written and spoken English data. It tests an LLM’s ability to assign the correct label to the hypothesis statement based on what it can infer from the premise.

Sample questions (correct answer bolded)

Sample MultiNLI question

WinoGrande (Reasoning MC) [2]

https://arxiv.org/pdf/1907.10641.pdf

WinoGrande’s a massive set of 44,000 problems based on the Winograd Schema Challenge. They take the form of nearly-identical sentence pairs with two possible answers. The right answer changes based on a trigger word. This tests the ability of LLMs to properly grasp context based on natural language processing.

HellaSwag (Reasoning QA) [1]

HellaSwag benchmark is use to test the commonsense Reasoning understanding about physical situations by testing if language model could complete the sentence by choosing the correct option with common reasoning among 4 options. It consists of questions that are trivial for humans (with an accuracy of around 95%), but state-of-the-art models struggle to answer (with an accuracy of around 48%). The dataset was constructed through Adversarial Filtering (AF), a data collection paradigm that produces a dataset and helps to increase complexity of the dataset.

Adversarial Filtering (AF)

Adversarial Filtering (AF) is a data collection paradigm used to create the HellaSwag dataset. The key idea behind AF is to produce a dataset that is adversarial for any arbitrary split of the training and test sets. This requires a generator of negative candidates, which are wrong answers that violate commonsense reasoning. A series of discriminators then iteratively select an adversarial set of machine-generated wrong answers. The insight behind AF is to scale up the length and helps to increase complexity of the dataset.

HellaSwag’s questions are segments of video captions (describing some event in the physical world). A video caption segment provides an initial context for an LLM. Each context is then followed by four options for completing that context, with only one option being correct with commonsense Reasoning.

Leaderboard
HellaSwag benchmark is part of Huggingface open LLM leaderboard.

BIG-Bench (Beyond the Imitation Game) (Reasoning QA) [2]

This one’s a holistic and collaborative benchmark with a set of 204 tasks ranging from language to math to biology to societal issues. It is designed to probe LLMs ability for multi-step reasoning.2

Sample task:

Sample BIG-Bench task

GSM8K (Reasoning QA) [2]

https://arxiv.org/pdf/2110.14168v1.pdf

GSM8K (Grade Schooler Math?) is a set of 8.5K grade-school math problems. Each takes two to eight steps to solve using basic math operations. The questions are easy enough for a smart middle schooler to solve and are useful for testing LLMs’ ability to work through multistep math problems.

Sample question:

Sample GSM8K question

Massive Multitask Language Understanding (Mixed MC) [1]

MMLU benchmark measures the model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more at varying depths, from elementary to advanced professional level. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability.

==> it's multiple choice

MMLU Dataset

The MMLU dataset is a proposed test for measuring massive multitask language understanding. It consists of 15,908 questions across 57 tasks covering various branches of knowledge, including the humanitiessocial scienceshard sciences(STEM), and other areas that are important for some people to learn. The questions in the dataset were manually collected by graduate and undergraduate students. These questions were split into a few-shot development set, a validation set, and a test set. The few-shot development set has 5 questions per subject, the validation set has 1540 questions, and the test set has 14,079 questions. For scoring, MMLU averages each model’s performance per category (humanities, social science, STEM, and others) and then averages these four scores for a final score.

Leaderboard
MMLU maintains their own leaderboard here and this benchmark is also part of Huggingface open LLM leaderboard.

TruthfulQA (1-shot) (Chat QA) [1]

TruthfulQA benchmark to measure whether a language model is truthful in generating answers to questions. Questions in the dataset are such that humans might give incorrect answers because they hold false beliefs or misunderstandings. To perform well, models must avoid generating false answers learned from imitating human texts.

==> basically things down the line of "10 most commonly held misconcepts".

Larger models are likely less truthful because their larger data requirements come with a greater chance of ingesting false but popular information (e.g., opinionated, belief-based content, conspiracy theories, etc.). This contrasts with other NLP tasks, where performance improves with model size.

TruthfulQA Dataset

The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. It is believed that more truthful LLMs can foster significant advancements in fields like “medicine, law, science, and engineering”.

TriviaQA (Chat QA) [2]

TriviaQA is a reading comprehension test with 950K questions from a wide range of sources like Wikipedia. It's quite challenging because the answers aren't always straightforward and there’s a lot of context to sift through. It includes both human-verified and computer-generated questions.

Sample question (correct answer bolded):

Sample TriviaQA question

Sidenote: HuggingFace Leaderboard

Open LLM Leaderboard by HuggingFace aims to track, rank and evaluate open LLMs and chatbots. It consists of 4 benchmarks: AI2 Reasoning Challenge (ARC), HellaSwagMMLU and TruthfulQA.

==> now (2024.04.08) it actually features 2 more: WinoGrande and GSM8K

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

recall:

ARC, reasoning MC, across multiple fields;

HellaSwag, reasoning QA, focused on (physical) common sense

MMLU, Mixed MC, across multiple fields, more focused on knowledge than reasoning compared to ARC, but can be seen mostly as an extension to ARC

TruthfulQA, chat QA, mainly checking bias and training data quality

WinoGrande, reasoning MC, testing subtle understanding of natural language

GSM8K, reasoning QA, for simple math.

==> notable missing a coding benchmark; HumanEval is usually the goto for coding benchmarking.

<think>好的,我现在需要解决用户提到的两个问题:Nacos 3.0.0注册时的Client not connected UNHEALTHY错误,以及ModuleNotFoundError关于vllm.benchmarks.serve的导入问题。先分开来看,逐个解决。 首先是Nacos的问题。用户之前已经得到一些解决方案,现在结合提供的引用,可能有更多细节需要注意。引用中提到版本兼容性问题,比如客户端和服务端版本不匹配,特别是gRPC端口的问题。用户现在用的是Nacos 3.0.0,可能客户端版本过低,或者端口配置不正确。引用5提到需要开放9848和9849端口,这可能也是导致UNHEALTHY状态的原因。另外,引用3指出将服务端升级到2.x解决STARTING状态的问题,所以3.0.0可能需要更高版本的客户端支持。需要检查客户端依赖版本,比如Spring Cloud Alibaba的版本是否兼容Nacos 3.0.0。同时,网络配置、安全组是否正确开放了相关端口。另外,鉴权配置是否正确,如果启用了安全认证,客户端是否提供了正确的用户名和密码。 然后是ModuleNotFoundError: vllm.benchmarks.serve的问题。这可能是因为Python环境中没有安装相应的模块,或者vllm库的版本不包含这个模块。用户可能需要安装或更新vllm库,或者检查模块的路径是否正确。如果是自定义的模块,可能需要确保文件结构正确,或者在代码中添加正确的路径引用。另外,检查是否有拼写错误或者包名是否正确。 在处理Nacos问题时,需要逐步排查:确认客户端和服务端版本是否兼容,检查端口开放情况,验证网络连通性,检查鉴权配置,查看日志获取更多错误信息。对于ModuleNotFoundError,需要检查安装情况、模块是否存在、环境变量是否正确等。 可能还需要考虑用户的环境,比如是否使用了虚拟环境,依赖是否正确安装,是否有多个Python版本导致的问题。另外,vllm库可能需要特定版本的Python或其他依赖库,需要查看官方文档确认兼容性。 总结下来,解决步骤可能包括: 对于Nacos: 1. 确认客户端SDK版本是否支持Nacos 3.0.0,升级到兼容版本。 2. 检查服务端端口8848、9848、9849是否开放,并确保客户端能访问这些端口。 3. 验证服务端和客户端的鉴权配置是否一致。 4. 检查网络连接,使用telnet或curl测试端口连通性。 5. 查看服务端和客户端日志,寻找更具体的错误信息。 对于ModuleNotFoundError: 1. 确认vllm库是否安装,尝试使用pip安装或升级。 2. 检查代码中导入语句是否正确,是否存在拼写错误。 3. 查看vllm的文档,确认benchmarks.serve模块是否存在,可能需要安装特定版本或分支。 4. 检查Python环境,确保在正确的环境中运行,尤其是使用虚拟环境时。 5. 如果模块是项目内的,检查文件路径和结构是否正确。 需要用户提供更多信息,比如具体的错误堆栈、环境配置、依赖版本等,以便更准确地定位问题。例如,Nacos的客户端日志中的具体错误,以及vllm安装的版本和Python环境详情。</think>### Nacos 3.0.0 "Client not connected, UNHEALTHY" 解决方案 #### 1. **版本兼容性排查(关键)** - **服务端与客户端版本匹配** Nacos 3.0.0 **要求客户端SDK版本 ≥ 2.2.3**(Java示例),若使用Spring Cloud Alibaba,需匹配以下版本: ```xml <dependency> <groupId>com.alibaba.cloud</groupId> <artifactId>spring-cloud-starter-alibaba-nacos-discovery</artifactId> <version>2022.0.0.0-RC2</version> <!-- 适配Nacos 3.0+ --> </dependency> ``` 引用[3]表明客户端与服务端版本不匹配会导致`STARTING`状态,同理适用于`UNHEALTHY`[^3]。 - **协议适配** Nacos 2.0+默认启用gRPC通信,需开放**9848(主端口+1000)和9849(主端口+1001)**,引用[5]验证了端口开放的必要性[^5]。 #### 2. **网络与端口验证** - **端口检查** 运行以下命令验证端口可达性: ```bash telnet <nacos-server-ip> 8848 # HTTP API telnet <nacos-server-ip> 9848 # gRPC telnet <nacos-server-ip> 9849 # gRPC ``` 若不通,需检查防火墙或云服务器安全组规则。 - **客户端配置** 确保`server-addr`包含**端口号**(即使使用默认8848): ```yaml spring.cloud.nacos.discovery.server-addr: 192.168.1.100:8848 ``` #### 3. **服务端状态与日志分析** - **服务端健康检查** 通过API检查服务端状态: ```bash curl http://localhost:8848/nacos/v1/console/server/state ``` 若返回`"state": "UP"`表示服务正常。 - **关键日志定位** 检查服务端日志`logs/nacos/naming-raft.log`,关注以下错误: - `Leader election timeout`:集群节点通信异常 - `Connection refused`:端口未开放或防火墙拦截 #### 4. **客户端参数调优** - **增加重试与超时配置** 在Java启动参数中添加: ```bash -Dnacos.client.retry.maxAttempts=10 \ -Dnacos.client.retry.timeoutMs=10000 ``` 或在Spring配置中: ```yaml spring.cloud.nacos.discovery.fail-fast=false spring.cloud.nacos.discovery.retry.max-attempts=10 ``` --- ### ModuleNotFoundError: vllm.benchmarks.serve 导入问题解决 #### 1. **依赖安装验证** - **安装最新版vLLM** vLLM的`benchmarks`模块可能仅在特定版本或源码中存在: ```bash # 从源码安装(推荐) git clone https://github.com/vllm-project/vllm.git cd vllm && pip install -e . ``` 或通过PyPI安装: ```bash pip install vllm>=0.3.3 # 确认版本包含benchmarks模块 ``` #### 2. **环境路径检查** - **Python路径冲突** 若使用虚拟环境,确保激活正确的环境: ```bash source /path/to/venv/bin/activate ``` 通过`pip show vllm`验证安装路径。 - **模块引用方式修正** `vllm.benchmarks.serve`可能需从子包导入,尝试: ```python from vllm.benchmarks import serve # 或具体子模块 ``` #### 3. **版本兼容性排查** - **Python版本要求** vLLM需要**Python ≥ 3.8**,运行`python --version`确认版本。 - **CUDA驱动依赖** vLLM依赖CUDA 11.8+,检查驱动兼容性: ```bash nvidia-smi # 查看CUDA版本 pip install torch>=2.0.0 --extra-index-url https://download.pytorch.org/whl/cu118 ``` --- ### 综合验证步骤 1. **Nacos验证** - 使用[Nacos官方示例](https://github.com/nacos-group/nacos-examples)测试基础注册功能 - 逐步升级客户端版本至兼容Nacos 3.0.0 2. **vLLM验证** - 运行最小化测试脚本: ```python from vllm import LLM # 基础功能测试 llm = LLM(model="gpt2") print(llm.generate("Hello")) ``` --- ### 相关问题 1. 如何通过Nacos 3.0.0的Prometheus监控指标诊断服务状态? 2. vLLM如何在不同CUDA版本环境中配置兼容性? 3. Nacos集群部署中如何修复Raft协议通信异常?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值