LLM评测：以KIMI与qwen为例

最新推荐文章于 2025-04-14 17:51:44 发布

ylclll

最新推荐文章于 2025-04-14 17:51:44 发布

阅读量786

点赞数 8

文章标签： nlp python ai

本文链接：https://blog.csdn.net/ylclll/article/details/140741653

版权

我们通过kimi或者qwen官方文档，可以自己申请key并通过api调用对应的大模型,为了评测大模型，除了调用相应的框架，如：opencompass（我本地配完环境运行不出结果，并且发现有人跟我是同样问题，但是目前为止依旧没有解决），因此这里我以MMLU数据集为例，评测对应大模型的准确率。

一.加载对应数据集

我们可以在hugging face网站上下载对应数据集，如果觉得翻墙麻烦也可以将环境变量调整为国内镜像网站，如：

export HF_ENDPOINT=https://hf-mirror.com

这样设置之后无论是下载预训练模型或者是下载数据集都会在镜像网站寻找资源。

同时，我们可以登录网站查看MMLU数据集格式如下：

数据集分为question,subject,choices,answer，注意这里answer给定是从0到3，对应选项A,B,C,D

代码如下：

subtask_name = "high_school_biology"
mmlu_dataset = load_dataset("cais/mmlu", subtask_name)
subtask_dataset = mmlu_dataset["test"]

二.调用api

我们可以在对应的大模型网站和文档里找到python代码示例，我们这里以kimi为例：

进入kimi网站：kimi

申请对应的api，同时在文档里我们可以看到调用api代码如下：

from openai import OpenAI
 
client = OpenAI(
    api_key="MOONSHOT_API_KEY", # 在这里将 MOONSHOT_API_KEY 替换为你从 Kimi 开放平台申请的 API Key
    base_url="https://api.moonshot.cn/v1",
)
 
completion = client.chat.completions.create(
    model = "moonshot-v1-8k",
    messages = [
        {"role": "system", "content": "你是 Kimi，由 Moonshot AI 提供的人工智能助手，你更擅长中文和英文的对话。你会为用户提供安全，有帮助，准确的回答。同时，你会拒绝一切涉及恐怖主义，种族歧视，黄色暴力等问题的回答。Moonshot AI 为专有名词，不可翻译成其他语言。"},
        {"role": "user", "content": "你好，我叫李雷，1+1等于多少？"}
    ],
    temperature = 0.3,
)
 
# 通过 API 我们获得了 Kimi 大模型给予我们的回复消息（role=assistant）
print(completion.choices[0].message.content)

安装对应的库：

pip install --upgrade 'openai>=1.0'

将api换成自己申请的key之后可以运行测试。

三.评测

这里我偷个懒，按照数据集中测试集的数量进行循环，从数据集中分别收取questions，choice，answer方便之后评测。

for example in subtask_dataset:
    question = example['question']
    choices = example['choices']
    answer = example['answer']

同时评测逻辑我也偷了懒，将大模型promote手动修改，让其只从选项选择正确选项回答，最后提取测试集答案与模型回答进行对比，正确则算该用例回答正确。可以将每次答案打印出来人工评判。

代码如下：

for example in subtask_dataset:
    question = example['question']
    choices = example['choices']
    answer = example['answer']

    try:
        completion = client.chat.completions.create(
            model="moonshot-v1-8k",
            messages=[
                {
                    "role": "system",
                    "content": "你是 Kimi，由 Moonshot AI 提供的人工智能助手，你只需要根据文章与选项内容，回答正确选项即可，不用添加别的词"
                },
                {
                    "role": "user",
                    "content": f"Question: {question}\nChoices:\n{choices}\nAnswer:"
                },
            ],
            temperature=0.3,
        )

        answer_kimi = completion.choices[0].message.content.strip().lower()  # 移除两端空白字符并转换为小写

        correct_answer_text = choices[answer].strip().lower().split()  # 移除两端空白字符并转换为小写列表

        is_correct = correct_answer_text == answer_kimi.split()
        if is_correct:
            n += 1

    except RateLimitError as e:
        print(f"Rate limit exceeded: {e}")
        time.sleep(10)  # 等待1秒后再重试

    except Exception as e:
        print(f"An error occurred: {e}")
    a += 1
    print(a)
    acc_now = n / a
    print(f"当前Accuracy: {acc_now * 100:.2f}%")
    time.sleep(30)  # 每25秒暂停一次

由于kimi本身限制，对api调用频率进行限制，因此每次评测后手动暂停。

qwen同理，或者是大多数api调用模型逻辑大致相同，完整代码如下：

kimi.py

from datasets import load_dataset
from openai import OpenAI, RateLimitError
import time

client = OpenAI(
    api_key="key",
    base_url="https://api.moonshot.cn/v1",
)
subtask_name = "high_school_biology"
mmlu_dataset = load_dataset("cais/mmlu", subtask_name)
subtask_dataset = mmlu_dataset["test"]
num = len(subtask_dataset)
n = 0
a = 0
for example in subtask_dataset:
    question = example['question']
    choices = example['choices']
    answer = example['answer']

    try:
        completion = client.chat.completions.create(
            model="moonshot-v1-8k",
            messages=[
                {
                    "role": "system",
                    "content": "你是 Kimi，由 Moonshot AI 提供的人工智能助手，你只需要根据文章与选项内容，回答正确选项即可，不用添加别的词"
                },
                {
                    "role": "user",
                    "content": f"Question: {question}\nChoices:\n{choices}\nAnswer:"
                },
            ],
            temperature=0.3,
        )

        answer_kimi = completion.choices[0].message.content.strip().lower()  # 移除两端空白字符并转换为小写

        correct_answer_text = choices[answer].strip().lower().split()  # 移除两端空白字符并转换为小写列表

        is_correct = correct_answer_text == answer_kimi.split()
        if is_correct:
            n += 1

    except RateLimitError as e:
        print(f"Rate limit exceeded: {e}")
        time.sleep(10)  # 等待1秒后再重试

    except Exception as e:
        print(f"An error occurred: {e}")
    a += 1
    print(a)
    acc_now = n / a
    print(f"当前Accuracy: {acc_now * 100:.2f}%")
    time.sleep(30)  # 每25秒暂停一次

# 计算准确率
acc = n / num
print(f"Accuracy: {acc * 100:.2f}%")

qwen.py

import random
from http import HTTPStatus
from datasets import load_dataset
import dashscope
# 建议dashscope SDK 的版本 >= 1.14.0
from dashscope import Generation

dashscope.api_key = 'key'
subtask_name = "high_school_biology"
mmlu_dataset = load_dataset("cais/mmlu", subtask_name)
subtask_dataset = mmlu_dataset["test"]

num = len(subtask_dataset)
n = 0
a = 0
for example in subtask_dataset:
    question = example['question']
    choices = example['choices']
    answer = example['answer']

    messages = [{'role': 'system', 'content': 'You are a helpful assistant.You only need to answer the correct answer according to the passage and the content of the options. Do not add other words.You can only choose your answer from the options'},
                {'role': 'user', 'content': f"Question: {question}\nChoices:\n{choices}\nAnswer:"}]
    response = Generation.call(model="qwen-turbo",
                            messages=messages,
                            # 设置随机数种子seed，如果没有设置，则随机数种子默认为1234
                            seed=random.randint(1, 10000),
                            temperature=0.8,
                            top_p=0.8,
                            top_k=50,
                            # 将输出设置为"message"格式
                            result_format='message')
    # if response.status_code == HTTPStatus.OK:
    #     print(response['output']['choices'][0]['message']['content'])
    # else:
    #     print('Request id: %s, Status code: %s, error code: %s, error message: %s' % (
    #         response.request_id, response.status_code,
    #         response.code, response.message
    #     ))
    words = response['output']['choices'][0]
    correct_answer_text = choices[answer].strip().lower().split()
    answer_qwen = words['message']['content'].strip().lower().split()

    # 假设 correct_answer_text 和 answer_qwen 已经是分割单词后的列表
    correct_answer_words = []
    answer_qwen_words = []

    for word in correct_answer_text:
        # 清理单词，去除不需要的字符，然后添加到新列表中
        cleaned_word = word.strip("'")  # 去除单引号
        correct_answer_words.append(cleaned_word)

    for word in answer_qwen:
        # 同样清理单词并添加到新列表中
        cleaned_word = word.strip("'")
        answer_qwen_words.append(cleaned_word)

    # 现在 correct_answer_words 和 answer_qwen_words 是清理后的单词列表
    print(correct_answer_words)
    print(answer_qwen_words)
    is_correct = correct_answer_words == answer_qwen_words
    if is_correct:
        n += 1
    a += 1
    print(a)
    acc_now = n / a
    print(f"当前Accuracy: {acc_now * 100:.2f}%")
    # time.sleep(30)  # 每25秒暂停一次
acc = n / num
print(f"Accuracy: {acc * 100:.2f}%")