实现一个简洁的代码模型评测框架（以Qwen2.5-coder Humaneval为例）

本文链接：https://blog.csdn.net/weixin_43013480/article/details/142364196

代码大模型在评测时主要用到的指标就是pass@1了，特别是在Humaneval测试集上。各个模型在其发布时也都给出了指标，使用的一些开源代码评测框架感觉都比较复杂，所以准备做一个简洁的评测框架，下面我们进行全部代码流程的构建。

整体框架可以分四部分，Generate、Execution、Evaluation、base_utils模块。构建完四部分后进行组合即可完成一个简洁的代码模型评测框架。为了便于扩展更多的评测集，所以本框架中每个任务都新建了一个py文件，文件中的类继承自base_utils中的TaskUtils基类，根据不同评测集的不同方法进行前处理与后处理。

代码模型评测框架简介

框架已上传至github：代码评测框架。

快速开始

通过bash文件可以快速开始：

MODELS_PATH="/qwen"
LOGS_PATH="./logs.jsonl"
OUT_PATH='./out.jsonl'
METRIC_PATH='./metric.json'
DATA_FILE='./dataset/humaneval_python.jsonl'


CUDA_VISIBLE_DEVICES=0 python main.py \
    --model_name_or_path "$MODELS_PATH" \
    --task_name "humaneval" \
    --save_logs_path "$LOGS_PATH" \
    --output_path "$OUT_PATH" \
    --do_sample false \
    --top_p 0.95 \
    --max_new_tokens 1024 \
    --evaluate_only false \
    --torch_dtype "bf16" \
    --save_metrics_path $METRIC_PATH \
    --data_file $DATA_FILE

评测生成文件

框架保存了评测具体结果以供分析，模型评测完成后主要生成三个文件：

1、out.jsonl: 模型输出，在评测数据的基础上新增字段如下：

output:模型接收prompt后产生的原本输出
generation: 经过提取代码部分及添加测试后的输出

2、logs.jsonl: 评测测试用例运行的信息

3、metric.json: 评测结果指标

新增评测

如若想要新增评测任务，可以继承base_utils中的类进行相关设置，然后在task文件夹下创建相关文件进行继承。最后在main.py文件的TASKS中添加即可。

框架构建

下面我们开始对框架进行从零的代码实现：

1、base_utils

首先是base_utils模块，该模块主要是为了便于后续添加不同任务，所以构建这样一个utils基类。每次添加一个评测集都需要继承此类，并重写类的方法以适配当前的评测。
基类的代码如下所示，主要就是三个函数。基本上不同的评测集的区别就在于如何构建prompt、如何提取、如何评测。

class TaskUtils:
    def __init__(self):
        self.IMPORT_HELPER = {
            "python": [
                "import math",
                "import re",
                "import sys",
                "import copy",
                "import datetime",
                "import itertools",
                "import collections",
                "import heapq",
                "import functools",
                "import hashlib",
                "import numpy",
                "import numpy as np",
                "import string",
                "from typing import *",
                "from collections import *",
            ]
        }

    @staticmethod
    def build_instruction(example):
        """
        根据模型构建合适的指令
        """
        return example['prompt']

    @staticmethod
    def generation_code_process(example):
        """
        对生成的代码提取函数部分 及 设置import、添加test用例等操作
        """
        pass

    @staticmethod
    def evaluate_function(input_file, args):
        """
        最终评测的方法，输入为保存的生成jsonl文件
        """
        pass

2、Generate

第二部分是模型Generate部分的构建，本部分的主要内容就是构建一些模型接收输入后输出部分，比较简单，可以采用常规的model.generate形式实现，并且有了第一部分我们构建好的三个函数，就可以直接应用到Generate部分：

主要代码如下所示

def generate_one(example, tokenizer, model, args, task):
    prompt = task.build_instruction(example)
    inputs = tokenizer.encode(prompt, return_tensors="pt").to(model.device)

    stop_id = tokenizer.eos_token_id if tokenizer.eos_token_id is not None else tokenizer.convert_tokens_to_ids(
        "<|EOT|>")
    assert isinstance(stop_id, int), "Invalid tokenizer, EOT id not found"

    outputs = model.generate(
        inputs,
        max_new_tokens=args.max_new_tokens,
        do_sample=args.do_sample,
        top_p=args.top_p,
        temperature=args.temperature,
        pad_token_id=stop_id,
        eos_token_id=stop_id
    )

    output = tokenizer.decode(outputs[0][:], skip_special_tokens=True)
    example['output'] = output

    return task.generation_code_process(example)

3、Evaluate

第三部分就是构建evaluate，该部分主要是对模型生成的结果进行一个评分，根据不同任务的不同计算方式，可以进行评分。代码模型中最常见的评分指标就是pass@1了，下面是其代码示例：由于pass@1是需要代码运行的，所以我们要构建一个execution模块来运行生成的代码，此部分我们放到第四部分进行构建

def evaluate_functional_correctness(
        input_file: str = None,
        n_workers: int = 32,
        timeout: float = 3.0,
        k: int = 1,
        save_logs_path='./logs.jsonl'
):
    """
    Evaluates the functional correctness of a model.
    """
    sample_jsonl = stream_jsonl_all(input_file)

    with ThreadPoolExecutor(max_workers=n_workers) as executor:

        futures = []
        n_samples = 0
        results = defaultdict(list)

        print("Reading samples...")
        for sample in tqdm(sample_jsonl):
            task_id = sample["task_id"]
            if sample["generation"] is None:
                continue
            args = (sample['generation'], task_id, timeout)
            future = executor.submit(check_correctness, *args)
            futures.append(future)
            n_samples += 1

        print("Running test suites...")
        for future in tqdm(as_completed(futures), total=len(futures)):
            result = future.result()
            results[result["task_id"]].append(result)
    # Calculate pass@k.
    total, correct, logs = [], [], []
    for result in results.values():
        passed = [r["passed"] for r in result]
        res = [{r['task_id']: r["result"]} for r in result]
        logs.append(res)
        total.append(len(passed))
        correct.append(sum(passed))
    total = np.array(total)
    correct = np.array(correct)

    pass_at_k = {f"pass@{k}": estimate_pass_at_k(total, correct, k).mean()}

    with open(save_logs_path, 'w', encoding='utf-8') as fw:
        for ex in logs:
            fw.write(json.dumps(ex) + '\n')
        print(f"execute logs were saved at {save_logs_path}")

    return pass_at_k


def estimate_pass_at_k(
        num_samples,
        num_correct,
        k: int
) -> np.ndarray:
    """
    Estimates pass@k and returns them in an array.
    """

    def estimator(n: int, c: int, k: int) -> float:
        """
        Calculates 1 - comb(n - c, k) / comb(n, k).
        """
        if n - c < k:
            return 1.0
        return 1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))

    assert len(num_samples) == len(num_correct)
    num_samples_it = iter(num_samples)

    return np.array([estimator(int(n), int(c), k) for n, c in zip(num_samples_it, num_correct)])

4、Execution

此部分比较常规，各个开源的评测框架中基本都是一样的，大概是200多行代码运行代码。感兴趣的可以去框架中的execution.py进行查看,下面我就贴一个其中比较有代表性的函数：通过multiprocessing对代码进行运行，最终返回通过的数量及运行结果

def check_correctness(check_program, task_id, timeout=3):
    manager = multiprocessing.Manager()
    result = manager.list()

    p = multiprocessing.Process(target=unsafe_execute, args=(check_program, result, timeout))
    p.start()
    p.join(timeout=timeout + 1)
    if p.is_alive():
        p.kill()

    if not result:
        result.append("timed out")

    return {
        "task_id": task_id,
        "passed": result[0] == "passed",
        "result": result[0]
            }

评测Qwen2.5-coder Humaneval

接下来通过本框架评测一下Qwen2.5-coder-7B在Humaneval上的表现，由于输入是简单的Humaneval prompt直接输入，并没有对模型做一些适配工作，且代码提取规则也比较少，所以最终的评测结果为：80.5。
与官方给出的86差了一些，大概是三四道题的差距吧，应该是prompt设置与输出代码提取的问题。
下面我们看一下框架的输出文件：

首先是metrics文件，保存了相关参数及最终得分。在这里插入图片描述