OpenCompass 大模型评测
文档:https://github.com/InternLM/tutorial/blob/main/opencompass/opencompass_tutorial.md
视频:https://www.bilibili.com/video/BV1Gg4y1U7uc/
Reop: https://github.com/open-compass/opencompass
大模型评测的三个问题
为什么需要进行大模型评测
InternLM2-7B 在 HF 的模型评测中,同参数量级达到了最优的效果。
评测什么
如何评测大语言模型
对于不同类型的模型,需要设计不同的方案来评估。比如,基座模型需要添加额外的指令,对话模型直接采用人类对话的模式。
根据评测的种类,可以分为客观评测和主观评测。客观评测可能回答也会有多样性,可以根据正则表达式来提取回答。
主观评测,可以通过人工评价来对比不同的模型结果。或者可以通过模型(比如GPT4)进行自动化评价。也有专门研究主观评测的模型,比如 JudgeLM
可以用提示词工程来测评模型对提示词的敏感性。用不同的提示词问同一个问题,期望模型都可以回答对。如果回答不对,说明模型对提示词比较敏感。
大模型评测框架
OpenCompass 榜单地址:https://opencompass.org.cn/leaderboard-llm
多模态榜单地址:https://opencompass.org.cn/leaderboard-multimodal
仍然有很多挑战。比如针对数据污染,开发了专门测试数据污染的工具。可以用工具测试出来某个模型是否在数据集上存在数据污染。
操作实践
在 OpenCompass 中评估一个模型通常包括以下几个阶段:配置 -> 推理 -> 评估 -> 可视化。
配置:这是整个工作流的起点。您需要配置整个评估过程,选择要评估的模型和数据集。此外,还可以选择评估策略、计算后端等,并定义显示结果的方式。
推理与评估:在这个阶段,OpenCompass 将会开始对模型和数据集进行并行推理和评估。推理阶段主要是让模型从数据集产生输出,而评估阶段则是衡量这些输出与标准答案的匹配程度。这两个过程会被拆分为多个同时运行的“任务”以提高效率,但请注意,如果计算资源有限,这种策略可能会使评测变得更慢。
可视化:评估完成后,OpenCompass 将结果整理成易读的表格,并将其保存为 CSV 和 TXT 文件。你也可以激活飞书状态上报功能,此后可以在飞书客户端中及时获得评测状态报告。
展示 OpenCompass 的基础用法,展示书生浦语在 C-Eval 基准任务上的评估。
C-Eval 介绍
C-Eval 是一个全面的中文基础模型评估套件。它包含了13948个多项选择题,涵盖了52个不同的学科和四个难度级别,如下所示。地址: https://cevalbenchmark.com/index_zh.html
数据集示例(https://cevalbenchmark.com/static/explore.html):
查看支持的数据集和模型
# 列出所有跟 internlm 及 ceval 相关的配置
# python tools/list_configs.py internlm ceval
+--------------------------+--------------------------------------------------------+
| Model | Config Path |
|--------------------------+--------------------------------------------------------|
| hf_internlm_20b | configs/models/hf_internlm/hf_internlm_20b.py |
| hf_internlm_7b | configs/models/hf_internlm/hf_internlm_7b.py |
| hf_internlm_chat_20b | configs/models/hf_internlm/hf_internlm_chat_20b.py |
| hf_internlm_chat_7b | configs/models/hf_internlm/hf_internlm_chat_7b.py |
| hf_internlm_chat_7b_8k | configs/models/hf_internlm/hf_internlm_chat_7b_8k.py |
| hf_internlm_chat_7b_v1_1 | configs/models/hf_internlm/hf_internlm_chat_7b_v1_1.py |
| internlm_7b | configs/models/internlm/internlm_7b.py |
| ms_internlm_chat_7b_8k | configs/models/ms_internlm/ms_internlm_chat_7b_8k.py |
+--------------------------+--------------------------------------------------------+
+----------------------------+------------------------------------------------------+
| Dataset | Config Path |
|----------------------------+------------------------------------------------------|
| ceval_clean_ppl | configs/datasets/ceval/ceval_clean_ppl.py |
| ceval_gen | configs/datasets/ceval/ceval_gen.py |
| ceval_gen_2daf24 | configs/datasets/ceval/ceval_gen_2daf24.py |
| ceval_gen_5f30c7 | configs/datasets/ceval/ceval_gen_5f30c7.py |
| ceval_ppl | configs/datasets/ceval/ceval_ppl.py |
| ceval_ppl_578f8d | configs/datasets/ceval/ceval_ppl_578f8d.py |
| ceval_ppl_93e5ce | configs/datasets/ceval/ceval_ppl_93e5ce.py |
| ceval_zero_shot_gen_bd40ef | configs/datasets/ceval/ceval_zero_shot_gen_bd40ef.py |
+----------------------------+------------------------------------------------------+
启动评测
可以通过以下命令评测 InternLM-Chat-7B 模型在 C-Eval 数据集上的性能
python run.py --datasets ceval_gen --hf-path /share/temp/model_repos/internlm-chat-7b/ --tokenizer-path /share/temp/model_repos/internlm-chat-7b/ --tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True --model-kwargs trust_remote_code=True device_map='auto' --max-seq-len 2048 --max-out-len 16 --batch-size 4 --num-gpus 1 --debug
命令解析
--datasets ceval_gen \
--hf-path /share/temp/model_repos/internlm-chat-7b/ \ # HuggingFace 模型路径
--tokenizer-path /share/temp/model_repos/internlm-chat-7b/ \ # HuggingFace tokenizer 路径(如果与模型路径相同,可以省略)
--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \ # 构建 tokenizer 的参数
--model-kwargs device_map='auto' trust_remote_code=True \ # 构建模型的参数
--max-seq-len 2048 \ # 模型可以接受的最大序列长度
--max-out-len 16 \ # 生成的最大 token 数
--batch-size 4 \ # 批量大小
--num-gpus 1 # 运行模型所需的 GPU 数量
--debug
完成后可以看到
dataset version metric mode opencompass.models.huggingface.HuggingFace_model_repos_internlm-chat-7b
---------------------------------------------- --------- ------------- ------ -------------------------------------------------------------------------
ceval-computer_network db9ce2 accuracy gen 31.58
ceval-operating_system 1c2571 accuracy gen 36.84
ceval-computer_architecture a74dad accuracy gen 28.57
ceval-college_programming 4ca32a accuracy gen 32.43
ceval-college_physics 963fa8 accuracy gen 26.32
ceval-college_chemistry e78857 accuracy gen 16.67
ceval-advanced_mathematics ce03e2 accuracy gen 21.05
ceval-probability_and_statistics 65e812 accuracy gen 38.89
ceval-discrete_mathematics e894ae accuracy gen 18.75
ceval-electrical_engineer ae42b9 accuracy gen 35.14
ceval-metrology_engineer ee34ea accuracy gen 50
ceval-high_school_mathematics 1dc5bf accuracy gen 22.22
ceval-high_school_physics adf25f accuracy gen 31.58
ceval-high_school_chemistry 2ed27f accuracy gen 15.79
ceval-high_school_biology 8e2b9a accuracy gen 36.84
ceval-middle_school_mathematics bee8d5 accuracy gen 26.32
ceval-middle_school_biology 86817c accuracy gen 61.9
ceval-middle_school_physics 8accf6 accuracy gen 63.16
ceval-middle_school_chemistry 167a15 accuracy gen 60
ceval-veterinary_medicine b4e08d accuracy gen 47.83
ceval-college_economics f3f4e6 accuracy gen 41.82
ceval-business_administration c1614e accuracy gen 33.33
ceval-marxism cf874c accuracy gen 68.42
ceval-mao_zedong_thought 51c7a4 accuracy gen 70.83
ceval-education_science 591fee accuracy gen 58.62
ceval-teacher_qualification 4e4ced accuracy gen 70.45
ceval-high_school_politics 5c0de2 accuracy gen 26.32
ceval-high_school_geography 865461 accuracy gen 47.37
ceval-middle_school_politics 5be3e7 accuracy gen 52.38
ceval-middle_school_geography 8a63be accuracy gen 58.33
ceval-modern_chinese_history fc01af accuracy gen 73.91
ceval-ideological_and_moral_cultivation a2aa4a accuracy gen 63.16
ceval-logic f5b022 accuracy gen 31.82
ceval-law a110a1 accuracy gen 25
ceval-chinese_language_and_literature 0f8b68 accuracy gen 30.43
ceval-art_studies 2a1300 accuracy gen 60.61
ceval-professional_tour_guide 4e673e accuracy gen 62.07
ceval-legal_professional ce8787 accuracy gen 39.13
ceval-high_school_chinese 315705 accuracy gen 63.16
ceval-high_school_history 7eb30a accuracy gen 70
ceval-middle_school_history 48ab4a accuracy gen 59.09
ceval-civil_servant 87d061 accuracy gen 53.19
ceval-sports_science 70f27b accuracy gen 52.63
ceval-plant_protection 8941f9 accuracy gen 59.09
ceval-basic_medicine c409d6 accuracy gen 47.37
ceval-clinical_medicine 49e82d accuracy gen 40.91
ceval-urban_and_rural_planner 95b885 accuracy gen 45.65
ceval-accountant 002837 accuracy gen 26.53
ceval-fire_engineer bc23f5 accuracy gen 22.58
ceval-environmental_impact_assessment_engineer c64e2d accuracy gen 64.52
ceval-tax_accountant 3a5e3c accuracy gen 34.69
ceval-physician 6e277d accuracy gen 40.82
ceval-stem - naive_average gen 35.09
ceval-social-science - naive_average gen 52.79
ceval-humanities - naive_average gen 52.58
ceval-other - naive_average gen 44.36
ceval-hard - naive_average gen 23.91
ceval - naive_average gen 44.16
不过我在试验的时候发现有一些任务会失败,没有评估结果。
除了通过命令行配置实验外,OpenCompass 还允许用户在配置文件中编写实验的完整配置,并通过 run.py 直接运行它。配置文件是以 Python 格式组织的,并且必须包括 datasets 和 models 字段。示例:
python run.py configs/eval_demo.py
python run.py --models hf_llama_7b --datasets base_medium
可视化评估结果
评估完成后,评估结果表格将打印如下:
dataset version metric mode opt350m opt125m
--------- --------- -------- ------ --------- ---------
siqa e78df3 accuracy gen 21.55 12.44
winograd b6c7ed accuracy ppl 51.23 49.82
所有运行输出将定向到 outputs/demo/ 目录,结构如下:
outputs/default/
├── 20200220_120000
├── 20230220_183030 # 每个实验一个文件夹
│ ├── configs # 用于记录的已转储的配置文件。如果在同一个实验文件夹中重新运行了不同的实验,可能会保留多个配置
│ ├── logs # 推理和评估阶段的日志文件
│ │ ├── eval
│ │ └── infer
│ ├── predictions # 每个任务的推理结果
│ ├── results # 每个任务的评估结果
│ └── summary # 单个实验的汇总评估结果
├── ...