Tutorial/docs/L1/OpenCompass/readme.md at camp3 · InternLM/Tutorial (github.com)
评测环节的作用
- 分辨大模型擅长的领域
- 检验专业知识能力,作为基准线
- 基于模型完成无人工的对齐监督
- 模型选型、准出
评测环节的挑战
- 不够全面:适配多个领域、不同应用场景
- 数据污染:海量语料不可避免有污染
- 评测成本:基于人工打分需要劳务费;基于模型需要算力
- 鲁棒性:对提示词敏感,多次采样不稳定
OpenCompass 支持情况
OpenCompass 按模型的类型进行评测,对基座模型和对话模型采用不同的评测方式;数据 IO 方面,OpenCompass 支持通过 API 评测;或通过加载模型文件 (pth, hf) 评测。
客观评测 & 主观评测
客观评价只要回答出来了正确的答案即可;主观评测需要模型/人工介入打分(e.g. 问 GPT:给两首诗打分)
长文本评测(大海捞针)
在长文本中随机位置插入一个与语料无关的语句,考察模型能否记住长文本中的信息。(笔者问:似乎在诱导模型更关注奇异的语料)
OpenCompass 社区
提供了榜单,投票社区,评测工具
实验部分
概览
在 OpenCompass 中评估一个模型通常包括以下几个阶段:配置 -> 推理 -> 评估 -> 可视化。
- 配置:这是整个工作流的起点。您需要配置整个评估过程,选择要评估的模型和数据集。此外,还可以选择评估策略、计算后端等,并定义显示结果的方式。
- 推理与评估:在这个阶段,OpenCompass 将会开始对模型和数据集进行并行推理和评估。推理阶段主要是让模型从数据集产生输出,而评估阶段则是衡量这些输出与标准答案的匹配程度。这两个过程会被拆分为多个同时运行的“任务”以提高效率。
- 可视化:评估完成后,OpenCompass 将结果整理成易读的表格,并将其保存为 CSV 和 TXT 文件。
OpenCompass 目录结构
- configs - 评测配置文件
- data - 语料数据集。每个子目录对应一个评测数据集,包含问题、(选项)、可接受的答案列表等等。使用的格式有 jsonl, csv 等。可以找几个看看,不过似乎没有统一的格式,还是得写多个 adapter 看不看无所谓
- opencompass - 不知道
执行指令进行评测
export MKL_SERVICE_FORCE_INTEL=1
#或
export MKL_THREADING_LAYER=GNU
python run.py
--datasets ceval_gen \ # 数据集准备
--models hf_internlm2_chat_1_8b \ # 模型准备
--debug
该命令会查找 configs/models/hf_internlm/<model_name>.py 对应的配置文件,该文件指定了 tokenizer、llm_model、以及 batch_size、长度、起始 token 等信息;同时会查找 data/<dataset_name> 目录下的数据集进行测试
上面的命令等价于如下 python 脚本:
from mmengine.config import read_base
with read_base():
from .datasets.ceval.ceval_gen import ceval_datasets
from .models.hf_internlm.hf_internlm2_chat_1_8b import models as hf_internlm2_chat_1_8b_models
datasets = ceval_datasets
models = hf_internlm2_chat_1_8b_models
实验结果
dataset version metric mode internlm2-chat-1.8b-hf
---------------------------------------------- --------- ------------- ------ ------------------------
ceval-computer_network db9ce2 accuracy gen 36.84
ceval-operating_system 1c2571 accuracy gen 47.37
ceval-computer_architecture a74dad accuracy gen 19.05
ceval-college_programming 4ca32a accuracy gen 35.14
ceval-college_physics 963fa8 accuracy gen 31.58
ceval-college_chemistry e78857 accuracy gen 37.5
ceval-advanced_mathematics ce03e2 accuracy gen 26.32
ceval-probability_and_statistics 65e812 accuracy gen 50
ceval-discrete_mathematics e894ae accuracy gen 37.5
ceval-electrical_engineer ae42b9 accuracy gen 32.43
ceval-metrology_engineer ee34ea accuracy gen 58.33
ceval-high_school_mathematics 1dc5bf accuracy gen 11.11
ceval-high_school_physics adf25f accuracy gen 42.11
ceval-high_school_chemistry 2ed27f accuracy gen 52.63
ceval-high_school_biology 8e2b9a accuracy gen 31.58
ceval-middle_school_mathematics bee8d5 accuracy gen 26.32
ceval-middle_school_biology 86817c accuracy gen 76.19
ceval-middle_school_physics 8accf6 accuracy gen 57.89
ceval-middle_school_chemistry 167a15 accuracy gen 80
ceval-veterinary_medicine b4e08d accuracy gen 60.87
ceval-college_economics f3f4e6 accuracy gen 38.18
ceval-business_administration c1614e accuracy gen 33.33
ceval-marxism cf874c accuracy gen 73.68
ceval-mao_zedong_thought 51c7a4 accuracy gen 66.67
ceval-education_science 591fee accuracy gen 55.17
ceval-teacher_qualification 4e4ced accuracy gen 63.64
ceval-high_school_politics 5c0de2 accuracy gen 47.37
ceval-high_school_geography 865461 accuracy gen 47.37
ceval-middle_school_politics 5be3e7 accuracy gen 80.95
ceval-middle_school_geography 8a63be accuracy gen 83.33
ceval-modern_chinese_history fc01af accuracy gen 56.52
ceval-ideological_and_moral_cultivation a2aa4a accuracy gen 78.95
ceval-logic f5b022 accuracy gen 54.55
ceval-law a110a1 accuracy gen 29.17
ceval-chinese_language_and_literature 0f8b68 accuracy gen 39.13
ceval-art_studies 2a1300 accuracy gen 54.55
ceval-professional_tour_guide 4e673e accuracy gen 62.07
ceval-legal_professional ce8787 accuracy gen 52.17
ceval-high_school_chinese 315705 accuracy gen 42.11
ceval-high_school_history 7eb30a accuracy gen 65
ceval-middle_school_history 48ab4a accuracy gen 86.36
ceval-civil_servant 87d061 accuracy gen 46.81
ceval-sports_science 70f27b accuracy gen 47.37
ceval-plant_protection 8941f9 accuracy gen 54.55
ceval-basic_medicine c409d6 accuracy gen 73.68
ceval-clinical_medicine 49e82d accuracy gen 45.45
ceval-urban_and_rural_planner 95b885 accuracy gen 41.3
ceval-accountant 002837 accuracy gen 32.65
ceval-fire_engineer bc23f5 accuracy gen 32.26
ceval-environmental_impact_assessment_engineer c64e2d accuracy gen 41.94
ceval-tax_accountant 3a5e3c accuracy gen 36.73
ceval-physician 6e277d accuracy gen 38.78
ceval-stem - naive_average gen 42.54
ceval-social-science - naive_average gen 58.97
ceval-humanities - naive_average gen 56.42
ceval-other - naive_average gen 44.68
ceval-hard - naive_average gen 36.09
ceval - naive_average gen 49.09
参考社区中的榜单 OpenCompass司南 - 评测集社区,InternLM_1.8B 大约有 Llama-30B 的水平。可见参数量和训练质量共同决定了最终的效果。