基础任务(完成此任务即完成闯关)
使用 OpenCompass 评测 internlm2-chat-1.8b 模型在 ceval 数据集上的性能,记录复现过程并截图。
环境配置
创建开发机和 conda 环境
在创建开发机界面选择镜像为 Cuda11.7-conda,并选择 GPU 为10% A100。
安装——面向GPU的环境安装
conda create -n opencompass python=3.10
conda activate opencompass
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia -y
# 注意:一定要先 cd /root
cd /root
git clone -b 0.2.4 https://github.com/open-compass/opencompass
cd opencompass
pip install -e .
apt-get update
apt-get install cmake
pip install -r requirements.txt
pip install protobuf
数据准备
评测数据集
解压评测数据集到 /root/opencompass/data/
处。(注意: 上方在git clone opencompass 时一定要将 opencompass clone 到 /root 路径下)
cp /share/temp/datasets/OpenCompassData-core-20231110.zip /root/opencompass/
unzip OpenCompassData-core-20231110.zip
将会在 OpenCompass 下看到data文件夹
InternLM和ceval 相关的配置文件
列出所有跟 InternLM 及 C-Eval 相关的配置
python tools/list_configs.py internlm ceval
将会看到
+----------------------------------------+----------------------------------------------------------------------+
| Model | Config Path |
|----------------------------------------+----------------------------------------------------------------------|
| hf_internlm2_1_8b | configs/models/hf_internlm/hf_internlm2_1_8b.py |
| hf_internlm2_20b | configs/models/hf_internlm/hf_internlm2_20b.py |
| hf_internlm2_7b | configs/models/hf_internlm/hf_internlm2_7b.py |
| hf_internlm2_base_20b | configs/models/hf_internlm/hf_internlm2_base_20b.py |
| hf_internlm2_base_7b | configs/models/hf_internlm/hf_internlm2_base_7b.py |
| hf_internlm2_chat_1_8b | configs/models/hf_internlm/hf_internlm2_chat_1_8b.py |
| hf_internlm2_chat_1_8b_sft | configs/models/hf_internlm/hf_internlm2_chat_1_8b_sft.py |
| hf_internlm2_chat_20b | configs/models/hf_internlm/hf_internlm2_chat_20b.py |
| hf_internlm2_chat_20b_sft | configs/models/hf_internlm/hf_internlm2_chat_20b_sft.py |
| hf_internlm2_chat_20b_with_system | configs/models/hf_internlm/hf_internlm2_chat_20b_with_system.py |
| hf_internlm2_chat_7b | configs/models/hf_internlm/hf_internlm2_chat_7b.py |
| hf_internlm2_chat_7b_sft | configs/models/hf_internlm/hf_internlm2_chat_7b_sft.py |
| hf_internlm2_chat_7b_with_system | configs/models/hf_internlm/hf_internlm2_chat_7b_with_system.py |
| hf_internlm2_chat_math_20b | configs/models/hf_internlm/hf_internlm2_chat_math_20b.py |
| hf_internlm2_chat_math_20b_with_system | configs/models/hf_internlm/hf_internlm2_chat_math_20b_with_system.py |
| hf_internlm2_chat_math_7b | configs/models/hf_internlm/hf_internlm2_chat_math_7b.py |
| hf_internlm2_chat_math_7b_with_system | configs/models/hf_internlm/hf_internlm2_chat_math_7b_with_system.py |
| hf_internlm_20b | configs/models/hf_internlm/hf_internlm_20b.py |
| hf_internlm_7b | configs/models/hf_internlm/hf_internlm_7b.py |
| hf_internlm_chat_20b | configs/models/hf_internlm/hf_internlm_chat_20b.py |
| hf_internlm_chat_7b | configs/models/hf_internlm/hf_internlm_chat_7b.py |
| hf_internlm_chat_7b_8k | configs/models/hf_internlm/hf_internlm_chat_7b_8k.py |
| hf_internlm_chat_7b_v1_1 | configs/models/hf_internlm/hf_internlm_chat_7b_v1_1.py |
| internlm_7b | configs/models/internlm/internlm_7b.py |
| lmdeploy_internlm2_chat_20b | configs/models/hf_internlm/lmdeploy_internlm2_chat_20b.py |
| lmdeploy_internlm2_chat_7b | configs/models/hf_internlm/lmdeploy_internlm2_chat_7b.py |
| ms_internlm_chat_7b_8k | configs/models/ms_internlm/ms_internlm_chat_7b_8k.py |
+----------------------------------------+----------------------------------------------------------------------+
评测模型的数据集如下图所示:
+-------------------------------+-----------------------------------------------------------------+
| Dataset | Config Path |
|-------------------------------+-----------------------------------------------------------------|
| cmmlu_gen | configs/datasets/cmmlu/cmmlu_gen.py |
| cmmlu_gen_c13365 | configs/datasets/cmmlu/cmmlu_gen_c13365.py |
| cmmlu_ppl | configs/datasets/cmmlu/cmmlu_ppl.py |
| cmmlu_ppl_041cbf | configs/datasets/cmmlu/cmmlu_ppl_041cbf.py |
| cmmlu_ppl_8b9c76 | configs/datasets/cmmlu/cmmlu_ppl_8b9c76.py |
| mmlu_clean_ppl | configs/datasets/mmlu/mmlu_clean_ppl.py |
| mmlu_contamination_ppl_810ec6 | configs/datasets/contamination/mmlu_contamination_ppl_810ec6.py |
| mmlu_gen | configs/datasets/mmlu/mmlu_gen.py |
| mmlu_gen_23a9a9 | configs/datasets/mmlu/mmlu_gen_23a9a9.py |
| mmlu_gen_4d595a | configs/datasets/mmlu/mmlu_gen_4d595a.py |
| mmlu_gen_5d1409 | configs/datasets/mmlu/mmlu_gen_5d1409.py |
| mmlu_gen_79e572 | configs/datasets/mmlu/mmlu_gen_79e572.py |
| mmlu_gen_a484b3 | configs/datasets/mmlu/mmlu_gen_a484b3.py |
| mmlu_ppl | configs/datasets/mmlu/mmlu_ppl.py |
| mmlu_ppl_ac766d | configs/datasets/mmlu/mmlu_ppl_ac766d.py |
| mmlu_zero_shot_gen_47e2c0 | configs/datasets/mmlu/mmlu_zero_shot_gen_47e2c0.py |
+-------------------------------+-----------------------------------------------------------------+
启动评测 (10% A100 8GB 资源)
使用命令行配置参数法进行评测
打开 opencompass文件夹下configs/models/hf_internlm/的hf_internlm2_chat_1_8b.py
,贴入以下代码
from opencompass.models import HuggingFaceCausalLM
models = [
dict(
type=HuggingFaceCausalLM,
abbr='internlm2-1.8b-hf',
path="/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b",
tokenizer_path='/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b',
model_kwargs=dict(
trust_remote_code=True,
device_map='auto',
),
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
use_fast=False,
trust_remote_code=True,
),
max_out_len=100,
min_out_len=1,
max_seq_len=2048,
batch_size=8,
run_cfg=dict(num_gpus=1, num_procs=1),
)
]
确保按照上述步骤正确安装 OpenCompass 并准备好数据集后,可以通过以下命令评测 InternLM2-Chat-1.8B 模型在 C-Eval 数据集上的性能。由于 OpenCompass 默认并行启动评估过程,我们可以在第一次运行时以 --debug 模式启动评估,并检查是否存在问题。在 --debug 模式下,任务将按顺序执行,并实时打印输出。
#环境变量配置
export MKL_SERVICE_FORCE_INTEL=1
#或
export MKL_THREADING_LAYER=GNU
python run.py --datasets ceval_gen --models hf_internlm2_chat_1_8b --debug
命令解析
python run.py
--datasets ceval_gen \ # 数据集准备
--models hf_internlm2_chat_1_8b \ # 模型准备
--debug
如果一切正常,您应该看到屏幕上显示:
[2024-08-09 16:48:07,016] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
评测完成后,将会看到:
07/22 22:45:02 - OpenCompass - INFO - Partitioned into 57 tasks.
07/22 22:45:04 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_college_biology]: {'accuracy': 51.388888888888886}
07/22 22:45:05 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_college_chemistry]: {'accuracy': 34.0}
07/22 22:45:07 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_college_computer_science]: {'accuracy': 41.0}
07/22 22:45:08 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_college_mathematics]: {'accuracy': 32.0}
07/22 22:45:10 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_college_physics]: {'accuracy': 29.411764705882355}
07/22 22:45:11 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_electrical_engineering]: {'accuracy': 44.13793103448276}
07/22 22:45:12 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_astronomy]: {'accuracy': 48.026315789473685}
07/22 22:45:14 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_anatomy]: {'accuracy': 45.925925925925924}
07/22 22:45:15 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_abstract_algebra]: {'accuracy': 31.0}
07/22 22:45:16 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_machine_learning]: {'accuracy': 32.142857142857146}
07/22 22:45:18 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_clinical_knowledge]: {'accuracy': 51.320754716981135}
07/22 22:45:19 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_global_facts]: {'accuracy': 24.0}
07/22 22:45:20 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_management]: {'accuracy': 62.13592233009708}
07/22 22:45:22 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_nutrition]: {'accuracy': 48.36601307189542}
07/22 22:45:23 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_marketing]: {'accuracy': 65.8119658119658}
07/22 22:45:25 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_professional_accounting]: {'accuracy': 35.1063829787234}
07/22 22:45:26 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_high_school_geography]: {'accuracy': 56.060606060606055}
07/22 22:45:27 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_international_law]: {'accuracy': 49.586776859504134}
07/22 22:45:29 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_moral_scenarios]: {'accuracy': 24.46927374301676}
07/22 22:45:30 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_computer_security]: {'accuracy': 63.0}
07/22 22:45:32 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_high_school_microeconomics]: {'accuracy': 48.319327731092436}
07/22 22:45:33 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_professional_law]: {'accuracy': 31.095176010430247}
07/22 22:45:35 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_medical_genetics]: {'accuracy': 54.0}
07/22 22:45:36 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_professional_psychology]: {'accuracy': 42.48366013071895}
07/22 22:45:37 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_jurisprudence]: {'accuracy': 50.0}
07/22 22:45:39 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_world_religions]: {'accuracy': 60.81871345029239}
07/22 22:45:40 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_philosophy]: {'accuracy': 49.19614147909968}
07/22 22:45:41 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_virology]: {'accuracy': 37.34939759036144}
07/22 22:45:43 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_high_school_chemistry]: {'accuracy': 35.960591133004925}
07/22 22:45:44 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_public_relations]: {'accuracy': 53.63636363636364}
07/22 22:45:46 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_high_school_macroeconomics]: {'accuracy': 45.64102564102564}
07/22 22:45:47 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_human_sexuality]: {'accuracy': 54.19847328244275}
07/22 22:45:48 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_elementary_mathematics]: {'accuracy': 29.894179894179896}
07/22 22:45:50 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_high_school_physics]: {'accuracy': 34.437086092715234}
07/22 22:45:51 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_high_school_computer_science]: {'accuracy': 38.0}
07/22 22:45:52 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_high_school_european_history]: {'accuracy': 58.18181818181818}
07/22 22:45:54 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_business_ethics]: {'accuracy': 42.0}
07/22 22:45:55 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_moral_disputes]: {'accuracy': 43.641618497109825}
07/22 22:45:57 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_high_school_statistics]: {'accuracy': 40.27777777777778}
07/22 22:45:58 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_miscellaneous]: {'accuracy': 55.172413793103445}
07/22 22:45:59 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_formal_logic]: {'accuracy': 26.984126984126984}
07/22 22:46:01 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_high_school_government_and_politics]: {'accuracy': 60.62176165803109}
07/22 22:46:02 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_prehistory]: {'accuracy': 46.2962962962963}
07/22 22:46:04 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_security_studies]: {'accuracy': 55.10204081632652}
07/22 22:46:05 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_high_school_biology]: {'accuracy': 56.12903225806451}
07/22 22:46:07 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_logical_fallacies]: {'accuracy': 55.828220858895705}
07/22 22:46:08 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_high_school_world_history]: {'accuracy': 65.40084388185655}
07/22 22:46:10 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_professional_medicine]: {'accuracy': 45.588235294117645}
07/22 22:46:11 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_high_school_mathematics]: {'accuracy': 21.48148148148148}
07/22 22:46:12 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_college_medicine]: {'accuracy': 43.35260115606936}
07/22 22:46:14 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_high_school_us_history]: {'accuracy': 51.9607843137255}
07/22 22:46:15 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_sociology]: {'accuracy': 64.6766169154229}
07/22 22:46:16 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_econometrics]: {'accuracy': 31.57894736842105}
07/22 22:46:18 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_high_school_psychology]: {'accuracy': 65.5045871559633}
07/22 22:46:19 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_human_aging]: {'accuracy': 48.4304932735426}
07/22 22:46:20 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_us_foreign_policy]: {'accuracy': 69.0}
07/22 22:46:22 - OpenCompass - INFO - Task [opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b/lukaemon_mmlu_conceptual_physics]: {'accuracy': 32.340425531914896}
dataset version metric mode opencompass.models.huggingface.HuggingFace_Shanghai_AI_Laboratory_internlm2-chat-1_8b
------------------------------------------------- --------- ---------------- ------ ---------------------------------------------------------------------------------------
lukaemon_mmlu_college_biology caec7d accuracy gen 51.39
lukaemon_mmlu_college_chemistry 520aa6 accuracy gen 34
lukaemon_mmlu_college_computer_science 99c216 accuracy gen 41
lukaemon_mmlu_college_mathematics 678751 accuracy gen 32
lukaemon_mmlu_college_physics 4f382c accuracy gen 29.41
lukaemon_mmlu_electrical_engineering 770ce3 accuracy gen 44.14
lukaemon_mmlu_astronomy d3ee01 accuracy gen 48.03
lukaemon_mmlu_anatomy 72183b accuracy gen 45.93
lukaemon_mmlu_abstract_algebra 2db373 accuracy gen 31
lukaemon_mmlu_machine_learning 0283bb accuracy gen 32.14
lukaemon_mmlu_clinical_knowledge cb3218 accuracy gen 51.32
lukaemon_mmlu_global_facts ab07b6 accuracy gen 24
lukaemon_mmlu_management 80876d accuracy gen 62.14
lukaemon_mmlu_nutrition 4543bd accuracy gen 48.37
lukaemon_mmlu_marketing 7394e3 accuracy gen 65.81
lukaemon_mmlu_professional_accounting 444b7f accuracy gen 35.11
lukaemon_mmlu_high_school_geography 0780e6 accuracy gen 56.06
lukaemon_mmlu_international_law cf3179 accuracy gen 49.59
lukaemon_mmlu_moral_scenarios f6dbe2 accuracy gen 24.47
lukaemon_mmlu_computer_security ce7550 accuracy gen 63
lukaemon_mmlu_high_school_microeconomics 04d21a accuracy gen 48.32
lukaemon_mmlu_professional_law 5f7e6c accuracy gen 31.1
lukaemon_mmlu_medical_genetics 881ef5 accuracy gen 54
lukaemon_mmlu_professional_psychology 221a16 accuracy gen 42.48
lukaemon_mmlu_jurisprudence 001f24 accuracy gen 50
lukaemon_mmlu_world_religions 232c09 accuracy gen 60.82
lukaemon_mmlu_philosophy 08042b accuracy gen 49.2
lukaemon_mmlu_virology 12e270 accuracy gen 37.35
lukaemon_mmlu_high_school_chemistry ae8820 accuracy gen 35.96
lukaemon_mmlu_public_relations e7d39b accuracy gen 53.64
lukaemon_mmlu_high_school_macroeconomics a01685 accuracy gen 45.64
lukaemon_mmlu_human_sexuality 42407c accuracy gen 54.2
lukaemon_mmlu_elementary_mathematics 269926 accuracy gen 29.89
lukaemon_mmlu_high_school_physics 93278f accuracy gen 34.44
lukaemon_mmlu_high_school_computer_science 9965a5 accuracy gen 38
lukaemon_mmlu_high_school_european_history eefc90 accuracy gen 58.18
lukaemon_mmlu_business_ethics 1dec08 accuracy gen 42
lukaemon_mmlu_moral_disputes a2173e accuracy gen 43.64
lukaemon_mmlu_high_school_statistics 8f3f3a accuracy gen 40.28
lukaemon_mmlu_miscellaneous 935647 accuracy gen 55.17
lukaemon_mmlu_formal_logic cfcb0c accuracy gen 26.98
lukaemon_mmlu_high_school_government_and_politics 3c52f9 accuracy gen 60.62
lukaemon_mmlu_prehistory bbb197 accuracy gen 46.3
lukaemon_mmlu_security_studies 9b1743 accuracy gen 55.1
lukaemon_mmlu_high_school_biology 37b125 accuracy gen 56.13
lukaemon_mmlu_logical_fallacies 9cebb0 accuracy gen 55.83
lukaemon_mmlu_high_school_world_history 048e7e accuracy gen 65.4
lukaemon_mmlu_professional_medicine 857144 accuracy gen 45.59
lukaemon_mmlu_high_school_mathematics ed4dc0 accuracy gen 21.48
lukaemon_mmlu_college_medicine 38709e accuracy gen 43.35
lukaemon_mmlu_high_school_us_history 8932df accuracy gen 51.96
lukaemon_mmlu_sociology c266a2 accuracy gen 64.68
lukaemon_mmlu_econometrics d1134d accuracy gen 31.58
lukaemon_mmlu_high_school_psychology 7db114 accuracy gen 65.5
lukaemon_mmlu_human_aging 82a410 accuracy gen 48.43
lukaemon_mmlu_us_foreign_policy 528cfe accuracy gen 69
lukaemon_mmlu_conceptual_physics 63588e accuracy gen 32.34
mmlu-humanities - naive_average gen 47.19
mmlu-stem - naive_average gen 38.98
mmlu-social-science - naive_average gen 53.9
mmlu-other - naive_average gen 47.13
mmlu - naive_average gen 45.85
mmlu-weighted - weighted_average gen 44.3
07/22 22:46:22 - OpenCompass - INFO - write summary to /root/opencompass/outputs/default/20240722_181832/summary/summary_20240722_181832.txt
07/22 22:46:22 - OpenCompass - INFO - write csv to /root/opencompass/outputs/default/20240722_181832/summary/summary_20240722_181832.csv
命令的解析:
python run.py
--datasets mmlu_gen \
--hf-path /share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b \ # HuggingFace 模型路径
--tokenizer-path /share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b \ # HuggingFace tokenizer 路径(如果与模型路径相同,可以省略)
--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \ # 构建 tokenizer 的参数
--model-kwargs device_map='auto' trust_remote_code=True \ # 构建模型的参数
--max-seq-len 1024 \ # 模型可以接受的最大序列长度
--max-out-len 16 \ # 生成的最大 token 数
--batch-size 2 \ # 批量大小
--num-gpus 1 # 运行模型所需的 GPU 数量
--debug
使用配置文件修改参数法进行评测
除了通过命令行配置实验外,OpenCompass 还允许用户在配置文件中编写实验的完整配置,并通过 run.py 直接运行它。配置文件是以 Python 格式组织的,并且必须包括 datasets 和 models 字段。本次测试配置在 configs
文件夹 中。此配置通过 继承机制 引入所需的数据集和模型配置,并以所需格式组合 datasets 和 models 字段。 运行以下代码,在configs文件夹下创建eval_tutorial_demo.py
cd /root/opencompass/configs
touch eval_tutorial_demo.py
打开eval_tutorial_demo.py
贴入以下代码
from mmengine.config import read_base
with read_base():
from .datasets.ceval.ceval_gen import ceval_datasets
from .models.hf_internlm.hf_internlm2_chat_1_8b import models as hf_internlm2_chat_1_8b_models
datasets = ceval_datasets
models = hf_internlm2_chat_1_8b_models
因此,运行任务时,我们只需将配置文件的路径传递给 run.py:
cd /root/opencompass
python run.py configs/eval_tutorial_demo.py --debug
如果一切正常,您应该看到屏幕上显示:
[2024-08-09 16:48:07,016] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
评测完成后,将会看到: