基于OpenCompass大模型评测
关于评测的三个问题Why/What/How
Why
What
有许多任务评测,包括垂直领域
How
包含客观评测和主观评测,其中主观评测分人工和模型来评估。
提示词工程
主流评测框架
OpenCompass 能力框架
- 模型层
- 能力层
- 方法层
- 工具层
支持丰富的模型
评测流水线设计,能切分多个独立执行的任务,最大化利用计算资源。
大模型能力对比结果输出
前言探索
探索性方向涵盖:
- 多模态
- 法律
- 医生
挑战
实践
创建开发环境和准备数据集
查看支持的数据集:
启动评测
客观评测
主要是run.py
代码文件
- datasets:指定数据集
- hf-path:模型文件
- tokenizer-path:tokenizer路径
- max-seq-len:模型读入的最大长度
- max-out-len:模型输出的最大长度,客观题设置一般较小
- –debug:debug模式,打印出所有的过程
主观评测
主要是eval_sbujective_alignbench.py
文件修改,需要注意model
,max_out_len
等处的修改。
最终结果:
python tools/list_configs.py internlm ceval
20240122_153109
tabulate format
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
dataset version metric mode opencompass.models.huggingface.HuggingFace_model_repos_internlm2-chat-7b
---------------------------------------------- --------- ------------- ------ --------------------------------------------------------------------------
ceval-computer_network db9ce2 accuracy gen 47.37
ceval-operating_system 1c2571 accuracy gen 57.89
ceval-computer_architecture a74dad accuracy gen 38.1
ceval-college_programming 4ca32a accuracy gen 18.92
ceval-college_physics 963fa8 accuracy gen 5.26
ceval-college_chemistry e78857 accuracy gen 0
ceval-advanced_mathematics ce03e2 accuracy gen 0
ceval-probability_and_statistics 65e812 accuracy gen 11.11
ceval-discrete_mathematics e894ae accuracy gen 18.75
ceval-electrical_engineer ae42b9 accuracy gen 18.92
ceval-metrology_engineer ee34ea accuracy gen 50
ceval-high_school_mathematics 1dc5bf accuracy gen 0
ceval-high_school_physics adf25f accuracy gen 31.58
ceval-high_school_chemistry 2ed27f accuracy gen 26.32
ceval-high_school_biology 8e2b9a accuracy gen 26.32
ceval-middle_school_mathematics bee8d5 accuracy gen 21.05
ceval-middle_school_biology 86817c accuracy gen 66.67
ceval-middle_school_physics 8accf6 accuracy gen 52.63
ceval-middle_school_chemistry 167a15 accuracy gen 80
ceval-veterinary_medicine b4e08d accuracy gen 39.13
ceval-college_economics f3f4e6 accuracy gen 29.09
ceval-business_administration c1614e accuracy gen 30.3
ceval-marxism cf874c accuracy gen 84.21
ceval-mao_zedong_thought 51c7a4 accuracy gen 70.83
ceval-education_science 591fee accuracy gen 62.07
ceval-teacher_qualification 4e4ced accuracy gen 77.27
ceval-high_school_politics 5c0de2 accuracy gen 21.05
ceval-high_school_geography 865461 accuracy gen 47.37
ceval-middle_school_politics 5be3e7 accuracy gen 38.1
ceval-middle_school_geography 8a63be accuracy gen 58.33
ceval-modern_chinese_history fc01af accuracy gen 65.22
ceval-ideological_and_moral_cultivation a2aa4a accuracy gen 89.47
ceval-logic f5b022 accuracy gen 13.64
ceval-law a110a1 accuracy gen 37.5
ceval-chinese_language_and_literature 0f8b68 accuracy gen 47.83
ceval-art_studies 2a1300 accuracy gen 66.67
ceval-professional_tour_guide 4e673e accuracy gen 82.76
ceval-legal_professional ce8787 accuracy gen 30.43
ceval-high_school_chinese 315705 accuracy gen 21.05
ceval-high_school_history 7eb30a accuracy gen 75
ceval-middle_school_history 48ab4a accuracy gen 68.18
ceval-civil_servant 87d061 accuracy gen 38.3
ceval-sports_science 70f27b accuracy gen 63.16
ceval-plant_protection 8941f9 accuracy gen 68.18
ceval-basic_medicine c409d6 accuracy gen 57.89
ceval-clinical_medicine 49e82d accuracy gen 45.45
ceval-urban_and_rural_planner 95b885 accuracy gen 58.7
ceval-accountant 002837 accuracy gen 34.69
ceval-fire_engineer bc23f5 accuracy gen 12.9
ceval-environmental_impact_assessment_engineer c64e2d accuracy gen 38.71
ceval-tax_accountant 3a5e3c accuracy gen 42.86
ceval-physician 6e277d accuracy gen 51.02
ceval-stem - naive_average gen 30.5
ceval-social-science - naive_average gen 51.86
ceval-humanities - naive_average gen 54.34
ceval-other - naive_average gen 46.53
ceval-hard - naive_average gen 11.63
ceval - naive_average gen 43.04
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
-------------------------------------------------------------------------------------------------------------------------------- THIS IS A DIVIDER --------------------------------------------------------------------------------------------------------------------------------
csv format
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
dataset,version,metric,mode,opencompass.models.huggingface.HuggingFace_model_repos_internlm2-chat-7b
ceval-computer_network,db9ce2,accuracy,gen,47.37
ceval-operating_system,1c2571,accuracy,gen,57.89
ceval-computer_architecture,a74dad,accuracy,gen,38.10
ceval-college_programming,4ca32a,accuracy,gen,18.92
ceval-college_physics,963fa8,accuracy,gen,5.26
ceval-college_chemistry,e78857,accuracy,gen,0.00
ceval-advanced_mathematics,ce03e2,accuracy,gen,0.00
ceval-probability_and_statistics,65e812,accuracy,gen,11.11
ceval-discrete_mathematics,e894ae,accuracy,gen,18.75
ceval-electrical_engineer,ae42b9,accuracy,gen,18.92
ceval-metrology_engineer,ee34ea,accuracy,gen,50.00
ceval-high_school_mathematics,1dc5bf,accuracy,gen,0.00
ceval-high_school_physics,adf25f,accuracy,gen,31.58
ceval-high_school_chemistry,2ed27f,accuracy,gen,26.32
ceval-high_school_biology,8e2b9a,accuracy,gen,26.32
ceval-middle_school_mathematics,bee8d5,accuracy,gen,21.05
ceval-middle_school_biology,86817c,accuracy,gen,66.67
ceval-middle_school_physics,8accf6,accuracy,gen,52.63
ceval-middle_school_chemistry,167a15,accuracy,gen,80.00
ceval-veterinary_medicine,b4e08d,accuracy,gen,39.13
ceval-college_economics,f3f4e6,accuracy,gen,29.09
ceval-business_administration,c1614e,accuracy,gen,30.30
ceval-marxism,cf874c,accuracy,gen,84.21
ceval-mao_zedong_thought,51c7a4,accuracy,gen,70.83
ceval-education_science,591fee,accuracy,gen,62.07
ceval-teacher_qualification,4e4ced,accuracy,gen,77.27
ceval-high_school_politics,5c0de2,accuracy,gen,21.05
ceval-high_school_geography,865461,accuracy,gen,47.37
ceval-middle_school_politics,5be3e7,accuracy,gen,38.10
ceval-middle_school_geography,8a63be,accuracy,gen,58.33
ceval-modern_chinese_history,fc01af,accuracy,gen,65.22
ceval-ideological_and_moral_cultivation,a2aa4a,accuracy,gen,89.47
ceval-logic,f5b022,accuracy,gen,13.64
ceval-law,a110a1,accuracy,gen,37.50
ceval-chinese_language_and_literature,0f8b68,accuracy,gen,47.83
ceval-art_studies,2a1300,accuracy,gen,66.67
ceval-professional_tour_guide,4e673e,accuracy,gen,82.76
ceval-legal_professional,ce8787,accuracy,gen,30.43
ceval-high_school_chinese,315705,accuracy,gen,21.05
ceval-high_school_history,7eb30a,accuracy,gen,75.00
ceval-middle_school_history,48ab4a,accuracy,gen,68.18
ceval-civil_servant,87d061,accuracy,gen,38.30
ceval-sports_science,70f27b,accuracy,gen,63.16
ceval-plant_protection,8941f9,accuracy,gen,68.18
ceval-basic_medicine,c409d6,accuracy,gen,57.89
ceval-clinical_medicine,49e82d,accuracy,gen,45.45
ceval-urban_and_rural_planner,95b885,accuracy,gen,58.70
ceval-accountant,002837,accuracy,gen,34.69
ceval-fire_engineer,bc23f5,accuracy,gen,12.90
ceval-environmental_impact_assessment_engineer,c64e2d,accuracy,gen,38.71
ceval-tax_accountant,3a5e3c,accuracy,gen,42.86
ceval-physician,6e277d,accuracy,gen,51.02
ceval-stem,-,naive_average,gen,30.50
ceval-social-science,-,naive_average,gen,51.86
ceval-humanities,-,naive_average,gen,54.34
ceval-other,-,naive_average,gen,46.53
ceval-hard,-,naive_average,gen,11.63
ceval,-,naive_average,gen,43.04
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
-------------------------------------------------------------------------------------------------------------------------------- THIS IS A DIVIDER --------------------------------------------------------------------------------------------------------------------------------
raw format
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-------------------------------
Model: opencompass.models.huggingface.HuggingFace_model_repos_internlm2-chat-7b
ceval-computer_network: {'accuracy': 47.368421052631575}
ceval-operating_system: {'accuracy': 57.89473684210527}
ceval-computer_architecture: {'accuracy': 38.095238095238095}
ceval-college_programming: {'accuracy': 18.91891891891892}
ceval-college_physics: {'accuracy': 5.263157894736842}
ceval-college_chemistry: {'accuracy': 0.0}
ceval-advanced_mathematics: {'accuracy': 0.0}
ceval-probability_and_statistics: {'accuracy': 11.11111111111111}
ceval-discrete_mathematics: {'accuracy': 18.75}
ceval-electrical_engineer: {'accuracy': 18.91891891891892}
ceval-metrology_engineer: {'accuracy': 50.0}
ceval-high_school_mathematics: {'accuracy': 0.0}
ceval-high_school_physics: {'accuracy': 31.57894736842105}
ceval-high_school_chemistry: {'accuracy': 26.31578947368421}
ceval-high_school_biology: {'accuracy': 26.31578947368421}
ceval-middle_school_mathematics: {'accuracy': 21.052631578947366}
ceval-middle_school_biology: {'accuracy': 66.66666666666666}
ceval-middle_school_physics: {'accuracy': 52.63157894736842}
ceval-middle_school_chemistry: {'accuracy': 80.0}
ceval-veterinary_medicine: {'accuracy': 39.130434782608695}
ceval-college_economics: {'accuracy': 29.09090909090909}
ceval-business_administration: {'accuracy': 30.303030303030305}
ceval-marxism: {'accuracy': 84.21052631578947}
ceval-mao_zedong_thought: {'accuracy': 70.83333333333334}
ceval-education_science: {'accuracy': 62.06896551724138}
ceval-teacher_qualification: {'accuracy': 77.27272727272727}
ceval-high_school_politics: {'accuracy': 21.052631578947366}
ceval-high_school_geography: {'accuracy': 47.368421052631575}
ceval-middle_school_politics: {'accuracy': 38.095238095238095}
ceval-middle_school_geography: {'accuracy': 58.333333333333336}
ceval-modern_chinese_history: {'accuracy': 65.21739130434783}
ceval-ideological_and_moral_cultivation: {'accuracy': 89.47368421052632}
ceval-logic: {'accuracy': 13.636363636363635}
ceval-law: {'accuracy': 37.5}
ceval-chinese_language_and_literature: {'accuracy': 47.82608695652174}
ceval-art_studies: {'accuracy': 66.66666666666666}
ceval-professional_tour_guide: {'accuracy': 82.75862068965517}
ceval-legal_professional: {'accuracy': 30.434782608695656}
ceval-high_school_chinese: {'accuracy': 21.052631578947366}
ceval-high_school_history: {'accuracy': 75.0}
ceval-middle_school_history: {'accuracy': 68.18181818181817}
ceval-civil_servant: {'accuracy': 38.297872340425535}
ceval-sports_science: {'accuracy': 63.1578947368421}
ceval-plant_protection: {'accuracy': 68.18181818181817}
ceval-basic_medicine: {'accuracy': 57.89473684210527}
ceval-clinical_medicine: {'accuracy': 45.45454545454545}
ceval-urban_and_rural_planner: {'accuracy': 58.69565217391305}
ceval-accountant: {'accuracy': 34.69387755102041}
ceval-fire_engineer: {'accuracy': 12.903225806451612}
ceval-environmental_impact_assessment_engineer: {'accuracy': 38.70967741935484}
ceval-tax_accountant: {'accuracy': 42.857142857142854}
ceval-physician: {'accuracy': 51.02040816326531}
ceval-stem: {'ceval-computer_network': 47.368421052631575, 'ceval-operating_system': 57.89473684210527, 'ceval-computer_architecture': 38.095238095238095, 'ceval-college_programming': 18.91891891891892, 'ceval-college_physics': 5.263157894736842, 'ceval-college_chemistry': 0.0, 'ceval-advanced_mathematics': 0.0, 'ceval-probability_and_statistics': 11.11111111111111, 'ceval-discrete_mathematics': 18.75, 'ceval-electrical_engineer': 18.91891891891892, 'ceval-metrology_engineer': 50.0, 'ceval-high_school_mathematics': 0.0, 'ceval-high_school_physics': 31.57894736842105, 'ceval-high_school_chemistry': 26.31578947368421, 'ceval-high_school_biology': 26.31578947368421, 'ceval-middle_school_mathematics': 21.052631578947366, 'ceval-middle_school_biology': 66.66666666666666, 'ceval-middle_school_physics': 52.63157894736842, 'ceval-middle_school_chemistry': 80.0, 'ceval-veterinary_medicine': 39.130434782608695, 'naive_average': 30.50061705625207}
ceval-social-science: {'ceval-college_economics': 29.09090909090909, 'ceval-business_administration': 30.303030303030305, 'ceval-marxism': 84.21052631578947, 'ceval-mao_zedong_thought': 70.83333333333334, 'ceval-education_science': 62.06896551724138, 'ceval-teacher_qualification': 77.27272727272727, 'ceval-high_school_politics': 21.052631578947366, 'ceval-high_school_geography': 47.368421052631575, 'ceval-middle_school_politics': 38.095238095238095, 'ceval-middle_school_geography': 58.333333333333336, 'naive_average': 51.86291158931812}
ceval-humanities: {'ceval-modern_chinese_history': 65.21739130434783, 'ceval-ideological_and_moral_cultivation': 89.47368421052632, 'ceval-logic': 13.636363636363635, 'ceval-law': 37.5, 'ceval-chinese_language_and_literature': 47.82608695652174, 'ceval-art_studies': 66.66666666666666, 'ceval-professional_tour_guide': 82.75862068965517, 'ceval-legal_professional': 30.434782608695656, 'ceval-high_school_chinese': 21.052631578947366, 'ceval-high_school_history': 75.0, 'ceval-middle_school_history': 68.18181818181817, 'naive_average': 54.340731439412956}
ceval-other: {'ceval-civil_servant': 38.297872340425535, 'ceval-sports_science': 63.1578947368421, 'ceval-plant_protection': 68.18181818181817, 'ceval-basic_medicine': 57.89473684210527, 'ceval-clinical_medicine': 45.45454545454545, 'ceval-urban_and_rural_planner': 58.69565217391305, 'ceval-accountant': 34.69387755102041, 'ceval-fire_engineer': 12.903225806451612, 'ceval-environmental_impact_assessment_engineer': 38.70967741935484, 'ceval-tax_accountant': 42.857142857142854, 'ceval-physician': 51.02040816326531, 'naive_average': 46.533350138807684}
ceval-hard: {'ceval-advanced_mathematics': 0.0, 'ceval-discrete_mathematics': 18.75, 'ceval-probability_and_statistics': 11.11111111111111, 'ceval-college_chemistry': 0.0, 'ceval-college_physics': 5.263157894736842, 'ceval-high_school_mathematics': 0.0, 'ceval-high_school_chemistry': 26.31578947368421, 'ceval-high_school_physics': 31.57894736842105, 'naive_average': 11.627375730994151}
ceval: {'ceval-computer_network': 47.368421052631575, 'ceval-operating_system': 57.89473684210527, 'ceval-computer_architecture': 38.095238095238095, 'ceval-college_programming': 18.91891891891892, 'ceval-college_physics': 5.263157894736842, 'ceval-college_chemistry': 0.0, 'ceval-advanced_mathematics': 0.0, 'ceval-probability_and_statistics': 11.11111111111111, 'ceval-discrete_mathematics': 18.75, 'ceval-electrical_engineer': 18.91891891891892, 'ceval-metrology_engineer': 50.0, 'ceval-high_school_mathematics': 0.0, 'ceval-high_school_physics': 31.57894736842105, 'ceval-high_school_chemistry': 26.31578947368421, 'ceval-high_school_biology': 26.31578947368421, 'ceval-middle_school_mathematics': 21.052631578947366, 'ceval-middle_school_biology': 66.66666666666666, 'ceval-middle_school_physics': 52.63157894736842, 'ceval-middle_school_chemistry': 80.0, 'ceval-veterinary_medicine': 39.130434782608695, 'ceval-college_economics': 29.09090909090909, 'ceval-business_administration': 30.303030303030305, 'ceval-marxism': 84.21052631578947, 'ceval-mao_zedong_thought': 70.83333333333334, 'ceval-education_science': 62.06896551724138, 'ceval-teacher_qualification': 77.27272727272727, 'ceval-high_school_politics': 21.052631578947366, 'ceval-high_school_geography': 47.368421052631575, 'ceval-middle_school_politics': 38.095238095238095, 'ceval-middle_school_geography': 58.333333333333336, 'ceval-modern_chinese_history': 65.21739130434783, 'ceval-ideological_and_moral_cultivation': 89.47368421052632, 'ceval-logic': 13.636363636363635, 'ceval-law': 37.5, 'ceval-chinese_language_and_literature': 47.82608695652174, 'ceval-art_studies': 66.66666666666666, 'ceval-professional_tour_guide': 82.75862068965517, 'ceval-legal_professional': 30.434782608695656, 'ceval-high_school_chinese': 21.052631578947366, 'ceval-high_school_history': 75.0, 'ceval-middle_school_history': 68.18181818181817, 'ceval-civil_servant': 38.297872340425535, 'ceval-sports_science': 63.1578947368421, 'ceval-plant_protection': 68.18181818181817, 'ceval-basic_medicine': 57.89473684210527, 'ceval-clinical_medicine': 45.45454545454545, 'ceval-urban_and_rural_planner': 58.69565217391305, 'ceval-accountant': 34.69387755102041, 'ceval-fire_engineer': 12.903225806451612, 'ceval-environmental_impact_assessment_engineer': 38.70967741935484, 'ceval-tax_accountant': 42.857142857142854, 'ceval-physician': 51.02040816326531, 'naive_average': 43.043391430358646}
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$