本次讲解围绕司南评测体系展开,分为整体介绍和代码实战两部分。评测体系是模型研发的重要工具,能够识别模型性能、指导研发方向,并应对大模型安全问题。评测面临全面性、数据污染、成本和鲁棒性等挑战。司南支持多种评测方法,包括客观和主观评测,致力于构建科学、公平的大模型评测体系。通过社区合作,司南的评测工具不断更新,已成为国内重要的评测平台,涵盖多种模型和应用场景。
评测体系在模型研发中至关重要,它不仅可以识别性能突出的模型,还能指导后续研发方向。尤其在医疗和金融等垂直领域,专业评测显得尤为必要,以确保模型的有效应用。
评测模型的过程不仅仅依赖于简单规则,还需结合主观与客观评测方法。通过长文本评测可以验证模型的理解和记忆能力,尤其是在插入无关信息时的表现尤为重要。
通过综合评测,模型的能力被客观和主观评分,形成可视化的排名。这些评测结果揭示了模型的提升空间,以及国内模型在中文场景中的优势,推动了人工智能的发展。
下一代大模型评测的发展方向主要集中在API的评测体系和动态评测的构建。通过自动化构建策略,我们能够建立复杂的智能体评测系统,以分析模型性能、能力来源及泛化性。
本视频讲解了如何使用命令行和配置文件进行模型评测,包括数据集准备、模型准备和任务执行的顺序。通过这些步骤,用户能够高效地完成评测并实时查看输出结果。
数据集、模型准备:
结果跑了29970.61s,我的心都碎了。。。
08/17 01:26:18 - OpenCompass - INFO - time elapsed: 29970.61s
08/17 01:26:23 - OpenCompass - INFO - Partitioned into 52 tasks.
08/17 01:26:26 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-computer_network]: {'accuracy': 47.368421052631575}
08/17 01:26:28 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-operating_system]: {'accuracy': 47.368421052631575}
08/17 01:26:30 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-computer_architecture]: {'accuracy': 23.809523809523807}
08/17 01:26:32 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-college_programming]: {'accuracy': 27.027027027027028}
08/17 01:26:34 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-college_physics]: {'accuracy': 42.10526315789473}
08/17 01:26:36 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-college_chemistry]: {'accuracy': 37.5}
08/17 01:26:38 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-advanced_mathematics]: {'accuracy': 26.31578947368421}
08/17 01:26:40 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-probability_and_statistics]: {'accuracy': 22.22222222222222}
08/17 01:26:42 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-discrete_mathematics]: {'accuracy': 25.0}
08/17 01:26:44 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-electrical_engineer]: {'accuracy': 27.027027027027028}
08/17 01:26:46 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-metrology_engineer]: {'accuracy': 54.166666666666664}
08/17 01:26:48 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-high_school_mathematics]: {'accuracy': 22.22222222222222}
08/17 01:26:50 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-high_school_physics]: {'accuracy': 42.10526315789473}
08/17 01:26:52 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-high_school_chemistry]: {'accuracy': 52.63157894736842}
08/17 01:26:54 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-high_school_biology]: {'accuracy': 26.31578947368421}
08/17 01:26:56 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-middle_school_mathematics]: {'accuracy': 36.84210526315789}
08/17 01:26:58 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-middle_school_biology]: {'accuracy': 80.95238095238095}
08/17 01:27:00 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-middle_school_physics]: {'accuracy': 47.368421052631575}
08/17 01:27:02 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-middle_school_chemistry]: {'accuracy': 80.0}
08/17 01:27:04 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-veterinary_medicine]: {'accuracy': 43.47826086956522}
08/17 01:27:06 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-college_economics]: {'accuracy': 32.72727272727273}
08/17 01:27:08 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-business_administration]: {'accuracy': 39.39393939393939}
08/17 01:27:10 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-marxism]: {'accuracy': 68.42105263157895}
08/17 01:27:12 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-mao_zedong_thought]: {'accuracy': 70.83333333333334}
08/17 01:27:15 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-education_science]: {'accuracy': 55.172413793103445}
08/17 01:27:17 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-teacher_qualification]: {'accuracy': 59.09090909090909}
08/17 01:27:19 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-high_school_politics]: {'accuracy': 57.89473684210527}
08/17 01:27:20 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-high_school_geography]: {'accuracy': 47.368421052631575}
08/17 01:27:22 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-middle_school_politics]: {'accuracy': 76.19047619047619}
08/17 01:27:25 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-middle_school_geography]: {'accuracy': 75.0}
08/17 01:27:27 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-modern_chinese_history]: {'accuracy': 52.17391304347826}
08/17 01:27:29 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-ideological_and_moral_cultivation]: {'accuracy': 73.68421052631578}
08/17 01:27:31 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-logic]: {'accuracy': 31.818181818181817}
08/17 01:27:33 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-law]: {'accuracy': 29.166666666666668}
08/17 01:27:35 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-chinese_language_and_literature]: {'accuracy': 47.82608695652174}
08/17 01:27:37 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-art_studies]: {'accuracy': 42.42424242424242}
08/17 01:27:39 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-professional_tour_guide]: {'accuracy': 51.724137931034484}
08/17 01:27:41 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-legal_professional]: {'accuracy': 34.78260869565217}
08/17 01:27:43 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-high_school_chinese]: {'accuracy': 36.84210526315789}
08/17 01:27:45 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-high_school_history]: {'accuracy': 65.0}
08/17 01:27:47 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-middle_school_history]: {'accuracy': 86.36363636363636}
08/17 01:27:50 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-civil_servant]: {'accuracy': 42.5531914893617}
08/17 01:27:51 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-sports_science]: {'accuracy': 52.63157894736842}
08/17 01:27:53 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-plant_protection]: {'accuracy': 40.909090909090914}
08/17 01:27:56 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-basic_medicine]: {'accuracy': 68.42105263157895}
08/17 01:27:58 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-clinical_medicine]: {'accuracy': 36.36363636363637}
08/17 01:28:00 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-urban_and_rural_planner]: {'accuracy': 52.17391304347826}
08/17 01:28:02 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-accountant]: {'accuracy': 36.734693877551024}
08/17 01:28:04 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-fire_engineer]: {'accuracy': 38.70967741935484}
08/17 01:28:06 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-environmental_impact_assessment_engineer]: {'accuracy': 51.61290322580645}
08/17 01:28:08 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-tax_accountant]: {'accuracy': 36.734693877551024}
08/17 01:28:10 - OpenCompass - INFO - Task [internlm2-1.8b-hf/ceval-physician]: {'accuracy': 42.857142857142854}
dataset version metric mode internlm2-1.8b-hf
---------------------------------------------- --------- ------------- ------ -------------------
ceval-computer_network db9ce2 accuracy gen 47.37
ceval-operating_system 1c2571 accuracy gen 47.37
ceval-computer_architecture a74dad accuracy gen 23.81
ceval-college_programming 4ca32a accuracy gen 27.03
ceval-college_physics 963fa8 accuracy gen 42.11
ceval-college_chemistry e78857 accuracy gen 37.5
ceval-advanced_mathematics ce03e2 accuracy gen 26.32
ceval-probability_and_statistics 65e812 accuracy gen 22.22
ceval-discrete_mathematics e894ae accuracy gen 25
ceval-electrical_engineer ae42b9 accuracy gen 27.03
ceval-metrology_engineer ee34ea accuracy gen 54.17
ceval-high_school_mathematics 1dc5bf accuracy gen 22.22
ceval-high_school_physics adf25f accuracy gen 42.11
ceval-high_school_chemistry 2ed27f accuracy gen 52.63
ceval-high_school_biology 8e2b9a accuracy gen 26.32
ceval-middle_school_mathematics bee8d5 accuracy gen 36.84
ceval-middle_school_biology 86817c accuracy gen 80.95
ceval-middle_school_physics 8accf6 accuracy gen 47.37
ceval-middle_school_chemistry 167a15 accuracy gen 80
ceval-veterinary_medicine b4e08d accuracy gen 43.48
ceval-college_economics f3f4e6 accuracy gen 32.73
ceval-business_administration c1614e accuracy gen 39.39
ceval-marxism cf874c accuracy gen 68.42
ceval-mao_zedong_thought 51c7a4 accuracy gen 70.83
ceval-education_science 591fee accuracy gen 55.17
ceval-teacher_qualification 4e4ced accuracy gen 59.09
ceval-high_school_politics 5c0de2 accuracy gen 57.89
ceval-high_school_geography 865461 accuracy gen 47.37
ceval-middle_school_politics 5be3e7 accuracy gen 76.19
ceval-middle_school_geography 8a63be accuracy gen 75
ceval-modern_chinese_history fc01af accuracy gen 52.17
ceval-ideological_and_moral_cultivation a2aa4a accuracy gen 73.68
ceval-logic f5b022 accuracy gen 31.82
ceval-law a110a1 accuracy gen 29.17
ceval-chinese_language_and_literature 0f8b68 accuracy gen 47.83
ceval-art_studies 2a1300 accuracy gen 42.42
ceval-professional_tour_guide 4e673e accuracy gen 51.72
ceval-legal_professional ce8787 accuracy gen 34.78
ceval-high_school_chinese 315705 accuracy gen 36.84
ceval-high_school_history 7eb30a accuracy gen 65
ceval-middle_school_history 48ab4a accuracy gen 86.36
ceval-civil_servant 87d061 accuracy gen 42.55
ceval-sports_science 70f27b accuracy gen 52.63
ceval-plant_protection 8941f9 accuracy gen 40.91
ceval-basic_medicine c409d6 accuracy gen 68.42
ceval-clinical_medicine 49e82d accuracy gen 36.36
ceval-urban_and_rural_planner 95b885 accuracy gen 52.17
ceval-accountant 002837 accuracy gen 36.73
ceval-fire_engineer bc23f5 accuracy gen 38.71
ceval-environmental_impact_assessment_engineer c64e2d accuracy gen 51.61
ceval-tax_accountant 3a5e3c accuracy gen 36.73
ceval-physician 6e277d accuracy gen 42.86
ceval-stem - naive_average gen 40.59
ceval-social-science - naive_average gen 58.21
ceval-humanities - naive_average gen 50.16
ceval-other - naive_average gen 45.43
ceval-hard - naive_average gen 33.76
ceval - naive_average gen 47.03
08/17 01:28:11 - OpenCompass - INFO - write summary to /root/opencompass/outputs/default/20240816_170558/summary/summary_20240816_170558.txt
08/17 01:28:11 - OpenCompass - INFO - write csv to /root/opencompass/outputs/default/20240816_170558/summary/summary_20240816_170558.cs
参考视频:https://www.bilibili.com/video/BV1RM4m1279j/?vd_source=d5e90f8fa067b4804697b319c7cc88e4
参考文档:https://github.com/InternLM/Tutorial/blob/camp3/docs/L1/OpenCompass/readme.md#%E5%90%AF%E5%8A%A8%E8%AF%84%E6%B5%8B-10-a100-8gb-%E8%B5%84%E6%BA%90