网址:Leaderboard | C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
Leaderboard - C-Eval
Results for different subjects and the average test results are shown below. The results are from either zero-shot or few-shot prompting ---- note that few-shot is not necessarily better than zero-shot, for example, zero-shot is better for many instruction-tuned models in our own runs. In cases we tested the models in both zero- and few-shot settings, we report the setting with higher overall average accuracy. (Model details including prompting format can be viewed by clicking into each model)
You are welcome to submit your model's test results to C-Eval at any time (either zero-shot or few-shot eval is fine). Click here to submit your results (your results will not be public on the leaderboard unless you request to do so).
(Note: * indicates that the model was evaluated by the C-Eval team, while other results are obtained through user submissions.)
# | Model | Creator | Submission Date | Avg | Avg(Hard) | STEM | Social Science | Humanities | Others |
0 | ChatGLM2 | Tsinghua & Zhipu.AI | 2023/6/25 | 71.1 | 50 | 64.4 | 81.6 | 73.7 | 71.3 |
1 | GPT-4* | OpenAI | 2023/5/15 | 68.7 | 54.9 | 67.1 | 77.6 | 64.5 | 67.8 |
2 | SenseChat | SenseTime | 2023/6/20 | 66.1 | 45.1 | 58 | 78.4 | 67.2 | 68.8 |
3 | AiLMe-100B v1 | APUS | 2023/7/19 | 65.2 | 55.3 | 65.4 | 72.3 | 62.4 | 61.1 |
4 | InternLM | SenseTime & Shanghai AI Laboratory (equal contribution) | 2023/6/1 | 62.7 | 46 | 58.1 | 76.7 | 64.6 | 56.4 |
5 | Instruct-DLM-v2 | DeepLang AI | 2023/7/2 | 56.8 | 37.4 | 50.3 | 71.1 | 59.1 | 53.4 |
6 | DFM2.0 | AISpeech & SJTU | 2023/7/10 | 55.4 | 38.3 | 47.5 | 64.6 | 58.7 | 58.2 |
7 | ChatGPT* | OpenAI | 2023/5/15 | 54.4 | 41.4 | 52.9 | 61.8 | 50.9 | 53.6 |
8 | Claude-v1.3* | Anthropic | 2023/5/15 | 54.2 | 39 | 51.9 | 61.7 | 52.1 | 53.7 |
9 | TeleChat-E | China Telecom Corporation Ltd. | 2023/7/4 | 54.2 | 41.5 | 51.1 | 63.1 | 53.8 | 52.3 |
10 | CPM | ModelBest | 2023/7/5 | 54.1 | 37.5 | 47.2 | 62.7 | 58.4 | 54.8 |
11 | Baichuan-13B | Baichuan | 2023/7/9 | 53.6 | 36.7 | 47 | 66.8 | 57.3 | 49.8 |
12 | DLM-v2 | DeepLang AI | 2023/7/2 | 53.5 | 35.3 | 47 | 64.7 | 56.4 | 52.1 |
13 | InternLM-7B | Shanghai AI Laboratory & SenseTime | 2023/7/5 | 52.8 | 37.1 | 48 | 67.4 | 55.4 | 45.8 |
14 | ChatGLM2-6B | Tsinghua & Zhipu.AI | 2023/6/24 | 51.7 | 37.1 | 48.6 | 60.5 | 51.3 | 49.8 |
15 | EduChat | ECNU | 2023/7/18 | 49.3 | 33.1 | 43.5 | 59.3 | 53.7 | 46.6 |
16 | SageGPT | 4Paradigm Inc. | 2023/6/21 | 49.1 | 39.1 | 46.6 | 54.6 | 45.8 | 51.8 |
17 | AndesLM-13B | AndesLM | 2023/6/18 | 46 | 29.7 | 38.1 | 61 | 51 | 41.9 |
18 | Claude-instant-v1.0* | Anthropic | 2023/5/15 | 45.9 | 35.5 | 43.1 | 53.8 | 44.2 | 45.4 |
19 | WestlakeLM-19B | Westlake University and Westlake Xinchen(Scietrain) | 2023/6/18 | 44.6 | 34.9 | 41.6 | 51 | 44.3 | 44.5 |
20 | bloomz-mt-176B* | BigScience | 2023/5/15 | 44.3 | 30.8 | 39 | 53 | 47.7 | 42.7 |
21 | 玉言 | Fuxi AI Lab, NetEase | 2023/6/20 | 44.3 | 30.6 | 39.2 | 54.5 | 46.4 | 42.2 |
22 | GLM-130B* | Tsinghua | 2023/5/15 | 44 | 30.7 | 36.7 | 55.8 | 47.7 | 43 |
23 | baichuan-7B | Baichuan | 2023/6/14 | 42.8 | 31.5 | 38.2 | 52 | 46.2 | 39.3 |
24 | CubeLM-13B | CubeLM | 2023/6/12 | 42.5 | 27.9 | 36 | 52.4 | 45.8 | 41.8 |
25 | Chinese-Alpaca-33B | Cui, Yang, and Yao | 2023/6/7 | 41.6 | 30.3 | 37 | 51.6 | 42.3 | 40.3 |
26 | Chinese-Alpaca-Plus-13B | Cui, Yang, and Yao | 2023/6/5 | 41.5 | 30.5 | 36.6 | 49.7 | 43.1 | 41.2 |
27 | ChatGLM-6B* | Tsinghua & Zhipu.AI | 2023/5/15 | 38.9 | 29.2 | 33.3 | 48.3 | 41.3 | 38 |
28 | LLaMA-65B* | Meta | 2023/5/15 | 38.8 | 31.7 | 37.8 | 45.6 | 36.1 | 37.1 |
29 | Chinese LLaMA-13B* | Cui et al. | 2023/5/15 | 33.3 | 27.3 | 31.6 | 37.2 | 33.6 | 32.8 |
30 | MOSS* | Fudan | 2023/5/15 | 33.1 | 28.4 | 31.6 | 37 | 33.4 | 32.1 |
31 | Chinese Alpaca-13B* | Cui et al. | 2023/5/15 | 30.9 | 24.4 | 27.4 | 39.2 | 32.5 | 28 |