Open Compass模型评估
前言
OpenCompass 是一个开源框架,旨在为构建和评估大规模语言模型提供一个全面、高效且可扩展的解决方案。它集成了数据管理、模型训练、评估分析等多种功能,帮助研究人员和开发者快速上手并专注于模型的设计和创新。
一、环境搭建
#安装虚拟环境
conda create --name opencompass python=3.10 -y
conda activate opencompass
#安装OpenCompass
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
二、数据准备
数据下载到opencompass目录下,直接解压即可,不需要改名字,也不要移路径,解压后就是一个data目录。
# 下载数据集到 data/ 处
wget https://github.com/open-compass/opencompass/releases/download/0.2.2.rc1/OpenCompassData-core-
20240207.zip
unzip OpenCompassData-core-20240207.zip
评估数据集展示,gsm8k里面有4个jsonl文件,截取第一个文件里面部分数据,可以看到里面都是一些问答对。
{"question": "Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?", "answer": "Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.\nShe makes 9 * 2 = $<<9*2=18>>18 every day at the farmer\u2019s market.\n#### 18"}
{"question": "A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?", "answer": "It takes 2/2=<<2/2=1>>1 bolt of white fiber\nSo the total amount of fabric is 2+1=<<2+1=3>>3 bolts of fabric\n#### 3"}
{"question": "Josh decides to try flipping a house. He buys a house for $80,000 and then puts in $50,000 in repairs. This increased the value of the house by 150%. How much profit did he make?", "answer": "The cost of the house and repairs came out to 80,000+50,000=$<<80000+50000=130000>>130,000\nHe increased the value of the house by 80,000*1.5=<<80000*1.5=120000>>120,000\nSo the new value of the house is 120,000+80,000=$<<120000+80000=200000>>200,000\nSo he made a profit of 200,000-130,000=$<<200000-130000=70000>>70,000\n#### 70000"}
{"question": "James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?", "answer": "He sprints 3*3=<<3*3=9>>9 times\nSo he runs 9*60=<<9*60=540>>540 meters\n#### 540"}
三、下模
魔塔社区下载模型, 找一个deepseek的,找一个qwen的。
from modelscope import snapshot_download
model_dir = snapshot_download('deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B')
from modelscope import snapshot_download
model_dir = snapshot_download('Qwen/Qwen1.5-1.8B-Chat')
下载完以后的路径,默认在c盘用户的.cache目录下,如果需要指定路径,后面加一个参数cache_dir=‘需要指定的下载路径path’
linux默认路径就是root下面的.cache
下载完成。
Downloading [model.safetensors]: 100%|██████████████████████████████████████████▉| 3.42G/3.42G [1:Downloading [model.safetensors]: 100%|██████████████████████████████████████████▉| 3.42G/3.42G [1:Downloading [model.safetensors]: 100%|██████████████████████████████████████████▉| 3.42G/3.42G [1:Downloading [model.safetensors]: 100%|██████████████████████████████████████████▉| 3.42G/3.42G [1:Downloading [model.safetensors]: 100%|███████████████████████████████████████████| 3.42G/3.42G [1:12:28<00:00, 845kB/s]
Processing 1 items: 100%|████████████████████████████████████████████████████████| 1.00/1.00 [1:12Processing 1 items: 100%|████████████████████████████████████████████████████████| 1.00/1.00 [1:12:31<00:00, 4.35ks/it]
2025-02-06 11:20:18,570 - modelscope - INFO - Download model 'Qwen/Qwen1.5-1.8B-Chat' successfully.
2025-02-06 11:20:18,571 - modelscope - INFO - Creating symbolic link [C:\Users\gyton\.cache\modelscope\hub\Qwen\Qwen1.5-1.8B-Chat].
(opencompass) PS G:\damoxing\opencompass>
下载模型
(opencompass) PS G:\damoxing\opencompass> python .\download.py
Downloading Model to directory: C:\Users\gyton\.cache\modelscope\hub\deepseek-ai\DeepSeek-R1-Distill-Qwen-1.5B
2025-02-06 00:56:53,900 - modelscope - INFO - Got 9 files, start to download ...
Downloading [configuration.json]: 100%|███████████████████████████████████████████████| 73.0/73.0 [00:00<00:00, 220B/s]
Downloading [README.md]: 100%|████████████████████████████████████████████████████| 18.5k/18.5k [00:00<00:00, 55.0kB/s]
Downloading [generation_config.json]: 100%|█████████████████████████████████████████████| 181/181 [00:00<00:00, 519B/s]
Downloading [LICENSE]: 100%|██████████████████████████████████████████████████████| 1.04k/1.04k [00:00<00:00, 2.89kB/s]
Downloading [config.json]: 100%|██████████████████████████████████████████████████████| 679/679 [00:00<00:00, 1.82kB/s]
Downloading [tokenizer_config.json]: 100%|████████████████████████████████████████| 2.99k/2.99k [00:00<00:00, 8.78kB/s]
Downloading [figures/benchmark.jpg]: 100%|███████████████████████████████████████████| 759k/759k [00:05<00:00, 132kB/s]
Downloading [tokenizer.json]: 100%|████████████████████████████████████████████████| 6.71M/6.71M [00:21<00:00, 328kB/s]
Downloading [model.safetensors]: 100%|███████████████████████████████████████████| 3.31G/3.31G [1:43:02<00:00, 575kB/s]
Processing 9 items: 100%|██████████████████████████████████████████████████████████| 9.00/9.00 [1:43:02<00:00, 687s/it]
2025-02-06 02:39:56,779 - modelscope - INFO - Download model 'deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B' successfully.
2025-02-06 02:39:56,779 - modelscope - INFO - Creating symbolic link [C:\Users\gyton\.cache\modelscope\hub\deepseek-ai\DeepSeek-R1-Distill-Qwen-1.5B].]: 0%| | 2.00M/3.31G [00:04<2:04:53, 474kB/s]
(opencompass) PS G:\damoxing\opencompass> | 8.00M/3.31G [00:20<2:25:02, 407kB/s]
Downloading [model.safetensors]: 100%|███████████████████████████████████████████| 3.31G/3.31G [1:43:02<00:00, 728kB/s]
Downloading [tokenizer.json]: 100%|████████████████████████████████████████████████| 6.71M/6.71M [00:21<00:00, 348kB/s]
#四、评估
把–models后面模型改成自己路径,数据集就用gsm8k
opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen
直接运行评估命令,会找不到模型,需要手动修改配置文件。
直接运行评估命令,会找不到模型。
```python
(opencompass) PS G:\damoxing\opencompass> opencompass --models C:\Users\gyton\.cache\modelscope\hub\deepseek-ai\DeepSeek-R1-Distill-Qwen-1___5B --datasets demo_gsm8k_chat_gen
signal.SIGALRM is not available on this platform
signal.SIGALRM is not available on this platform
02/06 09:00:03 - OpenCompass - INFO - Loading demo_gsm8k_chat_gen: C:\Users\gyton\anaconda3\envs\opencompass\lib\site-packages\opencompass\configs\./datasets\demo\demo_gsm8k_chat_gen.py
Traceback (most recent call last):
File "C:\Users\gyton\anaconda3\envs\opencompass\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\gyton\anaconda3\envs\opencompass\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\Users\gyton\anaconda3\envs\opencompass\Scripts\opencompass.exe\__main__.py", line 7, in <module>
sys.exit(main())
File "C:\Users\gyton\anaconda3\envs\opencompass\lib\site-packages\opencompass\cli\main.py", line 227, in main
cfg = get_config_from_arg(args)
File "C:\Users\gyton\anaconda3\envs\opencompass\lib\site-packages\opencompass\utils\run.py", line 168, in get_config_from_arg
for model in match_cfg_file(models_dir, [model_arg]):
File "C:\Users\gyton\anaconda3\envs\opencompass\lib\site-packages\opencompass\utils\run.py", line 70, in match_cfg_file
raise ValueError(err_msg)
ValueError: The provided pattern matches 0 or more than one config. Please verify your pattern and try again. You may use tools/list_configs.py to list or locate the configurations.
+----------------------------------------------------------------------------------+
| Not matched patterns |
|----------------------------------------------------------------------------------|
| C:\Users\gyton\.cache\modelscope\hub\deepseek-ai\DeepSeek-R1-Distill-Qwen-1___5B |
+----------------------------------------------------------------------------------+
(opencompass) PS G:\damoxing\opencompass>
这里需要修改配置文件,现在config找到deepseek
没有下载的那个1.5b的,那就随便找一个,找个名字接近的,把路径改一下
改动前
from opencompass.models import HuggingFacewithChatTemplate
models = [
dict(
type=HuggingFacewithChatTemplate,
abbr='deepseek-7b-chat-hf',
path='deepseek-ai/deepseek-llm-7b-chat',
max_out_len=1024,
batch_size=8,
run_cfg=dict(num_gpus=1),
)
]
改动后
from opencompass.models import HuggingFacewithChatTemplate
models = [
dict(
type=HuggingFacewithChatTemplate,
abbr='deepseek-7b-chat-hf',
path='C:\Users\gyton\.cache\modelscope\hub\deepseek-ai\DeepSeek-R1-Distill-Qwen-1___5B',
max_out_len=1024,
batch_size=8,
run_cfg=dict(num_gpus=1),
)
]
再把Qwen的配置文件修改一下
from opencompass.models import HuggingFacewithChatTemplate
models = [
dict(
type=HuggingFacewithChatTemplate,
abbr='qwen1.5-1.8b-chat-hf',
path='/root/autodl-tmp/llm/Qwen/Qwen1___5-1___8B-Chat',
max_out_len=1024,
batch_size=8,
run_cfg=dict(num_gpus=1),
stop_words=['<|im_end|>', '<|im_start|>'],
)
]
可以看到配置文件里有这个对应的模型配置,只需要把路径改一下。
相同命令,在本地操作时候报出一下错误,
(opencompass) PS G:\damoxing\opencompass> opencompass --models hf_deepseek_7b_chat --datasets demo_gsm8k_chat_gen
signal.SIGALRM is not available on this platform
signal.SIGALRM is not available on this platform
02/06 09:21:28 - OpenCompass - INFO - Loading demo_gsm8k_chat_gen: C:\Users\gyton\anaconda3\envs\opencompass\lib\site-packages\opencompass\configs\./datasets\demo\demo_gsm8k_chat_gen.py
02/06 09:21:28 - OpenCompass - INFO - Loading hf_deepseek_7b_chat: C:\Users\gyton\anaconda3\envs\opencompass\lib\site-packages\opencompass\configs\./models\deepseek\hf_deepseek_7b_chat.py
02/06 09:21:28 - OpenCompass - INFO - Loading example: C:\Users\gyton\anaconda3\envs\opencompass\lib\site-packages\opencompass\configs\./summarizers\example.py
02/06 09:21:28 - OpenCompass - INFO - Current exp folder: outputs\default\20250206_092128
02/06 09:21:28 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored.
02/06 09:21:28 - OpenCompass - INFO - Partitioned into 1 tasks.
0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
File "C:\Users\gyton\anaconda3\envs\opencompass\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\gyton\anaconda3\envs\opencompass\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\Users\gyton\anaconda3\envs\opencompass\Scripts\opencompass.exe\__main__.py", line 7, in <module>
sys.exit(main())
File "C:\Users\gyton\anaconda3\envs\opencompass\lib\site-packages\opencompass\cli\main.py", line 308, in main
runner(tasks)
File "C:\Users\gyton\anaconda3\envs\opencompass\lib\site-packages\opencompass\runners\base.py", line 39, in __call__
status_list = list(status) # change into list format
File "C:\Users\gyton\anaconda3\envs\opencompass\lib\concurrent\futures\_base.py", line 621, in result_iterator
yield _result_or_cancel(fs.pop())
File "C:\Users\gyton\anaconda3\envs\opencompass\lib\concurrent\futures\_base.py", line 319, in _result_or_cancel
return fut.result(timeout)
File "C:\Users\gyton\anaconda3\envs\opencompass\lib\concurrent\futures\_base.py", line 451, in result
return self.__get_result()
File "C:\Users\gyton\anaconda3\envs\opencompass\lib\concurrent\futures\_base.py", line 403, in __get_result
raise self._exception
File "C:\Users\gyton\anaconda3\envs\opencompass\lib\concurrent\futures\thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "C:\Users\gyton\anaconda3\envs\opencompass\lib\site-packages\opencompass\runners\local.py", line 166, in submit
assert len(gpus) >= num_gpus
AssertionError
0%| | 0/1 [00:00<?, ?it/s]
(opencompass) PS G:\damoxing\opencompass>
移到服务器上重新运行,都能正常运行,估计是torch版本不匹配问题。
四、评估
deepseek评估结果
(opencompass) root@autodl-container-dec7479388-77891342:~/autodl-tmp/opencompass# opencompass --models hf_deepseek_7b_chat --datasets demo_gsm8k_chat_gen
02/06 13:25:12 - OpenCompass - INFO - Loading demo_gsm8k_chat_gen: /root/autodl-tmp/opencompass/opencompass/configs/./datasets/demo/demo_gsm8k_chat_gen.py
02/06 13:25:12 - OpenCompass - INFO - Loading hf_deepseek_7b_chat: /root/autodl-tmp/opencompass/opencompass/configs/./models/deepseek/hf_deepseek_7b_chat.py
02/06 13:25:12 - OpenCompass - INFO - Loading example: /root/autodl-tmp/opencompass/opencompass/configs/./summarizers/example.py
02/06 13:25:12 - OpenCompass - INFO - Current exp folder: outputs/default/20250206_132512
02/06 13:25:12 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored.
02/06 13:25:12 - OpenCompass - INFO - Partitioned into 1 tasks.
launch OpenICLInfer[deepseek-7b-chat-hf/demo_gsm8k] on GPU 0
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [02:10<00:00, 130.55s/it]
02/06 13:27:22 - OpenCompass - INFO - Partitioned into 1 tasks.
launch OpenICLEval[deepseek-7b-chat-hf/demo_gsm8k] on CPU
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:09<00:00, 9.24s/it]
dataset version metric mode deepseek-7b-chat-hf
---------- --------- -------- ------ ---------------------
demo_gsm8k 1d7fe4 accuracy gen 40.62
02/06 13:27:32 - OpenCompass - INFO - write summary to /root/autodl-tmp/opencompass/outputs/default/20250206_132512/summary/summary_20250206_132512.txt
02/06 13:27:32 - OpenCompass - INFO - write csv to /root/autodl-tmp/opencompass/outputs/default/20250206_132512/summary/summary_20250206_132512.csv
The markdown format results is as below:
| dataset | version | metric | mode | deepseek-7b-chat-hf |
|----- | ----- | ----- | ----- | -----|
| demo_gsm8k | 1d7fe4 | accuracy | gen | 40.62 |
千问模型
(opencompass) root@autodl-container-dec7479388-77891342:~/autodl-tmp# opencompass --models hf_qwen1_5_1_8b_chat --datasets demo_gsm8k_chat_gen
02/06 13:28:33 - OpenCompass - INFO - Loading demo_gsm8k_chat_gen: /root/autodl-tmp/opencompass/opencompass/configs/./datasets/demo/demo_gsm8k_chat_gen.py
02/06 13:28:33 - OpenCompass - INFO - Loading hf_qwen1_5_1_8b_chat: /root/autodl-tmp/opencompass/opencompass/configs/./models/qwen/hf_qwen1_5_1_8b_chat.py
02/06 13:28:33 - OpenCompass - INFO - Loading example: /root/autodl-tmp/opencompass/opencompass/configs/./summarizers/example.py
02/06 13:28:33 - OpenCompass - INFO - Current exp folder: outputs/default/20250206_132833
02/06 13:28:33 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored.
02/06 13:28:33 - OpenCompass - INFO - Try to load the data from /root/.cache/opencompass/./data/gsm8k/
02/06 13:28:33 - OpenCompass - INFO - Partitioned into 1 tasks.
launch OpenICLInfer[qwen1.5-1.8b-chat-hf/demo_gsm8k] on GPU 0
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:32<00:00, 92.35s/it]
02/06 13:30:06 - OpenCompass - INFO - Partitioned into 1 tasks.
launch OpenICLEval[qwen1.5-1.8b-chat-hf/demo_gsm8k] on CPU
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.15s/it]
dataset version metric mode qwen1.5-1.8b-chat-hf
---------- --------- -------- ------ ----------------------
demo_gsm8k 1d7fe4 accuracy gen 26.56
02/06 13:30:16 - OpenCompass - INFO - write summary to /root/autodl-tmp/outputs/default/20250206_132833/summary/summary_20250206_132833.txt
02/06 13:30:16 - OpenCompass - INFO - write csv to /root/autodl-tmp/outputs/default/20250206_132833/summary/summary_20250206_132833.csv
The markdown format results is as below:
| dataset | version | metric | mode | qwen1.5-1.8b-chat-hf |
|----- | ----- | ----- | ----- | -----|
| demo_gsm8k | 1d7fe4 | accuracy | gen | 26.56 |
可以看到DeepSeek-R1-Distill-Qwen-1.5B评估结果40.62,远高于wen1___5-1___8B-Chat 的26.56的结果。
五、总结
OpenCompass 作为一个开源框架,为大规模语言模型的构建和评估提供了一整套解决方案。它不仅具备强大的数据管理、模型训练和评估功能,还具有高度的可扩展性和灵活性。通过使用 OpenCompass,开发者可以更加高效地开展语言模型相关的工作。时间充裕的话,也可以下一些那些百亿,千亿,参数量更大的模型。可以评估自己微调的大模型,也可以使用一些公开的评测数据集,或者是自定义数据集。