今天刷到了3年前的一篇文档,,题目是:一块英伟达3080单挑180亿参数大模型,国产开源项目这回杀疯了。赶紧点进去看看,感觉好牛啊,赶快去找github源码项目区。再一看日期,2022年发布的,我的天啊,我怎么不知道啊?
或许本人有些孤陋寡闻,不过像这样逆天的东西,我应该知道啊。或者说,飞桨和昇思应该借鉴了啊!
先上个人初步判断:其实Pytorch本身速度就不太快,它擅长快速编码,不擅长快速训练。所以deepspeed和Colossal-AI都有提速的空间,比如提速50%-200%左右。
另外Colossal-AI采用高维张量并行技术等降低单卡显存压力,实现训练10倍参数规模模型的目标。个人感觉其实对大型模型的训练公司,显存压力并没有算力压力那么显而易见,而且像飞桨、昇思这样的国产AI框架,也采取了类似的技术,所以这个惊艳的项目,才会不怎么为人所知(是说跟DeepSeek和Manus等比,没有天天被营销轰炸)。
但是这样项目还是有存在的理由,就是给pytorch线路的项目提升训练速度和模型容量。
官网:https://github.com/hpcaitech/ColossalAI
gitee镜像:ColossalAI: https://github.com/hpcaitech/ColossalAI.git
实践
安装
pip install colossalai
下载源代码
git clone https://github.com/hpcaitech/ColossalAI.git
cd ColossalAI
# install dependency
pip install -r requirements/requirements.txt
# install colossalai
BUILD_EXT=1 pip install .
在kaggle测试
建立notebook
下载源代码
git clone https://github.com/hpcaitech/ColossalAI.git
安装依赖库
cd ColossalAI
# install dependency
pip install -r requirements/requirements.txt
执行训练
执行训练时,会自动下cifer10数据集
cd ColossalAI/examples/images/resnet && colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp32
P100单node训练信息
Epoch [1/80]: 100%|██████████| 500/500 [00:24<00:00, 20.83it/s, loss=1.38]
Epoch [2/80]: 100%|██████████| 500/500 [00:23<00:00, 21.55it/s, loss=1.13]
Epoch [3/80]: 89%|████████▉ | 447/500 [00:20<00:02, 21.20it/s, loss=1.15]
双T4 单node训练信息
Epoch [1/80]: 100%|██████████| 500/500 [00:27<00:00, 18.05it/s, loss=1.38]
Epoch [2/80]: 100%|██████████| 500/500 [00:27<00:00, 18.50it/s, loss=1.15]
Epoch [3/80]: 100%|██████████| 500/500 [00:26<00:00, 18.69it/s, loss=0.952]
双T4 双node训练信息
Epoch [1/80]: 100%|██████████| 250/250 [00:21<00:00, 11.48it/s, loss=1.31]
Epoch [2/80]: 100%|██████████| 250/250 [00:20<00:00, 12.20it/s, loss=1.12]
Epoch [3/80]: 100%|██████████| 250/250 [00:19<00:00, 12.53it/s, loss=0.938]
可见T4比P100快,双T4比T4快!
大模型测试(未成功)
transformers 下载deepseek模型
from transformers import AutoModelForCausalLM, AutoTokenizer
# 指定模型名称
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
# 加载 tokenizer 和模型
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# 测试模型
input_text = "Hello, how are you?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
# 解码输出
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(output_text)
直接用huggingface-cli下载模型
!cd ColossalAI/examples/inference/llama && huggingface-cli download --resume-download deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --local-dir .
推理
但是推理没调通,我不明白为什么写了类似这样的:colossalai/grok-1-pytorch ,拿不到模型文件....
想看看kaggle上面的例子,结果只搜出来6个notebook文档,还有一个是自己的。所以这个东西,可能受众并不像github star所表现出来的那么多吧?
用的命令如下:
# !cd ColossalAI/examples/inference/llama && colossalai run --nproc_per_node 1 llama_generation.py -m $PATH_MODEL --max_length 128
# !cd ColossalAI/examples/inference/llama && colossalai run --nproc_per_node 1 llama_generation.py -m HuggingFaceTB/SmolLM-135M-Instruct --max_length 128
# !cd ColossalAI/examples/inference/llama && colossalai run --nproc_per_node 1 llama_generation.py -m “colossalai/llama-7b” --max_length 128
# !cd ColossalAI/examples/inference/llama && colossalai run --nproc_per_node 2 llama_generation.py -m lmsys/vicuna-7b-v1.5 --max_length 128 --tp_size 2
# !cd ColossalAI/examples/inference/llama && colossalai run --nproc_per_node 2 llama_generation.py -m vicuna --max_length 128 --tp_size 2
!cd ColossalAI/examples/inference/llama && colossalai run --nproc_per_node 1 llama_generation.py -m "HuggingFaceTB/SmolLM-135M-Instruct" --max_length 128
# !cd ColossalAI/examples/inference/llama && colossalai run --nproc_per_node 1 llama_generation.py --drafter HuggingFaceTB/SmolLM-135M-Instruct --max_length 128
测试了多个模型:lmsys/vicuna-7b-v1.5 colossalai/llama-7b HuggingFaceTB/SmolLM-135M-Instruct deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B 就没一个模型能用?
用绝对路径也不行:
!cd ColossalAI/examples/inference/llama && colossalai run --nproc_per_node 1 llama_generation.py -m /kaggle/working/ColossalAI/examples/inference/llama/vicuna/pytorch_model-00001-of-00002.bin --max_length 128
目录里面是有东西的:
!cd ColossalAI/examples/inference/llama && ls vicuna -la
total 13161652
drwxr-xr-x 3 root root 4096 Mar 10 08:23 .
drwxr-xr-x 3 root root 4096 Mar 10 08:23 ..
drwxr-xr-x 3 root root 4096 Mar 10 08:22 .cache
-rw-r--r-- 1 root root 615 Mar 10 08:22 config.json
-rw-r--r-- 1 root root 162 Mar 10 08:22 generation_config.json
-rw-r--r-- 1 root root 1519 Mar 10 08:22 .gitattributes
-rw-r--r-- 1 root root 9976634558 Mar 10 08:23 pytorch_model-00001-of-00002.bin
-rw-r--r-- 1 root root 3500315539 Mar 10 08:22 pytorch_model-00002-of-00002.bin
-rw-r--r-- 1 root root 26788 Mar 10 08:22 pytorch_model.bin.index.json
-rw-r--r-- 1 root root 1965 Mar 10 08:22 README.md
-rw-r--r-- 1 root root 438 Mar 10 08:22 special_tokens_map.json
-rw-r--r-- 1 root root 749 Mar 10 08:22 tokenizer_config.json
-rw-r--r-- 1 root root 499723 Mar 10 08:22 tokenizer.model
微调
总结
东西是好东西,有很好的借鉴意义!
调试
训练报错
Error: failed to run torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 train.py -c ./ckpt-fp32 on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!
明白了,没有选gpu 。点菜单Setting,选Accelerator-P100 ,选好GPU卡
报错
Error: failed to run torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 train.py -c ./ckpt-fp32 on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
改参数:run --nproc_per_node 2 ,改为run --nproc_per_node 1
可以继续了
尝试换成双T4来支持2node
测试pass
测试大模型报错
Root Cause (first observed failure): [0]: time : 2025-03-10_06:34:56 host : c19ec5c03aea rank : 0 (local_rank: 0) exitcode : 1 (pid: 202) error_file: <N/A> traceback : To enable traceback see: Error Propagation — PyTorch 2.6 documentation ============================================================ Error: failed to run torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 -m vicuna llama_generation.py --max_length 128 --tp_size 2 on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!
没搞定,不明白为什么?
# !cd ColossalAI/examples/inference/llama && colossalai run --nproc_per_node 1 llama_generation.py -m $PATH_MODEL --max_length 128
# !cd ColossalAI/examples/inference/llama && colossalai run --nproc_per_node 1 llama_generation.py -m HuggingFaceTB/SmolLM-135M-Instruct --max_length 128
# !cd ColossalAI/examples/inference/llama && colossalai run --nproc_per_node 1 llama_generation.py -m “colossalai/llama-7b” --max_length 128
# !cd ColossalAI/examples/inference/llama && colossalai run --nproc_per_node 2 llama_generation.py -m lmsys/vicuna-7b-v1.5 --max_length 128 --tp_size 2
!cd ColossalAI/examples/inference/llama && colossalai run --nproc_per_node 2 llama_generation.py -m vicuna --max_length 128 --tp_size 2
换了很多模型的名字,还把模型下载到本地vicuna目录
!cd ColossalAI/examples/inference/llama && huggingface-cli download --resume-download lmsys/vicuna-7b-v1.5 --local-dir ./vicuna
都没成功。
rakeshchow202/deepseek-r1-llama-8b-pth