1.从GitHub上下载代码
2.创建一个虚拟环境
我们所使用的服务器系统为Ubuntu 18.04.6 LTS (GNU/Linux 4.15.0-76-generic x86_64),GPU为A100(NVIDIA-SMI 510.54;Driver Version: 510.54;CUDA Version: 11.6)
conda create -n GLM python=3.8
3.使用pip安装依赖
pip install -i https://pypi.org/simple -r requirements.txt
# 国内请使用aliyun镜像,TUNA等镜像同步最近出现问题,命令如下
pip install -i https://mirrors.aliyun.com/pypi/simple/ -r requirements.txt
3.解压fewshot-data.zip
以后运行如下命令:
bash finetune/finetune_visualglm.sh
运行报错:
这个错误提示表明 GPU 驱动版本太旧,无法与当前的 PyTorch 版本兼容。需要更新GPU 驱动程序,以便与 PyTorch 的版本匹配,或更新PyTorch 的版本,以便与 GPU 驱动程序匹配。
输入:
conda list
以获取当前Conda环境的 PyTorch 的版本:
以下是Pytorch和CUDA对应的版本:
由于我们所使用的CUDA版本为11.6,所以我们选择安装对应版本的Pytorch
进入PyTorch官网:
# CUDA 11.6
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia
安装成功后,再次运行微调命令,报错如下:
Traceback (most recent call last):
File "/opt/conda/envs/GLM/bin/deepspeed", line 3, in <module>
from deepspeed.launcher.runner import main
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/deepspeed/__init__.py", line 25, in <module>
from . import ops
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/deepspeed/ops/__init__.py", line 6, in <module>
from . import adam
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/deepspeed/ops/adam/__init__.py", line 6, in <module>
from .cpu_adam import DeepSpeedCPUAdam
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 8, in <module>
from deepspeed.utils import logger
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/deepspeed/utils/__init__.py", line 10, in <module>
from .groups import *
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/deepspeed/utils/groups.py", line 28, in <module>
from deepspeed import comm as dist
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/deepspeed/comm/__init__.py", line 7, in <module>
from .comm import *
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 34, in <module>
from deepspeed.utils import timer, get_caller_func
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/deepspeed/utils/timer.py", line 31, in <module>
class CudaEventTimer(object):
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/deepspeed/utils/timer.py", line 33, in CudaEventTimer
def __init__(self, start_event: get_accelerator().Event, end_event: get_accelerator().Event):
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/deepspeed/accelerator/real_accelerator.py", line 145, in get_accelerator
torch.mps.current_allocated_memory()
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/torch/mps/__init__.py", line 102, in current_allocated_memory
return torch._C._mps_currentAllocatedMemory()
AttributeError: module 'torch._C' has no attribute '_mps_currentAllocatedMemory'
这个错误可能是由于 DeepSpeed 与当前的 PyTorch 版本不兼容所致。在当前的 PyTorch 版本中,torch._C
模块没有 _mps_currentAllocatedMemory
属性,导致 DeepSpeed 的某些功能无法正常使用。因此,降低pytorch版本:
# CUDA 11.1
pip3 install torch==1.8.2 torchvision==0.9.2 torchaudio==0.8.2 --extra-index-url https://download.pytorch.org/whl/lts/1.8/cu111
再次运行微调命令,报错如下:
Traceback (most recent call last):
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/requests/compat.py", line 11, in <module>
import chardet
ModuleNotFoundError: No module named 'chardet'
这个错误提示是由于缺少了一个名为 chardet
的模块而导致的问题。chardet
是一个用于检测编码类型的 Python 库,通常用于处理字符编码相关的任务。
解决这个问题的方法是安装缺失的 chardet
模块。命令如下:
pip install chardet
安装成功后再次运行微调报错:
Unzipping /root/.sat_models/visualglm-6b.zip...
[2024-04-14 13:17:20,087] [INFO] [RANK 0] building FineTuneVisualGLMModel model ...
INFO:sat:[RANK 0] building FineTuneVisualGLMModel model ...
Traceback (most recent call last):
File "finetune_visualglm.py", line 178, in <module>
model, args = FineTuneVisualGLMModel.from_pretrained(model_type, args)
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/sat/model/base_model.py", line 215, in from_pretrained
return cls.from_pretrained_base(name, args=args, home_path=home_path, url=url, prefix=prefix, build_only=build_only, overwrite_args=overwrite_args, **kwargs)
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/sat/model/base_model.py", line 207, in from_pretrained_base
model = get_model(args, cls, **kwargs)
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/sat/model/base_model.py", line 412, in get_model
model = model_cls(args, params_dtype=params_dtype, **kwargs)
File "finetune_visualglm.py", line 13, in __init__
super().__init__(args, transformer=transformer, parallel_output=parallel_output, **kw_args)
File "/root/HYJ/VisualGLM-6B-main/model/visualglm.py", line 34, in __init__
self.add_mixin("eva", ImageMixin(args))
File "/root/HYJ/VisualGLM-6B-main/model/visualglm.py", line 18, in __init__
self.model = BLIP2(args.eva_args, args.qformer_args)
File "/root/HYJ/VisualGLM-6B-main/model/blip2.py", line 56, in __init__
self.vit = EVAViT(EVAViT.get_args(**eva_args))
File "/root/HYJ/VisualGLM-6B-main/model/blip2.py", line 21, in __init__
super().__init__(args, transformer=transformer, parallel_output=parallel_output, **kwargs)
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/sat/model/official/vit_model.py", line 111, in __init__
super().__init__(args, transformer=transformer, parallel_output=parallel_output, **kwargs)
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/sat/model/base_model.py", line 92, in __init__
self.transformer = BaseTransformer(
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/sat/model/transformer.py", line 464, in __init__
self.word_embeddings = torch.nn.Embedding(vocab_size, hidden_size, dtype=params_dtype, device=device)
TypeError: __init__() got an unexpected keyword argument 'dtype'
Traceback (most recent call last):
File "finetune_visualglm.py", line 178, in <module>
model, args = FineTuneVisualGLMModel.from_pretrained(model_type, args)
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/sat/model/base_model.py", line 215, in from_pretrained
return cls.from_pretrained_base(name, args=args, home_path=home_path, url=url, prefix=prefix, build_only=build_only, overwrite_args=overwrite_args, **kwargs)
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/sat/model/base_model.py", line 207, in from_pretrained_base
model = get_model(args, cls, **kwargs)
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/sat/model/base_model.py", line 412, in get_model
model = model_cls(args, params_dtype=params_dtype, **kwargs)
File "finetune_visualglm.py", line 13, in __init__
super().__init__(args, transformer=transformer, parallel_output=parallel_output, **kw_args)
File "/root/HYJ/VisualGLM-6B-main/model/visualglm.py", line 34, in __init__
self.add_mixin("eva", ImageMixin(args))
File "/root/HYJ/VisualGLM-6B-main/model/visualglm.py", line 18, in __init__
self.model = BLIP2(args.eva_args, args.qformer_args)
File "/root/HYJ/VisualGLM-6B-main/model/blip2.py", line 56, in __init__
self.vit = EVAViT(EVAViT.get_args(**eva_args))
File "/root/HYJ/VisualGLM-6B-main/model/blip2.py", line 21, in __init__
super().__init__(args, transformer=transformer, parallel_output=parallel_output, **kwargs)
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/sat/model/official/vit_model.py", line 111, in __init__
super().__init__(args, transformer=transformer, parallel_output=parallel_output, **kwargs)
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/sat/model/base_model.py", line 92, in __init__
self.transformer = BaseTransformer(
File "/opt/conda/envs/GLM/lib/python3.8/site-packages/sat/model/transformer.py", line 464, in __init__
self.word_embeddings = torch.nn.Embedding(vocab_size, hidden_size, dtype=params_dtype, device=device)
TypeError: __init__() got an unexpected keyword argument 'dtype'
[2024-04-14 13:17:27,363] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 40318
[2024-04-14 13:17:27,370] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 40319
[2024-04-14 13:17:27,371] [ERROR] [launch.py:322:sigkill_handler] ['/opt/conda/envs/GLM/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=1', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--layer_range', '0', '14', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '4', '--skip-init', '--fp16', '--use_lora'] exits with return code = 1
可以成功的连接host下载模型训练参数至本地,但在初始化模型时出现了问题。具体来说,torch.nn.Embedding
类的构造函数不支持 dtype
关键字参数。
经过阅读Pytorch官方文档:
Embedding — PyTorch 1.9.0 documentation
发现,Pytorch1.9.0支持dtype关键字参数,则更新Pytorch版本为1.9.0:
# CUDA 11.1
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
报了一个新的错误:
这个错误表明在文件 finetune_visualglm.py
的第 118 行尝试调用了一个名为 encode
的方法,但是在 FakeTokenizer
对象中找不到这个方法。通常情况下,encode
方法是用于将文本编码成模型可接受的输入格式的方法。可能的原因是在使用 FakeTokenizer
对象时,没有正确初始化或者这个对象不包含 encode
方法。
找到下载的14个G的那个visualglm-6b.zip,然后unzip,然后解压的文件里面有个model_config.json文件,里面args.tokenizer_type='THUDM/chatglm-6b'
替换成本地的路径就可以了.
此时,能够load参数了,但报了如下错误:
下载chatglm-6b(THUDM/chatglm-6b at main (huggingface.co))至visualglm-6b所在文件夹:
再次运行,报如下错误:
可能依赖的transformers模块版本不匹配,导致了tokenizer对象的属性不一致。降低 transformers 版本:
pip install transformers==4.33.2 -i https://mirrors.aliyun.com/pypi/simple/
运行后又报错:
重装deepspeed:
git clone https://github.com/microsoft/DeepSpeed.git
cd DeepSpeed
DS_BUILD_FUSED_ADAM=1 pip3 install .
在该重装之前需要:
export PATH=/usr/local/cuda/bin:$PATH
再进行微调即可运行成功!