准备
下载模型,这里从魔搭社区下载
Git下载
请确保 lfs 已经被正确安装
git lfs install
git clone https://www.modelscope.cn/LLM-Research/Meta-Llama-3.1-8B-Instruct.git
下载数据集
https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
解压,与模型文件放置在同一目录下
安装工具包
pip install trl
pip install bitsandbytes
pip install accelerate
pip install peft
命令行方式微调
生成config文件
# accelerate config
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
In which compute environment are you running?
This machine
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?
No distributed training
Do you want to run your training on CPU only (even if a GPU / Apple Silicon / Ascend NPU device is available)? [yes/NO]:no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:3
Would you like to enable numa efficiency? (Currently only supported on NVIDIA hardware). [yes/NO]: yes
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
no
微调
逐行模式
trl sft \
--model_name_or_path LLM-Research/Meta-Llama-3___1-8B-Instruct \
--dataset_name aclImdb_v1 \
--dataset_text_field text \
--load_in_4bit \
--use_peft \
--max_seq_length 512 \
--learning_rate 0.001 \
--per_device_train_batch_size 2 \
--output_dir ./sft-imdb-llama3-8b \
--logging_steps 10
本次微调使用一个A100GPU,跑了61个小时,时长供参考
参考链接:https://blog.csdn.net/zhujiahui622/article/details/138308088
遇到的问题
1、有告警,可忽略
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.0), only 1.0.0 is known to be compatible
对于第一个告警可以安装对应包
apt install libaio-dev
2、微调报错
W0730 06:27:45.490000 139907808057152 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3963586 closing signal SIGTERM
W0730 06:27:45.493000 139907808057152 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3963587 closing signal SIGTERM
W0730 06:27:45.495000 139907808057152 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3963589 closing signal SIGTERM
E0730 06:27:45.713000 139907808057152 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 2 (pid: 3963588) of binary: /home/test/anaconda3/bin/python
Traceback (most recent call last):
File "/home/test/anaconda3/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/test/anaconda3/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/test/anaconda3/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1088, in launch_command
multi_gpu_launcher(args)
File "/home/test/anaconda3/lib/python3.11/site-packages/accelerate/commands/launch.py", line 733, in multi_gpu_launcher
distrib_run.run(args)
File "/home/test/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/test/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/test/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/test/anaconda3/lib/python3.11/site-packages/trl/commands/scripts/sft.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-07-30_06:27:45
host : node20
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 3963588)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[06:27:46] TRL - SFT failed on ! See the logs above for further details. cli.py:67
Traceback (most recent call last):
File "/home/test/anaconda3/lib/python3.11/site-packages/trl/commands/cli.py", line 58, in main
subprocess.run(
File "/home/test/anaconda3/lib/python3.11/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['accelerate', 'launch', '/home/test/anaconda3/lib/python3.11/site-packages/trl/commands/scripts/sft.py', '--model_name_or_path', 'LLM-Research/Meta-Llama-3___1-8B-Instruct', '--dataset_name', '/imdb', '--dataset_text_field', 'text', '--load_in_4bit', '--use_peft', '--max_seq_length', '512', '--learning_rate', '0.001', '--per_device_train_batch_size', '2', '--output_dir', './sft-imdb-llama3-8b', '--logging_steps', '10']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/test/anaconda3/bin/trl", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/test/anaconda3/lib/python3.11/site-packages/trl/commands/cli.py", line 68, in main
raise ValueError("TRL CLI failed! Check the traceback above..") from exc
ValueError: TRL CLI failed! Check the traceback above..
使用accelerate config重新更改配置文件后重试ok
在不清楚每一个选项的作用时,先不要调整选项的值,按照上文的设置可以跑起来。
使用LLaMA-Factory进行微调
安装LLaMA-Factory
https://github.com/hiyouga/LLaMA-Factory
使用webui进行微调
Fine-Tuning with LLaMA Board GUI
llamafactory-cli webui
选择模型,数据集,微调参数,开始训练即可
参考链接:
https://www.cnblogs.com/hlgnet/articles/18148788
https://blog.csdn.net/u010438035/article/details/140326826?spm=1001.2014.3001.5502