【Meta开源大模型Meta-Llama-3.1微调SFT】

最新推荐文章于 2025-03-19 15:22:13 发布

ziwend

最新推荐文章于 2025-03-19 15:22:13 发布

阅读量825

点赞数 3

分类专栏：大模型文章标签： llama 语言模型人工智能自然语言处理 python 开源软件

本文链接：https://blog.csdn.net/u010438035/article/details/140920310

版权

大模型专栏收录该内容

3 篇文章

订阅专栏

准备

下载模型，这里从魔搭社区下载

Git下载
请确保 lfs 已经被正确安装

git lfs install
git clone https://www.modelscope.cn/LLM-Research/Meta-Llama-3.1-8B-Instruct.git

下载数据集

https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
解压，与模型文件放置在同一目录下

安装工具包

pip install trl
pip install bitsandbytes
pip install accelerate
pip install peft

命令行方式微调

生成config文件

# accelerate config
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
In which compute environment are you running?
This machine
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?
No distributed training
Do you want to run your training on CPU only (even if a GPU / Apple Silicon / Ascend NPU device is available)? [yes/NO]:no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:3
Would you like to enable numa efficiency? (Currently only supported on NVIDIA hardware). [yes/NO]: yes
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
no

微调

逐行模式

trl sft \
--model_name_or_path LLM-Research/Meta-Llama-3___1-8B-Instruct \
--dataset_name aclImdb_v1 \
--dataset_text_field text \
--load_in_4bit \
--use_peft \
--max_seq_length 512 \
--learning_rate 0.001 \
--per_device_train_batch_size 2 \
--output_dir ./sft-imdb-llama3-8b \
--logging_steps 10

本次微调使用一个A100GPU，跑了61个小时，时长供参考
在这里插入图片描述

参考链接：https://blog.csdn.net/zhujiahui622/article/details/138308088

遇到的问题

1、有告警，可忽略

 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible

对于第一个告警可以安装对应包
apt install libaio-dev

2、微调报错

W0730 06:27:45.490000 139907808057152 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3963586 closing signal SIGTERM
W0730 06:27:45.493000 139907808057152 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3963587 closing signal SIGTERM
W0730 06:27:45.495000 139907808057152 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3963589 closing signal SIGTERM
E0730 06:27:45.713000 139907808057152 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 2 (pid: 3963588) of binary: /home/test/anaconda3/bin/python
Traceback (most recent call last):
  File "/home/test/anaconda3/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/test/anaconda3/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/test/anaconda3/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1088, in launch_command
    multi_gpu_launcher(args)
  File "/home/test/anaconda3/lib/python3.11/site-packages/accelerate/commands/launch.py", line 733, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/test/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/test/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/test/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/test/anaconda3/lib/python3.11/site-packages/trl/commands/scripts/sft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-30_06:27:45
  host      : node20
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 3963588)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[06:27:46] TRL - SFT failed on ! See the logs above for further details.                                                                                                                                                                                               cli.py:67
Traceback (most recent call last):
  File "/home/test/anaconda3/lib/python3.11/site-packages/trl/commands/cli.py", line 58, in main
    subprocess.run(
  File "/home/test/anaconda3/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['accelerate', 'launch', '/home/test/anaconda3/lib/python3.11/site-packages/trl/commands/scripts/sft.py', '--model_name_or_path', 'LLM-Research/Meta-Llama-3___1-8B-Instruct', '--dataset_name', '/imdb', '--dataset_text_field', 'text', '--load_in_4bit', '--use_peft', '--max_seq_length', '512', '--learning_rate', '0.001', '--per_device_train_batch_size', '2', '--output_dir', './sft-imdb-llama3-8b', '--logging_steps', '10']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/test/anaconda3/bin/trl", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/test/anaconda3/lib/python3.11/site-packages/trl/commands/cli.py", line 68, in main
    raise ValueError("TRL CLI failed! Check the traceback above..") from exc
ValueError: TRL CLI failed! Check the traceback above..