五一 Llama 3 超级课堂 | XTuner 微调 Llama3 图片理解多模态实践笔记

王依博

已于 2024-05-06 16:37:07 修改

阅读量1.1k

点赞数 17

分类专栏： Llama3实践文章标签：笔记 llama

于 2024-05-06 16:05:51 首次发布

本文链接：https://blog.csdn.net/qq_45078025/article/details/138496202

版权

Llama3实践专栏收录该内容

6 篇文章 1 订阅

订阅专栏

本文介绍了如何在已有的Llama3-8B-Instruct模型基础上，利用XTuner进行LLaVA的微调，包括环境和模型准备、数据处理以及微调过程，展示了微调后模型在图文理解任务上的显著改进。

摘要由CSDN通过智能技术生成

基于 Llama3-8B-Instruct 和 XTuner 团队预训练好的 Image Projector 微调自己的多模态图文理解模型 LLaVA。

课程文档：Llama3-Tutorial/docs/llava.md at main · SmartFlowAI/Llama3-Tutorial · GitHub

环境、模型、数据准备

1.环境准备

使用之前课程中已经配置好的环境、XTuner和Llama3-Tutorial

2.模型准备

Llama3 权重：使用之前课程软链接过的Llama3-8B-Instruct
Visual Encoder 权重：Llava 所需要的 openai/clip-vit-large-patch14-336，权重，即 Visual Encoder 权重。（使用软链接）
Image Projector 权重

3.数据准备

微调

1.训练启动

使用XTuner启动基于Llama3的LLaVA训练

xtuner train ~/Llama3-Tutorial/configs/llama3-llava/llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_lora_e1_finetune.py --work-dir ~/llama3_llava_pth --deepspeed deepspeed_zero2

需要先安装deepspeed，重试

30%的A100好像不太够用，加上offload重试，启动成功

大约用时4个小时左右

将原始 image projector 和我们微调得到的 image projector 都转换为 HuggingFace 格式

xtuner convert pth_to_hf ~/Llama3-Tutorial/configs/llama3-llava/llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_lora_e1_finetune.py \
  ~/model/llama3-llava-iter_2181.pth \
  ~/llama3_llava_pth/pretrain_iter_2181_hf

xtuner convert pth_to_hf ~/Llama3-Tutorial/configs/llama3-llava/llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_lora_e1_finetune.py \
  ~/llama3_llava_pth/iter_1200.pth \
  ~/llama3_llava_pth/iter_1200_hf

2.效果比对

检验模型效果

问题1：Describe this image. 问题2：What is the equipment in the image?

Pretrain 模型

Finetune 后模型

export MKL_SERVICE_FORCE_INTEL=1
xtuner chat /root/model/Meta-Llama-3-8B-Instruct \
  --visual-encoder /root/model/clip-vit-large-patch14-336 \
  --llava /root/llama3_llava_pth/iter_1200_hf \
  --prompt-template llama3_chat \
  --image /root/tutorial/xtuner/llava/llava_data/test_img/oph.jpg

原始模型回答不出第二个问题，经过微调后可以回答出来

王依博

关注

17
点赞
踩
11

收藏

觉得还不错? 一键收藏
3
评论
五一 Llama 3 超级课堂 | XTuner 微调 Llama3 图片理解多模态实践笔记

基于 Llama3-8B-Instruct 和 XTuner 团队预训练好的 Image Projector 微调自己的多模态图文理解模型 LLaVA。
复制链接

扫一扫