Llama3实践教程（InternStudio 版）

Hello2Bonjour

已于 2024-05-16 13:34:58 修改

阅读量1.1k

点赞数 27

文章标签： llama

于 2024-05-07 10:38:23 首次发布

本文链接：https://blog.csdn.net/Hello2Bonjour/article/details/138522446

版权

Llama3实践教程（InternStudio 版）

本实践教程包括：

Llama 3 Web Demo 部署
XTuner 微调 Llama3 个人小助手认知
LMDeploy 高效部署 Llama3 实践
XTuner 微调 Llama3 图片理解多模态
Llama 3 Agent 能力体验+微调

一、Llama 3 Web Demo 部署

前提：本人使用InternStudio平台的30%A100完成以下实践

1.1 环境配置

conda create -n llama3 python=3.10
conda activate llama3
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia

在这里插入图片描述

1.2 下载模型

新建文件夹

mkdir -p ~/model
cd ~/model

软链接 InternStudio 中的模型

ln -s /root/share/new_models/meta-llama/Meta-Llama-3-8B-Instruct ~/model/Meta-Llama-3-8B-Instruct

在这里插入图片描述

1.3 Web Demo 部署

cd ~
git clone https://github.com/SmartFlowAI/Llama3-Tutorial

安装 XTuner 时会自动安装其他依赖

cd ~
git clone -b v0.1.18 https://github.com/InternLM/XTuner
cd XTuner
pip install -e .

运行 web_demo.py

streamlit run ~/Llama3-Tutorial/tools/internstudio_web_demo.py \
  ~/model/Meta-Llama-3-8B-Instruct

在这里插入图片描述

Vscode转发端口配置后，打开localhost+转发端口，等待大模型加载成功后即可进行问答
在这里插入图片描述

二、XTuner 微调 Llama3 个人小助手认知

2.1 自我认知训练数据集准备

激活llama3环境后，运行训练数据集脚本

cd ~/Llama3-Tutorial
python tools/gdata.py

以上脚本在生成了 ~/Llama3-Tutorial/data/personal_assistant.json 数据文件格式如下所示：
在这里插入图片描述

2.2 XTuner配置文件准备

已经修改好了configs/assistant/llama3_8b_instruct_qlora_assistant.py

数据集地址修改

在这里插入图片描述

2.3 训练模型

cd ~/Llama3-Tutorial

# 开始训练,使用 deepspeed 加速，A100 40G显存 耗时24分钟
xtuner train configs/assistant/llama3_8b_instruct_qlora_assistant.py --work-dir /root/llama3_pth

# Adapter PTH 转 HF 格式
xtuner convert pth_to_hf /root/llama3_pth/llama3_8b_instruct_qlora_assistant.py \
  /root/llama3_pth/iter_500.pth \
  /root/llama3_hf_adapter

# 模型合并
export MKL_SERVICE_FORCE_INTEL=1
xtuner convert merge /root/model/Meta-Llama-3-8B-Instruct \
  /root/llama3_hf_adapter\
  /root/llama3_hf_merged

在这里插入图片描述

2.4 推理验证

streamlit run ~/Llama3-Tutorial/tools/internstudio_web_demo.py \
  /root/llama3_hf_merged

此时已经拥有了 SmartFlowAI 打造的人工智能助手。
在这里插入图片描述

三、LMDeploy 高效部署 Llama3 实践

3.1 环境，模型准备

3.1.1 环境配置 lmdeploy

# 如果你是InternStudio 可以直接使用
# studio-conda -t lmdeploy -o pytorch-2.1.2
# 初始化环境
conda create -n lmdeploy python=3.10
conda activate lmdeploy
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia

安装lmdeploy最新版。

pip install -U lmdeploy[all]

软链接 InternStudio 中的模型在1.2中已经进行，此处可不进行

3.2 LMDeploy Chat CLI 工具

直接在终端运行

conda activate lmdeploy
lmdeploy chat /root/model/Meta-Llama-3-8B-Instruct

运行结果是：
在这里插入图片描述

3.3 LMDeploy模型量化(lite)

3.3.1 设置最大KV Cache缓存大小

3.3.1.1 实验一设置KV缓存占比0.8

模型在运行时的显存占用可分为：模型参数本身占用的显存、KV Cache占用的显存，以及中间运算结果占用的显存。LMDeploy的KV Cache管理器可以通过设置–cache-max-entry-count参数，控制KV缓存占用剩余显存的最大比例。默认的比例为0.8。

下面调整--cache-max-entry-count参数进行对比实验。

首先保持不加该参数（默认0.8），运行 Llama3-8b 模型。

lmdeploy chat /root/model/Meta-Llama-3-8B-Instruct/ --cache-max-entry-count 0.8

终端运行查看显存占用情况

# 如果你是InternStudio 就使用
# studio-smi
nvidia-smi

在这里插入图片描述

此时模型的占用为23133M。

3.3.1.2 实验二设置KV缓存占比0.5

下面，改变--cache-max-entry-count参数，设为0.5。

lmdeploy chat /root/model/Meta-Llama-3-8B-Instruct/ --cache-max-entry-count 0.5

在这里插入图片描述

看到显存占用明显降低，变为20509M。
在这里插入图片描述

3.3.1.3 实验三设置KV缓存占比0.01

把--cache-max-entry-count参数设置为0.01，约等于禁止KV Cache占用显存。

lmdeploy chat /root/model/Meta-Llama-3-8B-Instruct/ --cache-max-entry-count 0.01

在这里插入图片描述

此时显存占用仅为40388 - 24183 = 16205M（此时有其他应程序占用显存），此时模型推理速度降低。

3.3.2 使用W4A16量化

仅需执行一条命令，就可以完成模型量化工作。

lmdeploy lite auto_awq \
   /root/model/Meta-Llama-3-8B-Instruct \
  --calib-dataset 'ptb' \
  --calib-samples 128 \
  --calib-seqlen 1024 \
  --w-bits 4 \
  --w-group-size 128 \
  --work-dir /root/model/Meta-Llama-3-8B-Instruct_4bit

量化工作结束后，新的HF模型被保存到Meta-Llama-3-8B-Instruct_4bit目录。下面使用Chat功能运行W4A16量化后的模型。

lmdeploy chat /root/model/Meta-Llama-3-8B-Instruct_4bit --model-format awq

将KV Cache比例再次调为0.01，查看显存占用情况。

lmdeploy chat /root/model/Meta-Llama-3-8B-Instruct_4bit --model-format awq --cache-max-entry-count 0.01

在这里插入图片描述

可以看到，显存占用变为30734 - 24183 = 6551MB，明显降低。

3.3.3 在线量化 KV

自 v0.4.0 起，LMDeploy KV 量化方式有原来的离线改为在线。并且，支持两种数值精度 int4、int8。

3.4 LMDeploy服务（serve）

在前面的章节，我们都是在本地直接推理大模型，这种方式成为本地部署。在生产环境下，我们有时会将大模型封装为 API 接口服务，供客户端访问。

3.4.1 启动API服务器

通过以下命令启动API服务器，推理Meta-Llama-3-8B-Instruct模型：

lmdeploy serve api_server \
    /root/model/Meta-Llama-3-8B-Instruct \
    --model-format hf \
    --quant-policy 0 \
    --server-name 0.0.0.0 \
    --server-port 23333 \
    --tp 1

其中，model-format、quant-policy这些参数是与量化推理模型一致的；server-name和server-port表示API服务器的服务IP与服务端口；tp参数表示并行数量（GPU数量）。
在这里插入图片描述

通过运行以上指令，我们成功启动了API服务器，请勿关闭该窗口，后面我们要新建客户端连接该服务。
你也可以直接打开http://{host}:23333查看接口的具体使用说明，如下图所示。
在这里插入图片描述

这一步由于Server在远程服务器上，所以本地需要做一下ssh转发才能直接访问。在你本地打开一个cmd窗口，输入命令如下：

ssh -CNg -L 23333:127.0.0.1:23333 root@ssh.intern-ai.org.cn -p 你的ssh端口号

然后打开浏览器，访问http://127.0.0.1:23333。

3.4.2 命令行客户端连接API服务器

在“4.1”中，我们在终端里新开了一个API服务器。
本节中，我们要新建一个命令行客户端去连接API服务器。首先通过VS Code新建一个终端：
激活conda环境

conda activate lmdeploy

运行命令行客户端：

lmdeploy serve api_client http://localhost:23333

运行后，可以通过命令行窗口直接与模型对话

3.4.3 网页客户端连接API服务器

服务器端不关闭，打开另一个终端执行以下。

# 安装gradio
pip install gradio==3.50.2
# 打开conda环境
conda activate lmdeploy
# 启动Gradio网页客户端
lmdeploy serve gradio http://localhost:23333 \
    --server-name 0.0.0.0 \
    --server-port 6006

访问地址http://127.0.0.1:6006
在这里插入图片描述

四、XTuner 微调 Llama3 图片理解多模态

4.1 数据准备

按照 https://github.com/InternLM/Tutorial/blob/camp2/xtuner/llava/xtuner_llava.md 中的教程来准备微调数据。

执行以下代码：

cd ~
git clone https://github.com/InternLM/tutorial -b camp2
python ~/tutorial/xtuner/llava/llava_data/repeat.py \
  -i ~/tutorial/xtuner/llava/llava_data/unique_data.json \
  -o ~/tutorial/xtuner/llava/llava_data/repeated_data.json \
  -n 200

4.2 训练启动

官方已经为大家准备好可一键启动的配置文件，主要修改了模型路径、对话模板以及数据路径。

修改llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_lora_e1_finetune.py文件中的batchsize为4。
使用如下指令以启动训练：

xtuner train ~/Llama3-Tutorial/configs/llama3-llava/llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_lora_e1_finetune.py --work-dir ~/llama3_llava_pth --deepspeed deepspeed_zero2_offload

4.3 格式转换

在训练好后，将原始 image projector 和我们微调得到的 image projector 都转换为 HuggingFace 格式

xtuner convert pth_to_hf ~/Llama3-Tutorial/configs/llama3-llava/llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_lora_e1_finetune.py \
  ~/model/llama3-llava-iter_2181.pth \
  ~/llama3_llava_pth/pretrain_iter_2181_hf

xtuner convert pth_to_hf ~/Llama3-Tutorial/configs/llama3-llava/llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_lora_e1_finetune.py \
  ~/llama3_llava_pth/iter_300.pth \
  ~/llama3_llava_pth/iter_300_hf

4.4 效果对比

在转换完成后，可以在命令行体验微调前后模型的效果。

问题1：Describe this image.
问题2：What is the equipment in the image?

图片如下所示：
在这里插入图片描述

Pretrain 模型

export MKL_SERVICE_FORCE_INTEL=1
xtuner chat /root/model/Meta-Llama-3-8B-Instruct \
  --visual-encoder /root/model/clip-vit-large-patch14-336 \
  --llava /root/llama3_llava_pth/pretrain_iter_2181_hf \
  --prompt-template llama3_chat \
  --image /root/tutorial/xtuner/llava/llava_data/test_img/oph.jpg

在这里插入图片描述

此时可以看到，Pretrain 模型只会为图片打标签，并不能回答问题。

Finetune 后模型

export MKL_SERVICE_FORCE_INTEL=1
xtuner chat /root/model/Meta-Llama-3-8B-Instruct \
  --visual-encoder /root/model/clip-vit-large-patch14-336 \
  --llava /root/llama3_llava_pth/iter_300_hf \
  --prompt-template llama3_chat \
  --image /root/tutorial/xtuner/llava/llava_data/test_img/oph.jpg

在这里插入图片描述

经过 Finetune 后，我们可以发现，模型已经可以根据图片回答我们的问题了。

五、Llama 3 Agent 能力体验+微调

使用基于 Lagent 的 Web Demo 来直观体验一下 Llama3 模型在 ReAct 范式下的智能体能力。我们让它使用 ArxivSearch 工具来搜索 InternLM2 的技术报告。使用 XTuner 在 Agent-FLAN 数据集上微调 Llama3-8B-Instruct，以让 Llama3-8B-Instruct 模型获得智能体能力。

5.1 准备工作

5.1.1 环境配置

如果在前面的实践中已经配置好了环境，这里选择直接执行 conda activate llama3 以进入环境。

最后， clone 本教程仓库。

cd ~
git clone https://github.com/SmartFlowAI/Llama3-Tutorial

5.1.2 模型准备

在微调开始前，我们首先来准备 Llama3-8B-Instruct 模型权重。

mkdir -p ~/model
cd ~/model
ln -s /root/share/new_models/meta-llama/Meta-Llama-3-8B-Instruct .

5.1.3 数据集准备

由于 HuggingFace 上的 Agent-FLAN 数据集暂时无法被 XTuner 直接加载，因此我们首先要下载到本地，然后转换成 XTuner 直接可用的格式。

已经在 InternStudio 上准备好了一份转换好的数据，可以直接通过如下脚本准备好：

cd ~
cp -r /root/share/new_models/internlm/Agent-FLAN .
chmod -R 755 Agent-FLAN

在 SmartFlowAI/Llama3-Tutorial 仓库中已经准备好了相关转换脚本。

python ~/Llama3-Tutorial/tools/convert_agentflan.py ~/Agent-FLAN/data

转换好的数据位于 ~/Agent-FLAN/data_converted

5.2 微调训练启动

使用如下指令以启动训练：

export MKL_SERVICE_FORCE_INTEL=1
xtuner train ~/Llama3-Tutorial/configs/llama3-agentflan/llama3_8b_instruct_qlora_agentflan_3e.py --work-dir ~/llama3_agent_pth --deepspeed deepspeed_zero2

在训练完成后，我们将权重转换为 HuggingFace 格式，并合并到原权重中。

# 转换权重
xtuner convert pth_to_hf ~/Llama3-Tutorial/configs/llama3-agentflan/llama3_8b_instruct_qlora_agentflan_3e.py \
    ~/llama3_agent_pth/iter_18516.pth \
    ~/llama3_agent_pth/iter_18516_hf

由于训练时间太长，官方也为大家准备已经训练好且转换为 HuggingFace 格式的权重，可以直接使用。路径位于 /share/new_models/agent-flan/iter_2316_hf。

如果要使用已经训练好的权重，可以使用如下指令合并权重：

export MKL_SERVICE_FORCE_INTEL=1
xtuner convert merge /root/model/Meta-Llama-3-8B-Instruct \
    /share/new_models/agent-flan/iter_2316_hf \
    ~/llama3_agent_pth/merged

5.3 Lagent Web Demo

在微调前后启动 Web Demo 以观察效果。

首先先来安装 lagent。

pip install lagent

然后使用如下指令启动微调前 Web Demo：

streamlit run ~/Llama3-Tutorial/tools/agent_web_demo.py /root/model/Meta-Llama-3-8B-Instruct

在这里插入图片描述

使用如下指令启动微调前 Web Demo：

streamlit run ~/Llama3-Tutorial/tools/agent_web_demo.py /root/llama3_agent_pth/merged

在这里插入图片描述

好像并没有太成功！

Hello2Bonjour

关注

27
点赞
踩
11

收藏

觉得还不错? 一键收藏
4
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

Llama3实践教程（InternStudio 版）

Llama3实践教程（InternStudio 版）

一、Llama 3 Web Demo 部署

1.1 环境配置

1.2 下载模型

1.3 Web Demo 部署

二、XTuner 微调 Llama3 个人小助手认知

2.1 自我认知训练数据集准备

2.2 XTuner配置文件准备

数据集地址修改

2.3 训练模型

2.4 推理验证

三、LMDeploy 高效部署 Llama3 实践

3.1 环境，模型准备

3.1.1 环境配置 lmdeploy

3.2 LMDeploy Chat CLI 工具

3.3 LMDeploy模型量化(lite)

3.3.1 设置最大KV Cache缓存大小

3.3.1.1 实验一 设置KV缓存占比0.8

3.3.1.2 实验二 设置KV缓存占比0.5

3.3.1.3 实验三 设置KV缓存占比0.01

3.3.2 使用W4A16量化

3.3.3 在线量化 KV

3.4 LMDeploy服务（serve）

3.4.1 启动API服务器

3.4.2 命令行客户端连接API服务器

3.4.3 网页客户端连接API服务器

四、XTuner 微调 Llama3 图片理解多模态

4.1 数据准备

4.2 训练启动

4.3 格式转换

4.4 效果对比

五、Llama 3 Agent 能力体验+微调

5.1 准备工作

5.1.1 环境配置

5.1.2 模型准备

5.1.3 数据集准备

5.2 微调训练启动

5.3 Lagent Web Demo

3.3.1.1 实验一设置KV缓存占比0.8

3.3.1.2 实验二设置KV缓存占比0.5

3.3.1.3 实验三设置KV缓存占比0.01