五一卷羊驼三 Llama3（机智流作业）

最新推荐文章于 2024-08-22 21:31:21 发布

Esirn

最新推荐文章于 2024-08-22 21:31:21 发布

阅读量791

点赞数 9

文章标签： llama 人工智能 python

本文链接：https://blog.csdn.net/Esirn/article/details/138700822

版权

本作业基于Llama3-Tutorial（Llama 3 超级课堂），使用VSCode远程ssh连接InternStudio 开发机，配置过程可见于文档，此处省略。

1 Llama 3 本地 Web Demo 部署

环境配置：

conda create -n llama3 python=3.10
conda activate llama3
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia

下载模型（软链接 InternStudio）

mkdir -p ~/model
cd ~/model
ln -s /root/share/new_models/meta-llama/Meta-Llama-3-8B-Instruct ~/model/Meta-Llama-3-8B-Instruct

Web Demo 部署

安装 XTuner （自动安装其他依赖），注意pip install -e .[all]这条指令直接加[all]这个参数。

cd ~
git clone -b v0.1.18 https://github.com/InternLM/XTuner
cd XTuner
pip install -e .[all]
streamlit run ~/Llama3-Tutorial/tools/internstudio_web_demo.py \
  ~/model/Meta-Llama-3-8B-Instruct

在vscode设置好端口转发，然后在浏览器进入 localhost:8501即可。
图1 终端命令
图2 浏览器

2 使用 XTuner 完成小助手认知微调

自我认知训练数据集准备

可以先修改~/Llama3-Tutorial/tools/gdata.py 部分，然后运行

cd ~/Llama3-Tutorial
python tools/gdata.py

训练模型

直接使用已经修改好了的配置文件configs/assistant/llama3_8b_instruct_qlora_assistant.py (主要修改了模型路径和对话模板)

# 开始训练,使用 deepspeed 加速，A100 24G显存
xtuner train configs/assistant/llama3_8b_instruct_qlora_assistant.py --work-dir /root/llama3_pth

# Adapter PTH 转 HF 格式
xtuner convert pth_to_hf /root/llama3_pth/llama3_8b_instruct_qlora_assistant.py \
  /root/llama3_pth/iter_500.pth \
  /root/llama3_hf_adapter

# 模型合并
export MKL_SERVICE_FORCE_INTEL=1
xtuner convert merge /root/model/Meta-Llama-3-8B-Instruct \
  /root/llama3_hf_adapter\
  /root/llama3_hf_merged

推理验证

streamlit run ~/Llama3-Tutorial/tools/internstudio_web_demo.py \
  /root/llama3_hf_merged

此时 Llama3 拥有了他是 da哒哒 打造的人工智能助手的认知。
图3 终端

图4 浏览器

3 使用 LMDeploy 成功部署 Llama 3 模型

3.1 环境，模型准备

# 初始化环境
conda create -n lmdeploy python=3.10
conda activate lmdeploy
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
# 安装lmdeploy最新版
pip install -U lmdeploy[all]

Llama3 的下载部分，此前已经软链接，此处省略。

3.2. LMDeploy Chat CLI 工具

conda activate lmdeploy
lmdeploy chat /root/model/Meta-Llama-3-8B-Instruct

3.3. LMDeploy模型量化(lite)

主要介绍如何对模型进行量化，包括 KV8量化和W4A16量化。

3.3.1 设置最大KV Cache缓存大小

模型在运行时，占用的显存可大致分为三部分：模型参数本身占用的显存、KV Cache占用的显存，以及中间运算结果占用的显存。LMDeploy的KV Cache管理器可以通过设置--cache-max-entry-count参数，控制KV缓存占用剩余显存的最大比例。默认的比例为0.8。

看一下调整参数的效果。首先保持不加该参数（默认0.8），运行 Llama3-8b 模型。

lmdeploy chat /root/model/Meta-Llama-3-8B-Instruct/

懒得新建一个终端运行nvidia-smi ，可以直接看开发机“资源监控”
0.8
此时模型的占用为23112M。下面，改变--cache-max-entry-count参数设为0.5。

lmdeploy chat /root/model/Meta-Llama-3-8B-Instruct/ --cache-max-entry-count 0.5

0.5

看到显存占用明显降低，变为20488M。最后把--cache-max-entry-count参数设置为0.01，约等于禁止KV Cache占用显存。

lmdeploy chat /root/model/Meta-Llama-3-8B-Instruct/ --cache-max-entry-count 0.01

在这里插入图片描述
然后与模型对话，可以看到，此时显存占用仅为16200M，代价是会降低模型推理速度。

3.3.2 使用W4A16量化

仅需执行一条命令，就可以完成模型量化工作。

lmdeploy lite auto_awq \
   /root/model/Meta-Llama-3-8B-Instruct \
  --calib-dataset 'ptb' \
  --calib-samples 128 \
  --calib-seqlen 1024 \
  --w-bits 4 \
  --w-group-size 128 \
  --work-dir /root/model/Meta-Llama-3-8B-Instruct_4bit

运行时间较长，请耐心等待。量化工作结束后，新的HF模型被保存到Meta-Llama-3-8B-Instruct_4bit目录。下面使用Chat功能运行W4A16量化后的模型。

lmdeploy chat /root/model/Meta-Llama-3-8B-Instruct_4bit --model-format awq

在这里插入图片描述

将KV Cache比例再次调为0.01，查看显存占用情况。

lmdeploy chat /root/model/Meta-Llama-3-8B-Instruct_4bit --model-format awq --cache-max-entry-count 0.01

在这里插入图片描述
可以看到，显存占用变为6546MB，明显降低。

3.4 LMDeploy服务（serve）

3.4.1 启动API服务器

前面的都是在本地直接推理大模型，也就是本地部署。在生产环境下，我们有时会将大模型封装为 API 接口服务，供客户端访问。

lmdeploy serve api_server \
    /root/model/Meta-Llama-3-8B-Instruct \
    --model-format hf \
    --quant-policy 0 \
    --server-name 0.0.0.0 \
    --server-port 23333 \
    --tp 1

做好23333的端口转发，然后在浏览器访问 localhost:23333
在这里插入图片描述

3.4.2 命令行客户端连接API服务器

新建一个命令行客户端去连接API服务器。

conda activate lmdeploy
lmdeploy serve api_client http://localhost:23333

运行后，可以通过命令行窗口直接与模型对话
在这里插入图片描述

3.4.3 网页客户端连接API服务器

在客户端的终端用Ctrl + C中断客户端，不用关闭终端，因为等等还要运行Gradio。
运行之前确保自己的gradio版本低于4.0.0，否则可能等等会报错：
在这里插入图片描述
可以先用pip show gradio，如果版本不对，再用pip install gradio==3.50.2。
使用Gradio作为前端，启动网页客户端。

lmdeploy serve gradio http://localhost:23333 \
    --server-name 0.0.0.0 \
    --server-port 6006

打开浏览器，访问地址http://127.0.0.1:6006（之前做好了端口转发），然后就可以与模型进行对话了！
在这里插入图片描述

3.5 推理速度

使用 LMDeploy 在 A100（24G）推理 Llama3。

# 克隆仓库
cd ~
git clone https://github.com/InternLM/lmdeploy.git
# 下载测试数据
cd /root/lmdeploy
wget https://hf-mirror.com/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

执行 benchmark 命令，24G显存不够用，需要调低--cache-max-entry-count，我直接从0.8调到了0.2，

python benchmark/profile_throughput.py \
    ShareGPT_V3_unfiltered_cleaned_split.json \
    /root/model/Meta-Llama-3-8B-Instruct \
    --cache-max-entry-count 0.2 \
    --concurrency 256 \
    --model-format hf \
    --quant-policy 0 \
    --num-prompts 10000

结果很慢：

concurrency: 256
elapsed_time: 3432.928s

first token latency(s)(min, max, ave): 1.234, 132.587, 73.308
per-token latency(s) percentile(50, 75, 95, 99): [0.031, 0.044, 0.27, 0.53]

number of prompt tokens: 2238358
number of completion tokens: 2005438
token throughput (completion token): 584.177 token/s
token throughput (prompt + completion token): 1236.203 token/s
RPS (request per second): 2.913 req/s
RPM (request per minute): 174.778 req/min

4 多模态 Llava 微调和部署

模型准备

准备 Llama3-8B-Instruct 权重，之前已经软链接，此处省略。
接下来准备 Llava 所需要的 Visual Encoder 权重和 Image Projector 部分权重。

mkdir -p ~/model
cd ~/model
# 准备 Visual Encoder 权重。
ln -s /root/share/new_models/openai/clip-vit-large-patch14-336 .
# 准备 Image Projector 部分权重
ln -s /root/share/new_models/xtuner/llama3-llava-iter_2181.pth .

数据准备

cd ~
git clone https://github.com/InternLM/tutorial -b camp2
python ~/tutorial/xtuner/llava/llava_data/repeat.py \
  -i ~/tutorial/xtuner/llava/llava_data/unique_data.json \
  -o ~/tutorial/xtuner/llava/llava_data/repeated_data.json \
  -n 200

微调过程

使用可以一键启动的配置文件，主要是修改好了模型路径、对话模板以及数据路径。24G显存不够用，根据课堂的oom教程，改一下参数：

# 原指令
xtuner train ~/Llama3-Tutorial/configs/llama3-llava/llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_lora_e1_finetune.py --work-dir ~/llama3_llava_pth --deepspeed deepspeed_zero2
# 修改后指令
llava/llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_lora_e1_finetune.py --work-dir ~/llama3_llava_pth --deepspeed deepspeed_zero2_offload

在训练好之后，将原始 image projector 和微调得到的 image projector 都转换为 HuggingFace 格式。

xtuner convert pth_to_hf ~/Llama3-Tutorial/configs/llama3-llava/llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_lora_e1_finetune.py \
  ~/model/llama3-llava-iter_2181.pth \
  ~/llama3_llava_pth/pretrain_iter_2181_hf

xtuner convert pth_to_hf ~/Llama3-Tutorial/configs/llama3-llava/llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_lora_e1_finetune.py \
  ~/llama3_llava_pth/iter_1200.pth \
  ~/llama3_llava_pth/iter_1200_hf

展示

根据图片提问：
oph
在转换完成后，在命令行简单体验一下微调后模型的效果。

问题1：Describe this image.
问题2：What is the equipment in the image?

Pretrain 模型

export MKL_SERVICE_FORCE_INTEL=1
xtuner chat /root/model/Meta-Llama-3-8B-Instruct \
  --visual-encoder /root/model/clip-vit-large-patch14-336 \
  --llava /root/llama3_llava_pth/pretrain_iter_2181_hf \
  --prompt-template llama3_chat \
  --image /root/tutorial/xtuner/llava/llava_data/test_img/oph.jpg

微调前

从上图可以看到，微调前的 Pretrain 模型只会为图片打标签，并不能回答问题。

Finetune 后模型

export MKL_SERVICE_FORCE_INTEL=1
xtuner chat /root/model/Meta-Llama-3-8B-Instruct \
  --visual-encoder /root/model/clip-vit-large-patch14-336 \
  --llava /root/llama3_llava_pth/iter_1200_hf \
  --prompt-template llama3_chat \
  --image /root/tutorial/xtuner/llava/llava_data/test_img/oph.jpg

微调后
经过 Finetune 后，模型已经可以根据图片回答我们的问题了。（但依然不会in Chinese）

* LMDeploy运行Llava-Llama-3

在lmdeploy的conda环境下安装依赖：

# 
pip install git+https://github.com/haotian-liu/LLaVA.git

运行touch /root/pipeline_llava.py 新建一个文件，复制下列代码进去

from lmdeploy import pipeline, ChatTemplateConfig
from lmdeploy.vl import load_image
pipe = pipeline('xtuner/llava-llama-3-8b-v1_1-hf',
                chat_template_config=ChatTemplateConfig(model_name='llama3'))

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response.text)

运行python pipeline_llava.py结果为：
在这里插入图片描述

5 Llama3 工具调用能力训练

体验一下 Llama3 模型在 ReAct 范式下的智能体能力。Llama3-8B-Instruc无法成功调用 ArxivSearch 工具来搜索 InternLM2 的技术报告。原因在于它输出了 query=InternLM2 Technical Report 而非 {'query': 'InternLM2 Technical Report'}，这也就导致了 ReAct 在解析工具输入参数时发生错误，进而导致调用工具失败。

使用 XTuner 在 Agent-FLAN 数据集上微调 Llama3-8B-Instruct，以让 Llama3-8B-Instruct 模型获得智能体能力。

5.1 准备数据集

conda activate llama3
cd ~
cp -r /root/share/new_models/internlm/Agent-FLAN .
chmod -R 755 Agent-FLAN
python ~/Llama3-Tutorial/tools/convert_agentflan.py ~/Agent-FLAN/data

5.2 合并权重

由于训练时间太长，机智流准备好了已经训练好且转换为 HuggingFace 格式的权重，可以直接使用。路径位于 /share/new_models/agent-flan/iter_2316_hf。如果要使用已经训练好的权重，可以使用如下指令合并权重：

export MKL_SERVICE_FORCE_INTEL=1
xtuner convert merge /root/model/Meta-Llama-3-8B-Instruct \
    /share/new_models/agent-flan/iter_2316_hf \
    ~/llama3_agent_pth/merged

在这里插入图片描述

5.3 Llama3+Agent-FLAN ReAct Demo

合并权重后，再次使用 Web Demo 体验一下它的智能体能力

streamlit run ~/Llama3-Tutorial/tools/internstudio_web_demo.py \
  ~/llama3_agent_pth/merged

因为在微调前后都需要启动 Web Demo 以观察效果，因此将 Web Demo 部分单独拆分出来。首先先来安装 lagent。

pip install lagent

然后我们使用如下指令启动 Web Demo：

streamlit run ~/Llama3-Tutorial/tools/agent_web_demo.py 微调前/后 LLaMA3 模型路径

微调前 LLaMA3 路径：/root/model/Meta-Llama-3-8B-Instruct
微调后 LLaMA3 路径：/root/llama3_agent_pth/merged

最终指令：

# 微调前
streamlit run ~/Llama3-Tutorial/tools/agent_web_demo.py /root/model/Meta-Llama-3-8B-Instruct
# 微调后
streamlit run ~/Llama3-Tutorial/tools/agent_web_demo.py /root/llama3_agent_pth/merged

在这里插入图片描述