LLM大模型教程：LLM大模型推理加速: mlc-llm 教程，将qwen-7b 部署到手机上

最新推荐文章于 2024-12-18 10:06:14 发布

AGI大模型学习

最新推荐文章于 2024-12-18 10:06:14 发布

阅读量2.2k

点赞数 40

文章标签：智能手机 AI大模型人工智能产品经理大模型教程

本文链接：https://blog.csdn.net/2401_84495872/article/details/141994075

版权

MLC-LLM 是一种高性能通用部署解决方案，允许使用具有编译器加速功能的本机 API 来本机部署任何大型语言模型。该项目的使命是让每个人都能利用机器学习编译技术在每个人的设备上本地开发、优化和部署人工智能模型。

	AMD GPU	NVIDIA GPU	Apple GPU	Intel GPU
Linux / Win	✅ Vulkan, ROCm	✅ Vulkan, CUDA	N/A	✅ Vulkan
macOS	✅ Metal (dGPU)	N/A	✅ Metal	✅ Metal (iGPU)
Web Browser	✅ WebGPU and WASM
iOS / iPadOS	✅ Metal on Apple A-series GPU
Android	✅ OpenCL on Adreno GPU		✅ OpenCL on Mali GPU

环境安装

conda create --name mlc python=3.11
conda activate mlc

python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-chat-nightly-cu121 mlc-ai-nightly-cu121
python -c "import mlc_chat; print(mlc_chat)"

模型转换

模型转换分为两步：

转换模型权重
生成mlc chat 的配置

下面以qwen模型为例，暂时不支持qwen2

转换模型权重

mlc_chat convert_weight /home/chuan/models/qwen/Qwen-7B-Chat \
    --quantization q4f16_1 \
    -o /home/chuan/models/qwen/Qwen-7B-Chat/mlc

参数列表

–CONFIG

It can be one of the following:

Path to a HuggingFace model directory that contains a config.json or
Path to config.json in HuggingFace format, or
The name of a pre-defined model architecture.

–quantization QUANTIZATION_MODE

可选项 q0f16, q0f32, q3f16_1, q4f16_1, q4f32_1, and q4f16_awq.推荐使用q4f16_1

–model-type MODEL_TYPE

Model architecture such as “llama”. If not set, it is inferred from config.json

–device DEVICE

The device used to do quantization such as “cuda” or “cuda:0”. Will detect from local available GPUs if not specified.

–source SOURCE

The path to original model weight, infer from config if missing.

–source-format SOURCE_FORMAT

The format of source model weight, infer from config if missing.

–output OUTPUT

The output directory to save the quantized model weight. Will create params_shard_*.bin and ndarray-cache.json in this directory.

生成mlc chat 的配置

mlc_chat gen_config /home/chuan/models/qwen/Qwen-7B-Chat \
    --quantization q4f16_1 --conv-template chatml \
    -o /home/chuan/models/qwen/Qwen-7B-Chat/mlc

注意conv-template的值参照github.com/mlc-ai/mlc-…，

如果不包含你的模型也可以自定义，但是要从源代码重新编译mlc

gen_config的参数列表

–CONFIG

It can be one of the following:

Path to a HuggingFace model directory that contains a config.json or
Path to config.json in HuggingFace format, or
The name of a pre-defined model architecture.

–quantization QUANTIZATION_MODE

–model-type MODEL_TYPE

–conv-template CONV_TEMPLATE

–context-window-size CONTEXT_WINDOW_SIZE

最大句子长度

–output OUTPUT

其他不重要的参数没有列出来

运行mlc

from mlc_chat import ChatModule

cm = ChatModule(model="/home/chuan/models/qwen/Qwen1___5-7B-Chat/mlc")
print(cm.generate("hello"))

把qwen模型编译成android app

我已经编译好了一个版本的app, 欢迎下载试用，需要科学上网

github.com/night-is-yo…

首先安装android studio，下载ndk、cmake,如图所示：

设置环境变量

export ANDROID_NDK=/home/chuan/Android/Sdk/ndk/26.2.11394342
export TVM_NDK_CC=/home/chuan/Android/Sdk/ndk/26.2.11394342/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android34-clang
export TVM_HOME=/home/chuan/github/mlc-llm/3rdparty/tvm
export JAVA_HOME=/home/chuan/tools/jdk-17.0.10

下载mlc ,并编译模型

git clone --recursive https://github.com/mlc-ai/mlc-llm/
cd ./mlc-llm/
MODEL_NAME=/home/chuan/models/qwen/Qwen-7B-Chat
QUANTIZATION=q4f16_1

mlc_chat convert_weight $MODEL_NAME --quantization $QUANTIZATION -o $MODEL_NAME/mlc
mlc_chat gen_config $MODEL_NAME --quantization $QUANTIZATION \
  --conv-template chatml --context-window-size 768 -o $MODEL_NAME/mlc
mlc_chat compile $MODEL_NAME/mlc/mlc-chat-config.json \
    --device android -o $MODEL_NAME/mlc/Qwen-7B-Chat-${QUANTIZATION}-android.tar

将模型上传到huggingface

git clone https://huggingface.co/chuan-niy/qwen_q4f16_1
git config user.name chuan-niy
git config user.email 1500546481@qq.com

cd qwen_q4f16_1
cp /home/chuan/models/qwen/Qwen-7B-Chat/mlc/* ./
git add . && git commit -m "Add qwen model weights for android"
git push origin main

编译android 库

cd ./android/library
vim ./src/main/assets/app-config.json

{
  "model_list": [
    {
      "model_url": "https://huggingface.co/chuan-niy/qwen_q4f16_1",
      "model_lib": "qwen_q4f16_1",
      "estimated_vram_bytes": 4348727787,
      "model_id": "Qwen-7B-Chat-hf-q4f16_1"
    }
  ],
  "model_lib_path_for_prepare_libs": {
    "Qwen-7B-Chat-hf-q4f16_1": "/home/chuan/models/qwen/Qwen-7B-Chat/mlc/Qwen-7B-Chat-q4f16_1-android.tar"
  }
}

在CMakeLists.txt中添加以下信息

vi CMakeLists.txt

set(JAVA_AWT_LIBRARY "/home/chuan/tools/jdk-17.0.10/include/linux")
set(JAVA_JVM_LIBRARY "/home/chuan/tools/jdk-17.0.10/include/linux")
set(JAVA_INCLUDE_PATH "/home/chuan/tools/jdk-17.0.10/include")
set(JAVA_INCLUDE_PATH2 "/home/chuan/tools/jdk-17.0.10/include/linux")
set(JAVA_AWT_INCLUDE_PATH "/home/chuan/tools/jdk-17.0.10/include")

需要修改#ifdef TVM4J_ANDROID代码的地方

vi mlc-llm/3rdparty/tvm/jvm/native/src/main/native/org_apache_tvm_native_c_api.cc

#ifdef TVM4J_ANDROID
    _jvm->AttachCurrentThread(reinterpret_cast<void**>(&env), nullptr);
#else
    _jvm->AttachCurrentThread(reinterpret_cast<void**>(&env), nullptr);

最后运行编译