LLaMA 3.2 1B模型部署到RK3588香橙派5plus开发板上的完整指南

最新推荐文章于 2025-03-10 13:37:59 发布

白开水�518

最新推荐文章于 2025-03-10 13:37:59 发布

阅读量3.4k

点赞数 67

文章标签： llama python nlp 边缘计算语言模型 arm开发

本文链接：https://blog.csdn.net/qq_67803307/article/details/142851898

版权

大家好！在9月25日，Meta发布了LLaMA 3.2模型，其中包括了专为边缘设备设计的1B和3B小参数模型。我最近在参加一个人工智能类的学科竞赛，因此决定尝试在边缘设备上部署LLaMA 3.2 1B模型。我选择的是RK3588芯片，使用的是香橙派5plus开发板，配备了16G的运行内存，处理速度非常快。最终成功部署，回答速度和质量都相当不错。以下是我的部署过程，希望对大家有所帮助。

更详细视频可以在b站查看，链接放在评论区吧

系统下载与烧录

首先，我下载了Ubuntu系统。由于官方版本烧录失败，我选择了第三方的系统镜像。

香橙派5plus从ssd启动Ubuntu_香橙派5plus镜像-CSDN博客

烧录软件我使用了balenaEtcher https://etcher.balena.io/#download-etcher，大家也可以选择其他烧录软件Orange Pi 5 Plus - Wiki-Orange Pi。烧录完成后，将卡插入开发板，进行初始化设置，包括选择语言和设置密码。

模型下载

系统安装好后，我创建了一个项目文件夹来存放模型和代码。我选择了Huggingface的镜像网站HF-mirror来下载LLaMA 3.2的instruct版本模型Models - Hugging Face。这里需要注意的是，我最初下载错了模型，下载了base版本，导致模型无法正确回复问题。后来发现是模型下载错误，并且注意力掩码没有设置。大家在下载时一定要注意这些问题。

# 下载git，克隆模型

sudo apt update
sudo apt install git
sudo apt install curl
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt install git-lfs
git lfs install
git clone https://hf-mirror.com/unsloth/Llama-3.2-1B-Instruct

环境配置

在开发板上配置环境与普通电脑有所不同，因为开发板使用的是ARM64系统。我们需要下载pip，并创建虚拟环境。

# 创建虚拟环境

python3 -m venv llama

# 激活虚拟环境

source llama/bin/activate

也可以去下载anaconda，参考这个链接超详细Ubuntu安装Anaconda步骤+Anconda常用命令_ubuntu 安装anaconda-CSDN博客

我使用的是Python 3.12版本。接下来，我们需要下载PyTorch和transformer库。由于直接使用pip下载PyTorch较慢，我选择了从网页下载后离线安装。这里需要注意的是，要确保版本对应。arm64下安装pytorch,torchvision,torchaudio以及它们之间的版本对应关系_torch-1.11.0对应的torchaudio版本-CSDN博客

# 下载PyTorch

pip install /home/YourName/Downloads/torch-2.4.1-cp312-cp312-manylinux2014_aarch64.whl
pip install /home/YourName/Downloads/torchvision-0.19.1-cp312-cp312-manylinux2014_aarch64.whl

# 下载transformer

pip install transformers -i https://pypi.tuna.tsinghua.edu.cn/simple

代码实现

创建一个新文件

touch llama-chat.py

代码部分主要包括导入模型、分词器、设置参数和多轮聊天的设计。在测试时，我遇到了注意力掩码未设置的问题，这会影响模型对输入内容的理解。此外，我还发现了束搜索的问题，当设置参数为1时，启用束搜索是无效的。因此，我将修改num_beams参数设置，以测试束搜索和贪心搜索的效果区别。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def load_model_and_tokenizer(model_directory):
    """
    从本地目录加载模型和分词器。
    """
    tokenizer = AutoTokenizer.from_pretrained(model_directory)
    model = AutoModelForCausalLM.from_pretrained(model_directory)
    # model.half()
    return model, tokenizer

def generate_text(model, tokenizer, prompt, max_new_tokens):
    """
    使用提供的模型和分词器生成文本。
    """
    # 编码输入提示
    inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True)
    input_ids = inputs['input_ids']
    attention_mask = inputs['attention_mask']


    # 生成文本
    output = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_new_tokens=max_new_tokens,
        pad_token_id=tokenizer.eos_token_id,
        no_repeat_ngram_size=2,
        repetition_penalty=1.5,
        top_k=50,
        top_p=0.95,
        temperature=1.0,
        do_sample=True,
        num_return_sequences=1,
        num_beams=2,  # 设置 num_beams > 1
        early_stopping=True
    )

    # 解码生成的文本
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    return generated_text

def chat_with_model(model, tokenizer, max_new_tokens):
    """
    与模型进行聊天对话。
    """
    print("Chat with the model! Type 'exit' to end the conversation.")
    prompt = "You are an Intelligent Traffic Rules Q&A Assistant, and when user ask you questions, you will provide me with traffic knowledge.Next, user will ask you questions, please answer them.\n"
    
    once_input = input("User:")
    
    if once_input.lower() == 'exit':
        print("Assistant: Goodbye! Stay safe on the roads!")
        exit()
        
    input_to_model = prompt + "\nUser:" + once_input + "\nAssistant"
    
    response = generate_text(model, tokenizer, input_to_model, max_new_tokens)

    while True:
        print(response)
        user_input = input("User: ")
        if user_input.lower() == 'exit':
            print("Assistant: Goodbye! Stay safe on the roads!")
            break
        input_to_model = response + "\n"+ "\nUser:" + user_input +"\nAssistant"
        # 更新对话历史
        # 生成模型的回答
        response = generate_text(model, tokenizer, input_to_model, max_new_tokens)

def main():
    # model_directory = "/home/zhineng/llama3.2-1B/Llama-3.2-1B"  # 替换为你的本地模型目录
    # model_directory = "/home/zhineng/llama3.2-1B/Llama-3.2-1B-FP8"
    model_directory = "/home/zhineng/llama3.2-1B/Llama-3.2-1B-Instruct"
    max_new_tokens = 100  # 生成新 token 的最大数量

    # 加载模型和分词器
    model, tokenizer = load_model_and_tokenizer(model_directory)

    # 确保模型处于评估模式
    model.eval()

    # 开始聊天对话
    chat_with_model(model, tokenizer, max_new_tokens)

if __name__ == "__main__":
    main()

# 运行推理代码