用Transformers库运行Llama-3.1-Minitron-4B

棂兰 | Lynlane

已于 2024-09-10 13:01:28 修改

阅读量2.6k

点赞数 39

分类专栏：语言模型文章标签： llama 人工智能语言模型

于 2024-09-02 22:32:18 首次发布

本文链接：https://blog.csdn.net/Lynlane/article/details/141807358

版权

语言模型专栏收录该内容

1 篇文章

订阅专栏

我的笔记本来自8年前的旧时代，勉强能运行8B模型，于是打算试试看更小的……

Llama-3.1-Minitron 4B 简介

Llama-3.1-Minitron 4B 是基于Llama-3.1 8B模型，通过结构化权重剪枝和知识提炼技术优化而成的紧凑型语言模型。

它有两种基座模型，Width-Base 和 Depth-Base，相关的模型文件可以在 Hugging Face 或其镜像 HF-Mirror 中找到。

为什么不用Ollama运行

理由很简单，电脑做不到……原本试图用Ollama管理，但运行时出现我暂无法解决的报错，这似乎是CPU资源不够或模型转换出错导致的无法读取结果，打算军训后降下量化精度重新试试。

所以暂时先用回transformers库+python，

反正也不需要这个小模型直接交互，影响不大。

~~影响不大才怪，我怎么调都加载不了940MX，Ollama能，虽然无法有效读端口但响应的速度快了好几十倍~~

所幸用transformers库运行 Llama-3.1-Minitron-4B 很简单 : )

1、下载模型

可以直接在 Hugging Face 上下载模型文件，但推荐使用 HF-Mirror ：

Width-Base模型，从 HF-Mirror 克隆

git lfs install  # 安装lfs以支持大文件下载
git clone https://hf-mirror.com/nvidia/Llama-3.1-Minitron-4B-Width-Base

或者Depth-Base模型，从 HF-Mirror 克隆

git lfs install   # 安装lfs以支持大文件下载
git clone https://hf-mirror.com/nvidia/Llama-3.1-Minitron-4B-Depth-Base

只需要二者之一即可，都下载可能要多花些时间。

如果克隆成功，可以去运行了；

如果克隆失败或缺失，则需要手动下载文件并校验。

手动下载地址：

两个模型各自的文件发布页（镜像）：

nvidia/Llama-3.1-Minitron-4B-Width-Base at main · HF Mirror

nvidia/Llama-3.1-Minitron-4B-Depth-Base at main · HF Mirror

校验方法：

打开文件的发布页，点击红框内的链接之一，

下滑，找到几个关键文件的校验信息，并记录

在命令行中转到这几个文件所在目录，计算文件的哈希值

cd 【待校验文件所在目录】
certutil -hashfile model-00001-of-00002.safetensors SHA256
certutil -hashfile model-00002-of-00002.safetensors SHA256

文件较大，需耐心等待哈希完成

如果返回的哈希值能够与之前记录的值对应，即下载正确，否则应该重新下载并校验。

对于没有哈希校验信息但有文件大小（size）信息的，可以在属性中查看文件字节大小进行校验。

2、使用transformers库运行

首先，安装 transformers 和 torch 库。（如果没有的话）

pip install transformers torch

然后就可以运行了，

注意先把代码中的路径改成你的。

# 不是官方示例呢，我自己写的（仰脸）
# 官方的可以在模型发布页找到
from transformers import AutoTokenizer, LlamaForCausalLM
import torch


# 定义模型路径
# 这是我的，应改为你的模型路径，记得双斜杠 ↓
model_path = "C:\\Users\\LingL\\llama.cpp\\models\\Llama-3.1-Minitron-4B-Depth-Base"

# 加载 tokenizer 和模型
tokenizer = AutoTokenizer.from_pretrained(model_path)
dtype = torch.bfloat16  # 使用混合精度以节省显存
device = "auto"  # 使用 HuggingFace 的自动设备映射
model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=dtype, device_map=device)


# 将模型转移到 GPU 上，如果CUDA可用的话
print("-------------------------------------\n")
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name()}")
else:
# 如果不可用，算了
    device = torch.device("cpu")
    print("CUDA is not available, using CPU.")
print("-------------------------------------\n")



# 导入输入内容，并返回输出内容
def run_test(input_text):
    input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
    attention_mask = torch.ones_like(input_ids)  # 创建 attention mask   
    with torch.no_grad():
        output = model.generate(input_ids, 
                                attention_mask=attention_mask,  # 传递 attention_mask
                                max_length=50,  # 生成长度限制，但长度需要大于输入
                                repetition_penalty=1.2, # 惩罚重复
                                pad_token_id=tokenizer.eos_token_id,
                                num_return_sequences=1,  # 确保只返回一个序列
                                do_sample=False)  # 关闭采样以减少随机性
    return tokenizer.decode(output[0], skip_special_tokens=True)


# 待输入的prompt，会被按顺序输入模型
input_texts = [
    # 1.注意添加逗号，不然两prompt会被视作一伙同时输入
    # 2.长度需要小于 max_length 的生成长度限制，具体数值见上文设定
    "Explain the theory of relativity: It is", #补完文本
    "1,1,2,3,5,8,13,21,34,55,"                 #数列推理
    "The Sky is",                              #自由联想

]


# 执行测试并打印结果
for text in input_texts:
    print(f"\n▶Input: {text}\n▶Output: {run_test(text)}\n")

运行结果参考：

▶Input: Explain the theory of relativity: It is
▶Output: Explain the theory of relativity: It is a theory of physics that explains how space and time are related to each other. It was developed by Albert Einstein in 1905 and has since been confirmed by numerous experiments.


▶Input: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55,
▶Output: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040, 1346269, 2178309, 3524578, 5702887, 9227465, 14930352, 24157817, 39088169, 63245986, 102334155, 165580141, 267914296, 433494437, 701408733, 1134903170, 1836311903, 2971215073, 4807526976


▶Input: The sky is
▶Output: The sky is falling! The sky is falling! The sky is falling! The sky is falling! The sky is falling! The sky is falling! The sky is falling! The sky is falling! The sky is falling! The sky is falling! The sky is falling! The sky is falling! The sky is falling! The sky is falling
# 建议关注模型的精神状态

另进行的一些基本推理测试

推理代数的加减运算（m=0, n=5, o=m+n, o=5），表现良好

▶Input: a=1, b=2, c=a+b, c=3. d=1, e=5, f=d+e, f=6. g=1, h=3, i=g+h, i=4. j=2, k=3, l=j+k, l=5. m=0, n=5, o=m+n, o=
▶Output: a=1, b=2, c=a+b, c=3. d=1, e=5, f=d+e, f=6. g=1, h=3, i=g+h, i=4. j=2, k=3, l=j+k, l=5. m=0, n=5, o=m+n, o=5. p=1, q=2, r=p+q, r=3. s=1, t=2, u=s+t, u=3.

推理字母对应的数字（按字母的顺序从1至26对应，e=5），表现良好

▶Input: a=1,z=26,e=?
▶Output: a=1,z=26,e=? (a=1,z=26,e=5) (a=1,z

推理接下来的字母（按键盘按键顺序，且每3字母有1逗号），表现良好

▶Input: qwe, rty, uio, pas, dfg, hjk, lzx, cvb, nmq, wer, tyu, iop, asd, fgh, j
▶Output: qwe, rty, uio, pas, dfg, hjk, lzx, cvb, nmq, wer, tyu, iop, asd, fgh, jkl, zxc, cvb, nmq, wer, tyu, iop, asd, fgh, jkl, zxc, cvb, nmq, wer, tyu, iop, asd, fgh, jkl, zxc, cvb, nmq, wer, tyu, iop, asd, fgh, jkl, zxc, cvb, nmq, wer, ty

推理斐波那契数列，表现良好

▶Input: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55,
▶Output: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040, 1346269, 2178309, 3524578, 5702887, 9227465, 14930352, 24157817, 39088169, 63245986, 102334155, 165580141, 267914296, 433494437, 701408733, 1134903170, 1836311903, 2971215073, 4807526976