【Llama3源码】llama3-implemented-from-scratch源码运行踩坑Incorrect padding

甄天真学AI

已于 2024-06-19 19:17:26 修改

阅读量329

点赞数 5

分类专栏： LLM 文章标签： llama

于 2024-06-18 22:07:42 首次发布

本文链接：https://blog.csdn.net/OldButSimple/article/details/139784728

版权

LLM 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Error: Incorrect padding

问题描述

学习naklecha/llama3-from-scratch代码时，首先要下载Meta-Llama-3-8B模型文件

in this file, i implemented llama3 from scratch, one tensor and matrix multiplication at a time. also, im going to load tensors directly from the model file that meta provided for llama3, you need to download the weights before running this file. here is the offical link to download the weights: https://llama.meta.com/llama-downloads/

结合上述说明以及Llama 3 使用方法以及模型下载教程链接方式下载Meta-Llama-3-8B模型文件后，运行下述代码：

from pathlib import Path
import tiktoken
from tiktoken.load import load_tiktoken_bpe
import torch
import json
import matplotlib.pyplot as plt

tokenizer_path = "Meta-Llama-3-8B/tokenizer.model"
special_tokens = [
            "<|begin_of_text|>",
            "<|end_of_text|>",
            "<|reserved_special_token_0|>",
            "<|reserved_special_token_1|>",
            "<|reserved_special_token_2|>",
            "<|reserved_special_token_3|>",
            "<|start_header_id|>",
            "<|end_header_id|>",
            "<|reserved_special_token_4|>",
            "<|eot_id|>",  # end of turn
        ] + [f"<|reserved_special_token_{i}|>" for i in range(5, 256 - 5)]
mergeable_ranks = load_tiktoken_bpe(tokenizer_path)
tokenizer = tiktoken.Encoding(
    name=Path(tokenizer_path).name,
    pat_str=r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+",
    mergeable_ranks=mergeable_ranks,
    special_tokens={token: len(mergeable_ranks) + i for i, token in enumerate(special_tokens)},
)

tokenizer.decode(tokenizer.encode("hello world!"))