从零手撕LLaMa3，怒拿5.7K星标！！

最新推荐文章于 2024-06-28 14:08:17 发布

程序锅锅

最新推荐文章于 2024-06-28 14:08:17 发布

阅读量318

点赞数 2

分类专栏：大模型文章标签：人工智能语言模型 AIGC python linux

本文链接：https://blog.csdn.net/qq_35054222/article/details/139123494

版权

大模型专栏收录该内容

17 篇文章 4 订阅

订阅专栏

大家好，我是程序锅。

一个月前，Meta发布了开源大模型llama3系列，在多个关键基准测试中优于业界 SOTA 模型，并在代码生成任务上全面领先。

和往常剧本类似，llama系列对中文支持力度不够。于是开发者们便开始了本地部署和基于中文语言训练。

在这里推荐一个项目：

项目地址：https://github.com/naklecha/llama3-from-scratch

这个项目发布了一个从零开始实现llama3的库，包括跨多个头的注意力矩阵乘法、位置编码和toekn化等等技术有非常详细的解释。

完成学习这个项目，你不仅会对llama3的网络结构非常了解，并且可以举一反三，自己分析其它开源大模型的网络架构。

下面举几个项目中涉及到的知识点。

1.如何将文本转化为token：

prompt = "the answer to the ultimate question of life, the universe, and everything is "
tokens = [128000] + tokenizer.encode(prompt)
print(tokens)
tokens = torch.tensor(tokens)
prompt_split_as_tokens = [tokenizer.decode([token.item()]) for token in tokens]
print(prompt_split_as_tokens)

[128000, 1820, 4320, 311, 279, 17139, 3488, 315, 2324, 11, 279, 15861, 11, 323, 4395, 374, 220]
['<|begin_of_text|>', 'the', ' answer', ' to', ' the', ' ultimate', ' question', ' of', ' life', ',', ' the', ' universe', ',', ' and', ' everything', ' is', ' ']

2.如何将token转化为embedding

embedding_layer = torch.nn.Embedding(vocab_size, dim)
embedding_layer.weight.data.copy_(model["tok_embeddings.weight"])
token_embeddings_unnormalized = embedding_layer(tokens).to(torch.bfloat16)
token_embeddings_unnormalized.shape