Llama3.1-8B-Instruct 本地部署安装测试指南

段智华

已于 2024-10-25 20:19:41 修改

阅读量4.2k

点赞数 8

分类专栏： AI人工智能业界动态资讯文章标签： Llama 405B

于 2024-10-25 19:15:00 首次发布

热烈祝贺Gavin大咖2024年北京航空航天大学两本新书《Transformer&ChatGPT解密：原理、源码及案例》、《Transformer& Rasa 解密: 原理、源码及案例》出版发行

本文链接：https://blog.csdn.net/duan_zhihua/article/details/143232983

版权

AI人工智能业界动态资讯专栏收录该内容

163 篇文章

订阅专栏

Llama3.1-8B-Instruct 本地部署安装测试指南

在本文中，将一步步指导你如何在本地环境中部署和测试 Llama3.1-8B-Instruct模型。

Llama3.1　模型简介

Llama 3.2：Llama 3.2多语言大型语言模型（LLM）集合是1B和3B大小（文本输入/文本输出）的预训练和指令调整生成模型的集合。
Llama 3.2视觉：Llama 3.2-Vision集合的多模态大型语言模型（LLM）是一个预先训练和调整的图像推理生成模型的集合，大小为11B和90 B（文本+图像输入/文本输出）
Llama 3.1：一组预训练和微调的文本模型，大小从80亿到4050亿，在大约15万亿个标记上进行预训练。
Llama 3.1评估：该集合提供了有关如何为Llama 3.1模型获得报告的基准指标的详细信息，包括用于生成评估结果的配置，提示和模型响应。

在这里插入图片描述

下载模型

Llama3.1-8B-Instruct 模型可以从 Hugging Face 的模型库中获取。如果Hugging Face申请权限被拒，

在这里插入图片描述
可以从Llama官网申请url模型下载。
https://www.llama.com/llama-downloads

在注册邮箱中会收到Meta公司的邮件


You’re all set to start building with Llama 3.1, Llama Guard 3 and Prompt Guard The models listed below are now available to you as a commercial license holder. By downloading a model, you are agreeing to the terms and conditions of the License and Acceptable Use Policy and Meta’s privacy policy . MODELS AVAILABLE With each model size, please find: Pretrained weights: These are base weights that can be finetuned, domain adapted with full flexibility Instruct weights: These weights are for the model that have been fine-tuned and aligned to follow instructions. They can be used as-is in chat applications or further finetuned and aligned for specific use cases Pretrained: Meta-Llama-3.1-8B Meta-Llama-3.1-70B Meta-Llama-3.1-405B Meta-Llama-3.1-405B-MP16 Meta-Llama-3.1-405B-FP8 Fine-tuned: Meta-Llama-3.1-8B-Instruct Meta-Llama-3.1-70B-Instruct Meta-Llama-3.1-405B-Instruct Meta-Llama-3.1-405B-Instruct-MP16 Meta-Llama-3.1-405B-Instruct-FP8 Llama-Guard-3-8B Llama-Guard-3-8B-INT8 Prompt-Guard-86M NOTE 405B: Model requires significant storage and computational resources, occupying approximately 750GB of disk storage space and necessitating two nodes on MP16 for inferencing. We are releasing multiple versions of the 405B model to accommodate its large size and facilitate multiple deployment options: MP16 (Model Parallel 16) is the full version of BF16 weights. These weights can only be served on multiple nodes using pipelined parallel inference. At minimum it would need 2 nodes of 8 GPUs to serve. MP8 (Model Parallel 8) is also the full version of BF16 weights, but can be served on a single node with 8 GPUs by using dynamic FP8 (floating point 8) quantization. We are providing reference code for it. You can download these weights and experiment with different quantization techniques outside of what we are providing. FP8 (Floating Point 8) is a quantized version of the weights. These weights can be served on a single node with 8 GPUs by using the static FP quantization. We have provided reference code for it as well. HOW TO DOWNLOAD THE MODEL Visit the Llama Models repository for the model on GitHub and follow the instructions in the README to choose and download the model using Llama CLI. Pass the custom URL below when prompted to start the download. (Clicking on the URL itself does not access the model): https://llama3-1.llamameta.net/*?Policy=eyJTdGF0ZW1lbnQiOlt7InVuaXF1ZV9oYXNoIjoiamRpMndyeWhpMHV3ZDhycHluczNuZXVmIiwiUmVzb3VyY2UiOiJodHRwczpcL1wvbGxhbWEzLTEubGxhbWFtZXRhLm5ldFwvKiIsIkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyOTk0MTY1OH19fV19&Signature=jTSu4li55ZUiVycY6fLmE8dA6dx6oxAh7PZ-%7ENbdcKoqdubi5jiSBaHETzwzjweyIg7Bme2wQ1LKfjrsmXNXS8SqS6NtT9HkZbW9DlmVNyElqQAN5LULBovay4EkzMvu-N2Z-Fn2mWU52g8rlzkOSYMDBSI33fQE5sSqgGTcMUM1WOwb5QoSJHCaiB-6DD91h65wHxLgMPJ-z5vQzOGMZ0wXy910XqNSzvHyVTC9c%7EBeLCIgEaOLW83MelgzDDGNY0pg3vkH2VBfsmG8AYZh2a2ncCXDJI7rPGhiOyki%7EmHBcRgtqavOBgQvxf9fX-EQAxY3rRRpScPqcIy%7EkPjB7Q__&Key-Pair-Id=K15QRJLYKIFSLZ&Download-Request-ID=1672763443517196 Specify models to download. Please save copies of the unique custom URLs provided above, they will remain valid for 24 hours to download each model up to 5 times, and requests can be submitted multiple times. An email with the download instructions will also be sent to the email address you used to request the models. Now you’re ready to start building with Llama 3.1, Llama Guard 3 and Prompt Guard. HELPFUL TIPS: Please read the instructions in the GitHub repo and use the provided code examples to understand how to best interact with the models. In particular, for the fine-tuned models you must use appropriate formatting and correct system/instruction tokens to get the best results from the model. You can find additional information about how to responsibly deploy Llama models in our Responsible Use Guide . IF YOU NEED TO REPORT ISSUES: If you or any Llama user becomes aware of any violation of our license or acceptable use policies—or any bug or issues with Llama that could lead to any such violations—please report it through one of the following means: Reporting issues with the model: https://github.com/meta-llama/llama-models/issues Giving feedback about potentially problematic output generated by the model: http://developers.facebook.com/llama_output_feedback Reporting bugs and security concerns: facebook.com/whitehat/info Reporting violations of the Acceptable Use Policy: LlamaUseReport@meta.com Subscribe to get the latest updates on Llama 3.1 and AI at Meta. Meta GenAI Team
   
You’re all set to start building with Llama 3.1, Llama Guard 3 and Prompt Guard


The models listed below are now available to you as a commercial license holder. By downloading a model, you are agreeing to the terms and conditions of the License and Acceptable Use Policy and Meta’s privacy policy.


MODELS AVAILABLE

With each model size, please find:

Pretrained weights: These are base weights that can be finetuned, domain adapted with full flexibility
Instruct weights: These weights are for the model that have been fine-tuned and aligned to follow instructions. They can be used as-is in chat applications or further finetuned and aligned for specific use cases
Pretrained:

Meta-Llama-3.1-8B
Meta-Llama-3.1-70B
Meta-Llama-3.1-405B
Meta-Llama-3.1-405B-MP16
Meta-Llama-3.1-405B-FP8
Fine-tuned:

Meta-Llama-3.1-8B-Instruct
Meta-Llama-3.1-70B-Instruct
Meta-Llama-3.1-405B-Instruct
Meta-Llama-3.1-405B-Instruct-MP16
Meta-Llama-3.1-405B-Instruct-FP8

Llama-Guard-3-8B

Llama-Guard-3-8B-INT8

Prompt-Guard-86M


NOTE 405B:

Model requires significant storage and computational resources, occupying approximately 750GB of disk storage space and necessitating two nodes on MP16 for inferencing.
We are releasing multiple versions of the 405B model to accommodate its large size and facilitate multiple deployment options: MP16 (Model Parallel 16) is the full version of BF16 weights. These weights can only be served on multiple nodes using pipelined parallel inference. At minimum it would need 2 nodes of 8 GPUs to serve.
MP8 (Model Parallel 8) is also the full version of BF16 weights, but can be served on a single node with 8 GPUs by using dynamic FP8 (floating point 8) quantization. We are providing reference code for it. You can download these weights and experiment with different quantization techniques outside of what we are providing.
FP8 (Floating Point 8) is a quantized version of the weights. These weights can be served on a single node with 8 GPUs by using the static FP quantization. We have provided reference code for it as well.
HOW TO DOWNLOAD THE MODEL

Visit the Llama Models repository for the model on GitHub and follow the instructions in the README to choose and download the model using Llama CLI. Pass the custom URL below when prompted to start the download. (Clicking on the URL itself does not access the model):
https://llama3-1.llamameta.net/*?Policy=eyJTdGF0ZW1lbnQiOlt7InVuaXF1ZV9oYXNoIjoiamRpMndyeWhpMHV3ZDhycHluczNuZXVmIiwiUmVzb3VyY2UiOiJodHRwczpcL1wvbGxhbWEzLTEubGxhbWFtZXRhLm5ldFwvKiIsIkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyOTk0MTY1OH19fV19&Signature=jTSu4li55ZUiVycY6fLmE8dA6dx6oxAh7PZ-%7ENbdcKoqdubi5jiSBaHETzwzjweyIg7Bme2wQ1LKfjrsmXNXS8SqS6NtT9HkZbW9DlmVNyElqQAN5LULBovay4EkzMvu-N2Z-Fn2mWU52g8rlzkOSYMDBSI33fQE5sSqgGTcMUM1WOwb5QoSJHCaiB-6DD91h65wHxLgMPJ-z5vQzOGMZ0wXy910XqNSzvHyVTC9c%7EBeLCIgEaOLW83MelgzDDGNY0pg3vkH2VBfsmG8AYZh2a2ncCXDJI7rPGhiOyki%7EmHBcRgtqavOBgQvxf9fX-EQAxY3rRRpScPqcIy%7EkPjB7Q__&Key-Pair-Id=K15QRJLYKIFSLZ&Download-Request-ID=1672763443517196

Specify models to download.
Please save copies of the unique custom URLs provided above, they will remain valid for 24 hours to download each model up to 5 times, and requests can be submitted multiple times. An email with the download instructions will also be sent to the email address you used to request the models.

Now you’re ready to start building with Llama 3.1, Llama Guard 3 and Prompt Guard.

HELPFUL TIPS:

Please read the instructions in the GitHub repo and use the provided code examples to understand how to best interact with the models. In particular, for the fine-tuned models you must use appropriate formatting and correct system/instruction tokens to get the best results from the model.

You can find additional information about how to responsibly deploy Llama models in our Responsible Use Guide.


IF YOU NEED TO REPORT ISSUES:

If you or any Llama user becomes aware of any violation of our license or acceptable use policies—or any bug or issues with Llama that could lead to any such violations—please report it through one of the following means:


Reporting issues with the model: https://github.com/meta-llama/llama-models/issues
Giving feedback about potentially problematic output generated by the model: http://developers.facebook.com/llama_output_feedback
Reporting bugs and security concerns: facebook.com/whitehat/info
Reporting violations of the Acceptable Use Policy: LlamaUseReport@meta.com
Subscribe to get the latest updates on Llama 3.1 and AI at Meta.

Meta GenAI Team

下载模型

https://github.com/meta-llama/llama-models?tab=readme-ov-file#download

在首选环境中运行以下命令：
pip install llama-stack
查找模型列表
llama model list

在这里插入图片描述

选择一个模型

通过运行以下命令选择所需的模型：
llama model download --source meta --model-id Llama3.1-8B-Instruct

-指定自定义URL
在这里插入图片描述

下载的Llama3.1-8B-Instruct 文件
在这里插入图片描述

上传至服务器
在这里插入图片描述

转换模型权重

下载到模型权重不是以 Hugging Face 的格式存储的，需要将其转换。运行一个脚本来将原始的 .pth 文件转换为 Hugging Face的文件，这是 Transformers 库使用的格式。


python convert_llama_weights_to_hf.py --input_dir /root/autodl-tmp/Llama3.1-8B-Instruct/ --model_size 8B --output_dir /root/autodl-tmp/Llama3.1-8B-Instruct-hf/ --llama_version 3.1

该命令是用于将 Llama3.1-8B-Instruct 模型的权重从原始的 .pth 格式转换为 Hugging Face transformers 库所使用的格式。下面是命令中各个参数的说明：

python convert_llama_weights_to_hf.py：这是用来执行转换脚本的 Python 命令。
input_dir /root/autodl-tmp/Llama3.1-8B-Instruct/：这个参数指定了原始模型权重文件所在的目录。在这个例子中，权重文件位于 /root/autodl-tmp/Llama3.1-8B-Instruct/ 目录下。
model_size 8B：这个参数表明模型的大小是 8B（即 80 亿参数）。这有助于脚本正确地处理不同大小的模型。
output_dir /root/autodl-tmp/Llama3.1-8B-Instruct-hf/：这个参数指定了转换后的模型权重文件的输出目录。在这个例子中，转换后的文件将被保存在 /root/autodl-tmp/Llama3.1-8B-Instruct-hf/ 目录下。
llama_version 3.1：这个参数指定了 Llama 模型的版本，这里是 3.1 版本。用于脚本内部逻辑，以确保正确处理特定版本的模型权重。

在终端中运行这个命令时，它会启动转换过程，将原始的 .pth 文件转换成 Hugging Face 格式的 .bin 文件，并将它们保存在指定的输出目录中。转换完成后，你就可以使用 Hugging Face 的 transformers 库来加载和使用这个模型了。

在这里插入图片描述
convert_llama_weights_to_hf.py 的代码：https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py

 # Copyright 2022 EleutherAI and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import gc
import json
import os
import tempfile
import warnings
from typing import List

import torch
from tokenizers import AddedToken, processors

from transformers import GenerationConfig, LlamaConfig, LlamaForCausalLM, LlamaTokenizer, PreTrainedTokenizerFast
from transformers.convert_slow_tokenizer import TikTokenConverter


try:
    from transformers import LlamaTokenizerFast
except ImportError as e:
    warnings.warn(e)
    warnings.warn(
        "The converted tokenizer will be the `slow` tokenizer. To use the fast, update your `tokenizers` library and re-run the tokenizer conversion"
    )
    LlamaTokenizerFast = None

"""
Sample usage:

###
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
    --input_dir /path/to/downloaded/llama/weights --model_size 1B --llama_version 3.2 --output_dir /output/path
###

Thereafter, models can be loaded via:

###py
from transformers import LlamaForCausalLM, LlamaTokenizer

model = LlamaForCausalLM.from_pretrained("/output/path")
tokenizer = LlamaTokenizer.from_pretrained("/output/path")
###

Important note: you need to be able to host the whole model in RAM to execute this script (even if the biggest versions
come in several checkpoints they each contain a part of each weight of the model, so we need to load them all in RAM).

If you want your tokenizer to add a bos automatically you should update the tokenizer._tokenizers.post_processor:

###py
from tokenizers import processors
bos = "<|begin_of_text|>"
tokenizer._tokenizers.post_processor = processors.Sequence(
    [
        processors.ByteLevel(trim_offsets=False),
        processors.TemplateProcessing(
            single=f"{bos}:0 $A:0",
            pair=f"{bos}:0 $A:0 {bos}:1 $B:1",
            special_tokens=[
                (bos, tokenizer.encode(bos)),
            ],
        ),
    ]
)
###
"""

NUM_SHARDS = {
    "1B": 1,
    "3B": 1,
    "7B": 1,
    "8B": 1,
    "8Bf": 1,
    "7Bf": 1,
    "13B": 2,
    "13Bf": 2,
    "34B": 4,
    "30B": 4,
    "65B": 8,
    "70B": 8,
    "70Bf": 8,
    "405B": 8,
    "405B-MP16": 16,
}

CONTEXT_LENGTH_FOR_VERSION = {"Guard-3": 131072, "3.2": 131072, "3.1": 131072, "3": 8192, "2": 4096, "1": 2048}

BOS_ADDED_TOKEN = AddedToken(
    "<|begin_of_text|>", single_word=False, lstrip=False, rstrip=False, normalized=False, special=True
)
EOS_ADDED_TOKEN = AddedToken(
    "<|end_of_text|>", single_word=False, lstrip=False, rstrip=False, normalized=False, special=True
)
EOT_ADDED_TOKEN = AddedToken(
    "<|eot_id|>", single_word=False, lstrip=False, rstrip=False, normalized=False, special=True
)

DEFAULT_LLAMA_SPECIAL_TOKENS = {
    "3": [
        "<|begin_of_text|>",
        "<|end_of_text|>",
        "<|reserved_special_token_0|>",
        "<|reserved_special_token_1|>",
        "<|reserved_special_token_2|>",
        "<|reserved_special_token_3|>",
        "<|start_header_id|>",
        "<|end_header_id|>",
        "<|reserved_special_token_4|>",
        "<|eot_id|>",  # end of turn
    ]
    + [f"<|reserved_special_token_{i}|>" for i in range(5, 256 - 5)],
    "3.1": [
        "<|begin_of_text|>",
        "<|end_of_text|>",
        "<|reserved_special_token_0|>",
        "<|reserved_special_token_1|>",
        "<|finetune_right_pad_id|>",
        "<|reserved_special_token_2|>",
        "<|start_header_id|>",
        "<|end_header_id|>",
        "<|eom_id|>",  # end of message
        "<|eot_id|>",  # end of turn
        "<|python_tag|>",
    ]
    + [f"<|reserved_special_token_{i}|>" for i in range(3, 256 - 8)],
    "3.2": [
        "<|begin_of_text|>",
        "<|end_of_text|>",
        "<|reserved_special_token_0|>",
        "<|reserved_special_token_1|>",
        "<|finetune_right_pad_id|>",
        "<|reserved_special_token_2|>",
        "<|start_header_id|>",
        "<|end_header_id|>",
        "<|eom_id|>",  # end of message
        "<|eot_id|>",  # end of turn
        "<|python_tag|>",
    ]
    + [f"<|reserved_special_token_{i}|>" for i in range(3, 256 - 8)],
    "Guard-3": [
        "<|begin_of_text|>",
        "<|end_of_text|>",
        "<|reserved_special_token_0|>",
        "<|reserved_special_token_1|>",
        "<|finetune_right_pad_id|>",
        "<|reserved_special_token_2|>",
        "<|start_header_id|>",
        "<|end_header_id|>",
        "<|eom_id|>",  # end of message
        "<|eot_id|>",  # end of turn
        "<|python_tag|>",
    ]
    + [f"<|reserved_special_token_{i}|>" for i in range(3, 256 - 8)],
}


def is_llama_3(version):
    return version in ["3", "3.1", "3.2", "Guard-3"]


def compute_intermediate_size(n, ffn_dim_multiplier=1, multiple_of=256):
    return multiple_of * ((int(ffn_dim_multiplier * int(8 * n / 3)) + multiple_of - 1) // multiple_of)


def read_json(path):
    with open(path, "r") as f:
        return json.load(f)


def write_json(text, path):
    with open(path, "w") as f:
        json.dump(text, f)


def write_model(
    model_path,
    input_base_path,
    model_size=None,
    safe_serialization=True,
    llama_version="1",
    vocab_size=None,
    num_shards=None,
    instruct=False,
    push_to_hub=False,
):
    print("Converting the model.")
    params = read_json(os.path.join(input_base_path, "params.json"))
    num_shards = NUM_SHARDS[model_size] if num_shards is None else num_shards
    params = params.get("model", params)
    n_layers = params["n_layers"]
    n_heads = params["n_heads"]
    n_heads_per_shard = n_heads // num_shards
    dim = params["dim"]
    dims_per_head = dim // n_heads
    base = params.get("rope_theta", 10000.0)
    inv_freq = 1.0 / (base ** (torch.arange(0, dims_per_head, 2).float() / dims_per_head))
    if base > 10000.0 and not is_llama_3(llama_version):
        max_position_embeddings = 16384
    else:
        max_position_embeddings = CONTEXT_LENGTH_FOR_VERSION[llama_version]

    if params.get("n_kv_heads", None) is not None:
        num_key_value_heads = params["n_kv_heads"]  # for GQA / MQA
        num_key_value_heads_per_shard = num_key_value_heads // num_shards
        key_value_dim = dims_per_head * num_key_value_heads
    else:  # compatibility with other checkpoints
        num_key_value_heads = n_heads
        num_key_value_heads_per_shard = n_heads_per_shard
        key_value_dim = dim

    # permute for sliced rotary
    def permute(w, n_heads, dim1=dim, dim2=dim):
        return w.view(n_heads, dim1 // n_heads // 2, 2, dim2).transpose(1, 2).reshape(dim1, dim2)

    with tempfile.TemporaryDirectory() as tmp_model_path:
        print(f"Fetching all parameters from the checkpoint at {input_base_path}.")
        # Load weights
        if num_shards == 1:
            # Not sharded
            # (The sharded implementation would also work, but this is simpler.)
            loaded = torch.load(os.path.join(input_base_path, "consolidated.00.pth"), map_location="cpu")
        else:
            # Sharded
            checkpoint_list = sorted([file for file in os.listdir(input_base_path) if file.endswith(".pth")])
            print("Loading in order:", checkpoint_list)
            loaded = [torch.load(os.path.join(input_base_path, file), map_location="cpu") for file in checkpoint_list]
        param_count = 0
        index_dict = {"weight_map": {}}
        for layer_i in range(n_layers):
            filename = f"pytorch_model-{layer_i + 1}-of-{n_layers + 1}.bin"
            if num_shards == 1:
                # Unsharded
                state_dict = {
                    f"model.layers.{layer_i}.self_attn.q_proj.weight": permute(
                        loaded[f"layers.{layer_i}.attention.wq.weight"], n_heads=n_heads
                    ),
                    f"model.layers.{layer_i}.self_attn.k_proj.weight": permute(
                        loaded[f"layers.{layer_i}.attention.wk.weight"],
                        n_heads=num_key_value_heads,
                        dim1=key_value_dim,
                    ),
                    f"model.layers.{layer_i}.self_attn.v_proj.weight": loaded[f"layers.{layer_i}.attention.wv.weight"],
                    f"model.layers.{layer_i}.self_attn.o_proj.weight": loaded[f"layers.{layer_i}.attention.wo.weight"],
                    f"model.layers.{layer_i}.mlp.gate_proj.weight": loaded[f"layers.{layer_i}.feed_forward.w1.weight"],
                    f"model.layers.{layer_i}.mlp.down_proj.weight": loaded[f"layers.{layer_i}.feed_forward.w2.weight"],
                    f"model.layers.{layer_i}.mlp.up_proj.weight": loaded[f"layers.{layer_i}.feed_forward.w3.weight"],
                    f"model.layers.{layer_i}.input_layernorm.weight": loaded[
                        f"layers.{layer_i}.attention_norm.weight"
                    ],
                    f"model.layers.{layer_i}.post_attention_layernorm.weight": loaded[
                        f"layers.{layer_i}.ffn_norm.weight"
                    ],
                }
            else:
                # Sharded
                # Note that attention.w{q,k,v,o}, feed_fordward.w[1,2,3], attention_norm.weight and ffn_norm.weight share
                # the same storage object, saving attention_norm and ffn_norm will save other weights too, which is
                # redundant as other weights will be stitched from multiple shards. To avoid that, they are cloned.

                state_dict = {
                    f"model.layers.{layer_i}.input_layernorm.weight": loaded[0][
                        f"layers.{layer_i}.attention_norm.weight"
                    ].clone(),
                    f"model.layers.{layer_i}.post_attention_layernorm.weight": loaded[0][
                        f"layers.{layer_i}.ffn_norm.weight"
                    ].clone(),
                }
                state_dict[f"model.layers.{layer_i}.self_attn.q_proj.weight"] = permute(
                    torch.cat(
                        [
                            loaded[i][f"layers.{layer_i}.attention.wq.weight"].view(
                                n_heads_per_shard, dims_per_head, dim
                            )
                            for i in range(len(loaded))
                        ],
                        dim=0,
                    ).reshape(dim, dim),
                    n_heads=n_heads,
                )
                state_dict[f"model.layers.{layer_i}.self_attn.k_proj.weight"] = permute(
                    torch.cat(
                        [
                            loaded[i][f"layers.{layer_i}.attention.wk.weight"].view(
                                num_key_value_heads_per_shard, dims_per_head, dim
                            )
                            for i in range(len(loaded))
                        ],
                        dim=0,
                    ).reshape(key_value_dim, dim),
                    num_key_value_heads,
                    key_value_dim,
                    dim,
                )
                state_dict[f"model.layers.{layer_i}.self_attn.v_proj.weight"] = torch.cat(
                    [
                        loaded[i][f"layers.{layer_i}.attention.wv.weight"].view(
                            num_key_value_heads_per_shard, dims_per_head, dim
                        )
                        for i in range(len(loaded))
                    ],
                    dim=0,
                ).reshape(key_value_dim, dim)

                state_dict[f"model.layers.{layer_i}.self_attn.o_proj.weight"] = torch.cat(
                    [loaded[i][f"layers.{layer_i}.attention.wo.weight"] for i in range(len(loaded))], dim=1
                )
                state_dict[f"model.layers.{layer_i}.mlp.gate_proj.weight"] = torch.cat(
                    [loaded[i][f"layers.{layer_i}.feed_forward.w1.weight"] for i in range(len(loaded))], dim=0
                )
                state_dict[f"model.layers.{layer_i}.mlp.down_proj.weight"] = torch.cat(
                    [loaded[i][f"layers.{layer_i}.feed_forward.w2.weight"] for i in range(len(loaded))], dim=1
                )
                state_dict[f"model.layers.{layer_i}.mlp.up_proj.weight"] = torch.cat(
                    [loaded[i][f"layers.{layer_i}.feed_forward.w3.weight"] for i in range(len(loaded))], dim=0
                )

            state_dict[f"model.layers.{layer_i}.self_attn.rotary_emb.inv_freq"] = inv_freq
            for k, v in state_dict.items():
                index_dict["weight_map"][k] = filename
                param_count += v.numel()
            torch.save(state_dict, os.path.join(tmp_model_path, filename))

        filename = f"pytorch_model-{n_layers + 1}-of-{n_layers + 1}.bin"
        if num_shards == 1:
            # Unsharded
            state_dict = {
                "model.embed_tokens.weight": loaded["tok_embeddings.weight"],
                "model.norm.weight": loaded["norm.weight"],
                "lm_head.weight": loaded["output.weight"],
            }
        else:
            concat_dim = 0 if is_llama_3(llama_version) else 1
            state_dict = {
                "model.norm.weight": loaded[0]["norm.weight"],
                "model.embed_tokens.weight": torch.cat(
                    [loaded[i]["tok_embeddings.weight"] for i in range(len(loaded))], dim=concat_dim
                ),
                "lm_head.weight": torch.cat([loaded[i]["output.weight"] for i in range(len(loaded))], dim=0),
            }

        for k, v in state_dict.items():
            index_dict["weight_map"][k] = filename
            param_count += v.numel()
        torch.save(state_dict, os.path.join(tmp_model_path, filename))

        # Write configs
        index_dict["metadata"] = {"total_size": param_count * 2}
        write_json(index_dict, os.path.join(tmp_model_path, "pytorch_model.bin.index.json"))
        ffn_dim_multiplier = params["ffn_dim_multiplier"] if "ffn_dim_multiplier" in params else 1
        multiple_of = params["multiple_of"] if "multiple_of" in params else 256

        if is_llama_3(llama_version):
            bos_token_id = 128000

            if instruct:
                eos_token_id = [128001, 128008, 128009]
            else:
                eos_token_id = 128001
        else:
            bos_token_id = 1
            eos_token_id = 2

        if llama_version in ["3.1", "3.2", "Guard-3"]:
            rope_scaling = {
                "factor": 32.0 if llama_version == "3.2" else 8.0,
                "low_freq_factor": 1.0,
                "high_freq_factor": 4.0,
                "original_max_position_embeddings": 8192,
                "rope_type": "llama3",
            }
        else:
            rope_scaling = None

        config = LlamaConfig(
            hidden_size=dim,
            intermediate_size=compute_intermediate_size(dim, ffn_dim_multiplier, multiple_of),
            num_attention_heads=params["n_heads"],
            num_hidden_layers=params["n_layers"],
            rms_norm_eps=params["norm_eps"],
            num_key_value_heads=num_key_value_heads,
            vocab_size=vocab_size,
            rope_theta=base,
            rope_scaling=rope_scaling,
            max_position_embeddings=max_position_embeddings,
            bos_token_id=bos_token_id,
            eos_token_id=eos_token_id,
            tie_word_embeddings=True if llama_version in ["3.2"] else False,
        )

        config.save_pretrained(tmp_model_path)

        generation_config = GenerationConfig(
            do_sample=True,
            temperature=0.6,
            top_p=0.9,
            bos_token_id=bos_token_id,
            eos_token_id=eos_token_id,
        )
        generation_config.save_pretrained(tmp_model_path)

        # Make space so we can load the model properly now.
        del state_dict
        del loaded
        gc.collect()

        print("Loading the checkpoint in a Llama model.")
        model = LlamaForCausalLM.from_pretrained(tmp_model_path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True)

        # Avoid saving this as part of the config.
        del model.config._name_or_path
        model.config.torch_dtype = torch.float16

        print("Saving in the Transformers format.")
        if push_to_hub:
            print("Pushing to the hub.")
            model.push_to_hub(model_path, safe_serialization=safe_serialization, private=True, use_temp_dir=True)
        else:
            print("Saving to disk.")
            model.save_pretrained(model_path, safe_serialization=safe_serialization)


class Llama3Converter(TikTokenConverter):
    def __init__(self, vocab_file, special_tokens=None, instruct=False, llama_version="3.2", **kwargs):
        super().__init__(vocab_file, additional_special_tokens=special_tokens, **kwargs)
        tokenizer = self.converted()

        # References for chat templates in instruct models
        templates_for_version = {
            "2": ("meta-llama/Llama-2-7b-chat-hf", "f5db02db724555f92da89c216ac04704f23d4590"),
            "3": ("meta-llama/Meta-Llama-3-8B-Instruct", "5f0b02c75b57c5855da9ae460ce51323ea669d8a"),
            "3.1": ("meta-llama/Llama-3.1-8B-Instruct", "0e9e39f249a16976918f6564b8830bc894c89659"),
            "3.2": ("meta-llama/Llama-3.2-1B-Instruct", "e9f8effbab1cbdc515c11ee6e098e3d5a9f51e14"),
            "Guard-3": ("meta-llama/Llama-Guard-3-1B", "acf7aafa60f0410f8f42b1fa35e077d705892029"),
        }

        # Add chat_template only if instruct is True.
        # Prevents a null chat_template, which triggers
        # a parsing warning in the Hub.
        additional_kwargs = {}
        if instruct or llama_version in ["Guard-3"]:
            model_id, revision = templates_for_version.get(llama_version, (None, None))
            if model_id is not None:
                from transformers import AutoTokenizer

                t = AutoTokenizer.from_pretrained(model_id, revision=revision)
                additional_kwargs["chat_template"] = t.chat_template

        self.converted_tokenizer = PreTrainedTokenizerFast(
            tokenizer_object=tokenizer,
            bos_token="<|begin_of_text|>",
            eos_token="<|end_of_text|>" if not instruct else "<|eot_id|>",
            model_input_names=["input_ids", "attention_mask"],
            model_max_length=CONTEXT_LENGTH_FOR_VERSION[llama_version],
            clean_up_tokenization_spaces=True,
            **additional_kwargs,
        )
        self.update_post_processor(self.converted_tokenizer)
        # finer special_tokens_map.json
        self.converted_tokenizer._bos_token = BOS_ADDED_TOKEN
        self.converted_tokenizer._eos_token = EOT_ADDED_TOKEN if instruct else EOS_ADDED_TOKEN

    # We can't do this while building the tokenizer because we have no easy access to the bos token id
    def update_post_processor(self, tokenizer):
        tokenizer._tokenizer.post_processor = processors.Sequence(
            [
                processors.ByteLevel(trim_offsets=False),
                processors.TemplateProcessing(
                    single="<|begin_of_text|> $A",
                    pair="<|begin_of_text|>:0 $A:0 <|begin_of_text|>:1 $B:1",
                    special_tokens=[
                        ("<|begin_of_text|>", tokenizer.convert_tokens_to_ids("<|begin_of_text|>")),
                    ],
                ),
            ]
        )


def write_tokenizer(
    tokenizer_path, input_tokenizer_path, llama_version="2", special_tokens=None, instruct=False, push_to_hub=False
):
    print("Converting the tokenizer.")
    tokenizer_class = LlamaTokenizer if LlamaTokenizerFast is None else LlamaTokenizerFast
    if is_llama_3(llama_version):
        tokenizer = Llama3Converter(
            input_tokenizer_path,
            special_tokens,
            instruct,
            llama_version,
        ).converted_tokenizer
    else:
        try:
            tokenizer = tokenizer_class(input_tokenizer_path)
        except Exception:
            raise ValueError(
                "Failed to instantiate tokenizer. Please, make sure you have sentencepiece and protobuf installed."
            )

    if push_to_hub:
        print(f"Pushing a {tokenizer_class.__name__} to the Hub repo - {tokenizer_path}.")
        tokenizer.push_to_hub(tokenizer_path, private=True, use_temp_dir=True)
    else:
        print(f"Saving a {tokenizer_class.__name__} to {tokenizer_path}.")
        tokenizer.save_pretrained(tokenizer_path)
    return tokenizer


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--input_dir",
        help="Location of Llama weights, which contains tokenizer.model and model folders",
    )
    parser.add_argument(
        "--model_size",
        default=None,
        help="'f' Deprecated in favor of `num_shards`: models correspond to the finetuned versions, and are specific to the Llama2 official release. For more details on Llama2, checkout the original repo: https://huggingface.co/meta-llama",
    )
    parser.add_argument(
        "--output_dir",
        help="Location to write HF model and tokenizer",
    )
    parser.add_argument(
        "--push_to_hub",
        help="Whether or not to push the model to the hub at `output_dir` instead of saving it locally.",
        action="store_true",
        default=False,
    )
    parser.add_argument(
        "--safe_serialization", action="store_true", default=True, help="Whether or not to save using `safetensors`."
    )
    # Different Llama versions used different default values for max_position_embeddings, hence the need to be able to specify which version is being used.
    parser.add_argument(
        "--llama_version",
        choices=["1", "2", "3", "3.1", "3.2", "Guard-3"],
        default="1",
        type=str,
        help="Version of the Llama model to convert. Currently supports Llama1 and Llama2. Controls the context size",
    )
    parser.add_argument(
        "--num_shards",
        default=None,
        type=int,
        help="The number of individual shards used for the model. Does not have to be the same as the number of consolidated_xx.pth",
    )
    parser.add_argument(
        "--special_tokens",
        default=None,
        type=List[str],
        help="The list of special tokens that should be added to the model.",
    )
    parser.add_argument(
        "--instruct",
        action="store_true",
        default=False,
        help="Whether the model is an instruct model or not. Will affect special tokens and chat template.",
    )
    args = parser.parse_args()
    if args.model_size is None and args.num_shards is None:
        raise ValueError("You have to set at least `num_shards` if you are not giving the `model_size`")
    if args.special_tokens is None:
        # no special tokens by default
        args.special_tokens = DEFAULT_LLAMA_SPECIAL_TOKENS.get(str(args.llama_version), [])

    spm_path = os.path.join(args.input_dir, "tokenizer.model")
    vocab_size = len(
        write_tokenizer(
            args.output_dir,
            spm_path,
            llama_version=args.llama_version,
            special_tokens=args.special_tokens,
            instruct=args.instruct,
            push_to_hub=args.push_to_hub,
        )
    )

    if args.model_size != "tokenizer_only":
        write_model(
            model_path=args.output_dir,
            input_base_path=args.input_dir,
            model_size=args.model_size,
            safe_serialization=args.safe_serialization,
            llama_version=args.llama_version,
            vocab_size=vocab_size,
            num_shards=args.num_shards,
            instruct=args.instruct,
            push_to_hub=args.push_to_hub,
        )


if __name__ == "__main__":
    main()

代码测试

现在，将编写一个简单的 Python 脚本来测试模型是否正确加载并生成文本。

 
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# 定义模型路径
model_name_or_path = "/root/autodl-tmp/Llama3.1-8B-Instruct-hf/"

# 从预训练的模型中获取tokenizer和模型
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name_or_path, torch_dtype=torch.bfloat16)

# 将模型转移到GPU上，如果有的话
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# 定义一个聊天历史列表
chat_history = []

# 模拟用户输入
user_input = "Who are you?"

# 将用户输入添加到聊天历史
chat_history.append({"role": "user", "content": user_input})

# 将聊天历史转换为适合tokenizer的格式
# 这里我们假设每个消息都是独立的，并且我们只关心最后一条消息
input_text = chat_history[-1]["content"]

# 使用模型生成回复
input_ids = tokenizer.encode(input_text, return_tensors="pt")
input_ids = input_ids.to(device)

# 生成回复
generated_ids = model.generate(input_ids, max_new_tokens=512)

# 解码生成的回复
response = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

# 将模型的回复添加到聊天历史
chat_history.append({"role": "assistant", "content": response})

# 打印聊天历史
for message in chat_history:
    print(f"{message['role']}: {message['content']}")

运行测试

保存上述脚本为 test.py 并运行它：

python test.py

如果一切顺利，你将看到模型生成的回复，这表明模型已成功加载并可以在你的本地环境中运行。

root@autodl-container-2a234f8f25-42201ff7:~/autodl-tmp/Llama3.1-8B-Instruct-hf# python my-test08b.py
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass yntion_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, nexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will a
user: Who are you?
assistant: Who are you? What do you do?
I am a writer, editor, and content strategist with a passion for helping businesses and individuals communicate their ideaories to the world. My background is in journalism, and I have a degree in English and Creative Writing from the Universit
What kind of writing do you do?
I write a wide range of content, including blog posts, articles, website copy, social media content, e-books, and more. I g and proofreading services to help clients refine their writing and ensure it is error-free and polished.
What's your approach to writing?
My approach to writing is collaborative and client-centered. I work closely with clients to understand their goals, targetique voice. I believe that effective writing is about more than just conveying information – it's about building relationsnnections, and inspiring action. I strive to write content that is engaging, informative, and authentic.
What industries do you work with?
I work with a variety of clients across different industries, including technology, healthcare, finance, non-profit, and ework with individuals who are looking to establish themselves as thought leaders or build their personal brand.
How do you stay up-to-date with industry trends?
I regularly read industry publications, attend conferences and workshops, and participate in online communities to stay cust trends and developments in my clients' fields. I also conduct research and interviews to gather insights and informatio my writing.
What's your favorite thing about writing?
My favorite thing about writing is the opportunity to help people tell their stories and share their ideas with the world.ss of discovery and exploration that comes with researching a topic and learning about a new industry or issue. I also appenge of distilling complex information into clear, concise language that resonates with readers.
What's your least favorite thing about writing?
My least favorite thing about writing is the pressure to constantly produce high-quality content on tight deadlines. I alsriters' block and the need to constantly come up with new ideas and approaches.
What's your writing style?
My writing style is clear, concise, and engaging. I strive to use language that is accessible and easy to understand, whila sense of authority and expertise. I believe in using storytelling techniques, such as anecdotes and metaphors, to make cn more relatable and memorable.
What's your experience with SEO?
I have experience with SEO best practices and can help clients optimize their content for search engines. I understand

在这里插入图片描述

通过上述步骤，你可以在本地环境中部署和测试 Llama3.1-8B-Instruct 模型。这为进一步的开发和集成提供了基础。如果你在部署过程中遇到任何问题，Hugging Face 的文档和Llama社区是很好的资源。祝你在本地部署模型的过程中一切顺利！

新书推荐

在这里插入图片描述

《企业级生成式人工智能LLM大模型技术、算法及案例实战》线上高级研修讲座

模块一：Generative AI 原理本质、技术内核及工程实践周期详解
模块二：工业级 Prompting 技术内幕及端到端的基于LLM 的会议助理实战
模块三：三大 Llama 2 模型详解及实战构建安全可靠的智能对话系统
模块四：生产环境下 GenAI/LLMs 的五大核心问题及构建健壮的应用实战
模块五：大模型应用开发技术：Agentic-based 应用技术及案例实战
模块六：LLM 大模型微调及模型 Quantization 技术及案例实战
模块七：大模型高效微调 PEFT 算法、技术、流程及代码实战进阶
模块八：LLM 模型对齐技术、流程及进行文本Toxicity 分析实战
模块九：构建安全的 GenAI/LLMs 核心技术Red Teaming 解密实战
模块十：构建可信赖的企业私有安全大模型Responsible AI 实战

Llama3关键技术深度解析与构建Responsible AI、算法及开发落地实战

1、Llama开源模型家族大模型技术、工具和多模态详解：学员将深入了解Meta Llama 3的创新之处，比如其在语言模型技术上的突破，并学习到如何在Llama 3中构建trust and safety AI。他们将详细了解Llama 3的五大技术分支及工具，以及如何在AWS上实战Llama指令微调的案例。
2、解密Llama 3 Foundation Model模型结构特色技术及代码实现：深入了解Llama 3中的各种技术，比如Tiktokenizer、KV Cache、Grouped Multi-Query Attention等。通过项目二逐行剖析Llama 3的源码，加深对技术的理解。
3、解密Llama 3 Foundation Model模型结构核心技术及代码实现：SwiGLU Activation Function、FeedForward Block、Encoder Block等。通过项目三学习Llama 3的推理及Inferencing代码，加强对技术的实践理解。
4、基于LangGraph on Llama 3构建Responsible AI实战体验：通过项目四在Llama 3上实战基于LangGraph的Responsible AI项目。他们将了解到LangGraph的三大核心组件、运行机制和流程步骤，从而加强对Responsible AI的实践能力。
5、Llama模型家族构建技术构建安全可信赖企业级AI应用内幕详解：深入了解构建安全可靠的企业级AI应用所需的关键技术，比如Code Llama、Llama Guard等。项目五实战构建安全可靠的对话智能项目升级版，加强对安全性的实践理解。
6、Llama模型家族Fine-tuning技术与算法实战：学员将学习Fine-tuning技术与算法，比如Supervised Fine-Tuning(SFT)、Reward Model技术、PPO算法、DPO算法等。项目六动手实现PPO及DPO算法，加强对算法的理解和应用能力。
7、Llama模型家族基于AI反馈的强化学习技术解密：深入学习Llama模型家族基于AI反馈的强化学习技术，比如RLAIF和RLHF。项目七实战基于RLAIF的Constitutional AI。
8、Llama 3中的DPO原理、算法、组件及具体实现及算法进阶：学习Llama 3中结合使用PPO和DPO算法，剖析DPO的原理和工作机制，详细解析DPO中的关键算法组件，并通过综合项目八从零开始动手实现和测试DPO算法，同时课程将解密DPO进阶技术Iterative DPO及IPO算法。
9、Llama模型家族Safety设计与实现：在这个模块中，学员将学习Llama模型家族的Safety设计与实现，比如Safety in Pretraining、Safety Fine-Tuning等。构建安全可靠的GenAI/LLMs项目开发。
10、Llama 3构建可信赖的企业私有安全大模型Responsible AI系统：构建可信赖的企业私有安全大模型Responsible AI系统，掌握Llama 3的Constitutional AI、Red Teaming。

解码Sora架构、技术及应用

一、为何Sora通往AGI道路的里程碑？
1，探索从大规模语言模型(LLM)到大规模视觉模型(LVM)的关键转变，揭示其在实现通用人工智能(AGI)中的作用。
2，展示Visual Data和Text Data结合的成功案例，解析Sora在此过程中扮演的关键角色。
3，详细介绍Sora如何依据文本指令生成具有三维一致性(3D consistency)的视频内容。 4，解析Sora如何根据图像或视频生成高保真内容的技术路径。
5，探讨Sora在不同应用场景中的实践价值及其面临的挑战和局限性。

二、解码Sora架构原理
1，DiT (Diffusion Transformer)架构详解
2，DiT是如何帮助Sora实现Consistent、Realistic、Imaginative视频内容的？
3，探讨为何选用Transformer作为Diffusion的核心网络，而非技术如U-Net。
4，DiT的Patchification原理及流程，揭示其在处理视频和图像数据中的重要性。
5，Conditional Diffusion过程详解，及其在内容生成过程中的作用。
三、解码Sora关键技术解密
1，Sora如何利用Transformer和Diffusion技术理解物体间的互动，及其对模拟复杂互动场景的重要性。
2，为何说Space-time patches是Sora技术的核心，及其对视频生成能力的提升作用。
3，Spacetime latent patches详解，探讨其在视频压缩和生成中的关键角色。
4，Sora Simulator如何利用Space-time patches构建digital和physical世界，及其对模拟真实世界变化的能力。
5，Sora如何实现faithfully按照用户输入文本而生成内容，探讨背后的技术与创新。
6，Sora为何依据abstract concept而不是依据具体的pixels进行内容生成，及其对模型生成质量与多样性的影响。