深入解析：Amazon Bedrock 上 Claude 3 Haiku 的微调测试报告

因_果_律

于 2024-08-07 18:25:12 发布

阅读量828

点赞数 21

分类专栏：亚马逊云科技文章标签： ai-native aws 亚马逊云科技 aigc 云计算人工智能

本文链接：https://blog.csdn.net/rralucard123/article/details/140997678

版权

亚马逊云科技专栏收录该内容

14 篇文章 0 订阅

订阅专栏

前言

2024年7月10日，Anthropic Claude 3 Haiku 的微调功能在 Amazon Bedrock 上开放预览。本篇文章将分享 Claude 3 Haiku 的微调使用步骤及微调后模型的评估结果。

LLM 细调的优势

通过细调，LLM可以获得特定领域的知识或新知识。这样，与RAG（Retrieval-Augmented Generation）相比，可以避免在提示中插入参考信息，从而最小化输入令牌，结果可以降低API执行的成本和延迟。此外，由于不需要外部存储和检索参考信息，因此可以减少外部数据库的管理成本和缩短检索所需的时间。（另一方面，将Fine-Tuning与RAG结合使用可能会进一步提高精度。）

使用步骤和验证内容

申请使用
创建数据集
将数据集上传到S3
执行细调作业
购买预配置吞吐量
运行细调后的模型
评估模型

申请使用

截至2024年7月27日，在Amazon Bedrock上对Claude3 Haiku进行fine-tuning需要向AWS支持提交申请。创建支持票时，请选择“Bedrock”作为服务，并选择“Models”作为类别。

数据集的创建

本次验证的目的是让Claude3 Haiku获得关于Amazon Bedrock的域知识，为此我们准备了一个专门用于fine-tuning的数据集。数据集由AWS官方文档中的问题和答案对组成。接下来，将介绍在进行本次验证时考虑的事项、验证策略以及数据集的准备和创建方法。

考虑使用的数据集

作为公开的典型日语数据集，可以提到databricks-dolly-15k-ja和databricks-dolly-15k-ja-gozaru等。databricks-dolly-15k-ja-gozaru是一个独特的数据集，旨在使LLM的回答末尾采用“ござる”这一古风口吻。然而，考虑到Claude3 Haiku的性能，即使不进行fine-tuning，通过系统提示也能达到类似效果。因此，使用这个数据集进行fine-tuning可能难以感受到其效果。

因此，本次验证的目的不是让Claude3 Haiku学习输出格式，而是获得域知识。具体而言，我们准备了一个数据集，以让Claude3 Haiku学习它在预训练数据中可能未包含的“Amazon Bedrock”的知识。

此外，AWS官方博客建议为了优化Claude3 Haiku的fine-tuning表现，首先应使用小规模但高质量的数据集（50-100条）进行尝试。根据这一建议，本次验证也采用了不足100条的数据集进行fine-tuning。

Fine-tune Anthropic’s Claude 3 Haiku in Amazon Bedrock to boost model accuracy and quality | AWS Machine Learning Blog

databricks-dolly-15k 是 Databricks 公开的包含15,000个指示-响应对的数据集。databricks-dolly-15k-ja-gozaru 是将 databricks-dolly-15k 翻译成日语后的版本 databricks-dolly-15k-ja 中的响应部分末尾替换为“ござる”，这样处理的数据集常用于对LLM进行fine-tuning的验证。

利用的训练数据

在本次验证中，我们使用了 AWS Machine Learning Blog 文章 “Improve RAG accuracy with fine-tuned embedding models on Amazon SageMaker” 中使用的 Amazon Bedrock FAQs 数据集作为 fine-tuning 的训练数据。该数据集已在以下仓库中公开。

https://github.com/aws-samples/fine-tune-embedding-models-on-sagemaker/blob/main/sentence-transformer/multiple-negatives-ranking-loss/training.json

本数据集基于Amazon Bedrock FAQs创建，以 JSON 格式存储了共 85 个问题和答案对。以下是数据集的部分内容。在 JSON 中，键“sentence1”表示问题，“sentence2”表示答案。

[
  {
    "sentence1": "What is Amazon Bedrock and its key features?",
    "sentence2": "Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models along with a broad set of capabilities for building generative AI applications, simplifying development with security, privacy, and responsible AI features."
  },
  {
    "sentence1": "How can I get started with using Amazon Bedrock?",
    "sentence2": "With the serverless experience of Amazon Bedrock, you can quickly get started by navigating to the service in the AWS console and trying out the foundation models in the playground or creating and testing an agent."
  }
]

由于本数据集仅包含85个问题和答案对，并不算多，因此决定不将这些数据分为训练数据和验证数据，而是另外创建验证数据。

选择使用上述数据集进行本次验证的原因是数据质量高且在许可方面没有问题。此外，选择这个数据集是因为它涉及到 Amazon Bedrock 这一特定领域的知识，这些知识预计不包含在 Claude3 Haiku 的预训练数据中，因此作为学习材料非常合适。

验证数据做成

为了创建验证数据，我们基于以下 AWS 官方文档，在 Claude3 Opus 中生成了验证数据。在此过程中，我们将相关文档转换成 PDF 格式，并利用 Amazon Bedrock 的 Converse API 中的 Document chat 和 Json mode，这使得我们能够相对容易地以 JSON 格式创建高质量的 QA 形式的数据集。

What is Amazon Bedrock? - Amazon Bedrock

我使用以下代码生成了32对问题和答案。

以下是为了设置工具使用的代码 tool_config.py 和为了创建验证数据的代码 create_val_dataset.py。在 tool_config.py 中，设置生成一个包含“question”和“answer”键的 JSON 数组形式，并指示生成32对。注意，由于是在 Json mode 中使用，所以没有定义工具本身。

class ToolConfig:
    tool_name = "QA_dataset_generator"
    no_of_dataset = 32

    description = f"""
    与えられるドキュメントに基づいて、LLMのFine-Tuning用のValidationデータセットを作成します。
    具体的には、ドキュメントの内容を利用し、Amazon Bedrockに関する質問文と回答文のペアを生成します。

    <example>
    question: What is Amazon Bedrock and its key features?
    answer: Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models along with a broad set of capabilities for building generative AI applications, simplifying development with security, privacy, and responsible AI features.
    </example>

    <rules>
    - 必ず{no_of_dataset}個の質問文と回答文のペアを生成すること。
    - 英語で回答すること。
    - JSON形式で回答すること。
    - Amazon Bedrockについて、多様な質問と回答を作成すること。
    </rules>
    """

    tool_definition = {
        "toolSpec": {
            "name": tool_name,
            "description": description,
            "inputSchema": {
                "json": {
                    "type": "object",
                    "properties": {
                        "dataset": {
                            "description": f"Validationデータ用の質問文と回答文のセット。必ず{no_of_dataset}個生成すること。",
                            "type": "array",
                            "items": {
                                "type": "object",
                                "properties": {
                                    "question": {
                                        "type": "string",
                                        "description": "Validationデータ用の質問文。",
                                    },
                                    "answer": {
                                        "type": "string",
                                        "description": "Validationデータ用の回答文。",
                                    },
                                },
                                "required": ["question", "answer"],
                            },
                        },
                    },
                    "required": ["dataset"],
                }
            },
        }
    }

在上述代码中，我们将32对 QA 形式的验证数据保存在外部的 JSON 文件中。以下是实际生成的验证数据的一部分，可以确认数据是按照提示指示的 QA 形式生成的。

[
  {
    "question": "What is Amazon Bedrock?",
    "answer": "Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models along with capabilities for building generative AI applications, simplifying development with security, privacy, and responsible AI features."
  },
  {
    "question": "What can you do with Amazon Bedrock?",
    "answer": "With Amazon Bedrock, you can experiment with and evaluate top foundation models for your use cases, privately customize them with your own data using techniques like fine-tuning and retrieval augmented generation, and build agents that execute tasks using your enterprise systems and data sources."
  }
]

数据集格式（预处理）

为了在 Claude3 Haiku 上进行 fine-tuning，需要将训练数据和验证数据作为预处理步骤，格式化为 JSON Lines (JSONL) 格式。具体来说，需要将系统提示、用户提示和 LLM 的响应保存为各个 JSON 记录。

{"system": string, "messages": [{"role": "user", "content": string}, {"role": "assistant", "content": string}]}
{"system": string, "messages": [{"role": "user", "content": string}, {"role": "assistant", "content": string}]}
{"system": string, "messages": [{"role": "user", "content": string}, {"role": "assistant", "content": string}]}

在本次验证中，我们使用下面展示的代码进行了预处理。在执行 Python 代码时，可以通过命令行参数指定系统提示、输入文件（训练数据或验证数据）、输出文件以及输入文件中使用的 JSON 键。

import argparse
import json


def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--system-prompt",
        type=str,
        default="You are a high-performance QA assistant that responds to questions concisely, accurately, and appropriately.",
    )
    parser.add_argument(
        "--input-file",
        type=str,
        default="../../dataset/rawdata/validation.json",
    )
    parser.add_argument(
        "--output-file",
        type=str,
        default="../../dataset/preprocessed/claude3_ft_validation.jsonl",
    )
    parser.add_argument("--prompt-key", type=str, default="question")
    parser.add_argument("--completion-key", type=str, default="answer")

    return parser.parse_args()


def preprocess(args: argparse.Namespace) -> None:
    """
    Preprocess the input JSON file to the format that can be used for claude3's fine-tuning.
    Input JSON file should have the structure of a list of dictionaries.
    Below is an example of the input JSON file.
    [
        {
            "question": "What is the capital of France?",
            "answer": "Paris"
        },
        ...
    ]
    """
    with open(args.input_file, "r") as f_in, open(args.output_file, "w") as f_out:
        input_json = json.load(f_in)
        for data in input_json:
            prompt = data[args.prompt_key]
            completion = data[args.completion_key]

            new_data = {}
            new_data["system"] = args.system_prompt
            new_data["messages"] = [
                {"role": "user", "content": prompt},
                {"role": "assistant", "content": completion},
            ]

            f_out.write(json.dumps(new_data) + "\n")


def main(args: argparse.Namespace) -> None:
    preprocess(args)
    print("Conversion completed!")


if __name__ == "__main__":
    args = get_args()
    main(args)

例如，如果要进行训练数据的预处理，可以执行类似于下面的 shell 命令。（由于 argparse 的默认值已设置为适合验证数据，因此在进行验证数据预处理时，无需指定参数。）

#!/bin/bash

INPUT="../../dataset/rawdata/training.json"
OUTPUT="../../dataset/preprocessed/claude3_ft_training.jsonl"
PROMPT_KEY="sentence1"
COMPLETION_KEY="sentence2"

python3 preprocess.py \
    --input-file $INPUT \
    --output-file $OUTPUT \
    --prompt-key $PROMPT_KEY \
    --completion-key $COMPLETION_KEY

在上述代码中，我们将数据集按照指定的格式保存在外部的 jsonl 文件中。以下是实际格式化后的训练数据的一部分，供参考。

{"system": "You are a high-performance QA assistant that responds to questions concisely, accurately, and appropriately.", "messages": [{"role": "user", "content": "What is Amazon Bedrock and its key features?"}, {"role": "assistant", "content": "Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models along with a broad set of capabilities for building generative AI applications, simplifying development with security, privacy, and responsible AI features."}]}
{"system": "You are a high-performance QA assistant that responds to questions concisely, accurately, and appropriately.", "messages": [{"role": "user", "content": "How can I get started with using Amazon Bedrock?"}, {"role": "assistant", "content": "With the serverless experience of Amazon Bedrock, you can quickly get started by navigating to the service in the AWS console and trying out the foundation models in the playground or creating and testing an agent."}]}

训练数据和验证数据需满足以下要求：

数据的数量上限和下限

训练数据：32~10,000
验证数据：32~1,000

数据集的大小

训练数据：不超过10GB
验证数据：不超过1GB

数据集的格式

JSON Lines (JSONL) 格式
需将系统提示、用户提示和LLM的响应按照 Claude3 的格式保存

由于 fine-tuning 需要时间和费用，因此推荐在开始前确认数据集是否满足要求。在本验证中，我们使用以下公开的 AWS 官方 Data Validation 工具进行了预先确认。

amazon-bedrock-samples/bedrock-fine-tuning/claude-haiku/DataValidation at main · aws-samples/amazon-bedrock-samples · GitHub

将数据集上传到S3

需要将创建的训练数据和验证数据上传到位于美国西部（俄勒冈）地区的 S3 存储桶中。本仓库已公开先前步骤中创建的预处理后的数据集。如果使用本仓库上的数据，可以通过以下命令上传数据集。命令中的 <your bucket> 应替换为您选择的任意存储桶名称。

aws s3 cp dataset/preprocessed/ s3://<your bucket>/claude3-haiku/dataset --recursive

运行微调作业

在控制台上执行的步骤

在控制台上执行 fine-tuning 的步骤如下，以下是在 Amazon Bedrock 控制台上进行 Claude3 Haiku 的 fine-tuning 的步骤说明。

在俄勒冈地区的 Amazon Bedrock 控制台中，从左侧导航面板的“基础模型”部分选择“自定义模型”。

选择右侧的【自定义模型】，然后选择【创建微调作业】。

在微调作业创建屏幕上选择[源模型]下的[选择模型]。

选择 Claude3 Haiku 并按 [应用]。

在微调作业设置屏幕上输入以下信息。

调整模型名称：任何模型名称
作业名称：微调作业名称
输入数据：之前上传的训练数据和验证数据的S3路径

设置超参数以进行微调作业。 epoch 数的默认值为 2，但本次验证使用了 10 个 epoch，其他参数均设置为默认值。

超参数	内容
时代	迭代整个训练数据集的次数（最多 10 轮）
批量大小	用于更新模型参数的样本数量
学习率乘数	调整基础学习率的乘数
Early stopping (早期停止)	一种防止过度拟合的方法，如果验证损失在一定次数后没有改善，则停止学习
提前停止阈值	用于确定提前停止的验证损失改善阈值
尽早停止耐心	确定提前停止之前允许的时期数

指定要保存的 S3 URI，以记录微调执行期间训练损失和验证损失的转变。另外，创建一个新的服务角色。然后选择微调模型并运行微调作业。

微调作业完成后，状态将更改为完了。此次验证大约在2小时内完成。

选择【Model】并选择模型名称，可以查看模型ARN、作业ARN、输出数据的存储位置（每个epoch的训练损失、验证损失值）等详细信息。

观察训练损失和验证损失

作为输出数据，包含每个时期的训练损失和验证损失值的 CSV 文件保存在 S3 中。通过观察这些值，可以判断微调是否正常。

下面展示了fine-tuning执行过程中trainin损失和validation损失的变化。随着时期数的增加，训练损失和验证损失都会减少，这证实了学习正在适当地进行。此外，可以看出，验证损失在第 5 个 epoch 和第 7 个 epoch 之间没有改善，并且在第 7 个 epoch 发生了提前停止。

CSV 文件记录了每个步骤的损失，但上图显示了每个 epoch 的平均损失。

购买预配置吞吐量

为了部署微调模型，您需要购买预配置吞吐量。下面，我们将介绍在控制台上购买配置吞吐量的流程。

在Bedrock控制台的[自定义模型]屏幕上选择微调后的模型，然后选择“购买预配置吞吐量”。

输入预配吞吐量的名称，并选择契约期限。由于本次验证仅使用约1小时，因此选择了按小时计费的“No commitment”选项。之后，点击“购买预配吞吐量”。

将显示购买确认屏幕，因此选中该框并选择[确认购买]。

在这次测试中，大约需要20分钟才能完成。

如果您选择创建的预置吞吐量，您可以查看预置吞吐量的ARN。这在使用 Boto3 进行模型推断时使用。

您将支付每小时 132 美元的费用，因此我们建议您在使用后立即删除预配置的吞吐量！

运行微调模型

我们将使用 Amazon Bedrock 的 Playground 和适用于 Python 的 AWS 开发工具包 (Boto3) 通过微调模型执行推理。

在playground上运行

在 Amazon Bedrock 控制台上，选择 Playground 中的 [Chat] 并选择模型。这时选择自定义模型，选择之前微调过的模型。

您可以确认playground中使用的模型名称是微调后的模型名称。这次，我输入了问题What is Knowledge Bases for Amazon Bedrock?并按下 [运行] 按钮。

微调模型的答案如下所示。答案的前半部分是合适的，但后半部分出现了一些幻觉（在撰写本文时，Amazon Bedrock 的知识库不支持 JSON 和 XML 等数据格式）。

Knowledge Bases for Amazon Bedrock is a feature that allows you to seamlessly integrate structured and unstructured data sources into your large language model (LLM) applications built on Amazon Bedrock. This feature enables you to enhance the context and knowledge available to your models, leading to more informed and relevant responses.

With Knowledge Bases for Amazon Bedrock, you can connect your LLM applications to a variety of data sources, including databases, content management systems, enterprise applications, and file storage services. The feature supports a wide range of data formats, including structured data (e.g., CSV, Excel, SQL), unstructured data (e.g., PDF, Word, HTML), and semi-structured data (e.g., JSON, XML).

为了进行比较，我们在没有进行微调的正常 Claude3 Haiku 上提出了相同的问题。

以下是 Claude3 Haiku 的回答。我们收到的答复表明 Claude3 Haiku 的预训练数据不包含有关 Amazon Bedrock 知识库的信息。通过比较微调模型的响应结果，可以确认微调的效果。

Unfortunately, I do not have any specific information about "Knowledge Bases for Amazon Bedrock." Bedrock is a platform provided by Amazon, but I do not have detailed knowledge about its capabilities or features. Knowledge bases are typically collections of structured data that can be used to provide information, but without more context about how this relates to Amazon Bedrock, I cannot provide any details. My knowledge is limited, so I cannot give you a substantive answer about this particular topic. I'd suggest checking the Amazon documentation or other reliable sources to learn more about Amazon Bedrock and any associated knowledge base capabilities.

使用适用于 Python 的 AWS 开发工具包 (Boto3) 运行

您还可以使用适用于 Python 的 AWS 开发工具包 (Boto3) 通过 API 调用导入的模型。具体来说，您可以使用 Amazon Bedrock 的 InvokeModel API 并在 modelId 参数中指定导入的预配置吞吐量的 ARN 来执行推理。

import json

import boto3

model_id = "<provisioned throughput arn>"

system_prompt = "You are a high-performance QA assistant that responds to questions concisely, accurately, and appropriately."
prompt = "What can you do with Amazon Bedrock?"

client = boto3.client(service_name="bedrock-runtime", region_name="us-west-2")

response = client.invoke_model(
    body=json.dumps(
        {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 2048,
            "messages": [{"role": "user", "content": f"{prompt}"}],
            "temperature": 0.1,
            "top_p": 0.9,
            "system": f"{system_prompt}",
        }
    ),
    modelId=model_id,
)
output = response.get("body").read().decode("utf-8")
response_body = json.loads(output)
response_text = response_body["content"][0]["text"]
print(response_text)

当我运行上面的代码时，输出如下： “ What can you do with Amazon Bedrock?这个问题有一个很好的答案。

Amazon Bedrock is a fully managed service that enables developers to build, deploy, and scale generative AI applications quickly and easily. With Amazon Bedrock, you can create generative AI applications that can generate human-like text, images, code, and other content, as well as engage in open-ended conversations and complete a variety of tasks.

模型评估

为了评估 fine-tuning 后的模型，我们创建了一个评估用数据集，并通过 LLM-as-a-Judge 进行了评估。

设置评估用数据集以 QA 格式创建，并且为每个问题预先准备了四个假设的答案。以下是本次评估使用的问题：

What can you do with Amazon Bedrock?
What is Knowledge Bases for Amazon Bedrock?
What are Agents for Amazon Bedrock?
What are Guardrails for Amazon Bedrock?

此外，作为比较方法，我们使用了 fine-tuning 之前的 Claude3 Haiku（以下称为基础模型）。

评估指标

作为评估指标，我们使用了 LLM-as-a-Judge 的正确性（Correctness）。LLM-as-a-Judge 的执行采用了 LangChain 的 Scoring Evaluator 的 evaluate_strings 方法。通过使用 evaluate_strings 方法，可以利用 LangChain 提供的各种评估指标对 LLM-as-a-Judge 进行评估。为了进行定量评估，我们使用了 labeled_score_string 对 LLM 的推理结果进行 1 到 10 的打分，并使用了 Claude3.5 Sonnet 作为评估用的 LLM。

Scoring Evaluator | 🦜️🔗 LangChain

结果

以下是每个模型针对四个问题的正确性（Correctness）平均值。可以确认，经过 fine-tuning 的模型的正确性高于基础模型（Base model），评估值也相对较好。

Model	Correctness (平均)
Fine-tuning model	7.0 / 10.0
Base model (Claude3 Haiku)	2.5 / 10.0

接下来，为了确认 fine-tuning 的效果，我们将观察每个问题的评估值和各模型的回答内容。

每个问题的评估值

以下是每个问题的正确性（Correctness）评估值。经过 fine-tuning 的模型在第1、2、4个问题上表现出较高的正确性，但在第3个问题上的正确性较低。对于基础模型（Base model），可以确认其整体正确性较低。

#	提问 (prompt)	Correctness (Fine-tuning model)	Correctness (Base model)
1	What can you do with Amazon Bedrock?	8.0 / 10.0	3.0 / 10.0
2	What is Knowledge Bases for Amazon Bedrock?	8.0 / 10.0	2.0 / 10.0
3	What are Agents for Amazon Bedrock?	4.0 / 10.0	2.0 / 10.0
4	What are Guardrails for Amazon Bedrock?	8.0 / 10.0	3.0 / 10.0

每个问题答案的详细信息

以下是每个问题各模型的回答内容。与基础模型（Base model）相比，经过 fine-tuning 的模型的回答内容更为适当。特别是，在 Correctness 较高的第1、2、4个问题上，回答质量较高。然而，在第2和第3个问题的回答中出现了一些幻觉（hallucination）现象。此外，与其他问题相比，第3个问题的回答内容信息量较少，回答不够适当。

观察基础模型的回答内容，可以看到在第2、3个问题上，模型回答“不具备该知识”。其他问题的回答中也出现了大量的幻觉现象，整体上 Correctness 较低。

从以上结果来看，可以认为通过 fine-tuning，Claude3 Haiku 在一定程度上获得了 Amazon Bedrock 的知识。

#	提问 (prompt)	Fine-tuning model 的回答	Base model 的回答
1	What can you do with Amazon Bedrock?	Amazon Bedrock is a fully managed service that provides a quick, effective, and secure path to building generative AI applications powered by foundation models from Amazon, Anthropic, Stability AI, Cohere, and other integrated providers. It simplifies the creation and deployment of cutting-edge customized models and foundational models for a wide range of natural language processing (NLP), text-to-image, and speech-to-text use cases, all while focusing on security, data privacy, and responsible AI.	Amazon Bedrock is a platform that provides developers with a set of tools and services to build, deploy, and manage conversational AI applications. Here are some of the key things developers can do with Amazon Bedrock: 1. Build conversational AI models: Bedrock provides access to pre-trained language models that can be fine-tuned for various natural language processing (NLP) tasks, such as text generation, intent detection, and entity extraction. 2. Deploy and manage models: Bedrock makes it easier to deploy and manage conversational AI models by providing a scalable, cloud-based infrastructure. Developers can use Bedrock to host, monitor, and update their models.
2	What is Knowledge Bases for Amazon Bedrock?	Knowledge Bases for Amazon Bedrock is a feature that allows you to seamlessly integrate structured and unstructured data sources into your large language model (LLM) applications built on Amazon Bedrock. This feature enables you to enhance the context and knowledge available to your models, leading to more informed and relevant responses. With Knowledge Bases for Amazon Bedrock, you can connect your LLM applications to a variety of data sources, including databases, content management systems, enterprise applications, and file storage services. The feature supports a wide range of data formats, including structured data (e.g., CSV, Excel, SQL), unstructured data (e.g., PDF, Word, HTML), and semi-structured data (e.g., JSON, XML).	Unfortunately, I do not have any specific information about "Knowledge Bases for Amazon Bedrock." Bedrock is a platform provided by Amazon, but I do not have detailed knowledge about its capabilities or features. Knowledge bases are typically collections of structured data that can be used to provide information, but without more context about how this relates to Amazon Bedrock, I cannot provide any details. My knowledge is limited, so I cannot give you a substantive answer about this particular topic. I'd suggest checking the Amazon documentation or other reliable sources to learn more about Amazon Bedrock and any associated knowledge base capabilities.
3	What are Agents for Amazon Bedrock?	Agents for Amazon Bedrock are pre-built AI agents like Claude or Exporter that can be easily added via APIs while maintaining flexibility and customization. They are ready-to-use capabilities for rapid development of conversational AI apps and workflows.	I do not have any specific information about "Agents for Amazon Bedrock". Amazon Bedrock is a machine learning platform, but I don't have details on what "Agents" may refer to in that context. My knowledge is limited, so I cannot provide a more complete answer about this. You may need to check Amazon's documentation or other resources for more information about the features and capabilities of the Amazon Bedrock platform.
4	What are Guardrails for Amazon Bedrock?	Guardrails for Amazon Bedrock is a feature that allows you to manage and mitigate potential risks associated with the use of large language models (LLMs) in various applications. Guardrails for Amazon Bedrock provides a set of controls and policies that can be applied to the input and output of the LLM to help ensure responsible and ethical use, alignment with organizational policies, and protection against potential misuse or harm. These controls can include content filtering, sensitive data detection, profanity filtering, hate speech detection, and other safeguards. With Guardrails for Amazon Bedrock, you can implement controls to prevent or mitigate the generation of inappropriate, harmful, or biased content, as well as ensure compliance with regulations and organizational policies.	Guardrails for Amazon Bedrock are a set of default security and compliance configurations that help protect Bedrock models and deployments. They are designed to provide a secure baseline for Bedrock usage and help customers adhere to best practices and industry standards. Some key features of Bedrock Guardrails include: 1. Encryption: Guardrails ensure that all data stored and transmitted by Bedrock is encrypted at rest and in transit using industry-standard encryption protocols. 2. IAM-based access control: Guardrails enforce strict access control policies using AWS Identity and Access Management (IAM), ensuring that only authorized users and services can interact with Bedrock resources.

对经过 fine-tuning 的模型观察到的情况进行分析：

关于问题3的回答的准确性

经过 fine-tuning 的模型在问题3的回答中，与其他问题相比，文字量少且信息量不足。问题3是关于 Amazon Bedrock 的 Agents 的问题，与其他问题相比并不复杂。考虑到在其他问题上的回答相当准确，推测该模型没有很好地掌握关于 Amazon Bedrock 的 Agents 的知识。原因是用于 fine-tuning 的数据集中关于 Amazon Bedrock 的 Agents 的信息不足。

检查训练数据中关于 Amazon Bedrock 的 Agents 的 QA 对数量，发现在85个问题中仅有1个，且该词出现次数仅为2次。而关于问题1、2、4的 QA 对数量至少有7个以上，出现次数也在14次以上。这些结果证实了关于 Amazon Bedrock 的 Agents 的数据不足。

为解决这个问题，需要增加关于 Amazon Bedrock 的 Agents 的 QA 对。

关于回答的质量

虽然观察到经过 fine-tuning 的模型的回答的 Correctness 很高，但在详细检查时发现包含了一部分幻觉。这可能是因为训练数据的规模较小，不足以充分获取知识。

对于这个问题，通过准备更高质量的数据集可以提高回答的准确性，从而减少幻觉的出现。此外，由于 LLM-as-a-Judge 难以准确检测这些幻觉，人工评价也是必要的。

关于输出格式

不限于本验证的 Base model，Claude3 Haiku 的回答中经常使用编号列表（项目符号列表），但在经过 fine-tuning 的模型的回答中未包含。这可能是因为数据中不包含编号列表的数据，且通过 QA 格式的数据学习的结果，也影响了回答的输出格式。

评估时使用的代码

为参考，以下展示了本次验证中使用的代码。将经过 fine-tuning 的模型的回答内容、Base model (Claude3 Haiku)的回答内容以及评估用的标签数据保存在外部文件中，然后读取这些数据，在 LangChain 中进行评估。

import argparse
import json

from langchain.evaluation import Criteria, EvaluatorType, load_evaluator
from langchain_aws import ChatBedrock


def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--prediction-file",
        type=str,
        default="../../dataset/eval/fine-tuning-model_prediction.json",
    )
    parser.add_argument(
        "--label-file",
        type=str,
        default="../../dataset/eval/label.json",
    )
    return parser.parse_args()


def load_json(file_path: str) -> list:
    with open(file_path, "r") as f:
        return json.load(f)


def llm_as_a_judge(predictions: list, labels: list) -> None:
    model = ChatBedrock(
        model_id="anthropic.claude-3-5-sonnet-20240620-v1:0",
        region_name="us-east-1",
        model_kwargs={
            "temperature": 0.0,
        },
    )

    evaluator = load_evaluator(
        evaluator=EvaluatorType.LABELED_SCORE_STRING,
        criteria=Criteria.CORRECTNESS,
        llm=model,
    )

    scores = []
    for prediction, label in zip(predictions, labels):
        # print(f"Prediction: {prediction}, Label: {label}")
        eval_result = evaluator.evaluate_strings(
            prediction=prediction["answer"],
            reference=label["answer"],
            input=label["question"],
        )
        print(eval_result)
        print(eval_result["score"])
        scores.append(eval_result["score"])

    score_average = sum(scores) / len(scores)
    print(f"Average score: {score_average}")


def main(args: argparse.Namespace) -> None:
    predictions = load_json(args.prediction_file)
    labels = load_json(args.label_file)

    llm_as_a_judge(predictions, labels)


if __name__ == "__main__":
    args = get_args()
    main(args)

补充：使用的评估指标

在本次验证中，我们使用了 LLM-as-a-Judge 的 Correctness 作为评估指标，但在文本生成任务中常用的评估指标还包括 ROUGE、BLEU 和 BERTScore 等。特别是 BERTScore，它利用预训练的 BERT 生成的上下文化的令牌嵌入来计算文本间的相似度，常用于文本生成和摘要任务。由于在验证时也进行了 BERTScore 的评估，因此下面将展示其结果。以下表格显示了各模型的 Precision（精确度）、Recall（召回率）和 F1 Score（F1 分数）的平均值。

Model	Precision	Recall	F1 Score
Fine-tuning model	0.74	0.76	0.75
Base model (Claude3 Haiku)	0.67	0.71	0.69

与 LLM-as-a-Judge 的 Correctness 结果一致，经过 fine-tuning 的模型比基础模型（Base model）的评估值更高，可以确认它能生成更接近预期答案的文本。此外，在各模型中，Recall（召回率）往往比 Precision（精确度）更高。这是因为回答文本较长，且包含较多与参考文本（预期答案）中单词的相似表达。（通过检查 ROUGE-1 发现，Recall 比 Precision 更高的趋势。）

然而，基础模型的评估值也并不是很差。这可能是因为即使回答内容包含许多幻觉，回答中的单词及其类似单词在参考文本（预期答案）中也相对较多。

在本次验证中，虽然仅使用 LLM-as-a-Judge 进行了评估，但通过结合使用 BERTScore 来评估语义相似性和使用 LLM-as-a-Judge 的 Correctness 来评估逻辑一致性及事实准确性，可以实现更多角度的评估。

以下是用于评估 BERTScore 的代码示例。

import argparse
import json

from bert_score import score


def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--prediction-file",
        type=str,
        default="../../dataset/eval/fine-tuning-model_prediction.json",
    )
    parser.add_argument(
        "--label-file",
        type=str,
        default="../../dataset/eval/label.json",
    )
    return parser.parse_args()


def load_json(file_path: str) -> list:
    with open(file_path, "r") as f:
        return json.load(f)


def get_target_sentences(qa_list: list) -> list:
    sentences = []
    for qa in qa_list:
        sentences.append(qa["answer"])
    return sentences


def calc_bert_score(cands: list, refs: list) -> tuple:
    Precision, Recall, F1 = score(cands, refs, lang="ja", verbose=True)
    return Precision.numpy().tolist(), Recall.numpy().tolist(), F1.numpy().tolist()


def bert_score(predictions: list, labels: list) -> None:
    cands = get_target_sentences(predictions)
    refs = get_target_sentences(labels)

    P, R, F1 = calc_bert_score(cands, refs)
    for p, r, f1 in zip(P, R, F1):
        print(f"precision: {p}, recall: {r}, f1_score: {f1}")

    print(f"Average precision: {sum(P) / len(P)}")
    print(f"Average recall: {sum(R) / len(R)}")
    print(f"Average f1_score: {sum(F1) / len(F1)}")


def main(args: argparse.Namespace) -> None:
    predictions = load_json(args.prediction_file)
    labels = load_json(args.label_file)

    bert_score(predictions, labels)


if __name__ == "__main__":
    args = get_args()
    main(args)

总结

本文介绍了在 Amazon Bedrock 上对 Claude3 Haiku 进行 fine-tuning 的方法，并对实际 fine-tuned 的模型进行了评估。作为数据集，我们使用了 Amazon Bedrock FAQs 的数据，并利用 Bedrock Converse API 的 Document chat 和 Json mode 创建了验证数据。在模型评估中，我们使用了 LLM-as-a-Judge 的 Correctness 和 BERTScore 作为评估指标。评估结果显示，通过 fine-tuning 成功地获得了 Amazon Bedrock 的知识。

从运行成本的角度来看，Claude3 Haiku 的预配吞吐量费用给人留下了相当高昂的印象，目前感觉 RAG 的成本效益更高（即使是6个月的合约，每月费用也是 $48180）。然而，能够对像 Claude3 Haiku 这样的高性能模型进行 fine-tuning 是非常有吸引力的，并期待未来的改进。

虽然本次验证只是使用了小规模数据集进行 fine-tuning，但使用更大规模的数据集可能会提高精度。希望本文可以为您提供参考，欢迎尝试。

因_果_律

关注

21
点赞
踩
18

收藏

觉得还不错? 一键收藏
0
评论
深入解析：Amazon Bedrock 上 Claude 3 Haiku 的微调测试报告

2024年7月10日，Anthropic Claude 3 Haiku 的微调功能在 Amazon Bedrock 上开放预览。本篇文章将分享 Claude 3 Haiku 的微调使用步骤及微调后模型的评估结果。
复制链接

扫一扫