大模型微调——训练数据集的格式Alpaca 和 ShareGPT

最新推荐文章于 2025-05-07 10:00:00 发布

IT修炼家

最新推荐文章于 2025-05-07 10:00:00 发布

阅读量9.6k

点赞数 19

分类专栏：大模型基础文章标签：人工智能大模型数据集格式 Alpaca ShareGPT

本文链接：https://blog.csdn.net/qq_42755230/article/details/142880678

版权

大模型基础专栏收录该内容

10 篇文章

订阅专栏

1. Alpaca 格式

Alpaca 是基于 Meta 开源的 LLaMA 模型构建的一种微调数据集格式，特别用于 instruction-tuning，即指令微调。其数据格式的特点是提供了一个明确的任务描述（instruction）、输入（input）和输出（output）三部分。

典型的 Alpaca 数据集格式：

{
    "instruction": "Summarize the following text.",
    "input": "Artificial intelligence (AI) is a rapidly growing field...",
    "output": "AI is an evolving technology that is growing quickly in various fields..."
}

字段说明：

instruction: 任务的指令，告诉模型需要完成什么操作。
input: 任务所需的输入。如果任务是开放式的或者不需要明确的输入，这一字段可以为空字符串。
output: 任务的期望输出，也就是模型在给定指令和输入情况下需要生成的内容。

特点：

结构简单，易于理解。
明确分离任务指令和输入内容，适合各种自然语言处理任务，如文本生成、翻译、总结等。

2. ShareGPT 格式

ShareGPT 格式来源于通过记录 ChatGPT 与用户对话的数据集，主要用于对话系统的训练。它更侧重于多轮对话数据的收集和组织，模拟用户与 AI 之间的交互。

典型的 ShareGPT 数据集格式：

{
    "conversations": [
        {
            "role": "user",
            "content": "What is the capital of France?"
        },
        {
            "role": "assistant",
            "content": "The capital of France is Paris."
        },
        {
            "role": "user",
            "content": "Can you tell me more about Paris?"
        },
        {
            "role": "assistant",
            "content": "Paris is the largest city and the capital of France. It is known for its art, culture, and history..."
        }
    ]
}