预训练数据集和指令微调数据集格式的比较

最新推荐文章于 2025-03-18 08:30:00 发布

二分掌柜的

最新推荐文章于 2025-03-18 08:30:00 发布

阅读量1.2k

点赞数 3

分类专栏：大模型文章标签：深度学习人工智能预训练数据集和指令微调数据集大模型

本文链接：https://blog.csdn.net/flyfish1986/article/details/141288890

版权

大模型专栏收录该内容

233 篇文章

订阅专栏

预训练数据集和指令微调数据集格式的比较

flyfish

1. 预训练数据集

目的 : 学习通用的语言模式和特征，建立一个强大的语言理解和生成能力。
格式 :
文本连续性 : 预训练数据集通常是由大量的连续文本组成的。这些文本可以来自书籍、文章、对话等。

无明确标签 : 预训练数据集不需要显式的输入-输出对。例如，GPT 类模型通常只需要大量的未标注文本来预测下一词或填补掩盖词（Masked Language Modeling）。

数据示例 :

"The quick brown fox jumps over the lazy dog."
"In a faraway land, there lived a wise old man."

模型会通过上下文推测出下一个词或理解整个句子的结构。

预训练数据集例子

OpenWebText 数据集 : OpenWebText 是一个用来模拟 OpenAI 的 WebText 数据集的开源预训练数据集。它从高质量的、发布在 Reddit 上且得分较高的网页内容中收集。数据格式 :
OpenWebText 数据集主要由未标注的连续文本组成，每个数据样本是一个文档或段落。数据的主要目的是让模型从大量的语言中学习通用的语义和语言模式。示例 :

“Once upon a time, in a small village, there was a little girl who loved to explore the woods. She would spend hours wandering among the trees, listening to the birds singing and the leaves rustling in the wind.”

这个例子中的文本没有特定的输入输出结构，而是连续的自然语言段落，供模型在预训练时学习。

2. 指令微调数据集

目的 : 使模型能够理解和执行特定的任务指令，如回答问题、生成特定格式的文本等。
格式 :
明确的输入-输出对 : 指令微调数据集通常包含明确的输入（指令）和期望的输出（响应）。这些数据旨在训练模型根据特定的任务或指令生成准确的输出。

指令和响应的对话结构 : 通常以对话形式存在，包括“指令”与“回应”。

数据示例 :

{
  "instruction": "Translate the following English sentence to French.",
  "input": "The weather is nice today.",
  "output": "Il fait beau aujourd'hui."
}

或者：

{
  "instruction": "Summarize the following paragraph.",
  "input": "The quick brown fox jumps over the lazy dog in the forest...",
  "output": "A fox jumps over a lazy dog."
}

指令微调数据集例子

Stanford Alpaca 数据集 : Stanford Alpaca 是一个用来指令微调的开源数据集，基于 OpenAI 的 InstructGPT 数据集的格式。该数据集旨在让模型学习如何根据指令生成特定的输出。
数据格式 :
Stanford Alpaca 数据集中的每个样本包括三部分：指令（instruction），可选的输入（input），以及期望的输出（output）。这个数据集的主要目的是让模型学习如何理解和执行人类给出的具体指令。
示例 :

{
  "instruction": "Translate the following English sentence to Spanish.",
  "input": "How are you?",
  "output": "¿Cómo estás?"
}

或者另一个例子：

{
  "instruction": "Write a short story about a robot who wants to be a human.",
  "input": "",
  "output": "Once upon a time, there was a robot who dreamed of becoming a human. Every day, it would watch the humans around it, observing their emotions, their relationships, and their experiences. The robot wished to experience these things for itself, to feel happiness, sadness, love, and loss..."
}