自定义数据集进行大模型微调

最新推荐文章于 2025-05-08 11:19:22 发布

辛勤的程序猿

最新推荐文章于 2025-05-08 11:19:22 发布

阅读量1.2k

点赞数 3

分类专栏：大模型文章标签： java 前端服务器

本文链接：https://blog.csdn.net/llz19670/article/details/146310530

版权

大模型专栏收录该内容

3 篇文章

订阅专栏

1.数据集格式：

一般进行大模型数据集准备时常用的两种数据集格式：ALpaca和shareGPT

ALpaca：

[
  {
    "instruction": "user instruction (required)",
    "input": "user input (optional)",
    "output": "model response (required)",
    "system": "system prompt (optional)",
    "history": [
      ["user instruction in the first round (optional)", "model response in the first round (optional)"],
      ["user instruction in the second round (optional)", "model response in the second round (optional)"]
    ]
  }
]

shareGPT：

{
    "conversations": [
        {
            "role": "user",
            "content": "What is the capital of France?"
        },
        {
            "role": "assistant",
            "content": "The capital of France is Paris."
        },
        {
            "role": "user",
            "content": "Can you tell me more about Paris?"
        },
        {
            "role": "assistant",
            "content": "Paris is the largest city and the capital of France. It is known for its art, culture, and history..."
        }
    ]
}

2.数据集标注

在数据集标注这里，我比较推荐 easy-dataset这个开源项目的使用，可以通过github仓库去部署，也可以直接下载安装包进行安装，目前该项目仅仅支持markdown格式的文本，此时自己的文本可以通过百度去搜索在线转换的工具，转换成markdown格式的文本。准备好资料后，启动easy-dataset

此时可以通过创建项目，创建一个项目，也可以直接搜索公开的数据集，我们这里使用创建新建的项目。

创建完一般是如上图所示，注意一定要引入一个大模型，通过现有的大模型进行文本的分析拆分等。

可以在上图所示的位置进行大模型api配置，其中这里推荐一个平台获取Deepseek-R1免费5百万token的api,注册链接：https://cloud.lanyun.net/#/registerPage?promoterCode=0005，注册完进入后访问MaaS平台 | GPU智算云平台文档中心去查看如何使用api。