ChatGLM微调训练集配置

Ccccyaaa

于 2024-06-24 02:28:19 发布

阅读量227

点赞数 1

文章标签：服务器

本文链接：https://blog.csdn.net/2301_78005925/article/details/139909971

版权

chatGLM 训练数据集格式：

 
# 自定义数据集
[
    {
        "instruction": "用户指令（必填）",
        "input": "用户输入（选填）",
        "output": "模型回答（必填）",
        "system": "系统提示词（选填）",
        "history": [
            ["第一轮指令（选填）", "第一轮回答（选填）"],
            ["第二轮指令（选填）", "第二轮回答（选填）"]
        ]
    }
]

ChatGLM-6B需要将数据集定义句子对的形式。这里的需要考虑下游任务，以及prompt的构造案例。

我们的任务在于提取医学文献中的医学名词和数值，以及他们的对应关系。输出要求是JSON格式，以方便我们的后端进行JSON object的生成和绘制表格。

因此我们需要生成prompt在系统提示词system中。我们的prompt表示为：

指令工程（prompt engineering）：

指令是用于从模型中检索信息的请求，它描述了用户的需求，但通常没有具体的语法或固定格式。指令可以涵盖广泛的领域，包括知识检索、任务执行、自然语言生成等。例如，用户可能希望查询中国各个直辖市的人口情况。

实践微调，设计prompt：

 
# 自定义数据集
[
    {
        "instruction": "用户指令（必填）",
        "input": "用户输入（选填）",
        "output": "模型回答（必填）",
        "system": "请输出为以下格式的JSON文件：
                    "fixed_data": {
      "total-participants": "500",
      "intervention-participants": "237",
      "control-participants": "235",
      "age": "45-65",
      "eligibility": "",
      "condition": "Type 2 Diabetes",
      "location": "Medline and Scopus",
      "ethnicity": "",
      "intervention": "probiotic yogurt",
      "control": "conventional yogurt",
      "intervention-age": "",
      "control-age": "",
      "conclusion": "The study shows significant improvements in glucose levels."
    },
    "variable_data": [
      {
        "outcome": "fasting blood glucose",
        "outcome-Measure": "",
        "iv-bin-abs": "100",
        "cv-bin-abs": "95"
      },
      {
        "outcome": "fasting insulin",
        "outcome-Measure": "",
        "iv-bin-percent": "42.1",
        "cv-bin-percent": "40.0"
      },
      {
        "outcome": "insulin resistance",
        "outcome-Measure": "",
        "iv-cont-mean": "5.5",
        "cv-cont-mean": "5.3"
      }
    ]

                    ",
        "history": [
            ["第一轮指令（选填）", "第一轮回答（选填）"],
            ["第二轮指令（选填）", "第二轮回答（选填）"]
        ]
    }
]