chatGLM 训练数据集格式:
# 自定义数据集
[
{
"instruction": "用户指令(必填)",
"input": "用户输入(选填)",
"output": "模型回答(必填)",
"system": "系统提示词(选填)",
"history": [
["第一轮指令(选填)", "第一轮回答(选填)"],
["第二轮指令(选填)", "第二轮回答(选填)"]
]
}
]
ChatGLM-6B
需要将数据集定义句子对的形式。这里的需要考虑下游任务,以及prompt的构造案例。
我们的任务在于提取医学文献中的医学名词和数值,以及他们的对应关系。输出要求是JSON格式,以方便我们的后端进行JSON object的生成和绘制表格。
因此我们需要生成prompt在系统提示词system中。我们的prompt表示为:
指令工程(prompt engineering):
指令是用于从模型中检索信息的请求,它描述了用户的需求,但通常没有具体的语法或固定格式。指令可以涵盖广泛的领域,包括知识检索、任务执行、自然语言生成等。例如,用户可能希望查询中国各个直辖市的人口情况。
实践微调,设计prompt:
# 自定义数据集
[
{
"instruction": "用户指令(必填)",
"input": "用户输入(选填)",
"output": "模型回答(必填)",
"system": "请输出为以下格式的JSON文件:
"fixed_data": {
"total-participants": "500",
"intervention-participants": "237",
"control-participants": "235",
"age": "45-65",
"eligibility": "",
"condition": "Type 2 Diabetes",
"location": "Medline and Scopus",
"ethnicity": "",
"intervention": "probiotic yogurt",
"control": "conventional yogurt",
"intervention-age": "",
"control-age": "",
"conclusion": "The study shows significant improvements in glucose levels."
},
"variable_data": [
{
"outcome": "fasting blood glucose",
"outcome-Measure": "",
"iv-bin-abs": "100",
"cv-bin-abs": "95"
},
{
"outcome": "fasting insulin",
"outcome-Measure": "",
"iv-bin-percent": "42.1",
"cv-bin-percent": "40.0"
},
{
"outcome": "insulin resistance",
"outcome-Measure": "",
"iv-cont-mean": "5.5",
"cv-cont-mean": "5.3"
}
]
",
"history": [
["第一轮指令(选填)", "第一轮回答(选填)"],
["第二轮指令(选填)", "第二轮回答(选填)"]
]
}
]