1、准备训练数据:
SFT
的数据格式有多种,例如:Alpaca格式、OpenAI格式等。
#其中Alpaca格式如下:
[
{
"instruction":"human instruction (required)",
"input":"human input (optional)",
"output":"model response (required)",
"system":"system prompt (optional)",
"history":[
[
"human instruction in the first round (optional)","model response in the first round (optional)"
],
[
"human instruction in the second round (optional)","model response in the second round (optional)"
]
]
}
]
根据以上的数据格式,我们在ModelScope的数据集找到中文医疗对话数据-Chinese-medical-dialogue符合上述格式。
# 使用git命令拉取数据集 至data目录下
git clone https://www.modelscope.cn/datasets/xiaofengalg/Chinese-medical-dialogue.git /mnt/workspace/LLaMA-Factory/data