把一个jsonl数据集转为tsv格式以供训练模型使用:
首先看看jsonl文件的格式:(BoolQ数据集)
它有四个key:question\passage\idx\label
{
"question": "do iran and afghanistan speak the same language",
"passage": "Persian language -- Persian (/\u02c8p\u025c\u02d0r\u0292\u0259n, -\u0283\u0259n/), also known by its endonym Farsi (\u0641\u0627\u0631\u0633\u06cc f\u0101rsi (f\u0252\u02d0\u027e\u02c8si\u02d0) ( listen)), is one of the Western Iranian languages within the Indo-Iranian branch of the Indo-European language family. It is primarily spoken in Iran, Afghanistan (officially known as Dari since 1958), and Tajikistan (officially known as Tajiki since the Soviet era), and some other regions which historically were Persianate societies and considered part of Greater Iran. It is written in the Persian alphabet, a modified variant of the Arabic script, which itself evolved from the Aramaic alphabet.",
"idx": 0,
"label": true
}
于是通过下面的代码,先读取jsonl文件,将数据存到列表question和label中,然后再一行一行写入tsv文件即可。
import json
import csv
file_path = 'data/val.jsonl'
with open(file_path, "r", encoding="utf-8") as f:
lines = f.readlines()
question = []
label = []
for line in lines:
i = json.loads(line.strip("\n"))
question.append(i["question"]+". "+i["passage"])
if i["label"]:
label_in = 1
else:
label_in = 0
label.append(label_in)
print(label)
with open('data/file.tsv', 'w', encoding="utf-8") as f:
tsv_w = csv.writer(f, delimiter='\t', lineterminator='\n')
tsv_w.writerow(['label', 'question'])
for num in range(len(lines)):
tsv_w.writerow([label[num], question[num]])
下面就是最后的tsv文件效果: