python:jsonl文件转tsv文件

把一个jsonl数据集转为tsv格式以供训练模型使用:

首先看看jsonl文件的格式:(BoolQ数据集)

它有四个key:question\passage\idx\label

{

"question": "do iran and afghanistan speak the same language", 

 "passage": "Persian language -- Persian (/\u02c8p\u025c\u02d0r\u0292\u0259n, -\u0283\u0259n/), also known by its endonym Farsi (\u0641\u0627\u0631\u0633\u06cc f\u0101rsi (f\u0252\u02d0\u027e\u02c8si\u02d0) ( listen)), is one of the Western Iranian languages within the Indo-Iranian branch of the Indo-European language family. It is primarily spoken in Iran, Afghanistan (officially known as Dari since 1958), and Tajikistan (officially known as Tajiki since the Soviet era), and some other regions which historically were Persianate societies and considered part of Greater Iran. It is written in the Persian alphabet, a modified variant of the Arabic script, which itself evolved from the Aramaic alphabet.",
 
 "idx": 0, 

 "label": true

}

于是通过下面的代码,先读取jsonl文件,将数据存到列表question和label中,然后再一行一行写入tsv文件即可。

import json
import csv

file_path = 'data/val.jsonl'
with open(file_path, "r", encoding="utf-8") as f:
    lines = f.readlines()

question = []
label = []
for line in lines:
    i = json.loads(line.strip("\n"))
    question.append(i["question"]+". "+i["passage"])
    if i["label"]:
        label_in = 1
    else:
        label_in = 0
    label.append(label_in)
print(label)

with open('data/file.tsv', 'w', encoding="utf-8") as f:
    tsv_w = csv.writer(f, delimiter='\t', lineterminator='\n')
    tsv_w.writerow(['label', 'question'])
    for num in range(len(lines)):
        tsv_w.writerow([label[num], question[num]])

下面就是最后的tsv文件效果:

  • 2
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值