python：jsonl文件转tsv文件

最新推荐文章于 2024-06-19 09:05:31 发布

专心致志写BUG

最新推荐文章于 2024-06-19 09:05:31 发布

阅读量1.6k

点赞数 2

分类专栏： python

本文链接：https://blog.csdn.net/weixin_43975374/article/details/107779950

版权

python 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

把一个jsonl数据集转为tsv格式以供训练模型使用：

首先看看jsonl文件的格式：（BoolQ数据集）

它有四个key：question\passage\idx\label

{

"question": "do iran and afghanistan speak the same language", 

 "passage": "Persian language -- Persian (/\u02c8p\u025c\u02d0r\u0292\u0259n, -\u0283\u0259n/), also known by its endonym Farsi (\u0641\u0627\u0631\u0633\u06cc f\u0101rsi (f\u0252\u02d0\u027e\u02c8si\u02d0) ( listen)), is one of the Western Iranian languages within the Indo-Iranian branch of the Indo-European language family. It is primarily spoken in Iran, Afghanistan (officially known as Dari since 1958), and Tajikistan (officially known as Tajiki since the Soviet era), and some other regions which historically were Persianate societies and considered part of Greater Iran. It is written in the Persian alphabet, a modified variant of the Arabic script, which itself evolved from the Aramaic alphabet.",
 
 "idx": 0, 

 "label": true

}

于是通过下面的代码，先读取jsonl文件，将数据存到列表question和label中，然后再一行一行写入tsv文件即可。

import json
import csv

file_path = 'data/val.jsonl'
with open(file_path, "r", encoding="utf-8") as f:
    lines = f.readlines()

question = []
label = []
for line in lines:
    i = json.loads(line.strip("\n"))
    question.append(i["question"]+". "+i["passage"])
    if i["label"]:
        label_in = 1
    else:
        label_in = 0
    label.append(label_in)
print(label)

with open('data/file.tsv', 'w', encoding="utf-8") as f:
    tsv_w = csv.writer(f, delimiter='\t', lineterminator='\n')
    tsv_w.writerow(['label', 'question'])
    for num in range(len(lines)):
        tsv_w.writerow([label[num], question[num]])

下面就是最后的tsv文件效果：

专心致志写BUG

关注

2
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
python：jsonl文件转tsv文件

把一个jsonl数据集转为tsv格式以供训练模型使用：首先看看jsonl文件的格式：（BoolQ数据集）它有四个key：question\passage\idx\label{"question": "do iran and afghanistan speak the same language", "passage": "Persian language -- Persian (/\u02c8p\u025c\u02d0r\u0292\u0259n, -\u0283\u0259n/),
复制链接

扫一扫

专栏目录