任务
由“列数据集”转化“行数据集”
即由列标注数据转化为百度PaddleHub平台的nlp序列标注任务的数据集类型
首先,看一下数据样例:
注1:训练集数据格式:单词 \t 词性标签 \n
即每行包含单词及其词性标签,如Jawa NNP
注2:每句话用\n隔开
转换代码
import os
from itertools import groupby
file = open("Ind_train.txt",encoding="utf-8")
sig_data = file.readlines()
dic = {}
#将字符和标注拆成两个列表
stack_1 = []
stack_2 = []
for i in sig_data:
if i =="\n":
stack_1.append(i)
stack_2.append(i)
else:
_f1 = i.split(" ")[0]
_f2 = i.split(" ")[1].replace("\n","")
stack_1.append(_f1)
stack_2.append(_f2)
zifu_res = [list(g) for k, g in groupby(stack_1, lambda x: x == '\n') if not k]
biao_res = [list(g) for k, g in groupby(stack_2, lambda x: x == '\n') if not k]
for i in biao_res:
i.insert(0,"\t")
# 做成行标注
list_new = []
for x,y in zip(zifu_res,biao_res):
for item in y:
x.append(item)
list_new.append(x)
print(list_new)
for i in list_new:
text = " ".join(i)
# print(text)
with open('train1234.txt', 'a') as file_handle: # .txt可以不自己新建,代码会自动新建
file_handle.write(text) # 写入
file_handle.write('\n')
结果示例:
Namun aparat yang berjaga langsung sigap dan menangkap kedua orang itu . CC NN PRL VB RB NN CC VB NN NN DT Z