目录
安装方法参考;doccano安装使用教程
进入doccano之后
登录
右上角登陆,右上角可以切换中文
创建任务
点左上角创建
选序列标注
导入数据集
数据集
导入
(文本每行是一句话,我随便找个水浒传的章节名写在上面为例)
导入成功如下所示
添加标签
标签
标注
左上角开始标注
导出结果
标注完后导出
解压,admin.jsonl
转换为bio
def json2bio(fpath, output):
with open(fpath, encoding='utf-8') as f:
lines = f.readlines()
for line in lines: # '{"id":1,"text":"张天师祈禳瘟疫 洪太尉误走妖魔","label":[[0,3,"人物"],[3,5,"动作"],[8,11,"人物"],[11,13,"动作"]]}'
annotations = json.loads(line) # {"id":1,"text":"张天师祈禳瘟疫 洪太尉误走妖魔","label":[[0,3,"人物"],[3,5,"动作"],[8,11,"人物"],[11,13,"动作"]]}
text = annotations['text'].replace('\n', ' ') # '张天师祈禳瘟疫 洪太尉误走妖魔'
all_words = list(
text.replace(' ', ',')) # ['张', '天', '师', '祈', '禳', '瘟', '疫', ',', '洪', '太', '尉', '误', '走', '妖', '魔']
all_label = ['O'] * len(all_words)
for i in annotations['label']: # [0, 3, '人物']
b_location = i[0] # 0
e_location = i[1] # 3
label = i[2] # '人物'
all_label[b_location] = 'B-' + label
if b_location != e_location:
for word in range(b_location + 1, e_location):
all_label[word] = 'I-' + label
cur_line = 0
# 写入文件
toekn_label = zip(all_words, all_label)
with open(output, 'a', encoding='utf-8') as f:
for tl in toekn_label:
f.write(tl[0] + str(' ') + tl[1])
f.write('\n')
cur_line += 1
if cur_line == len(all_words):
f.write('\n') # 空格间隔不同句子
转换为可以训练的bio格式的训练集
接下来就可以用这个数据集训练了