1、背景
paddlenlp项目:快递信息抽取
该项目主要是从快递单中抽取相关信息,那么该部分是否可以迁移应用到体检报告中某些特定字段的提取呢?本项目就是一个探索。
2、转换代码
前面讲过,brat安装,安装完以后会有一些超声的识别,此部分,需要将brat标注后的ann和txt文件转换为paddlenlp下面的bio标注,代码如下:
import sys
import glob
import os
# 定义标签映射,将标注的标签英文化
labelmapping = {
"肝脏超声-LABEL":"GZL",
"胆囊超声-LABEL":"DNL",
"胰腺超声-LABEL":"YXL",
"脾脏超声-LABEL":"PZL",
"肾脏超声-LABEL":"SZL",
"子宫超声-LABEL":"ZGL",
"颈动脉超声-LABEL":"JDL",
"胸腔超声-LABEL":"XQL",
"心脏超声-LABEL":"XZL",
"甲状腺超声-LABEL":"JZL",
"前列腺超声-LABEL":"QLL",
"心电图-LABEL":"XDL",
"肝脏超声":"GZ",
"胆囊超声":"DN",
"胰腺超声":"YX",
"脾脏超声":"PZ",
"肾脏超声":"SZ",
"子宫超声":"ZG",
"颈动脉超声":"JD",
"胸腔超声":"XQ",
"心脏超声":"XZ",
"甲状腺超声":"JZ",
"前列腺超声":"QL",
"心电图":"XD"
}
# 转换方法,将该脚本放置在含有ann和txt的目录下面,会读取本目录下面的所有数据并返回结果
combine_ocr=[]
def trans_ann2bio():
files = glob.glob("*.ann")
for fle in files:
ann_file = fle
ocr_file = fle.split(".")[0]+".txt"
ann_map = {}
with open(ann_file,"r",encoding="utf8") as annf:
anns = annf.readlines()
for ann in anns:
ann_map[ann.split("\t")[1].split()[1]+"-"+ann.split("\t")[1].split()[2]]=labelmapping[ann.split("\t")[1].split()[0]]
with open(ocr_file,"r",encoding="utf8") as ocrf:
ocrs = ocrf.readlines()
for ocr in ocrs:
label_line = []
orig_txt = ""
for idx,ocrchr in enumerate(ocr):
temp=ocrchr
if ocrchr == "\n":
continue
bio_label="O"
for annk in ann_map.keys():
rnge = list(range(int(annk.split("-")[0]),int(annk.split("-")[1])+1))
if idx in rnge:
if idx == rnge[0]:
bio_label=ann_map[annk]+"-B"
break
elif idx < rnge[len(rnge)-1] and idx > rnge[0]:
bio_label=ann_map[annk]+"-I"
break
label_line.append(bio_label+"\x02")
orig_txt=orig_txt+temp+"\x02"
combine_ocr.append(orig_txt.strip("\x02")+"\t"+"".join(label_line).strip("\x02")+"\n")
# 开启转换
trans_ann2bio()
# 分割训练集、验证集、测试集
sample_num=len(combine_ocr)
train_sample = combine_ocr[:round(sample_num*0.6)]
dev_sample = combine_ocr[round(sample_num*0.6):round(sample_num*0.8)]
test_sample = combine_ocr[round(sample_num*0.8):]
with open("data/train.txt","a+",encoding="utf8") as trf:
trf.write("text_a\tlabel\n")
trf.writelines(train_sample)
with open("data/dev.txt","a+",encoding="utf8") as dvf:
dvf.write("text_a\tlabel\n")
dvf.writelines(dev_sample)
with open("data/test.txt","a+",encoding="utf8") as tef:
tef.write("text_a\tlabel\n")
tef.writelines(test_sample)
3、训练
可以看出,由于数据量很小,训练效果并不好,预测一般,所以后期还需要更多数据来训,希望会有好的结果。