1、NER命名实体识别,网络预测的结果BIO,如何转录,提取出实体?
思路1:遇到B则前面存在的实体,进行一次存储。多个i粘连一块儿也可能被认为是一个实体。错误的情况是B识别成i了。对于类别判断失误,粘连的实体取众数。
#标签转录BIO格式
string="我是李明,我爱中国,我来自呼和浩特"
predict=["o","o","i-per","i-per","o","o","o","b-loc","i-loc","o","o","o","o","b-per","i-loc","i-loc","i-loc"]
item = {"string": string, "entities": []}
entity_name = ""
flag=[]
visit=False
for char, tag in zip(string, predict):
if tag[0] == "b":
if entity_name!="":
x=dict((a,flag.count(a)) for a in flag)
y=[k for k,v in x.items() if max(x.values())==v]
item["entities"].append({"word": entity_name,"type": y[0]})
flag.clear()
entity_name=""
entity_name += char
flag.append(tag[2:])
elif tag[0]=="i":
entity_name += char
flag.append(tag[2:])
else:
if entity_name!="":
x=dict((a,flag.count(a)) for a in flag)
y=[k for k,v in x.items() if max(x.values())==v]
item["entities"].append({"word": entity_name,"type": y[0]})
flag.clear()
flag.clear()
entity_name=""
if entity_name!="":
x=dict((a,flag.count(a)) for a in flag)
y=[k for k,v in x.items() if max(x.values())==v]
item["entities"].append({"word": entity_name,"type": y[0]})
print(item)
{'string': '我是李明,我爱中国,我来自呼和浩特', 'entities': [{'word': '李明', 'type': 'per'}, {'word': '中国', 'type': 'loc'}, {'word': '呼和浩特', 'type': 'loc'}]}
思路2:只取B开头的实体,其它的不要。同样类别也是取众数。
#标签转录BIO格式
string="我是李明,我爱中国,我来自呼和浩特"
predict=["o","o","i-per","i-per","o","o","o","b-loc","i-loc","o","o","o","o","b-per","i-loc","i-loc","i-loc"]
item = {"string": string, "entities": []}
entity_name = ""
flag=[]
visit=False
for char, tag in zip(string, tags):
if tag[0] == "b":
if entity_name!="":
x=dict((a,flag.count(a)) for a in flag)
y=[k for k,v in x.items() if max(x.values())==v]
item["entities"].append({"word": entity_name,"type": y[0]})
flag.clear()
entity_name=""
visit=True
entity_name += char
flag.append(tag[2:])
elif tag[0]=="i" and visit:
entity_name += char
flag.append(tag[2:])
else:
if entity_name!="":
x=dict((a,flag.count(a)) for a in flag)
y=[k for k,v in x.items() if max(x.values())==v]
item["entities"].append({"word": entity_name,"type": y[0]})
flag.clear()
flag.clear()
visit=False
entity_name=""
if entity_name!="":
x=dict((a,flag.count(a)) for a in flag)
y=[k for k,v in x.items() if max(x.values())==v]
item["entities"].append({"word": entity_name,"type": y[0]})
print(item)
{'string': '我是李明,我爱中国,我来自呼和浩特', 'entities': [{'word': '中国', 'type': 'loc'}, {'word': '呼和浩特', 'type': 'loc'}]}