小黑NER一步步摸索:NER中Bio编码预测

爱喝喜茶爱吃烤冷面的小黑黑

已于 2022-03-02 14:00:05 修改

阅读量727

点赞数 1

分类专栏：小黑ner起飞探索笔记文章标签：深度学习 batch 机器学习

于 2022-03-02 13:59:28 首次发布

本文链接：https://blog.csdn.net/qq_37418807/article/details/123228697

版权

小黑ner起飞探索笔记专栏收录该内容

6 篇文章 1 订阅

订阅专栏

这段代码主要涉及BIO编码的验证和实体提取。首先，`check_bio`函数检查输入的tags是否遵循BIO编码规则，不合法的编码会被修正。接着，`check_label`函数判断两个BIO标签是否属于同一实体。最后，`format_result`函数根据BIO编码和字符序列生成实体列表，用于信息抽取。示例中展示了不同BIO编码情况下的实体提取结果。

摘要由CSDN通过智能技术生成

def check_bio(tags):
    """
    检测输入的tags是否是bia编码
    如果不是bio编码
    那么错误的类型:
    (1)编码不在BIO中
    (2)第一个编码是I
    (3)当前编码不是B,前一个编码不是O
    """
    for i,tag in enumerate(tags):
        if tag == 'O':
            continue
        tag_list = tag.split('-')
        if len(tag_list) != 2 or tag_list[0] not in set(['B','I']):
            # 非法编码
            return False
        if tag_list[0] == 'B':
            continue
        # I-xxx... or ...O I-xxx  -> B-xxxx 
        elif i == 0 or tags[i-1] == 'O':
            # 如果(第一个位置是I)或者(当前编码是I并且前一个编码为O),则全部转换为B
            tags[i] = 'B' + tag[1:]
        # I-xxxx I-xxxx
        elif tags[i-1][1:] == tag[1:]:
            # 前后的I标签类型一致,跳过
            continue
        # I-aaaa I-bbbb -> I-aaaa B-bbbb
        else:
            tags[i] = 'B' + tag[1:]
    return True
#tags = ['O','I-loc','I-person','I-person'] 
#check_bio(tags)
#print(tags)
#tags = ['I-loc','B-loc','O','I-person'] 
#check_bio(tags)
#print(tags)

# 判断两个bio标签是否属于一个连续的实体
def check_label(front_label,follow_label):
    if not follow_label:
        raise Exception('follow label should not both None')
    # xxxx NULL
    if not front_label:
        return True
    # xxxx B-aaa
    if follow_label.startswith('B-'):
        return False
    # I-aaa I-aaa or B-aaa I-aaa
    if follow_label.startswith('I-') and front_label.endswith(follow_label.split('-')[1]) and front_label.split('-')[0] in ['B','I']:
        return True
    # B-aaa B-bbb or O O or I-aaa B-bbb or I-aaa I-bbb
    return False

#labels = ['O','O']
#print(check_label(*labels))
#labels = ['B-person','I-person']
#print(check_label(*labels))
def format_result(chars,tags):
    entities = []
    entity = []
    for index,(char,tag) in enumerate(zip(chars,tags)):
        entity_continue = check_label(tags[index - 1] if index > 0 else None,tag)
        # 如果标签不连续，则将前面的部分进行切分
        if not entity_continue and entity:
            entities.append(entity)
            entity = []
        entity.append([index,char,tag,entity_continue])
    if entity:
        entities.append(entity)
    entities_result = []
    for entity in entities:
        if entity[0][2].startswith('B-'):
            entities_result.append(
                {
                    'begin':entity[0][0] + 1,
                    'end':entity[-1][0] + 1,
                    'words':''.join([char for _,char,_,_ in entity]),
                    'type':entity[0][2].split('-')[1]
                }
            )
    return entities_result
tags = ['O','O','B-person','I-person','I-person','O','O','B-org','I-org','I-org','I-org','O']
chars = '干！小黑黑来自北京大学。'
print(format_result(chars,tags))
tags = ['O','O','I-person','I-person','I-person','O','O','B-org','I-org','I-org','I-org','O']
print('不严格的BIO编码:',tags)
print(format_result(chars,tags))
print('修改后的BIO编码预测结果:',check_bio(tags))
print(format_result(chars,tags))

输出:

[{‘begin’: 3, ‘end’: 5, ‘words’: ‘小黑黑’, ‘type’: ‘person’}, {‘begin’: 8, ‘end’: 11, ‘words’: ‘北京大学’, ‘type’: ‘org’}]
不严格的BIO编码: [‘O’, ‘O’, ‘I-person’, ‘I-person’, ‘I-person’, ‘O’, ‘O’, ‘B-org’, ‘I-org’, ‘I-org’, ‘I-org’, ‘O’]
[{‘begin’: 8, ‘end’: 11, ‘words’: ‘北京大学’, ‘type’: ‘org’}]
修改后的BIO编码预测结果: True
[{‘begin’: 3, ‘end’: 5, ‘words’: ‘小黑黑’, ‘type’: ‘person’}, {‘begin’: 8, ‘end’: 11, ‘words’: ‘北京大学’, ‘type’: ‘org’}]

爱喝喜茶爱吃烤冷面的小黑黑

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
小黑NER一步步摸索:NER中Bio编码预测

def check_bio(tags): """ 检测输入的tags是否是bia编码如果不是bio编码那么错误的类型: (1)编码不在BIO中 (2)第一个编码是I (3)当前编码不是B,前一个编码不是O """ for i,tag in enumerate(tags): if tag == 'O': continue tag_list = tag.split('-')
复制链接

扫一扫

专栏目录