总结一下常见的序列标注任务的标注体系

最新推荐文章于 2024-04-29 00:38:22 发布

W_Yeee

最新推荐文章于 2024-04-29 00:38:22 发布

阅读量2.3k

点赞数 4

文章标签： python 自然语言处理机器学习

本文链接：https://blog.csdn.net/weixin_48592695/article/details/126004598

版权

序列标注：序列标注是NLP中一个基本任务，在序列标注中，我们想对一个序列的每一个元素标注一个标签，一般情况下，序列标注可以分为中文分词，命名实体识别等

每个元素都需要被标注为一个标签，，其中一个标签指向实体的开始，另外一个标签指向实体的中间部分或者结束部分，例如在NER任务中，最常用的就是BIO标注体系。

记录下常见的标注体系：

1.BIO标注体系：

B-begin:代表实体的开头

I-inside：代表实体的中间或结尾

O-outside:代表非实体部分

2.BIOES标注体系：

B-begin:代表实体的开头

I-inside：代表实体的中间

O-outside:代表非实体部分

E-end:代表实体的结尾

S-single:代表单个字符，其本身就是一个实体

3.BMES标注体系

B-begin:代表实体的开头

M-inside：代表实体的中间

O-outside:代表非实体部分

E-end:代表实体的结尾

S-single:代表单个字符，其本身就是一个实体

综合来看，在很多任务上各种标注体系的表现差异不大。

下面附上BMES转为BIO标签体系的代码实现：

def load_lines(path, encoding='utf8'):
    with open(path, 'r', encoding=encoding) as f:
        lines = [line.strip() for line in f.readlines()]
        return lines


def write_lines(lines, path, encoding='utf8'):
    with open(path, 'w', encoding=encoding) as f:
        for line in lines:
            f.writelines('{}\n'.format(line))


def bmes_to_json(bmes_file, json_file):
    """
    将bmes格式的文件，转换为json文件，json文件包含text和label,并且转换为BIO的标注格式
    Args:
        bmes_file:
        json_file:
    :return:
    """
    texts = []
    with open(bmes_file, 'r', encoding='utf8') as f:
        lines = f.readlines()
        words = []
        labels = []
        for idx in trange(len(lines)):
            line = lines[idx].strip()

            if not line:
                assert len(words) == len(labels), (len(words), len(labels))
                sample = {}
                sample['text'] = words
                sample['label'] = labels
                texts.append(json.dumps(sample, ensure_ascii=False))

                words = []
                labels = []
            else:
                word = line.split()
                label = line.split()
                label = str(label).replace('M-', 'I-').replace('E-', 'I-')
                words.append(word)
                labels.append(label)

    with open(json_file, 'w', encoding='utf-8') as f:
        for text in texts:
            f.write("{}\n".format(text))
if __name__ == '__main__':
    # 生成json文件
    data_names = ['msra']
    path = '../datasets'
    for data_name in data_names:
        logger.info('processing dataset:{}'.format(data_name))
        files = os.listdir(join(path, data_name))
        for file in files:
            file = join(path, data_name, file)
            data_type = os.path.basename(file).split('.')[0]
            out_path = join(path, data_name, data_type+'.json')